Projects
Current Projects
Aim of the project is developing a search engine on specific to the law domain with the help of natural language processing techniques, text mining techniques, big data solutions and machine learning algorithms. Also, develop a matching algorithm specific to the field of law that works faster and with higher performance, integrate it into our search engine interface and make it available to lawyers. Poster |
Despite the rapidly increasing crime and lawsuit rates in Turkey, very few people know their own rights and laws. New crimes, new cases and new legislation are emerging every day. It is very difficult not only for citizens but also for people working in the field of law to follow all these. Our goal is to make laws and litigation outcomes more accessible and understandable to everyone. That's why we want to create a question and answer system (Q&A) that will give the most accurate answers to the questions asked by users in the field of law. Poster |
We created a knowledge graph to address issues in the legal system, such as long durations and lack of synchronization. Using a dataset of 13,990 thesis documents, we constructed a graph based on semantic connections between nodes. By analyzing keywords, we identified similarities and commonalities between different fields of cases. Our approach involved utilizing fastText and a fine-tuned version of it developed in BIGDataLab, along with BERT for language representation. The experimental results show promise, and the graph visualization highlights the relationships between keywords. Poster |
We developed a petition generation tool to assist lawyers in automating or semi-automating the process of creating legal documents. The tool utilizes a language model and semantic search to generate petition templates based on client statements and retrieve similar petitions from a dataset. We collected over 6,000 petitions from various sources and fine-tuned the Google/mt5-small model for transfer learning. The results show promise, with a BLEU score of 12.10. The tool has the potential to streamline the text generation process in the legal domain. Poster |
Previous Projects
In this study, a new method called Domain Classification and a natural language generator system in which this method is applied have been developed in order to increase the model performance in natural language processing studies conducted in the corpus of Turkish legal texts. In short, the new method developed states that the performance of a deep learning model trained in the field of law will be higher when it is trained on a sub-field based special dataset classified according to legal disciplines. To be able to test the method during the development process of the natural language generator is designed with an architecture using Recurrent Neural Networks, which can work as a hybrid, capable of being trained and working even on low-equipped devices by using interdisciplinary study. In addition, the texts produced in different fields of Turkish law by the natural language generator system developed in this study were examined, and it was discussed in which areas the developed Domain Classification method and the natural language generator could benefit the lawyers and the judicial system in general. |
We developed an NLP framework to analyze speeches in the Grand National Assembly of Turkey. Our solution includes a search engine that efficiently retrieves speeches from the reliable "Tutanak" records. We applied sentiment analysis using machine learning models, achieving an accuracy of 87.11% with the Logistic Regression model. The framework enables people to understand and analyze political text, promoting informed participation in political processes. Future work involves incorporating topic extraction techniques for more relevant search results. Poster |
One of the applications of Natural Language Processing (NLP) is to process free text data for extracting information. Information extraction has various forms like Named Entity Recognition (NER) for detecting the named entities in the free text. Biomedical named-entity extraction task is about extracting named entities like drugs, diseases, organs, etc. from texts in medical domain. In our study, we improve commonly used models in this domain, such as biLSTM+CRF model, using transformer based language models like BERT and its domain-specific variant BioBERT in the embedding layer. We conduct several experiments on several different benchmark biomedical datasets using a variety of combination of models and embeddings such as BioBERT+biLSTM+CRF, BERT+biLSTM+CRF, Fasttext+biLSTM+CRF, and Graph Convolutional Networks. Our results show a quite visible, 4% to 13%, improvements when baseline biLSTM+CRF model is initialized with pretrained language models such as BERT and especially with domain specific one like BioBERT on several datasets. |
We have developed a Turkish chatbot and question answering system to address the need for rapid and accurate information during the university selection period. The system offers quick and concise answers to university candidates, eliminating the hassle of searching through lengthy paragraphs or waiting for forum responses. Through intent detection using logistic regression and LinearSVC models, the system achieves an accuracy score of 95.5% in differentiating between questions and information. User trials have shown positive feedback, indicating that the chatbot serves as an efficient guide for university candidates. Poster |
In this project, we developed an algorithmic trading system that incorporates machine learning models to predict the direction of stocks in the BIST30 index. We utilized financial price data, financial sentiment analysis from Twitter, and official KAP news as data sources for training the models. By considering additional data beyond stock prices, we aimed to enhance the accuracy of future predictions and increase profits. The system followed a control flow that involved data collection, preprocessing, sentiment analysis, model training, and imputation for missing data. Experimental results showed that incorporating KAP sentiment scores improved the accuracy of the predictions. Our model achieved an average accuracy of 56% and a maximum accuracy of 77.4% for 2-class classification. This project represents a significant step in predicting stock direction by leveraging social media sentiments and official disclosures, contributing to the field of algorithmic trading. Poster |
In this project, we addressed the problem of sharing long stories on Twitter through images instead of text due to the platform's character limit. Our objective was to develop a solution using deep learning algorithms to predict if an image contains text, perform Optical Character Recognition (OCR) to detect and extract text from the image, and convert the file type from jpg or png to docx. We collected a dataset of 6000 labeled images and achieved an accuracy rate of approximately 87.5% for training sets using deep learning. The OCR process was implemented using the Pytesseract module for both English and Turkish characters with error-free results. The project successfully tackled the challenge of transporting data contained within images on Twitter and opened possibilities for further improvements, including higher accuracy rates, multi-language support, enhanced user interfaces, and additional file type options. Poster |
In this project, we focused on the task of word sense disambiguation (WSD) and applied deep learning and machine learning techniques to improve the accuracy of WSD systems. The goal was to enable machines to understand the sense of words in a similar way to humans. We implemented the semantic diffusion kernel, a novel approach that captures semantic relations between words, and evaluated its performance using different datasets and experiments. The results showed that the semantic diffusion kernel achieved higher accuracy compared to the base kernel in the SupWSD framework. We also worked on optimizing the algorithm and implementing it in the libsvm library for further improvements. This project contributes to the field of WSD and lays the foundation for future enhancements and experiments in word sense disambiguation using advanced techniques. Poster |
In this project, we addressed the problem of grammar and spell checking for the Turkish language, specifically focusing on the challenges posed by rich but noisy textual data in social media. We aimed to develop a solution that can quickly and accurately correct spelling mistakes in user input by implementing an encoder-decoder architecture based system. The system utilizes two networks—one operating at the word level and the other at the character level—to handle both known and unknown words. The results showed promising precision, recall, and F1 scores, outperforming existing rule-based algorithms such as Zemberek. The model's success in reducing vocabulary size and improving correction accuracy makes it a valuable preprocessing step for Turkish Natural Language Processing (NLP) applications, provided it is trained with sufficient data. However, limitations exist due to memory constraints and the availability of Turkish normalization dataset resources. Overall, this project presents a significant advancement in grammar and spell checking for the Turkish language, offering a more efficient and accurate solution for handling noisy textual data in social media. Poster |
In this project, the focus was on solving the problem of word sense disambiguation using kernel methods. Word sense disambiguation involves determining the correct meaning of a word based on its context. The team proposed two semantic kernel methods: the Abstract Feature Kernel (AFK) and the Relevance Value Kernel (RVK). These kernels aimed to capture semantic connections and contribute to the accurate classification of words into different senses. Experimental results showed that the semantic kernel methods outperformed the baseline linear kernel in terms of F1 scores, indicating their effectiveness in word sense disambiguation. The team also discussed future work, including the exploration of additional semantic kernels with larger and more balanced datasets. Overall, the project contributes to the advancement of word sense disambiguation techniques and highlights the potential of kernel methods in this domain. Poster |
In this project, the goal was to forecast electricity demand using machine learning algorithms and relevant features from the electricity market. The project utilized technical features such as bilateral agreements, day-ahead planning, and the balancing power market to predict electricity demand. Mutual Information was used for feature selection, selecting the most relevant features for the predictive modeling. Three different machine learning models, including Decision Tree, Linear Regression, and Random Forest, were trained and tested using different time periods. The accuracy of the models was evaluated using metrics such as R2, MAPE, MAE, and MSE. The results showed high accuracy, with R2 values reaching up to 98% and a low MAPE indicating a prediction error of around 1.4%. The project demonstrates the effectiveness of machine learning in electricity demand forecasting and its potential to optimize power system planning and operational costs. The technologies used included data from electricity market platforms and datasets from reliable sources. Poster |
In this project, the goal was to forecast electricity demand using machine learning algorithms and relevant features from the electricity market. The project utilized technical features such as bilateral agreements, day-ahead planning, and the balancing power market to predict electricity demand. Mutual Information was used for feature selection, selecting the most relevant features for the predictive modeling. Three different machine learning models, including Decision Tree, Linear Regression, and Random Forest, were trained and tested using different time periods. The accuracy of the models was evaluated using metrics such as R2, MAPE, MAE, and MSE. The results showed high accuracy, with R2 values reaching up to 98% and a low MAPE indicating a prediction error of around 1.4%. The project demonstrates the effectiveness of machine learning in electricity demand forecasting and its potential to optimize power system planning and operational costs. The technologies used included data from electricity market platforms and datasets from reliable sources. Poster |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Murat Can Ganiz (Principle Investigator) , Berna Altınel (Research Assistant, PhD student) |
Mithat Poyraz , Burak Görener , Murat Diker |
Dilara Torunoğlu (COME MS), Abdülkerim Canbay (COME), Hamdi Atacan Oğul (COME) |
Zeynep Hilal Kilimci (COME MS), Işıl Çoşkun (ISE-graduated with honors) |
İsmail Murat Engün (ISE-graduated with honors), Süleyman Kaan Yeloğlu (ISE), Abdülhadi Çelenlioğlu (ISE) |
Mithat Poyraz , Çağla Şahinli |
Abnormal Event Detection on BGP Traffic Erkan Köşlük (ISE-graduated), Inigo Ortiz de Urbina (COME Erasmus) |
Intelligent Focused Web Crawler Project Mithat Poyraz (COME MS), Duygu Taylan (COME-graduated with honors) |
Preprocessing Methods for Turkish Text Classification |
Spectral Algorithms for Document Clustering |