Projects - Big Data and Text Analytics (BIGDaTA) Research Lab. /

Current Projects

Semantic Search Engine on Law Domain.

Aim of the project is developing a search engine on specific to the law domain with the help of natural language processing techniques, text mining techniques, big data solutions and machine learning algorithms. Also, develop a matching algorithm specific to the field of law that works faster and with higher performance, integrate it into our search engine interface and make it available to lawyers. Poster

Q&A Systems in Law

Despite the rapidly increasing crime and lawsuit rates in Turkey, very few people know their own rights and laws. New crimes, new cases and new legislation are emerging every day. It is very difficult not only for citizens but also for people working in the field of law to follow all these. Our goal is to make laws and litigation outcomes more accessible and understandable to everyone. That's why we want to create a question and answer system (Q&A) that will give the most accurate answers to the questions asked by users in the field of law. Poster

Knowledge Graph about Legal Concepts

We created a knowledge graph to address issues in the legal system, such as long durations and lack of synchronization. Using a dataset of 13,990 thesis documents, we constructed a graph based on semantic connections between nodes. By analyzing keywords, we identified similarities and commonalities between different fields of cases. Our approach involved utilizing fastText and a fine-tuned version of it developed in BIGDataLab, along with BERT for language representation. The experimental results show promise, and the graph visualization highlights the relationships between keywords. Poster

Automatic / Semiautomatic Text Generation in Law

We developed a petition generation tool to assist lawyers in automating or semi-automating the process of creating legal documents. The tool utilizes a language model and semantic search to generate petition templates based on client statements and retrieve similar petitions from a dataset. We collected over 6,000 petitions from various sources and fine-tuned the Google/mt5-small model for transfer learning. The results show promise, with a BLEU score of 12.10. The tool has the potential to streamline the text generation process in the legal domain. Poster

Previous Projects

Language Modeling in Turkish Legal Corpus: Improving Model Performance with Domain Classification by Using Recurrent Neural Networks

In this study, a new method called Domain Classification and a natural language generator system in which this method is applied have been developed in order to increase the model performance in natural language processing studies conducted in the corpus of Turkish legal texts. In short, the new method developed states that the performance of a deep learning model trained in the field of law will be higher when it is trained on a sub-field based special dataset classified according to legal disciplines. To be able to test the method during the development process of the natural language generator is designed with an architecture using Recurrent Neural Networks, which can work as a hybrid, capable of being trained and working even on low-equipped devices by using interdisciplinary study. In addition, the texts produced in different fields of Turkish law by the natural language generator system developed in this study were examined, and it was discussed in which areas the developed Domain Classification method and the natural language generator could benefit the lawyers and the judicial system in general.

Natural Language Processing Framework to Analyze Speeches in Grand National Assembly of Turkey

We developed an NLP framework to analyze speeches in the Grand National Assembly of Turkey. Our solution includes a search engine that efficiently retrieves speeches from the reliable "Tutanak" records. We applied sentiment analysis using machine learning models, achieving an accuracy of 87.11% with the Logistic Regression model. The framework enables people to understand and analyze political text, promoting informed participation in political processes. Future work involves incorporating topic extraction techniques for more relevant search results. Poster

Biomedical Named Entity Recognition Using Transformers with biLSTM + CRF and Graph Convolutional Neural Networks

One of the applications of Natural Language Processing (NLP) is to process free text data for extracting information. Information extraction has various forms like Named Entity Recognition (NER) for detecting the named entities in the free text. Biomedical named-entity extraction task is about extracting named entities like drugs, diseases, organs, etc. from texts in medical domain. In our study, we improve commonly used models in this domain, such as biLSTM+CRF model, using transformer based language models like BERT and its domain-specific variant BioBERT in the embedding layer. We conduct several experiments on several different benchmark biomedical datasets using a variety of combination of models and embeddings such as BioBERT+biLSTM+CRF, BERT+biLSTM+CRF, Fasttext+biLSTM+CRF, and Graph Convolutional Networks. Our results show a quite visible, 4% to 13%, improvements when baseline biLSTM+CRF model is initialized with pretrained language models such as BERT and especially with domain specific one like BioBERT on several datasets.

Concept and Implementation of a Turkish Chatbot Service

We have developed a Turkish chatbot and question answering system to address the need for rapid and accurate information during the university selection period. The system offers quick and concise answers to university candidates, eliminating the hassle of searching through lengthy paragraphs or waiting for forum responses. Through intent detection using logistic regression and LinearSVC models, the system achieves an accuracy score of 95.5% in differentiating between questions and information. User trials have shown positive feedback, indicating that the chatbot serves as an efficient guide for university candidates. Poster

Algorithmic Trading using KAP and Twitter Sentiments with Machine Learning

In this project, we developed an algorithmic trading system that incorporates machine learning models to predict the direction of stocks in the BIST30 index. We utilized financial price data, financial sentiment analysis from Twitter, and official KAP news as data sources for training the models. By considering additional data beyond stock prices, we aimed to enhance the accuracy of future predictions and increase profits. The system followed a control flow that involved data collection, preprocessing, sentiment analysis, model training, and imputation for missing data. Experimental results showed that incorporating KAP sentiment scores improved the accuracy of the predictions. Our model achieved an average accuracy of 56% and a maximum accuracy of 77.4% for 2-class classification. This project represents a significant step in predicting stock direction by leveraging social media sentiments and official disclosures, contributing to the field of algorithmic trading. Poster

Deep Learning Based Image Classification and OCR for Twitter

In this project, we addressed the problem of sharing long stories on Twitter through images instead of text due to the platform's character limit. Our objective was to develop a solution using deep learning algorithms to predict if an image contains text, perform Optical Character Recognition (OCR) to detect and extract text from the image, and convert the file type from jpg or png to docx. We collected a dataset of 6000 labeled images and achieved an accuracy rate of approximately 87.5% for training sets using deep learning. The OCR process was implemented using the Pytesseract module for both English and Turkish characters with error-free results. The project successfully tackled the challenge of transporting data contained within images on Twitter and opened possibilities for further improvements, including higher accuracy rates, multi-language support, enhanced user interfaces, and additional file type options. Poster

An Application of Deep-learning and Machine-learning Techniques to All-Words

In this project, we focused on the task of word sense disambiguation (WSD) and applied deep learning and machine learning techniques to improve the accuracy of WSD systems. The goal was to enable machines to understand the sense of words in a similar way to humans. We implemented the semantic diffusion kernel, a novel approach that captures semantic relations between words, and evaluated its performance using different datasets and experiments. The results showed that the semantic diffusion kernel achieved higher accuracy compared to the base kernel in the SupWSD framework. We also worked on optimizing the algorithm and implementing it in the libsvm library for further improvements. This project contributes to the field of WSD and lays the foundation for future enhancements and experiments in word sense disambiguation using advanced techniques. Poster

Grammar and Spell Checking for Turkish Language

In this project, we addressed the problem of grammar and spell checking for the Turkish language, specifically focusing on the challenges posed by rich but noisy textual data in social media. We aimed to develop a solution that can quickly and accurately correct spelling mistakes in user input by implementing an encoder-decoder architecture based system. The system utilizes two networks—one operating at the word level and the other at the character level—to handle both known and unknown words. The results showed promising precision, recall, and F1 scores, outperforming existing rule-based algorithms such as Zemberek. The model's success in reducing vocabulary size and improving correction accuracy makes it a valuable preprocessing step for Turkish Natural Language Processing (NLP) applications, provided it is trained with sufficient data. However, limitations exist due to memory constraints and the availability of Turkish normalization dataset resources. Overall, this project presents a significant advancement in grammar and spell checking for the Turkish language, offering a more efficient and accurate solution for handling noisy textual data in social media. Poster

Kernel Methods For Word Sense Disambiguation

In this project, the focus was on solving the problem of word sense disambiguation using kernel methods. Word sense disambiguation involves determining the correct meaning of a word based on its context. The team proposed two semantic kernel methods: the Abstract Feature Kernel (AFK) and the Relevance Value Kernel (RVK). These kernels aimed to capture semantic connections and contribute to the accurate classification of words into different senses. Experimental results showed that the semantic kernel methods outperformed the baseline linear kernel in terms of F1 scores, indicating their effectiveness in word sense disambiguation. The team also discussed future work, including the exploration of additional semantic kernels with larger and more balanced datasets. Overall, the project contributes to the advancement of word sense disambiguation techniques and highlights the potential of kernel methods in this domain. Poster

Machine Learning Based Electricity Demand Forecasting

In this project, the goal was to forecast electricity demand using machine learning algorithms and relevant features from the electricity market. The project utilized technical features such as bilateral agreements, day-ahead planning, and the balancing power market to predict electricity demand. Mutual Information was used for feature selection, selecting the most relevant features for the predictive modeling. Three different machine learning models, including Decision Tree, Linear Regression, and Random Forest, were trained and tested using different time periods. The accuracy of the models was evaluated using metrics such as R2, MAPE, MAE, and MSE. The results showed high accuracy, with R2 values reaching up to 98% and a low MAPE indicating a prediction error of around 1.4%. The project demonstrates the effectiveness of machine learning in electricity demand forecasting and its potential to optimize power system planning and operational costs. The technologies used included data from electricity market platforms and datasets from reliable sources. Poster

Machine Learning Based Electricity Demand Forecasting

Large Scale Supervised Learning for Unstructured Big Data Analytics

Deep learning Algorithms for Supervised Word Embeddings

Social Media Bot Detection using Big Data Analytics

A Deep Learning based Word Embedding Framework for Mining of Turkish Documents

A Preprocessing Framework for Twitter Bot Detection using NoSql DBs and Python based Data Engineering Tools

The Preprocessing Framework for Twitter Bot Detection aims to identify and detect bot accounts on social media platforms. It utilizes features such as user preferences, tweet characteristics, and periodic patterns to train machine learning algorithms. The framework achieves an accuracy of 86% using Gradient Boosted Trees. Detecting bot accounts is crucial to maintaining platform integrity and combating fake news and manipulation. Poster

A Data Collection and Preprocessing Framework for Opinion Mining and Sentiment Analysis using NoSql DBs and Python based Data Engineering Tools

Concept based aspect aware Sentiment Analysis using NoSql DBs and Python based Data Engineering Tools or Machine Learning Libraries for Big Data (Apache Spark / MLLib / Mahout)

University Student Profiling and Comparison from Twitter using NoSql DBs and Python based Data Engineering Tools

Class Based Semantics for Supervised Word Sense Disambiguation (WSD)

Opinion Leader Detection on Social Networks

Social Network Opinion Leaders (SNOL) aims to detect topic-based opinion leaders by analyzing network structures and user-specific features. The methodology involves topic modeling to label tweets and user modeling to identify opinion leaders based on centrality measures and user-specific characteristics. Experimental results show that SNOL outperforms PageRank in spreading information. Future work includes applying SNOL to dynamic networks for real-time identification of opinion leaders. Technologies utilized include LDA, various classifiers, and SVM. References to related research papers are provided. Poster

Semantic Supervised and Unsupervised Term Weighting Metrics

Semi-Supervised Semantic Text Classification Algorithms (Random Walk Algorithms, Manifold Regularization)

Concept-level Analysis of Natural Language (bag-of-concepts based language processing systems)

Information Extraction, specifically Named Entity Recognition (NER) algorithms for Highly Noisy and Short Turkish Texts (Tweets)

HR Analytics using NoSql DBs and Python based Data Engineering Tools or Machine Learning Libraries for Big Data (Apache Spark / MLLib / Mahout)

TÜBİTAK 3501 Career Award, 111E239, Development of Semantic Semi-Supervised Algorithms for Textual Data Mining (Successfully completed with several SCI Journal and International Conference publications)
Murat Can Ganiz (Principle Investigator) , Berna Altınel (Research Assistant, PhD student)

Using Domain Knowledge for Improving Text Mining Algorithms
Mithat Poyraz , Burak Görener , Murat Diker

Semantic Smoothing Algorithms for Text Mining
Dilara Torunoğlu (COME MS), Abdülkerim Canbay (COME), Hamdi Atacan Oğul (COME)

Smoothing Methods for Bayesian Algorithms
Zeynep Hilal Kilimci (COME MS), Işıl Çoşkun (ISE-graduated with honors)

Semi-supervised Learning Algorithms for Text Mining
İsmail Murat Engün (ISE-graduated with honors), Süleyman Kaan Yeloğlu (ISE), Abdülhadi Çelenlioğlu (ISE)

Concept Ranking Algorithms for Natural Language Processing
Mithat Poyraz , Çağla Şahinli

Abnormal Event Detection on BGP Traffic

Erkan Köşlük (ISE-graduated), Inigo Ortiz de Urbina (COME Erasmus)

Intelligent Focused Web Crawler Project

Mithat Poyraz (COME MS), Duygu Taylan (COME-graduated with honors)

Preprocessing Methods for Turkish Text Classification

Spectral Algorithms for Document Clustering