Learning from imbalanced data in text classification

Author nameAlexandros Chatzaras
TitleLearning from imbalanced data in text classification
Year2024-2025
Supervisor

Ilias Zavitsanos

IliasZavitsanos

Summary

This thesis investigates the performance of several standard and advanced machine learning techniques for text classification in the context of imbalanced datasets. The research focuses on four well-established algorithms—Decision Tree, Random Forest, Support Vector Machine (SVM), and Logistic Regression—alongside two advanced methods: the DynAmic self-Paced sampling enSemble (DAPS) algorithm and Example-Dependent Cost-Sensitive Learning. These approaches are evaluated across 20-Newsgroups and Clickbait datasets, under varying levels of class imbalance and text representations. Our goal is to assess whether the DAPS algorithm and Example-Dependent Cost- Sensitive Learning can improve classification performance compared to standard classifiers in scenarios with high class imbalance.

The DAPS algorithm utilizes dynamic sampling and instance weighting to address overlapping regions in the data, while Example-Dependent Cost-Sensitive Learning incorporates the financial impact of misclassifications into the learning process. To evaluate these methods, 32 dataset variants were created by applying transformations such as TF-IDF, Bag-of-Words, Word2Vec, and GloVe, and inducing different levels of class imbalance. Experimental results indicate that cost-sensitive methods, particularly when paired with Random Forest, consistently outperform standard classifiers across a range of imbalance ratios, especially with Word2Vec and GloVe embeddings. The DAPS algorithm also demonstrated superior performance with Random Forest and SVM classifiers, particularly in datasets with low imbalance ratios. However, its effectiveness varied depending on the type of text representation.

Both DAPS and cost-sensitive methods underperformed with Bag-of-Words representations, where standard algorithms were more successful. Despite the resource-intensive nature of cost-sensitive methods, their robustness in handling severe imbalances is a key finding. The datasets created during this research and the corresponding code are made available for future exploration and replication. The study concludes that while advanced methods like DAPS and cost-sensitive learning significantly improve classification in imbalanced text datasets, their effectiveness is influenced by the text representation and computational resources available. Future research should explore expanding these methods to other algorithms, refining resource consumption, and experimenting with a broader range of datasets and imbalance levels to further optimize their application.