Extensive performance evaluation of popular relational and non-relational data stores for full-text search

Author nameCharalampos Apostolakis
TitleExtensive performance evaluation of popular relational and non-relational data stores for full-text search
Year2024-2025
Supervisor

Christos Tryfonopoulos

ChristosTryfonopoulos

Summary

In this thesis, we perform an extensive performance evaluation of relational and non-relational database systems comparing their full-text search capabilities. We build upon previous relevant work by expanding it across four key directions. First, we include additional systems in our evaluation such as Neo4j, MySQL, MariaDB, Sphinx Search, Redis, alongside those previously studied. Second, we focus on reproducibility, using Docker containers for our experimental setup, ensuring that our results can be reliably replicated. Third, we enhance the study by utilizing newer and more diverse datasets, including the updated Crossref database and a new Tweet dataset. Fourth, we extend the performance comparison to a broader set of full-text queries, including term importance, proximity, fuzziness, and thesaurus searches, alongside exact phrase matching, wildcard, and Boolean search queries.

The DBMSs tested were chosen after conducting extensive research on available open-source or free-tier systems offering full text search capabilities without incurring licensing expenses. The selected datastores include both relational (PostgreSQL, MySQL, MariaDB) and non-relational systems (MongoDB, Neo4j, Redis), as well as specialized search engines (Apache Solr, Elasticsearch, Sphinx Search). The wide range of examined systems allows for a comprehensive comparison of fulltext search capabilities across a range of architectures. Finally, we provide a detailed scorecard presenting the performance of each system based on several examined aspects, such as query handling, data insertion, and indexing efficiency. The scorecard highlights the strengths and weaknesses of each system, offering insights into their advantages and disadvantages in terms of scalability, flexibility, search performance and accuracy across different query types.