Author name | Anna Spyropoulou |
---|---|
Title | Estimating the usefulness of biological data representations in data analysis tasks |
Year | 2017-2018 |
Supervisor | George Giannakopoulos GeorgeGiannakopoulos |
The increasing size and richness of available data has brought in a revolution towards the usage and representations of Big Data. Bioinformatics research is considered one of the domains that was revolutionized by the usage of Big Data. Data Scientists reclaim signal processing methods in order to analyze information contained in a growing volume of deoxyribonucleic acid (DNA) sequence database of the human and model organisms. Those methods require the transformation of the genomic information to a numerical representation in the form of a feature vector, thus enabling further analysis and meta-analysis by AI algorithms. This thesis examines classification difficulty estimation, applied on biological data - that is, given a collection of annotated datasets in the biology / bioinformatics domain, we aim to produce a model that estimates how challenging a classification task is going to be on new data from a similar domain. The approach chosen utilizes data-driven machine learning methods in a metalearning setting. First, we construct a metadataset consisting of: a) instances of a series of statistical metafeatures and b) a numeric ground truth from aggregated classification scores with different classifiers. To achieve this, we process multiple instances of in- put data that encompass different representations of biological information, as well as applying sampling techinques for data augmentation purposes. We move on to build difficulty estimation models in a metalearning fashion by applying different regression algorithms on the built metadataset. We conduct a large-scale experimental evaluation that is effective, outperforming the statistical baseline in terms of performance and robustness. Additionally, we provide useful information pertaining to metafeature evaluation as well as performance and stability rankings for representation, regression and classification models utilized.