Biological data representations and the estimation of usefulness for analysis tasks

Author nameNikolaos Nikolaou
TitleBiological data representations and the estimation of usefulness for analysis tasks

George Giannakopoulos



The classification task is a prominent example of machine learning applicability. Classification of documents, images, audio clips are a few of the applications that appear in the related literature. Numerous approaches and algorithm have been designed and implemented to perform classification, but –according to extensive experimentation– the ”No Free Lunch theorem” designates that no one algorithm is better across all problem settings. This implies that each setting, essentially each dataset, has unique traits that change the difficulty of the related classification problem. The aim of this thesis is to investigate whether there exist per-dataset descriptive measures/features –we term “meta-features”– that can help estimate the difficulty of a related classification task. To this end:
•we studied the related literature to identify such meta-features, reporting the findings;
•we propose and examine new meta-features: (a) the statistical characteristics of the instance overlap hyperspace, (b) the pairwise feature correlation and (c) the intrinsic dimensionality of a dataset via the fractal dimension;
•based on 10 datasets from the biomedical domain (DNA sequences, represented utilizing different representation approaches: graph-based and numerical ones), we generate a multitude of datasets with varying meta-feature values to create a benchmark set of datasets for estimating classification performance;
•we applied different classification algorithms on each dataset, classifying the represented DNA sequences based on their species of origin (which is used as the target label of the classification task), which essentially offers the expected classification performance over a given dataset (irrespective of the classification algorithm used);
•we study the correlation of meta-features to the expected classification performance;
•we study the prediction capability of the meta-features concerning the expected classification performance, through a meta-feature-based regression model.
We see that each family of representations (numerical vs. graphical) gives different results with respect to which meta-feature correlates to the expected performance. On the other hand, it appears that in several cases of both settings, the combination of the meta-features can be used to define an additive model with statistically significant predictive ability concerning expected classification performance. The proposed meta-features expressing the statistical characteristics of the instance overlap hyperspace appear to consistently contribute to effective prediction. The above results indicate a promising direction for future study and experimentation, which can lead to increasing our intuition on the difficulty of classification tasks.