Machine Learning methods for Missing values in longitudinal data and multivariate time series

Author nameChristos Platias
TitleMachine Learning methods for Missing values in longitudinal data and multivariate time series
Year2017-2018
Supervisor

Georgios Petasis

GeorgiosPetasis

Summary

Handling missing values in a dataset is a long-standing issue across many disciplines, such as health care, geosciences, biology and medicine. Missing values can arise from different sources such as mishandling of samples, measurement errors, lack of responses, or deleted values. The main problem emerging from this situation is that many algorithms can’t run with incomplete datasets. Several methods exist for handling missing values, including “SoftImpute”, “k-nearest neighbor”, “mice”, “MatrixFactorization”, and “miss- Forest”. However, performance comparisons for these methods are hard to find as most research approaches usually face imputation as an intermediate problem of a regression or a classification task and only focus on this task’s performance. In addition, comparisons with existing scientific work are difficult, due to the lack of publications with open-access datasets. Taking into consideration all the above, the goals of this thesis were three. The first one was to find and use open datasets from real use cases, so any- one can have access to them and compare their experimental results. The second one was to propose a new imputation method. Towards this end, two approaches were actually developed. One based on Autoencoders and one on bagging. Finally, the third goal was to compare some of the most frequently used methods for missing data imputation. To achieve this, 13 different methods were tested using four different real world, publicly available datasets.