Deploying Scalable Deep Learning on HPC Infrastructure

Author nameNikolaos Nikoloutsakos
TitleDeploying Scalable Deep Learning on HPC Infrastructure
Year2018-2019
Supervisor

Christos Tryfonopoulos

ChristosTryfonopoulos

Summary

The aim of this thesis is to explore ways to parallelize/distribute deep learning in multi-core and distributed systems and compare the tools used for deploying models at scale. Models get deeper and more compute intensive as they tackle more complex tasks, and while the volume of training data is increasing, the training time may be prohibitively long. In Deep Learning it is hard to know in advance which ideas will work, typically researchers aim at trying out many different solutions in the shortest possible period of time. One way to speed up training is to use hardware accelerators such as GPUs or TPUs. To go even faster we can train a model across multiple machines each equipped with multiple hardware accelerators. In this thesis we describe the problem from a theoretical perspective, approaches for its parallelization as well as with benchmark datasets on a real HPC system. To achieve the distribution of the training step there are two principal implementations commonly known as model parallelism and data parallelism. For the benchmarking tools implementations we focus in the Data Parallelism approach that is supported from widely used frameworks such as Tensorflow, PyTorch and Horovod. We also present our work while training a DL model with distributed training frameworks, with the aim to provide means for efficient resource utilization, when scaling out. In our benchmarks, we evaluate performance of the parallel models in terms of predictive accuracy of the model and in the computational speed of the process such as speedup, efficiency and scalability. We also experiment with the model quality implications of naively scaling out in deep learning vs. scaling out utilizing the linear scaling rule and learning rate warmup methods. The tools and methods utilized in our benchmarks, enabled us to efficiently scale out a 1000 class image classification problem to 16 GPU workers, without degrading the resulting model quality.