Survey on Software Tools that Implement Deep Learning Algorithms on Intel/x86 and IBM/Power8/Power9 Platforms

Denis Shaikhislamov; Andrey Sozykin; Vadim Voevodin

doi:10.14529/jsfi190404

Authors

Denis Shaikhislamov Research Computing Center of Lomonosov Moscow State University
Andrey Sozykin Ural Federal University N.N. Krasovskii Institute of Mathematics and Mechanics
Vadim Voevodin Research Computing Center of Lomonosov Moscow State University

DOI:

https://doi.org/10.14529/jsfi190404

Abstract

Neural networks are becoming more and more popular in scientific field and in the industry. It is mostly because new solutions using neural networks show state-of-the-art results in the domains previously occupied by traditional methods, eg. computer vision, speech recognition etc. But to get these results neural networks become progressively more complex, thus needing a lot more training. The training of neural networks today can take weeks. This problems can be solved by parallelization of the neural networks training and using modern clusters and supercomputers, which can significantly reduce the learning time. Today, a faster training for data scientist is essential, because it allows to get the results faster to make the next decision.

In this paper we provide an overview of distributed learning provided by the popular modern deep learning frameworks, both in terms of provided functionality and performance. We consider multiple hardware choices: training on multiple GPUs and multiple computing nodes.

References

Apache MXNet. https://mxnet.apache.org/, accessed: 2019-12-03

Apache MXNet crash course. https://beta.mxnet.io/guide/crash-course/index.html, accessed: 2019-12-03

Caffe-MPI for deep learning. https://github.com/Caffe-MPI/Caffe-MPI.github.io, accessed: 2019-12-03

Caffe, Multi-GPU usage. https://github.com/BVLC/caffe/blob/master/docs/multigpu.md, accessed: 2019-12-03

Caffe2. https://caffe2.ai/docs, accessed: 2019-12-03

Caffe2, ResNet-50 training example. https://github.com/pytorch/pytorch/blob/master/caffe2/python/examples/imagenet_trainer.py, accessed: 2019-12-03

Caffe2, Synchronous SGD. https://caffe2.ai/docs/SynchronousSGD.html, accessed: 2019-12-03

Cloud TPU. https://cloud.google.com/tpu/, accessed: 2019-12-03

Deep learning with multiple GPUs on Rescale: Torch. https://blog.rescale.com/deep-learning-with-multiple-gpus-on-rescale-torch, accessed: 2019-12-03

Deeplearning4J. https://deeplearning4j.org, accessed: 2019-12-03

Deeplearning4J: Deep learning with Java, Spark and Power. https://www.ibm.com/developerworks/community/blogs/fe313521-2e95-46f2-817d-44a4f27eba32/entry/DeepLearning4J_Deep_Learning_with_Java_Spark_and_Power?lang=en, accessed: 2019-12-03

Distributed deep learning with DL4J and Spark. https://deeplearning4j.org/docs/latest/deeplearning4j-scaleout-intro, accessed: 2019-12-03

Distributed deep learning with Horovod and PowerAI DDL. https://developer.ibm.com/linuxonpower/2018/08/24/distributed-deep-learning-horovod-powerai-ddl/, accessed: 2019-12-03

Distributed training in TensorFlow. https://www.tensorflow.org/guide/distribute_strategy, accessed: 2019-12-03

Distribution strategy - revised API. https://github.com/tensorflow/community/blob/master/rfcs/20181016-replicator.md, accessed: 2019-12-03

DL4J, parallel training. https://deeplearning4j.org/tutorials/14-parallel-training, accessed: 2019-12-03

Getting started with Intel optimization for MXNet*. https://software.intel.com/en-us/articles/getting-started-with-intel-optimization-for-mxnet, accessed: 2019-12-03

Getting started with the Keras functional API. https://keras.io/getting-started/functional-api-guide/, accessed: 2019-12-03

Guide to multi-node training with Intel Distribution of Caffe. https://github.com/intel/caffe/wiki/Multinode-guide, accessed: 2019-12-03

Horovod. https://github.com/uber/horovod, accessed: 2019-12-03

Intel Nervana neural network processors. https://www.intel.ai/nervana-nnp, accessed: 2019-12-03

Intel PyTorch. https://github.com/intel/pytorch, accessed: 2019-12-03

Intelligence processing unit. https://www.graphcore.ai/technology, accessed: 2019-12-03

Intel Distribution of Caffe. https://github.com/intel/caffe, accessed: 2019-12-03

IntelSoftware Optimization for Torch. https://github.com/intel/torch, accessed: 2019-12-03

Keras: Deep learning library for MXNet, TensorFlow and Theano. https://github.com/dmlc/keras, accessed: 2019-12-03

Keras tuner. https://github.com/keras-team/keras-tuner, accessed: 2019-12-03

Meet Horovod: Ubers open source distributed deep learning framework for TensorFlow. https://eng.uber.com/horovod/, accessed: 2019-12-03

Microsoft cognitive toolkit. https://www.microsoft.com/en-us/cognitive-toolkit, accessed: 2019-12-03

Microsoft cognitive toolkit, multiple GPUs and machines. https://docs.microsoft.com/en-us/cognitive-toolkit/multiple-gpus-and-machines, accessed: 2019-12-03

The most popular language for machine learning and data science is ... https://www.kdnuggets.com/2017/01/most-popular-language-machine-learning-data-science.html, accessed: 2019-12-03

MXNet, training on multiple GPUs with gluon. https://gluon.mxnet.io/chapter07_distributed-learning/multiple-gpus-gluon.html, accessed: 2019-12-03

MXNet, training with multiple GPUs from scratch. https://gluon.mxnet.io/chapter07_distributed-learning/multiple-gpus-scratch.html, accessed: 2019-12-03

MXNet, training with multiple GPUs using model parallelism. https://mxnet.apache.org/api/faq/model_parallel_lstm, accessed: 2019-12-03

NVCaffe. https://docs.nvidia.com/deeplearning/frameworks/caffe-user-guide/index.html#pullnvcaffe, accessed: 2019-12-03

PaddlePaddle. https://github.com/PaddlePaddle/Paddle, accessed: 2019-12-03

PaddlePaddle benchmark. https://github.com/PaddlePaddle/benchmark, accessed: 2019-12-03

PaddlePaddle, manual for distributed training with fluid. https://www.paddlepaddle.org.cn/documentation/docs/en/1.5/user_guides/howto/training/cluster_howto_en.html#training-in-the-parameter-server-manner, accessed: 2019-12-03

PaddlePaddle, parallel executor. https://www.paddlepaddle.org.cn/documentation/docs/en/1.6/api_guides/low_level/parallel_executor_en.html, accessed: 2019-12-03

PaddlePaddle, single-node training. https://www.paddlepaddle.org.cn/documentation/docs/en/1.5/user_guides/howto/training/single_node_en.html, accessed: 2019-12-03

Protocol buffers. https://developers.google.com/protocol-buffers/, accessed: 2019-12-03

PyTorch. https://pytorch.org/, accessed: 2019-12-03

PyTorch, model parallel best practices. https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html, accessed: 2019-12-03

Scaling Keras model training to multiple GPUs. https://devblogs.nvidia.com/scaling-keras-training-multiple-gpus/, accessed: 2019-12-03

TensorFlow Roadmap. https://www.tensorflow.org/community/roadmap, accessed: 2019-12-03

TorchMPI. https://github.com/facebookresearch/TorchMPI, accessed: 2019-12-03

Abadi, M., Barham, P., Chen, J., et al.: TensorFlow: A system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). pp. 265–283 (2016), https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf

Bergstra, J., Yamins, D., Cox, D.D.: Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In: Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28. pp. I–115–I–123. ICML’13, JMLR.org (2013), http://dl.acm.org/citation.cfm?id=3042817.3042832

Cho, M., Finkler, U., Kumar, S., et al.: Powerai DDL. CoRR abs/1708.02188 (2017), http://arxiv.org/abs/1708.02188

Chollet, F., et al.: Keras. https://keras.io (2015)

Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: A Matlab-like environment for machine learning. In: BigLearn, NIPS Workshop (2011)

Dean, J., Corrado, G.S., Monga, R., et al.: Large scale distributed deep networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1. pp. 1223–1231. NIPS’12, Curran Associates Inc., USA (2012), http://dl.acm.org/citation.cfm?id=2999134.2999271

Devlin, J., Chang, M., Lee, K., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018), http://arxiv.org/abs/1810.04805

Goyal, P., Doll´ar, P., Girshick, R.B., et al.: Accurate, large minibatch SGD: training ImageNet in 1 hour. CoRR abs/1706.02677 (2017), http://arxiv.org/abs/1706.02677

He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015), http://arxiv.org/abs/1512.03385

Huang, X., Baker, J., Reddy, R.: A historical perspective of speech recognition. Commun. ACM 57(1), 94–103 (Jan 2014), DOI: 10.1145/2500887

Jia, Y., Shelhamer, E., Donahue, J., et al.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)

Johnson, M., Schuster, M., Le, Q.V., et al.: Google’s multilingual neural machine translation system: Enabling zero-shot translation. CoRR abs/1611.04558 (2016), http://arxiv.org/abs/1611.04558

Krizhevsky, A., Sutskever, I., Hinton, et al.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (May 2017), DOI: 10.1145/3065386

Kurth, T., Zhang, J., Satish, N., et al.: Deep learning at 15PF: Supervised and semisupervised classification for scientific data. CoRR abs/1708.05256 (2017), http://arxiv.org/abs/1708.05256

Liaw, R., Liang, E., Nishihara, R., et al.: Tune: A research platform for distributed model selection and training. CoRR abs/1807.05118 (2018), http://arxiv.org/abs/1807.05118

Litjens, G.J.S., Kooi, T., Bejnordi, B.E., et al.: A survey on deep learning in medical image analysis. CoRR abs/1702.05747 (2017), http://arxiv.org/abs/1702.05747

Liu, J., Liu, J., Dutta, J., et al.: Usability study of distributed deep learning frameworks for convolutional neural networks (2018)

Radford, A., Wu, J., Child, R., et al.: Language models are unsupervised multitask learners (2018), https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf

Sergeev, A., Balso, M.D.: Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018)

Shazeer, N., Cheng, Y., Parmar, N., et al.: Mesh-TensorFlow: Deep learning for supercomputers. CoRR abs/1811.02084 (2018), http://arxiv.org/abs/1811.02084

Shi, S., Wang, Q., Xu, P., Chu, X.: Benchmarking state-of-the-art deep learning software tools. CoRR abs/1608.07249 (2016), http://arxiv.org/abs/1608.07249

Strom, N.: Scalable distributed DNN training using commodity GPU cloud computing. In: INTERSPEECH Interspeech (2015)

Szegedy, C., Liu, W., Jia, Y., et al.: Going deeper with convolutions. CoRR abs/1409.4842 (2014), http://arxiv.org/abs/1409.4842

Wang, L., Chen, Z., Liu, Y., et al.: A unified optimization approach for CNN model inference on integrated GPUs. CoRR abs/1907.02154 (2019), http://arxiv.org/abs/1907.02154

Xiong, W., Droppo, J., Huang, X., et al.: The Microsoft 2016 conversational speech recognition system. CoRR abs/1609.03528 (2016), http://arxiv.org/abs/1609.03528