Survey on Software Tools that Implement Deep Learning Algorithms on Intel/x86 and IBM/Power8/Power9 Platforms


  • Denis Shaikhislamov Research Computing Center of Lomonosov Moscow State University
  • Andrey Sozykin Ural Federal University N.N. Krasovskii Institute of Mathematics and Mechanics
  • Vadim Voevodin Research Computing Center of Lomonosov Moscow State University



Neural networks are becoming more and more popular in scientific field and in the industry. It is mostly because new solutions using neural networks show state-of-the-art results in the domains previously occupied by traditional methods, eg. computer vision, speech recognition etc. But to get these results neural networks become progressively more complex, thus needing a lot more training. The training of neural networks today can take weeks. This problems can be solved by parallelization of the neural networks training and using modern clusters and supercomputers, which can significantly reduce the learning time. Today, a faster training for data scientist is essential, because it allows to get the results faster to make the next decision.

In this paper we provide an overview of distributed learning provided by the popular modern deep learning frameworks, both in terms of provided functionality and performance. We consider multiple hardware choices: training on multiple GPUs and multiple computing nodes.


Apache MXNet., accessed: 2019-12-03

Apache MXNet crash course., accessed: 2019-12-03

Caffe-MPI for deep learning., accessed: 2019-12-03

Caffe, Multi-GPU usage., accessed: 2019-12-03

Caffe2., accessed: 2019-12-03

Caffe2, ResNet-50 training example., accessed: 2019-12-03

Caffe2, Synchronous SGD., accessed: 2019-12-03

Cloud TPU., accessed: 2019-12-03

Deep learning with multiple GPUs on Rescale: Torch., accessed: 2019-12-03

Deeplearning4J., accessed: 2019-12-03

Deeplearning4J: Deep learning with Java, Spark and Power., accessed: 2019-12-03

Distributed deep learning with DL4J and Spark., accessed: 2019-12-03

Distributed deep learning with Horovod and PowerAI DDL., accessed: 2019-12-03

Distributed training in TensorFlow., accessed: 2019-12-03

Distribution strategy - revised API., accessed: 2019-12-03

DL4J, parallel training., accessed: 2019-12-03

Getting started with Intel optimization for MXNet*., accessed: 2019-12-03

Getting started with the Keras functional API., accessed: 2019-12-03

Guide to multi-node training with Intel Distribution of Caffe., accessed: 2019-12-03

Horovod., accessed: 2019-12-03

Intel Nervana neural network processors., accessed: 2019-12-03

Intel PyTorch., accessed: 2019-12-03

Intelligence processing unit., accessed: 2019-12-03

Intel Distribution of Caffe., accessed: 2019-12-03

IntelSoftware Optimization for Torch., accessed: 2019-12-03

Keras: Deep learning library for MXNet, TensorFlow and Theano., accessed: 2019-12-03

Keras tuner., accessed: 2019-12-03

Meet Horovod: Ubers open source distributed deep learning framework for TensorFlow., accessed: 2019-12-03

Microsoft cognitive toolkit., accessed: 2019-12-03

Microsoft cognitive toolkit, multiple GPUs and machines., accessed: 2019-12-03

The most popular language for machine learning and data science is ..., accessed: 2019-12-03

MXNet, training on multiple GPUs with gluon., accessed: 2019-12-03

MXNet, training with multiple GPUs from scratch., accessed: 2019-12-03

MXNet, training with multiple GPUs using model parallelism., accessed: 2019-12-03

NVCaffe., accessed: 2019-12-03

PaddlePaddle., accessed: 2019-12-03

PaddlePaddle benchmark., accessed: 2019-12-03

PaddlePaddle, manual for distributed training with fluid., accessed: 2019-12-03

PaddlePaddle, parallel executor., accessed: 2019-12-03

PaddlePaddle, single-node training., accessed: 2019-12-03

Protocol buffers., accessed: 2019-12-03

PyTorch., accessed: 2019-12-03

PyTorch, model parallel best practices., accessed: 2019-12-03

Scaling Keras model training to multiple GPUs., accessed: 2019-12-03

TensorFlow Roadmap., accessed: 2019-12-03

TorchMPI., accessed: 2019-12-03

Abadi, M., Barham, P., Chen, J., et al.: TensorFlow: A system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). pp. 265–283 (2016),

Bergstra, J., Yamins, D., Cox, D.D.: Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In: Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28. pp. I–115–I–123. ICML’13, (2013),

Cho, M., Finkler, U., Kumar, S., et al.: Powerai DDL. CoRR abs/1708.02188 (2017),

Chollet, F., et al.: Keras. (2015)

Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: A Matlab-like environment for machine learning. In: BigLearn, NIPS Workshop (2011)

Dean, J., Corrado, G.S., Monga, R., et al.: Large scale distributed deep networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1. pp. 1223–1231. NIPS’12, Curran Associates Inc., USA (2012),

Devlin, J., Chang, M., Lee, K., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018),

Goyal, P., Doll´ar, P., Girshick, R.B., et al.: Accurate, large minibatch SGD: training ImageNet in 1 hour. CoRR abs/1706.02677 (2017),

He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015),

Huang, X., Baker, J., Reddy, R.: A historical perspective of speech recognition. Commun. ACM 57(1), 94–103 (Jan 2014), DOI: 10.1145/2500887

Jia, Y., Shelhamer, E., Donahue, J., et al.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)

Johnson, M., Schuster, M., Le, Q.V., et al.: Google’s multilingual neural machine translation system: Enabling zero-shot translation. CoRR abs/1611.04558 (2016),

Krizhevsky, A., Sutskever, I., Hinton, et al.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (May 2017), DOI: 10.1145/3065386

Kurth, T., Zhang, J., Satish, N., et al.: Deep learning at 15PF: Supervised and semisupervised classification for scientific data. CoRR abs/1708.05256 (2017),

Liaw, R., Liang, E., Nishihara, R., et al.: Tune: A research platform for distributed model selection and training. CoRR abs/1807.05118 (2018),

Litjens, G.J.S., Kooi, T., Bejnordi, B.E., et al.: A survey on deep learning in medical image analysis. CoRR abs/1702.05747 (2017),

Liu, J., Liu, J., Dutta, J., et al.: Usability study of distributed deep learning frameworks for convolutional neural networks (2018)

Radford, A., Wu, J., Child, R., et al.: Language models are unsupervised multitask learners (2018),

Sergeev, A., Balso, M.D.: Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018)

Shazeer, N., Cheng, Y., Parmar, N., et al.: Mesh-TensorFlow: Deep learning for supercomputers. CoRR abs/1811.02084 (2018),

Shi, S., Wang, Q., Xu, P., Chu, X.: Benchmarking state-of-the-art deep learning software tools. CoRR abs/1608.07249 (2016),

Strom, N.: Scalable distributed DNN training using commodity GPU cloud computing. In: INTERSPEECH Interspeech (2015)

Szegedy, C., Liu, W., Jia, Y., et al.: Going deeper with convolutions. CoRR abs/1409.4842 (2014),

Wang, L., Chen, Z., Liu, Y., et al.: A unified optimization approach for CNN model inference on integrated GPUs. CoRR abs/1907.02154 (2019),

Xiong, W., Droppo, J., Huang, X., et al.: The Microsoft 2016 conversational speech recognition system. CoRR abs/1609.03528 (2016),




How to Cite

Shaikhislamov, D., Sozykin, A., & Voevodin, V. (2020). Survey on Software Tools that Implement Deep Learning Algorithms on Intel/x86 and IBM/Power8/Power9 Platforms. Supercomputing Frontiers and Innovations, 6(4), 57–83.