### Optimizing Deep Learning RNN Topologies on Intel Architecture

#### Abstract

Recurrent neural network (RNN) models have been found to be well suited for processing temporal data. In this work, we present an optimized implementation of vanilla RNN cell and its two popular variants: LSTM and GRU for Intel Xeon architecture. Typical implementations of these RNN cells employ one or two large matrix multiplication (GEMM) calls and then apply the element-wise operations (sigmoid/tanh) onto the GEMM results. While this approach is easy to implement by exploiting vendor-optimized GEMM library calls, the data reuse relies on how GEMMs are parallelized and is sub-optimal for GEMM sizes stemming from small minibatch. Also, the element-wise operations are exposed as a bandwidth-bound kernel after the GEMM which is typically a compute-bound kernel. To address this discrepancy, we implemented a parallel blocked matrix GEMM in order to (a) achieve load balance, (b) maximize weight matrix reuse, (c) fuse the element-wise operations after partial GEMM blocks are computed and while they are hot in cache. Additionally, we bring the time step loop in our cell to further increase the weight reuse and amortize the overhead to transform the weights into blocked layout. The results show that our implementation is generally faster than Intel MKL-DNN library implementations, e.g. for RNN, forward pass is up to ~3× faster whereas the backward/weight update pass is up to ~5× faster. Furthermore, we investigate high-performance implementations of sigmoid and tanh activation functions that achieve various levels of accuracy. These implementations rely on minimax polynomial approximations, rational polynomials, Taylor expansions and exponential approximation techniques. Our vectorized implementations can be flexibly integrated into deep learning computations with different accuracy requirements without compromising performance; in fact, these are able to outperform vectorized and reduced accuracy vendor-optimized (Intel SVML) libraries by 1.6–2.6× while speep up over GNU libm is close to two orders of magnitude. All our experiments are conducted on Intel’s latest CascadeLake architecture.

#### Full Text:

PDF#### References

BFLOAT16 - hardware numerics definition. https://software.intel.com/sites/default/files/managed/40/8b/bf16-hardware-numerics-definition-white-paper.

pdf, accessed: 2019-03-22

Intel(R) Math Kernel Library for Deep Neural Networks. https://github.com/intel/mkl-dnn, accessed: 2019-03-22

LIBXSMM. https://github.com/hfp/libxsmm, accessed: 2019-09-13

Module: tf.contrib.seq2seq. https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq, accessed: 2019-04-08

Abadi, M., Barham, P., Chen, J., et al.: Tensorflow: A system for large-scale machine learning. In: OSDI. pp. 265–283 (2016)

Beebe, N.H.: Accurate hyperbolic tangent computation. Technical report, Center for Scientific Computing, Department of Mathematics, University of Utah (1991)

Brezinski, C.: Outlines of pade approximation. In: Computational aspects of complex analysis, pp. 1–50. Springer (1983)

Chen, M., Ding, G., Zhao, S., et al.: Reference based LSTM for image captioning. In: AAAI. pp. 3981–3987 (2017)

Chetlur, S., Woolley, C., Vandermersch, P., et al.: cuDNN: Efficient primitives for deep learning. CoRR abs/1410.0759 (2014), http://arxiv.org/abs/1410.0759

Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555 (2014), http://arxiv.org/abs/1412.3555

Cody, W.J.: Software Manual for the Elementary Functions (Prentice-Hall series in computational mathematics). Prentice-Hall, Inc. (1980)

Courbariaux, M., Bengio, Y., David, J.P.: Training deep neural networks with low precision multiplications. arXiv preprint arXiv:1412.7024 (2014)

Das, D., Mellempudi, N., Mudigere, D., et al.: Mixed precision training of convolutional neural networks using integer operations. arXiv preprint arXiv:1802.00930 (2018)

Elsen, E.: Optimizing rnn performance. http://svail.github.io/rnn_perf/, accessed: 2019-03-28

Graves, A., Liwicki, M., Fernandez, S., et al.: A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 855–868 (2009), DOI: 10.1109/TPAMI.2008.137

Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997), DOI: 10.1162/neco.1997.9.8.1735

Kalamkar, D., Banerjee, K., Srinivasan, S., et al.: Training google neural machine translation on an intel cpu cluster. In: CLUSTER (2019 (to appear))

Karpathy, A.: The unreasonable effectiveness of recurrent neural networks. http://karpathy.github.io/2015/05/21/rnn-effectiveness/, accessed: 2019-03-28

Namin, A.H., Leboeuf, K., Muscedere, R., Wu, H., Ahmadi, M.: Efficient hardware implementation of the hyperbolic tangent sigmoid function. In: ISCAS. pp. 2117–2120 (2009), DOI: 10.1109/ISCAS.2009.5118213

Powell, M.J.D.: Approximation theory and methods. Cambridge university press (1981)

Rivlin, T.J.: The Chebyshev polynomials (Pure and Applied Mathematics). Wiley-Interscience (1974)

Tommiska, M.T.: Efficient digital implementation of the sigmoid function for reprogrammable logic. IEE Proceedings - Computers and Digital Techniques 150(6), 403–411 (2003), DOI: 10.1049/ip-cdt:20030965

Wen, T., Gasic, M., Mrksic, N., et al.: Semantically conditioned lstm-based natural language generation for spoken dialogue systems. In: EMNLP. pp. 1711–1721 (2015), DOI: 10.18653/v1/D15-1199

Wu, Y., Schuster, M., Chen, Z., et al.: Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016), http://arxiv.org/abs/1609.08144