Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems

Jack Dongarra; M. Abalenkovs; A. Abdelfattah; M. Gates; A. Haidar; J. Kurzak; P. Luszczek; S. Tomov; I. Yamazaki; A. YarKhan

doi:10.14529/jsfi150405

Authors

Jack Dongarra University of Tennessee, Knoxville
M. Abalenkovs University of Manchester, Manchester
A. Abdelfattah University of Tennessee, Knoxville
M. Gates University of Tennessee, Knoxville
A. Haidar University of Tennessee, Knoxville
J. Kurzak University of Tennessee, Knoxville
P. Luszczek University of Tennessee, Knoxville
S. Tomov University of Tennessee, Knoxville
I. Yamazaki University of Tennessee, Knoxville
A. YarKhan University of Tennessee, Knoxville

DOI:

https://doi.org/10.14529/jsfi150405

Abstract

We present a review of the current best practices in parallel programming models for dense linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand alone manycore coprocessors, GPUs, and combinations of these. Of interest is the evolution of the programming models for DLA libraries { in particular, the evolution from the popular LAPACK and ScaLAPACK libraries to their modernized counterparts PLASMA (for multicore CPUs) and MAGMA (for heterogeneous architectures), as well as other programming models and libraries.
Besides providing insights into the programming techniques of the libraries considered, we outline our view of the current strengths and weaknesses of their programming models { especially in regards to hardware trends and ease of programming high-performance numerical software that current applications need { in order to motivate work and future directions for the next generation of parallel programming models for high-performance linear algebra libraries on heterogeneous systems.

References

A. Abdelfattah, M. Baboulin, V. Dobrev, J. Dongarra, C. Earl, J. Falcou, A. Haidar, I. Karlin, T. Kolev, I. Masliah, and S. Tomov. High-Performance Tensor Contractions for GPUs. Technical Report UT-EECS-16-738, 01-2016 2016.

E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. J. Phys.: Conf. Ser., 180(1), 2009.

ACML - AMD Core Math Library, 2014. Available at http://developer.amd.com/tools-and-sdks/cpu-development/amd-core-math-library-acml.

E. Anderson, Z. Bai, C. Bischof, S. L. Blackford, J. W. Demmel, J. J. Dongarra, J. D. Croz, A. Greenbaum, S. J. Hammarling, A. McKenney, and D. C. Sorensen. LAPACK User’s Guide. Society for Industrial and Applied Mathematics, Philadelphia, Third edition, 1999.

M. Anderson, D. Sheffield, and K. Keutzer. A predictive model for solving small linear algebra problems in gpu registers. In IEEE 26th International Parallel Distributed Processing Symposium (IPDPS), 2012.

A. A. Auer, G. Baumgartner, D. E. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, X. Gao, R. Harrison, S. Krishnamoorthy, S. Krishnan, C.-C. Lam, Q. Luc, M. Nooijene, R. Pitzerf, J. Ramanujamg, P. Sadayappanc, and A. Sibiryakovc. Automatic code generation for many-body electronic structure methods: the tensor contraction engine. Molecular Physics, 104(2):211–228,

C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, 23(2):187–198, 2011.

M. Baboulin, V. Dobrev, J. Dongarra, C. Earl, C. Falcou, A. Haidar, I. Karlin, T. Kolev, I. Masliah, and S. Tomov. Towards a High-Performance Tensor Algebra Package for Accelerators. http://icl.cs.utk.edu/projectsfiles/magma/pubs/43-smc15_tensor_contractions.pdf, September 2 2015. Smoky Mountains Computational Sciences and Engineering Conference (SMC’15),

Poster, Gatlinburg, TN.

R. M. Badia, J. R. Herrero, J. Labarta, J. M. P ́erez, E. S. Quintana-Ort ́ı, and G. QuintanaOrt ́ı. Parallelizing dense and banded linear algebra libraries using smpss. Concurrency and Computation: Practice and Experience, 21(18):2438–2456, 2009.

A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Computing, 35(1):38–53, 2009.

J. Choi, J. Demmel, I. S. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers Design Issues and Performance. Computer Physics Communications, 97, aug 1996.

J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. W. Walker, and R. C. Whaley. A proposal for a set of parallel basic linear algebra subprograms. In Applied Parallel Computing, Computations in Physics, Chemistry and Engineering Science, Second International Workshop, PARA ’95, Lyngby, Denmark, August 21-24, 1995, Proc., pages 107–114, 1995.

M. Corporation. C++ AMP : Language and programming model, 2012. Version 1.0, August.

S. Donfack, S. Tomov, and J. Dongarra. Dynamically balanced synchronization-avoiding lu factorization with multicore and gpus. In Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014, 05-2014 2014.

T. Dong, V. Dobrev, T. Kolev, R. Rieben, S. Tomov, and J. Dongarra. A step towards energy efficient computing: Redesigning a hydrodynamic application on CPU-GPU. In IEEE 28th International Parallel Distributed Processing Symposium (IPDPS), 2014.

J. J. Dongara, C. B. Moler, J. R. Bunch, and G. W. Stewart. LINPACK User’s Guide. SIAM, Philadelphia, PA, 1979.

J. Dongarra, M. Gates, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, and I. Yamazaki. Accelerating numerical dense linear algebra calculations with gpus. Numerical Computations with GPUs, pages 1–26, 2014.

J. Dongarra, J. Kurzak, P. Luszczek, T. Moore, and S. Tomov. Numerical algorithms and libraries at exascale.

http://www.hpcwire.com/2015/10/19/numerical-algorithms-and-libraries-at-exascale/, October 19 2015. HPCwire.

J. J. Dongarra, J. D. Croz, S. Hammarling, and R. J. Hanson. An extended set of fortran basic linear algebra subprograms. ACM Trans. Math. Softw., 14(1):1–17, 1988.

J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw., 16(1):1–17, Mar. 1990.

P. Du, R. Weber, P. Luszczek, S. Tomov, G. Peterson, and J. Dongarra. From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming. Parallel Comput., 38(8):391–407, Aug. 2012.

A. Duran, E. Ayguad ́e, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. OMPSS: A proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21(02):173–193, 2011.

K. Gregory and A. Miller. C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++. Microsoft Press, 1st edition, 2012. ISBN-13: 978-0735664739 ISBN-10: 0735664730.

F. Gustavson, L. Karlsson, and B. K ̊agstr ̈om. Parallel and cache-efficient in-place matrix storage format conversion. ACM Transactions on Mathematical Software (TOMS), 38(3):17, 2012.

A. Haidar, C. Cao, I. Yamazaki, J. Dongarra, M. Gates, P. Luszczek, and S. Tomov. Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads Across Accelerators, Coprocessors, and Multicore rocessors. In 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA 14), New Orleans, LA, 11-2014 2014. IEEE.

A. Haidar, C. Cao, A. Yarkhan, P. Luszczek, S. Tomov, K. Kabir, and J. Dongarra. Unified Development for Mixed Multi-GPU and Multi-coprocessor Environments Using a Lightweight Runtime Environment. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS ’14, pages 491–500, Washington, DC, USA, 2014. IEEE Computer Society.

A. Haidar, T. Dong, P. Luszczek, S. Tomov, and J. Dongarra. Batched matrix computations on hardware accelerators based on gpus. International Journal of High Performance Computing Applications, 2015.

A. Haidar, T. Dong, S. Tomov, P. Luszczek, and J. Dongarra. Framework for Batched and GPU-resident Factorization Algorithms to Block Householder Transformations. In ISC High Performance, Frankfurt, Germany, 07-2015 2015. Springer, Springer.

A. Haidar, J. Dongarra, K. Kabir, M. Gates, P. Luszczek, S. Tomov, and Y. Jia. Hpc programming on intel many-integrated-core hardware with magma port to xeon phi. Scientific Programming, 23, 01-2015 2015.

A. Haidar, A. YarKhan, C. Cao, P. Luszczek, S. Tomov, and J. Dongarra. Flexible linear algebra development and scheduling with cholesky factorization. In 17th IEEE International Conference on High Performance Computing and Communications, Newark, NJ, 08-2015 2015.

M. A. Heroux. Exascale programming: Adapting what we have can (and must) work. http://www.hpcwire.com/2016/01/14/24151/, January 14 2016. HPCwire.

M. Horton, S. Tomov, and J. Dongarra. A class of hybrid LAPACK algorithms for multicore and GPU architectures. In Proceedings of Symposium for Application Accelerators in High Performance Computing (SAAHPC), 2011.

E.-J. Im, K. Yelick, and R. Vuduc. Sparsity: Optimization framework for sparse matrix kernels. Int. J. High Perform. Comput. Appl., 18(1):135–158, Feb. 2004.

Intel Math Kernel Library, 2014. Available at http://software.intel.com/intel-

mkl/.

Y. Jia, P. Luszczek, and J. Dongarra. Multi-GPU implementation of LU factorization. In

proceedings of the international conference on computational science (ICCS), 2012.

K. Kabir, A. Haidar, S. Tomov, and J. Dongarra. On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors. In ISC High Performance 2015, Frankfurt, Germany, 07-2015 2015.

J. L. Khodayari A., A.R. Zomorrodi and C. Maranas. A kinetic model of escherichia coli core metabolism satisfying multiple sets of mutant flux data. Metabolic engineering, 25C:50–62, 2014.

Khronos OpenCL Working Group. The opencl specification, version: 1.0 document revision: 48, 2009.

J. Kurzak, H. Ltaief, J. Dongarra, and R. M. Badia. Scheduling dense linear algebra operations on multicore processors. Concurrency and Computation: Practice and Experience, 22(1):15–44, 2010.

C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for fortran usage. ACM Trans. Math. Softw., 5(3):308–323, Sept. 1979.

O. Messer, J. Harris, S. Parete-Koon, and M. Chertkow. Multicore and accelerator development for a leadership-class stellar astrophysics code. In Proceedings of ”PARA 2012: State-of-the-Art in Scientific an Parallel Computing.”, 2012.

S. Mittal and J. S. Vetter. A survey of cpu-gpu heterogeneous computing techniques. ACM Comput. Surv., 47(4):69:1–69:35, July 2015.

J. Molero, E. Garz ́on, I. Garc ́ıa, E. Quintana-Ort ́ı, and A. Plaza. Poster: A batched Cholesky solver for local RX anomaly detection on GPUs, 2013. PUMPS.

R. Nath, S. Tomov, and J. Dongarra. An improved magma gemm for fermi graphics processing units. Int. J. High Perform. Comput. Appl., 24(4):511–515, Nov. 2010.

C. J. Newburn, G. Bansal, M. Wood, L. Crivelli, J. Planas, A. Duran, P. Souza, L. Borges, P. Luszczek, S. Tomov, J. Dongarra, H. Anzt, M. Gates, A. Haidar, Y. Jia, K. Kabir, I. Yamazaki, and J. Labarta. Heterogeneous streaming. In IPDPSW, AsHES 2016 (accepted), Chicago, IL, USA, May 23 2016.

OpenACC Non-Profit Corporation. The OpenACC application programming interface version 2.0. http://www.openacc.org/sites/default/files/OpenACC.2.0a_1.pdf, June 2013.

OpenMP Architecture Review Board. OpenMP application program interface version 3.0.

http://www.openmp.org/mp-documents/spec30.pdf, May 2008.

OpenMP Architecture Review Board. OpenMP application program interface version 4.0.

http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf, July 2013.

OpenMP Architecture Review Board. OpenMP application program interface version 4.5, Nov 2015.

J. M. P ́erez, P. Bellens, R. M. Badia, and J. Labarta. Cellss: Making it easier to program the cell broadband engine processor. IBM Journal of Research and Development, 51(5):593–604, 2007.

A. Pop and A. Cohen. Openstream: Expressiveness and data-flow compilation of openmp streaming programs. ACM Trans. Archit. Code Optim., 9(4), 2013.

F. Song, S. Tomov, and J. Dongarra. Enabling and scaling matrix computations on heterogeneous multi-core and multi-gpu systems. In ICS, pages 365–376, 2012.

M. Tillenius. Superglue: A shared memory framework using data versioning for dependency-aware task-based parallelization. SIAM Journal on Scientific Computing, 37(6):C617–C642, 2015.

S. Tomov, J. Dongarra, and M. Baboulin. Towards dense linear algebra for hybrid gpu accelerated manycore systems. Parellel Comput. Syst. Appl., 36(5-6):232–240, 2010. http://dx.doi.org/10.1016/j.parco.2009.12.005, DOI: 10.1016/j.parco.2009.12.005.

S. Tomov, R. Nath, H. Ltaief, and J. Dongarra. Dense linear algebra solvers for multicore with GPU accelerators. In Proc. of the IEEE IPDPS’10, pages 1–8, Atlanta, GA, April 19-23 2010. IEEE Computer Society. DOI: 10.1109/IPDPSW.2010.5470941.

I. Yamazaki, S. Tomov, and J. Dongarra. One-sided dense matrix factorizations on a multicore with multiple GPU accelerators. In proceedings of the international conference on computational science (ICCS), 2012.

I. Yamazaki, S. Tomov, and J. Dongarra. Non-GPU-resident symmetric indefinite factorization. Submitted to ACM Transactions on Mathematical Software (TOMS), 2016.

Y. Yan, B. M. Chapman, and M. Wong. A comparison of heterogeneous

and manycore programming models. http://www.hpcwire.com/2015/03/02/

a-comparison-of-heterogeneous-and-manycore-programming-models,

March 2 2015. HPCwire.

A. YarKhan. Dynamic Task Execution on Shared and Distributed Memory Architectures. PhD thesis, University of Tennessee, 2012.

S. N. Yeralan, T. A. Davis, and S. Ranka. Sparse mulitfrontal QR on the GPU. Technical report, University of Florida Technical Report, 2013.