Heterogeneous parallel computing: from clusters of workstations to hierarchical hybrid platforms
The paper overviews the state of the art in design and implementation of data parallel
scientic applications on heterogeneous platforms. It covers both traditional approaches originally
designed for clusters of heterogeneous workstations and the most recent methods developed in the
context of modern multicore and multi-accelerator heterogeneous platforms.
H. Akima. A new method of interpolation and smooth curve fitting based on local procedures. Journal of the ACM, 17:589-602, 1970.
A. Alonazi, D. Keyes, A. Lastovetsky, and V. Rychkov. Design and optimization of openfoam-based cfd applications for hybrid and heterogeneous hpc platforms. 26th International Conference on Parallel Computational Fluid Dynamics (ParCFD 2014), Trondheim,
D. Arapov, A. Kalinov, A. Lastovetsky, I. Ledovskih, and T. Lewis. A programming environment for heterogeneous distributed memory machines. In 6th Heterogeneous Computing Workshop (HCW 1997), pages 32-45. IEEE, 1997.
E. Aubanel and X. Wu. Incorporating latency in heterogeneous graph partitioning. In IPDPS 2007, pages 1-8, 2007.
C. Augonnet et al. Automatic calibration of performance models on heterogeneous multicore architectures. In EuroPar, 2009.
J. Barbosa, J. Tavares, and A. J. Padilha. Linear algebra algorithms in a heterogeneous cluster of personal computers. In 9th Heterogeneous Computing Workshop (HCW 2000), pages 147-159, 2000.
O. Beaumont, V. Boudet, A. Petitet, F. Rastello, and Y. Robert. A proposal for a heterogeneous cluster scalapack (dense linear solvers). IEEE Transactions on Computers, 50(10):1052-1070, 2001.
O. Beaumont, V. Boudet, F. Rastello, and Y. Robert. Matrix multiplication on heterogeneous platforms. IEEE Transactions on Parallel and Distributed Systems, 12(10):1033-1051, 2001.
R.D. Blumofe and C.E. Leiserson. Scheduling multithreaded computations by work stealing. JACM, 46(5):720-748, 1999.
P. Boulet, J. Dongarra, F. Rastello, Y. Robert, and F. Vivien. Algorithmic issues on heterogeneous computing platforms. Parallel Processing Letters, 9(2):197-213, 1999.
U. Catalyurek, E. Boman, K. Devine, et al. Hypergraph-based dynamic load balancing for adaptive scientic computations. In IPDPS 2007, pages 1-11, 2007.
C. Chevalier and F. Pellegrini. Pt-scotch: A tool for efficient parallel graph ordering. Parallel Computing, 34(6-8):318-331, 2008.
J. Choi. A new parallel matrix multiplication algorithm on distributed-memory concurrent computers. In HPC Asia, pages 224-229, 1997.
D. Clarke, A. Ilic, A. Lastovetsky, and L. Sousa. Hierarchical partitioning algorithm for scientific computing on highly heterogeneous cpu+ gpu clusters. In Euro-Par 2012 Parallel Processing, pages 489-501, Springer, 2012.
D. Clarke, A. Lastovetsky, and V. Rychkov. Column-based matrix partitioning for parallel matrix multiplication on heterogeneous processors based on functional performance models. In HeteroPar 2011, pages 450-459. Springer, 2011.
D. Clarke, A. Lastovetsky, and V. Rychkov. Dynamic load balancing of parallel computational iterative routines on highly heterogeneous HPC platforms. Parallel Processing Letters, 21(2):195-217, 2011.
D. Clarke, Z. Zhong, V. Rychkov, and A. Lastovetsky. Fupermod: A framework for optimal data partitioning for parallel scientific applications on dedicated heterogeneous hpc platforms. In PaCT 2013, volume 7979 of LNCS, pages 182-196. Springer, 2013.
J. Colaco, A. Matoga, et al. Transparent application acceleration by intelligent scheduling of shared library calls on heterogeneous systems. In PPAM 2013, Part I, pages 693-703, 2014.
A. DeFlumere and A. Lastovetsky. Searching for the optimal data partitioning shape for parallel matrix matrix multiplication on 3 heterogeneous processors. In 23rd Heterogeneity in Computing Workshop (HCW 2014), pages 1-12, 2014.
A. DeFlumere, A. Lastovetsky, and B. Becker. Partitioning for parallel matrix multiplication with heterogeneous processors: The optimal solution. In 21st Heterogeneity in Computing Workshop (HCW 2012), pages 1-15, 2012.
K. Dichev and A. Lastovetsky. Optimization of collective communication for heterogeneous hpc platforms. High-Performance Computing on Complex Environments, pages 95-114, 2014.
M. Drozdowski and P. Wolniewicz. Out-of-core divisible load processing. IEEE Transactions on Parallel and Distributed Systems, 14(10):1048-1056, 2003.
R. Van De Geijn and J. Watts. Summa: Scalable universal matrix multiplication algorithm. Concurrency-Practice and Experience, 9(4):255-274, 1997.
R. Hockney. The communication challenge for mpp: Intel paragon and meiko cs-2. Parallel Computing, 20(3):389-398, 1994.
A. Ilic, F. Pratas, P. Trancoso, and L. Sousa. High-performance computing on heterogeneous systems: Database queries on cpu and gpu. In High Performance Scientific Computing with Special Emphasis on Current Capabilities and Future Perspectives. IOS Press, 2011.
A. Ilic and L. Sousa. On realistic divisible load scheduling in highly heterogeneous distributed systems. In PDP 2012, pages 426-433. IEEE, 2012.
A. Kalinov and A. Lastovetsky. Heterogeneous distribution of computations while solving linear algebra problems on networks of heterogeneous computers. In 7th International Conference on High Performance Computing and Networking Europe (HPCN'99), pages 191-200, 1999.
G. Karypis and K. Schloegel. ParMETIS: Parallel Graph Partitioning and Sparse Matrix Ordering Library. Version 4.0. University of Minnesota, MN, USA, 2013.
A. Lastovetsky. On grid-based matrix partitioning for heterogeneous processors. In 6th International Symposium on Parallel and Distributed Computing (ISPDC 2007), pages 383-390. IEEE, 2007.
A. Lastovetsky and R. Higgins. Scheduling for heterogeneous networks of computers with persistent fluctuation of load. In 13th International Conference on Parallel Computing (ParCo 2005), pages 383-390, 2005.
A. Lastovetsky and R. Reddy. Data partitioning with a realistic performance model of networks of heterogeneous computers. In IPDPS 2004, pages 1-15, 2004.
A. Lastovetsky and R. Reddy. Data partitioning for multiprocessors with memory heterogeneity and memory constraints. Scientific Programming, 13(2):93-112, 2005.
A. Lastovetsky and R. Reddy. Heterompi: Towards a message-passing library for heterogeneous networks of computers. Journal of Parallel and Distributed Computing, 66(2):197-220, 2006.
A. Lastovetsky and R. Reddy. Data partitioning for dense factorization on computers with memory heterogeneity. Parallel Computing, 33(12):757-779, 2007.
A. Lastovetsky and R. Reddy. Data partitioning with a functional performance model of heterogeneous processors. International Journal of High Performance Computing Applications, 21:76-90, 2007.
A. Lastovetsky and R. Reddy. Distributed data partitioning for heterogeneous processors based on partial estimation of their functional performance models. In Euro-Par'09, pages 91-101, 2009.
A. Lastovetsky, R. Reddy, and R. Higgins. Building the functional performance model of a processor. In Proceedings of the 2006 ACM Symposium on Applied Computing (SAC 2006), pages 746-753. ACM, 2006.
A. Lastovetsky and J. Twamley. Towards a realistic performance model for networks of heterogeneous computers. In High Performance Computational Science and Engineering, pages 39-57. Springer, 2005.
M. Linderman, J. Collins, H. Wang, et al. Merge: a programming model for heterogeneous multi-core systems. SIGPLAN Not., 43:287-296, 2008.
G. Quintana-Ort et al. Solving dense linear systems on platforms with multiple hardware accelerators. SIGPLAN Notices, 44:121-130, 2009.
J.N. Quintin and F. Wagner. Hierarchical work-stealing. Euro-Par 2010-Parallel Processing, pages 217-229, 2010.
R. Reddy and A. Lastovetsky. Heterompi+ scalapack: towards a scalapack (dense linear solvers) on heterogeneous networks of computers. In 13th IEEE International Conference on High Performance Computing (HiPC 2006), pages 242-253. 2006.
V. Rychkov, D. Clarke, and A. Lastovetsky. Using multidimensional solvers for optimal data partitioning on dedicated heterogeneous hpc platforms. In PaCT 2011, pages 332-346. Springer-Verlag, 2011.
F. Song et al. Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems. In ICS, 2012.
C. Walshaw and M. Cross. Multilevel mesh partitioning for heterogeneous communication networks. Future Generation Computer Systems, 17(5):601-623, 2001.
Z. Zhong, V. Rychkov, and A. Lastovetsky. Data partitioning on heterogeneous multicore platforms. In Cluster 2011, pages 580-584. IEEE, 2011.
Z. Zhong, V. Rychkov, and A. Lastovetsky. Data partitioning on heterogeneous multicore and multi-gpu systems using functional performance models of data-parallel applications. In Cluster 2012, pages 191-199. IEEE, 2012.
Z. Zhong, V. Rychkov, and A. Lastovetsky. Data partitioning on heterogeneous multicore and multi-gpu platforms using functional performance models. IEEE Transactions on Computers, pages 1-14, 2014.