Communication Complexity of the Fast Multipole Method and its Algebraic Variants
A combination of hierarchical tree-like data structures and data access patterns from fast multipole methods and hierarchical low-rank approximation of linear operators from H-matrix methods appears to form an algorithmic path forward for efficient implementation of many linear algebraic operations of scientific computing at the exascale. The combination provides asymptot- ically optimal computational and communication complexity and applicability to large classes of operators that commonly arise in scientific computing applications. A convergence of the mathe- matical theories of the fast multipole and H-matrix methods has been underway for over a decade. We recap this mathematical unification and describe implementation aspects of a hybrid of these two compelling hierarchical algorithms on hierarchical distributed-shared memory architectures, which are likely to be the first to reach the exascale. We present a new communication complexity estimate for fast multipole methods on such architectures. We also show how the data structures and access patterns of H-matrices for low-rank operators map onto those of fast multipole, leading to an algebraically generalized form of fast multipole that compromises none of its architecturally ideal properties.
M. Abduljabbar, R. Yokota, and D. Keyes. Asynchronous execution of the fast multipole method using charm++. arXiv:1405.7487, 2014.
A. H. Baker, R. D. Falgout, T. Gamblin, T. V. Kolev, M. Schulz, and U. M. Yang. Scaling algebraic multigrid solvers: On the road to exascale. In C. Bischof, H.-G. Hegering, W. E. Nagel, and G. Wittum, editors, Competence in High Performance Computing, pages 215– 226. Springer, 2012.
M. Bebendorf and J. Ostrowski. Parallel hierarchical matrix preconditioners for the curl-curl operator. Journal of Computational Mathematics, 27(5):624–641, 2009.
J. B ́edrof, E. Gaburov, and S. Portegies Zwart. A sparse octree gravitational N-body code that runs entirely on the GPU processor. Journal of Computational Physics, 231:2825–2839, 2012.
S. B ̈orm. Approximation of integral operators by matrices with adaptive bases. Computing, 74(3):249–271, 2005.
M. Burtscher and K. Pingali. An efficient CUDA implementation of the tree-based Barnes Hut N-body algorithm. In GPU Computing Gems, chapter 6. Elsevier, 2011.
S. Chandrasekaran, M. Gu, and T. Pals. A fast ULV decomposition solver for hierarchically semiseparable representations. SIAM J. Matrix Anal. Appl., 28(3):603–622, 2006.
E. Corona, P.-G. Martinsson, and D. Zorin. An O(N) direct solver for integral equations on the plane. arXiv:1303.5466, 2013. submitted to SIAM Journal of Scientific Computing.
J. Dongarra, P. Beckman, T. Moore, P. Aerts, G. Aloisio, J. Andre, D. Barkai, J. Berthou, T. Boku, B. Braunschweig, F. Cappello, B. Chapman, Xuebin C., A. Choudhary, S. Dosanjh, T. Dunning, S. Fiore, A. Geist, W. Gropp, R. Harrison, M. Hereld, M. Heroux, A. Hoisie, K. Hotta, J. Zhong, Y. Ishikawa, F. Johnson, S. Kale, R. Kenway, D. Keyes, W. Kramer, J. Labarta, Al. Lichnewsky, T. Lippert, R. Lucas, B. Maccabe, S. Matsuoka, P. Messina, P. Michielse, B. Mohr, M. S. Mueller, W. E. Nagel, H. Nakashima, M. E. Papka, D. Reed, M. Sato, E. Seidel, J. Shalf, D. Skinner, M. Snir, T. Sterling, R. Stevens, F. Streitz, R. Sugar, S. Sumimoto, W. Tang, J. Taylor, R. Thakur, A. Trefethen, M. Valero, A. Van Der Steen, J. Vetter, P. Williams, R. Wisniewski, and K. Yelick. The international exascale soft- ware project roadmap. International Journal of High Performance Computing Applications, 25(1):3–60, 2011.
E. Gaburov, J. B ́edrof, and S. Portegies Zwart. Gravitational tree-code on graphics process- ing units: Implementation in CUDA. arXiv:1005.5384v1, 2010.
A. Gillman, P. M. Young, and P.-G. Martinsson. A direct solver with O(N) complexity for integral equations on one-dimensional domains. Frontiers of Mathematics in China, 7(2):217–247, 2012.
L. Grasedyck and W. Hackbusch. Construction and arithmetics of H-matrices. Computing, 70(4):295–334, 2003.
L. Grasedyck, R. Kriemann, and S. Le Borne. Parallel black box HLU preconditioning for elliptic boundary value problems. Computing and Visualization in Science, 11(4-6):273–291, 2008.
L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. Journal of Com- putational Physics, 73(2):325–348, 1987.
N. A. Gumerov and R. Duraiswami. Fast multipole methods on graphics processors. Journal of Computational Physics, 227:8290–8313, 2008.
W. Hackbusch. A sparse matrix arithmetic based on H-matrices. part I: Introduction to H-matrices. Computing, 62(2):89–108, 1999.
W. Hackbusch and S. B ̈orm. Data-sparse approximation by adaptive H2-matrices. Comput- ing, 69(1):1–35, 2002.
W. Hackbusch, B. Khoromskij, and S.A. Sauter. On H2-matrices. In H. Bungartz, R. Hoppe, and Zenger C., editors, Lectures on Applied Mathematics, pages 9–29. Springer, 2000.
T. Hamada and K. Nitadori. 190 Tflops astrophysical N-body simulation on cluster of GPUs. In SC ’10 Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2010.
T. Hamada, R. Yokota, K. Nitadori, T. Narumi, K. Yasuoka, M. Taiji, and K. Oguri. 42 Tflops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09, 2009.
H. Ibeid, R. Yokota, and D. E. Keyes. A performance model for the communication in fast multipole methods on hpc platforms. arXiv:1405.6362v1, 2014.
T. Ishiyama, K. Nitadori, and J. Makino. 4.45 Pflops astrophysical N-body simulation on K computer – The gravitational trillion-body problem. In Proceedings of the 2012 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Anal- ysis, SC ’12, 2012.
P. Jetley, L. Wesolowski, F. Gioachin, L. V. Kal ́e, and T. R. Quinn. Scaling hierarchical N-body simulations on GPU clusters. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2010.
A. Kawai, T. Fukushige, and J. Makino. $ 7.0/Mflops astrophysical N-body simulation with treecode on GRAPE-5. In Proceedings of the 1999 ACM/IEEE conference on Supercomput- ing, pages 1–6, 1999.
P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Fran- zon, W. Harrod, K. Hill, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, Al. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick. Exascale computing study: Technology challenges in achieving exascale systems. Technical report, DARPA, 2008.
H. Langston, M. Baskaran, B. Meister, N. Vasilache, and R. Lethin. Re-introduction of communication-avoiding FMM-accelerated FFTs with GPU acceleration. In IEEE High Performance Extreme Computing Conference (HPEC), pages 1–6, 2013.
I. Lashuk, A. Chandramowlishwaran, H. Langston, T.-A. Nguyen, R. Sampath, A. Shringarpure, R. Vuduc, L. Ying, D. Zorin, and G. Biros. A massively parallel adaptive fast multipole method on heterogeneous architectures. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2009.
H. C. Plummer. On the problem of distribution in globular star clusters. Monthly Notices of the Royal Astronomical Society, 71:460–470, 1911.
A. Rahimian, I. Lashuk, K. Veerapaneni, A. Chandramowlishwaran, D. Malhotra, L. Moon, R. Sampath, A. Shringarpure, J. Vetter, R. Vuduc, D. Zorin, and G. Biros. Petascale di- rect numerical simulation of blood flow on 200k cores and heterogeneous architectures. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Com- puting, Networking, Storage and Analysis, SC ’10, 2010.
M. J. Stock and A. Gharakhani. Toward efficient GPU-accelerated N-body simulations. AIAA Paper, 2008-608, 46th AIAA Aerospace Sciences Meeting and Exhibit, Reno, Nevada, Jan. 7 - 10:1–13, 2008.
H. Sundar, R. S. Sampath, and G. Biros. Bottom-up construction and 2:1 balance refinement of linear octrees in parallel. SIAM Journal on Scientific Computing, 30(5):2675–2708, 2008.
S.-H. Teng. Provably good partitioning and load balancing algorithms for parallel adaptive N-body simulation. SIAM Journal on Scientific Computing, 19(2):635–656, 1998.
L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103–111, 1990.
R. Vandebril, M.Van Barel, G. Golub, and N. Mastronardi. A bibliography on semiseparable matrices*. CALCOLO, 42(3-4):249–270, 2005.
M. S. Warren, T. C. Germann, P. S. Lomdahl, D. M. Beazley, and J. K. Salmon. Avalon: An Alpha/Linux cluster achieves 10 Gflops for $150k. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing, pages 1–10, 1998.
M. S. Warren and J. K. Salmon. Astrophysical N-body simulation using hierarchical tree data structures. In Proceedings of the 1992 ACM/IEEE Conference on Supercomputing, pages 570–576, 1992.
M. S. Warren, J. K. Salmon, D. J. Becker, M. P. Goda, and T. Sterling. Pentium pro inside: I. a treecode at 430 Gigaflops on ASCI red, II. price/performance of $ 50/Mflop on Loki and Hyglac. In Proceedings of the 1997 ACM/IEEE conference on Supercomputing, pages 1–16, 1997.
S. Williams, A. Waterman, and D. Patterson. Roofline: An insightful visual performance model for multicore architectures. Communications in Applied Mathematics and Computa- tional Science, 52(4):65–76, 2009.
J. Xia, S. Chandrasekaran, M. Gu, and X.-S. Li. Fast algorithms for hierarchically semisep- arable matrices. Numerical Linear Algebra with Applications, 17(6):953–976, 2010.
R. Yokota and L. A. Barba. Treecode and fast multipole method for N-body simulation with CUDA. In GPU Computing Gems, chapter 9. Morgan Kaufmann, Emerald edition, 2011.
R. Yokota, T. Narumi, R. Sakamaki, S. Kameoka, S. Obi, and K. Yasuoka. Fast multipole methods on a cluster of GPUs for the meshless simulation of turbulence. Computer Physics Communications, 180:2066–2078, 2009.
R. Yokota, J. Pestana, H. Ibeid, and D. E. Keyes. Fast multipole preconditioners for sparse matrices arising from elliptic equations. arXiv:1308.3339v2, 2014.