Design and Implementation of the PULSAR Programming System for Large Scale Computing

Jakub Kurzak; Piotr Luszczek; Ichitaro Yamazaki; Yves Robert; Jack Dongarra

doi:10.14529/jsfi170101

Authors

Jakub Kurzak University of Tennessee, Knoxville
Piotr Luszczek University of Tennessee, Knoxville
Ichitaro Yamazaki University of Tennessee, Knoxville
Yves Robert ENS Lyon, Lyon
Jack Dongarra University of Tennessee, Knoxville

DOI:

https://doi.org/10.14529/jsfi170101

Abstract

The objective of the PULSAR project was to design a programming model suitable for large scale machines with complex memory hierarchies, and to deliver a prototype implementation of a runtime system supporting that model. PULSAR tackled the challenge by proposing a programming model based on systolic processing and virtualization. The PULSAR programming model is quite simple, with point-to-point channels as the main communication abstraction. The runtime implementation is very lightweight and fully distributed, and provides multithreading, message-passing and multi-GPU offload capabilities. Performance evaluation shows good scalability up to one thousand nodes with one thousand GPU accelerators.

References

Ahmed, H.M., Delosme, J.M., Morf, M.: Highly concurrent computing structures for matrix arithmetic and signal processing. Computer 15(1), 65–82 (1982), DOI: 10.1109/MC.1982.1653828

Allen, E., Culpepper, R., Nielsen, J., Rafkind, J., Ryu, S.: Growing a syntax. In: ACM SIGPLAN Foundations of Object-Oriented Languages workshop. ACM, Savannah, GA, USA (2009)

Annaratone, M., Arnould, E., Gross, T., Kung, H.T., Lam, M., Menzilcioglu, O., Webb, J.A.: The Warp computer: Architecture, implementation, and performance. IEEE Transactions on Computers C-36(12), 1523–1538 (1987), DOI: 10.1109/TC.1987.5009502

Arpaci, R., Culler, D., Krishnamurthy, A., Steinberg, S., Yelick, K.: Empirical evaluation of the CRAY-T3D: A compiler perspective. In: Proceedings of the 22nd Annual International Symposium on Computer Architecture. pp. 320–331. IEEE, Santa Margherita Ligure, Italy (June 22-24 1995), iSSN: 1063-6897, Print ISBN: 0-89791-698-0, INSPEC Accession Number: 5086797

Augonnet, C.: Scheduling Tasks over Multicore machines enhanced with Accelerators: a Runtime System’s Perspective. Phd thesis, Universit‘e Bordeaux 1 (December 2011)

Barada, H., El-Amawy, A.: Systolic architecture for matrix triangularisation with partial pivoting. IEE Proceedings E, Computers and Digital Techniques 135(4), 208–213 (1987), ISSN: 0143-7062

Barnes, G.H., Brown, R.M., Kato, M., Kuck, D.J., Slotnick, D.L., Stokes, R.A.: The ILLIAC IV computer. IEEE Transactions on Computers C-17(8), 746–757 (1968), DOI: 10.1109/TC.1968.229158

Bojanczyk, A.W., Brent, R.P., Kung, H.T.: Numerically stable solution of dense systems of linear equations using mesh-connected processors. SIAM J. Sci. Stat. Comput. 5(1), 95–104 (1984), DOI: 10.1137/0905007

Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Haidar, H., Herault, T., Kurzak, J., Langou, J., Lemarinier, P., Ltaief, H., Luszczek, P., YarKhan, A., Dongarra, J.: Distibuted Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA. Tech. rep., Innovative Computing Laboratory, University of Tennessee (apr 2010), http://icl.cs.utk.edu/news_pub/submissions/ut-cs-10-660.pdf

Cannon, L.E.: A Cellular Computer to Implement the Kalman Filter Algorithm. Ph.D. thesis, Montana State University (1969)

Carlson, W.W., Draper, J.M., Culler, D.E., Yelick, K., Brooks, E., Warren, K.: Introduction to UPC and language specification. Tech. Rep. CCS-TR-99-157, CCS (May 13 1999)

Chamberlain, B., Callahan, D., Zima, H.: Parallel programmability and the chapel language. International Journal of High Performance Computing Applications 21(3), 291–312 (August 2007)

Chapel language specification 0.750 (2007), http://chapel.cs.washington.edu/spec-0.750.pdf, accessed: 2017-02-15

Chapman, B., Curtis, T., Pophale, S., Koelbel, C., Kuehn, J., Poole, S., Smith, L.: Introducing OpenSHMEM, SHMEM for the PGAS community. In: PGAS’10: The Fourth Conference on Partitioned Global Address Space Programming Models. PGAS, ACM, New York, NY, USA (October 12-15 2010)

Choi, J.: A new parallel matrix multiplication algorithm on distributed-memory concurrent computers. In: High Performance Computing on the Information Superhighway, 1997. HPC Asia ’97. pp. 224–229. IEEE, Seoul (April-May 1997), DOI: 10.1109/HPC.1997.592151

Choi, J., Walker, D.W., Dongarra, J.J.: PUMMA: parallel universal matrix multiplication algorithms on distributed memory concurrent computers. Concurrency Computat.: Pract. Exper. 6(7), 543–570 (1994), DOI: 10.1002/cpe.4330060702

Comon, P., Robert, Y.: A systolic array for computing BA −1 . IEEE Transactions on Acoustics, Speech and Signal Processing 35(6), 717–723 (1987), DOI: 10.1109/TASSP.1987.1165208

Consortium, U.: Upc language specifications, v1.2. Tech. Rep. LBNL-59208, Lawrence Berkeley National Laboratory (2005)

Darcy, J.: Writing robust IEEE recommended functions in ”100% pure Java”(tm). Tech. Rep. CSD-98-1009, Computer Science Division, University of California, Berkeley (Oct 1998)

Dennis, J.B., Gao, G.R.: On the feasibility of a codelet based multi-core operating system. In: 4th Workshop on Data-Flow Execution Models for Extreme Scale Computing (DFM’14). Edmonton, Alberta, Canada (August 24 2014)

Dongarra, J., Graybill, R., Harrod, W., Lucas, R., Lusk, E., Luszczek, P., McMahon, J., Snavely, A., Vetter, J., Yelick, K., Alam, S., Campbell, R., Carringon, L., Chen, T.Y., Khalili, O., Meredith, J., Tikir, M.: Darpa’s hpcs program: History, models, tools, languages. Advances in Computers: High Performance Computing 72, 1–100 (2008), iSBN: 978-0-12-374411-1, ISSN: 0065-2458

Evans, D.J.: Systolic Algorithms (Topics in Computer Mathematics). Routledge (1991), ISBN: 2881248047

Fanfarillo, A., Burnus, T., Cardellini, V., Filippone, S., Nagle, D., Rouson, D.W.I.: Open-coarrays: Open-source transport layers supporting Coarray Fortran compilers. In: 8th International Conference on Partitioned Global Address Space Programming Models. ACM, Eugene, Oregon, USA (October 7-10 2014)

Fisher, A.L., Kung, H.T.: Special-purpose VLSI architectures: General discussions and a case study. In: Kung, S.Y., Kailath, T., Whitehouse, H.J. (eds.) VLSI and Modern Signal Processing, pp. 153–169. Prentice Hall (1984), ISBN: 013942699X

Fortes, J.A.B., Wah, B.W.: Systolic arrays-from concept to implementation. Computer 20(7), 12–17 (1987), DOI: 10.1109/MC.1987.1663616

Fox, G.C., Otto, S.W., Hey, A.J.G.: Matrix algorithms on a Hypercube I: Matrix multiplication. Parallel Comput. 4(1), 17–31 (1987), DOI: 10.1016/0167-8191(87)90060-3

Gao, G.R., Sterling, T., Stevens, R., Hereld, M., Zhu, W.: Parallex: A study of a new parallel computation model (2007)

Gentleman, W.M., Kung, H.T.: Matrix triangularization by systolic arrays. In: SPIE Proceedings Vol. 298, Advances in Laser Scanning Technology. pp. 19–26. Society for Photo-Optical Instrumentation Engineers, Bellingham, WA (1981)

Gregory, J., McReynolds, R.: The SOLOMON computer. IEEE Transactions on Electronic Computers EC-12(6), 774–781 (1963), DOI: 10.1109/PGEC.1963.263560

Huss-Lederman, S., Jacobson, E.M., Tsao, A., Zhang, G.: Matrix multiplication on the Intel Touchstone Delta. Concurrency: Pract. Exper. 6(7), 571–594 (1994), DOI: 10.1002/cpe.4330060703

Jin, G., Adhianto, L., Mellor-Crummey, J., III, W.N.S., Yang, C.: Implementation and performance evaluation of the hpc challenge benchmarks in coarray Fortran 2.0. In: 25th IEEE International Parallel and Distributed Processing Symposium (IPDPS). Anchorage, AK, USA (May 16-20 2011)

Johnson, K.T., Hurson, A.R., Shirazi, B.: General-purpose systolic arrays. Computer 23(11), 20–31 (1993), DOI: 10.1109/2.241423

Jr., G.L.S., Allen, E., Chase, D., Flood, C., Luchangco, V., Maessen, J.W., Ryu, S.: Fortress (Sun HPCS language). In: Encyclopedia of Parallel Computing, pp. 718–735. Springer, New York Dordrecht Heidelberg London (2011), DOI 10.1007/978-0-387-09766-4

Kaiser, H., Brodowicz, M., Sterling, T.: Parallex: An advanced parallel execution model for scaling-impaired applications (2009)

Kaiser, H., Brodowicz, M., Sterling, T.: Parallex. 2012 41st International Conference on Parallel Processing Workshops 0, 394–401 (2009)

Kale, L.V., Krishnan, S.: CHARM++: A portable concurrent object oriented system based on C++. SIGPLAN Not. (10), 91–108 (Oct.), DOI: 10.1145/167962.165874

Krishnamurthy, A., Culler, D., Yelick, K.: Evaluation of architectural support for global address-based communication in large-scale parallel machines. Tech. Rep. CSD-98-984, Computer Science Division, University of California, Berkeley (1998)

Krishnamurthy, A., Yelick, K.: Optimizing parallel programs with explicit synchronization. In: Proceedings of the ACM SIGPLAN ’95 Conference on Programming Language Design and Implementation. pp. 196–204. ACM (Jun 1995)

Krishnamurthy, A., Yelick, K.: Analyses and optimizations for shared address space programs. Journal of Parallel and Distributed Computation 38(2), 130–144 (November 1 1996)

Kuck, D.J.: ILLIAC IV software and application programming. IEEE Transactions on Computers C-17(8), 758–770 (1968), DOI: 10.1109/TC.1968.229159

Kung, H.T.: Why systolic architectures? Computer 15(1), 37–46 (1982),DOI: 10.1109/MC.1982.1653825

Kung, H.T., Leiserson, C.E.: Systolic arrays (for VLSI). In: Sparse Matrix Proceedings. pp. 256–282. Society for Industrial and Applied Mathematics (1978), ISBN: 0898711606

Kung, S.Y., Lo, S.C., Jean, S.N., Hwang, J.N.: Wavefront array processors-concept to implementation. Computer 20(7), 18–33 (1987), DOI: 10.1109/MC.1987.1663617

Liblit, B.: Local Qualification Inference for Titanium. http://www.cs.berkeley.edu/liblit/lqi/ (Aug 26 1998), cS263/CS265 semester project report.

Liblit, B., Aiken, A.: Type systems for distributed data structures. In: In the 27th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). pp. 199–213. Boston, Massachusetts (19–21 Jan 2000)

Luk, F.T.: Triangular processor array for computing singular values. Lin. Alg. Appl. 77, 259–273 (1986), DOI: 10.1016/0024-3795(86)90171-0

Marjanovi ́c, V., Labarta, J., Ayguad ́e, E., Valero, M.: Overlapping communication and computation by using a hybrid MPI/SMPSs approach. In: Proceedings of the 24th ACM International Conference on Supercomputing. pp. 5–16. ICS ’10, ACM, New York, NY, USA (2010), DOI: 10.1145/1810085.1810091

Mellor-Crummey, J., Adhianto, L., Jin, G., III, W.N.S.: A new vision for coarray fortran. In: The Third Conference on Partitioned Global Address Space Programming Models. Ashburn, VA, USA (October 5-8 2009)

Mellor-Crummey, J., Adhianto, L., Scherer, W.N.: A critique of co-array features in fortran 2008 working draft j3/07-007r3 (February 2008), paper J3 08-126 of the Fortran 2008 J3 standard working group

Miyamoto, C., Liblit, B.: Themis: Enforcing Titanium Consistency on the NOW. http://www.cs.berkeley.edu/~liblit/themis/ (Dec 1997), cS262 semester project report.

Numrich, R.W., Reid, J.K.: Co-array Fortran for parallel programming. ACM SIGPLAN Fortran Forum 17(2), 1–31 (Aug 1998), DOI: 10.1145/289918.289920

Quinton, P., Robert, Y.: Systolic Algorithms & Architectures. Prentice Hall (1991), ISBN: 0138807906

Robert, Y.: Impact of Vector and Parallel Architectures on the Gaussian Elimination Algorithm. Manchester University Press (1991), ISBN: 0470217030

Scherer, W.N., Adhianto, L., Jin, G., Mellor-Crummey, J., Yang, C.: Hiding latency in Coarray Fortran 2.0. In: PGAS’10: The Fourth Conference on Partitioned Global Address Space Programming Models. PGAS, New York, NY, USA (October 12-15), DOI: 10.1145/2020373.2020387

Shterenlikht, A., Margetts, L., Cebamanos, L., Henty, D.: Fortran 2008 CoArrays. ACM SIGPLAN Fortran Forum 34(1), 10–30 (Apr 2015), DOI:10.1145/2754942.2754944

Sorensen, D.C.: Analysis of pairwise pivoting in Gaussian elimination. IEEE Transactions on Computers C-34(3), 274–278 (1985), DOI: 10.1109/TC.1985.1676570

Tan, A.T., Falcou, J., Etiemble, D., Kaiser, H.: Automatic task-based code generation for high performance domain specific embedded language. International Journal of Parallel Programming (2015)

Van De Geijn, R.A., Watts, J.: SUMMA: scalable universal matrix multiplication algorithm. Concurrency Computat.: Pract. Exper. 9(4), 255–274 (1997), DOI: 10.1002/(SICI)1096-9128(199704)9:43.0.CO;2-2

Report on the experimental language X10, version 1.0.1 (December 2006), http://x10.sourceforge.net/docs/x10-101.pdf, accessed: 2017-02-15