Towards A Data Centric System Architecture: SHARP

Authors

  • Richard Graham Mellanox Technologies
  • Gil Bloch Mellanox Technologies
  • Devendar Bureddy Mellanox Technologies
  • Gilad Shainer Mellanox Technologies
  • Brian Smith Mellanox Technologies

DOI:

https://doi.org/10.14529/jsfi170401

Abstract

Increased system size and a greater reliance on utilizing system parallelism to achieve computational needs, requires innovative system architectures to meet the simulation challenges. The SHARP technology is a step towards a data-centric architecture, where data is manipulated throughout the system. This paper introduces a new SHARP optimization, and studies aspects that impact application performance in a data-centric environment. The use of UD-Multicast to distribute aggregation results is introduced, reducing the latency of an eight-byte MPI Allreduce() across 128 nodes by 16%. Use of reduction trees that avoid the inter-socket bus further improves the eight-byte MPI Allreduce() latency across 128 nodes, with 28 processes per node, by 18%. The distribution of latency across processes in the communicator is studied, as is the capacity of the system to process concurrent aggregation operations.

References

Adiga, N.R., Blumrich, M.A., Chen, D., Coteus, P., et al.: Blue Gene/L torus interconnection network. IBM Journal of Research and Development 49(2/3), 265 (2005), DOI: 10.1147/rd.492.0265

Almasi, G., Heidelberger, P., Archer, C.J., Martorell, X., Erway, C.C., Moreira, J.E., Steinmacher-Burow, B., Zheng, Y.: Optimization of MPI collective communication on Blue-Gene/L systems. In: Proceedings of the 19th annual international conference on Supercomputing. pp. 253–262. ACM (2005), DOI: 10.1145/1088149.1088183

Alverson, B., Froese, E., Kaplan, L., Roweth, D.: Cray XC series network. Tech. rep., Cray Inc. (2012), https://www.cray.com/sites/default/files/resources/CrayXCNetwork.pdf, accessed: 2017-10-01

Arimilli, B., Arimilli, R., Chung, V., Clark, S., Denzel, W., Drerup, B., Hoefler, T., Joyner, J., Lewis, J., Li, J., et al.: The PERCS high-performance interconnect. In: High Performance Interconnects (HOTI), 2010 IEEE 18th Annual Symposium on. pp. 75–82. IEEE (2010), DOI: 10.1109/HOTI.2010.16

August, M.C., Brost, G.M., Hsiung, C.C., Schiffleger, A.J.: Cray X-MP: The birth of a supercomputer. Computer 22(1), 45–52 (1989), DOI: 10.1109/2.19822

Barnett, M., Littlefield, R.J., Payne, D.G., van de Geijn, R.A.: Global combine on mesh architectures with wormhole routing. In: The Seventh International Parallel Processing Symposium, Proceedings, Newport Beach, California, USA, April 13-16, 1993. pp. 156–162 (1993), DOI: 10.1109/IPPS.1993.262873

Barrett, B., Brightwell, R., Hemmert, S., Pedretti, K., Wheeler, K., Underwood, K.D., Reisen, R., Maccabe, A.B., Hudson, T.: The Portals 4.0 network programming interface, technical report SAND201210087. https://www.osti.gov/scitech/biblio/1088065 (2012), accessed: 2017-10-01

CORAL Collaboration: Benchmark codes. https://asc.llnl.gov/CORAL-benchmarks/#amg2013, accessed: 2017-10-01

Esmaeilzadeh, H., Blem, E., Amant, R.S., Sankaralingam, K., Burger, D.: Dark silicon and the end of multicore scaling. In: Computer Architecture (ISCA), 2011 38th Annual International Symposium on. pp. 365–376. IEEE (2011), DOI: 10.1145/2000064.2000108

Faraj, A., Kumar, S., Smith, B., Mamidala, A., Gunnels, J.: MPI collective communications on the Blue Gene/P Supercomputer: Algorithms and optimizations. In: High Performance Interconnects, 2009. HOTI 2009. 17th IEEE Symposium on. pp. 63–72. IEEE (2009), DOI: 10.1109/HOTI.2009.12

Gara, A., Blumrich, M.A., Chen, D., Chiu, G.T., Coteus, P., Giampapa, M.E., Haring, R.A., Heidelberger, P., Hoenicke, D., Kopcsay, G.V., et al.: Overview of the Blue Gene/L system architecture. IBM Journal of Research and Development 49(2), 195–212 (2005), DOI: 10.1147/rd.492.0195

Graham, R.L., Bureddy, D., Lui, P., Rosenstock, H., Shainer, G., Bloch, G., Goldenerg, D., Dubman, M., Kotchubievsky, S., Koushnir, V., Levi, L., Margolin, A., Ronen, T., Shpiner, A., Wertheim, O., Zahavi, E.: Scalable hierarchical aggregation protocol (SHArP): A hardware architecture for efficient data reduction. In: Proceedings of the First Workshop on Optimization of Communication in HPC. pp. 1–10. COM-HPC ’16, IEEE Press, Piscataway, NJ, USA (2016), DOI: 10.1109/COM-HPC.2016.6

Graham, R.L., Poole, S., Shamis, P., Bloch, G., Bloch, N., Chapman, H., Kagan, M., Shahar, A., Rabinovitz, I., Shainer, G.: Connectx-2 infiniband management queues: First investigation of the new support for network offloaded collective operations. In: Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. pp. 53–62. CCGRID ’10, IEEE Computer Society, Washington, DC, USA (2010), DOI: 10.1109/CCGRID.2010.9

Haring, R.A., Ohmacht, M., Fox, T.W., Gschwind, M.K., Satterfield, D.L., Sugavanam, K., Coteus, P.W., Heidelberger, P., Blumrich, M.A., Wisniewski, R.W., et al.: The IBM Blue Gene/Q compute chip. Micro, IEEE 32(2), 48–60 (2012), DOI: 10.1109/MM.2011.108

Herbordt, M.C., VanCourt, T., Gu, Y., Sukhwani, B., Conti, A., Model, J., DiSabello, D.: Achieving high performance with FPGA-based computing. Computer 40(3), 50 (2007), DOI: 10.1109/MC.2007.79

Hoefler, T., Squyres, J.M., Rehm, W., Lumsdaine, A.: A case for nonblocking collective operations. In: In Frontiers of High Performance Computing and Networking - ISPA 2006 Workshops. pp. 155–164. Springer (2006), DOI: 10.1007/11942634 17

Kessler, R., Schwarzmeier, J.: Cray T3D: a new dimension for Cray Research. In: Compcon Spring ’93, Digest of Papers. pp. 176 –182 (1993), DOI: 10.1109/CMPCON.1993.289660, accessed: 2017-12-19

Kumar, S., Dozsa, G., Almasi, G., Heidelberger, P., Chen, D., Giampapa, M.E., Blocksome, M., Faraj, A., Parker, J., Ratterman, J., Smith, B., Archer, C.J.: The deep computing messaging framework: Generalized scalable message passing on the Blue Gene/P Supercomputer. In: Proceedings of the 22nd Annual International Conference on Supercomputing. pp. 94–103. ICS ’08, ACM, New York, NY, USA (2008), DOI: 10.1145/1375527.1375544

Kumar, S., Mamidala, A., Heidelberger, P., Chen, D., Faraj, D.: Optimization of MPI collective operations on the IBM Blue Gene/Q Supercomputer. Int. J. High Perform. Comput. Appl. 28(4), 450–464 (2014), DOI: 10.1177/1094342014552086

Kumar, S., Sabharwal, Y., Garg, R., Heidelberger, P.: Optimization of all-to-all communication on the Blue Gene/L Supercomputer. In: Proceedings of the 2008 37th International Conference on Parallel Processing. pp. 320–329. ICPP ’08, IEEE Computer Society, Washington, DC, USA (2008), DOI: 10.1109/ICPP.2008.83

Leiserson, C.E., Abuhamdeh, Z.S., Douglas, D.C., et al.: The network architecture of the connection machine CM-5 (extended abstract). In: Proceedings of the Fourth Annual ACM Symposium on Parallel Algorithms and Architectures. pp. 272–285. SPAA ’92, ACM, New York, NY, USA (1992), DOI: 10.1145/140901.141883

Liu, V.W., Chen, C., Chen, R.B.: Optimal all-to-all personalized exchange in d-nary banyan multistage interconnection networks. Journal of Combinatorial Optimization 14(2), 131–142 (2007), DOI: 10.1007/s10878-007-9065-5

Mai, L., Rupprecht, L., Alim, A., Costa, P., Migliavacca, M., Pietzuch, P., Wolf, A.L.: Netagg: Using middleboxes for application-specific on-path aggregation in data centres. In: Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies. pp. 249–262. ACM (2014), DOI: 10.1145/2674005.2674996

Oak Ridge National Laboratory Leadership Computing Facility: Titan Cray XK7. https://www.olcf.ornl.gov/computing-resources/titan-cray-xk7/, accessed: 2017-10-01

Ohio State University Network-Based Computing Laboratory: OSU microbenchmarks. http://mvapich.cse.ohio-state.edu/benchmarks/, accessed: 2017-10-01

Petrini, F., Coll, S., Frachtemberg, E., Hoisie, A.: Hardware-and-software-based collective communication on the Quadrics network. (2001), http://www.osti.gov/scitech/servlets/purl/975699, accessed: 2017-10-19

Russell, R.M.: The CRAY-1 computer system. Commun. ACM 21(1), 63–72 (1978), DOI: 10.1145/359327.359336

Schneck, P.B.: Supercomputer Architecture, chap. The CDC STAR-100, pp. 99–117. Springer US, Boston, MA (1987), DOI: 10.1007/978-1-4615-7957-1 5

Schneider, T., Hoefler, T., Grant, R., Barrett, B., Brightwell, R.: Protocols for Fully Offloaded Collective Operations on Accelerated Network Adapters. In: Parallel Processing (ICPP), 2013 42nd International Conference on. pp. 593–602 (2013), DOI: 10.1109/ICPP.2013.73

Texas Advanced Computing Center: Stampede Supercomputer. https://www.tacc.utexas.edu/systems/stampede, accessed: 2017-12-19

Thakur, R., Rabenseifner, R.: Optimization of collective communication operations in MPICH. International Journal of High Performance Computing Applications 19, 49–66 (2005), DOI: 10.1177/1094342005051521

Vadhiyar, S.S., Fagg, G.E., Dongarra, J.: Automatically tuned collective communications. In: Proceedings of SC99: High Performance Networking and Computing. p. 3. IEEE Computer Society (2000), DOI: 10.1109/SC.2000.10024

Venkata, M.G., Shamis, P., Sampath, R., Graham, R.L., Ladd, J.S.: Optimizing blocking and nonblocking reduction operations for multicore systems: Hierarchical design and implementation. In: Cluster Computing (CLUSTER), 2013 IEEE International Conference on. pp. 1–8. IEEE (2013), DOI: 10.1109/CLUSTER.2013.6702676

Downloads

Published

2017-12-29

How to Cite

Graham, R., Bloch, G., Bureddy, D., Shainer, G., & Smith, B. (2017). Towards A Data Centric System Architecture: SHARP. Supercomputing Frontiers and Innovations, 4(4), 4–16. https://doi.org/10.14529/jsfi170401