Developing an Architecture-independent Graph Framework for Modern Vector Processors and NVIDIA GPUs

Ilya V. Afanasyev

doi:10.14529/jsfi200404

Authors

Ilya V. Afanasyev Lomonosov Moscow State University

DOI:

https://doi.org/10.14529/jsfi200404

Abstract

This paper describes the first-in-the-world attempt to develop an architectural-independent graph framework named VGL, designed for different modern architectures with high-bandwidth memory. Currently VGL supports two classes of architectures: NEC SX-Aurora TSUBASA vector processors and NVIDIA GPUs. However, VGL can be easily extended to other architectures due to its flexible software structure. VGL is designed to provide users with the possibility of selecting the most suitable architecture for solving a specific graph problem on a given input data, which, in return, allows to significantly outperform existing frameworks and libraries, developed for modern multicore CPUs and NVIDIA GPUs. Since VGL uses an identical set of computational and data abstractions for all architectures, its users can easily port graph algorithms between different target architectures without any source code modifications. Additionally, in this paper we show how graph algorithms should be implemented and optimised for NVIDIA GPU and NEC SX-Aurora TSUBASA architectures, demonstrating that both architectures have multiple similar properties and hardware features.

References

Stanford Large Network Dataset Collection – SNAP. https://snap.stanford.edu/data/ (2020), accessed: 2020-12-29

Afanasyev, I.V.: Developing a prototype of high-performance graph-processing framework for NEC SX-Aurora TSUBASA vector architecture. Numerical methods and programming 21, 290–305 (2020), DOI: 10.26089/NumMet.v21r325

Afanasyev, I.V., Voevodin, V.V., Kobayashi, H., et al.: Analysis of relationship between SIMD-processing features used in NVIDIA GPUs and NEC SX-Aurora TSUBASA vector processors. In: Malyshkin, V. (ed.) International Conference on Parallel Computing Technologies, PaCT 2019. Lecture Notes in Computer Science, vol. 11657, pp. 125–139. Springer (2019), DOI: 10.1007/978-3-030-25636-4_10

Afanasyev, I.V., Voevodin, V.V., Kobayashi, H., et al.: Developing efficient implementations of shortest paths and page rank algorithms for NEC SX-Aurora TSUBASA architecture. Lobachevskii Journal of Mathematics 40(11), 1753–1762 (2019), DOI: 10.1134/S1995080219110039

Afanasyev, I.V., Voevodin, V.V., Komatsu, K., et al.: VGL: a high-performance graph processing framework for the NEC SX-Aurora TSUBASA vector architecture. The Journal of Supercomputing (2021), DOI: 10.1007/s11227-020-03564-9

Azad, A., Aznaveh, M.M., Beamer, S., et al.: Evaluation of graph analytics frameworks using the GAP benchmark suite. In: IEEE International Symposium on Workload Characterization, IISWC 2020, 27-30 October 2020, Beijing, China. pp. 216–227. IEEE (2020), DOI: 10.1109/IISWC50251.2020.00029

Besta, M., Podstawski, M., Groner, L., et al.: To push or to pull: On reducing communication and synchronization in graph computations. In: Huang, H.H., Weissman, J.B., Iamnitchi, A., et al. (eds.) Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2017, 26-30 June 2017, Washington, DC, USA. pp. 93–104. ACM (2017), DOI: 10.1145/3078597.3078616

Chakrabarti, D., Zhan, Y., Faloutsos, C.: R-MAT: A recursive model for graph mining. In: Berry, M.W., Dayal, U., Kamath, C., et al. (eds.) Proceedings of the Fourth SIAM International Conference on Data Mining, 22-24 April 2004, Lake Buena Vista, Florida, USA. pp. 442–446. SIAM (2004), DOI: 10.1137/1.9781611972740.43

Flynn, M.J.: Very high-speed computing systems. Proceedings of the IEEE 54(12), 1901–1909 (1966), DOI: 10.1109/PROC.1966.5273

Harris, M., et al.: Optimizing parallel reduction in CUDA. NVIDIA Developer Technology 2(4), 70 (2007)

Hillis, W.D., Steele Jr., G.L.: Data parallel algorithms. Commun. ACM 29(12), 1170–1183 (1986), DOI: 10.1145/7902.7903

Hong, S., Kim, S.K., Oguntebi, T., et al.: Accelerating CUDA graph algorithms at maximum warp. SIGPLAN Not. 46(8), 267–276 (2011), DOI: 10.1145/2038037.1941590

Khorasani, F., Vora, K., Gupta, R., et al.: CuSha: vertex-centric graph processing on GPUs. In: Plale, B., Ripeanu, M., Cappello, F., et al. (eds.) The 23rd International Symposium on High-Performance Parallel and Distributed Computing, HPDC’14, 23-27 June 2014, Vancouver, BC, Canada. pp. 239–252. ACM (2014), DOI: 10.1145/2600212.2600227

Kirk, D.: NVIDIA CUDA software and GPU parallel computing architecture. In: Morrisett, G., Sagiv, M. (eds.) Proceedings of the 6th International Symposium on Memory Management, ISMM 2007, 21-22 October 2007, Montreal, Quebec, Canada. pp. 103–104. ACM (2007), DOI: 10.1145/1296907.1296909

Komatsu, K., Watanabe, O., Musa, A., et al.: Performance evaluation of a vector supercomputer SX-Aurora TSUBASA. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018, 11-16 November 2018, Dallas, TX, USA. pp. 54:1–54:12. IEEE Press (2018)

Liu, H., Huang, H.H.: Enterprise: breadth-first graph traversal on GPUs. In: Kern, J., Vetter, J.S. (eds.) Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, 15-20 November 2015, Austin, TX, USA. pp. 68:1–68:12. ACM (2015), DOI: 10.1145/2807591.2807594

Nguyen, D., Lenharth, A., Pingali, K.: A lightweight infrastructure for graph analytics. In: Kaminsky, M., Dahlin, M. (eds.) ACM SIGOPS 24th Symposium on Operating Systems Principles, SOSP ’13, 3-6 November 2013, Farmington, PA, USA. pp. 456–471. ACM (2013), DOI: 10.1145/2517349.2522739

Shun, J., Blelloch, G.E.: Ligra: a lightweight graph processing framework for shared memory. SIGPLAN Not. 48(8), 135–146 (2013), DOI: 10.1145/2517327.2442530

Wang, Y., Davidson, A.A., Pan, Y., et al.: Gunrock: a high-performance graph processing library on the GPU. In: Asenjo, R., Harris, T. (eds.) Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2016, 12-16 March 2016, Barcelona, Spain. pp. 11:1–11:12. ACM (2016), DOI: 10.1145/2851141.2851145

Yamada, Y., Momose, S.: Vector Engine Processor of NEC Brand-New supercomputer SX-Aurora TSUBASA. In: International symposium on High Performance Chips, Hot Chips 2018, August 2018, Cupertino, USA (2018)

Zhang, Y., Kiriansky, V., Mendis, C., et al.: Making caches work for graph analytics. In: Nie, J., Obradovic, Z., Suzumura, T., et al. (eds.) 2017 IEEE International Conference on Big Data, BigData 2017, 11-14 December 2017, Boston, MA, USA. pp. 293–302. IEEE Computer Society (2017), DOI: 10.1109/BigData.2017.8257937

Zhong, J., He, B.: Medusa: Simplified graph processing on GPUs. IEEE Trans. Parallel Distributed Syst. 25(6), 1543–1552 (2014), DOI: 10.1109/TPDS.2013.111