Evaluating the Performance of OpenMP Offloading on the NEC SX-Aurora TSUBASA Vector Engine

Tim Cramer; Boris Kosmynin; Simon Moll; Manoel Römmer; Erich Focht; Matthias S. Müller

doi:10.14529/jsfi210204

Authors

Tim Cramer RWTH Aachen University
Boris Kosmynin RWTH Aachen University
Simon Moll NEC Cooperation
Manoel Römmer RWTH Aachen University
Erich Focht NEC Cooperation
Matthias S. Müller RWTH Aachen University

DOI:

https://doi.org/10.14529/jsfi210204

Abstract

The NEC SX-Aurora TSUBASA vector engine (VE) follows the tradition of long vector processors for high-performance computing (HPC). The technology combines the vector computing capabilities with the popularity of standard x86 architecture by integrating it as an accelerator. To decrease the burden of code porting for different accelerator types, the OpenMP specification is designed to be single parallel programming model for all of them. Besides the availability of compiler and runtime implementations, the functionality as well as the performance is important for the usability and acceptance of this paradigm. In this work, we present LLVM-based solutions for OpenMP target device offloading from the host to the vector engine and vice versa (reverse offloading). Therefore, we use our source-to-source transformation tool sotoc as well as the native LLVM-VE code path. We assess the functionality and present the first performance numbers of real-world HPC kernels. We discuss the advantages and disadvantage of the different approaches and show that our implementation is competitive to other GPU OpenMP runtime implementations. Our work gives scientific programmers new opportunities and flexibilities for the development of scalable OpenMP offloading applications for SX-Aurora TSUBASA.

References

AOMP GitHub repository. https://github.com/ROCm-Developer-Tools/aomp, accessed: 2021-06-24

Flang GitHub repository. https://github.com/flang-compiler/f18-llvm-project, accessed: 2021-06-24

Getting Started with VH Call - libsysve. https://www.hpc.nec/documents/veos/en/libsysve/md_doc_VHCall.html, accessed: 2021-06-24

NEC & RWTH Aachen University GitHub repositories. https://github.com/sx-aurora-dev, https://github.com/RWTH-HPC, https://rwth-hpc.github.io/sx-aurora-offloading, accessed: 2021-06-24

Sollve_vv GitHub repository. https://github.com/SOLLVE/sollve_vv, accessed: 2021-06-24

Álvarez, Á., Ugarte, Í., Fernández, V., Sánchez, P.: OpenMP Dynamic Device Offloading in Heterogeneous Platforms. In: Fan, X., de Supinski, B.R., Sinnen, O., Giacaman, N. (eds.) OpenMP: Conquering the Full Hardware Spectrum. Lecture Notes in Computer Science, vol. 11718, pp. 109–122. Springer (2019). https://doi.org/10.1007/978-3-030-28596-8_8

Antao, S.F., Bataev, A., Jacob, A.C., et al.: Offloading Support for OpenMP in Clang and LLVM. In: Proceedings of the Third Workshop on LLVM Compiler Infrastructure in HPC, Salt Lake City, UT, USA, Nov. 14, 2016. pp. 1–11. LLVM-HPC, IEEE (2016). https://doi.org/10.1109/LLVM-HPC.2016.006

Bertolli, C., Antao, S.F., Bercea, G.T., et al.: Integrating GPU Support for OpenMP Offloading Directives into Clang. In: Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC. ACM (2015). https://doi.org/10.1145/2833157.2833161

Bull, J.M.: Measuring Synchronisation and Scheduling Overheads in OpenMP. In: Proc. of the 1st European Workshop on OpenMP. pp. 99–105. Lund, Sweden (1999)

Chen, C., Yang, W., Wang, F., et al.: Reverse Offload Programming on Heterogeneous Systems. IEEE Access 7, 10787–10797 (2019). https://doi.org/10.1109/ACCESS.2019.2891740

Cramer, T., Schmidl, D., Klemm, M., an Mey, D.: OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison. In: Proceedings of the Many-core Applications Research Community (MARC) Symposium at RWTH Aachen University. pp. 38–44 (2012)

Cramer, T., Römmer, M., Kosmynin, B., et al.: OpenMP Target Device Offloading for the SX-Aurora TSUBASA Vector Engine. In: Wyrzykowski, R., Deelman, E., Jack Dongarra, K.K. (eds.) Parallel Processing and Applied Mathematics: 13th International Conference, PPAM 2019. Theoretical Computer Science and General Issues, vol. 12043, pp. 237–249. Springer (2020). https://doi.org/10.1007/978-3-030-43229-4_21

Diaz, J.M., Pophale, S., Friedline, K., et al.: Evaluating Support for OpenMP Offload Features. In: Proceedings of the 47th International Conference on Parallel Processing Companion. pp. 31:1–31:10. ICPP ’18, ACM (2018). https://doi.org/10.1145/3229710.3229717

Diaz, J.M., Pophale, S., Hernandez, O., et al.: OpenMP 4.5 Validation and Verification Suite for Device Offload. In: Evolving OpenMP for Evolving Architectures, IWOMP 2018. Lecture Notes in Computer Science, vol. 11128, pp. 82–95. Springer (2018). https://doi.org/10.1007/978-3-319-98521-3_6

Focht, E.: VEO and PyVEO: Vector Engine Offloading for the NEC SX-Aurora Tsubasa. In: Resch, M.M., Kovalenko, Y., Bez, W., et al. (eds.) Sustained Simulation Performance 2018 and 2019. pp. 95–109. Springer (2020). https://doi.org/10.1007/978-3-030-39181-2_9

Focht, E.: Speeding Up Vector Engine Offloading with AVEO. In: Resch, M.M., Wossough, M., Bez, W., et al. (eds.) Sustained Simulation Performance 2019 and 2020. pp. 35–47. Springer (2021). https://doi.org/10.1007/978-3-030-68049-7_3

Juckeland, G., Brantley, W.C., Chandrasekaran, S., et al.: SPEC ACCEL: A standard application suite for measuring hardware accelerator performance. In: Jarvis, S.A., Wright, S.A., Hammond, S.D. (eds.) High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation - 5th International Workshop, PMBS 2014. Lecture Notes in Computer Science, vol. 8966, pp. 46–67. Springer (2014). https://doi.org/10.1007/978-3-319-17248-4_3

Juckeland, G., Hernandez, O.R., Jacob, A.C., et al.: From Describing to Prescribing Parallelism: Translating the SPEC ACCEL OpenACC Suite to OpenMP Target Directives. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science, vol. 9945, pp. 470–488. Springer (2016). https://doi.org/10.1007/978-3-319-46079-6_33

Ke, Y., Agung, M., Takizawa, H.: NeoSYCL: A SYCL Implementation for SX-Aurora TSUBASA. In: The International Conference on High Performance Computing in Asia-Pacific Region. p. 50–57. HPC Asia 2021, ACM (2021). https://doi.org/10.1145/3432261.3432268

Knaust, M., Mayer, F., Steinke, T.: OpenMP to FPGA Offloading Prototype Using OpenCL SDK. In: 2019 IEEE International Parallel and Distributed Processing SymposiumWorkshops (IPDPSW). pp. 387–390. IEEE (2019). https://doi.org/10.1109/IPDPSW.2019.00072

Mitra, G., Stotzer, E., Jayaraj, A., Rendell, A.: Implementation and Optimization of the OpenMP Accelerator Model for the TI Keystone II Architecture. In: Using and Improving OpenMP for Devices, Tasks, and More, IWOMP 2014. Lecture Notes in Computer Science, vol. 8766, pp. 202–214. Springer (2014). https://doi.org/10.1007/978-3-319-11454-5_15

Newburn, C.J., Dmitriev, S., Narayanaswamy, R., et al.: Offload Compiler Runtime for the Intel R Xeon Phi Coprocessor. In: 2013 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum, Cambridge, MA, USA, May 20-24, 2013. pp. 1213–1225. IEEE (2013). https://doi.org/10.1109/IPDPSW.2013.251

Noack, M., Focht, E., Steinke, T.: Heterogeneous Active Messages for Offloading on the NEC SX-Aurora TSUBASA. In: 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). pp. 26–35. IEEE (2019). https://doi.org/10.1109/IPDPSW.2019.00014

OpenMP Architecture Review Board: OpenMP Application Program Interface, Version 5.0 (2018)

Sommer, L., Korinth, J., Koch, A.: OpenMP device offloading to FPGA accelerators. In: 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP), Seattle, WA, USA, July 10-12, 2017. pp. 201–205. IEEE (2017). https://doi.org/10.1109/ASAP.2017.7995280

Stone, J.E., Gohara, D., Shi, G.: OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Computing in Science Engineering 12(3), 66–73 (2010). https://doi.org/10.1109/MCSE.2010.69

Takizawa, H., Shiotsuki, S., Ebata, N., Egawa, R.: An OpenCL-Like Offload Programming Framework for SX-Aurora TSUBASA. In: 2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT). pp. 282–288. IEEE (2019). https://doi.org/10.1109/PDCAT46702.2019.00059

Tian, S., Chesterfield, J., Doerfert, J., Chapman, B.: Experience Report: Writing a Portable GPU Runtime with OpenMP 5.1 (2021)

Yamada, Y., Momose, S.: Vector Engine Processor of NEC’s Brand-New Supercomputer SX-Aurora TSUBASA. Hot Chips Symposium on High Performance Chips (2018), accessed: 2021-06-24