Evaluating the Performance of OpenMP Offloading on the NEC SX-Aurora TSUBASA Vector Engine
DOI:
https://doi.org/10.14529/jsfi210204Abstract
The NEC SX-Aurora TSUBASA vector engine (VE) follows the tradition of long vector processors for high-performance computing (HPC). The technology combines the vector computing capabilities with the popularity of standard x86 architecture by integrating it as an accelerator. To decrease the burden of code porting for different accelerator types, the OpenMP specification is designed to be single parallel programming model for all of them. Besides the availability of compiler and runtime implementations, the functionality as well as the performance is important for the usability and acceptance of this paradigm. In this work, we present LLVM-based solutions for OpenMP target device offloading from the host to the vector engine and vice versa (reverse offloading). Therefore, we use our source-to-source transformation tool sotoc as well as the native LLVM-VE code path. We assess the functionality and present the first performance numbers of real-world HPC kernels. We discuss the advantages and disadvantage of the different approaches and show that our implementation is competitive to other GPU OpenMP runtime implementations. Our work gives scientific programmers new opportunities and flexibilities for the development of scalable OpenMP offloading applications for SX-Aurora TSUBASA.
References
AOMP GitHub repository. https://github.com/ROCm-Developer-Tools/aomp, accessed: 2021-06-24
Flang GitHub repository. https://github.com/flang-compiler/f18-llvm-project, accessed: 2021-06-24
Getting Started with VH Call - libsysve. https://www.hpc.nec/documents/veos/en/libsysve/md_doc_VHCall.html, accessed: 2021-06-24
NEC & RWTH Aachen University GitHub repositories. https://github.com/sx-aurora-dev, https://github.com/RWTH-HPC, https://rwth-hpc.github.io/sx-aurora-offloading, accessed: 2021-06-24
Sollve_vv GitHub repository. https://github.com/SOLLVE/sollve_vv, accessed: 2021-06-24
Álvarez, Á., Ugarte, Í., Fernández, V., Sánchez, P.: OpenMP Dynamic Device Offloading in Heterogeneous Platforms. In: Fan, X., de Supinski, B.R., Sinnen, O., Giacaman, N. (eds.) OpenMP: Conquering the Full Hardware Spectrum. Lecture Notes in Computer Science, vol. 11718, pp. 109–122. Springer (2019). https://doi.org/10.1007/978-3-030-28596-8_8
Antao, S.F., Bataev, A., Jacob, A.C., et al.: Offloading Support for OpenMP in Clang and LLVM. In: Proceedings of the Third Workshop on LLVM Compiler Infrastructure in HPC, Salt Lake City, UT, USA, Nov. 14, 2016. pp. 1–11. LLVM-HPC, IEEE (2016). https://doi.org/10.1109/LLVM-HPC.2016.006
Bertolli, C., Antao, S.F., Bercea, G.T., et al.: Integrating GPU Support for OpenMP Offloading Directives into Clang. In: Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC. ACM (2015). https://doi.org/10.1145/2833157.2833161
Bull, J.M.: Measuring Synchronisation and Scheduling Overheads in OpenMP. In: Proc. of the 1st European Workshop on OpenMP. pp. 99–105. Lund, Sweden (1999)
Chen, C., Yang, W., Wang, F., et al.: Reverse Offload Programming on Heterogeneous Systems. IEEE Access 7, 10787–10797 (2019). https://doi.org/10.1109/ACCESS.2019.2891740
Cramer, T., Schmidl, D., Klemm, M., an Mey, D.: OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison. In: Proceedings of the Many-core Applications Research Community (MARC) Symposium at RWTH Aachen University. pp. 38–44 (2012)
Cramer, T., Römmer, M., Kosmynin, B., et al.: OpenMP Target Device Offloading for the SX-Aurora TSUBASA Vector Engine. In: Wyrzykowski, R., Deelman, E., Jack Dongarra, K.K. (eds.) Parallel Processing and Applied Mathematics: 13th International Conference, PPAM 2019. Theoretical Computer Science and General Issues, vol. 12043, pp. 237–249. Springer (2020). https://doi.org/10.1007/978-3-030-43229-4_21
Diaz, J.M., Pophale, S., Friedline, K., et al.: Evaluating Support for OpenMP Offload Features. In: Proceedings of the 47th International Conference on Parallel Processing Companion. pp. 31:1–31:10. ICPP ’18, ACM (2018). https://doi.org/10.1145/3229710.3229717
Diaz, J.M., Pophale, S., Hernandez, O., et al.: OpenMP 4.5 Validation and Verification Suite for Device Offload. In: Evolving OpenMP for Evolving Architectures, IWOMP 2018. Lecture Notes in Computer Science, vol. 11128, pp. 82–95. Springer (2018). https://doi.org/10.1007/978-3-319-98521-3_6
Focht, E.: VEO and PyVEO: Vector Engine Offloading for the NEC SX-Aurora Tsubasa. In: Resch, M.M., Kovalenko, Y., Bez, W., et al. (eds.) Sustained Simulation Performance 2018 and 2019. pp. 95–109. Springer (2020). https://doi.org/10.1007/978-3-030-39181-2_9
Focht, E.: Speeding Up Vector Engine Offloading with AVEO. In: Resch, M.M., Wossough, M., Bez, W., et al. (eds.) Sustained Simulation Performance 2019 and 2020. pp. 35–47. Springer (2021). https://doi.org/10.1007/978-3-030-68049-7_3
Juckeland, G., Brantley, W.C., Chandrasekaran, S., et al.: SPEC ACCEL: A standard application suite for measuring hardware accelerator performance. In: Jarvis, S.A., Wright, S.A., Hammond, S.D. (eds.) High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation - 5th International Workshop, PMBS 2014. Lecture Notes in Computer Science, vol. 8966, pp. 46–67. Springer (2014). https://doi.org/10.1007/978-3-319-17248-4_3
Juckeland, G., Hernandez, O.R., Jacob, A.C., et al.: From Describing to Prescribing Parallelism: Translating the SPEC ACCEL OpenACC Suite to OpenMP Target Directives. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science, vol. 9945, pp. 470–488. Springer (2016). https://doi.org/10.1007/978-3-319-46079-6_33
Ke, Y., Agung, M., Takizawa, H.: NeoSYCL: A SYCL Implementation for SX-Aurora TSUBASA. In: The International Conference on High Performance Computing in Asia-Pacific Region. p. 50–57. HPC Asia 2021, ACM (2021). https://doi.org/10.1145/3432261.3432268
Knaust, M., Mayer, F., Steinke, T.: OpenMP to FPGA Offloading Prototype Using OpenCL SDK. In: 2019 IEEE International Parallel and Distributed Processing SymposiumWorkshops (IPDPSW). pp. 387–390. IEEE (2019). https://doi.org/10.1109/IPDPSW.2019.00072
Mitra, G., Stotzer, E., Jayaraj, A., Rendell, A.: Implementation and Optimization of the OpenMP Accelerator Model for the TI Keystone II Architecture. In: Using and Improving OpenMP for Devices, Tasks, and More, IWOMP 2014. Lecture Notes in Computer Science, vol. 8766, pp. 202–214. Springer (2014). https://doi.org/10.1007/978-3-319-11454-5_15
Newburn, C.J., Dmitriev, S., Narayanaswamy, R., et al.: Offload Compiler Runtime for the Intel R Xeon Phi Coprocessor. In: 2013 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum, Cambridge, MA, USA, May 20-24, 2013. pp. 1213–1225. IEEE (2013). https://doi.org/10.1109/IPDPSW.2013.251
Noack, M., Focht, E., Steinke, T.: Heterogeneous Active Messages for Offloading on the NEC SX-Aurora TSUBASA. In: 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). pp. 26–35. IEEE (2019). https://doi.org/10.1109/IPDPSW.2019.00014
OpenMP Architecture Review Board: OpenMP Application Program Interface, Version 5.0 (2018)
Sommer, L., Korinth, J., Koch, A.: OpenMP device offloading to FPGA accelerators. In: 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP), Seattle, WA, USA, July 10-12, 2017. pp. 201–205. IEEE (2017). https://doi.org/10.1109/ASAP.2017.7995280
Stone, J.E., Gohara, D., Shi, G.: OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Computing in Science Engineering 12(3), 66–73 (2010). https://doi.org/10.1109/MCSE.2010.69
Takizawa, H., Shiotsuki, S., Ebata, N., Egawa, R.: An OpenCL-Like Offload Programming Framework for SX-Aurora TSUBASA. In: 2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT). pp. 282–288. IEEE (2019). https://doi.org/10.1109/PDCAT46702.2019.00059
Tian, S., Chesterfield, J., Doerfert, J., Chapman, B.: Experience Report: Writing a Portable GPU Runtime with OpenMP 5.1 (2021)
Yamada, Y., Momose, S.: Vector Engine Processor of NEC’s Brand-New Supercomputer SX-Aurora TSUBASA. Hot Chips Symposium on High Performance Chips (2018), accessed: 2021-06-24
Downloads
Published
How to Cite
License
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-Non Commercial 3.0 License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.