Collecting and Presenting Reproducible Intranode Stencil Performance: INSPECT
DOI:
https://doi.org/10.14529/jsfi190301Abstract
Stencil algorithms have been receiving considerable interest in HPC research for decades. The techniques used to approach multi-core stencil performance modeling and engineering span basic runtime measurements, elaborate performance models, detailed hardware counter analysis, and thorough scaling behavior evaluation. Due to the plurality of approaches and stencil patterns, we set out to develop a generalizable methodology for reproducible measurements accompanied by state-of-the-art performance models. Our open-source toolchain and collected results are publicly available in the "Intranode Stencil Performance Evaluation Collection" (INSPECT).We present the underlying methods, models and tools involved in gathering and documenting the performance behavior of a collection of typical stencil patterns across multiple architectures and hardware configuration options. Our aim is to endow performance-aware application developers with reproducible baseline performance data and validated models to initiate a well-defined process of performance assessment and optimization. All data is available for inspection: source code, produced assembly, performance measurements, hardware performance counter data, single-core and multicore Roofline and ECM (execution-cache-memory) performance models, and machine properties. Deviations between measured performance and performance models become immediately evident and can be investigated. We also give hints as to how INSPECT can be used in practice for custom code analysis.
References
Abel, A., Reineke, J.: Uops.info: Characterizing latency, throughput, and port usage of instructions on Intel microarchitectures. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. pp. 673–686. ASPLOS ’19, ACM, New York, NY, USA (2019), DOI: 10.1145/3297858.3304062
Anzt, H., Cojean, T., Flegar, G., Gr¨utzmacher, T., Nayak, P., Ribizel, T.: An Automated Performance Evaluation Framework for the GINKGO Software Ecosystem. In: 90th Annual Meeting of the International Associaten of Applied Mathematics and Mechanics, GAMM 2019, February 2019, Vienna, Austria. GAMM (2019)
Datta, K., Kamil, S., Williams, S., Oliker, L., Shalf, J., Yelick, K.: Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors. SIAM Review 51(1), 129–159 (2009), DOI: 10.1137/070693199
Dongarra, J.: The LINPACK Benchmark: An Explanation. In: Proceedings of the 1st International Conference on Supercomputing, Athens, Greece, June 812, 1987. pp. 456–474. Springer-Verlag, London, UK, UK (1988), http://dl.acm.org/citation.cfm?id=647970.742568
Dongarra, J., Heroux, M.A., Luszczek, P.: High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems. The International Journal of High Performance Computing Applications 30(1), 3–10 (2015), DOI: 10.1177/1094342015593158
Fog, A.: Instruction tables (2018), https://www.agner.org/optimize/instruction tables.pdf
Ghysels, P., Vanroose, W.: Modeling the Performance of Geometric Multigrid Stencils on Multicore Computer Architectures. SIAM Journal on Scientific Computing 37(2), C194–C216 (2015), DOI: 10.1137/130935781
Glinskiy, B., Kulikov, I., Snytnikov, A., Romanenko, A., Chernykh, I., Vshivkov, V.: Codesign of parallel numerical methods for plasma physics and astrophysics. Supercomputing Frontiers and Innovations 1(3) (2015), DOI: 10.14529/jsfi140305
Godbolt, M.: Compiler Explorer (2019), https://godbolt.org/, accessed: 2019-09-06
Guerrera, D.: STEMPEL: Stencil TEMPlate Engineering Library (2019), https://github.com/RRZE-HPC/stempel, accessed: 2019-09-06
Hammer, J.: Layer Conditions: Interactive Web Calculator (2017), https://rrze-hpc.github.io/layer-condition/, accessed: 2019-09-06
Hammer, J.: pycachesim: Python Cache Hierarchy Simulator (2018), https://github.com/RRZE-HPC/pycachesim, accessed: 2019-09-06
Hammer, J.: OoO Instruction Benchmarking Framework on the Back of Dragons (poster). In: The International Conference for High Performance Computing, Networking, Storage and Analysis, SC18, November 2018, Dallas, Texas, USA (2019), https://sc18.supercomputing.org/proceedings/src_poster/src_poster_pages/spost115.html
Hammer, J., Eitzinger, J., Hager, G., Wellein, G.: Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels. Tools for High Performance Computing 2016, Stuttgart, Germany, October 2016 pp. 1–22 (2017), DOI: 10.1007/978-3-319-56702-0_1
Hirsch, I., Gideon S. (sic): Intel Architecture Code Analyzer (2017), https://software.intel.com/en-us/articles/intel-architecture-code-analyzer, accessed: 2019-09-06
Hofmann, J., Alappat, C.L., Hager, G., Fey, D., Wellein, G.: Bridging the Architecture Gap: Abstracting Performance-Relevant Properties of Modern Server Processors, https://arxiv.org/abs/1907.00048, in Review/Pre-print.
Hofmann, J., Hager, G., Fey, D.: On the Accuracy and Usefulness of Analytic Energy Models for Contemporary Multicore Processors. In: Yokota, R., Weiland, M., Keyes, D., Trinitis, C. (eds.) Proceedings of the 33rd International Conference, ISC High Performance 2018, Frankfurt, Germany, June 24–28, 2018. pp. 22–43. Springer International Publishing, Cham (2018), DOI: 10.1007/978-3-319-92040-5_2
Hofmann, J., Hager, G., Wellein, G., Fey, D.: An Analysis of Core- and Chip-Level Architectural Features in Four Generations of Intel Server Processors. In: Kunkel, J.M., Yokota, R., Balaji, P., Keyes, D. (eds.) Proceedings of the 32nd International Conference, ISC High Performance 2017, Frankfurt, Germany, June 18–22, 2017. pp. 294–314. Springer International Publishing, Cham (2017), DOI: 10.1007/978-3-319-58667-0_16
Hornich, J., Hammer, J.: INSPECT: 3D Box Stencil on Broadwell, https://rrze-hpc.github.io/INSPECT/stencils/3D/r1/heterogeneous/box/constant/double/BroadwellEP_E5-2697_CoD/, accessed: 2019-09-06
Hornich, J., Hammer, J.: INSPECT: 3D Box Stencil on Skylake X, https://rrze-hpc.github.io/INSPECT/stencils/3D/r1/heterogeneous/box/constant/double/SkylakeSP_Gold-6148_SNC/, accessed: 2019-09-06
Hornich, J., Hammer, J.: INSPECT: 3D Long Range Stencil on Skylake X, https://rrze-hpc.github.io/INSPECT/stencils/3D/r3/heterogeneous/star/constant/double/SkylakeSP_Gold-6148_SNC/, accessed: 2019-09-06
Hornich, J., Hammer, J.: INSPECT: 3D Short Range Stencil on AMD Zen, https://rrze-hpc.github.io/INSPECT/stencils/3D/r1/homogeneous/star/constant/double/Zen EPYC-7451/, accessed: 2019-09-06
Hornich, J., Hammer, J.: INSPECT: 3D Short Range Stencil on Haswell,
https://rrze-hpc.github.io/INSPECT/stencils/3D/r1/homogeneous/star/constant/double/HaswellEP_E5-2695v3_CoD/, accessed: 2019-09-06
Hornich, J., Hammer, J.: INSPECT: Intra Node Stencil Performance Evaluation Collection, https://rrze-hpc.github.io/INSPECT, accessed: 2019-09-06
Hornich, J., Hammer, J.: INSPECT: Machine Models, https://rrze-hpc.github.io/INSPECT/machinefiles, accessed: 2019-09-06
Hornich, J., Pflaum, C., Hager, G.: Efficient optical simulation of nano structures in thinfilm solar cells. Computational Optics II, SPIE Optical Systems Design, Frankfurt, Germany, 14–17 May, 2018 (2018), DOI: 10.1117/12.2312545
Intel Corporation: Intel 64 and IA-32 Architectures Optimization Reference Manual (2019), https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf, accessed: 2019-09-06
Kulikov, I., Chernykh, I., Tutukov, A.: A new hydrodynamic code with explicit vectorization instructions optimizations that is dedicated to the numerical simulation of astrophysical gas flow. I. Numerical method, tests, and model problems. The Astrophysical Journal Supplement Series 243(1), 4 (2019), DOI: 10.3847/1538-4365/ab2237
Laukemann, J., Hammer, J., Hofmann, J., Hager, G., Wellein, G.: Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures. In: 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), Dallas, Texas, USA, December 12, 2018. pp. 121–131 (2018), DOI: 10.1109/PMBS.2018.8641578
Luszczek, P., Luszczek, P., Dongarra, J.J., Koester, D., Rabenseifner, R., Lucas, B., Kepner, J., Mccalpin, J., Bailey, D., Takahashi, D.: Introduction to the HPC Challenge Benchmark Suite (2005) (2005), http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.112.1817, accessed: 2019-09-06
Malas, T., Hager, G., Ltaief, H., Stengel, H., Wellein, G., Keyes, D.: Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates. SIAM Journal on Scientific Computing 37(4), C439–C464 (2015), DOI: 10.1137/140991133
Malas, T.M., Hager, G., Ltaief, H., Keyes, D.E.: Multidimensional Intratile Parallelization for Memory-Starved Stencil Computations. ACM Trans. Parallel Comput. 4(3), 12:1–12:32 (2017), DOI: 10.1145/3155290
McCalpin, J.D.: Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter pp. 19–25 (1995)
Meuer, H.W., Strohmaier, E., Dongarra, J., Simon, H.D.: The TOP500: History, Trends, and Future Directions in High Performance Computing. Chapman & Hall/CRC, 1st edn. (2014)
Mitchell, N.L., Vorobyov, E.I., Hensler, G.: Collisionless stellar hydrodynamics as an efficient alternative to n-body methods. Monthly Notices of the Royal Astronomical Society 428(3), 2674–2687 (2012), DOI: 10.1093/mnras/sts228
Schafer, A., Fey, D.: A Predictive Performance Model for Stencil Codes on Multicore CPUs. High Performance Computing for Computational Science – VECPAR 2012, Kobe, Japan, July 17–20, 2012 pp. 451–466 (2013), DOI: 10.1007/978-3-642-38718-0 40
SPEC: Standard Performance Evaluation Corporation (2019), https://www.spec.org, accessed: 2019-09-06
Stengel, H., Treibig, J., Hager, G., Wellein, G.: Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model. Proceedings of the 29th ACM on International Conference on Supercomputing - ICS ’15, Newport Beach, California, USA, June 8-11, 2015 (2015), DOI: 10.1145/2751205.2751240
Treibig, J., Hager, G., Wellein, G.: LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments. In: 2010 39th International Conference on Parallel Processing Workshops, San Diego, California, USA, 13-16 September, 2010. pp. 207–216 (2010), DOI: 10.1109/ICPPW.2010.38
Williams, S., Waterman, A., Patterson, D.: Roofline. Communications of the ACM 52(4), 65 (2009), DOI: 10.1145/1498765.1498785
Downloads
Published
How to Cite
Issue
License
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-Non Commercial 3.0 License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.