Collecting and Presenting Reproducible Intranode Stencil Performance: INSPECT

Julian Hornich, Julian Hammer, Georg Hager, Thomas Gruber, Gerhard Wellein


Stencil algorithms have been receiving considerable interest in HPC research for decades. The techniques used to approach multi-core stencil performance modeling and engineering span basic runtime measurements, elaborate performance models, detailed hardware counter analysis, and thorough scaling behavior evaluation. Due to the plurality of approaches and stencil patterns, we set out to develop a generalizable methodology for reproducible measurements accompanied by state-of-the-art performance models. Our open-source toolchain and collected results are publicly available in the "Intranode Stencil Performance Evaluation Collection" (INSPECT).

We present the underlying methods, models and tools involved in gathering and documenting the performance behavior of a collection of typical stencil patterns across multiple architectures and hardware configuration options. Our aim is to endow performance-aware application developers with reproducible baseline performance data and validated models to initiate a well-defined process of performance assessment and optimization. All data is available for inspection: source code, produced assembly, performance measurements, hardware performance counter data, single-core and multicore Roofline and ECM (execution-cache-memory) performance models, and machine properties. Deviations between measured performance and performance models become immediately evident and can be investigated. We also give hints as to how INSPECT can be used in practice for custom code analysis.

Full Text:



Abel, A., Reineke, J.: Characterizing latency, throughput, and port usage of instructions on Intel microarchitectures. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. pp. 673–686. ASPLOS ’19, ACM, New York, NY, USA (2019), DOI: 10.1145/3297858.3304062

Anzt, H., Cojean, T., Flegar, G., Gr¨utzmacher, T., Nayak, P., Ribizel, T.: An Automated Performance Evaluation Framework for the GINKGO Software Ecosystem. In: 90th Annual Meeting of the International Associaten of Applied Mathematics and Mechanics, GAMM 2019, February 2019, Vienna, Austria. GAMM (2019)

Datta, K., Kamil, S., Williams, S., Oliker, L., Shalf, J., Yelick, K.: Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors. SIAM Review 51(1), 129–159 (2009), DOI: 10.1137/070693199

Dongarra, J.: The LINPACK Benchmark: An Explanation. In: Proceedings of the 1st International Conference on Supercomputing, Athens, Greece, June 812, 1987. pp. 456–474. Springer-Verlag, London, UK, UK (1988),

Dongarra, J., Heroux, M.A., Luszczek, P.: High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems. The International Journal of High Performance Computing Applications 30(1), 3–10 (2015), DOI: 10.1177/1094342015593158

Fog, A.: Instruction tables (2018), tables.pdf

Ghysels, P., Vanroose, W.: Modeling the Performance of Geometric Multigrid Stencils on Multicore Computer Architectures. SIAM Journal on Scientific Computing 37(2), C194–C216 (2015), DOI: 10.1137/130935781

Glinskiy, B., Kulikov, I., Snytnikov, A., Romanenko, A., Chernykh, I., Vshivkov, V.: Codesign of parallel numerical methods for plasma physics and astrophysics. Supercomputing Frontiers and Innovations 1(3) (2015), DOI: 10.14529/jsfi140305

Godbolt, M.: Compiler Explorer (2019),, accessed: 2019-09-06

Guerrera, D.: STEMPEL: Stencil TEMPlate Engineering Library (2019),, accessed: 2019-09-06

Hammer, J.: Layer Conditions: Interactive Web Calculator (2017),, accessed: 2019-09-06

Hammer, J.: pycachesim: Python Cache Hierarchy Simulator (2018),, accessed: 2019-09-06

Hammer, J.: OoO Instruction Benchmarking Framework on the Back of Dragons (poster). In: The International Conference for High Performance Computing, Networking, Storage and Analysis, SC18, November 2018, Dallas, Texas, USA (2019),

Hammer, J., Eitzinger, J., Hager, G., Wellein, G.: Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels. Tools for High Performance Computing 2016, Stuttgart, Germany, October 2016 pp. 1–22 (2017), DOI: 10.1007/978-3-319-56702-0_1

Hirsch, I., Gideon S. (sic): Intel Architecture Code Analyzer (2017),, accessed: 2019-09-06

Hofmann, J., Alappat, C.L., Hager, G., Fey, D., Wellein, G.: Bridging the Architecture Gap: Abstracting Performance-Relevant Properties of Modern Server Processors,, in Review/Pre-print.

Hofmann, J., Hager, G., Fey, D.: On the Accuracy and Usefulness of Analytic Energy Models for Contemporary Multicore Processors. In: Yokota, R., Weiland, M., Keyes, D., Trinitis, C. (eds.) Proceedings of the 33rd International Conference, ISC High Performance 2018, Frankfurt, Germany, June 24–28, 2018. pp. 22–43. Springer International Publishing, Cham (2018), DOI: 10.1007/978-3-319-92040-5_2

Hofmann, J., Hager, G., Wellein, G., Fey, D.: An Analysis of Core- and Chip-Level Architectural Features in Four Generations of Intel Server Processors. In: Kunkel, J.M., Yokota, R., Balaji, P., Keyes, D. (eds.) Proceedings of the 32nd International Conference, ISC High Performance 2017, Frankfurt, Germany, June 18–22, 2017. pp. 294–314. Springer International Publishing, Cham (2017), DOI: 10.1007/978-3-319-58667-0_16

Hornich, J., Hammer, J.: INSPECT: 3D Box Stencil on Broadwell,, accessed: 2019-09-06

Hornich, J., Hammer, J.: INSPECT: 3D Box Stencil on Skylake X,, accessed: 2019-09-06

Hornich, J., Hammer, J.: INSPECT: 3D Long Range Stencil on Skylake X,, accessed: 2019-09-06

Hornich, J., Hammer, J.: INSPECT: 3D Short Range Stencil on AMD Zen, EPYC-7451/, accessed: 2019-09-06

Hornich, J., Hammer, J.: INSPECT: 3D Short Range Stencil on Haswell,, accessed: 2019-09-06

Hornich, J., Hammer, J.: INSPECT: Intra Node Stencil Performance Evaluation Collection,, accessed: 2019-09-06

Hornich, J., Hammer, J.: INSPECT: Machine Models,, accessed: 2019-09-06

Hornich, J., Pflaum, C., Hager, G.: Efficient optical simulation of nano structures in thinfilm solar cells. Computational Optics II, SPIE Optical Systems Design, Frankfurt, Germany, 14–17 May, 2018 (2018), DOI: 10.1117/12.2312545

Intel Corporation: Intel 64 and IA-32 Architectures Optimization Reference Manual (2019),, accessed: 2019-09-06

Kulikov, I., Chernykh, I., Tutukov, A.: A new hydrodynamic code with explicit vectorization instructions optimizations that is dedicated to the numerical simulation of astrophysical gas flow. I. Numerical method, tests, and model problems. The Astrophysical Journal Supplement Series 243(1), 4 (2019), DOI: 10.3847/1538-4365/ab2237

Laukemann, J., Hammer, J., Hofmann, J., Hager, G., Wellein, G.: Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures. In: 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), Dallas, Texas, USA, December 12, 2018. pp. 121–131 (2018), DOI: 10.1109/PMBS.2018.8641578

Luszczek, P., Luszczek, P., Dongarra, J.J., Koester, D., Rabenseifner, R., Lucas, B., Kepner, J., Mccalpin, J., Bailey, D., Takahashi, D.: Introduction to the HPC Challenge Benchmark Suite (2005) (2005),, accessed: 2019-09-06

Malas, T., Hager, G., Ltaief, H., Stengel, H., Wellein, G., Keyes, D.: Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates. SIAM Journal on Scientific Computing 37(4), C439–C464 (2015), DOI: 10.1137/140991133

Malas, T.M., Hager, G., Ltaief, H., Keyes, D.E.: Multidimensional Intratile Parallelization for Memory-Starved Stencil Computations. ACM Trans. Parallel Comput. 4(3), 12:1–12:32 (2017), DOI: 10.1145/3155290

McCalpin, J.D.: Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter pp. 19–25 (1995)

Meuer, H.W., Strohmaier, E., Dongarra, J., Simon, H.D.: The TOP500: History, Trends, and Future Directions in High Performance Computing. Chapman & Hall/CRC, 1st edn. (2014)

Mitchell, N.L., Vorobyov, E.I., Hensler, G.: Collisionless stellar hydrodynamics as an efficient alternative to n-body methods. Monthly Notices of the Royal Astronomical Society 428(3), 2674–2687 (2012), DOI: 10.1093/mnras/sts228

Schafer, A., Fey, D.: A Predictive Performance Model for Stencil Codes on Multicore CPUs. High Performance Computing for Computational Science – VECPAR 2012, Kobe, Japan, July 17–20, 2012 pp. 451–466 (2013), DOI: 10.1007/978-3-642-38718-0 40

SPEC: Standard Performance Evaluation Corporation (2019),, accessed: 2019-09-06

Stengel, H., Treibig, J., Hager, G., Wellein, G.: Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model. Proceedings of the 29th ACM on International Conference on Supercomputing - ICS ’15, Newport Beach, California, USA, June 8-11, 2015 (2015), DOI: 10.1145/2751205.2751240

Treibig, J., Hager, G., Wellein, G.: LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments. In: 2010 39th International Conference on Parallel Processing Workshops, San Diego, California, USA, 13-16 September, 2010. pp. 207–216 (2010), DOI: 10.1109/ICPPW.2010.38

Williams, S., Waterman, A., Patterson, D.: Roofline. Communications of the ACM 52(4), 65 (2009), DOI: 10.1145/1498765.1498785

Publishing Center of South Ural State University (454080, Lenin prospekt, 76, Chelyabinsk, Russia)