Bridging the Architecture Gap: Abstracting Performance-Relevant Properties of Modern Server Processors
DOI:
https://doi.org/10.14529/jsfi200204Abstract
We propose several improvements to the execution-cache-memory (ECM) model, an analytic performance model for predicting single- and multicore runtime of steady-state loops on server processors. The model is made more general by strictly differentiating between application and machine models: an application model comprises the loop code, problem sizes, and other runtime parameters, while a machine model is an abstraction of all performance-relevant properties of a processor. Moreover, new first principles underlying the model’s estimates are derived from common microarchitectural features implemented by today’s server processors to make the model more architecture independent, thereby extending its applicability beyond Intel processors.
We introduce a generic method for determining machine models, and present results for relevant server-processor architectures by Intel, AMD, IBM, and Marvell/Cavium. Considering this wide range of architectures, the set of features required for adequate performance modeling is surprisingly small.
To validate our approach, we compare performance predictions to empirical data for an OpenMP-parallel preconditioned CG algorithm, which includes compute- and memory-bound kernels. Both single- and multicore analysis shows that the model exhibits average and maximum relative errors of 5 % and 10 %. Deviations from the model and insights gained are discussed in detail.
References
Alam, S.R., Bhatia, N., Vetter, J.S.: An exploration of performance attributes for symbolic modeling of emerging processing devices. In: Perrott, R., Chapman, B.M., Subhlok, J., de Mello, R.F., Yang, L.T. (eds.) High Performance Computing and Communications. Lecture Notes in Computer Science, vol. 4782, pp. 683–694. Springer, Berlin, Heidelberg (2007), DOI: 10.1007/978-3-540-75444-2 64
Cremonesi, F., Hager, G., Wellein, G., Schrmann, F.: Analytic performance modeling and analysis of detailed neuron simulations. The International Journal of High Performance Computing Applications 34(4), 428–449 (2020), DOI: 10.1177/1094342020912528
Datta, K., Kamil, S., Williams, S., Oliker, L., Shalf, J., Yelick, K.: Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Review 51(1), 129–159 (2009), DOI: 10.1137/070693199
Gmeiner, B., Rude, U., Stengel, H., Waluga, C., Wohlmuth, B.: Performance and scalability of hierarchical hybrid multigrid solvers for Stokes systems. SIAM Journal on Scientific Computing 37(2), C143–C168 (2015), DOI: 10.1137/130941353
Gruber, T., et al.: LIKWID performance tools (2019), http://tiny.cc/LIKWID
Hager, G., Treibig, J., Habich, J., Wellein, G.: Exploring performance and power properties of modern multicore chips via simple machine models. Concurrency Computat.: Pract. Exper. 28(2), 189–210 (2013), DOI: 10.1002/cpe.3180
Hager, G., Wellein, G.: Introduction to High Performance Computing for Scientists and Engineers. CRC Press, Inc., Boca Raton, FL, USA, 1st edn. (2010)
Hammer, J.: pycachesim – Python Cache Hierarchy Simulator (2019), https://github.com/RRZE-HPC/pycachesim
Hammer, J., Eitzinger, J., Hager, G., Wellein, G.: Kerncraft: A tool for analytic performance modeling of loop kernels. In: Niethammer, C., Gracia, J., Hilbrich, T., Knupfer, A., Resch, M.M., Nagel, W.E. (eds.) Tools for High Performance Computing 2016: Proceedings of the 10th International Workshop on Parallel Tools for High Performance Computing, October 2016, Stuttgart, Germany. pp. 1–22. Springer International Publishing, Cham (2017), DOI: 10.1007/978-3-319-56702-0_1
Hockney, R.W., Curington, I.J.: f1/2: A parameter to characterize memory and communication bottlenecks. Parallel Computing 10(3), 277–286 (1989), DOI: 10.1016/0167-8191(89)90100-2
Hofmann, J.: ibench – measure instruction latency and throughput (2019), https://github.com/hofm/ibench
Hofmann, J., Fey, D.: An ECM-based energy-efficiency optimization approach for bandwidth-limited streaming kernels on recent Intel Xeon processors. In: Proceedings of the 4th International Workshop on Energy Efficient Supercomputing, 14 Nov. 2016, Salt Lake City, UT, USA. pp. 31–38. IEEE Press, Piscataway, NJ, USA (2016), DOI: 10.1109/E2SC.2016.010
Hofmann, J., Hager, G., Fey, D.: On the accuracy and usefulness of analytic energy models for contemporary multicore processors. In: Yokota, R., Weiland, M., Keyes, D., Trinitis, C. (eds.) High Performance Computing. pp. 22–43. Springer International Publishing, Cham (2018), DOI: 10.1007/978-3-319-92040-5_2
Hornich, J., Hammer, J., Hager, G., Gruber, T., Wellein, G.: Collecting and presenting reproducible intranode stencil performance: INSPECT. Supercomputing Frontiers and Innovations 6(3), 4–25 (2019), DOI: 10.14529/jsfi190301
lic, A., Pratas, F., Sousa, L.: Cache-aware roofline model: Upgrading the loft. IEEE Comput. Archit. Lett. 13(1), 21–24 (2014), DOI: 10.1109/L-CA.2013.6
Intel Corporation: Intel Xeon Processor Scalable Family (2019), http://tiny.cc/IntelSP
McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter pp. 19–25 (1995)
Ofenbeck, G., Steinmann, R., Cabezas, V.C., Spampinato, D.G., Puschel, M.: Applying the roofline model. In: IEEE International Symposium on Performance Analysis of Systems and Software, 23-25 March 2014, Monterey, CA, USA. pp. 76–85. IEEE (2014), DOI: 10.1109/ISPASS.2014.6844463
Peraza, J., Tiwari, A., Laurenzano, M., Carrington, L., Ward, W.A., Campbell, R.: Understanding the performance of stencil computations on Intel’s Xeon Phi. In: 2013 IEEE International Conference on Cluster Computing, 23-27 Sept. 2013, Indianapolis, IN, USA. pp. 1–5. IEEE (2013), DOI: 10.1109/CLUSTER.2013.6702651
Sadasivam, S.K., Thompto, B.W., Kalla, R., Starke, W.J.: IBM Power9 processor architecture. IEEE Micro 37(2), 40–51 (2017), DOI: 10.1109/MM.2017.40
Seiferth, J., Alappat, C., Korch, M., Rauber, T.: Applicability of the ECM performance model to explicit ODE methods on current multi-core processors. In: Yokota, R., Weiland, M., Keyes, D., Trinitis, C. (eds.) High Performance Computing, 24-28 June 2018, Frankfurt, Germany. Lecture Notes in Computer Science, vol. 10876, pp. 163–183. Springer International Publishing, Cham (2018), DOI: 10.1007/978-3-319-92040-5_9
Stengel, H., Treibig, J., Hager, G., Wellein, G.: Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model. In: Proceedings of the 29th ACM International Conference on Supercomputing, June 2015, Newport Beach, CA, USA. ACM, New York, NY, USA (2015), DOI: 10.1145/2751205.2751240
Terpstra, D., Jagode, H., You, H., Dongarra, J.: Collecting performance data with PAPIC. In: Muller, M.S., Resch, M.M., Schulz, A., Nagel, W.E. (eds.) Tools for High Performance Computing 2009. pp. 157–173. Springer Berlin Heidelberg, Berlin, Heidelberg (2010), DOI: 10.1007/978-3-642-11261-4_11
Wichmann, K.R., Kronbichler, M., Lohner, R., Wall, W.A.: Practical applicability of optimizations and performance models to complex stencil based loop kernels in CFD. International Journal of High Performance Computing Applications 33(4), 602–618 (2018), DOI: 10.1177/1094342018774126
Williams, S., Waterman, A., Patterson, D.: Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009), DOI: 10.1145/1498765.1498785
Wittmann, M., Hager, G., Zeiser, T., Treibig, J., Wellein, G.: Chip-level and multi-node analysis of energy-optimized lattice Boltzmann CFD simulations. Concurrency and Computation: Practice and Experience 28(7), 2295–2315 (2016), DOI: 10.1002/cpe.3489
Downloads
Published
How to Cite
Issue
License
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-Non Commercial 3.0 License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.