Bridging the Architecture Gap: Abstracting Performance-Relevant Properties of Modern Server Processors

We describe a universal modeling approach for predicting single- and multicore runtime of steady-state loops on server processors. To this end we strictly differentiate between application and machine models: An application model comprises the loop code, problem sizes, and other runtime parameters, while a machine model is an abstraction of all performance-relevant properties of a CPU. We introduce a generic method for determining machine models and present results for relevant server-processor architectures by Intel, AMD, IBM, and Marvell/Cavium. Considering this wide range of architectures, the set of features required for adequate performance modeling is surprisingly small. To validate our approach, we compare performance predictions to empirical data for an OpenMP-parallel preconditioned CG algorithm, which includes compute- and memory-bound kernels. Both single- and multicore analysis shows that the model exhibits average and maximum relative errors of 5% and 10%. Deviations from the model and insights gained are discussed in detail.


INTRODUCTION
The architectural differences among processor models of different vendors (and even among models of a single vendor) lead to a diverse server-processor landscape in the high-performance computing market. On the other hand, several analytic performance models, such as the Roofline model [10,23] and the executioncache-memory (ECM) model [6,12], show that many relevant performance features can be described using a few key assumptions and a small set of numbers such as bandwidths and peak execution rates. In this work we introduce a structured method of establishing and describing those assumptions and parameters that best summarize the features of a multicore server processor. It has satisfactory predictive power in terms of performance modeling of (sequences of) steady-state loops but is still simple enough to be carried out with pen and paper. The overarching goal is to allow comparisons among microarchitectures not based on benchmarks alone, which have narrow limits of generality, but based on abstract, parameterized performance models that can be used to attribute performance differences to one or a few parameters or features. As a consequence, reasoning about code performance from an architectural point of view becomes rooted in a scientific process.

Main contributions
We describe an abstract workflow for predicting the runtime and performance of sequential and parallel steady-state loops (or sequences thereof) with regular access patterns on multicore server CPUs. The core of the method is an abstract formulation of the ECM model, which is currently the only analytic model capable of giving accurate single-and multicore estimates.
We show that a separation between the machine model, which contains hardware features alone, and the application model, which comprises loop code and execution parameters, is possible with some minor exceptions.
We describe a formalized way to establish a machine model for a processor architecture and present results for Intel Skylake SP and, for the first time, for AMD Epyc, IBM Power9, and Marvell/Cavium ThunderX2 CPUs. The degree of data-transfer overlap in the memory hierarchy is identified as a key determinant for the singlecore in-memory performance of data-bound code.
The feasibility of the approach is demonstrated by predicting runtime and performance of a preconditioned conjugate-gradient (PCG) solver and comparing estimates to empirical data for all investigated processors. ECM predictions for the AMD, Cavium, and IBM CPUs have not been published before.

Outline
This paper is structured as follows. In Section 2 we detail our testbed and methodology. Section 3 describes, in general terms, our modeling approach including application model, machine model, and the modeling workflow. Section 4 shows how machine models can be constructed by analyzing data from carefully chosen microbenchmarks and gives results for the four CPU architectures under consideration. In Section 5 we validate the model by giving runtime and performance predictions for a PCG solver and comparing them to measurements. Finally, Section 6 puts our work in context of existing research and Section 7 summarizes and concludes the paper.

METHODOLOGY AND TESTBED
In this section we point out some relevant high-level properties, while details will be discussed later. Note that we generally take care to run the optimal instruction mix for all benchmark kernels (i.e., using the most recent instruction sets available on the hardware at hand, with appropriate unrolling in place to enable optimal instruction-level parallelism). Compiler peculiarities are commented on where necessary. To minimize interference from the operating system, NUMA balancing was disabled. Transparent huge pages were used by default. The simultaneous multi-threading feature was ignored throughout. Measurements were carried out on repeated loop traversals so timer resolution was not an issue. Runto-run variations were small (generally below 2%) and will thus not be reported. An overview of the investigated processors is provided in Table 1. The AMD Epyc 7451 (EPYC) has a hierarchical design comprising four ccNUMA nodes per socket and six cores per domain. L3 cache segments of 8 MiB each are shared among the three cores of a core complex (CCX). The Uncore of the processor (i.e., the L3 cache, memory interface, and other I/O circuitry) is clocked at a fixed 2.66 GHz. Although the cores support the AVX2 instruction set, 32-byte (B) wide SIMD instructions are executed in two chunks of 16 B by only 16-B wide hardware, so that an effective SIMD width of 16 B applies.
Although the Intel Xeon Skylake Gold 6148 (SKL) has a base core frequency of 2.4 GHz and a wide range of Turbo settings, we fix the clock speed to 2.2 GHz in all our experiments in order to avoid the automatic clock-speed reduction when running AVX-512 code [14]. The Uncore frequency is set to its nominal value of 2.4 GHz. These choices are not a limitation of generality since all procedures described in this work can be carried out for any clock-speed setting. SKL also features a boot-time configuration option of sub-NUMA clustering (SNC), which splits the 20-core chip into two ccNUMA nodes, each comprising ten cores (while the full L3 is still available to all cores). This improves memory-access characteristics and is thus a recommended operating mode for HPC in our opinion. The last-level cache (LLC) prefetcher was turned on for the same reason.
The Cavium/Marvell ThunderX2 CN9980 (TX2) implements the ARMv8.1 ISA with 128-bit NEON SIMD extensions that support double-precision floating-point arithmetic for a peak performance of two 16-B wide FMA instructions per cycle and core. The 32-core chip runs at a fixed 2.2 GHz clock speed, while the L3 cache runs at half the core speed. The victim L3 cache is organized in 2 MiB slices but shared among all cores of the chip.
The Power9 processor used for our investigations is part of an IBM 8336 GTX data analytics/AI node. Being an implementation of the Power ISA v3.0, the core supports VSX-3 (128-bit wide) SIMD instructions. A 512 KiB L2 cache is shared between each pair of cores. The victim L3 cache is segmented, with eleven slices of 10 MiB each, and each slice can act as a victim cache for others [18].
The suite [5] version 4.3.3 was used in several contexts: likwid-pin for thread-core affinity, likwid-perfctr for counting hardware performance events, and likwid-bench for low-level loop benchmarking (with customized kernels for TX2 and PWR9). Instruction latency and throughput we measured using the ibench tool. 1 Where compiled code was required, we used the compiler versions and flags indicated in Table 1.

MODELING APPROACH
Just like the Roofline model, the ECM model is an analytic performance model for streaming loop kernels with regular data-access patterns and a uniform amount of work per loop iteration. Unlike Roofline, however, ECM favors an analytic approach. As a result, the model can give single-and multicore estimates with high accuracy without relying on a large number of measurements. Moreover, the analytic nature enables the evaluation of different hypotheses with respect to a processor's performance behavior by investigating which of them lead to a model that best describes empirical performance, thereby enabling deeper insights than measurement-based approaches such as Roofline. See Section 6 for a more thorough comparison of the models and their predictive powers.
Two major shortcomings of the ECM model concern its loose formulation and the resulting lack of portability: In its current form, the model mixes general first principles and Intel-specific microarchitectural behavior into a set of rules that make it difficult to apply it to other processors. In the following, we untangle the original model: First, several truly general (i.e., microarchitecture-independent) first principles and their rationales are laid out. Next, application and machine models that address code-and microarchitecture-specific properties are covered (in addition, we provide general instructions on how to determine machine models for new microarchitectures in Section 4). Finally, the workflow of the new model is demonstrated.

Model assumptions
The model assumes that the single-core runtime is composed of different runtime components. These include the time required to execute instructions in the core (T core ) and the runtime contributions that result from carrying out the necessary data transfers in the memory hierarchy (e.g., T RegL1 the time to transfer data between the register file and the L1 cache, T L1L2 for L1-L2 transfers, and so on). Depending on the architecture, some or all of these components may overlap. The single-core runtime estimate is therefore derived from the runtime components by putting them together according to the architecture's overlap capabilities.
If no shared resources are involved, single-core performance is assumed to scale linearly with the number of active cores for the multicore estimate. In practice, however, at least one shared resource (the memory interface) will be involved. The model takes conflicts on shared resources into account by modeling contention and the resulting waiting times in an analytical way. In the following, some particularities of modern server processors that simplify runtime modeling are discussed.
Today's server processors typically feature superscalar, out-oforder cores that support speculative execution and implement pipelined execution units. Fig 1a shows the execution of instructions corresponding to a simple vector sum (C[i]=A[i]+B[i]) for a data set in the L1 cache on a hypothetical core. The core has a two-cycle latency for add and load instructions. When the loop begins execution, each of the two load units can execute a load instruction. Since there is a two-cycle load latency, inputs for the add instruction will only be available after two cycles. However, due to speculative execution, the core can continue to execute two load instructions from the next loop iterations in each cycle. Once input data is available, the core can begin executing an add instruction in each cycle. Eventually, after another two-cycle latency (that of the add instruction), the core can begin executing a store instruction in each cycle. Once this latency-induced wind-up phase of four cycles is complete, instruction latency no longer impacts runtime; instead, the runtime is determined by the throughput of instructions. Although latencies might be higher on real processors, the wind-up phase can be neglected even for short loops with only hundreds of iterations. This leads to one of the key assumptions of the ECM model: In the absence of loop-carried dependencies and data-access delays from beyond the L1 data cache, the runtime of a single loop iteration can be approximated by the time that is required to retire the instructions of a loop iteration. With loop-carried dependencies in place, the inter-iteration critical path is a good estimate of the runtime. Due to speculative execution, load/store instructions are decoupled from the arithmetic instructions of a particular loop iteration. This leads to the further assumption that the time to retire arithmetic instructions and the time to retire load/store instructions can overlap.
The next set of assumptions concerns data transfers in the memory hierarchy. The relationship between latency and bandwidth is well understood, so most designs typically provide a sufficient number of buffers to track outstanding cache-line transfers to allow for the saturation of the data-transfer link between adjacent cache levels. Fig 1b shows such a design with more than two buffers to track outstanding transfers to hide a two-cycle latency. Sometimes, however, the number of buffers is insufficient, leading to a deterioration of bandwidth. Figure 1c shows a variant with only two buffers: After two cycles, no more transfer-tracking buffers are available, which prevents the initiation of new transfers. Only after a previous transfer completes and the buffer tracking this transfer is freed can a new transfer request be initiated. As a result, the data link is idle for one cycle, reducing the attainable bandwidth in practice to two-thirds of the theoretical value. On some of the investigated processors this problem can be observed for transfers between the LLC and main memory. This can be attributed to significant latencies caused by the increasingly complex on-chip networks required to accommodate the growing number of cores of modern CPUs.
The model assumes that data links can typically be fully saturated because a sufficient amount of buffers is available and adequate prefetching (be it hardware, software, or both) results in full utilization of these buffers. As a result, runtime contributions of data transfers can typically be calculated by dividing data volumes by the theoretical bandwidths of the corresponding links; the model does, however, include an optional latency penalty to cover edge cases such as the one shown in Fig 1c. Therefore, the runtime contribution of data transfers between memory hierarchy levels i and j is the sum of the actual data transfer time and an optional latency penalty:

Application model
An application model condenses all of the code-related information required to give runtime estimates for a particular loop.
It comprises arithmetic and load-store operations carried out during one loop iteration as well as parameters that influence data transfers in the memory hierarchy. Most prominently, the latter includes the data-set size(s), which determine in which level of the memory hierarchy data resides, yet it may also cover information about blocking size(s) and the scheduling strategy.

Machine model
Machine models comprise selected key information about processors. Despite being limited to few architectural properties, the data included in machine models is sufficient to give meaningful performance estimates. With respect to scope, the contents of machine models can be separated into two parts: the execution capabilities of cores, and details about the memory hierarchy. In the following, each of the two components is discussed in detail.
The part concerning in-core execution capabilities deals with the cores' properties that determine the runtime contribution of two-cycle latency  instruction execution. As discussed in Section 3.1, throughput is a key determinant for single-core runtime, so throughput limits (in operations per cycle) of relevant operations are included. To address loop-carried dependencies, latencies for the corresponding instructions must be included. Moreover, the machine model includes information about potential bottlenecks that limit operation throughput: On most architectures, different functional units share the same execution port, which implies that operations associated with units served by the same port cannot cannot begin execution in the same cycle. Finally, most modern core designs have one or the other peculiar shortcoming that prevents them from fully utilizing the core's load/store units. 2 The second part of the machine model covers information about the cache hierarchy. This entails everything needed to calculate the volume of data transfers for a loop: the number of cache levels, their effective 3 sizes, write-through vs. write-back policy, victim/exclusive vs. inclusive, etc. For example, a victim cache typically implies additional traffic since it receives both modified and unmodified cache lines (CLs) from the overlying cache, whereas a non-victim cache only receives modified CLs. In order to get from data volumes to runtime contributions of individual data paths, the machine model also requires data about the available bandwidth between adjacent caches, and whether transfers take place over a single bi-directional link or over two uni-directional links. Moreover, if an architecture provides an inadequate number of buffers to track outstanding transfers, the corresponding latency penalties must be included. Finally, the second part of the machine model contains a description of which transfers in the memory hierarchy can occur simultaneously. 4 2 Most modern cores feature one store and two load units but only have two addressgeneration units (AGUs), which means that in each cycle only two of the three load/store units can be supplied with memory addresses if complex addressing modes (e.g., base plus scaled offset) are used. In addition to the two-AGU shortcoming, the EYPC's cores have only two data paths between the register file and the L1 cache. 3 For several reasons (imperfect cache replacement strategies, prefetchers preempting data that could have otherwise been reused, etc.) the effective capacity of a cache is lower than its nominal size. In practice, the heuristic of halving the theoretical cache size delivers good estimates for the effective size. 4 As will be demonstrated later, we find that in practice, this rarely discussed architectural feature turns out to be much more important for single-core in-memory performance than other more prominent features such as SIMD width or cache bandwidths.

Performance prediction workflow
An overview of the performance-prediction workflow is provided in Figure 2. As indicated in the figure, the process can be divided into four steps: First, the runtime contribution of performing operations in the core (with all data coming from L1) is determined. Next, the runtime contributions of data transfers in the memory hierarchy are calculated (to this end, data transfer volumes in the memory hierarchy need to be determined). In a third step, the previously determined runtime contributions are put together to arrive at a single-core runtime estimate. Finally, based on the singlecore estimate from the previous step, multicore predictions can be given. In the following, each of the steps is discussed in detail.
3.4.1 Contributions of instruction execution in the core. The fact that some architectures cannot overlap data transfers between the register file and the L1 cache on one hand and the L1 and L2 caches on the other makes it necessary to separate the runtime contribution of operations into two components: T comp , which are cycles in which only computational operations occur, and T RegL1 , which are cycles in which at least one load or store operation takes place.
To estimate T RegL1 , first the numbers of load and store operations (n and n ) are determined by counting their occurrences in the loop body; the numbers are then divided by the respective throughputs, ω and ω , taking additional constraints specified in the machine model into account (e.g., a limited throughput for the overall number of load/store operations per cycle, ω / , caused by a limited number of AGUs). The corresponding runtime contribution is the maximum of all components: The number of cycles in which no load/store operations are carried out is determined in a similar way: Operation counts are found in the loop body. Each count is then divided by the operation's throughput documented in the machine model. As before, additional constraints have to be considered: For example, executionport conflicts (cf.   The fact that cores have an upper limit to the number of instructions they can retire per cycle can be modeled by dividing the total number of operations by a corresponding instruction-throughput limit ω total . Finally, loop-carried dependencies are accounted for by including the contribution of the longest cross-iteration dependency chain, T dep , when determining the overall runtime by applying the maximum to all individual contributions: 3.4.2 Contributions of data transfers in the memory hierarchy. Before the runtime contributions of data transfers can be determined, the data volumes transferred over the various data paths in the memory hierarchy need to be established. To this end, the location of the data set(s) in the memory hierarchy is derived from the data-set size(s) specified in the application model. Then, the load/store operations documented in the application model are revisited: For each operation, the corresponding data set is identified, and the transfers required to get the data from its current location in the memory hierarchy to the L1 cache are recorded. Along with the required transfers, the data volume is determined (e.g., four bytes per single-or eight bytes per double-precision floatingpoint number). Note that full CL transfers need to be taken into account even when CLs are only partially read or written (e.g., for strided but regular access). Moreover, the model will degenerate in case of truly random access patterns as latency contributions will dominate in this case [2]. Note that this process requires keeping track of previous data access to detect possible data reuse. While this can be done manually for kernels with simple data-access patterns, analysis of complex patterns is best left to cache simulators (e.g., pycachesim [8]). If necessary, the resulting numbers can be validated by measuring the actual data volumes using hardware performance events (e.g., with [21] or [5]). Once the data volumes have been established, the runtime contribution T i j of data transfers between levels i and j of the memory hierarchy can be calculated: The process works by first calculating the time the data link(s) connecting levels i and j are actually busy transferring data. To calculate this data-link busy time, T data i j , the data volumes transferred in each direction are divided by the bandwidth b of the link over which the data is transferred. The two directional components T i →j and T i ←j are then combined according to the information provided in the machine model. If there is a single bi-directional link over which transfers in both directions take place, the combined data-link busy time is the sum of both contributions. If there are two dedicated uni-directional links over which the transfers can take place, the overall data-link busy time is the maximum of both contributions. The overall data-transfer time, T i j , is given by the sum of the previously determined data-link busy time and (if applicable) the corresponding latency penalty specified in the machine model.

Combination of runtime contributions for single-core estimate.
To arrive at a single-core runtime prediction, the previously determined components are put together according to the overlap capabilities specified in the machine model. To this end, first, all non-overlapping components are added up. The result is then included in the set of overlapping components, and the total runtime estimate is the maximum of the resulting set: The following example will clarify the process: When discussing the model assumptions in Section 3.1, it was established that T comp and T RegL1 overlap on all processors. Let us further assume that the architecture under consideration has a multi-ported L1 cache, which enables the cache to simultaneously communicate with the register file and the L2 cache. Assuming no overlap of other transfers, the runtime estimate for an in-memory data set on this processor would be T = max(T comp ,T RegL1 ,T L1L2 ,T L2L3 + T L3Mem ). The runtime estimate T can be converted into a performance estimate P by dividing the amount of workW carried out in one loop iteration by the runtime estimate for the same, and multiplying the result with the core frequency: P = f core · W /T .
For our investigations f core was fixed, so converting from runtime to performance estimates is trivial. In practice, however, f core is often set dynamically on the authority of the operating system, the processor, or even the user. However, f core is virtually constant during the execution of a particular steady-state loop. This is because the metric used by the underlying mechanism (e.g., DVFS) to select f core does not change while the processor is in a steady state. For a particular kernel, f core can thus be measured via hardware performance events. For each kernel of a multi-loop application, f core value must be determined individually. See [12] for an investigation of the model's ability to deal with different core and Uncore frequencies.
3.4.4 Multicore prediction based on single-core estimate. Multicore estimates require as inputs the single-core runtime estimate T , and the time the memory interface is busy transferring dataT data Mem , which is the sum of all data-link busy times that involve the main memory (e.g., in a memory hierarchy with a victim L3 cache, where memory sends data to L2 and receives modified CLs from L3, T data Mem = T data L2Mem + T data L3Mem ). In the absence of shared resources (e.g., if the entire data set fits into core-private or scalable 5 shared caches), single-core performance P is expected to scale linearly with the number of active cores n, so the multicore estimate for n active cores is just P(n) = nP. If shared resources, such as the main memory interface, are involved, resource conflicts and the resulting waiting times must be considered. Here we employ a statistical model that is motivated by first principles: The utilization of the memory bus u is the probability of another core encountering a busy bus. For a single core, the utilization is given by the ratio of the time the memory interface is busy transferring data and the overall runtime estimate: u(1) = T data Mem /T . If multiple cores are active, the utilization is expressed recursively: In the numerator, the memory-bus busy time is multiplied with the number of active cores n since multiple cores are using the memory interface. The denominator is the expanded expression for the runtime estimate T , where a conflict time has been added to the data-transfer time involving the memory interface. This conflict time represents the average time that a core encountering a busy memory bus has to wait for the bus to become available to it. The conflict time encountered in a scenario with n active cores is given by multiplying the probability of a core hitting a busy memory bus, which corresponds to the memory utilization of the remaining cores, u(n − 1), with the time the other n − 1 cores are using the interface. This results in T conf = u(n − 1)(n − 1)p 0 , with p 0 being an empirical fit parameter. 6 For performance estimates, the memory-bus utilization is multiplied with the performance to be expected with fully saturated bandwidth: P(n) = u(n)P Sat . The memory-saturation performance, P Sat , corresponds to the bandwidth limitation of the Roofline model and is determined by dividing the amount of work per loop iteration by the memory-bus busy time, and multiplying the result with the core frequency: P Sat = W /T Mem · f core .

MACHINE MODEL CONSTRUCTION 4.1 Method to determine machine models
In the ideal case, all of the data required for a machine model would be available in vendor data sheets. In practice, however, this is rarely the case because important information is deemed irrelevant or, more likely, intellectual property and therefore omitted from specifications. Moreover, the interaction of different parts of the processor might lead to situations in which vendor-specified numbers are not attainable (see, e.g., the discussion on load/store throughput in Section 3.3). In the following, a method is presented that allows to establish machine models in cases where relevant information is missing, or the documented specifications turn out to be impractical for some reason.

Instruction throughput and latency.
At fixed core clock speed f core , the time t it takes the core to execute a large number n of independent 7 instructions of type i is measured. The throughput of the instruction is then τ i = n/(t f core ). Since we will usually use a work unit of one (high-level) loop iteration in the modeling procedure, the instruction throughput is multiplied by the appropriate SIMD width w to get the operation throughput To measure latency, an artificial data-dependency chain is introduced by making each instruction use the output of the previous instruction as its input. This forces each new instructions to be held at a reservation station until the previous instruction has completed. The holding time corresponds to the instruction's latency.
While implementing these two strategies sounds simple in theory, deriving a suitable instruction mix from a high-level language implementation can be difficult in practice because compiler optimizations get in the way. We solve this problem by side-stepping the compiler and hand-crafting the necessary code in assembly language. To automate the process, the ibench tool was developed, which comprises a measurement framework and a number of assembly-code files for the most widespread instructions of AMD, IBM, ARM, and Intel processors.

Topology and data flow in the memory hierarchy.
Information about the topology of the memory hierarchy, such as the number of caches, their sizes and properties (write-back vs. -through, victim vs. non-victim) are often well documented in vendor data sheets. Even if this is not the case, the data is easy to obtain, for most processors provide access to it over a well-defined interface. In case of x86, for instance, the cpuid instruction can be used to extract detailed information about the memory hierarchy, including the capacity, associativity, number of sets, inclusiveness, cacheline size, and more for each level in the hierarchy. Other processors offer similar mechanisms, and the Linux sysfs file system provides an architecture-independent interface to obtain the necessary data.
Information about data flow (i.e., the path data takes from a particular level in the memory hierarchy to reach a core's L1 cache) can be derived from the topology information. In most cases, only stores require special attention to determine whether store-misses trigger a write-allocate for the missed CL or if some optimization (as the one implemented in Marvell's ThunderX2) detects whether a full CL is written to avoid the write-allocate. Such details can be derived with the help of hardware performance events, which can be used to record the data volumes exchanged between different levels of the memory hierarchy.

Bandwidth, latency, and overlap in the memory hierarchy.
In most cases, only the bandwidths of selected caches are documented by vendors. Cache bandwidths can be determined by selecting a set of reasonable bandwidth candidates (e.g., 16, 32, and 64 B/cy), and examining which of the corresponding estimates best agrees with empirical data. To the best of our knowledge, no vendor publishes data on the overlap properties of their processors' memory hierarchies, so this data needs to be determined in a similar way.
The process of comparing estimates to empirical data is iterative: Once the bandwidth and overlap properties for a particular memory level have been established, the numbers can be used as input for different bandwidth and overlap assumptions in the next memory level. In the following the process is demonstrated on the SKL processor for the well-known triad [15]. On the SKL processor, one loop iteration of the triad (A[i]=B[i]+s*C[i]) comprises two loads (one from each of the input arrays B and C), one fused multiply-add (FMA) (to calculate the result), and one store (to write the result to the output array A). Using ibench, the following operation throughputs were established: ω = 16/cy, ω = 16/cy, ω = 8/cy, and ω / = 16/cy. According to equations (1), (2), and (4), for a data set in the L1 cache this leads to a single-iteration runtime estimate of In Figure 3a we compare this prediction to measurements. Note that the estimate corresponds to the lower limit of runtime, which is actually attained by the running code if the loop is long enough.
If the data set resides in the L2 cache, a total of 32 B are transferred between the L1 and L2 caches per iteration: 8 B for each of the double-precision floating-point numbers from the input arrays B and C, 8 B for the write-allocate to A, and 8 B for evicting the updated element of A to the L2 cache. Bandwidth assumptions of 16, 32, and 64 B/cy yield estimates for T L1L2 of two, one, and one-half cycle, respectively. Figure 3b compares the estimates to empirical data. The assumptions of no overlap and a bandwidth of 64 B/cy match the measurements strikingly well; incidentally, the L1-L2 cache bandwidth as advertised by Intel is also 64 B/cy. With L1-L2 cache bandwidth and overlap properties established, we can move on to the L3 cache. The data exchanged between the L2 and L3 caches is 48 B because each of the three eight-byte reads from L3 (two from the input arrays B and C, one write-allocate from the target array A) triggers the eviction of data replaced in the L2 cache to the victim L3. L2-L3 bandwidth assumptions of 16, 32, and 64 B/cy yield estimates for T L2L3 of 3, 1.5, and 0.75 cy, respectively. Figure 3c compares estimates derived from the different bandwidth and overlap assumptions to empirical data for a data set in the L3 cache. In this case we find that that assumptions of no overlap and a bandwidth of 32 B/cy agree very well with the measurement. Finally, for in-memory data sets, only different overlap assumptions must be made, since the sustained memory bandwidth is determined by measurement (55 GB/s for one SNC domain, which for f core =2.2 GHz is 25 B/cy). Figure 3d compares the resulting estimates to empirical data (black line) and we find that in memory, too, no overlap of data transfers occurs.
In addition to runtime measurements obtained with SNC mode and the LLC prefetcher (PF) enabled, Figure 3d also shows data where these features were disabled. This is to demonstrate that in some settings, bandwidth and overlap are not sufficient to describe the empirical behavior in a satisfying manner. Then, a latency penalty must be added to data transfer times (see Section 3.1). Table 2 shows the machine models that result from applying the previously introduced method to the processors from the testbed.

Results for investigated processors
The upper part of the table lists relevant operation throughput (ω) and instruction latency (λ) values. The center part lists bandwidths and latency penalties (if applicable) in the memory hierarchy. Note that in cases where two numbers are provided (e.g., 64+16 B/cy for PWR9's L1-L2 bandwidth), two uni-directional data paths exist between the caches. In such instances, the first number corresponds to the bandwidth of sending data from the underlying to the overlying cache, and second number to the bandwidth in the opposite direction. Note that listed memory bandwidth corresponds to that of a single NUMA node (SNC node on SKL, Zeppelin on EPYC, full-chip on TX2 and PWR9). Memory bandwidths are specified as ranges, since different data-access patterns exhibit slightly varying sustained memory bandwidths. The last part of the table contains overlap capabilities and additional information on cache types.

CASE STUDY: PCG
We use a matrix-free PCG solver to demonstrate the viability of our approach in real-world scenarios. The solver is preconditioned using the well-known symmetric Gauss-Seidel iteration and relies on the second-order finite-difference method for discretization. We use it to solve the steady-state heat equation in 2D. The sparse matrix entries are not stored explicitly but hard-coded into a 2D five-point stencil representation. Hence, the solver is similar to the well-known HPCG but shows a more interesting phenomenology:   As opposed to HPCG, where all loops are limited by data transfers due to explicit matrix storage, our preconditioner is bound by incore pipeline hazards. All computations and data storage are in double precision. Algorithm 1 shows the entire PCG method. It is composed of a matrix-free sparse-matrix-vector multiplication (SpMVM) which we refer to as , a symmetric Gauss-Seidel pre-conditioner ( ), and three BLAS-1 routines: product, vector , and . The code is implemented in C++ and parallelized with OpenMP. The Gauss-Seidel kernels, which have loop-carried dependencies, are parallelized using a well-known wavefront technique that preserves the numerical behavior of the serial code [7]. The preconditioner can be vectorized by, e.g., coloring methods, but this would alter the convergence and render the loops data bound, which is not the scenario we want do showcase (see above).

Application models
The total problem size (n i ×n j ) was chosen to be n i = 25000 (inner, leading dimension) and n j = 2000 (outer dimension), so that all arrays reside in main memory. In the following, application models for all of the PCG components are presented. x = x + λp -> 10: r norm = r, r -> 12: z = Pr -> preconditioner 13: for i = 1 : n i − 1 do 3: Algorithm 3 High-level representation of forward sweep 1: for j = 1 : n j − 1 do 2: entails two loads, one FMA, one multiplication, and one store. The product (d+=x[i]*y[i]) and (n+=x[i]*x[i]) have two and one load(s), respectively, along with an FMA. These kernels can be fully and effectively vectorized by all compilers.
For kernels with cache reuse such as and , reusedistance analysis (best done using the layer condition [3]), blocking factors, parallelization strategies, and scheduling techniques have to be taken into account. The kernel is shown in Algorithm 2, with w * representing different weights obtained from the matrix A. The kernel requires two FMAs, two additions, one multiplication, one store, and five load operations. SIMD vectorization is straightforward, but in contrast to the BLAS kernels, different loads can hit different memory hierarchy levels depending on the reuse distance. For the considered inner dimension of n i = 25000 and outer (j) loop parallelization employed in our code, the layer condition would require 4n i elements per thread to fit in a cache. The lowest (i.e., outermost) cache that satisfies this criterion will only have a miss for one of the four elements on the right-hand side, while the cache levels above it will have three. Storing to implies a write-allocate through the whole memory hierarchy on all processors, and, at some point, the writing back of the newlycomputed data to memory.
The kernel is a symmetric operator comprising a forward and a backward sweep. The forward sweep ( ) is shown in Algorithm 3, and requires two FMAs, one multiplication, one store, and three load operations. The kernel is similar to , but it reads from z j,i −1 and writes to z j,i , causing a loop-carried dependency. A wavefront technique can be used to parallelize the kernel [7], and the corresponding layer-condition criterion requires 3n i elements to fit in a cache. The outermost cache that satisfies this condition will have only two load misses on the right-hand side, while the others would have three. The backward sweep (not shown here for brevity) is similar, but loops are traversed in reverse direction and w c r j,i in is replaced with z j,i . The analysis of the kernel follows the same approach, but there is one less load miss.
Both loops have loop-carried dependencies, preventing SIMD vectorization. As a result, a critical path analysis is required. In the element z j,i written in a particular iteration is read in the next as z j,i −1 . The actual delay caused by this dependency can vary depending on the code generated by the compiler. Figure 4 shows the result when using the Intel compiler and the critical path of the generated instruction mix includes one FMA and one multiplication. The ARM clang compiler produces code that does not keep z j,i in a register across loop iterations, leading to an extra delay caused by storing and loading the element. Due to its particular unrolling strategy, IBM's compiler generates a combination of the two previous variants.

Runtime predictions
In the following, the proposed model is validated by comparing the model estimates to empirical performance for the and kernels, as well as the full PCG algorithm. Note that estimates correspond to the runtime of a single high-level loop iteration.

5.2.1
Single-core. On the SKL processor, retiring the kernel's multiplication and FMA operations takes T comp ≈ 0.0625 cy. The one store and two load operations take T RegL1 ≈ 0.1875 cy. Per-iteration data-transfer volumes are 24 B between the L1 and L2 caches (one load each from x and y, one write to y), 32 B between the L2 and L3 caches (one load each from x and y, and two corresponding evicts since the L3 is a victim cache), and 24 B between L3 and main memory (see L1-L2 transfers). Using the bandwidths documented in the machine model, this results in contributions of T L1L2 = 0.375 cy, and T L2L3 = 1 cy. For the measured memory bandwidth of 60 GB/s, which for f core = 2.2 GHz corresponds to a bandwidth of 27.3 B/cy, T L3Mem is 0.88 cy. Since all data transfers are non-overlapping, the runtime estimates are T L1 = 0.1875 cy, T L2 = 0.5625 cy, T L3 = 1.5625 cy, and T Mem = 2.4425 cy.
Intermediate and final single-core estimates for on SKL, and all other processors, are given in Table 3. Cases where data volumes change in the victim L3 cache (depending on whether the input data resides in the L3 or main memory) are indicated by listing two numbers in the table, the former corresponding to the data-transfer time estimate for data in the L3, the latter for data in memory.
These single-core estimates are compared to empirical data in Figure 5. The data indicates that the model manages to describe empirical performance on all investigated processors with high accuracy.

Multicore.
The and kernels were selected to investigate the model's capability to accurately describe multicore performance. Being a data-bound streaming kernel, proves particularly suitable to investigate the memory subsystem of the investigated processors and their scaling behavior.
, on the other hand, is core bound for all architectures when executed on a single core. However, when increasing the number of cores, NUMA properties turn out to have a significant impact on performance. Figure 6 shows the multicore scaling of on all architectures up to a full socket using "close" thread affinity (i.e., filling cores consecutively through ccNUMA domains). For SKL we observe the typical saturation behavior (at ≈ 2.2 GIT/s = 53 GB/s) of bandwidth-bound code within a single SNC domain. Using the second SNC domain doubles the bandwidth and hence performance by a factor of two as predicted by the model. The scaling behavior of EPYC exposes its main hardware features: Within a single CCX (three cores) the shared L3 bandwidth does not scale across the cores and hits a maximum of 32 B/cy. The best bandwidth attained on a single CCX is 30 GB/s compared to 33 GB/s for the entire cc-NUMA domain (a "Zeppelin" die); we speculate that this is a faint echo of non-scalable L3 cache. Scaling across the Zeppelin dies is linear as expected. On the TX2, we initially observed a significant deviation: The compiler-generated code (blue line) fell short of the model by as much as 40% for a single core and 10% after saturation. The prompted investigation revealed that ARM's compiler did not generate prefetch instructions, which prove imperative for best The kernel is latency bound due to the loop-carried dependency discussed in Section 5.1. There are two peculiarities that make predictions of the parallel kernel challenging: First, the wavefront parallelization requires a barrier synchronization after each inner loop traversal. For the chosen problem size, the corresponding OpenMP-barrier was found to cause non-negligible overhead. We addressed this by benchmarking the OpenMP barrier for all relevant compiler-hardware combinations and included the barrier time as additional overhead. Secondly, although parallel firsttouch page placement works fine for all other loops, the parallelwavefront algorithm accesses data in parallel across the inner dimension. Since data placement is done with static OpenMP scheduling across the outer dimension, this leads to all threads accessing the same ccNUMA domain most of the time during the sweeps. It turns out that this effect can be incorporated into the model as well. To this end, the sustained memory bandwidth is measured across all ccNUMA domains with data residing in only one domain. This data can then be used as a bandwidth limit when using multiple ccNUMA domains on SKL and EPYC. Figure 7a compares performance estimtes to measurements for across the cores of a socket on all architectures. The deviation from the model is generally smaller than 10% when using multiple NUMA domains, and below 5% when looking at a single ccNUMA domain. The results  indicate that the model with enhancements described above (barrier overhead, ccNUMA contention) delivers a good qualitative and quantitative description of the performance behavior.

Composition.
With estimates for individual kernels in place we can now present multicore-scaling data for the full PCG algorithm. Composing the model from single-loop predictions is simple due to the time-based formulation of the ECM model [19]. In case of PCG we have three invocations of , two of , one forward-and backward-sweep each, as well as one of . Figure 7b shows the comparison of the model with measurements for all four architectures. Again, the general model error is below 10%, and less than 5% when looking at single ccNUMA domains. The slightly larger deviation beyond 12 cores on TX2 can be attributed to the fact that we use compiler-generated code instead of hand-crafted assembly for the CG solver on this machine. The lack of prefetching causes a 10-15% performance breakdown of data-bound loops beyond the saturation point (see Figure 6c), which we ignore in the model. On EPYC and SKL we observe very low performance for OpenMP reductions across ccNUMA domains (much larger than the considered OpenMP barrier) with the Intel compiler, causing the slight deviation beyond one domain.

RELATED WORK
There are two capable analytic (in the sense of "first principles") performance models for steady-state loop code on multicore CPUs: the Roofline model [10,23] and the ECM model [6,12,20]. Both have been subject to intense study, refinements, and validation, and their areas of applicability are well understood. However, while there is ample data available for Roofline on a wide variety of architectures [13,16], one drawback of previous applications of the ECM model [2,4,11,19,20,22,24] is that they were mostly restricted to Intel processors. We provide the first thorough crossarchitecture study of the model.
The Roofline model has the attractive property that it can be easily separated into a machine part (memory and cache bandwidths, peak performance) and an application part (computational intensity). There is no previous work that has done the same with the ECM model. A comparison between Roofline and ECM for several stencil algorithms can be found in [20]. A drawback of the Roofline model is that it requires a large amount of phenomenological input such as measured bandwidths for all core counts and all memory hierarchy levels [13], while the ECM model only needs the saturated memory bandwidth and the machine model (i.e., overlap assumptions).
Advanced curve-fitting and machine-learning techniques combined with hardware performance monitoring data have been used in the past to model the performance of code [1,17]. Although these approaches are useful in practical settings, e.g., for predicting program runtimes with a goal of optimized resource scheduling, the deepest insights are gained through first-principles models such as Roofline or ECM.

CONCLUSION
We have shown that it is possible to set up a well-defined workflow for modeling the serial and parallel runtime of steady-state (sequences of) loops with regular data access patterns using the analytic ECM performance model. One can, with minor exceptions, cleanly separate machine properties from application properties. Four multicore server processors were investigated, and we could demonstrate that despite their obvious differences the main properties needed to set up a useful machine model can be summarized in a few parameters. The performance, including scalability across cores and ccNUMA domains, of an OpenMP-parallel preconditioned CG solver with wavefront-parallel Gauss-Seidel sweeps could be described with a modeling error of 5% or less in most cases.
We found the overlapping property of transfers across data paths in the cache hierarchy to be the pivotal architectural feature governing single-core performance for data-bound loops. A design with very strong in-core performance (e.g., via wide SIMD execution) but a non-overlapping memory hierarchy may well be inferior to a weak core with strong overlap, as our comparison of Skylake SP and AMD Epyc shows. The architecture with the lowest incore computational performance, Power9, came out first in serial and parallel memory-bound performance. The Cavium ThunderX2 processor can compensate its rather low in-core performance with good memory bandwidth and a large core count.
All modeling procedures carried out in this paper were done by hand. Some components, e.g., the construction of a runtime prediction from code and a (given) machine model, can be supported by tools [9]; others, such as the derivation of overlapping properties, would be very hard to automate. However, the purpose of performance modeling is not just prediction but also insight, and manual analysis sharpens the view on the relevant details.

ACKNOWLEDGMENTS
We thank Thomas Gruber for helping to port the tool suite to IBM's P 9 architecture. We also thank the Center for Information Services and High Performance Computing (ZIH) at TU Dresden for providing access to their P 9 cluster.