Performance Limits Study of Stencil Codes on Modern GPGPUs

paper


Introduction
The theoretical performance peak of the modern GPU exceeds 10 TFLOPS for single precision, and this could be a trillion cell updates per second for a variety of stencil schemes.However, it is well known that this performance can not be achieved, since the stencil codes are not compute-bound [7].The global memory throughput limit for modern GPU corresponds to approximately 1% of their peak computing performance.Furthermore, in the implementations of the stencil codes, other factors, such as data access overhead and latency, limit the performance.
In this paper we study the performance limits of different algorithmic approaches, applied to a sample problem, and aim to find the highest limit of the achievable performance efficiency for the stencil computing.For this, we develop the implementation using the accumulated experience of CUDA programming so as to minimize the performance losses.We use an advanced quantitative performance model to make a thorough analysis of the performance bottlenecks, and develop it further to account for the latency of different levels of GPU memory.
Increasing the performance efficiency of the stencil implementation is an intricate task, and multiple factors should be take into account.The cell update requires its data and the data from its neighbours to be accessed.Thus, each cell data is accessed multiple times during one time iteration, and the exact way this access is performed depends on the algorithm.The cache hierarchy is developed so as to accelerate the data accesses with high locality in time and space.To take advantage of it in the stencil computing, the tiling and blocking techniques are used [5, 8-10, 13, 14, 20, 27, 30].These involve a modification of the data space traversal and decomposition of this task into subtasks that may be given to different processing units.Various polyhedral optimization techniques are based on this idea [6,22].
Temporal tiling [5,8,13,14,20,27,30] is the method to perform several cell updates on the same data portion that is located in the fast memory.After this, more data will be sent,

Problem Statement and Cross Stencil
The following (2r + 2)-point stencil computation is constructed (Fig. 1): Here r is the stencil radius (half-width), The values of α ±l for specific r may be readily computed manually or generated with scripts [19].

Stepwise Algorithm
By the word "stepwise" we denote the most common way of stencil implementation on GPU [32].The data is localized in the global device memory.This algorithm is the most intuitive   to implement with the tools available in CUDA [15].However, even in this case care must be taken to get close to the optimal performance.
The two time layers u n and u n−1 are stored in the global memory.The computation kernel computes one update for each cell according to the stencil (1).Each thread is assigned to one cell [32].In total, K/threads CUDA-blocks are executed, where threads is the number of threads per block (usually 1024 or 512 or 256).This CUDA-kernel is executed N times in a CPU loop.
The stepwise algorithm is illustrated in terms of the problem dependency graph in Fig. 2. One point represents one u n k computation.Inside the outlined areas there are no data dependencies.The domain size in x is limited by the size of the device memory, which means that about 10 9 cells may be stored.There is a data dependency between adjacent outlined areas, as shown by the arrow.After a CUDA-kernel computes one such area and exits, the data is synchronized, and the next loop iteration is started.

Performance Testing
The dependency of the performance of the stepwise kernel on the problem size has been tested.The number of cell updates per second ( cell•step s ) is chosen as a unit for performance P .Thus, P = K•N time where time is the time per program run in seconds.In Fig. 3 this dependency is shown for r = 1, 2, 3 and for the two GPU: GTX 1070 (Pascal) and RTX 2070 (Turing).We have tested single precision (32 bit) and double precision (64 bit).Note, that target applications for desktop GPU require only single precision floating point operations, since its double precision performance is much lower.At the start the linear increase is observed, due to low occupancy [1] of high parallel GPU device.The number of threads that may be started on an Streaming Multiprocessor (SM) depends on the required registers number (up to 64 thousand 32 bit registers per SM) and the shared memory size per thread (up to 48-64KB).The stepwise kernel does not use the shared memory explicitly and uses no more than 32 registers per thread.Thus, each SM may process up to 2048 cells simultaneously.There are 15 SM on GTX 1070 and 36 SM on RTX 2070.Therefore, 30 thousand and 72 thousand cells are required to utilize all resources of GTX1070 and RTX2070 correspondingly.This is in accordance with the cell number K value at which the performance P stops the linear increase.It gradually saturates, then falls to the constant P = P sw ∼ 10 10 cell•step s .We note that in the single-precision computation the performance does not depend on the stencil radius.This is an evidence that the current implementation is memory-bound, since the operation count increases with the stencil radius.For double precision a slight variation in performance for different r is observed.In that case, the problem is closer to compute-bound domain since the peak computing performance for double precision is much lower.
The peak at K = 10 5 is clearly seen in the single-precision performance, which is higher than P sw .It is explained by the use of the L2 cache.Since the problem is memory-bound, P is determined by the global memory bandwidth.When the data may be localized in the L2 cache (2MB for GTX 1070, 4MB for RTX 2070), L2 memory bandwidth determines the performance.It is 1.5-2 times higher.This size corresponds to K = 2.5 • 10 5 and K = 5 • 10 5 cells with single precision, which is exactly the location of the end of the peak and transition to P sw .

Roofline Model
The Roofline [21] is a visual performance model that provides an estimate of the performance limit from the two fundamental ceilings: peak computing performance and memory bandwidth: ( Here Π alg is the algorithm achievable performance, and Π peak is the theoretical peak computing performance in the elementary operations per unit of time.Θ is the memory bandwidth.
Dalg is the arithmetic intensity of the algorithm, where O alg is the number of operations in the algorithm and D alg is the data traffic to and from memory.
Usually, Π peak and Θ are taken from the device specification.However, since these values depend on the frequency which may vary in wide range, they should be calculated accurately as follows.Arithmetic instruction throughput τ is the number of instructions that may be executed per cycle.These values can be found in [3], and are determined by the GPU architecture.The number of SM µ can be obtained by CUDA Runtime API.To get the actual frequency of SM ν SM , the monitoring tools (such as nvidia-smi or nvidia-settings) are run during the code execution, since the frequency may depend on many factors.Then, Π peak = τ µν SM .Similarly, with the use of CUDA Runtime API the memory bus width β can be found out.With the use of monitoring tools the memory frequency ν mem is measured during the code run.Thus, Θ = 2βν mem .The factor of 2 is explained by the Graphics Double Data Rate SDRAM (GDDR SDRAM) feature.All these values are collected in Tab. 1 and Tab. 2. The unit for measurement of the performance Π is one Fused Multiply Add (FMA) operation, which may be one or two floating point operations (FLOP), while the symbol P is used for performance in cell updates.The compiler might optimize all operations to FMA.However, we have decided to force compiler to use FMA operations whenever possible via intrinsic (built-in) functions.
For one cell update O alg = (2r + 1) FMA cell•step operations are required.The operations are grouped manually into FMA as follows: To compute u n+1 k , (2r + 1) values are loaded from the u n layer, one value is loaded from the u n−1 layer, and one value is saved: (2r+3) values total.However, since the update is stepwise, the neighboring values that are updated by other threads may be stored in the L2 cache.Actually, each thread only loads u n−1 k and u n k to, and writes u n+1 k from the global memory.To estimate the required number of the L2 accesses it is necessary to take into account the possibility of misaligned access to the cache lines of the neighbor cells.The number of accesses to L1 about twice more that the number of accesses to L2, and it is much faster than the global memory.Thus, the L1 exchange may be neglected in the Roofline model.Data throughput is D alg = 12 The performance Π alg [ GFMA s ] is computed from the performance P sw [ cell•step s ], that was obtained as a saturated performance for large K in the performance tests, and O alg [ FMA cell•step ] as The Roofline graph for our implementation of the stepwise algorithm with data localization in the global memory is plotted in Fig. 4.
For single precision, our results fall into the memory-bound domain.The performance is limited by the global memory bandwidth.The overhead is observed to be negligible, as the result is close to the ceiling.With the increase of the stencil radius r the performance increases, due to the increase in the arithmetic intensity.However, peak performance to memory bandwidth ratio is about 100 times more then optimal ratio for stencil calculations with stepwise algorithm.
The double-precision peak performance on the chosen GPU is 32 times lower than singleprecision peak ones, so performance to memory bandwidth ratio is closer to operation intensity.At r = 1 the result falls into the memory-bound area, and as the arithmetic intensity increases, the result comes closer to the compute-bound domain, and reaches it at r = 4.The performance is 40-50% of Π peak and does not increase in the compute-bound domain.
It is likely that if a Tesla architecture GPU is used, the Roofline would look similar for single and double precision, since that architecture is better suited for the general purpose computing.This means that for the consumer-level GPU even the stepwise algorithm is enough to reach the compute-bound domain at double precision, since the double precision performance is very low.Hereafter, only single precision implementations are considered.

Recursive Domain Decomposition
For single-precision, to move the problem closer to the compute-bound domain, the arithmetic intensity should be increased.For this purpose, we propose a Recursive Domain Decomposition (RDD) algorithm to localize the cell data in the SM registers.The classical Domain Decomposition (Fig. 5) divides the computing domain into equal domains, and each domain is assigned to its processor element.The local memory of that processor element stores u n k−1 and u n k layer data for its domain, and the element computes cell updates for them layer by layer.Each step it exchanges data with the neighboring elements.However, one device contains at least 2 levels of parallelism, SM and CUDA threads.Thus, we choose to implement 2-level RDD (Fig. 6).This approach is close to the tiling (blocking) methods, such as [9], since the performance gain is achieved by the use of the faster memory for each process.Each SM is assigned a domain that corresponds to the size of its register file.In turn, inside these domains thread-level decomposition occurs.Each thread is assigned with not one, like in the stepwise algorithm, but a localized group (grp) of cells.Each SM contains 64 thousand 32 bit registers.In the ideal case, two registers per cell are required (u n k and u n k−1 ).Thus, up to 32 thousand cells may be stored and processed per SM.Including some possible overhead, the actual number may be estimated as 20 ÷ 25 thousand cells per SM.That is, 300 ÷ 375 thousand cells total per 15 SM of GTX 1070, 720 ÷ 900 thousand cells total per 36 SM of RTX 2070.This is more than enough for many 1D problems.

Synchronization
For the data exchange between SMs the L2 cache is used, and for the data exchange between threads the shared memory is used.The inter-block synchronization was an open issue up to the release of CUDA 9. Since CUDA 9 for devices with Compute Capability 6.1 or higher, CUDA Cooperative Groups (CG) is an universal API for synchronization [15].It may be applied to a group of threads of arbitrary size, from one warp to all threads of several devices, situated on one node.This is a barrier synchronization, that is, a chosen group of threads is synchronized altogether.This mechanism seems convenient for our purposes, and we have applied it to our code.
Along with it, we choose to manually implement and test the performance gain of the local block synchronization with a semaphore, the classic synchronization primitive.The semaphore limits the number of parallel processes that can read and write the shared data.In this case two neighboring blocks work with the same data: one reads and the other writes.The pointer array stores the semaphores, assigned to the data section which is subject to the racing condition.The length of the array is equal to twice the number of blocks, since each domain has two regions, which are accessed by other blocks: at the start and at the end.The semaphore data type is integer and it has two states: READABLE (0) and WRITEABLE (1).Before reading the data chunk from the other block, the corresponding semaphore is checked to be READABLE, using while loop terminated by a semicolon.After the read from memory, the value WRITABLE is written into the semaphore, since no other block requires this data.The data can now be overwritten by the adjacent block.
CUDA CG synchronization appears to be inefficient for our problem (see the comparison in Fig. 7).It is the possible reason that the synchronization is called for the whole grid of blocks at once, and all SMs are synchronized.This is superfluous for ensuring the correctness of the data read.Actually, it is sufficient to wait only for the two blocks (in 1D) to complete the work up to this point.If the synchronization of one block with only its neighbors was implemented in CG, it would possibly lead to the acceleration of the memory access.

Performance Testing
The RDD algorithm is defined by the three parameters: the number of cells per thread grp, the number of threads in a block and the number of blocks.The number of the domains of the first level of the decomposition is equal to the number of threads, the number of the domains of the second level is equal to the number of blocks.The latter is equal to or divisible by the number of SM in GPU.In the tests we take the maximum possible number of cells, thus grp•threads•blocks ≈ const, any two of the parameters may be varied.We have tested various threads and blocks parameters."Heavy" blocks and threads, that is, the blocks that use so much shared memory and register space that it may be assigned to an SM only one-by-one, have proven to be more efficient.We fix the number of threads per block to 256, since it is preferable to use all of the 256 2 = 65536 registers, and the CUDA compiler does not allow more than 256 registers per thread.
In Fig. 7 the dependency of P on grp is shown with threads = 256 and maximum possible blocks.The dashed line in the saturated stepwise performance is P sw .The two groups of lines correspond to the two types of implemented block synchronization approaches.The cooperative groups approach seems to be not efficient enough.Indeed, with this type of synchronization, the Performance dependency on the number of cells per thread grp parameter in the RDD algorithm performance hardly exceeds the performance of the stepwise algorithm.In the following we use only the manual synchronization method with semaphores.Low grp provide less performance.The performance grows linearly up to grp = 32.For grp in the 32 ÷ 80 range the performance shows little variation.It shows that the performance of the code with grp = 32 and 2 blocks per SM and the performance of the code with grp = 64 and 1 block per SM are close.As grp exceeds some peak value, 64 or 80, the abrupt fall of performance is seen.At this point, the data could not be localized in the registers, and register spilling [3] occurs.
In total, the obtained performance is P RDD ∼ (300 ÷ 600) • 10 9 , depending on the stencil radius r.

Roofline Model
Since the two memory levels are engaged in the RDD algorithm, the two bandwidths are to be considered.Thus, the Roofline is first defined by Here the denominations are as in section 2.2, and the extra subscript denotes the type of memory (L2 cache or shared).As before, O alg = (2r + 1) FMA cell•step .The data throughput is estimated as follows.In each block, there are grp • threads cells.Each block exchanges 4r values of 4B size.Thus, assuming grp = 64 and threads = 256, Similarly, each thread has grp cells and exchanges 4r values, The operational intensity is , inter-thread via shared memory .Table 3. GPU characteristics: α is the shared memory efficiency, µ is the number of SM, ϕ is the number of shared memory banks per SM, β is the bank width, ν gr is the graphics clock rate, Θ sh is the shared memory bandwidth The shared memory bandwidth may be calculated as Θ sh = αµϕβν gr , all these parameters and their descriptions are gathered in Tab. 3. The L2 cache bandwidth may be estimated as Θ L2 = (3 ÷ 5)Θ [2], where Θ is taken from Tab. 2. Now, it is easy to see that Θ sh • I alg,sh Θ L2 • I alg,L2 on actual threads values, therefore the inequality (5) can be reduced to The Roofline model ( 6) is plotted in Fig. 8.The markers show the performance for different values of grp.The color of a marker signifies the value of r.
The resulting performance for highest points is about 50% of the theoretical peak value.This result was obtained by localization of data in the registers and "heavy" blocks and threads (grp 1) Such efficiency is enough for many applications.However, in the Roofline estimate (Fig. 8) we see that another limitation acts as a bottleneck.Thus, we propose the following algorithm.

Recursive Domain Decomposition with Halo
Further increase in performance seems to be impeded by the high latency of L2 cache.This fact prevents the complete use of the L2 memory bandwidth.This may be mitigated by the introduction of a halo of redundant compute [20,30].
The key idea of halo is simple: instead of the exchange of D data each step, the exchange of ∼ H • D data each H steps takes place (Fig. 9), that is a kind of temporal tiling.Since the data is required in each step for correct simulation, the domains are overlapped.In the overlapped domain (halo) on the interface of two SMs, the cells are updated on both SMs (Fig. 9).At start, both SMs store the correct values of all cells, that are assigned to be updated on them, and r more values at each side that are required for the update.After the first update, the data is not enough, so r cells on each side of the domain are incorrect after the second update.The number of correct values is decreased in each step between the synchronization events each H time steps.Then, the data exchange takes place to actualize both domains.The size of the halo is 2r(H − 1).The data to be exchanged are the cells of the half of the halo, and the r more values that are necessary for the stencil computation on the next step, 2r(H−1) 2 + r total.Since the stencil requires two time layers, 2r(H−1) 2 − r more values are required from the previous time step.The data required for the exchange from the two time steps is outlined by the exchange box in Fig. 9.
Thus, we choose to improve the RDD algorithm to the RDD with Halo (RDDhalo) algorithm by introducing the overlapped region in the inter-thread and the inter-block interfaces.Two important remarks should be made here.First of all, the overhead for redundant computation appears in comparison to the RDD, since some cells are computed twice.However, the size of the halo (2r(H − 1)) is smaller than the size of the domain (grp • threads ∼ 15 ÷ 20 • 10 3 ) by several orders of magnitude.With the increase in H the overhead may become significant, but the goal of hiding the latency by decreasing the number of synchronization events is achieved much earlier.Second, while the redundant computations may be skipped, it is better to perform these computations nonetheless.This is due to the SIMT (Single Instruction, Multiple Thread) architecture of GPU, which dictates the homogeneity of the thread computing.Thus, the conditional statements are not used in the implementation, and the incorrect cells continue to be updated until the synchronization event.
Although the halo was implemented both for inter-thread (H b ) and inter-block (H g ) interfaces, only the inter-block halo has a significant influence on the performance.The shared memory latency is low enough so that H b = 1 (no halo) or H b = 2 is enough to completely conceal it.

Roofline Model
The operational intensity is twice lower than in the RDD algorithm, since two time layers are exchanged each time.Nevertheless, it is still high enough for the problem to be compute-bound.
On the other hand, the latency overhead limits the performance.We may write where O sync = KHO alg is the operation count between synchronizations, K is the number of cells, Λ sync = min r,grp t step is the inter-block synchronization time, where t step is the elapsed time to perform one step of RDD algorithm (H g = 1).Λ sync = 0.84µs and Λ sync = 0.90µs for GTX 1070 and RTX 2070, respectively.The Π peak and O sync /Λ sync limits are visualized as a Roofline in Fig. 10.The markers show the performance of our implementation for different values of grp, H g and H b .The graph confirms the attention to the latency overhead, and the fact that it is overcome with the introduction of halo.The highest performance we achieved and the parameters that were used are presented in Tab. 4.

Conclusion
In this work, we have brought to attention several insights on the high performance stencil code GPU implementation.We have carried out performance limits analysis with the Roofline model.In this paper we have detailed the method for acquisition of both hardware (memory bandwidth of all levels) and algorithmic (operational intensity) parameters.Since the results that were obtained by our code match the obtained performance limits precisely, we can conclude that the implementation has minimal overhead and that the Roofline estimation was correct.
We have found a way to get a performance of more than 90% of the compute-bound limitation, and more than 80% for larger stencil radius.According to our previous experience, in the practice of implementation of codes in scientific computing, the omnipresent issue is to evaluate the progress on performance optimization.That is, whether the code has reached the performance limit, or it may be further improved by increasing the operational intensity, utilizing better programming tools, reducing overhead.The impact of the current paper is the evidence that the stencil codes performance may be driven close to the peak if the computations are localized in the highest level of the memory hierarchy of GPU, namely in the register file.In the stepwise implementation, the obtained efficiency value is ∼ 10 10 cell•step s , and in the RDDhalo algorithm, it is ∼ 10 12 cell•step s .Both correspond closely to the Roofline estimate.The RDDhalo performance is compute-bound and reaches 92% of the peak computing performance.
We assume that it is the computing performance peak limit in any further stencil codes implementations.Such codes are relevant, for example, in electromagnetic wave simulation with the FDTD method, elastic wave simulation with the Levander scheme.More complex multi-level numerical schemes often use the cross stencil as one of the computation stages as well [17].As is, the wave equation simulation is used both in optics and in seismic codes.In case the purpose of the simulation is the solution of inverse problem, such as image reconstruction in seismic exploration, double precision is excessive and single precision will suffice.
Any other stencil code, including the 3D extension of the current scheme, is more complex than the one used in this work.So the obtained performance efficiency value is unlikely to be surpassed on the current hardware.The data size in the problem fits the register file, which is comparatively small.However, the common trend in newer GPU is the increase of the register file space.It is 3.75 MB on Kepler, 6 MB on Maxwell, 14 MB on Pascal, 20 MB on Volta [16].
For 3D problems, our approach may be used in temporal tiling algorithms.As an example, in the wavefront tiling, one wavefront slice may fit in the register file.In case it is large enough, the memory transactions with the global memory may be concealed, and the performance in the tile would determine the performance efficiency, save for the newly introduced overhead.We intend to use the underlying algorithm in our projects on developing 3D wave simulation codes with LRnLA algorithms [26].
Performance Limits Study of Stencil Codes on GPGPUs 98 Supercomputing Frontiers and Innovations

Figure 1 .
Figure 1.Cross stencils for r = 1, 2, 3: 3 time layers, 2r + 1 points in space on the middle layer; the blue dots are read and the green dot is updated in the stencil computation

Figure 3 .
Figure 3.The stepwise algorithm performance dependence on problem size

12 FMAB , for single precision 2r+1 24 FMAB
for single precision and D alg = 24 B cell•step for double precision.The arithmetic intensity isI alg =O alg D alg = , for double precision .

Figure 4 .
Figure 4. Roofline model for the stepwise algorithm implementation.The markers show the highest performance result obtained with our code

Figure 6 .
Figure 6.Two-level Recursive Domain Decomposition (RDD) Algorithm.Domains of the first level are outlined in blue, domains of the second level are outlined in orange

Figure 7 .
Figure 7. Performance dependency on the number of cells per thread grp parameter in the RDD algorithm

Figure 8 .
Figure 8.The Roofline model against shared memory bandwidth for the RDD algorithm

Figure 9 .
Figure 9. RDDHalo algorithms on the dependency graph.The cells that are computed correctly and stored on the two adjacent SM are shown in two colors.The data that should be sent in the synchronization is outlined in colored boxes

Figure 10 .
Figure 10.Latency limitation.The markers show the results of the test runs with different halo and grp parameters.Larger marker corresponds to larger H g

Table 2 .
GPU characteristics: β is the memory bus width, ν mem is the memory clock rate, Θ is the memory bandwidth GPU β [ B clock ] ν mem [ Gclock s ] Θ = 2βν mem [

Table 4 .
The highest performance results obtained with RDDhalo GPU r grp H b H g P , Gcell•step