Accelerating Seismic Redatuming Using Tile Low-Rank Approximations on NEC SX-Aurora TSUBASA

© The


Introduction
Exploration geophysics is an applied branch of geophysics that uses several physical measurements at the surface of the Earth (e.g., seismic, gravity, electromagnetic) to estimate the physical properties of the first few kilometers of the subsurface. Originally developed with the aim of mapping anomalies corresponding to mineral or hydrocarbon accumulations, these methods are nowadays also used in the context of geothermal exploration, carbon capture, and storage evaluation and monitoring, as well as to assess the integrity of the near subsurface for offshore wind farms.
Seismic reflection is a popular remote sensing technique that utilizes reflected seismic waves to produce high-resolution images of the geological structures as well as estimates of elastic properties of the subsurface. Its success is motivated by the fact that the waves propagating in the subsurface and being recorded at the surface of the Earth are governed by the well known elastic wave equation; various techniques have been developed during the years to harness the information contained in such recordings -see [36] for a detailed treatise. With the aim of imaging subsurface discontinuities, seismic data recorded at the surface of the Earth must be numerically re-positioned at locations in the subsurface where reflections have originated, a process generally referred to as redatuming by the geophysical community [7]. Historically, this process has been carried out by numerically time-reversing the data recorded along an open boundary of surface receivers into the subsurface. Despite its simplicity, such an approach is only able to handle seismic energy from primary arrivals (i.e., waves that interact only once with the medium discontinuities) failing to explain multi-scattering in the subsurface. As a result, seismic images are contaminated by artificial reflectors if data are not pre-processed prior to imaging such that multiples are removed from the data. In the last decade, a novel family of methods has emerged under the name of Marchenko redatuming [12,31,34]. Such methods allow for accurate redatuming of the full-wavefield recorded seismic data including multiple arrivals. This is achieved by solving an inverse problem, whose adjoint modeling can be shown to be equivalent to the standard single-scattering redatuming method of [7]. Whilst being more accurate, this new approach calls for solution of an inverse problem that requires repeated application of the so-called multi-dimensional convolution (MDC) operator and its adjoint. Mostly because of the extremely expensive nature of these operators in terms of complexity and memory footprint, the Marchenko redatuming method has not been widely adopted by the geophysical community yet. It also show poor scalability due to their inherent memory-bound behavior.
In this paper, we report on accelerating these expensive operators by using Tile Low-Rank (TLR) approximations to perform one of the most time-consuming computational kernels, i.e., the Matrix-Vector Multiplication (MVM) operation, on the NEC vector computing SX-Aurora TSUBASA hardware solution. In fact, MVM is the workhorse of Marchenko redatuming for seismic imaging since it is repeatedly applied for hundreds of frequencies at each step of the iterative process. Instead of operating the MVM on the original dense data structure, our numerical technique consists in (1) splitting it into tiles with elements contiguously stored in memory, (2) compressing the tile matrix using TLR approximations (e.g., using randomized SVD [22]) up to an application-dependent accuracy threshold, and (3) performing the MVM directly on the compressed TLR data storage. This translates into a reduction of the number of floating-point operations, while further saving memory footprint. The latter is especially critical when working with large 3D seismic datasets. This TLR algorithmic redesign of the MVM introduces load imbalance when processing low and high frequencies that does not occur with the traditional dense MVM. Indeed, TLR matrices associated with high frequencies reveal higher ranks than those from low frequencies. We design and implement a load balancing technique to map processing units into frequencies so that the overall application's idle time is limited. Our MPI+OpenMP TLR-MVM implementation saturates the second generation of high bandwidth memory (HBM2) from the SX-Aurora TSUBASA cards and maintains a decent scalability when increasing the number of vector engines. We assess the accuracy of TLR-MVM and demonstrates its numerical robustness on representative 3D seismic datasets. We benchmark our TLR-MVM on two distributed-memory systems. It reaches up to 3X performance speedup against the dense MVM counterpart from NEC scientific library on 128 NEC SX-Aurora TSUBASA cards. Thanks to HBM2, it further attains up to 67X performance speedup (i.e., 110 TB/s) compared to the dense MVM from Intel MKL when running on 128 dual-socket 20-core Intel Cascade Lake nodes with DDR4 memory.
The contributions of this paper are as follows. We democratize the Marchenko redatuming method for seismic imaging simulations by integrating TLR-MVM as the core computational engine for solving the inverse problem. We conduct performance profiling of our TLR-MVM code using NEC Ftrace profiler tool and identify hot spots and room for improvement. In particular, we mitigate the load imbalance engendered by TLR-MVM when operating matrices from several frequencies in an embarrassingly parallel fashion. We highlight the performance advantage of TLR-MVM over the traditional dense MVM on Intel x86 and NEC vector computing hardware solution, without suffering deterioration of the image quality. We evaluate our implementation using the roofline model [35] and show how TLR-MVM is able to leverage HBM2 technology.
The remainder of the paper is as follows. Section 2 presents related work on low-rank matrix approximations. Section 3 describes the seismic Marchenko redatuming method. Section 4 recalls the general TLR-MVM algorithm. Section 5 provides implementation details of multiple inlined TLR-MVM calls and introduces the necessary load balancing strategies. Section 6 shows the TLR impact on numerical accuracy using proxy 3D seismic datasets. Section 7 reports performance analysis and experimental results of TLR-MVM on two distributed-memory systems composed of x86 and vector computing architectures. Section 8 discusses potential future work along this herein research direction and we conclude in Section 9.

Related Work
Low-rank matrix approximation is a class of algebraic compression methods, that permits to exploit the data sparsity of large matrices. This becomes critical when performing linear algebra operations on these large operators since the algorithmic complexity and the memory footprint can be reduced [13,19].
For instance, in the context of wave-equation-based seismic processing methods, the estimation of primaries by sparse inversion [23] (EPSI) suggests that low-rank approximations of the integral operators may be utilized to reduce the storage and computation cost of the MVM. This work is however only applied as a proof-of-concept to 2D datasets using the Hierarchical Semi-Separable (HSS) compression data format. While HSS provides linear complexity, it may face challenges in compressing 3D datasets resulting in an increase of the arithmetic complexity. Moreover, one of the main reasons that slows down the wide adoption of low-rank approximations in scientific applications on current petascale supercomputers is the lack of support for advanced numerical kernels from the vendor numerical libraries. Indeed, H-matrix computations require the development of new kernels that are versatile enough to effectively support a range of arithmetic intensity, while exhibiting low overheads during kernel launch. On the contrary, flat TLR matrix approximations is a pragmatic approach, which represents a compromise between algorithmic complexity and software development/deployment on emerging HPC platforms.
Referred as batched matrix operations [1,10,16], the idea behind these advanced numerical kernels is to simultaneously execute many linear algebra kernels accessing different matrices so that one may achieve high hardware occupancy. While the support from optimized vendor libraries has improved over the last few years, developers may still have to implement their own kernels (e.g., on GPUs) or simply fall back to the OpenMP for loop pragma to execute kernels in batched mode. The former raises concerns on software sustainability while the latter may not extract performance of the underlying hardware architecture.
The authors in [27] have introduced the algorithm of a single TLR-MVM to enhance realtime performance when identifying the atmospheric turbulence for ground-based telescopes. Based on batched MVM, their TLR-MVM implementations rely on OpenMP due to standardization constraints required in the computational astronomy community for code sustainability and portability purposes. Performance results have been reported on several cutting-edge architectures.
In this paper, we extend this previous work [27] to process multiple TLR-MVM (i.e., a batch of batched MVM) required by the seismic redatuming method and deploy the application to distributed-memory systems equipped with x86 nodes and vector computing engines. Given the few but strong cores (i.e., eight cores) on NEC SX-Aurora TSUBASA cards compared to x86 architectures, NEC hardware solution makes up the low core count with vectorizations and high bandwidth memory (HBM2) to achieve high performance. This is quite different than GPU architectures that promote massive parallelism via the single instruction, multiple data (SIMD) paradigm. Therefore, porting on NEC vector engines represents a similar effort than deploying on x86, i.e., with high user productivity, with the favorable exception that HBM2 may provide a significant performance boost to memory-bound kernels compared to x86's DDR4 memory technology. And to further maximize performance on NEC vector engines, it is also paramount to fully utilize the vector units. All in all, the hardware design of NEC SX-Aurora TSUBASA cards facilitates the deployment of these advanced numerical kernels, which intrinsically drives the performance of low-rank matrix approximations.
To our knowledge, this is the first time TLR-MVM is successfully applied to 3D datasets using NEC vector computing hardware solutions in the context of seismic redatuming.

The Seismic Redatuming Method
Seismic redatuming is the process of numerically re-positioning seismic data physically recorded at the surface of the Earth to any location of interest in the subsurface. Whilst historically able to target only so-called primary arrivals in the recorded data, recent theoretical advances have led to the creation of so-called Marchenko redatuming, which is capable of handling full-wavefield seismic data including any order and type of internal scattering. This entails an inverse problem to be solved that can be expressed concisely as the following system of equations [28]: where R and R * are so-called convolution and correlation integral operators, Θ is a timespace window, I is the identity operator, f − and f + m are the up-going and the coda of the downgoing focusing functions to invert for, and f + d on the left-hand side of equation 1 is the direct component of the down-going focusing function that can be obtained by numerical modelling in a reference velocity medium. Finally, the overall down-going focusing function can be created as For simplicity, Equation 1 can be written compactly as d = Mf , where d, f and M are the overall data, model, and Marchenko operator of the problem we wish to solve.
Once the focusing functions are retrieved, the up-and down-going separated Green's functions g − and g + can be computed to be evaluating the following equations: Note that due to the extremely large size of these matrices, whilst the problem is written in a compact matrix-vector formulation, its numerical solution is performed using matrix-free operators and iterative solvers such as conjugate gradient least-squares (CGLS) or LSQR [30]. Focusing now our attention on the multi-dimensional convolution (MDC) integral operator, which represents the most expensive computations in the overall chain of operations, its inner working can be written more explicitly as follows: Similarly, the adjoint of such an operator can be written as: where F and F −1 represent the forward and inverse Fourier transforms, ω is the angular frequency, x A , x B and x R represent spatial locations with the latter two spanning the integration domain δD. ω max is used to indicate that the output of the forward Fourier transform is truncated to contain only frequencies where the signal spectrum resides. Finally, R(ω, x B , x R ) represents the kernel of the integral operator in the frequency-space domain and can be created upfront by applying the Fourier transform along the time axis of the physically recorded seismic data R(t, x B , x R ). Moreover, once the spatial integral is discretized, the kernel simply becomes a stack of matrices (one for each frequency ω within the specified frequency spectrum of the seismic data) and the integral can be interpreted as a batched matrix-vector multiplication (MVM) operations. This is true for both the forward and adjoint computations, with the main difference that the latter requires such matrices to be transposed and complex conjugated. Finally, in order to validate our statement that the operators R and R H represent the most expensive computations in the solution of the inverse problem in Equation 1, a single iteration of CGLS is evaluated and the overall computational time is divided into atomic contributions ( Figure 1). A single-node implementation of the Marchenko redatuming equations is used in this example as provided by the PyLops framework for large-scale inverse problem [29]. More specifically, since an iteration of CGLS requires the application of both a forward (M) and adjoint (M H ) passes, we observe that almost ninety percent of the time is spent in evaluating the R (and R * ) and their adjoints, while the remaining time percent of the time is roughly split between other computations involved in the M operator and vector-vector operations in the CGLS step. Finally, we observe here a slight time difference in the computation of the forward and adjoint passes; this results from the transposition of the matrix stack in the adjoint step. Moreover, complex conjugation must be performed on each element of the kernel. In practice, as explained in more detail in [30], complex conjugation is applied to the input and output vectors, which are much smaller than the kernel, so the kernel itself is not transposed. Nevertheless, since the matrices  in the stack are stored in a row-major order in main memory, the timing of the forward step is more favourable.
Alongside with advances in processing algorithms, the size and scale of seismic surveys have increased since the late 20th century. Nowadays, large-scale high-resolution 3D surveys are routinely acquired where the recorded data can easily be on the order of several Terabytes. Recently, the implementation of the 3D Marchenko equations has been discussed in [30] and [11]. In both cases, special attention has been placed on the implementation of the integral operator and the handling of kernels that cannot directly fit in the main memory of single compute node. In the former approach, the embarrassingly parallel nature of the batched MVM is leveraged by reading different frequency batches in the main memory of multiple compute nodes only once prior, to solving the inverse problem in Equation 1. The latter approach, on the other hand, utilizes the ZFP-based compression algorithm of [26] to reduce the size of the reflection response to be stored on disk and the read-in time to memory. Authors report a compression factor of four for this lossless compression when applied to their frequency-space reflection seismic data. In fact, even when the data is compressed, on-the-fly decompression is still required to be able to perform the computations in Equations 3 and 4. In both implementations, however, no attempt is made to expedite the dense MVM required in both the forward and adjoint processes.
Whilst the redatuming Equation 1 is relatively new to the field of geophysics, integral operators of the kind of R (Equation 3) and R H (Equation 4) are common to a number of other wave-equation-based seismic processing methods, such as surface-related multiple elimination (SRME - [33]), estimation of primaries by sparse inversion (EPSI -[20]), and up/down deconvolution [6] just to name a few. This work may therefore have a broader impact beyond the Marchenko redatuming technique.

Background on TLR-MVM
We briefly recall here the algorithmic design of the tile low-rank matrix-vector multiplication (TLR-MVM) kernel [27]. While the traditional dense MVM stores the matrix elements in column-major data layout format, TLR-MVM first splits the dense matrix into tiles in which elements are now contiguous in memory. This tiling technique is important in terms of memory access as it shortens the strided memory access to better fit in the high levels of the memory subsystem. Once the tile matrix data structure is constructed, TLR-MVM compresses each dense tile in an embarrassingly parallel fashion using an algebraic method of choice (e.g., rankrevealing QR, randomized SVD, etc.). The numerical lossy compression depends on the accuracy threshold required by the application to sustain its numerical robustness. Figure 2 represents the TLR-MVM operation A × x = y, with A ∈ R m×n , x ∈ R n , y ∈ R m . The matrix A is split into 6-by-7 tiles with a tunable tile size parameter nb. After compression using the accuracy threshold , each tile is decomposed into U and V bases containing the k most significant singular values Σ i,j and their associated singular vectors, as defined by the following formula based on the Frobenius norm: The selected singular values Σ i,j may be absorbed by one of the bases after applying corresponding scaling operations, which facilitates the MVM implementation. Once compressed, the tiles may have different ranks k, which may create load imbalance situations. Moreover, Level-2 BLAS MVM kernels are inherently memory-bound and their performance solely depends on the memory bandwidth. Therefore, it is critical to optimized memory access to avoid additional data motion between main memory and the last-level cache. While tiling technique helps for the traditional dense MVM, TLR-MVM ends up dealing with several new compressed U and V data structures that may not be stored contiguously in memory. As introduced in [27], we stack the U bases together and the V bases together and design a new MVM algorithm that leverages the TLR data structure.
The actual TLR-MVM operation can then proceed with the following three successive computational phases: (1) we multiply the stacked V bases to the vector x, (2) we project the result from phase 1 to the V bases, and (3) we multiply the stacked U bases with the result from the reshuffling phase 2 and compute the final result vector y.
In this paper, we extend the real precision arithmetic used for computational astronomy [27] to complex precision arithmetic in order to support seismic imaging applications. Furthermore, we implement a driver to launch several TLR-MVM kernels, i.e., one for each frequency ω, coming from the discretization of the spatial integral R(ω, x B , x R ) (see Section 3), as explained in the next section.

Launching Multiple TLR-MVM Kernels for Seismic Redatuming
In this section, we describe the implementation of a driver that launches multiple TLR-MVM kernels to support the workload of seismic redatuming. We first identify the challenges introduced by TLR-MVM on 3D seismic datasets and propose optimization techniques to address them.

Challenges with 3D Seismic Datasets
The 3D seismic dataset used in this study contain stacked matrices for 150 frequencies. This is representative of the workload of seismic redatuming, although the number of frequencies may be further increased to include hiigher frequencies to produce seismic wavefields and images of higher resolution. The size of each matrix is m = n = 9801. The matrices considered herein are square, but rectangular matrices are also supported. The relationship between the frequency index F i and the value of the frequency is given by f Fi = i dtnt , where d t = 0.0025 and n t = 1201.   Figure 3 shows the rank analysis of the 3D seismic datasets. In particular, Figure 3a highlights the rank distribution of F i matrices for i = 0, 50, 100 and 149. We set nb = 256 and = 0.001. As expected, the figure captures how the rank distribution shifts to the right with higher ranks for matrices corresponding to higher frequency components of the data. The red vertical line in Figure 3b shows the rank limit nb/2 = 128. If the rank distribution goes beyond this red line, the accumulated sizes of the bases for a single tile will be higher than the size of the original dense tile. On the contrary, if the rank distribution stays on the left of the red line, as pictured in Figure 3a, TLR-MVM remains competitive compared to dense MVM. Figure 3b reports the summation for all ranks for a given frequency matrix. The total rank summation is a good metric to evaluate the algorithmic complexity. We observe an increase in rank for higher frequencies, which corroborates the analysis of the rank distribution in Figure 3a. For this particular 3D seismic dataset, we can observe from Figure 3b that the log-scale of the number of floating-point operations (FLOPS) of TLR-MVM has a near-linear relationship with the frequency matrix index. This relationship provides insights on how to orchestrate the TLR-MVM scheduling for all frequencies, given the workload heterogeneity. It is now clear that one of the main challenges is the load imbalance introduced by TLR-MVM within and across all frequency matrices, compared to the homogeneous dense MVM. We implement two optimizations techniques and present their corresponding pseudo-codes in Figure 4. The codes are written in C and rely on MPI+OpenMP programming models.
We address in subsequent sections how these techniques mitigate the load imbalance overhead. The Merge-Phase strategy is designed to achieve intra-node load balancing by evenly distributing the computation on each thread (or processing units). Processing a collection of frequencies F is not performed one by one anymore. Instead, the strategy fuses each individual phase across all frequencies. This increases the workload per OpenMP loops within each phase, engenders a larger amount of computational tasks, and reduces the scheduling overheads. The default static scheduling mode of operation may increase data locality, especially when dealing with small datasets. For large datasets, static scheduling may lose its performance advantage and may create idle time while work is available. We further enable the OpenMP dynamic scheduling so that the runtime has opportunities to prevent idle time situations by scheduling tasks as soon as they enter ready state. While this strategy alleviates the intra-node load imbalance bottleneck, the inter-node load imbalance remains an issue, as observed in Figure 3b.

The ZigZag Mapping Strategy for Inter-Node Load Balancing
To leverage performance on distributed-memory systems, we evenly allocate frequency matrices across computational nodes. Dynamic load balancing on distributed-memory systems is a challenging approach that may require data movement to compensate the for the idle time. However, we have identified some relationship between the frequency index and the corresponding FLOPS, as explained in Section 5.1. The ZigZag mapping strategy is used to achieve inter-node load balancing. We design the ZigZag mapping strategy to statically map frequencies to processing units and achieve inter-node load balancing. Figure 4b highlights the ZigZag pattern for frequency mapping using 150 frequencies on eight Vector Engines (VE).
In the first sweep of VEs, we map the frequencies in increasing index order. We continue mapping the next set of frequencies in decreasing index order for the VEs. We repeat the above ZigZag pattern until all frequencies are mapped to the VEs. This strategy has several advantages. Not only does it exploit the linear relationship between frequency index and FLOPS, it also balances the memory footprint per VE. For instance, given that the on-chip memory capacity is limited to 48GB on NEC VEs, the ZigZag mapping strategy allows to scale memory-intensive applications. Furthermore, this mapping is performed offline and does not incur runtime overheads. Although it is important to mention that this linear relationship may not be characteristic of all seismic datasets. Network interconnect congestions may even further exacerbate the load imbalance even if the load is properly distributed. Therefore, we believe dynamic load balancing is an interesting research direction and we leave it as future work.

Algorithmic Complexity and Code Balance
As discussed in [27], the number of floating-point operations (FLOPS) of the dense MVM is 2mn and the memory bandwidth can be calculated as B(mn+n+m) t , with B(W ) the number of bytes of W elements (stored in single complex precision) and t the execution time. The FLOPS and memory bandwidth of TLR-MVM are 4K × nb and B(2K×nb+4K+m+n) t , respectively, with K the sum of the ranks across all tiles of any single frequency matrix (see Figure 2) and nb the tile size.
In seismic application, suppose there are F frequency matrices, the overall FLOPS and memory bandwidth of dense MVM is F × 2mn and F × B(mn+n+m) t , respectively. The overall FLOPS and memory bandwidth of TLR-MVM for all frequency matrices F is 4K F × nb and where K F is the sum of all ranks across all frequencies. According to the FLOPS calculation of TLR-MVM, the rank sum K of the frequency matrix plays an important role. If the rank sum K is not large enough, the algorithm may not saturate the memory bandwidth due to a suboptimal hardware occupancy. To begin with, we consider the same synthetic geological model used in [30] (their Figure 3) and compare the reconstruction of the total Green's function g = g + + g − obtained using the original R kernel and the TLR compressed R kernel with different combinations of tile size and accuracy. The observation that seismic data in the frequency domain can be compressed in a low-rank form is not surprising when considering the form of matrices representing seismic data in the frequency domain as shown in Figure 5a. Such a matrix represents the seismic data at a single frequency (f = 33Hz) with sources placed along the rows and receivers along with the columns. In other words, the element i, j of this complex-valued matrix contains the frequency at 33Hz coefficient of the seismic recording at i − th source and j − th receiver. Looking at the insert in Figure 5a, we can see how the entire matrix is composed of smaller diagonally dominant submatrices. This is due to the fact that we are dealing with 3D seismic data and each submatrix represents the responses of a single source line to a single receiver line. Moving from one source line to the next leads to the block pattern of this matrix in the row space, whilst moving from one receiver line to the next leads to the block pattern in the column space. Another important observation that was made by [23] is that within the available seismic bandwidth, kernel matrices at higher frequencies are of higher rank, i.e., require more basis functions to be approximated at the selected accuracy. Whilst not discussed by the authors in [23], this imbalance in the rank of the different matrices inevitably leads to a load imbalance in the computation of the batch matrix-vector multiplications (MVM), as explained in Section 5.1. The effect of solving the inverse problem in Equation 1 using different TLR compressed R kernels is presented in Figure 6. More specifically, Figure 6a shows the reconstructed Green's function along eight different receiver lines, as shown in the insert using the original uncompressed reflection response. Such an estimate has been shown in [30] to be very accurate and is used here as our benchmark. Figures 6b-d shows the reconstructed Green's function using TLR-MVM with different compression parameters. We choose nb = 256, = 0.001, 0.005, 0.01. Figure 6e shows element-wise absolute value difference between Figure 6a and Figure 6b. Figure 6f is 10× of the value in Figure 6e. Figure 6g-h are also 10× the element-wise absolute value difference with Figure 6a using corresponding compression parameters.

Numerical Accuracy Assessment
We use Signal-to-Noise Ratio (SNR) to quantify the error shown in Figure 6. The formula is SNR = −20 * log 10 : the larger the SNR, the better the approximation. Next, we investigate the impact of compression on the SNR for various tile sizes and accuracy thresholds, as shown in Figure 7a. We compute the ratio between the FLOPS executed by dense MVM versus TLR-MVM. This ratio of FLOPS savings corresponds to how much fewer FLOPS are performed by TLR-MVM compared to dense MVM. For instance, TLR-MVM achieves an SNR around 30 while performing 2.4 fewer FLOPS than dense MVM when using nb = 128 and = 0.001. We use colormap to show the corresponding SNR value. There is a general trend that one needs a more restrictive accuracy threshold in cases with smaller tile sizes. Figure 7b shows the relationship of FLOPS savings with SNR. We choose a threshold SNR value of 40 that satisfies the quality requirement of the application. We observe two sets of compression parameters (nb = 256, = 0.001 and nb = 512, = 0.001), which satisfy this requirement. Indeed, under these two compression parameters, TLR-MVM algorithm does not affect the final image quality.
Once we complete the assessment on the numerical accuracy with the identification of this couple of sets of parameters, we can now study their impact on performance with the hardware systems studied in this paper.

Experimental Results
This section reports the performance results of our TLR-MVM implementation, compares it against dense MVM, and highlights the impact of the optimization techniques introduced in Sections 5.2 and 5.3.

Environment Settings
The experiments are carried out on two x86 systems. The first system is a 16-node NEC B300-8 cluster, where each node is equipped with 8 NEC SX-Aurora TSUBASA-20B Vector Engines (VE). The memory capacity of each VE is 48GB. The second system is a single x86 node with a dual-socket 20-core Intel Cascade Lake. We refer to this shared-memory node to CSL in the subsequent sections. The memory capacity of the Intel server is 350GB. The OS of both systems is Linux RedHat 7.7. For the NEC server, we use NEC Compilers tools to compile the code and rely on the NEC Numeric Library Collection for the vendor optimized BLAS implementation. For the Intel server, we use Intel Parallel Studio 2019 to compile the code and link it against Intel Math Kernel Library for the vendor optimized BLAS implementation. To analyze the performance results, we rely on NEC Ftrace profile analysis tool. All experiments are run in single complex precision arithmetics.

Performance Results on Synthetic Datasets
We first study TLR-MVM on synthetic datasets to demonstrate the robustness and software capabilities of our TLR-MVM implementation. We randomly generate U and V bases in single complex precision. Without loss of generality, we employ square matrices with m = n = 10000 because it is closer to the real seismic datasets. The rank k is set to nb/4 and is constant across  all tiles. This specific rank translates into a theoretical FLOPS saving factor of 2, i.e., TLR-MVM performs twice fewer FLOPS than dense MVM. Figures 8a and 8b show the time to solution and memory bandwidth using the synthetic datasets, respectively, on the single Intel CSL node and one NEC VE. We test three tile sizes, i.e., nb = 128, 256, 512. We also show the memory bandwidth obtained by STREAM Benchmark. This is the upper limit of sustained memory bandwidth from DDR4 memory (i.e., Intel CSL) and HBM2 (i.e., NEC VE). For Intel CSL, we run one MPI process per socket with 20 OpenMP threads. For the single NEC VE, we run one MPI process with 8 OpenMP threads. We see that a single NEC VE is much faster than Intel CSL thanks to HBM2 technology with up to 85% of bandwidth saturation. On NEC VE, the configuration with the best time / bandwidth (annotated with a star) runs in less than 375 micro seconds and exceeds 1TB/s. Figure 8c shows the time breakdown of the three phases in TLR-MVM on Intel CSL and one NEC VE. As expected, the computational phases 1 and 3 are the most time-consuming with the batched MVM capturing most of the elapsed time. We also compare TLR-MVM with dense MVM, as implemented in the Level-2 BLAS CGEMV kernel in the vendor optimized libraries. TLR-MVM reaches 8.94 and 1.58 performance speedups compared to dense MVM on Intel CSL and one NEC VE, respectively. Compared to the theoretical FLOPS saving factor of 2, our TLR-MVM implementation gets higher performance speedup on Intel CSL thanks to a full saturation of the memory bandwidth. The dense MVM implementation from Intel MKL may not saturate the bandwidth enough and it seems to be poorly optimized. On the NEC VE card, our TLR-MVM implementation achieves less performance speedup when compared against the dense vectorized MVM from NEC NLC. As shown in Figure 8b, there may still be some room for improvement for our TLR-MVM implementation in further saturating the main memory.

Performance Results on 3D Seismic Datasets
In this section, we run against 3D seismic datasets and conduct experiments using 150 frequencies. Each frequency matrix is of size 9801 × 9801. As identified in Section 6, we use nb = 256 and = 0.001 to compress the dense matrix into a tile low-rank (TLR) matrix, while delivering application's accuracy and performance on the NEC VE-based system. Figure 9a compares the time to solution for each frequency matrix on Intel CSL and NEC VE systems when running the dense MVM and TLR-MVM. The time for dense MVM is constant across all frequencies on both systems. However, the time for TLR-MVM increases with the frequency index. This trend confirms the outcome of the rank statistics analysis made in Section 5.1, where the ranks (i.e., the workloads) grow with the index of the frequency matrix. For high frequencies, TLR-MVM on NEC VE outperforms its counterpart on Intel CSL by up to a factor 4, which is aligned with results on synthetic datasets from Section 7.2. Figure 9b shows the same performance analysis but with the sustained bandwidth obtained for each frequency matrix on the two systems. The TLR-MVM bandwidth increases with the frequency index and achieves more than 1TB/s for high frequencies. On Intel CSL, the dense MVM implemented in CGEMV kernel from MKL shows limited sustained bandwidth while TLR-MVM on the same system is able to saturate the bandwidth, as already demonstrated for synthetic datasets in Section 7.2. However, there is a clear load imbalance issue when looking at all frequencies and their respective makespan.

Performance Impact of Optimization Techniques
In this section, we assess the performance impact of the two previously introduced optimization techniques from Sections 5.2 and 5.3. We investigate four scenarios to solve the load imbalance issue: the reference TLR-MVM, the Merge-Phase TLR-MVM, the reference TLR-MVM with ZigZag mapping strategy and the Merge-Phase TLR-MVM with ZigZag mapping strategy. We conduct experiments on the 3D seismic dataset with 150 frequencies on 8 NEC VEs. We use NEC Ftrace analysis to get the FLOPS count of each OpenMP thread to illustrate how the various strategies can mitigate the load imbalance overhead. The default mode of NEC Ftrace is Vector Operation profiling. We need to set the environment variable VE PERF MODE to VECTOR-MEM to get the FLOPS count. Figure 10 reports the FLOPS count of each thread for the aforementioned scenarios. Figure 10b shows how the Merge-Phase strategy improves the load balancing among threads within each VE. Figure 10c highlights how the ZigZag mapping strategy further distributes evenly the workload between NEC VE cards. Figure 10d pictures the performance impact when both strategies are activated. It permits to achieve inter-node and intra-node load balancing. It is also noteworthy to mention that the workload among threads   from VE 0 to VE 5 is higher than VE 6 and 7. This is because the number of frequencies (i.e., 150) is not divisible by the number of VEs (i.e., 8). Therefore, the first 6 VEs end up having one more frequency matrix to process. Figure 11a compares time to solution of our TLR-MVM implementation with various optimization techniques against vendor optimized dense MVM on the overall application using one single Intel CSL node and a single NEC VE. TLR-MVM achieves up to 16X performance speedup against dense MVM on Intel CSL. This speedup factor is due to a suboptimal implementation of MKL multithreaded CGEMV kernel. Moreover, TLR-MVM scores up ot 3.3X performance speedup compared to dense MVM on a single NEC VE. This result is on par with the theoretical FLOPS saving factor of 3.9 reported in Figure 7a. The optimization techniques do not impact significantly the TLR-MVM performance variants on the single NEC VE since the number of OpenMP threads is limited to 8. However, they do impact the TLR-MVM performance on the Intel CSL system, due to the large thread count, i.e., a total of 40 threads. Figure 11b shows the roofline performance model using a single NEC VE, while combining Merge-Phase + ZigZag mapping strategies. We select frequency indices 1, 50, 100, and 150 and show how TLR-MVM performance increases with the frequency index and gets near the HBM2 sustained bandwidth. Although the gain in time is important, the fine-grained computation of TLR-MVM may prevent matrices with low frequencies from getting closer to the sustained bandwidth on the NEC VE system due to low vector units utilization. Figures 12a and 12b show the bandwidth (left y-axis) and time to solution (right y-axis) for the overall simulation with 150 frequency matrices. The figures report performance for each of the four strategies on Intel CSL and 8 NEC VEs. By combining the Merge-Phase + Zigzag strategies, we achieve the best time to solution and over 85% memory bandwidth of the STREAM Benchmark for both systems, thanks to a better hardware occupancy. Our TLR-MVM achieves around 9TB/s aggregated bandwidth on 8 NEC VEs: this bandwidth score is approximately equivalent to 36 Intel CSL nodes. This result highlights the performance advantage of HBM2 over DDR4 memory technology.

Performance Scalability
We scale up the number of VEs as well as the problem size to study the performance scalability of our TLR-MVM implementation with both strategies Merge-Phase + ZigZag activated.     We use the linear interpolation method to enlarge the original 3D seismic datasets from 150 to 596 frequency matrices. Figure 13a shows scalability results on 596 frequency matrices up to 128 NEC VEs. Our TLR-MVM implementation reaches 67% of the sustained bandwidth and 77.7% of linear scalability when using 128 NEC VEs. Figure 13b is a close-up view for up to 32 VEs. Note that the reference and Merge-Phase strategies can only run starting from 5 VEs because there is not enough memory on the VEs to host all frequency matrices. The ZigZag mapping strategy alleviates this bottleneck and permit to balance the memory allocation as well. Figure 13c shows time to solution comparison of vendor optimized dense MVM CGEMV kernel against TLR-MVM on up to 128 VEs. The average acceleration of TLR-MVM over dense MVM is 3.13× on NEC VEs. This number is again on par with the theoretical FLOPS saving factor of 3.9 reported in Figure 7a, since the interpolation used to extend the 3D seismic datasets is linear. When comparing against CGEMV from MKL on Intel CSL, our TLR-MVM implementation achieves 67X performance speedup when using 128 VEs. This corresponds to an aggregated bandwidth around 110 TB/s.

Discussions and Future Work
In this work, our attention has primarily focused on reducing the memory footprint of the kernel of the multi-dimensional convolution operator as well as improving its computational efficiency. Both goals have been largely achieved by means of the proposed TLR-MVM implementation. Given the nature of the inverse problem that we wish to solve (Equation 1), several questions remain to be answered in our future endeavors. First, whilst we currently focus on improving the inner working of the R operator, the algorithm requires four slightly different computations to be implemented, namely R, R * , R * , (R * ) H = R T . The latter three computations call for the application of the transpose and/or complex conjugate kernel to the input vector. Similar to its dense counterpart [30], we envision a TLR-MVM implementation where complex conjugation is applied to the input and output vectors instead of to the kernel itself. Moreover, given that the overall kernel is divided into tiles, the application of the transposed kernel is expected to benefit from the fact that elements of different rows of the U and V bases are closer in memory than their dense counterparts. Whilst this suggests a similar workload for the forward and adjoint passes, numerical validation is required. Moreover whilst our focus has so far not included the forward and inverse FFTs that comprise part of the operator in Equation 3, future research will investigate the possibility to begin their computations as soon as some of the TLR-MVM have been executed without waiting for the computations over the entire frequency range to be finalized. Another opportunity lies in the fact that the inverse problem we wish to solve can be slightly modified to include more than one spatial coordinates x B at the time [30]; in other words, our batched TLR-MVM can be replaced by a batched tile low-rank matrix-matrix multiplication (TLR-MMM) where each column of the input matrix represents the wavefield originated from a different virtual source.

Conclusion
In this paper, we investigate and deploy the Tile Low-Rank (TLR) Matrix-Vector Multiplication (MVM) performance to accelerate 3D seismic application workloads using vector computing hardware solutions based on NEC SX-Aurora TSUBASA architecture. We propose and implement strategies to mitigate the load imbalance overheads that inherently emerge from such workloads. Our TLR-MVM implementation not only improves the overall performance but also permits to scale up in terms of memory footprint. Thanks to its fine-grained computations and memory-friendly data layout, our TLR-MVM implementation can leverage the HBM2 technology of NEC vector engines, which translates into a performance boost compared to Intel CSL architecture with DDR4 memory technology. We assess accuracy of our TLR matrix approximations and demonstrates the numerical robustness of our method by investigating the impact of compression on the subsurface image quality using signal-to-noise ratio as a qualitative metric. We then employ the roofline performance model to show how TLR-MVM is able to effectively extract performance from the underlying architecture. On distributed-memory environment, our TLR-MVM implementation reaches around 110 TB/s of aggregated bandwidth on 128 NEC vector engines, which corresponds to 67X perfromance speedup against vendor optimized dense MVM (i.e., MKL) with the same number of Intel CSL nodes. We believe these results are promising in the context of 3D seismic imaging. TLR-MVM may enable to increase the problem sizes further and eventually improve the quality of 3D land surveys.