Parallel Numerical Algorithm for Solving Advection Equation for Coagulating Particles

In this work we present a parallel implementation of numerical algorithm solving the Cauchy problem for equation of advection of coagulating particles. This equation describes time-evolution of the concentration f ( x, v, t ) of particles of size v at the point x at the time-moment t . Our numerical algorithm is based on use of total variation diminishing (TVD) scheme and perfectly matching layers (PML) for approximation of advection operator along spatial coordinate x and utilization of the fast numerical method for evaluation of coagulation integrals exploiting low-rank decomposition of coagulation kernel coeﬃcients and fast FFT-based implementation of convolution operation along particle size coordinate v . In our work we exploit one-dimensional domain decomposition approach along spatial coordinate x because it allows to avoid use of parallel FFT implementations which are very expensive in terms of data exchanges and have poor parallel scalability. Moreover, locality of ﬁnite-diﬀerence operator from TVD-scheme along x coordinate allows to obtain good scalability even for computing clusters with slow network interconnect due to modest volumes of data necessary for synchronization exchanges between times integration steps.


Introduction
Coagulation and fragmentation processes stand in the basement of a wide class of physical phenomena starting from micro-polymer chains growth [6,17] and finishing at the scale of stars formation from interstellar dust [5,8].The very first model of aggregation was suggested by Smoluchowski in 1916 [24] and generalized by Hans Muller [18] into the form of the following integro-differential equation: This equation describes dynamics of concentration f (t, v) of the particles of size v per unit volume.The first term in the right-hand side corresponds to growth of concentration due to coalescence of the aggregates of sizes v and v − u.The second term describes decrease of the concentration due to their coalescence of the particles of size v with other particles.Kernel coefficients K(v, u) = K(u, v) ≥ 0 correspond to rates of aggregation process and have to be derived for each concrete application.
If the initial conditions f (t = 0, v) are known, then Cauchy problem for the coagulation equation is defined and can be solved numerically.Under natural assumptions [8] the solution of the Cauchy problem for semi-infinite size domain v ∈ [0, ∞) can be approximated by its finite part v ∈ [0, V max ].In fact, the class of mathematical models of potential interest is extremely wide and may include description of various physical effects: inception and sinks of particles of concrete sizes [2], unary fragmentation process [1] due to instabilities of big clusters and binary collisional fragmentation process [5], and many others [3,7,12,14,20].Even though the list of phenomena for application of spatially homogeneous aggregation and fragmentation equations is really huge [25,26], there is an even broader class of spatially inhomogeneous models [19,22].In this work we concentrate on analysis of advection-coagulation equation [28]: where function f (t, x, v) corresponds to concentration per unit volume of particles of size v at the point with coordinate x at moment t.Coagulating particles are transported along the axis of nonnegative coordinate x with velocity c(v).The velocities of the coalescence processes are again determined by the values of function K(u, v).In the similar manner with classical coagulation equation one needs initial conditions f (t = 0, x, v) and one boundary condition e.g.f (t, x = 0, v) for definition of the Cauchy problem for advection-coagulation equation and its further numerical investigation.
Recently, we proposed an efficient numerical method solving the Cauchy problem for this equation [28] and demonstrated its efficiency for the concrete examples of modelling problems.We revisit description of this algorithm in Section 1.
Nevertheless, even modest simulations presented in our previous report required quite a lot of CPU-time.In this work we propose a parallel implementation of this algorithm.We should also emphasize, that there exists a special parallel algorithm allowing to perform evaluation of aggregation and fragmentation sums of the similar structure as coagulation integrals.Even though, a reasonable speedup of calculations was presented at [16], the parallel scalability of the algorithm is relatively poor [16].This fact lies in motivation to exploit one-dimensional domain decomposition approach along spatial coordinate x for advection-coagulation equation but not along axis v and not two-dimensional domain decomposition.
In Section 2 we present a detailed description of novel parallel algorithm.Section 3 is devoted to tests of performance of the proposed algorithm at different computing clusters and Lomonosov supercomputer.We obtain speedup of calculations more than by 300 times.This allows us consider a broader class of problems of potential interest for mathematical modelling and studies of advection-coagulation processes.In conclusions we discuss the presented results and possible directions for further generalization of the presented ideas.

Numerical Method, Revisited
In this section we revisit description of the fast numerical algorithm for advectioncoagulation equation from work [28].First of all, we use an explicit Euler time-integration scheme with time step ∆t for solution of the Cauchy problem: where A(f ) is an approximation of the advection operator and S(f ) approximates Smoluchowski integrals.Let us denote M as number of grid nodes along particle size axis v with grid step ∆v and parameters v 1 = 0, v M = M • ∆v and N as number of spatial grid nodes along x with step ∆x, x 1 = 0, x N = N • ∆x.Finally, we denote f n i,j as value of numerical solution at grid node (x j , v i ) at the moment n • ∆t.
For S(f ) we use skeleton decomposition of coagulation kernel and fast algorithms of linear algebra with overall algorithmic complexity of solution scheme to be O(M R log M ) at each grid node along x axis [13].Whereas without utilization of these ideas the complexity becomes O(M 2 ).We discuss this method with more details in Section 2.2.We get significant acceleration for evaluation of Smoluchowski integrals if the rank R of the coagulation kernel is a modest number.For approximation of the advection part A(f ) we exploit well-known TVD scheme [10] and PML layers [4] allowing us to keep monotonicity of numerical method.Detailed relations for advection part are presented in Section 2.3.

Handling Smoluchowski Integrals
Skeleton decomposition of coagulation kernel helps accelerate computations by reduction of the algorithmic complexity of the numerical evaluation of Smoluchowski integrals [27,29].
With use of skeleton decomposition we can perform transformation of the first integral: Following this idea we also transform the second integral: Transformation of integrals written as sum of R lower-triangular convolutions, hence, with the use of FFT algorithm, total algorithmic complexity of the numerical integration using the relations above yields to O(M R log M ) operations at each from N points of spatial grid.Hence, total cost of evaluation of Smoluchowski integrals is O(N M R log M )

Approximation of Advection Part
For approximation of the advection equation we use the following TVD-scheme.
where C j is a flux limiter function that ensures that no artificial oscillations occur and has to satisfy the following restrictions For simplicity of notations we revisit this scheme ignoring volume index i because at each grid point along size coordinate v i we perform these operations along spatial axis x.In fact, there are a lot of possible choices for C j which were introduced by many authors [10,11].In our work, we use the one that is called "Monotonized Central Flux Limiter" [10].
The last thing that has to be done with respect to numerical solution of advection equation is incorporation of PML to emulate the absence of the right boundary condition in the grid.Hence, we modify the advection equation into following: where where l is width of calculation domain along axis x without PML, d is width of PML and W , m are PML parameters and formal assumption f | x=l+d ≡ 0 .

Parallel Algorithm
There are two possible ways of parallel implementation of our numerical method with use of p processors.First of all, we can perform simultaneous calculation by allocation of fixed particle size intervals to different processors.This is one-dimensional domain decomposition along axis v.But this option turns out to be bad due to quite poor parallel scalability of the algorithm for evaluation of the Smoluchoswki coagulation integrals what was reported in [16].
The reason lies in the fact that fast calculation of coagulation convolution typed integrals via FFT has a very limited resource of speedup on real parallel systems.Here we refer to work [9] where authors demonstrated that main problem in parallel execution of one-dimensional FFT is  One parallel time-integration step with use of the suggested parallel algorithm.Areas bordered by black lines represent the grid areas allocated to separated processors problematic due to extremely high communication costs (comparable with computational).Such bad parallel scalability of evaluation of the coagulation integrals also leads to loss of motivation to exploit two-dimensional domain decomposition along both v and x coordinates in application to our problem.However another option of execution of parallel calculations is to distribute them along the spatial coordinate x.Such distribution of data among processors corresponds exactly to uniform one-dimensional domain decomposition into the segments along spatial coordinate x.This approach allows to obtain very modest requirements in terms of number of communications -it is necessary just to synchronize the boundaries of spatial segments corresponding to neighboring processes for evaluation of advection operator with use of TVD-scheme stencil.In other words, we can allocate integrals at different points of the physical space to different processors and then transmit the data at the borders of the corresponding intervals.Thus, on the input each from p processes has set of grid nodes X and values of f n (x i , v j ) at this domain X × (v 1 , . . ., v M ).
Total complexity of time-integration step for each processor is O N RM log M p operations and requires O(pM ) data elements for O(p) data exchange operations.
We present our approach informally in Fig. 1.It shows the structure of the parallel numerical scheme and the areas of the grid which communicate with each other.Areas bordered by black lines represent the grid areas allocated to separated processors.Red stripes indicate how calculations are performed.Blue stripes pose organization of data transmission between the neighboring processes.
During stage corresponding to Fig. 1a each process evaluates the coagulation integrals along vertical axis v (lines 1-3 the pseudocode) and keeps two additional vectors for the boundaries assignment (presented as white stripes).Stage from Fig. 1b corresponds to synchronization of the particle size distributions along axis v (from blue stripes into red of the neighboring process) for the edge coordinates x between the processes (lines 4-5 of the pseudocode).It is necessary for the correct work of the TVD-scheme stencil.During stage from Fig. 1c the advection operator is evaluated along axis x for each particle size v (horizontal stripes corresponding to lines 6-10).All in all, during stage from Fig. 1d we provide calculation of the time-step result for both coordinates v and x (lines 11-16 of the pseudocode).
However, we should mention that grid parameter affects the parallel the efficiency of presented algorithm.Number of grid points along spatial coordinate x must be greater or equal to number of used processes.Otherwise there will be a lack of opportunity to distribute the data among the processors.We assume that the best parallel performance can be achieved in case of uniform domain decomposition when number of grid points along x can be divisible by a number of used processes.In the next section we present benchmarks of the proposed algorithm.
Algorithm 1 Operations performed by each process for j ∈ V do 13: end for 15: end for 16: return f n+1 (x i ; v j ) defined for X × (v 1 , . . ., v M )

Benchmarking Parallel Implementation
In our numerical tests we investigated the Cauchy problem for the advection-coagulation equation with zero-value initial conditions and the following left boundary condition: Thus, we model a constant flow of particles from one side of the grid to the other.Also, calculations differ with respect to coagulation kernels.Tests were conducted for constant kernel K(u, v) ≡ 1, which rank R = 1 and for ballistic coagulation kernel with greater values of ranks.Grid parameters were set as ∆x = ∆v = 10 −2 and ∆t = 0.005.Scalability of parallel implementation was tested on several different computing clusters, so we assume that the results are robust.Moreover, our calculations also differ with respect to mesh and coagulation kernel.Surprisingly, even for a system with slow interconnect (see Tab. 4 increasing costs of data exchanges between processors after each time integration step.Speedup also saturates with satisfactory results when the size of problems is not large enough (see Tab. 2).

Conclusions
In this work we propose a parallel implementation of the numerical method solving advection-coagulation equation from work [28].The numerical scheme is parallelized along spatial axis x which promotes a good speedup of calculations.In the presented benchmarks we show that our scheme of finding the numerical solution of the problem might be accelerated by hundreds of times giving an almost linear speedup with respect to the amount of used CPU cores.
There is also an opportunity for additional increase of performance of the presented method.As far as coalescence integrals are evaluated sequentially at each processor their calculation can be additionally accelerated by use of GPU and Cuda technology.We will apply our ideas to more intricate multicomponent coagulation models [7,15,23] requiring special implementations of multidimensional convolutions based on utilization of low-rank tensor decompositions [15,21].
(a) Computation of S(f n ) (b) Data exchange (c) Computation of A(f n ) (d) Calculation of f n+1

Figure 1 .
Figure 1.One parallel time-integration step with use of the suggested parallel algorithm.Areas bordered by black lines represent the grid areas allocated to separated processors

Table 1 .
Strong scalability of the proposed parallel algorithm for different clusters.Visualized data corresponds to results presented in Tab. 1 -4.We obtain almost linear speedup if number of processors is less than or equal to 128 Supercomputer Lomonosov, tests for regular4 partition with 40 Gb/s QDR interconnect.Constant coagulation kernel, R = 1, N = M = 8192, 32 time-integration steps Supermicro cluster of Skoltech 1 Gb/s Ethernet) and for heterogeneous cluster of Marchuk Institute of Numerical Mathematics (see Tab. 2) we obtain a reasonable speedup of calculations.Besides the fact that we obtain a good scalability of the algorithm for different clusters, we also see its good performance for different problem sizes (for each of Tab. 1 -4 we tried a different problem size).Tables 1 -4 and Fig.2indicate that the scalability of parallel program is almost linear at the amount of nodes per dimension lower than or equal to 128.If the number of processors goes beyond that number, the payoffs from parallel implementation slowly decrease due to

Table 2 .
Cluster of Marchuk Institute of Numerical Mathematics RAS; test for heterogeneous cluster consisting of many different models of CPUs with 32 Gb/s QDR interconnect.Ballistic coagulation kernel, R = 12, N = M = 2048, 32 time-integration steps