String-Wave Direct Parallel Solver for Sparse System of Linear Equations

The article discusses a parallel algorithm of solving linear algebraic equations systems for symmetric sparse matrices, which allows to split a large task into many small subtasks, thereby both increasing performance and reducing memory consumption. It is based on a method of simultaneous calculation of intermediate values during matrix factorization with maintaining load balancing on processors so that when the ﬁnal result of the left parts of the factorization is obtained, the right parts of the factorization do not depend on them. This approach allows the initial stiﬀness matrix to be represented as a product of a large number of simple matrixes and solve a system of linear algebraic equations in the form of a sequence of solutions by substitution. To reduce the ﬁlling of sparse factorization matrices, an approximate minimum degree method was used, which, in addition to being one of the most eﬃcient and fastest ones existing at the moment, allows the developed algorithm to distribute the load of calculations more evenly. The developed method is implemented in APM Ltd. software products for systems with shared memory, but it can also be performed for distributed memory systems.


Introduction
Currently, the finite element method (FEM) is actively used for engineering calculations in the field of construction and machine-building [1].The use of FEM involves solving systems of linear algebraic equations of large dimension.The matrices of such systems, called stiffness matrixes, have a number of properties that depend on the structure and number of nodes of the finite element grid, as well as the nature of the problem.These matrices are symmetric and sparse to a considerable extent.
With the help of the system of linear algebraic equations, such tasks as: static, dynamic calculation, calculation of natural frequencies, stability, nonlinear calculations, etc. are solved.In general, system of linear algebraic equations has the following form [2].
where K -is the stiffness matrix, X -are displacement vectors, F -are load vectors for a finite element system, n -is the dimension of the stiffness matrix; m -is the number of loads.Due to the fact that the stiffness matrix is sparse, it is presented in the CSR (Compressed Sparse Rows) [3] string sparse form, where the storage structure uses three one-dimensional arrays: 1) an array of non-zero elements of the stiffness matrix line by line; 2) an array of column numbers of non-zero elements line by line; 3) an array of the location of the first non-zero element in each row.This format allows to reduce the amount of source data and decrease the number of operations performed on them significantly, i.e., when calculating, memory consumption is reduced and performance increases.
The article is organized as follows.Section 1 is devoted to the problem of factorization of symmetric sparse matrices and the need to reorder them.In section 2 we describe the main methods for reordering sparse matrices and compares the performance of the AMD and Cuthill-McKee algorithms.Section 3 contains a description of the developed string-wave algorithm for solving system of linear algebraic equations and computational experiments of its application.Conclusion summarizes the study and points directions for further work.

Matrix Factorization
Today, two different approaches to solving systems of linear algebraic equations are used: direct solution and iterative methods.Each approach has certain advantages, but direct methods are of the most common use.The main problem of iterative methods is the frequent poor conditionality of stiffness matrices of large dimension when its determinant is equal to a small value or tends to zero.And even the preconditioning process does not absolutely guarantee the convergence of the solution.In order to simplify the solution of system of linear algebraic equations with a symmetric stiffness matrix, the following factorization is used [4].
where L -is the lower triangular matrix (stored in sparse format), in which all diagonal elements are equal to one, D -is the diagonal matrix (stored as a one-dimensional array).The solution of system of linear algebraic equations LDL T X = F in this case is reduced to the sequential solution of three simplest issues by the substitution method.
The main problem with this factorization is filling a sparse triangular matrix L with a large number of non-zero elements, which entails significant memory consumption and an increase in the number of computational operations on the elements of this matrix.Therefore, before the factorization procedure, the rows and columns of the original stiffness matrix are reordered.To do this, we find a special permutation matrix P such that

Permutation of Stiffness Matrix
The permutation matrix P is a matrix in which each row and column contains only one element equal to one, and the others equal to zero.This matrix is stored as an array of column numbers with single elements.To find the permutation matrix, the Cuthill-McKee algorithm is often used due to the simplicity of its implementation, but such an algorithm is not optimal from the point of reducing the filling of factorization matrices with nonzero elements.Algorithms based on the application of the minimum degree method are more effective.In practice, this method, in its original form, is not used due to its considerable complexity.Two of its modifications are widespread: the Multiple Minimum Degree method (MMD) and the Approximate Minimum Degree method (AMD) [5].The latter is used in the software products of the company APM Ltd. Figure 1 shows the result of calculating the lower triangular matrix L with permutation using the Cuthill-McKee algorithm and the approximate minimum degree method.Table 1 shows the number of received non-zero elements of this matrix.After the permutation procedure, the factorization of the stiffness matrix is performed directly.In the classical version, LDL T factorization calculated as follows

Description of the Algorithm
Expression (5) does not lend itself well to parallelization due to the large number of information dependencies.In addition, the calculations are not balanced, which significantly reduces their efficiency [6].To eliminate these problems, an algorithm adapted for parallel computing was developed.The triangular matrix of factorization L can be visually represented as a musical instrument like a harp (Fig. 2).Each harp string corresponds to a column of the matrix L. The strings are played alternately.Playing on the j-th string means finding the final values of the elements of the j-th column L. At the same time, each string propagates an undamped sound wave towards the strings located to the right of it.That is, each j-th column of the matrix L partially changes the values of the elements of all columns with a number higher than j.Thus, it is possible to calculate the effect of the current column on all columns located to the right in parallel.

Figure 2. Visual representation of the factorization matrix in the form of a harp
Mathematically, the string-wave direct solver can be described as follows.First, the view matrices are assigned the corresponding values of the stiffness matrix.
Further, the elements of these matrices are changed with accordance to the algorithm described above.
Since the factorization matrix L is sparse, not all right columns are affected by the current one.In the course of the conducted research, it was revealed that only those columns which number corresponds to the row numbers of non-zero elements of the current column change.This circumstance makes it possible to eliminate unnecessary iterations of the parallel algorithm cycle.When solving system of linear algebraic equations of large dimensions by a direct method, there is often not enough memory.The string-wave algorithm allows to reduce the size of the problem, for example, by applying the following factorization.
where L 1 are the first columns of the triangular factorization matrix L, B is a submatrix of smaller dimension obtained by the influence of L 1 (Fig. 3).In turn, a similar factorization can Figure 3. Reducing the dimension of the problem be applied to the matrix B. As a result, the string-wave direct solver allows us to obtain the following expression This factorization of the stiffness matrix allows to split a large problem into many small subtasks, where the solution of system of linear algebraic equations is also found by substitution Speedup and efficiency parameters were used to evaluate the efficiency of parallel computing [7].Speedup refers to the execution time ratio of a sequential algorithm to the execution time of a parallel algorithm on p processors The efficiency of using processors by a parallel algorithm in solving a problem is defined as the ratio of acceleration to the number of processors The results of computational experiments evaluating the effectiveness of the parallel algorithm are presented in Tab. 2. The algorithm is implemented using OpenMP [8].Calculations were performed on a six-core Intel Core i7-8700K processor with 32 GB of DDR4 RAM.To conduct computational experiments, the APM Structure3D module of APM WinMachine software was used, in which models of various dimensions were loaded and when measuring the execution time of the parallel algorithm, the time for the formation of the stiffness matrix was taken into account.Thereby, the string-wave direct solver effectively works in conjunction with the AMD reordering algorithm, breaking a large task into many small subtasks with the least information dependencies, which allows its parallel implementation to achieve optimal values of speedup and efficiency parameters.

Conclusion
The article presents a new algorithm for solving systems of linear algebraic equations for symmetric sparse matrices.This algorithm includes the AMD method for reordering sparse matrices and the string-wave method for solving system of linear algebraic equations of our own design.It is successfully used in software products of the company APM Ltd.The developed solver is used in such software modules as static and dynamic calculations, calculation of natural frequencies and stability when solving a generalized eigenvalue problem, calculation of harmonic vibrations and many others.
The developed algorithm is parallel and implemented using OpenMP technology.Computational experiments show that with an increase in the number of processors and the volume of input data, the speedup increases without a decrease in efficiency.That is the developed parallel algorithm is scalable.
Further research will focus on improving its performance and implementation using CUDA technology for NVIDIA graphics accelerators and MPI technology for distributed memory systems.
This paper is distributed under the terms of the Creative Commons Attribution-Non Commercial 3.0 License which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is properly cited.

Figure 1 .
Figure 1.The result of factorization after reordering the matrix

Table 1 .
Results of matrix reordering algorithms

Table 2 .
Evaluating the effectiveness of a parallel algorithm Dimension of the stiffness matrix, n Speedup, S p Efficiency, E p