Accelerating advanced preconditioning methods on hybrid architectures

Many problems, in diverse areas of science and engineering, involve the solution of largescale sparse systems of linear equations. In most of these scenarios, they are also a computational bottleneck, and therefore their efficient solution on parallel architectureshas motivated a tremendous volume of research.This dissertation targets the use of GPUs to enhance the performance of the solution of sparse linear systems using iterative methods complemented with state-of-the-art preconditioned techniques. In particular, we study ILUPACK, a package for the solution of sparse linear systems via Krylov subspace methods that relies on a modern inverse-based multilevel ILU (incomplete LU) preconditioning technique.We present new data-parallel versions of the preconditioner and the most important solvers contained in the package that significantly improve its performance without affecting its accuracy. Additionally we enhance existing task-parallel versions of ILUPACK for shared- and distributed-memory systems with the inclusion of GPU acceleration. The results obtained show a sensible reduction in the runtime of the methods, as well as the possibility of addressing large-scale problems efficiently.


Introduction
Sparse systems of linear equations appear in many areas of knowledge, such as circuit simulation, optimal control, quantum mechanics or economics [1,2]. For example, they emerge as a natural consequence of the discretization of PDEs, which makes their efficient solution extremely relevant. A number of current real-world applications (as for example three-dimensional PDEs) involve linear systems with millions of equations and unknowns. Direct solvers such as those based on Gaussian Elimination (GE) [3], which apply a sequence of matrix transformations to reach an equivalent but easier-to-solve system, today fall short when solving large-scale problems because of their excessive memory requirements, impractical time to solution and complexity of implementation.
In these cases, iterative methods pose an appealing alternative. These sort of methods can be interpreted as searching for the solution of the linear system inside a predefined subspace, by taking one step in a given search direction at each iteration of the solver. A large number of methods exist that differ from each other in how this subspace is defined, and the criteria under which this search direction and step size are chosen.
In general, it is desired that the iterative method rapidly improves the initial guess, with the purpose of finding an acceptable solution to the system in a small number of steps. However, this is often impaired by unavoidable rounding errors due to the use of finite-precision arithmetic, which are amplified according to the numerical properties of each problem.
To remedy these shortcomings, preconditioning techniques are applied. In a broad sense, these techniques aim to transform the original linear system into an equivalent one that presents better numerical properties, so that an iterative method can be applied and converge faster to a solution.
The development of effective preconditioners is an active field of research in applied mathematics. Among the most versatile and widely-used general-purpose preconditioners, Incomplete LU (ILU) factorizations occupy a place of privilege. These are based on computing an LU factorization of the matrix where some of the entries that become nonzero during the process are dropped (replaced by 0) to keep the factors sparse.
However, an active line of research is devoted to make ILUs more efficient and reliable, so they can be applied in a wider spectrum of problems.
Among the most recent advances in this field, ILUPACK (http://ilupack.tu-bs.de) stands out as a package for the solution of sparse linear systems via Krylov subspace methods that relies on an inverse-based multilevel ILU (incomplete LU) preconditioning technique [4]. A remarkable characteristic of ILUPACK is its unique control of the growth in the magnitude of the inverse of the triangular factors during the approximate factorization process.
Unfortunately, the favorable numerical properties of ILUPACK's preconditioner in the context of an iterative solvers come at the cost of expensive construction and application procedures, especially for large-scale sparse linear systems. This high computational cost motivated the development of task-parallel implementations of ILUPACK for shared-memory and message-passing platforms [5,6,7] but, despite showing good performance and scalability results, these variants of ILUPACK are limited to the solution of symmetric and positive-definite (SPD) linear systems, and they slightly modify the preconditioner to exploit taskparallelism. This means that these modifications can impact the numerical properties and convergence of the preconditioner.
Due to its rapid development in the last two decades, GPUs are now ubiquitous, making their efficient use crucial in order to obtain good performance on the most recent hardware for scientific computing. Even in sparse linear algebra where the computational intensity of the operations is generally low and does not allow to take full advantage of the computational power that GPUs provide, the high memory bandwidth of these devices offers significant acceleration opportunities.
Previous to this dissertation, no parallel versions of ILUPACK existed aside the above-mentioned taskparallel implementations. It is therefore interesting to analyze the data-level parallelism in ILUPACK to develop new efficient parallel versions that do not suffer from the limitations of the previous variants on the one hand, and to enhance the performance of these variants on the other. In this line, the use of hardware accelerators, and in particular of GPUs, is a valuable tool, and a clear path to explore.
This work extends the summary of the dissertation submitted to the Latin-American Contest of PhD. Theses (CLTD) of the CLEI in 2019. The main new contributions summarized in this manuscript are: 1. The evaluation of a task-and data-parallel implementation of the BiCG solver for single-GPU systems that relies on GPU Streams (Section 5.2). 2. A detailed evaluation of the data-parallel ILUPACK preconditioner using synchronization-free sparse triangular solvers, performed on 86 matrices from the SuiteSparse matrix collection (Section 6). 3. The development of a data-parallel variant of the BiCGStab method with the ILUPACK preconditioner, especially tailored for low-power devices. The experimental evaluation is performed on the Jetson AGX Xavier board, one of the latest low-power computing platform from NVIDIA (Section 8).

Objectives
The main goal of the thesis is to advance the state-of-the-art in the efficient parallel implementations of modern preconditioning techniques. In particular, we are interested in the use of hardware accelerators to leverage the data-parallelism of iterative solvers working together with advanced incomplete-factorization methods. As exposed previously, ILUPACK is a prominent example of such solvers, but its remarkable numerical results come at the expense of a high complexity, which implies costly construction and application procedures. Therefore, to achieve our principal goal we have set the following specific objectives: • Enable the use of the GPU to accelerate ILUPACK's multilevel-preconditioner, harnessing the dataparallelism in the operations that compose its application for different matrix types. • Evaluate and improve ILUPACK solvers, identifying the principal factors that limit their performance. • Extend ILUPACK by adding new accelerated solvers. • Enhance the existing task-parallel versions of ILUPACK by enabling the exploitation of data-parallelism.

Sparse systems of linear equations
Gaussian Elimination can only take partial advantage of the coefficient matrix being sparse. Although there are some exceptions regarding specific classes of matrices like banded or tridiagonal, the problem of fill-in makes it necessary to abandon this technique to solve the general sparse case for large-scale scenarios.
An iterative method for the solution of linear systems is one that, starting from an initial guess x 0 to the solution vector x, is capable of iteratively refining it until (under certain conditions) an approximation x k that is acceptably close to x is reached, i.e. Ax k − b is relatively small for some vector norm. In order to avoid the difficulties that make GE impractical, this approximation should preserve the number of nonzero entries of the problem throughout the process. This is achieved by exploiting a matrix primitive that is well suited for sparse matrices, concretely, the matrix-vector product.
The intuitive concept behind these methods is that one should be able to improve an approximate solution x k by finding an appropriate linear combination of b, x k and Ax k . If one is to use all the information available in the expression Ax = b, it is reasonable to set the initial guess x 0 to some multiple of b, which implies that Then, following the previous idea, we should replace the current solution x 0 with a linear combination of itself with Ax 0 . In this case the current solution x 1 belongs to the subspace span{b, Ab}. After k iterations, it is clear that the current iterate will belong to This is called the Krylov subspace of dimension k for A and b.
The challenge of this sort of procedure lies on how to construct these linear combinations such that an acceptable approximation is reached in just a few iterations, and this is what gives place to a myriad of different methods. In fact, although there are optimal methods that are able to find the actual solution of the linear system in exact arithmetic, the convergence of the iterations will be determined by the spectral properties of the matrix, and the use of finite precision arithmetic can complicate things even further.
Fortunately, when the properties of the matrix A are not perfect, one can still multiply the coefficient matrix by a certain matrix M −1 and solve instead of Ax = b. Now the approximate solution x k will belong to the subspace K k (M −1 A, x 0 ). Matrix M is known as a preconditioner as its purpose is to improve the "condition number" of the iteration matrix, which is closely related with the way a matrix amplifies the numerical errors due to finite precision when, for instance, performing a matrix-vector product. Broadly speaking, M −1 is chosen so that it resembles A −1 in some way, but this choice is nothing but trivial and is the subject of active research. Moreover, if inverting M or solving a linear system with M is required, such inversion or system should be easier to solve than the original one, or the whole exercise would be pointless.

ILUPACK
Consider the linear system Ax = b, where A ∈ R n×n is large and sparse, and both the right-hand side vector b and the sought-after solution x contain n elements. ILUPACK provides software routines to calculate an inverse-based multilevel ILU preconditioner M , of dimension n × n, which can be applied to accelerate the convergence of Krylov subspace iterative solvers. The implementation of all solvers in ILUPACK follows a reverse communication approach, in which the backbone of the method is performed by a serial routine that is repeatedly called. This routine is re-entered at different points, and sets flags before exiting so that operations such as SpMv, the application of the preconditioner, and convergence checks can be then performed by external routines implemented by the user. This is aligned with the decision adopted by ILUPACK to employ SPARSKIT 1 as the backbone of the solvers. When using an iterative solver enhanced with ILUPACK preconditioner, the application of the preconditioner, which occurs (at least once) per iteration of the solver, is the most demanding task from the computational point of view. The computation of the preconditioner proceeds following three steps: 1. Initially, a pre-processing stage scales A by a diagonal matrixD ∈ R n×n and reorders the result by a permutationP ∈ R n×n :Â =P TD ADP . 2. An incomplete factorization next computesÂ = LDU + E, where L, U T ∈ R n×n are unit lower triangular factors, D ∈ R n×n is (block) diagonal, and E is a small term that contains the dropped entries during the process. In some detail,Â is processed in this stage to obtain the partial ILU factorization: Here,P ∈ R n×n is a permutation matrix, L −1 , U −1 κ, with κ a user-predefined threshold, and S C represents the approximate Schur complement assembled from the "rejected" rows and columns. 3. The process is then restarted with A = S c , (until S c is void or "dense enough" to be handled by a dense solver,) yielding a multilevel approach. At level l, the multilevel preconditioner can be recursively expressed as where L B , D B and U B are blocks of the factors of the multilevel LDU preconditioner (with L B , U T B unit lower triangular and D B diagonal); and M l+1 stands for the preconditioner computed at level l + 1.
This means that the application of the preconditioner requires, at each level, two sparse matrix-vector products (SpMv), solving two linear systems with coefficient matrix of the form LDU , and a few vector kernels. The derivation of the procedure is described in [8].

Task-parallel ILUPACK
The task-parallel version of ILUPACK employs Nested Dissection (ND) [9] to reorder the original matrix such that it can be partitioned into a number of decoupled blocks, and then delivers a partial Incomplete Cholesky (IC) factorization with some differences with respect to the sequential procedure. The main change is that the computation is restricted to the leading block, and therefore the rejected pivots are moved to the bottom-right corner of the leading block, instead of being moved to the bottom-right corner of the whole matrix; see Figure 1. Although the recursive definition of the preconditioner is still valid in the task-parallel case, some recursion steps are now related to the edges of the corresponding preconditioner DAG. Therefore different DAGs involve distinct recursion steps yielding distinct preconditioners, which nonetheless exhibit close numerical properties to that obtained with the sequential ILUPACK.
As the definition of the recursion is maintained, the operations to apply the preconditioner remain valid. However, to complete the recursion step in the task parallel case, the DAG has to be crossed two times per solve z k+1 := M −1 r k+1 at each iteration of the Preconditioned Conjugate Gradient (PCG) solver: once from bottom to top and a second time from top to bottom (with dependencies/arrows reversed in the DAG). More details can be found in [6]. The cost of the preconditioner application is, at the same time, dominated by triangular system solves and sparse matrix-vector products.

Related work
Many research works have reported important benefits for the solution of sparse linear algebra problems on many-core platforms. However, many of these efforts address only non-preconditioned versions of the methods, where the sparse matrix-vector product (SpMv) is the main bottleneck. For example, Buatois et al. [10] implemented the CG method in conjunction with a Jacobi preconditioner, using the block compressed sparse row (BCSR) format. Later, Bell and Garland [11] addressed the SpMv, including several sparse storage formats, that became the basis for the development of the CUSP library [12].
A parallel CG solver with preconditioning for the Poisson equation optimized for multi-GPU architectures was presented by Ament et al. [13]. Sudan et al. [14] introduced GPU kernels for the SpMv and the block-ILU preconditioned GMRES in flow simulations, showing promising speed-ups. At the same time, Gupta completed a master thesis [15] implementing a deflated preconditioned CG for Bubbly Flow.
Naumov [16] produced a solver for triangular sparse linear systems in a GPU, one of the major kernels for preconditioned variants of iterative methods such as CG or GMRES. Later, the same author extended the proposal to compute incomplete LU and Cholesky factorizations with no fill-in (ILU0) in a GPU [17]. The performance of the aforementioned algorithms strongly depends on the nonzero pattern of the coefficient matrix.
Li and Saad [18] studied the GPU implementation of SpMv with different sparse formats to develop data-parallel versions of CG and GMRES. Taking into account that the performance of the triangular solve for CG and GMRES, preconditioned with IC and ILU respectively, was rather low on the GPU, a hybrid approach was proposed in which the CPU is leveraged to solve the triangular systems.
In 2016, Lukarsky et al. [19] proposed a hybrid CPU-GPU implementation of multi elimination ILU preconditioners in GPU.
Recently, He et. al. [20] presented a hybrid CPU-GPU implementation of the GMRES method preconditioned with an ILU-threshold preconditioner to control the fill-in of the factors. In the same work, the authors also propose a new algorithm to compute SpMv in the GPU.
Some of these ideas are currently implemented in libraries and frameworks such as CUSPARSE, CUSP, CULA, PARALUTION and MAGMA-sparse.

Data-parallel variants
ILUPACK's multilevel preconditioner is stored as a linked list of structures that contain the information computed at each level. Concretely, a level contains pointers to the submatrices that form the ILU factorization: the B submatrix that comprises the LDU factored upper left block and the G, F rectangular matrices, along with the diagonal scaling and permutation vectors that correspond toD,P , andP ; see Section 2.1. The case of symmetric matrices is analogous, and the differences reside in that only the lower triangular factor L and the (block-)diagonal D of the LDL T factorization are stored explicitly, and that G is not stored because it is equal to F T .
As shown in Figure 2, the computational cost required to apply the preconditioner is dominated by the sparse triangular system solves (SpTrSV) and SpMVs. Nvidia cuSparse [21] library provides efficient GPU implementations of these two kernels that support several common sparse matrix formats. Therefore, it is convenient to rely on this library. The rest of the operations are mainly vector scalings and re-orderings, which gain certain importance only for highly sparse matrices of large dimension, and are accelerated in our codes via ad-hoc CUDA kernels.  Given the modified-CSR (MCSR) storage layout adopted by ILUPACK for the LDL T and LDU factors, and the native format handled by the cuSparse library (CSR), a layout reorganization is necessary before the corresponding kernel can be invoked. In the current implementation, this process is performed by the CPU, during the calculation of the preconditioner. The analysis phase required by the cuSparse solver, which gathers information about the data dependencies and aggregates the rows of the triangular matrix into levels, is executed only once for each level of the preconditioner, and it runs asynchronously with respect to the host CPU.
In the Symmetric Positive-Definite (SPD) case, the matrix-vector products that need to be computed during the application of the preconditioner, at each level, are of the form x := F v and x := F T v. To avoid the use of the slow SpMvT operation provided by cuSparse, we allocated both F and F T in the GPU, using a cuSparse routine to transpose the matrix in the device. We find this performance-storage trade-off acceptable, since the memory footprint of the symmetric solver does not exceed that of the general non-symmetric case. However, in the case of BiCG solver, the application of the preconditioner involves G and F as well as their respective transposes. Here, storing both transposes can exceed significantly the memory requirements of the non-symmetric versions of ILUPACK, as two additional matrices need to be stored. For this reason, our approach in this case only keeps G and F in the accelerator, and makes use of cuSparse SpMvT operation.
As it can be observed in Figure 3 the acceleration factors measured for a scalable 3D PDE, and benchmark problems from the Suite Sparse Matrix Collection (SSMC) 2 , in an Intel Core i3-3220 CPU connected to a Nvidia Tesla K20 GPU, are between 1.5 and 6×.
In the case of GMRES, we managed to accelerate the application of the preconditioner by similar factors, For this reason, we developed a GPU-version for the orthogonalization procedure together with a hybrid implementation of GMRES that leverages the best type of architecture for each operation while, at the same time, limiting the data transferences. Since we already off-loaded the SpMv appearing in GMRES to the accelerator, the basis vectors of MGSO reside in the GPU. The output of the process is the current basis vector, orthogonalized with respect to the remaining vectors, and the coefficients of the current row of the Hessenberg matrix. This matrix is small, and the following application of rotations and triangular solve expose little parallelism, so it is natural to keep it in the CPU memory. The coefficients of the matrix are calculated serially via vector products with the basis vectors, so the GPU computes n flops for each coefficient that is transferred back to the CPU. A similar approach was studied in [20].  Table 1 shows the results for the enhanced variant of GMRES, with speed-ups for the orthogonalization procedure that are between 3× and 7×. Combining this with the previous enhancements, we obtain speedups of up to 5.8× for the entire solver.

Leveraging task and data parallelism in BiCG
The BiCG method was first derived by Lanczos [22] in 1952 as a variation of the two-sided Lanczos algorithm to compute the eigenvalues of a non-symmetric matrix A. In a broad sense, it is based on maintaining two parallel recurrences, one for matrix A and the other for A T , and imposing bi-conjugacy and bi-orthogonality conditions between the vectors of each recurrence. In Figure 6, we offer an algorithmic description of the method, detailing the corresponding computational kernel on the right column.
It follows from the algorithmic description of the BiCG, that contrary to most iterative linear solvers, in which there exist strict data dependencies that serialize the sequence of kernels appearing in the iteration, the two recurrences involved in the BiCG method are quasi-independent. Moreover, in the preconditioned version of the method, there is no data dependence between the application of the transposed and non-transposed Operation kernel Initialize x 0 , r 0 , q 0 , p 0 , . . . , s 0 , ρ 0 , τ 0 ; k := 0 SpMv + apply prec. s k+1 := s k − α k z k axpy ρ k+1 := (s T k+1 r k )/ρ k dot product p k+1 := t k + ρ k+1 p k axpy q k+1 := s k+1 − ρ k+1 q k axpy τ k+1 := r k 2 dot product k := k + 1 end while Figure 6: Algorithmic formulation of the preconditioned BiCG method.
preconditioner inside the iteration, exposing coarse-grain parallelism at the recurrence-level. Considering this context, we first rearranged the operations in the BiCG method so that these two sequences are isolated, making it possible to execute them concurrently. This idea is summarized in Figure 7, where we group the operations of the BiCG in three sets, namely Set A, Set B, and a third set that contains the rest of the operations. Each of the first two sets contains a single SpMv and the application of one of the preconditioners. Although Set A also contains a dot product and two axpy operations before the synchronization point, these kernels have almost negligible computational cost in general, and this distribution of the workload can be expected to be fairly well-balanced.

Variant for two GPUs
In systems equipped with two discrete graphics accelerators, the arrangement of the operations of the BiCG proposed in Figure 7 allows to execute each sequence in a different device, until reaching the synchronization point. This enables the exploitation of coarse-grain task-level parallelism, using the two GPUs to execute the operations of the solver that belong to different sets concurrently, in conjunction with the data-level parallelism of each operation, that is leveraged inside each accelerator.
In addition to the concurrent execution, the presence of two GPUs allows to avoid the use of the transposed version of the cuSparse SpMv routine. These operations are required because in the BiCG method both the transposed coefficient matrix A T and the transposed preconditioner are involved in the calculations. In particular, the solver will perform an SpMv with the transposed matrix in each iteration, and the application of the transposed preconditioner will involve two transposed SpMv per level of the preconditioner in each iteration. In our baseline implementation, these transposed SpMVs are computed by calling the SpMvT kernel of cuSparse on the original non-transposed matrices. The execution times reached using this strategy show that the calls to SpMvT are, on average, 2-3× slower than those to SpMv, with one special case for which it becomes almost 7× slower. As a consequence, the application of the transposed preconditioner sometimes takes more than twice the time of its non-transposed counterpart.
As each GPU (1 and 2) has its own separate memory, we maintain the non-transposed operands in GPU 1 and the transposed ones in GPU 2, using the faster version of cuSparse SpMv in both devices. It should be noted that we are not wasting memory in this case, since using the transposed SpMv routine in the second GPU would also imply the storage of the original matrix in the device, at the same memory cost.
The copy of the data to both devices is performed concurrently with the construction of the preconditioner. In particular, the asynchronous transference of one level takes place while the next one is being computed by the incomplete factorization procedure.
Regarding the concurrent execution in both devices, we use two CPU threads that concurrently enqueue work to each accelerator to ensure the correct overlapping of the two execution streams.

Leveraging task-parallelism in the GPU using streams
In systems equipped with only one GPU, exploiting the task-level parallelism present in the BiCG method demands more elaborate solutions to obtain significant improvements.
Since the GPU offers the possibility of overlapping the execution of several kernels using streams, it is reasonable to offload the Set A and Set B blocks in Figure 7 to different GPU streams in the same device. However, the achieved overlapping will depend on the resources that one kernel leaves available to the others [23]. If one of the kernels is sufficient to fully occupy the GPU, none overlapping will occur. To study the overlapping that can be achieved between the two sequences using streams, we first evaluate the concurrent execution of two triangular system solvers, which is the main component of the application of the preconditioner.  Table 2 shows that, for the tested matrices, almost no overlapping could be achieved. Even if the matrices employed are relatively small (as in the case of matrix A050) the overlapping obtained is negligible. This suggest that the use of GPU streams will not deliver any significant improvement in the parallelization of the BiCG solver.

Concurrent CPU-GPU variant
Since the use of GPU streams does not seem to offer any advantage with respect to the baseline dataparallel variant in single-GPU systems, an interesting alternative to enable task-level parallelism is to perform computations in the multicore CPU concurrently with computations on the GPU. In this sense, the Adv_BICG_Hyb_Cusp variant employs the multicore CPU to perform the transposed SpMv of BiCG using the multi-threaded MKL library. The rest of the computations are performed in the GPU using cuSparse and cuBlas libraries. Figure 9 presents the time-line for the Adv_BICG_Hyb_Cusp version, extracted with Nvidia Visual Profiler. The time line consists of four horizontal lines divided into several bars corresponding to the duration of different tasks. The first line corresponds to the execution of CUDA API calls in the CPU thread (kernel launches, parameter setup, memory transferences, etc.); the second line also corresponds to the main CPU thread, and we use it to display the duration of the transposed SpMv in the CPU (yellow bar) and to aggregate the CUDA API calls that correspond to the application of the transposed preconditioner. The two inferior lines correspond to each one of the two GPU streams.
The figure shows that the transposed SpMv considerably overlaps with the application of the preconditioner in the GPU. Additionally, from the analysis of the orange bars in the first line, it follows that an important part of the CPU time is devoted to the processing of CUDA API calls, which we refer to as "kernel launch overhead". This overhead is mainly due to the solution of four triangular linear systems at each level of the ILUPACK preconditioner.
The solution of these operations relies on the routine cusparseDcsrsv_solve, of cuSparse library, whose implementation is based on the so-called level-set strategy [24], and is described in [25]. This strategy consists of two main stages. In the analysis phase, the triangular sparse matrix is processed to determine sets of independent rows called levels. In the solution phase, the solver launches a GPU kernel to process each of these levels, processing the rows that belong to each level in parallel. The number of levels that derive from the analysis can vary largely according to the nonzero pattern of each triangular matrix, and for matrices of considerable size the variation is usually in the order of hundreds or even a few thousands. In such cases, the overhead due to launching the kernels that correspond to each level can become significant [26]. Figure 9 illustrates how the launching of these kernels delays the start of the multi-threaded transposed SpMv on the CPU. 6 Self-scheduled sparse triangular solvers As Figure 2 shows, the solution of sparse triangular linear systems is the operation that concentrates most of the computation time during the application of the preconditioner. Hence, improving the performance of this kernel can have a significant impact on the performance of ILUPACK. The parallel algorithms for the solution of this operation can be classified in two main contrasting categories. Apart from two-stage methods based on the level-set paradigm, which was briefly described in Section 5.3, there are one-stage methods that rely on a self-scheduled pool of tasks, where some of the tasks wait until the data necessary to perform their computations is made available by other tasks 3 .
In the previous sections, we relied on the solver bundled with the cuSparse library to perform this operation. This implementation, based on the two-stage paradigm, is publicly available, and it is constantly revised and adapted to the latest Nvidia GPU architectures. For several years, this has been the reference implementation for this operation on GPUs.
Recently, Liu et al. [34] proposed a new GPU solver for sparse triangular systems, for matrices stored in the CSC format, based on the self-scheduled strategy. The method relies on a lightweight analysis phase to determine the number of dependencies of each unknown (which is not trivial in the CSC format), and the use of atomic operations to manage the synchronization of the warps inside the GPU kernel. This new synchronization mechanism avoids the constant synchronization with the CPU that cuSparse suffers from.
In [35], we proposed a triangular solver for matrices stored in the CSR format based on Liu's approach. We developed several routines for this operation, outperforming cuSparse for a varied set of sparse matrices. Later, in [36], we applied a similar idea to the level-set preprocessing stage used by cuSparse, showing remarkable performance gains.
Our baseline algorithm for the parallel solution of a sparse triangular linear system advances row-wise, such that each warp processes one row and each thread is assigned to one nonzero element. The threads must actively-wait on the dependency represented by its nonzero entry until all the dependencies of the warp are satisfied, instead of relying on a fixed schedule of kernel launches determined by the level-set analysis of the sparse matrix, as in the cuSparse approach. The built-in warp-scheduling mechanism of the GPU prevents the warps from entering in dead-lock.
Unlike cuSparse and Liu's solution, our solver can run without any previous analysis phase. However, our strategy faces some limitations when the sparse matrix presents only few nonzero elements per row. Concretely, as each warp (of 32 threads) processes one row, and each thread processes one nonzero, there can be a significant waste of threads (and cores). Moreover, the thread-blocks that are executing concurrently on the multiprocessors of the GPU at a given time (active thread-blocks) is limited. Inactive thread-blocks have to wait until the active ones finish their execution before starting their own. To avoid deadlocks, the rows need to be processed in ascending index order, which means that a warp can be ready to execute (have no dependencies) but nevertheless belongs to an inactive thread-block. This situation can be avoided by processing the rows in one of the possible orders that derive from the level-set analysis. The use of the analysis information to produce a fixed execution schedule can also be of benefit on the cases where the sparse matrices present nonzero patterns that imply a high grade of dependencies between their rows.
In [37] we developed new routines that, after performing a level-set analysis of the sparse matrix, leverages this information to overcome some of the above difficulties. In addition to this, the sf_Multirow routine partitions the warp into equally-sized sub-warps such that each sub-warp processes either 32 rows of 1 element, 16 rows of 2 elements, 8 rows of 3 or 4 elements, 4 rows of 5, 6, 7 or 8 elements, and so on. These rows must correspond to the same level-set, since they are processed in parallel. The rationale of this strategy is to mitigate the waste of resources, and consequent performance downscale, experienced for rows with few nonzero elements. Moreover, the number of rows that are processed concurrently by active thread-blocks also increases. The procedure is described in Figure 10. The analysis-phase produces three vectors: row_order, with the row indexes of the matrices partially ordered according to their level-set and number of nonzero element, base_index, which stores the first index in row_order to be processed by each warp, and part_size which stores the size of the partitions of each warp. With this information, the warp loads the corresponding coefficients from global memory, waits until the dependencies are ready, performs the multiplication, and then each partition performs a parallel summation to obtain the corresponding unknown. Figure 11: The picture shows the speedup obtained using synchronization free triangular solver for the preconditioner application relative to the preconditioner application using cuSparse for 86 matrices of the Suite Sparse matrix collection. To assess the impact of our method in the context of ILUPACK we compare the runtimes of one application of the preconditioner using cuSparse solvers and one application of the preconditioner using the sf_Multirow solver. At this point, we remark that the analysis phase required by the sf_Multirow solver is, in general, much faster than that of cuSparse. However, we only consider the solution phase for this evaluation, since the analysis is performed only once during the construction of the preconditioner. Figure 11 shows the speedup obtained by the preconditioner accelerated with the sf_Multirow solver with respect to the cuSparse counterpart. The experiment was performed on all the matrices of the SuiteSparse matrix collection with dimension n such that 100, 000 < n < 1, 000, 000 and nnz such that 1, 000, 000 < nnz < 10, 000, 000. This yields 94 sparse matrices, but 8 of them failed during the construction of the preconditioner (with drop tolerance of 10 −2 and κ = 5) because they were severely ill-conditioned.
The results for the remaining 86 matrices shows that the use of the sf_Multirow solver manages to accelerate the preconditioner in the vast majority of the tested instances. The greatest acceleration (of 3.62×) was obtained for matrix cage12, while the worst result (of 0.80×) was obtained for matrix tmt_sym. The harmonic mean of the speedups was 1.85×.

Hybrid variant of BiCG with enhanced task-parallelism, Adv_BICG_Hyb_SF
The Adv_BICG_Hyb_SF version replaces the triangular solver of the routine that applies the nontransposed preconditioner by our new synchronization-free routine. It employs the MKL library to perform the transposed SpMv on the CPU, and cuSparse to compute the application of the transposed preconditioner on the GPU. Figure 12 presents the time-line corresponding to the execution of the Adv_BICG_Hyb_SF variant for the test case cage15. In this figure, the light-blue bars correspond to our synchronization-free routine. It can be observed that the delay in the execution of the transposed SpMv is significantly reduced, and that this operation overlaps almost completely with the execution of Set A in the GPU. Figure 13 compares the speedups achieved by the baseline, Adv_BICG_Hyb_SF and Adv_BICG_2GPU variants respect to the CPU version based on the multi-thread MKL, showing that Adv_BICG_Hyb_SF improves the speedups of the baseline version, with a performance that is comparable to that of Adv_BICG_2GPU in some cases.

Task+data parallel variants
Encouraged by the results obtained by the data-parallel variants, we explored the GPU acceleration of the task-parallel version of ILUPACK. We focused on the preconditioner application and developed variants for shared and distributed systems with GPUs. We followed two different approaches. First, we entirely offloaded the leaf tasks of the preconditioner application phase to the GPU. Later, we used a threshold aiming to offload computations to the GPU only when there is enough work to take advantage of the accelerator.
Since the inner tasks in the tree correspond to separator sets with much fewer nodes than leaf sets, only leaf tasks perform an important amount of work. The leaf tasks are also independent from each other.
Each task has its own data structure, which contains the part of the multilevel preconditioner relevant to it and private CPU and GPU buffers. We associate each of these tasks with a different GPU stream, following a round-robin strategy to assign tasks to GPUs on nodes with multiple devices.
The computation on each node of the DAG is based on the data-parallel version presented in [38]. It proceeds as described in Section 2.1, with the difference that, in this case, the forward and backward substitution are separated and distributed among the levels of the task-tree. Therefore, entering or leaving the recursive step in Equation (5) sometimes implies moving to a different level in the task tree hierarchy. In these cases, the residual r k+1 has to be transferred to the GPU at the beginning of the forward substitution phase, and the intermediate result has to be retrieved back into the CPU buffers before entering the recursive step. Once the inner tasks compute the recursive steps, the backward substitution proceeds from top to bottom until finally reaching the leaf tasks again, where the z k+1 vector has to be transferred to the GPU. There the last steps of the calculation of the preconditioned residual z k+1 := M −1 r k+1 are performed. Upon completion, the preconditioned residual z k+1 is retrieved back into the CPU.
The experiments showed that offloading the leaf tasks entirely to the accelerator can degrade the performance, as the SpTrSV of the smaller levels have little parallelism. In order to deal with this, we propose a threshold-based strategy that computes the algebraic levels in the GPU until certain granularity, and then moves the computation of the SpTrSV on the remaining levels to the CPU. This allows to perform each operation in the most convenient device, as even in smaller levels some parallelism can still be leveraged for the SpMv.
In this variant we determine the threshold value experimentally. Our on-going work aims to identify the best algorithmic threshold from a model that captures the algorithm's performance. Figure 14 shows the results obtained on 4 nodes equipped with 2 GPUs each. It can be observed that the execution time of the all-CPU variant can be improved by a factor of up to 2× by assigning the tasks to the GPUs. The experiments also reveal scaling properties that suggest that it is possible to address large-scale

Variant of BiCGStab for low-power devices
As an alternative for reducing the power consumption of large clusters, new systems that include unconventional devices have been proposed. In particular, it is now common to encounter energy efficient hardware such as GPUs and low-power ARM processors as part of hardware platforms intended for scientific computing.
In [39], we made a first step towards exploiting the use of energy efficient hardware to compute the ILUPACK solvers. Specifically, we adapted the SPD linear system solver of ILUPACK for the NVIDIA Jetson TX1 platform, based on an NVIDIA Tegra X1 processor, and evaluated its performance in problems that are unable to fully leverage the capabilities of high end GPUs.
Here we extend this work in two aspects. First, we consider our GPU-variant of the BiCGStab for ILUPACK [40], instead of the SPD solver, exploring new capabilities of the Jetson board for our adaptation. Second, we evaluate this proposal on a Jetson AGX Xavier, the latest and more powerful of this family of compute platforms.
The Jetson AGX Xavier integrates a 512-core Volta GPU (2× more cores than the TX1) with an 8-core ARM v8.2 64-bit CPU, 32GB of unified 256-Bit LPDDR4x memory (shared between the CPU and the integrated GPU) with a peak bandwidth of 137GB/s (+5× more bandwidth than the TX1), on a 100 × 87 mm board. The maximum power consumption can be configured to 10, 15 or 30W. We set it to 30W for the experiments in this work.
Our original implementation of the BiCGStab solver offloads the entire solver to the GPU. Hence, the right hand side of the system is transferred to the GPU before the iteration commences and the solution vector is sent back to the CPU memory once the iteration is completed. The rest of the computations occur in the GPU. The matrix-vector multiplications and solution of triangular linear systems are performed using the cuSparse library, while vector operations rely on cuBlas and our own kernels.
This implementation is capable of running on the Jetson board without any modification. However, the unified memory of the Jetson provides the opportunity of executing each operation in the most convenient device without the overhead implied by the data transfers through the PCIe bus.
In this sense, Table 3 shows the harmonic mean of the relation between the performance of the CPU and the GPU for application of the first four levels of the multilevel preconditioner. The benchmark used is the same as in Section 6. The results clearly shows that the GPU is unable to deliver high performance in the case of higher-numbered levels (which present smaller matrices). This is primarily due to the performance of Table 3: Harmonic mean of the ratio between the runtime taken by the CPU and the GPU to compute the first 4th levels of the multilevel non-symmetric preconditioner, for 86 matrices of the SuiteSparse matrix collection.
Level 1st 2nd 3rd 4th T CP U /T GP U 2.07 0.57 0.26 0.12 the sparse triangular solver, which requires a considerable amount of work and parallelism to be able to hide the performance restrictions implied by the data dependencies. It is then reasonable to solve the systems corresponding to the first level of the preconditioner in the integrated GPU, and the rest in the ARM CPU.
Although the physical memory is shared between the CPU and the GPU, the Jetson presents four types of memory (Device Memory, Pageable Host Memory, Pinned Memory and Unified Memory), which differ on their caching and accessing policies 4 . Hence, it is important to select the right type of memory for each buffer in order to maximize the performance.
Since we are still interested on leveraging the GPU for the matrix-vector product and other operations of the BiCGStab solver, we keep the principal matrix on the Device Memory. We do the same for all the vectors implied in the solver but the residual and preconditioned residuals, which are the input and output vectors of the routine that performs application of the multilevel preconditioner. These vectors are kept in the Unified Memory space, since they will be shared between the CPU and the GPU. Some internal buffers of the preconditioner routine must be moved to the Unified Memory as well. Regarding the preconditioner, the first level is stored in the Device Memory, while the rest of the levels are stored in Pageable Host memory. Inside the routine that applies the preconditioner, the cudaStreamAttachMemAsync call is used accordingly before changing the device for the solution of a system, to aid the prefetching policy.  Figure 15 shows a comparison of the acceleration obtained by the baseline GPU variant of the BiCGStab on the integrated GPU of the Jetson and on a discrete GTX 1080 Ti accelerator, with respect to the ARM processor of the Jetson. It is evident that the acceleration obtained is significantly lower in the Jetson (3× lower on average). However, it must be observed that the problem is highly memory-bound, and the theoretical bandwidth of the GTX 1080 Ti is of 484 GB/s (3.5× higher than the theoretical bandwidth of the Jetson). In this sense, the results are completely aligned with the theory. Moreover, it is important to note that the total power dissipation of the GTX 1080 Ti is 250W (8× higher than the Jetson).
Regarding the selection of the most convenient memory buffer and device to perform each operation, Figure 16 shows that a sensible advantage can be obtained in some cases, specially for preconditioners that present many small levels. It is then interesting to develop a model that allows predicting which will be the most convenient device to perform each operation in this sort of platforms.

Concluding Remarks
The undeniable relevance of sparse linear systems in many areas of science and engineering, as well as the rapid scaling in the size of the problems we are witnessing nowadays, make necessary to adapt numerical software to correctly exploit modern computational platforms. In this sense, hardware accelerators such as GPUs have become an ubiquitous and powerful parallel architecture, and making an efficient use of these devices is of utmost importance. ILUPACK is a package that bundles several of the most widely used Krylov subspace solvers with a sophisticated multilevel "inverse-based" ILU preconditioner, that stands out due to its remarkable numerical properties.
Prior to this dissertation, there were no variants of ILUPACK capable of exploiting data-parallelism. Previous task-parallel implementations of ILUPACK [6,7,5] have shown good performance and scalability in many scenarios, and remain a valuable tool to this day, but also face some limitations.
In this thesis we have made a comprehensive study of this tool, in order to enable its efficient execution in platforms equipped with hardware accelerators. This study involves the development of several GPUaware implementations of its preconditioner and most important solvers. In this sense, our new data-parallel implementations are able to improve the performance of the original preconditioner without significantly affecting its numerical properties.
In the context of accelerating the triangular linear systems, which are the computational bottleneck of the ILUPACK preconditioner, we proposed a synchronization-free algorithm for the solution of this operation [35], based in a recently proposed GPU computation paradigm. We also developed an algorithm for the level-set analysis of the sparse matrices based on the same strategy [36]. Our solvers present performance results that are, in general, better than those of cuSparse. In [41] we show how our new solvers are able to significantly accelerate the application of ILUPACK preconditioner, while on [42] we show how this paradigm favors the overlapping of the SpTrSV with other operations.
In addition to our data-parallel variants of the sequential version of ILUPACK, we studied the exploitation of GPU-enabled platforms for the two task-parallel implementations available at the moment. Specifically, we enabled GPU computations for the shared-memory and distributed-memory implementations of ILUPACK.
During the process of the thesis we have detected some directions in which this work can be extended. First, the computation of ILUPACK preconditioner is a complex and time consuming process and its parallel execution has not been studied for non-SPD problems. The main reasons for this are that the computation of the preconditioner is performed only once for each matrix, and it is therefore more interesting to accelerate its application, which is likely to be performed even hundreds of times for each matrix. However, there are use cases in which the computation time of the preconditioner completely overshadows the reduction in the iteration count of the solvers. The study of this computation process and its parallelization is an interesting line of future work.
Another interesting line of future work is the solution of sparse triangular linear systems. Our recent advances on a synchronization-free approach for this operation motivate the future development of GPU synchronization-free LDU -system solvers specially designed for ILUPACK data types and characteristic nonzero patterns.
In the context of our acceleration of the task-parallel variants of the ILUPACK preconditioner, we are interested on the development of an automatic mechanism to determine the most convenient device for each operation.
Finally, we plan to study the exploitation of the new features presented by the latest GPU architectures, such as Tensor Cores or the group synchronization mechanisms introduced in CUDA 9.0. These developments are too recent to be covered in this thesis, so they will be addressed as part of future work.