All publications sorted by Books and proceedings |
Through the recent improvements toward exascale supercomputer systems, huge computations can be performed in reasonable times by using massively parallelized operations. Unfortunately, the increase of the computational units in these systems does not lead to a rise in the memory available per core. Therefore, this memory limitation forces the scientists/engineers to not only efficiently parallelize the operations but also minimize the memory used. Many scientific and engineering applications have to solve large sparse linear systems of the type Ax = b. Although the direct methods are the most robust solutions for these systems, they are costly in terms of their memory usage and time-to-solution. In this respect, the low-rank representations have been recently introduced into these solvers to reduce the time and memory footprint. In this work, our goal is to improve the low-rank feature of the block low-rank (BLR) sparse supernodal direct solver PaStiX. For this purpose, we compare some compression methods to determine the fastest kernel, which keeps the representative data with the smallest rank possible. Then, we focus on improving supernodal solver by reducing the number of re-compression during the updates. Firstly, we study the separator reordering strategies to identify the poorly compressible blocks involved in these updates and reduce their occurrences. Secondly, we propose an orthogonal solution to predict thecompressibility of the blocks before the numerical factorization. This last approach relies on the use of the level of fill of a symbolic block incomplete factorization. Thanks to these optimizations, the memory usage has been reduced more effectively compared to the state of the art solvers while also improving the time to solution. This thesis is a requested first step toward a advanced sparse direct solver using hierarchical compression schemes. |
Solving sparse linear systems is a problem that arises in many scientific applications, and sparse direct solvers are a time consuming and key kernel for those applications and for more advanced solvers such as hybrid direct-iterative solvers. For those reasons, optimizing their performance on modern architectures is critical. However, memory requirements and time-to-solution limit the use of direct methods for very large matrices. For other approaches, such as iterative methods, general black-box preconditioners that can ensure fast convergence for a wide range of problems are still missing. In the first part of this thesis, we present two approaches using a Block Low-Rank (BLR) compression technique to reduce the memory footprint and/or the time-to-solution of a supernodal sparse direct solver. This flat, non-hierarchical, compression method allows to take advantage of the low-rank property of the blocks appearing during the factorization of sparse linear systems. The proposed solver can be used either as a direct solver at a lower precision or as a very robust preconditioner. The first approach, called Minimal-Memory, illustrates the maximum memory gain that can be obtained with the BLR compression method, while the second approach, called Just-In-Time, mainly focuses on reducing the computational complexity and thus the time-to-solution. In the second part, we present a reordering strategy that increases the block granularity to better take advantage of the locality for multicores and provide larger tasks to GPUs. This strategy relies on the block-symbolic factorization to refine the ordering produced by tools such as Metis or Scotch, but it does not impact the number of operations required to solve the problem. From this approach, we propose in the third part of this manuscript a new low-rank clustering technique that is designed to cluster unknowns within a separator to obtain the BLR partition, and demonstrate its assets with respect to widely used clustering strategies. Both reordering and clustering where designed for the flat BLR representation but are also a first step to move to hierarchical formats. We investigate in the last part of this thesis a modified nested dissection strategy that aligns separators with respect to their father to obtain more regular data structure. |
In this thesis, we focus on the parallel solving of large sparse linear systems. Our main interest is on direct-iterative hybrid solvers such as HIPS, MAPHYS, PDSLIN or SHYLU, which rely on domain decomposition and Schur complement approaches. Althrough these solvers are not as time and space consuming as direct methods, they still suffer from serious overheads. In a first part, we thus present the existing techniques for reducing the memory consumption, and we present a new method which does not impact the numerical robustness of the preconditioner. This technique reduces the memory peak by doing a special scheduling of computation, allocation, and freeing tasks in particular in the Schur coupling blocks of the matrix. In a second part, we focus on the load balancing of the domain decomposition in a parallel context. This problem consists in partitioning the adjacency graph of the matrix in as many domains as desired. We point out that a good load balancing for the most expensive steps of an hybrid solver such as MAPHYS relies on the balancing of both interior nodes and interface nodes of the domains. Through, until now, graph partitioners such as METIS or SCOTCH used to optimize only the first criteria (i.e. the balancing of interior nodes) in the context of sparse matrix ordering. We propose different variations of the existing algorithms to improve the balancing of interface nodes and interior nodes simultaneously. All our changes are implemented in the SCOTCH partitioner. We present our results on large collection of matrices coming from real industrial cases. |
The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of computing resources. The pressure to maintain reasonable levels of performance and portability forces application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical manycore architectures. In this thesis, we study the benefits and the limits of replacing the highly specialized internal scheduler of the PaStiX solver by two generic runtime systems: PaRSEC and StarPU. Thus, we have to describe the factorization algorithm as a tasks graph that we provide to the runtime system. Then it can decide how to process and optimize the graph traversal in order to maximize the algorithm efficiency for the targeted hardware platform. A comparative study of the performance of the PaStiX solver on top of its original internal scheduler, PaRSEC, and StarPU frameworks is performed. The analysis highlights that these generic task-based runtimes achieve comparable results to the application-optimized embedded scheduler on homogeneous platforms. Furthermore, they are able to significantly speed up the solver on heterogeneous environments by taking advantage of the accelerators while hiding the complexity of their efficient manipulation from the programmer. In this thesis, we also study the possibilities to build a distributed sparse linear solver on top of task-based runtime systems to target heterogeneous clusters. To permit an efficient and easy usage of these developments in parallel simulations, we also present an optimized distributed interface aiming at hiding the complexity of the construction of a distributed matrix to the user. |
The reactivity computations are an essential component for the simulation of the core of a nuclear plant. These computations lead to generalized eigenvalue problems solved by the inverse power iteration algorithm. At each iteration, an algebraic linear system is solved through an inner/outer process. With the solver Cocagne developed at EDF, it is difficult to take into account very fine discretisation, due to the memory requirement and the computation time. In this thesis, a domain decomposition method based on the Schur dual technique is studied. Several placements in the inner/outer process are possible. Two of them are implemented and the results analyzed. The second one, which uses the specificities of the Raviart Thomas finite elements and of the alternating directions algorithm, leads to very promising results. From these results the industrialization of the method can be considered. |
New supercomputers incorporate many microprocessors which include themselves one or many computational cores. These new architectures induce strongly hierarchical topologies. These are called NUMA architectures. Sparse direct solvers are a basic building block of many numerical simulation algorithms. They need to be adapted to these new architectures with Non Uniform Memory Accesses. We propose to introduce a dynamic scheduling designed for NUMA architectures in the PaStiX, solver. The data structures of the solver, as well as the patterns of communication have been modified to meet the needs of these architectures and dynamic scheduling. We are also interested in the dynamic adaptation of the computation grain to use efficiently multi-core architectures and shared memory. Experiments on several numerical test cases will be presented to prove the efficiency of the approach on different architectures. |
This thesis presents a parallel resolution method for sparse linear systems which combines effectively techniques of direct and iterative solvers using a Schur complement approach. A domain decomposition is built ; the interiors of the subdomains are eliminated by a direct method in order to use an iterative method only on the interface unknowns. The system on the interface (Schur complement) is solved thanks to an iterative method preconditioned by a global incomplete factorization. A special ordering on the Schur complement allows to build a scalable preconditioner. Algorithms minimizing the memory peak that appears during the construction of the preconditioner are presented. The memory is balanced thanks to a multiple domains per processors parallelization scheme. The methods are implemented in the Hips solver and parallel experimental results are presented on large industrial test cases. |
Solving large sparse symmetric positive definite systems of linear equations is a crucial and time-consuming step, arising in many scientific and engineering applications. In this work, we consider the block partitioning and scheduling problem for sparse parallel factorization without pivoting. We focus on the scalability of the parallel solver, and on the compromise between memory overhead and efficiency. We validate this study with parallel experiments on a large collection of irregular industrial problems. |
This thesis initially concerned the parallelization of the OSSAU software code from CEA/CESTA. The application domain of this software is the vectorized structural mechanics; the code itself is non linear in time and in two or three dimensions. This investigation leads to the conception and the implementation of a parallel high performance software processing chain for the assembly and the resolution of sparse linear systems by direct methods. The final objective is a validation of the OSSAU code for three dimension problems with several millions of unknowns. |
This thesis deals with the high performance computation problems and more specifically with those of scientific parallel computation for irregular real-world applications. In the first part, we describe a method for overlapping communications on parallel computers with distributed memory. This method has resulted in a generic computation scheme for the optimal packet size. We also tackle the problem of finding the optimal computation grain for the Cholesky factorization algorithm for dense matrices. The goal of this study is to exploit the irregularity induced by the matrix symmetry. Based on this work we have developped a portable software library providing an efficient application context for these techniques. The second part of this thesis presents and analyses a general algorithm for the computation of an efficient static scheduling of block computations, developped especially for a parallel direct sparse linear factorization based on a combination of 1D and 2D block distributions. Our solver uses a supernodal Fan-In approach and is fully driven by our static scheduling algorithm. Compared to the existing parallel direct solvers our solver shows very favorable performance results. |
Non-linear MHD simulations of edge localized modes (ELMs) show features in qualitative agreement with the experimental observations such as the formation and speed of filaments, features in the radial profiles and the fine structure observed in the power deposition profiles at the divertor target. The density perturbation predominantly follows the ballooning mode convection cells leading to density filaments. The temperature perturbation, due to the large parallel conduction, follows the magnetic field perturbation. Simulations of pellets injected in the H-mode pedestal show that the high pressure in the high density plasmoid can become large enough to drive ballooning type modes forming a single helical structure located at the pellet (plasmoid) position. |
Among existing preconditioners, the level-of-fill ILU has been quite popular as a general-purpose technique. Experimental observations have shown that, when coupled with block techniques, these methods can be quite effective in solving realistic problems arising from various applications. In this work, we consider an extension of this kind of method which is suitable for parallel environments. Our method is developed from the framework of high performance sparse direct solvers. The main idea we propose is to define an adaptive blockwise incomplete factorization that is much more accurate (and numerically more robust) than the scalar incomplete factorizations commonly used to precondition iterative solvers. These requirements lead to a robust class of parallel preconditioners based on generalized versions of block ILU techniques. |
Solving large sparse symmetric positive definite systems of linear equations is a crucial and time-consuming step, arising in many scientific and engineering applications. This paper considers the block partitioning and scheduling problem for sparse parallel factorization without pivoting. There are two major aims to this study: the scalability of the parallel solver, and the compromise between memory overhead and efficiency. Parallel experiments on a large collection of irregular industrial problems validate our approach. |
The framework of this paper is the parallelization of a plasticity algorithm that uses an implicit method and an incremental approach. More precisely, we will focus on some specific parallel sparse linear algebra algorithms which are the most time-consuming steps to solve efficiently such an engineering application. First, we present a general algorithm which computes an efficient static scheduling of block computations for parallel sparse linear factorization. The associated solver, based on a supernodal fan-in approach, is fully driven by this scheduling. Second we describe a scalable parallel assembly algorithm based on a distribution of elements induced by the previous distribution for the blocks of the sparse matrix. We give an overview of these algorithms and present performance results on an IBM SP2 for a collection of grid and irregular problems. |
As the core of a large number of simulation tools, the resolution of large linear systems often represents the dominant part of the computing time. Massively parallel versions are needed to maintain advances in multi-physics and multi-scale simulations, especially when targeting exascale platforms. The aim is therefore to address the major challenge of designing and building numerically robust solvers on runtime systems that can scale up and push back the limits of existing industrial codes of the EoCoE project. In this talk, we will study the recent changes made to the solver with matrices issued from the project such as the block low-rank compression factorization, the capacity to exploit modern GPU accelerators through runtime systems, and the scalability on distributed memory. |
Task-based programming models have been widely studied in the context of dense linear algebra, but remains less studied for the more complex sparse solvers. In this talk, we will present the use of two different programming models: Sequential Task Flow from StarPU, and Parameterized Task Graph from PaRSEC to parallelize the factorization step of the PaStiX sparse direct solver. We will present how those programming models have been used to integrate more complex and finer parallelism to take into account new architectures with many computational units. Efficiency of such solutions on homogeneous and heterogeneous architectures with a spectrum of matrices from different applications will be shown. We will also present how such solutions enable, without extra cost to the programmer, better performance on irregular computations such as in the block low-rank implementation of the solver. |
In this talk, we adress the Block Low-Rank (BLR) clustering problem, to cluster unknowns within separators appearing during the factorization of sparse matrices. We show that methods considering only intra-separators connectivity (i.e., k-way or recursive bissection) as well as methods managing only interaction between separators have some limitations. The new strategy we propose consider interactions between a separator and its children to pre-select some interactions while reducing the number of off-diagonal blocks. We demonstrate how this method enhance the BLR strategies in the sparse direct supernodal solver PaStiX, and discuss how it can be extended to low-rank formats with more than one level of hierarchy. |
In this talk, we present new ordering heuristics to perform block low-rank clustering in supernodes issued from the nested dissection. As kway partitioning within supernodes does not take into account interactions between supernodes, there is room to improve compression rates. We combine kway partitioning with a reordering strategy that aims at minimizing the number of off-diagonal blocks in the symbolic structure and show that both methods are limited. In addition, we propose a selection of some non-compressible vertices to handle the corresponding blocks in full-rank and reduce the burden on managing low-rank blocks with high ranks. |
In this talk, we present the PaStiX sparse supernodal solver, using hierarchical compression to reduce the burden on large blocks appearing during the nested dissection process. We compare the numerical stability, and the performance in terms of memory consumption and time to solution of different approaches by selecting when the compression of the factorized matrix occurs. In order to improve the efficiency of the sparse update kernel for both BLR (block low rank) and HODLR (hierarchically off-diagonal low-rank), we investigate the BDLR (boundary distance low-rank) method to preselect rows and columns in the low-rank approximation algorithm. |
In this talk, we will present recent advances on PaStiX, a supernodal sparse direct solver, which has been enhanced by the introduction of Block Low-Rank compression. We will describe different strategies leading to memory consumption gain and/or time-to-solution reduction. Finally, the implementation on top of runtime systems (Parsec, StarPU), will be compared with the static scheduling used in previous experiments. |
Sparse direct solvers is a time consuming operation required by many scientific applications to simulate physical problems. By its important overall cost, many studies tried to optimize the time to solution of those solvers on multi-core and distributed architectures. More recently, many works have addressed heterogeneous architectures to exploit accelerators such as GPUs or Intel Xeon Phi with interesting speedup. Despite researches towards generic solutions to efficiently exploit those accelerators, their hardware evolution requires continual adaptation of the kernels running on those architectures. The recent Nvidia architectures, as Kepler, present a larger number of parallel units thus requiring more data to feed every computational units. A solution considered to supply enough computation has been to study problems with large number of small computations. The batched BLAS libraries proposed by Intel, Nvidia, or the University of Tennessee are examples of this solution. We discuss in this talk the use of the variable size batched matrix-matrix multiply to improve the performance of a the PaStiX sparse direct solver. Indeed, this kernel suits the supernodal method of the solver, and the multiple updates of variable sizes that occur during the numerical factorization. Performance results on a spectrum of matrices with different properties will be presented. |
Among the preprocessing steps of a sparse direct solver, reordering and block symbolic factorization are two major steps to reach a suitable granularity for BLAS kernels efficiency and runtime management. In this talk, we present a reordering strategy to increase off-diagonal block sizes. It enhances BLAS kernels and allows to handle larger tasks, reducing runtime overhead. Finally, we will comment the resulting gain in the PaStiX solver implemented over StarPU and PaRSEC. |
The complexity of the hardware architectures of modern supercomputers led the community of developers of scientific libraries to adopt new parallel programming paradigms. Among them, task-based programming has certainly become one of the most popular as it allows for high productivity while ensuring high performance and portability by delegating tasks management to a runtime system. In this talk, we will present an overview of sparse solvers that have been designed in the context of the Matrices Over Runtime Systems @ Exascale (MORSE) and Solvers for Heterogeneous Architectures (SOLHAR) projects. We will present the design of new direct solvers implementing supernodal (PaStiX) and multifrontal (qr-mumps) methods, new Krylov solvers ensuring pipelining both at a numerical and software level, new sparse hybrid methods (MaPHyS) as well as data sparse libraries implementing fast multipole methods (ScalFMM) and hierarchical matrices (hmat, in collaboration with Airbus Group Innovations). For all these methods, we will highlight the challenges we have faced in terms of expressivity, granularity, scheduling and scalability and illustrate their performance on large academic and industrial test problems. |
Sparse direct solvers is a time consuming operation required by many scientific applications to simulate physical problems. By its important overall cost, many studies tried to optimize the time to solution of those solvers on multi-core and distributed architectures. More recently, many works have addressed heterogeneous architectures to exploit accelerators such as GPUs or Intel Xeon Phi with interesting speedup. Despite researches towards generic solutions to efficiently exploit those accelerators, their hardware evolution requires continual adaptation of the kernels running on those architectures. The recent Nvidia architectures, as Kepler, present a larger number of parallel units thus requiring more data to feed every computational units. A solution considered to supply enough computation has been to study problems with large number of small computations. The batched BLAS libraries proposed by Intel, Nvidia, or the University of Tennessee are examples of this solution. We discuss in this talk the use of the variable size batched matrix-matrix multiply to improve the performance of a the PaStiX sparse direct solver. Indeed, this kernel suits the supernodal method of the solver, and the multiple updates of variable sizes that occur during the numerical factorization. Performance results on a spectrum of matrices with different properties will be presented. |
In this talk, we describe a preliminary fast direct solver using HODLR library to compress large blocks appearing in the symbolic structure of the PaStiX sparse direct solver. We will present our general strategy before analyzing the practical gains in terms of memory and floating point operations with respect to a theoretical study of the problem. Finally, we will discuss ways to enhance the overall performance of the solver. |
In this talk, we describe a preliminary fast direct solver using HODLR library to compress large blocks appearing in the symbolic structure of the PaStiX sparse direct solver. We will present our general strategy before analyzing the practical gains in terms of memory and floating point operations with respect to a theoretical study of the problem. Finally, we will discuss the impact of reordering technic to enhance the low-rank compression. |
In this talk, we present the PaStiX sparse supernodal solver, using hierarchical compression to reduce the burden on large blocks appearing during the nested dissection process. To improve the efficiency of our sparse update kernel for both BLR (block low rank) and HODLR (hierarchically off-diagonal low-rank), we investigate to BDLR (boundary distance low-rank) method to preselect rows and columns in the low-rank approximation algorithm. We will also discuss ordering strategies to enhance data locality and compressibility. |
Among the preprocessing steps of a sparse direct solver, reordering and block symbolic factorization are two major steps to reach a suitable granularity for BLAS kernels efficiency and runtime management. In this talk, we present a reordering strategy to increase off-diagonal block sizes. It enhances BLAS kernels and allows to handle larger tasks, reducing runtime overhead. Finally, we will comment the resulting gain in the PaStiX solver implemented over StarPU and PaRSEC. |
In the context of hybrid sparse linear solvers based on domain decomposition and Schur complement approaches, getting a domain decomposition tool leading to a good balancing of both the internal node set size and the interface node set size is a critical point for parallel computation. We propose several variations of the existing algorithms in the multilevel Scotch partitioner and we illustrate the improved results on a collection of graphs coming from numerical scientific applications. |
In the context of solving sparse linear systems, an ordering process partitions the matrix graph to minimize both fill-in and computational cost. We found that the ordering strategy used within supernodes might be enhanced to reduce the number of off-diagonal blocks, and then increases block sizes and kernel performance. This turns to be into the same complexity as the factorization algorithm, but allows for more efficient BLAS kernels. On the other side, supernodes that are too large need to be split to create more parallelism. The regular splitting strategy when applied locally impacts significantly the number of off-diagonal blocks and might have negative effect on the efficiency. In this talk, we present both a new strategy to improve supernodes ordering and splitting strategy that both enlarge the off-diagonal block sizes without changing the computational cost of the factorization. Performance improvement gains on the supernodal solver PaStiX are shown on multi-cores and heterogeneous architectures. |
When solving large sparse linear systems, both the amount of memory needed and the computational cost represent a burden to efficiency. In order to solve larger systems, low-rank strategies are used to reduce the overall complexity of a solver. In this talk, we present a preliminary study of the use of H-Matrix arithmetic in a supernodal solver. We also present a new feature in PaStiX, a reordering strategy to reduce the number of off-diagonal blocks in the symbolic factorization. It allows BLAS kernels to be more efficient, and those ideas could be explored in the context of a low-rank strategy. |
In this talk we will discuss our research activities on the design of parallel linear solvers for large scale problems that range from dense linear algebra, to parallel sparse direct solver and hybrid iterative-direct approaches. In particular we will describe the implementations designed on top of runtime systems that should provide both code and performance portabilities. Finally, we will present some preliminary results on the integration of h-matrice kernels in our sparse direct solver framework. |
In the context of hybrid sparse linear solvers based on domain decomposition and Schur complement approaches, getting a domain decomposition tool leading to a good balancing of both the internal node set size and the interface node set size for all the domains is a critical point for load balancing and efficiency issues in a parallel computation context. For this purpose, we revisit the original algorithm introduced by Lipton, Rose and Tarjan which performed the recursion for nested dissection in a particular manner. From this specific recursive strategy, we propose in this paper several variations of the existing algorithms in the multilevel Scotch partitioner that take into account these multiple criteria and we illustrate the improved results on a collection of graphs corresponding to finite element meshes used in numerical scientific applications. |
In recent years there has been a resurgence in direct methods to solve linear systems. These methods can have many advantages compared to iterative solvers; in particular their accuracy and performance is less sensitive to the distribution of eigenvalues. However, they typically have a larger computational cost in cases where iterative solvers converge in few iterations. We will discuss a recent trend of methods that address this cost and can make these direct solvers competitive. Techniques involved include hierarchical matrices, hierarchically semi-separable matrices, fast multipole method, etc. |
The ongoing hardware evolution exhibits an es- calation in the number, as well as in the heterogeneity, of computing resources. The pressure to maintain reasonable levels of performance and portability forces application de- velopers to leave the traditional programming paradigms and explore alternative solutions. PASTIX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical manycore architectures. In this paper, we study the benefits and limits of replacing the highly specialized internal scheduler of the PASTIX solver with two generic runtime systems: PARSEC and STARPU. The tasks graph of the factorization step is made available to the two runtimes, providing them the opportunity to process and optimize its traversal in order to maximize the algorithm efficiency for the targeted hardware platform. A comparative study of the performance of the PASTIX solver on top of its native internal scheduler, PARSEC, and STARPU frameworks, on different execution environments, is performed. The analysis highlights that these generic task-based runtimes achieve comparable results to the application-optimized embed- ded scheduler on homogeneous platforms. Furthermore, they are able to significantly speed up the solver on heterogeneous environments by taking advantage of the accelerators while hiding the complexity of their efficient manipulation from the programmer. |
The current trend in the high performance computing shows a dramatic increase in the number of cores on the shared memory compute nodes. Algorithms, especially those related to linear algebra, need to be adapted to these new computer architectures in order to be efficient. PaStiX is a sparse parallel direct solver, that incorporates a dynamic scheduler for strongly hierarchical modern architectures. In this work, we study the replacement of this internal highly integrated scheduling strategy by two generic runtime frameworks: DAGuE and StarPU. Those runtimes will give the opportunity to execute the factorization tasks graph on emerging computers equipped with accelerators. As for previous work done in dense linear algebra, we will present the kernels used for GPU computations inspired by the MAGMA library and the DAG algorithm used with those two runtimes. A comparative study of the performances of the supernodal solver with the three different schedulers is performed on ma nycore architectures and the improvements obtained with accelerators will be presented with the StarPU runtime. These results demonstrate that these DAG runtimes provide uniform programming interfaces to obtain high performance on different architectures on irregular problems as sparse direct factorizations. |
The emergence of many-cores architectures introduces variations in computation costs, which makes precise cost models hard to realize. Static schedulers based on cost models, like the one used in the sparse direct solver extsc{PaStiX}, are no longer adapted. We describe the dynamic scheduler developed for the super-nodal method of extsc{PaStiX} to correct the imperfections of the static model. The solution presented exploit the elimination tree of the problem to keep the data locality during the execution. |
The simulation of the neutron transport inside a nuclear reactor leads to the computation of the lowest eigen pair of a simplified transport operator. This computation is done by a power inverse algorithm accelerated by a Chebyshev polynomials based process. At each iteration, a large linear system is solved inexactly by a block Gauss-Seidel algorithm. For our applications, one Gauss-Seidel iteration is already sufficient to ensure the right convergence of the inverse power algorithm. For the approximate resolution of the linear system at each inverse power iteration, we propose a non overlapping domain decomposition based on the introduction of Lagrange multipliers in order to: - get a parallel algorithm, which allows to circumvent memory consumption problem and to reduce the computational time; - deal with different numerical approximations in each subdomain; - minimize the code modifications in our industrial solver. When the Chebyshev acceleration process is switched off, the method performs well on up to 100 processors for an industrial test case. It exhibits a good efficiency which allows us to realize some computations beyond the reach of standard workstations. Besides, we study the efficiency of the Chebyshev acceleration process in our domain decomposition method. |
The simulation of the neutron transport inside a nuclear reactor leads to the computation of the lowest eigen pair of a simplified transport operator. Whereas the sequential solution at our disposal today is really efficient, we are not able to run some industrial cases due to the memory consumption and the computational time. This problem brings us to study parallel strategies. In order to re-use an important part of the solver and to bypass some limitations of conforming cartesian meshes, we propose a non overlapping domain decomposition based on the introduction of Lagrange multipliers. The method performs well on up to $100$ processors for an industrial test case. |
The use of scientific computing centers becomes more and more difficult on modern parallel architectures. Users must face a large variety of batch systems (with their own specific syntax) and have to set many parameters to tune their applications (e.g., processors and/or threads mapping, memory resource constraints). Moreover, finding the optimal performance is not the only criteria when a pool of jobs is submitted on the Grid (for numerical parametric analysis for instance) and one must focus on the wall-time completion. In this work we tackle the problem by using the D IET Grid middleware that integrates an adaptable PaStiX service to solve a set of experiments issued from the simulations of the ASTER project. |
Over the past few years, parallel sparse direct solvers made significant progress and are now able to solve efficiently industrial three-dimensional problems with several millions of unknowns. An hybrid MPI-thread implementation of our direct solver PaStiX is already well suited for SMP nodes or new multi-core architectures and drastically reduced the memory overhead and improved scalability. In the context of distributed NUMA architectures, a dynamic scheduler based on a work-stealing algorithm has been developed to fill in communication idle times. On these architectures, it is important to take care of NUMA effects and to preserve memory affinity during the work-stealing. The scheduling of communications also needs to be adapted, especially to ensure the overlap by computations. Experiments on numerical test cases will be presented to prove the efficiency of the approach on NUMA architectures. If memory is not large enough to treat a given problem, disks must be used to store data that cannot fit in memory (out-of-core storage). The idle-times due to disk access have to be managed by our dynamic scheduler to prefetch and save datasets. Thus, we design and study specific scheduling algorithms in this particular context. |
Over the past few years, parallel sparse direct solvers made significant progress and are now able to efficiently work on problems with several millions of equations. This paper presents some improvements on our sparse direct solver PaStiX1 for distributed Non-Uniform Memory Access architectures. We show results on two preliminary works: a memory allocation scheme more adapted to these architectures and a better overlap of communication by computation. We also present a dynamic scheduler that takes care of memory affinity and data locality. |
The purpose of this work is to provide a method which exploits the parallel block-wise algorithmic approach used in the framework of high performance sparse direct solvers in order to develop robust and efficient preconditioners based on a parallel incomplete factorization. |
Since the last decade, most of the supercomputer architectures are based on cluster of SMP nodes. In those architectures the exchanges between processors are made through shared memory when the processor are located on a same SMP node and through the network otherwise. Generally, the MPI implementations provided by the constructor on those machines are adapted to this situation and take advantage of the share memory to treat messages between processors in a same SMP node. Nevertheless, this transparent approach to exploit shared memory do not avoid the storage of buffers needed in asynchronous communications. In the parallel direct solvers the storage of these buffers can become a bottleneck. In this paper, we propose an hybrid thread-MPI implementation of a direct solver and analyse the benefits of this approach in term of memory and run-time performances. |
In this paper, we consider a hyperbolic system with multiple time step characteristics. Such a situation arises for example in combustion problems when the acoustic time is small compared to the characteristic time associated to the flame propagation. The problems investigated in this paper are characterized by a small Mach number. At the asymptotic limit, the initial hyperbolic system degenerates to an elliptic problem. Therefore, numerical methods proposed with the assumption of hyperbolicity of the system becomes hill conditioned at this limit. As a consequence, the iterative methods used in the numerical algorithm implemented in the software FluidBox, have a worse convergence behavior. Some physical preconditioning has been proposed to overcome this difficulty. However, in the context of parallel computing, a global preconditioning is unavoidable for performance efficiency. The parallelization of FluidBox relies on a domain decomposition. A first version of FluidBox was using a block Jacobi or a block Gauss-Seidel preconditioner that are easily implementable in this framework. But to solve 3D problems up to several millions of unknowns on numerous processors, this kind of preconditioner becomes inefficient du to their lack of scalability and robustness. Hence, a collaboration inside the INRIA ScAlApplix project has been setup to use the high performance solver library PaStiX that provides both complete and incomplete factorizations on clusters of SMP nodes to solve large scale computations. The aim of this work is then to investigate the performance of the combination of FluidBox and PaStiX (both developped in the INRIA ScAlApplix project) and also present the parallel assembly algorithm that allows a good load balance in this context. |
The purpose of our work is to provide a method which exploits the parallel blockwise algorithmic approach used in the framework of high performance sparse direct solvers in order to develop robust preconditioners based on a parallel incomplete factorization. The idea is then to define an adaptive blockwise incomplete factorization that is much more accurate (and numerically more robust) than the scalar incomplete factorizations commonly used to precondition iterative solvers. |
PaStiX is a scientific library that provides a high performance direct supernodal solver for very large sparse linear systems. It relies on a block factorization based on an hybrid ordering (Nested Dissection + Halo Approximate Minimum Degree) obtained using the Scotch library. Efficient static scheduling and memory management are used to solve irregular problems with more of 25 millions unknowns on clusters of SMP nodes. In order to solve larger 3D problems, we apply these blockwise algorithms to compute robust and efficient parallel ILU preconditioners. |
Solving large sparse linear systems by iterative methods has often been quite unsatisfactory when dealing with pratical "industrial" problems. The main difficulty encountered by such methods is their lack of robustness and, generally, the unpredictability and unconsistency of their performance over a wide sample of different problems; certain methods work quite well for certain types of problems but can fail completely on other problems. Over the past few years, direct methods have made significant progress thanks to either the combinatorial analysis of the Gaussian elimination process and the parallel algorithmic of blockwise solvers optimized for modern parallel supercomputers. Its is now possible to solve practical three-dimensional problems in the order of several millions of equations in a very powerful way with the direct solvers that efficiently use the superscalar effects of modern processors. However, direct methods may fail to solve very large three-dimensional problems, due to the large amount of memory needed for these cases. In our work, we consider an approach which, we hope, will bridge the gap between the two classes of methods. The goal is to provide a method which exploits the parallel blockwise algorithmic used in the framework of the high performance sparse direct solvers for developping robust parallel incomplete factorization based preconditioners for iterative solvers. The idea is then to define an adaptive blockwise incomplete factorization that is much more accurate (and numerically more robust) than the scalar incomplete factorizations commonly used to precondition iterative solvers. Our approach consists in computing symbolically the block structure of the factors that would have been obtained with a complete factorization, and then deciding to drop off some blocks of this structure according to relevant criterions. Such incomplete factorization can take advantage of the latest breakthroughts in sparse direct methods and therefore be very competitive in CPU time while avoiding the memory limitation encountered by direct methods. By this way, we expect to be able to solve systems in the order of hundred millions of unknowns. |
In this paper, we focus on the scheduling of Laplace task graph on a general platform where both communication links and processing units are heterogeneous. In this context, it is known that deriving optimal algorithm, in the sense of makespan minimization, is NP-Complete, and several inapproximation results have been proved. Nevertheless, we provide an asymtotically optimal algorithm in this general context. Moreover, we expect that this methodolgy can be extended to more general task graphs, especially for nested loops where the inner-most loop is parallel. |
L'\'emergence des machines parall\`eles t\'eraflopiques permet maintenant de traiter des probl\`emes de diffraction de tr\`es grande taille li\'es \`a des calculs hautes fr\'equences par des m\'ethodes exactes. Une premi\`ere \'etape de ce travail a constist\'e \`a parall\'eliser des codes s\'equentiels existants qui utilisaient des m\'ethodes num\'eriques classiques. Ces codes \'etaient bas\'es sur un couplage fort par compl\'ement de Schur entre une formulation int\'egrale (\'el\'ements-finis surfaciques pour traiter la condition de radiation \`a l'infini) et une formulation \'el\'ements-finis volumiques (pour calculer les champs \`a l'int\'erieur de l'objet). L'adapation au parall\'elisme a consist\'e \`a parall\'eliser la factorisation du syst\`eme lin\'eaire creux (\`a l'aide de la biblioth\'eque EMILIO/PaStiX, la r\'esolution du syst\`eme int\'egral plein (sym\'etrique, complexe) et le couplage par compl\'ement de Schur (consistant en des descentes-remont\'ees sur le syst\`eme creux associ\'e). Les deux premi\`eres \'etapes cit\'ees ont donn\'e pleinement satisfaction, mais le compl\'ement de Schur, \`a cause du tr\'es grand nombre descentes-remont\'ees li\'ees aux inconnues surfaciques, limite l'efficacit\'e globale de cette approche \`a des probl\`emes de taille interm\'ediaire. Par exemple, pour le cas test du ``Sphere-Cone'' comportant $3.10^6$ d'inconnues volumiques et $23.10^3$ inconnues surfaciques, sur 32 n\oe uds ES45 (quadri-processeurs), l'assemblage volumique repr\'esente 6s, la factorisation du syst\`eme creux est obtenue en 376s et le compl\'ement de Schur en 21500s (en complexe double pr\'ecision). Pour d\'epasser ces limites, une \'evolution globale de l'algorithme a \'et\'e n\'ecessaire~: - un couplage it\'eratif entre les r\'esolutions surfacique et volumique a \'et\'e mis en place; - pour augmenter le nombre d'inconnues surfaciques une m\'ethode multip\^ole multi-niveaux parall\'ele a \'et\'e d\'evelopp\'ee; - enfin, pour atteindre des tailles de probl\`eme de plusieurs dizaines de millions d'inconnues volumiques, des \'evolutions ont \'et\'e apport\'ees dans la biblioth\`eque EMILIO/PaStiX. En particulier, une algorithmique sp\'ecifique a \'et\'e mise en place pour utiliser au mieux les clusters de n\oe uds SMP et pour pr\'eserver une scalabilit\'e m\'emoire raisonnable. Enfin, le sch\'ema actuel pr\'esente un couplage direct-it\'eratif \`a plusieurs niveaux. L'\'evolution envisag\'ee sera un couplage it\'eratif-it\'eratif, puis sera bas\'ee sur des m\'ethodes hybrides globales (interm\'ediaire entre le direct et l'it\'eratif)~: une version de PaStiX \'evolu\'ee permettra de calculer un pr\'econdition\-neur par bloc ILU(k) parall\`ele. |
In previous works, we have described an efficient static scheduling based on a mixed 1D/2D block distribution with local aggregation for a parallel supernodal version of sparse $LL^{T}$ factorization. In this paper, we present new algorithms suitable for overcoming architectures based on clusters of SMP nodes and also technics to keep a good memory scalability. This algorithms are implemented in the PaStiX library that achieved high performances (resolution of a $26.10^6$ unknown system on 192 ES45 quadriprocessors with 35 percent of peak performance). |
Cet article d\'ecrit une partie de nos travaux sur la r\'esolution directe parall\`ele de grands syst\`emes lin\'eaires creux. Il pr\'esente sa mise en \oe uvre sur des architectures \`a base de n\oe uds SMP, et plus particuli\`erement des techniques de gestion efficace de la m\'emoire. |
La r\'esolution de grands syst\`emes lin\'eaires creux est un point crucial dans de nombreuses applications industrielles et scientifiques. Notre travail porte sur le partionnement et la distribution de grandes matrices creuses pour la factorisation $LDL^t$ en parall\`ele sur machine de type MIMD. Nous pr\'esentons dans cet article notre technique de factorisation parall\`ele bas\'ee sur un ordonnancement statique des calculs et des communications, et nous la validons sur des syst\`emes de plus d'un million d'inconnues pour des probl\`emes d'\'el\'ements finis 3D. |
Solving large sparse symmetric positive definite systems of linear equations is a crucial and time-consuming step, arising in many scientific and engineering applications. In this work, we consider the block partitioning and scheduling problem for sparse parallel factorization without pivoting. We focus on the scalability of the parallel solver, and on the compromise between memory overhead and efficiency. We validate this study with parallel experiments on a large collection of irregular industrial problems. |
Cet article d\'ecrit un algorithme d'assemblage parall\`ele pouvant \^etre utilis\'e lors de la phase de r\'esolution de grands syst\'emes lin\'eaires creux. La premi\`ere partie traite des probl\'emes pos\`es par un code de m\'ecanique des structures nomm\'e OSSAU d\'evelopp\'e au CEA. La deuxi\'eme partie est consacr\'ee \`a la description de notre algorithme et \`a son int\'egration dans la cha\^ine logicielle EMILIO, d\'evelopp\'ee au LaBRI sein du th\'eme ALiENor. |
Cet article pr\'esente un tour d'horizon concernant les principes et techniques pouvant \^etre utilis\'ees pour la r\'esolution de grands syst\`emes lin\'eaires creux. Il pr\'esente en particulier les travaux r\'ealis\'es au LaBRI au sein du th\`eme ALiENor consacr\'es \`a la r\'esolution parall\`ele haute performance par m\'ethodes directes. |
This paper describes a distribution of elements for irregular finite element meshes as well as the associated parallel assembly algorithm, in the context of parallel solving of the resulting sparse linear system using a direct block solver. These algorithms are integrated in the software processing chain EMILIO being developped at LaBRI for structural mechanics applications. Some illustrative numerical experiments on IBM SP2 validate this study. |
We present and analyze a general algorithm which computes an efficient static scheduling of block computations for a parallel $L.D.L^{t}$ factorization of sparse symmetric positive definite systems based on a combination of 1D and 2D block distributions. Our solver uses a supernodal fan-in approach and is fully driven by this scheduling. We give an overview of the algorithm and present performance results and comparisons with PSPASES on an IBM-SP2 with 120 MHz Power2SC nodes for a collection of irregular problems. |
We present and analyze a general algorithm which computes efficient static schedulings of block computations for parallel sparse linear factorization. Our solver, based on a supernodal fan-in approach, is fully driven by this scheduling. We give an overview of the algorithms and present performance results on a 16-node IBM-SP2 with 66 MHz Power2 thin nodes for a collection of grid and irregular problems. |
L'utilisation des machines parall\`eles \`a m\'emoire distribu\'ee apporte un gain important en performances et en taille m\'emoire mais am\`ene en contre partie un surco\^ut en communications. Afin d'obtenir des programmes performants et extensibles, il convient de masquer ce surco\^ut. Plusieurs solutions existent. La premi\`ere consiste en un choix judicieux de la distribution des donn\'ees qui r\'eduira au maximum le nombre et la taille des communications. De plus, si les d\'ependances le permettent, on essaiera d'initialiser les communications au plus t\^ot et de mani\`ere asynchrone : pendant l'ex\'ecution de ces communications, on effectuera d'autres calculs, sans rapport avec l'\'echange de donn\'ees. |
In this paper, we present a method for overlapping communications on parallel computers for pipelined algorithms. We first introduce a general theoretical model which leads to a generic computation scheme for the optimal packet size. Then, we use the OPIUM library, which provides an easy-to-use and efficient way to compute, in the general case, this optimal packet size, on the column $LU$ factorization; the implementation and performance measures are made on an Intel Paragon. |
Le cadre g\'en\'eral de ce travail concerne les machines parall\`eles M.I.M.D. \`a m\'emoire distribu\'ee. Dans ce contexte, il est admis que la bonne d\'emarche de gestion du parall\'elisme passe par l'utilisation syst\'ematique de bibliot\`eques performantes de calcul et de communication avec la possibilit\'e de faire du recouvrement pour masquer, dans le temps, les communications par des calculs. Le th\`eme central de ce travail concerne le recouvrement calcul/communication, avec en particulier le calcul, \'eventuellement adaptatif, de la taille optimale des paquets \`a communiquer. |
{O}ver the past few years, parallel sparse direct solvers made significant progress and are now able to solve efficiently industrial three-dimensional problems with several millions of unknowns. {T}o solve efficiently these problems, {P}a{S}ti{X} and {WSMP} solvers for example, provide an hybrid {MPI}-thread implementation well suited for {SMP} nodes or multi-core architectures. {I}t enables to drastically reduce the memory overhead of the factorization and improve the scalability of the algorithms. {H}owever, today's modern architectures introduce new hierarchical memory accesses that are not handle in these solvers. {W}e present in this paper three improvements on {P}a{S}ti{X} solver to improve the performance on modern architectures : memory allocation, communication overlap and dynamic scheduling and some results on numerical test cases will be presented to prove the efficiency of the approach on {NUMA} architectures. |
Sparse direct solvers is a time consuming operation required by many scientific applications to simulate physical problems. By its important overall cost, many studies tried to optimize the time to solution of those solvers on multi-core and distributed architectures. In this talk, we will present recent advances on PaStiX (https://gitlab.inria.fr/solverstack/pastix), a supernodal sparse direct solver, which has been enhanced by the introduction of Block Low-Rank compression. We will compare the numerical stability, and the performance in terms of memory consumption and time to solution of different approaches by selecting when the compression of the factorized matrix occurs. Many works have also addressed heterogeneous architectures to exploit accelerators such as GPUs or Intel Xeon Phi with interesting speedup. The new implementation on top of runtime systems (PaRSEC, StarPU), will be compared with the static scheduling used in previous experiments. Among the preprocessing steps of a sparse direct solver, reordering and block symbolic factorization are two major steps to reach a suitable granularity. In this talk, we will present a reordering strategy to increase off-diagonal block sizes. It enhances BLAS kernels and allows to handle larger tasks, reducing runtime overhead. Finally, in order to improve the efficiency of the sparse update kernel for both BLR (block low rank) and HODLR (hierarchically off-diagonal low-rank), we are currently investigating the BDLR (boundary distance low-rank) method to preselect rows and columns in the low-rank approximation algorithm. |
In the context of solving sparse linear systems, an ordering process partitions the matrix graph to minimize both fill-in and computational cost. We found that the ordering strategy used within supernodes might be enhanced to reduce the number of off-diagonal blocks, and then increases block sizes and kernel performance. This turns to be into the same complexity as the factorization algorithm, but allows for more efficient BLAS kernels. On the other side, supernodes that are too large need to be split to create more parallelism. The regular splitting strategy when applied locally impacts significantly the number of off-diagonal blocks and might have negative effect on the efficiency. In this talk, we present both a new strategy to improve supernodes ordering and splitting strategy that both enlarge the off-diagonal block sizes without changing the computational cost of the factorization. Performance improvement gains on the supernodal solver PaStiX are shown on multi-cores and heterogeneous architectures. |
In this talk we will discuss the current and future research activities on the design of parallel scalable linear systems for large scale problems that range from dense linear algebra, to parallel sparse direct solver and hybrid iterative-direct approaches that attempt to go beyond the best capabilities one can expect from sparse direct solvers. In particular we will describe the current activities on the implementations designed on top of runtime systems that should provide both code and performance portabilities across different parallel platforms. Finally, we will present some preliminary results to address the resilience issues on extreme scale computers; for that purpose we consider numerical alternatives that do not intensively rely on checkpoint restart mechanisms. |
The emergence of many-cores architectures introduces variations in computation costs, which makes precise cost models hard to realize. Static schedulers based on cost models, like the one used in the sparse direct solver PaStiX, are no longer adapted. We describe the dynamic scheduler developed for the super-nodal method of PaStiX to correct the imperfections of the static model. The solution presented exploit the elimination tree of the problem to keep the data locality during the execution. |
The current trend in the high performance computing shows a dramatic increase in the number of cores on the shared memory compute nodes. Algorithms, especially those related to linear algebra, need to be adapted to these new computer architectures in order to be efficient. PaStiX is a sparse parallel direct solver, that incorporates a dynamic scheduler for strongly hierarchical modern architectures. In this work, we study the replacement of this internal highly integrated scheduling strategy by two generic runtime frameworks: DAGuE and StarPU. Those runtimes will give the opportunity to execute the factorization tasks graph on emerging computers equipped with accelerators. As for previous work done in dense linear algebra, we will present the kernels used for GPU computations inspired by the MAGMA library and the DAG algorithm used with those two runtimes. A comparative study of the performances of the supernodal solver with the three different schedulers is performed on manycore architectures and the improvements obtained with accelerators will be presented with the StarPU runtime. These results demonstrate that these DAG runtimes provide uniform programming interfaces to obtain high performance on different architectures on irregular problems as sparse direct factorizations. |
This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All person copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.
Les documents contenus dans ces répertoires sont rendus disponibles par les auteurs qui y ont contribué en vue d'assurer la diffusion à temps de travaux savants et techniques sur une base non-commerciale. Les droits de copie et autres droits sont gardés par les auteurs et par les détenteurs du copyright, en dépit du fait qu'ils présentent ici leurs travaux sous forme électronique. Les personnes copiant ces informations doivent adhérer aux termes et contraintes couverts par le copyright de chaque auteur. Ces travaux ne peuvent pas être rendus disponibles ailleurs sans la permission explicite du détenteur du copyright.
This document was translated from BibTEX by bibtex2html