Description

THESE pour obtenir LE TITRE DE DOCTEUR DE L INSTITUT NATIONAL POLYTECHNIQUE DE TOULOUSE Spécialité: Informatique et Télécommunications par Christof VÖMEL CERFACS Contributions à la recherche en calcul

Information

Category:
## Health & Lifestyle

Publish on:

Views: 14 | Pages: 20

Extension: PDF | Download: 0

Share

Transcript

THESE pour obtenir LE TITRE DE DOCTEUR DE L INSTITUT NATIONAL POLYTECHNIQUE DE TOULOUSE Spécialité: Informatique et Télécommunications par Christof VÖMEL CERFACS Contributions à la recherche en calcul scientifique haute performance pour les matrices creuses Contributions to research in high performance scientific computing for sparse matrices Thèse présentée le 20 mars 2003 devant le jury composé de: P. R. Amestoy Maître de conférence, enseeiht Directeur de thèse M. Arioli Directeur de recherche, ral I. S. Duff Directeur de recherche, cerfacs et ral E. Ng Directeur de recherche, nersc, lbnl Rapporteur A. Pothen Professeur, odu J. Roman Professeur, enseirb Rapporteur, Président Résumé Dans cette thèse, nous présentons le résultat de nos recherches dans le domaine du calcul en algèbre linéaire creuse. En particulier, nous nous intéressons au développement d un nouvel algorithme pour estimer la norme d une matrice de manière incrémentielle, à l implantation d un modèlederéférence des Basic Linear Algebra Subprograms for sparse matrices (Sparse BLAS), et àlaréalisation d un nouveau gestionnaire de tâches pour un solver multifrontal creux pour architectures à mémoire répartie. Notre méthode pour estimer la norme d une matrice a l avantage de s appliquer à tout type de matrices, dense et creux, contrairement aux algorithmes incrémentiels existants auparavant. Elle peut s avérer utile par exemple dans le cadre des factorisations QR, Cholesky, ou LU. En ce qui concerne le BLAS creux, notre modèle de référence en Fortran 95 est actuellement la seule implantation existante des interfaces spécifiées par le standard. Afin de laisser assez de libertéà l implantation la plus efficace, le standard définit des interfaces génériques et reste général quant à la structure des données. Nous avons donc été amenés àrépondre aux questions complexes concernant la représentation et la gestion des données. Le séquencement de tâches devient un enjeu important dès que nous travaillons sur un grand nombre de processeurs (entre 100 et 500). L algorithme introduit dans cette thèse permet d améliorer le passage àl échelle du solveur d une façon significative. Une étude des gains en mémoire et en temps de calcul obtenus sur une architecture possédant plus de 500 processeurs montre tout l intérêt de cette nouvelle approche. Mots clés: Algorithmique numérique, calcul à mémoire répartie, systèmes linéaires creux, élimination de Gauss multifrontal, séquencement dynamique de tâches, structures de données creuses, standard de programmation, BLAS creux, norme d une matrice, conditionnement, estimation incrémentielle. Abstract In this thesis, we present our research on methods in high performance scientific computing for sparse matrices. Specifically, we are concerned with the development of a new algorithm for incremental norm estimation, the reference model implementation of the standard for Basic Linear Algebra Subprograms for sparse matrices (Sparse BLAS), and the design of a new task scheduler for MUMPS, an asynchronous distributed memory multifrontal direct solver for sparse linear systems. Our new incremental norm estimator has the advantage of being applicable to both dense and sparse systems, in contrast to previously existing incremental schemes. Applications include for example monitoring of a QR, a Cholesky, or an LU factorization. Our Fortran 95 reference model is currently the only existing implementation of the interfaces in the standard for the Sparse BLAS. We identify many of the complicated issues regarding the representation and the handling of sparse data structures that underlie the high-level description of sparse vector and matrix operations in the standard but are avoided there in order to leave enough freedom for vendors to provide efficient implementations. With our new task scheduler for MUMPS, we address concerns about lack of scalability and performance on large numbers of processors that arose in a comparison with the SuperLU parallel solver. In the new approach, we determine, during the analysis of the matrix, candidate processes for the tasks that will be dynamically scheduled during the subsequent factorization. This approach significantly improves the scalability of the solver in terms of time and memory needed, as we show by comparison with the previous version. Keywords: High performance computing, sparse linear systems, MUMPS, multifrontal Gaussian elimination, distributed memory code, dynamic task scheduling, sparse data structures, programming standard, Sparse BLAS, reference implementation, matrix norm, condition number, incremental estimators. Acknowledgements I am grateful to I. Duff for allowing me to write my thesis at CERFACS, for letting me find my way, and for supporting my ideas and projects. I also thank P. Amestoy for his continuing interest in my work, his constructive criticism, and his guidance. I recognize that without their help, this thesis would not have been possible. I wish to thank Mr J. Roman and Mr E. Ng for their interest in my work and acting as referees on my thesis commitee. Likewise, I acknowledge with gratitude Mr M. Arioli and Mr A. Pothen as members of my jury. It is my great pleasure to thank all members of the Parallel Algorithms Group who, explicitly or implicitly through their friendship, in many ways contributed to this work. Merci, Brigitte, Bruno, Daniel, Dominique, Elisabeth, Emeric, Enric, Hamid, Jean-Christophe, Julien, Luc, Marielba, Martin, Serge, Songklod, Stéphane, Tadas. Contents Introduction 1 Part I 7 1 Incremental Norm Estimation for Dense and Sparse Matrices Introduction to incremental estimation Incremental estimators The original incremental condition estimator (ICE) Incremental norm estimation by approximate left singular vectors Incremental norm estimation by approximate right singular vectors Quality of incremental norm estimates Incremental norm estimation for sparse matrices The relationship between incremental norm and condition estimation Computational costs of incremental norm estimation Triangular factorizations with inverse factors The QR factorization with simultaneous inversion of R Stability issues of triangular matrix inversion Inverses in sparse factored form Sparse storage of triangular inverses Incremental norm estimation for sparse factored inverses Numerical tests Conclusions and future work Part II 35 2 Computational kernels for unstructured sparse matrices Introduction, history, and overview of the development of the Sparse BLAS The Sparse BLAS functionalities Level 1 Sparse BLAS functionalities Level 2 Sparse BLAS functionalities Level 3 Sparse BLAS functionalities 2.2.4 Routines for the creation of sparse matrices Remarks on the Sparse BLAS functionalities Representation of sparse matrices and vectors The representation of sparse vectors Internal data structures for sparse matrix data Sparse BLAS operations on sparse matrices and vectors Sparse vector operations Product of a sparse matrix with one or many dense vectors Solution of a sparse triangular system with one or many dense right-hand sides Releasing matrix handles Some remarks on using Fortran Sample programs A sample program An application: the power method Conclusions and future work Part III 61 3 Task scheduling in a multifrontal sparse linear solver Introduction to MUMPS Multifrontal solution of sparse linear systems General principles of sparse Gaussian elimination Tasks and task dependencies in the multifrontal factorization Parallelism in the multifrontal factorization The different types of parallelism Parallel task scheduling: main principles Geist-Ng mapping and layers in the assembly tree The proportional mapping of Pothen and Sun Dynamic task scheduling for type 2 parallelism Modifications to the assembly tree The benefits and drawbacks of amalgamation The benefits and drawbacks of splitting Delayed pivots and numerical stability Combining the concept of candidates with dynamic task scheduling Issues of dynamic scheduling Candidate processors for type 2 parallel nodes Task mapping and task scheduling in MUMPS Task mapping algorithm during the analysis phase Task scheduling during the factorization phase Details of the improved task mapping and scheduling algorithms The relaxed proportional mapping The Geist-Ng construction of layer L Choosing the number of candidates for a type 2 node 3.7.4 Layer-wise task mapping Post-processing of the assembly tree for an improved memory balance in the LU factorization The dynamic scheduling algorithm used at run time The test environment Regular grid test problems General symmetric and unsymmetric matrices Experimental investigation of algorithmic details The impact of k max on volume of communication and memory The impact of k max on performance Modifying the freedom offered to dynamic scheduling Improved node splitting Improved node amalgamation Post-processing for a better memory balance Performance analysis Nested dissection ordering Hybrid nested dissection with SCOTCH Approximate Minimum Fill (AMF) ordering Analysis of the speedup for regular grid problems Performance analysis on general symmetric and unsymmetric matrices Perspectives and future work Adapting the new scheduling algorithm to include communication costs Future directions of research Summary and conclusions Bibliography 113 Introduction The solution of large linear systems of equations is a central part of many scientific calculations. Whatever physical phenomenon might be modelled, at the end of the discretization process usually a linear system must be solved for unknowns that represent the physical quantities of the underlying problem. We are particularly interested in the class of sparse matrices which arise from diverse fields such as finite element models in mechanics, finite volume discretizations in fluid dynamics, or circuit theory in electrical engineering. We call a matrix sparse if it is advantageous, for example with respect to work and storage, to exploit the zero entries. By focusing only on the nonzero part of the matrix entries, sparse matrix algorithms can considerably reduce computations and memory space, and thus be much more efficient than their dense counterparts. However, the amount of work required for developing the sparse version of a given algorithm can be important and involve complicated issues like the efficient handling of sparse data structures. Because of its prominent role and the complexity of the related issues, research on the efficient solution of sparse linear systems is being given a great deal of attention and is going through an exciting evolution that we describe in the following. In the first part of this introductory chapter, we give an overview of the area of high performance scientific computing for the direct solution of sparse linear systems. In particular, we identify those issues that have inspired our own research. Then, in the second part, we present the results that we have obtained in the framework of this thesis. Current topics of research in high performance scientific computing for the direct solution of sparse linear systems As an introduction to the results of our research, we describe in this section recent work on the efficient solution of sparse linear systems. However, we are mainly concerned with direct methods for unstructured sparse matrices, that is matrices without a special form. These methods are often preferable to iterative solvers when trying to solve an ill-conditioned system. Furthermore, they are commonly exploited to construct preconditioners when because of limited storage, an iterative approach must be used [46]. The iterative solution of sparse linear systems and the calculation of eigenvalues are complicated issues in their own right, and we refer the reader to the recent surveys [69] and [121]. Furthermore, direct techniques for 2 Introduction structured sparse matrices, for example in tridiagonal or general banded form, are reviewed in [47, 56]. The evolution of computer architectures and programming paradigms has greatly influenced research and developments in sparse matrix computations. To make efficient use of modern machines with a multi-layered memory hierarchy, algorithm designers now try to increase the ratio of computational operations to memory access [41, 56]. In order to facilitate this task, but also to provide a level of abstraction for programming and to allow a greater robustness and program portability, the Basic Linear Algebra Subprograms (BLAS) were developed [39, 40, 100]. This initial set of standardized kernels for operations with dense matrices was adopted by vendors who provided optimised implementations, offering to users a high performance API for their programs. Thanks to this support, the gains through using BLAS kernels on modern system architectures are so high that we usually cannot afford not to use them. Another illustration of their huge impact on software development is given by the fact that standard numerical libraries like the Linear Algebra PACKage (LAPACK [13]) make heavy use of the BLAS kernels [42]. With this success of the BLAS, other efforts followed: first, the development of the Basic Linear Algebra Communication Routines (BLACS [43]) used within the framework of the ScaLAPACK [29] project, then the Automatically Tuned Linear Algebra Software (ATLAS [44]), and finally the design of a new, extended BLAS standard [1, 24] including kernels for mixed precision calculations [103] and sparse matrices [50, 57]. The performance of numerical algorithms on modern computer architectures is often limited by the speed of data movement between different layers of the memory hierarchy rather than by the performance of the floating point processing units of the processor executing the numerical operations of the algorithm itself [41]. In consequence, a key strategy of the BLAS is to partition the data into blocks which are subsequently loaded into the cache or the local memory with the goal of maximizing reuse and increasing the ratio of floating point to memory access operations. In this respect, an important application of the BLAS in the framework of sparse matrices is the class of blocked algorithms such as supernodal [16, 54] and multifrontal methods [54, 55]. There, one obtains a significant gain in performance through using the BLAS, that is dense matrix kernels, for the frontal matrices. One example of such a code making use of the Level 3 BLAS is MA41 [7, 12] from HSL [91]. While MA41 was designed for shared memory computers, the development of distributed memory machines and the message passing paradigm for communication in the 90 s required the design of new multifrontal codes and led, for example, to the recent development of MUMPS [9, 10, 11] and WSMP [72, 73]. Furthermore, alternative distributed memory codes, based for example on supernodal techniques, were developed, including PaStiX [79, 80, 81], SPOOLES [15], and SuperLU [35, 36]. For a complete review on the history of these different techniques, we refer to [41, 56, 76]. For any of these direct solvers, a crucial issue when working with sparse matrices is the ordering of the rows and columns so that the factorization preserves sparsity and/or less work needs to be performed. Furthermore, in a parallel environment, the ordering must offer parallelism for the solver. Classic orderings based on a minimum Introduction 3 degree heuristic [126] or the nested dissection approach [65] have been refined and have led to the development of modern algorithms including approximate minimum degree (AMD [6]) and approximate minimum fill (AMF [109, 119]) orderings as well as hybrid techniques [17, 78, 123] which are implemented in ordering packages such as CHACO [77], METIS [98], and SCOTCH [112, 113]. Moreover, in order to facilitate pivoting or to avoid it entirely, one often tries to permute large matrix entries to the diagonal [51, 52]. While the stability of an LU factorization with diagonal pivoting cannot be guaranteed even with this permutation, it can still mean that the number of off-diagonal pivots is substantially reduced in solvers like MA41, MUMPS, and WSMP. In the SuperLU distributed memory solver, static pivoting is used instead of threshold-based partial pivoting, that is off-diagonal pivots are entirely avoided. Small or zero diagonal entries are perturbed when encountered and the computation of a solution proceeds via iterative refinement [47, 85, 86]. Thus, the permutation of large entries to the diagonal is also very beneficial in this context [102]. In the special case of symmetric positive definite systems, the Cholesky factorization is preferable to the LU factorization because it is backward stable [86] and needs only half of the number of operations. Furthermore, since no pivoting is required, the sparse Cholesky factorization can avoid the use of dynamic data structures and work instead with static memory, as do for example PSPASES [74] and PaStiX [79, 80, 81]. Also from the point of view of scheduling, the sparse Cholesky factorization is very attractive. Due to the absence of pivoting, the elimination process depends only on the matrix structure and can be simulated efficiently without taking account of the values of the matrix entries. In particular, in a parallel environment, it is possible to calculate a static schedule prior to factorization that balances the work among the parallel processes based on a careful estimation of the speed of communication and of the computational operations of the underlying computer architecture, as does for example PaStiX [79, 80, 81]. Dynamic load balancing during the actual factorization as used by MUMPS [9, 10, 11] is generally less important in this case. However, the scheduling problems arising in the context of sparse multifrontal factorization, as far as they concern the minimisation of the overall completion time for the jobs to be scheduled, are NP hard in general [62]. Consequently, one usually calculates only an approximate solution based on relatively inexpensive heuristics or so-called approximating algorithms in order not to increase the overall complexity of the linear solver [89, 101]. Once we have solved a linear system, another important step consists of determining the quality of the numerical solution that has been computed in finite precision. As already mentioned in the discussion of SuperLU, iterative refinement can be used to improve the accuracy of the computed solution. While backward error bounds usually are computed from a residual [86], the computation of forward error bounds requires knowledge of the matrix condition number. As the computation of singular values is costly, these are commonly not computed exactly but obtained via relatively inexpensive so-called condition estimators. These are in- 4 Introduction cluded in standard numerical libraries such as LINPACK [30] or LAPACK [75, 88] for dense matrices, for a survey we refer also to [83, 86]. Furthermore, an alternative, so-called incremental, approach was designed specifically to monitor the conditioning of a triangular factorization on the fly [22]. However, the application of this incremental technique to sparse matrices is difficult and requires additional care, see [23, 59]. A related question concerns the determination of the numerical null space of a matrix by a direct factorization and led to the development of the class of so-called rank-revealing methods, see for example [26, 27, 28, 110]. Initially, these were designed for dense matrices but have recently been adopted also for sparse solvers, see [114]. In all these methods, a condition estimator plays a central role for determining the rank-revealing factor. A description of our research and the contents of this manuscript In the first part of this introduction, we have given an overview of recent research

Related Search

Similar documents

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks