Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

964 views

Published on

We have developed a parallel eigensolver for very small-size matrices. Unlike conventional solvers, our design policy focusses on nature of non-blocking computations and reduced communications. A communication-avoiding approach for Householder pivot vectors is used to implement part of Householder inverse transformation. In addition to that, we implement some techniques for reducing communications by using non-blocking communications in tridiagonalization part. Performance of the solver with full nodes in the Fujitsu FX10 (76,800 cores) is also presented.

  • Be the first to comment

  • Be the first to like this

Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

  1. 1. Extreme‐Scale Parallel Symmetric  Eigensolver for Very Small‐Size  Matrices Using A Communication‐ Avoiding for Pivot Vectors  Takahiro Katagiri  (Information Technology Center, The University of Tokyo) Jun'ichi Iwata and Kazuyuki Uchida  (Department of Applied Physics  School of Engineering,  The University of Tokyo) Thursday, February 20, Room: Salon A, 10:35‐10:55  MS34 Auto‐tuning Technologies for Extreme‐Scale Solvers ‐ Part I of III SIAM PP14, Feb.18‐21, 2014, Marriott Portland Downtown Waterfront, Portland, OR., USA   
  2. 2. Outline • Target Application: RSDFT • Parallel Algorithm of Symmetric  Eigensolver for Small Matrices  • Performance Evaluation with 76,800  cores of the Fujitsu FX10 • Conclusion
  3. 3. Outline • Target Application: RSDFT • Parallel Algorithm of Symmetric  Eigensolver for Small Matrices  • Performance Evaluation with 76,800  cores of the Fujitsu FX10 • Conclusion
  4. 4. RSDFT (Real Space Density Functional Theory)RSDFT (Real Space Density Functional Theory) )()( )( ][)( 2 1 2 rr rrr r r jjj XC ion E dv                 Kohn-Sham equation is solved as a finite-difference equation J.-I. Iwata et al., J. Comp. Phys. 229, 2339 (2010). 10648-atom cell of Si crystal and its electron density Volume of Si crystal vs. Total Energy 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 18 18.5 19 19.5 20 20.5 21 Energy/atom(eV) Volume/atom 10648 atoms 21952 atoms Volume / atom Energy/atom(eV) 10,648 atoms 21,952 atoms Structural properties of Si crystal
  5. 5. Requirements of  Mathematical Software from RSDFT • An FFT‐free algorithm. • All eigenvalues and eigenvectors computation for a dense real symmetric matrix. – Standard Eigenproblem. – O(100) times are executed for SCF (Self Consistent Field) process.    • Re‐orthogonalization for eigenvectors. • Due to computational complexity, the parts of eigensolver  and orthogonalization become a bottleneck. – Since these parts require O(N3) computations, while others require O(N2)  computations.  • Matrix and eigenvalues are distributed to obtain  parallelism for the other parts to eigensolver. – It is difficult to obtain while data even if it is small.
  6. 6. Requirements of  Mathematical Software from RSDFT (Cont’d) • Other parts of the eigensolver in application are also time‐consuming. Source: Y. Hasegawa et.al.: First‐principles calculations of electron states of  a silicon nanowire with 100,000 atoms on the K computer, SC11, (2011) Processes Execution Costs  to whole time [%] Order SCF 99.6% O(N3) SD 47.2% O(N3) Subspace Diag. 44.2% O(N3) MatE 10.0% O(N3) DGEMM Eigensolve 19.6% O(N3) Rot V 14.6% O(N3) CG (Conjugate Gradient) 26.0% O(N2) GS (Gramm‐Schmidt Ort.) 25.8% O(N3) DGEMM Others 0.6% ‐ RSDFT Processes Breakdown Eigensolve and GS Parts will be  bottleneck in large‐scale computation,  but other processes is needed  to be considered. • Required memory space is also needed to be considered. – Due to API of numerical library, such as re‐distribution of data, actual problem  size is limited as small sizes with respect to remainder memory space.
  7. 7. Our Assumption • Target : The eigensolver part in RSDFT • Exa‐scale computing: Total number of nodes is  on the order of 1,000,000 (a million). • Since the matrix is two‐dimensional (2D),  the size of the matrix required in exa‐scale computers  reaches the order of: 10,000 * sqrt (1,000,000) = 10,000,000 (ten millions),  if each node has matrix of N=10,000 . • Since most dense solvers require O(N3) for  computational complexity, the execution time  with a matrix of  N=10,000,000 (ten millions) is unrealistic  in actual applications (in production‐run phase). 
  8. 8. Our Assumption (Cont’d) • We presume that N=1,000 per node is the  maximum size. The size in exa‐scale is on the  order of N=1,000,000 (a million). • The used memory size of a matrix per node is  only on the order of 8 MB.  – ! This is eigensolver part only. • This is just the cache size for current CPUs. – Next generation CPUs may be having order of  100MB cache! • Such as the IBM Power8 with e‐DRAM (3D Stacked Memory)  for L4 cache. 
  9. 9. Originalities of Our Eigensolver 1. Non‐blocking Computation Algorithm  Since data in cache in our assumption in exa‐scale  computing.   2. Communication reducing and  communication avoiding algorithm  Tridiagonalization and Householder inverse  transformation of symmetric eigensolvers.  By duplicating Householder vectors.  3. Hybrid MPI‐OpenMP execution   With a full system of a peta‐scale supercomputer  (The Fujitsu FX10) consisting of 4800 nodes  (76,800 cores). 
  10. 10. Outline • Target Application: RSDFT • Parallel Algorithm of Symmetric  Eigensolver for Small Matrices  • Performance Evaluation with 76,800  cores of the Fujitsu FX10 • Conclusion
  11. 11. A Classical Householder Algorithm (Standard Eigenproblem )xAx  Symmetric Dense Matrix A 1. Householder Transformation QAQ=T Tri-diagonalization 16 )( 3 nO T Tridiagonal matrix 4. Householder Inverse Transformation A: Dense matrix All eigenvectors: X = QY )( 3 nO Q=H1 H2 … Hn-2 2. Bisection T: Tridiagonal matrix All eigenvalues :Λ 3. Inverse Iteration T : Tridiagonal matrix All eigenvectors: Y )(~)( 32 nOnO )( 2 nOMRRR:
  12. 12. Whole Parallel Processes on the Eigensolver A Tridiagonalization T Gather All Elements T T T T Upper Lower Compute Upper and Lower limits For eigenvalues 1,2,3,4… (Rising Order) Λ 1,2,3,4… (Corresponding to Rising Order for the eigenvalues Compute Eigenvectors Householder Inverse Transformation YGather All Eigenvalues Λ 17 2D Cyclic‐Cyclic Distribution
  13. 13. Data Duplication in Tridiagonalization 19 Matrix A :Vectors uk  , xk uk uk Duplication of  p Processes q Processes uk : Householder  Vector :Vectors yk,  yk ykDuplication of 
  14. 14. Transposed yk in Tridiagonalization (The case of p < q) 20 yk Multi‐casting  MPI_ALLREDUCE p Processes q Processes p=2 q=4 :Root Processes : With Rectangle Processor Grid  [Katagiri and Itoh, 2010] ykDuplication of Communication Avoiding By Using  the Duplications
  15. 15. <1> do k=n-2, 1, -1 <2>   Gather the vector      and  scalar     by using multiple MPI_BCASTs. <3>  do i=nstart, nend <4>       <5>     <6>   enddo <7> enddo       Parallel Householder Inverse Transformation ku ikiink k ink k uAA  ,: )( ,: )( k 21 ink kT kki Au ,: )(   
  16. 16. ①Multi‐casting  MPI_BCAST Gathering vector uk for Inverse Transformation :Non-packing messages for gathering uk 22 uk ukDuplication of  p Processes q Processes p = 2 q = 4 ②Multi‐casting  MPI_BCAST Communication Avoiding by using  the duplications
  17. 17. Gathering vector uk for Inverse Transformation :Packing messages for gathering uk 23 uk ukDuplication of  p Processes q Processes p = 2 q = 4 ①Multi‐casting  MPI_BCAST ②Multi‐casting  MPI_BCAST Communication Avoiding & Reducing  by using packing of messages uk : Send the two vectors by one communication →Communication Blocking  Communication  Blocking Length = 2 uk+1
  18. 18. Outline • Target Application: RSDFT • Parallel Algorithm of Symmetric  Eigensolver for Small Matrices  • Performance Evaluation with 76,800  cores of the Fujitsu FX10 • Conclusion
  19. 19. Oakleaf‐FX (ITC, U.Tokyo), The Fujitsu PRIMEHPC FX10 Contents Specifications Whole System Total Performance 1.135 PFLOPS Total Memory Amounts 150 TB Total #nodes 4,800 Inter Connection The TOFU (6 Dimension  Mesh / Torus) Local File System Amounts 1.1 PB Shared File System Amounts 2.1 PB Contents Specifications Node Theoretical Peak Performance 236.5 GFlops #Processors (#Cores) 16 Main Memory Amounts 32 GB Processor Processor Name SPARC64 IX‐fx Frequency 1.848 GHz Theoretical Peak Performance (Core) 14.78 GFLOPS 4800 Nodes (76,800 Cores)
  20. 20. COMMUNICATION AVOIDING  EFFECT
  21. 21. Householder Inverse Transformation (4096 Nodes (65,536 Cores), 64x64), N=38,400, Hybrid 0 10 20 30 40 50 60 70 80 90 MPI_BCAST Binary Tree MPI_Isend Block MPI_BCAST Time in Second Communication Implementations Other HIT Ker Send Piv The Best Parameter #Processes =4096 #Threads=16/node Comm. Block =12 Non‐packing Sending Packing Sending  1.57x Non‐blocking MPI
  22. 22. HYBRID MPI‐OPENMP EFFECT
  23. 23. Pure MPI vs. Hybrid MPI‐OpenMPI (64 Nodes (1024 Cores)), N=4800, Total Time 0 0.5 1 1.5 2 2.5 3 3.5 16x64 (Pure MPI) 8x8 (Hybrid MPI) Time in Second Process Organization Householder Inv Calculating Eigenvectors Re‐distribution Tridiagonalization 1.61x 64 MPI Processes, 16 OMP Threads/MPI Process
  24. 24. Pure MPI vs. Hybrid MPI‐OpenMPI (64 Nodes (1024 Cores)), N=4800, Tridiagonalization 0 0.5 1 1.5 2 2.5 16x64 (Pure MPI) 8x8 (Hybrid MPI) Time in Second Process Organization Other Update MatVec MatVec Reduce Send xt Send yt Send Piv Communication Computation 27.9% 46.1%72.1% 53.9%18.2 Points  Reduction
  25. 25. Pure MPI vs. Hybrid MPI‐OpenMPI (64 Nodes (1024 Cores)), N=4800,  Householder Inverse Transformation 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 16x64 (Pure MPI) 8x8 (Hybrid MPI) Time in Second Process Organization Other HIT Ker Send Piv Communication Computation 15.6% 44.6% 84.4% 55.4%29 Points  Reduction
  26. 26. FX10 76800 CORES (4800 NODES) RESULTS 
  27. 27. Hybrid MPI‐OpenMP Execution in 4800 nodes (76,800 Cores) (40x120) 31.8  83.3  429.9  34.3  180.1  904.0  0 200 400 600 800 1000 1200 1400 1600 N=41568 N=83138 N=166276 Time in Second Process Organization Householder Inv Calculating Eigenvec Re‐dist Tridiag HIT comm. block=6 HIT comm. block=4 HIT comm. block=2 2.61x 5.24x5.16x 5.01x 3.97x 5.05x Inner L1  Cache Size Only 4x increase  with 2x problem size in O(N3) algorithm
  28. 28. Execution Time in Pure MPI between ScaLAPACK PDSYEVD and Ours ScaLAPACK (version 1.8) on the Fujitsu FX10. Fujitsu Optimized BLAS is used. The best block size is specified for each ScaLAPACK execution in range between  1,  8, 16, 32, 64, 128, and 256. 4.26 10.96 25.76 1.79 4.61 15.52 0 5 10 15 20 25 30 N=4800 (8x8) 64 cores N=9600 (16x16) 256 cores N=19200 (32x32) 1024 cores ScaLAPACK Ours [Time in Seconds] Better
  29. 29. Conclusion • Our eigensolver is effective for very small matrices to  utilize communication reducing and avoiding  techniques. – By halving duplicate Householder vectors in  Tridiagonalization and Householder Inverse  Transformation phases. – By using reduced communications for multiple sending  with 2D splitting for process grid. – By using packing messages for Householder Inverse  Transformation part. • Selection of implementations in communication  processes is the target of AT. – The best implementation depends on process grids, the  number of processors, and block size for data packing.    
  30. 30. Conclusion (Cont’d) • One of drawbacks is increase of memory space. – , where process grid is p * q. – Since memory space for matrix is in cache size, the  increase of memory space can be ignored.   • Comparison with new blocking algorithms is  future work. – 2‐step method with block Householder  tridiagonalization. • Eigen‐K (Riken) • ELPA (Technische Universität München) • A new implementation of PLASMA and MAGMA  )/( 2 pNO
  31. 31. Acknowledgements  • Computational resource of Fujitsu FX10  was awarded by  “Large‐scale HPC Challenge” Project,  Information Technology Center,  The University of Tokyo. This topic was submitted to Parallel Computing. (As of December 2013.)

×