© 2016 IBM Corporation
Enhanced MPSM3 for
applications to quantum
biological simulations
Cristiano Malossi, IBM Research - Zurich
A. Pozdneev, V. Weber, T. Laino, C. Bekas, A. Curioni, IBM Research - Zurich
© 2016 IBM Corporation2
IBM Research
Motivations
Applications of quantum Hamiltonians to biological systems is limited by the cost
of performing long calculations on large systems (+30K atoms).
Classical forcefields and QM/MM are good for conformational changes and
localized reactions, respectively. Thus the need for developing scalable algorithms
that allow the applications of quantum Hamiltonians to biological systems, to:
NADH:ubiquinone oxidoreductase
succinate dehydrogenase
large scale ion motion large scale electron transfer
© 2016 IBM Corporation3
IBM Research
Outlook and Goal
Goal: Design an efficient parallel sparse matrix-matrix multiply.
 Introduction: Born-Oppenheimer molecular dynamics.
 Parallelization: midpoint-based parallel sparse matrix-matrix
multiplication for matrices with decay.
 Benchmark: weak and strong scaling, and communication volume
on BlueGene/Q1
.
 Summary
1
IBM and Blue Gene are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service
names might be trademarks of IBM or other companies.
© 2016 IBM Corporation4
IBM Research
Introduction: Born-Oppenheimer molecular
dynamics
The core operation of the SCF iterations is the sparse matrix-matrix
multiplication.
Each SCF iteration requires the construction of the density matrix.
Each MD step requires U to be calculated at the relaxed ground
state electronic density.
© 2016 IBM Corporation5
IBM Research
Parallel sparse matrix-matrix multiplication
atoms
cell
 Atoms in the simulation cell.
© 2016 IBM Corporation6
IBM Research
Parallel sparse matrix-matrix multiplication
box
 Simulation cell divided into boxes. Each box and its atoms are owned by
a process.
© 2016 IBM Corporation7
IBM Research
Parallel sparse matrix-matrix multiplication
i
k
 These two atoms are owned by different processes.
© 2016 IBM Corporation8
IBM Research
Parallel sparse matrix-matrix multiplication
i
k
+
Aik
 The matrix block Aij
is owned by the process where the midpoint
(distance) resides.
© 2016 IBM Corporation9
IBM Research
Parallel sparse matrix-matrix multiplication
i
k
+
Aik
 The matrix block Aik
is owned by the process where the midpoint
(distance) resides.
© 2016 IBM Corporation10
IBM Research
Parallel sparse matrix-matrix multiplication
i
k
+
Aik
j
Bkj
+
 Another matrix block Bkj
.
© 2016 IBM Corporation11
IBM Research
Parallel sparse matrix-matrix multiplication
i
k
j+
Cij
 The result Cij
of the product Aik
Bkj
is owned by the process where the
midpoint between i and j resides.
© 2016 IBM Corporation12
IBM Research
Parallel sparse matrix-matrix multiplication
 Blocks Aik
and Bkj
are sent to the process that owns Cij
and multiplication
takes place. Blocks are sent along x, y and z.
i
k
Aik
j
Bkj
+
Cij
= Cij
+ Aik
Bkj
© 2016 IBM Corporation13
IBM Research
Improved MPSM3
 The process that owns the midcell performs the multiplication.
+
x
x
© 2016 IBM Corporation14
IBM Research
Improved MPSM3
 All blocks A and B are sent to the process that owns the midcell and
multiplication takes place. Blocks are sent along x, y and z.
A**
B**
+
C = C + AB
x
x
© 2016 IBM Corporation15
IBM Research
Improved MPSM3
 Process that does the multiplication needs to redistribute the results to
neighbors processes. Blocks are sent along x, y and z.
+
Ci'j'
+
i
j i'
j'
Cij
© 2016 IBM Corporation16
IBM Research
Improved MPSM3
Redistribution of the
computed matrix
Exchange of local
matrices
Local products
© 2016 IBM Corporation17
IBM Research
Benchmark: weak scaling
 Time per DM build wrt MPI
tasks, PM6
 About 19 waters per task
 Parallel efficiency: 92% at
110592 MPI tasks (2.1M
waters)
 Nbr non-zero elements:
1.6k/water (O1) and
1.0k/water (O2)
© 2016 IBM Corporation18
IBM Research
Benchmark: weak scaling
 Time per DM build wrt MPI
tasks, PM6
 About 19 waters per task
 Parallel efficiency: 92% at
110592 MPI tasks (2.1M
waters)
 Nbr non-zero elements:
1.6k/water (O1) and
1.0k/water (O2)
Constant walltime with proportional resources
© 2016 IBM Corporation19
IBM Research
Benchmark: weak scaling (improved MPSM3)
 Time per DM build wrt MPI
tasks, PM6
 About 19 waters per task
 Nbr non-zero elements:
1.6k/water ~10x
© 2016 IBM Corporation20
IBM Research
Benchmark: weak scaling (improved MPSM3)
 Time per DM build wrt MPI
tasks, PM6
 About 19 waters per task
 Nbr non-zero elements:
1.6k/water
Improved MPSM3 competes already with libdbcsr for small
system/MPI tasks ratio
https://dbcsr.cp2k.org/
© 2016 IBM Corporation21
IBM Research
Benchmark: strong scaling
 Time per DM build wrt MPI
tasks, PM6
 110k (S1), 373k (S2) and
1124k (S3) waters
© 2016 IBM Corporation22
IBM Research
Benchmark: strong scaling
 Time per DM build wrt MPI
tasks, PM6
 110k (S1), 373k (S2) and
1124k (S3) waters
Largest system:
Matrix dimensions: 6749184 x 6749184
Non-zero: 3.9E-3%
Nbr. multiplies: 17
Sparsity boost vs dense: 42760 x
© 2016 IBM Corporation23
IBM Research
Benchmark: strong scaling (improved MPSM3)
 Time per DM build wrt MPI
tasks, PM6
 32k (S0), 110k (S1)
© 2016 IBM Corporation24
IBM Research
Benchmark: communication volume
 Total communication
volume (Isend/Irecv) per
DM build wrt MPI tasks
 110k (S1), 373k (S2) and
1124k (S3) waters
 BlueGene/Q
© 2016 IBM Corporation25
IBM Research
Summary
The MPSM3 and its improved version shows (1 push per direction):
 close to perfect weak scaling
 very good strong scaling
 communication volume decreases as nbr task increases
 fewer logistic operations (improved version)
Providing proportional resources, we can perform a MD step in
about few dozen of seconds regardless of system size.
© 2016 IBM Corporation26
IBM Research
Parallel sparse matrix-matrix multiplication
radius
 Interaction of an atom with its neighbors.
© 2016 IBM Corporation27
IBM Research
References
 SEMD I: Midpoint-based parallel sparse matrix-matrix multiplication algorithm for
matrices with decay
Valéry Weber, Teodoro Laino, Alexander Pozdneev, Irina Fedulova, and Alessandro Curioni
Journal of Chemical Theory and Computation 2015 11 (7), 3145-3152
https://doi.org/10.1021/acs.jctc.5b00382
 Enhanced MPSM3 for Applications to Quantum Biological Simulations
Alexander Pozdneev, Valéry Weber, Teodoro Laino, Costas Bekas, and Alessandro Curioni
In Proceedings of SC16: The International Conference for High Performance Computing,
Networking, Storage and Analysis, Salt Lake City, Utah, November 13-18, 2016, Article no. 9
https://dl.acm.org/citation.cfm?id=3014916

Enhanced MPSM3 for applications to quantum biological simulations

  • 1.
    © 2016 IBMCorporation Enhanced MPSM3 for applications to quantum biological simulations Cristiano Malossi, IBM Research - Zurich A. Pozdneev, V. Weber, T. Laino, C. Bekas, A. Curioni, IBM Research - Zurich
  • 2.
    © 2016 IBMCorporation2 IBM Research Motivations Applications of quantum Hamiltonians to biological systems is limited by the cost of performing long calculations on large systems (+30K atoms). Classical forcefields and QM/MM are good for conformational changes and localized reactions, respectively. Thus the need for developing scalable algorithms that allow the applications of quantum Hamiltonians to biological systems, to: NADH:ubiquinone oxidoreductase succinate dehydrogenase large scale ion motion large scale electron transfer
  • 3.
    © 2016 IBMCorporation3 IBM Research Outlook and Goal Goal: Design an efficient parallel sparse matrix-matrix multiply.  Introduction: Born-Oppenheimer molecular dynamics.  Parallelization: midpoint-based parallel sparse matrix-matrix multiplication for matrices with decay.  Benchmark: weak and strong scaling, and communication volume on BlueGene/Q1 .  Summary 1 IBM and Blue Gene are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies.
  • 4.
    © 2016 IBMCorporation4 IBM Research Introduction: Born-Oppenheimer molecular dynamics The core operation of the SCF iterations is the sparse matrix-matrix multiplication. Each SCF iteration requires the construction of the density matrix. Each MD step requires U to be calculated at the relaxed ground state electronic density.
  • 5.
    © 2016 IBMCorporation5 IBM Research Parallel sparse matrix-matrix multiplication atoms cell  Atoms in the simulation cell.
  • 6.
    © 2016 IBMCorporation6 IBM Research Parallel sparse matrix-matrix multiplication box  Simulation cell divided into boxes. Each box and its atoms are owned by a process.
  • 7.
    © 2016 IBMCorporation7 IBM Research Parallel sparse matrix-matrix multiplication i k  These two atoms are owned by different processes.
  • 8.
    © 2016 IBMCorporation8 IBM Research Parallel sparse matrix-matrix multiplication i k + Aik  The matrix block Aij is owned by the process where the midpoint (distance) resides.
  • 9.
    © 2016 IBMCorporation9 IBM Research Parallel sparse matrix-matrix multiplication i k + Aik  The matrix block Aik is owned by the process where the midpoint (distance) resides.
  • 10.
    © 2016 IBMCorporation10 IBM Research Parallel sparse matrix-matrix multiplication i k + Aik j Bkj +  Another matrix block Bkj .
  • 11.
    © 2016 IBMCorporation11 IBM Research Parallel sparse matrix-matrix multiplication i k j+ Cij  The result Cij of the product Aik Bkj is owned by the process where the midpoint between i and j resides.
  • 12.
    © 2016 IBMCorporation12 IBM Research Parallel sparse matrix-matrix multiplication  Blocks Aik and Bkj are sent to the process that owns Cij and multiplication takes place. Blocks are sent along x, y and z. i k Aik j Bkj + Cij = Cij + Aik Bkj
  • 13.
    © 2016 IBMCorporation13 IBM Research Improved MPSM3  The process that owns the midcell performs the multiplication. + x x
  • 14.
    © 2016 IBMCorporation14 IBM Research Improved MPSM3  All blocks A and B are sent to the process that owns the midcell and multiplication takes place. Blocks are sent along x, y and z. A** B** + C = C + AB x x
  • 15.
    © 2016 IBMCorporation15 IBM Research Improved MPSM3  Process that does the multiplication needs to redistribute the results to neighbors processes. Blocks are sent along x, y and z. + Ci'j' + i j i' j' Cij
  • 16.
    © 2016 IBMCorporation16 IBM Research Improved MPSM3 Redistribution of the computed matrix Exchange of local matrices Local products
  • 17.
    © 2016 IBMCorporation17 IBM Research Benchmark: weak scaling  Time per DM build wrt MPI tasks, PM6  About 19 waters per task  Parallel efficiency: 92% at 110592 MPI tasks (2.1M waters)  Nbr non-zero elements: 1.6k/water (O1) and 1.0k/water (O2)
  • 18.
    © 2016 IBMCorporation18 IBM Research Benchmark: weak scaling  Time per DM build wrt MPI tasks, PM6  About 19 waters per task  Parallel efficiency: 92% at 110592 MPI tasks (2.1M waters)  Nbr non-zero elements: 1.6k/water (O1) and 1.0k/water (O2) Constant walltime with proportional resources
  • 19.
    © 2016 IBMCorporation19 IBM Research Benchmark: weak scaling (improved MPSM3)  Time per DM build wrt MPI tasks, PM6  About 19 waters per task  Nbr non-zero elements: 1.6k/water ~10x
  • 20.
    © 2016 IBMCorporation20 IBM Research Benchmark: weak scaling (improved MPSM3)  Time per DM build wrt MPI tasks, PM6  About 19 waters per task  Nbr non-zero elements: 1.6k/water Improved MPSM3 competes already with libdbcsr for small system/MPI tasks ratio https://dbcsr.cp2k.org/
  • 21.
    © 2016 IBMCorporation21 IBM Research Benchmark: strong scaling  Time per DM build wrt MPI tasks, PM6  110k (S1), 373k (S2) and 1124k (S3) waters
  • 22.
    © 2016 IBMCorporation22 IBM Research Benchmark: strong scaling  Time per DM build wrt MPI tasks, PM6  110k (S1), 373k (S2) and 1124k (S3) waters Largest system: Matrix dimensions: 6749184 x 6749184 Non-zero: 3.9E-3% Nbr. multiplies: 17 Sparsity boost vs dense: 42760 x
  • 23.
    © 2016 IBMCorporation23 IBM Research Benchmark: strong scaling (improved MPSM3)  Time per DM build wrt MPI tasks, PM6  32k (S0), 110k (S1)
  • 24.
    © 2016 IBMCorporation24 IBM Research Benchmark: communication volume  Total communication volume (Isend/Irecv) per DM build wrt MPI tasks  110k (S1), 373k (S2) and 1124k (S3) waters  BlueGene/Q
  • 25.
    © 2016 IBMCorporation25 IBM Research Summary The MPSM3 and its improved version shows (1 push per direction):  close to perfect weak scaling  very good strong scaling  communication volume decreases as nbr task increases  fewer logistic operations (improved version) Providing proportional resources, we can perform a MD step in about few dozen of seconds regardless of system size.
  • 26.
    © 2016 IBMCorporation26 IBM Research Parallel sparse matrix-matrix multiplication radius  Interaction of an atom with its neighbors.
  • 27.
    © 2016 IBMCorporation27 IBM Research References  SEMD I: Midpoint-based parallel sparse matrix-matrix multiplication algorithm for matrices with decay Valéry Weber, Teodoro Laino, Alexander Pozdneev, Irina Fedulova, and Alessandro Curioni Journal of Chemical Theory and Computation 2015 11 (7), 3145-3152 https://doi.org/10.1021/acs.jctc.5b00382  Enhanced MPSM3 for Applications to Quantum Biological Simulations Alexander Pozdneev, Valéry Weber, Teodoro Laino, Costas Bekas, and Alessandro Curioni In Proceedings of SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis, Salt Lake City, Utah, November 13-18, 2016, Article no. 9 https://dl.acm.org/citation.cfm?id=3014916