Gromacs and Kepler GPUs


Published on

Learn about the first multi-node, multi-GPU-enabled release 4.6 of GROMACS from Dr. Erik Lindahl, the project leader for this popular molecular dynamics package. GROMACS 4.6 allows you to run your models up to 3X faster compare to the latest state-of-the-art parallel AVX-accelerated CPU-code in GROMACS. Dr. Lindahl talks about the new features of the latest GROMACS 4.6 release as well as future plans. You will learn how to download the latest accelerated version of GROMACS and which features are GPU supported. Dr. Lindahl covers GROMACS performance on the very latest NVIDIA Kepler hardware and explain how to run GPU-accelerated MD simulations. You will also be invited to try GROMACS on K40 with a free test drive and experience all the new features and enhanced performance for yourself:

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Gromacs and Kepler GPUs

  1. 1. Webinar 20130404 Molecular Simulation with GROMACS on CUDA GPUs Erik Lindahl
  2. 2. GROMACS is used on a wide range of resources We’re comfortably on the single-μs scale today Larger machines often mean larger systems, not necessarily longer simulations
  3. 3. Why use GPUs? Throughput • Sampling • Free energy • Cost efficiency • Power efficiency • Desktop simulation • Upgrade old machines • Low-end clusters Performance • Longer simulations • Parallel GPU simulation • • using Infiniband High-end efficiency by using fewer nodes Reach timescales not possible with CPUs
  4. 4. Many GPU programs today Caveat emperor: It is much easier to get a reference problem/algorithm to scale i.e., you see much better relative scaling before introducing any optimization on the CPU side When comparing programs: What matters is absolute performance (ns/day), not the relative speedup!
  5. 5. Gromacs-4.5 with OpenMM Previous version - what was the limitation? Gromacs running entirely on CPU as a fancy interface Actual simulation running entirely on GPU using OpenMM kernels Only a few select algorithms worked Multi-CPU sometimes beat GPU performance...
  6. 6. Why don’t we use the CPU too? 0.5-1 TFLOP Random memory access OK (not great) Great for complex latency-sensitive stuff (domain decomposition, etc.) ~2 TFLOP Random memory access won’t work Great for throughput
  7. 7. Gromacs-4.6 next-generation GPU implementation: Programming model Domain decomposition dynamic load balancing 1 MPI rank 1 MPI rank 1 MPI rank 1 MPI rank CPU N OpenMP N OpenMP (PME) threads threads N OpenMP N OpenMP threads threads Load balancing GPU 1 GPU context 1 GPU context Load balancing 1 GPU context 1 GPU context
  8. 8. Heterogeneous CPU-GPU acceleration in GROMACS-4.6 Wallclock time for an MD step: ~0.5 ms if we want to simulate 1μs/day We cannot afford to lose all previous acceleration tricks!
  9. 9. CPU trick 1: all-bond constraints • •Δt limited by fast motions - 1fs • SHAKE (iterative, slow) - 2fs Remove bond vibrations • • • Problematic in parallel (won’t work) Compromise: constrain h-bonds only 1.4fs GROMACS (LINCS): • • • • • • LINear Constraint Solver Approximate matrix inversion expansion Fast & stable - much better than SHAKE Non-iterative Enables 2-3 fs timesteps Parallel: P-LINCS (from Gromacs 4.0) LINCS: t=2’ t=1 A) Move w/o constraint t=2’’ t=1 B) Project out motion along bonds t=2 t=1 C) Correct for rotational extension of bond
  10. 10. CPU trick 2: Virtual sites • • Next fastest motions is H-angle and rotations of CH3/NH2 groups Try to remove them: • • • • • Ideal H position from heavy atoms. CH3/NH2 groups are made rigid Calculate forces, then project back onto heavy atoms Integrate only heavy atom positions, reconstruct H’s Enables 5fs timesteps! θ 1-a 2 a a b a |d | 1-a 3 |b | 3fd 3fad |c | 3out 4fd Interactions Degrees of Freedom
  11. 11. dista actio this i intera it to tem. Fo tween 7 C’ 3 B’ 2 1D d C B decom A’ 4 6 A sions rc 5 1 0 allow rc 3 1 detai 0 8th-sphere comm most FIG. 3: The zones to communicate to the proces FIG. 2: The domain decomposition cells (1-7)for details. see the text that communinami cate coordinates to cell 0. Cell 2 is hidden below cell 7. The parts zones that need to be communicated to cell 0 are dashed, rc ensure that all bonded interaction between ch cut-o is the cut-o radius. can be assigned to a processor, it is su⌅cien balan that the charge groups within a sphere of ra muni present on at least one processor for every p ter of the sphere. In Fig. ?? this means we a balan communicate volumes B’ and C’. When no bo are calculated. in cel actions are present between charge groups, th are not communicated. For 2D decomposition bond Bonded interactions are distributed over the processors C’ are the only extra volumes that need to calcu by finding the smallest x, y and z coordinate of the charge pictures be For 3D domain decomposition the be tions groups involved and assigning the ainteraction to thebut the procedure i bit more complicated, pro- CPU trick 3: Non-rectangular cells & decomposition Load balancing works for arbitrary triclinic cells Lysozyme, 25k atoms All these “tricks” now work fine Rhombic dodecahedron (36k atoms in cubic cell) with GPUs in GROMACS-4.6! apart from more extensive book-keeping. All
  12. 12. From neighborlists to cluster pair lists in GROMACS-4.6 x,y grid z sort z bin x,y,z gridding Organize as tiles with Cluster pairlist all-vs-all interactions: X X X X X X X X X X X X X X X X
  13. 13. Tiling circles is difficult Need a lot of cubes to cover a sphere Interactions outside cutoff should be 0.0 Group cutoff • • Verlet cutoff GROMACS-4.6 calculates a “large enough” buffer zone so no interactions are missed Optimize nstlist for performance - no need to worry about missing any interactions with Verlet!
  14. 14. Tixel algorithm work-efficiency 8x8x8 tixels compared to a non performance-optimized Verlet scheme 0.36 rc=1.5, rl=1.6 0.58 0.82 Verlet Tixel Pruned Tixel non-pruned 0.29 rc=1.2, rl=1.3 0.52 0.75 0.21 rc=0.9, rl=1.0 0.42 0.73 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Highly memory-efficient algorithm: Can handle 20-40 million atoms with 2-3GB memory Even cheap consumer cards will get you a long way
  15. 15. PME weak scaling Xeon X5650 3T + C2075 / process 0.35 480 μs/step (1500 atoms) 1xC2075 CUDA F kernel 1xC2075 CPU total 2xC2075 CPU total 4xC2075 CPU total Iteration time per 1000 atoms (ms/step) 0.3 0.25 0.2 Text 0.15 700 μs/step (6000 atoms) 0.1 Complete time step including kernel, h2d, d2h, CPU constraints, CPU PME, CPU integration,OpenMP & MPI 0.05 0 1.5 3 6 12 24 48 96 192 System size/GPU (1000s of atoms) 384 768 1536 3072
  16. 16. Example performance: Systems with ~24,000 atoms, 2 fs time steps, NPT Amber: DHFR CPU, 96 CPU cores GPU, 1xGTX680 GPU, 4xGTX680 0 Gromacs: RNAse 100 ns/day 200 300 200 300 CPU, 6 cores CPU, 2*8 cores 6 CPU cores +1xK20c GPU 6 CPU cores +1xGTX680 GPU dodec+vsites(5fs), 6 CPU cores dodec+vsites(5fs), 2*8 CPU cores dodec+vsites(5fs), 6 cores + 1xK20c dodec+vsites(5fs), 6 cores + 1xGTX680 0 100
  17. 17. The Villin headpiece ~8,000 atoms, 5 fs steps explicit solvent triclinic box PME electrostatics i7 3930K (GMX 4.5) i7 3930K (GMX 4.6) i7 3930K+GTX680 E5-2690+GTX Titan 0 200 400 600 ns/day 800 1000 2,546 FPS (beat that, Battlefield 4) 1200
  18. 18. GLIC: Ion channel membrane protein 150,000 atoms Running on a simple desktop! i7 3930K (GMX4.5) i7 3930K (GMX4.6) i7 3930K+GTX680 E5-2690+GTX Titan 0 10 20 ns/day 30 40
  19. 19. Strong scaling of Reaction-Field and & Scaling of Reaction-fieldPME PME 1.5M atoms waterbox, RF cutoff=0.9nm, PME auto-tuned cutoff Performance (ns/day) 100 10 1 RF RF linear scaling PME PME linear scaling 0.1 1 10 100 #Processes-GPUs Challenge: GROMACS has very short iteration times hard requirements on latency/bandwidth Small systems often work best using only a single GPU!
  20. 20. GROMACS 4.6 extreme scaling Scaling to 130 atoms/core: ADH protein 134k atoms, PME, rc >= 0.9 1000 XK6/X2090 XK7/K20X XK6 CPU only XE6 CPU only ns/day 100 10 1 1 2 4 8 16 #sockets (CPU or CPU+GPU) 32 64
  21. 21. Using GROMACS with GPUs in practice
  22. 22. Compiling GROMACS with CUDA • • • • Make sure CUDA driver is installed Make sure CUDA SDK is in /usr/local/cuda Use the default GROMACS distribution Just run ‘cmake’ and we will detect CUDA automatically and use it • • gcc-4.7 works great as a compiler On Macs, you want to use icc (commercial) Longer Mac story: Clang does not support OpenMP, which gcc does. However, the current gcc versions for Macs do not support AVX on the CPU. icc supports both!
  23. 23. Using GPUs in practice In your mdp file: cutoff-scheme nstlist coulombtype vdw-type nstcalcenergy = = = = = Verlet 10 ; likely 10-50 pme ; or reaction-field cut-off -1 ; only when writing edr • Verlet cutoff-scheme is more accurate • Necessary for GPUs in GROMACS • Use -testverlet mdrun option to force it w. old tpr files • Slower on a single CPU, but scales well on CPUs too! Shift modifier is applied to both coulomb and VdW by default on GPUs - change with coulomb/vdw-modifier
  24. 24. Load balancing rcoulomb fourierspacing = 1.0 = 0.12 • If we increase/decrease the coulomb direct-space • • cutoff and the reciprocal space PME grid spacing by the same amount, we maintain accuracy ... but we move work between CPU & GPU! By default, GROMACS-4.6 does this automatically at the start of each run - you will see diagnostic output GROMACS excels when you combine a fairly fast CPU and GPU. Currently, this means Intel CPUs.
  25. 25. Demo
  26. 26. Acknowledgments • • • • GROMACS: Berk Hess, David v. der Spoel, Per Larsson, Mark Abraham Gromacs-GPU: Szilard Pall, Berk Hess, Rossen Apostolov Multi-Threaded PME: Roland Shultz, Berk Hess Nvidia: Mark Berger, Scott LeGrand, Duncan Poole, and others!  
  27. 27. Test Drive K20 GPUs! Questions? Run GROMACS on Tesla K20 GPU today Devang Sachdev - NVIDIA @DevangSachdev Contact us Experience The Acceleration Sign up for FREE GPU Test Drive on remotely hosted clusters GROMACS questions Check   mailing  list Stream other webinars from GTC Express: page/gtc-express-webinar.html
  28. 28. Register for the Next GTC Express Webinar Molecular Shape Searching on GPUs Paul Hawkins, Applications Science Group Leader, OpenEye Wednesday, May 22, 2013, 9:00 AM PDT Register at