update         Updated: February 4, 2013
Molecular Dynamics (MD) Applications                      Features Application                                      GPU Pe...
New/Additional MD Applications Ramping                      FeaturesApplication                                        GPU...
Quantum Chemistry Applications Application   Features Supported                GPU Perf         Release Status            ...
Quantum Chemistry Applications Application   Features Supported                   GPU Perf      Release Status            ...
Quantum Chemistry Applications                       Features Application                                     GPU Perf    ...
Quantum Chemistry Applications                        Features Application                                       GPU Perf ...
Viz, ―Docking‖ and Related Applications Growing   Related                Features                                         ...
Bioinformatics Applications                     Features                     GPU Application                              ...
MD Average Speedups                                                                                                       ...
Molecular Dynamics (MD) Applications                      Features Application                                      GPU Pe...
New/Additional MD Applications Ramping                      FeaturesApplication                                        GPU...
Built from Ground Up for GPUs       Computational Chemistry       Study disease & discover drugsWhat       Predict drug an...
AMBER 12     GPU Support Revision 12.2            1/22/201315
Kepler - Our Fastest Family of GPUs Yet                         30.00                                                     ...
K10 Accelerates Simulations of All Sizes                               30                                                 ...
K20 Accelerates Simulations of All Sizes                                    30.00                                         ...
K20X Accelerates Simulations of All Sizes                                    35                                           ...
K10 Strong Scaling over Nodes                                    Cellulose 408K Atoms (NPT)                         Runnin...
Kepler – Universally Faster                                9                                                              ...
K10 Extreme Performance                                                                        Running AMBER 12 GPU Suppor...
K20 Extreme Performance                                            DHRF JAC 23K Atoms (NVE)                          Runni...
Replace 8 Nodes with 1 K20 GPU     90.00                                                                 35000            ...
Replace 7 Nodes with 1 K10 GPU                          Performance on JAC NVE                         Cost               ...
Extra CPUs decrease Performance                                   Cellulose NVE                                          R...
Kepler - Greener Science                                                                                               Run...
Recommended GPU Node Configuration for         AMBER Computational Chemistry                   Workstation or Single Node ...
Benefits of GPU AMBER Accelerated Computing     Faster than CPU only systems in all tests     Most major compute intensive...
NAMD 2.9
Kepler - Our Fastest Family of GPUs Yet                       4.50                                                        ...
Accelerates Simulations of All Sizes                                     3                                                ...
Kepler – Universally Faster                               6                                                               ...
Outstanding Strong Scaling with Multi-STMV                                                                                ...
Replace 3 Nodes with 1 2090 GPU                                                                                     Runnin...
K20 - Greener: Twice The Science Per Watt                            1200000                                      Energy U...
Kepler - Greener: Twice The Science/Joule                                Energy used in simulating 1 ns of SMTV           ...
Recommended GPU Node Configuration for         NAMD Computational Chemistry                   Workstation or Single Node C...
Summary/Conclusions     Benefits of GPU Accelerated Computing     Faster than CPU only systems in all tests     Large perf...
LAMMPS, Jan. 2013 or later
More Science for Your Money                                                  Embedded Atom Model                          ...
K20X, the Fastest GPU Yet                                7                                                                ...
Get a CPU Rebate to Fund Part of Your GPU Budget                               Acceleration in Loop Time Computation by   ...
Excellent Strong Scaling on Large Clusters                                                 LAMMPS Gay-Berne 134M Atoms    ...
GPUs Sustain 5x Performance for Weak Scaling                                                Weak Scaling with 32K Atoms pe...
Faster, Greener — Worth It!                         Energy Consumed in one loop of EAM                       140          ...
Molecular Dynamics with LAMMPS on a Hybrid Cray Supercomputer                    W. Michael Brown        National Center f...
Early Kepler Benchmarks on Titan                      32.00                                                               ...
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
Upcoming SlideShare
Loading in …5
×

GPU Accelerated Computational Chemistry Applications

6,776 views

Published on

GPU computing accelerates several computational chemistry applications. With GPUs the users don't need to make any code changes when running applications such as AMBER, NAMD, GROMACS or LAMMPS. All they need to do is run their models as they would run without GPUs to be able to speed up their simulations from days to hours. For a full list of GPU accelerated applications - http://goo.gl/IKmYs

Published in: Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
6,776
On SlideShare
0
From Embeds
0
Number of Embeds
276
Actions
Shares
0
Downloads
0
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide
  • Note the rise of GPU only applications and GPU-grid applications. This indicates that GPUs are a sweet spot for MD.
  • Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
  • Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
  • Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
  • Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
  • cpuk10k20k20x2k102k202k20xambercellulose 0.744.345.396.146.46.787.5 factor9 nve3.4218.922.425.429.228.131.4jacnve12.4768.681.189.19895.6102.1trpcage210420559585418451475namdapoa11.373.453.5746.256.677.14atpase0.460.961.121.251.782.042.22stmv0.1150.290.310.350.520.560.61lammpsfluid lj 511.951.112.82 4.382.223.62eam11.72.472.923.34.55.5 rhodopsin11.330.771.62.351.482.28gromacsrnase46.7109120
  • Note the rise of GPU only applications and GPU-grid applications. This indicates that GPUs are a sweet spot for MD.
  • ns/dayDual E5-2687W CPUs 3.4Dual E5-2687W CPUs + M2090 11.9Dual E5-2687W CPUs + K10 18.9Dual E5-2687W CPUs + K20 22.4Dual E5-2687W CPUs + K20X 25.39
  • cpu ns/day gpu ns/dayTrpcage 210 420Jacnve 12.47 68.6Factor 9 3.42 18.9Cellulose .74 3.73Myoglobin 6.12 122.3Nucleosome .1 2.4
  • cpu ns/day gpu ns/dayTRPcage GB 210.32 559.32JAC NVE PME 12.47 81.09Factor IX NVE PME 3.42 22.44Cellulose NVE PME 0.74 5.39Myoglobin GB 6.12 156.45Nucleosome GB 0.10 2.80SPFP ECC off
  • cpu ns/day gpu ns/dayTrpcage 210 585Jacnve 12.47 89.13Factor 9 3.42 25.4Cellulose .74 6.14Myoglobin 6.12 175.77Nucleosome .1 3.13
  • Nodes cpu ns/day gpu ns/day1 .65 3.312 1.14 4.134 2.01 4.8
  • Energytdpsec/nsenergy (kJ)2e5-2687 300 6928 207822687+K10 5351259673 22687+K20 535 1065 569 22687+K20X 535 969 518
  • 1 CPU node (dual CPUs) = 12.47 ns/day1 CPU+ GPU node (dual CPUs and GPUs) = 95.59 ns/day
  • Perflab:no gpuk10k20k20x2k102k202k20xcell 1 cpu0.374.445.46.166.376.937.67cell 2 cpu0.744.345.396.146.46.787.5
  • Energytdpsec/nsenergy (kJ)2e5-2687 300 6928 207822687+K10 5351259673 22687+K20 535 1065 569 22687+K20X 535 969 518
  • ns/dayDual E5-2687W CPUs 1.370Dual E5-2687W CPUs + M2090 2.632Dual E5-2687W CPUs + K10 3.448Dual E5-2687W CPUs + K20 3.571Dual E5-2687W CPUs + K20X 4.000
  • cpu ns/day gpu ns/dayApoA1 1.370 3.571F1-ATPase 0.461 1.124STMV 0.116 0.314ECC off
  • All #s are days/ns apoa1atpasestmvCPU Only0.732.178.641x K100.291.043.51x K200.28 0.893.181x K20X0.250.82.872x K100.160.561.932x K200.150.491.772x K20X0.140.451.63
  • 32 64 128 256 512 640 768s/step GPU XK6 1.2414 0.660887 0.342743 0.199465 0.10837 0.089752 0.0774948s/step CPU XK6 4.62633 2.36707 1.19722 0.609124 0.314745 0.255016 0.209511ns/day Fermi XK6 0.069599 0.13073339 0.252084 0.433159 0.797269 0.962655 1.114913517ns/day CPU XK6 0.018676 0.03650082 0.072167 0.141843 0.274508 0.338802 0.412388848
  • Config: TDP sec/ns energy 2x E5-2687W 150 63,072.0 9,460,800.0 2x E5-2687W+ 2x K20 600 24,192.0 14,515,200 TDP = Thermal Design Power
  • ns/day tdp energyCpu .115 300 223kK10s .518 770 128kK20s .565 770 117kK20xs .613 770 108k
  • CPU OnlyCPU + K10CPU + 2K101k202k201k20x2k20xLoop time: 382.13225115.4154.684.2130.569.9
  • CPU OnlyCPU + K10CPU + 2K101k202k201k20x2k20xLoop time: 382.13225115.4154.684.2130.569.9
  • Config: loop time:2x X5670 (HP Z800) 2717.6301xM2090 (2xX5570)511.7502xM2090 (2xX5570)274.9703xM2090 (2xX5570)210.4304xM2090 (2xX5570)148.880
  • nodes:300400500600700800900CPU-only time:563.96423.83339.62281.58260.98220.83203.13CPU+GPU time: 159.06118.6296.4481.0371.5763.7658.96GPU speedup ratio:  3.553.573.523.483.653.463.45
  • Nodes, box size, atoms, cpu time, cpu+gpu time, gpu speedup11x1x13276842.26.336.67 x82x2x226214441.86.736.21 x273x3x388473641.56.866.05 x644x4x4209715241.57.185.78 x1255x5x5409600041.47.185.77 x2166x6x67077888427.665.48 x3437x7x71123942441.98.345.02 x5128x8x81677721642.38.415.03 x7299x9x92388787242.58.924.76 x
  • Power WTimeenergy spentCpu 300 382 114Cpu 1 k20x 535 130 69Cpu 2 k20x 770 70 54
  • ns/daySingle E5-2687W CPUs 4.35 (1.0X)Dual E5-2687W CPUs 7.32 (1.7X)Single E5-2687W CPUs + M2090 7.33 (1.7X)Dual E5-2687W CPUs + M2090 7.54 (1.7X)Single E5-2687W CPUs + K10 13.24 (3.0X)Dual E5-2687W CPUs + K10 13.24 (3.0X)Single E5-2687W CPUs + K20 11.6 (2.7X)Dual E5-2687W CPUs + K20 12.26 (2.8X)Single E5-2687W CPUs + K20X 11.99 (2.7X)Dual E5-2687W CPUs + K20X 12.27 (2.8X)
  • Nodes CPU only gpu1 2.26 8.362 3.58 13.014 6.7 21.68
  • Nodes CPU GPU86.61320.3351611.28237.01632 23.06763.8766442.28496.62812872.694 144.424
  • nanoseconds/day8 X5550 6.72M2090+2X5550 8.36CPU Node: 4 X 2 X $1000 = $8000CPU + GPU Node: 1 X 2 X $1000 + 2 X $2000 = $6000
  • GPU: 640 (watts) * 10,334 (seconds/nanosecond) = 6.6 MegaJoulesCPU: 760 (watts) * 12,895 (seconds/nanosecond) = 9.8 MegaJoules
  • 44 cpus2 cpu 1 gpu2 cpu 2 gpuNs/day6025.342.4price6000030004000
  • 44 cpus2 cpu 1 gpu2 cpu 2 gpuNs/day 6025.342.4price6000030004000scaled price1 0.050.066666667perf/price18.43333333310.6
  • 64 cpu2 cpu 1 gpu2 cpu 2 gpuNs/day 318.915.1tdp6080428666sec/ns2787.09679707.86515721.8543energy/ns16945.5484154.96623810.7549
  • Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
  • Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
  • Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
  • Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
  • Test case not specified in perf lab run
  • Test case not specified in perf lab run
  • I am here today to talk to you about the value of seamlessly adding GPUs to the computer which you use to run Quantum Espresso/PWscf and achieving phenomenal performance improvements. This small incremental investment will yield significant performance payback.What is Quantum Espresso/PWscf:-A set of programs used to calculate the electron configuration of atoms or molecules-Uses plane wave basis sets and quantum mechanical principles-Highly compute intensiveBenefits of GPU-accelerated Computing:-Faster than CPU only systems in all tests-Performance boost much larger than marginal price increase-Power consumption more than halved in all simulations-GPUs scale very well on clusters with dozens of nodes, and beyond===============Price assumes FERMI workstation ~$4000 and C2050 $1000shilu 3 water on calcite6 OpenMP CPU nodes 1025 15606 OpenMP CPU nodes 1 gpu 275 480FERMI (ICHEC): assembled workstationCPU: 2 * Intel Xeon X5650 (6-core), 24 GByte RAMGPU: 2 x C2050, GTX480, C2075SW: CUDA 4.1, Intel compilers
  • ausurf k ptausurf gamma 6 OpenMP CPU nodes 7100 s 7000 s 6 OpenMP CPU nodes 1 gpu 2350 s 2000 s FERMI (ICHEC): assembled workstationCPU: 2 * Intel Xeon X5650 (6-core), 24 GByte RAMGPU: 2 x C2050, GTX480, C2075SW: CUDA 4.1, Intel compilers
  • CPU: Intel X5550, TDP of 95W, priced at $350GPU: NVIDIA M2070, TDP of 225W, priced at $2000PLX (CINECA): IBM iDataPlex DX360M3, 264 GPU nodesCPU: 2 Intel Westmere X5550 (6-core), 48 GByte RAMGPU: 2 x M2070SW: CUDA4.0, Intel compilers, PGI (11.x)
  • National average 9.83 cents/kWh kWh/sim tests/year $/test yearly energy billCPU 42.372357 4.16 $9816GPU/CPU 23.214 2400 2.28 $5476CPU: Intel X5550, TDP of 95W, priced at $350GPU: NVIDIA M2070, TDP of 225W, priced at $2000PLX (CINECA): IBM iDataPlex DX360M3, 264 GPU nodesCPU: 2 Intel Westmere X5550 (6-core), 48 GByte RAMGPU: 2 x M2070SW: CUDA4.0, Intel compilers, PGI (11.x)
  • FERMI (ICHEC): assembled workstationCPU: 2 * Intel Xeon X5650 (6-core), 24 GByte RAMGPU: 2 x C2050, GTX480, C2075SW: CUDA 4.1, Intel compilers
  • total # of core2 (16) 4 (32)6 (48)8 (64)10 (80)12 (96)14 (112)time (s) cpu3100016500110009500750060005500time gpu+cpu12500700050004500350030002500SPEEDUP2.482.3571432.22.1111112.14285722.2STONEY (ICHEC): Bull Novascale R422-E2, 24 GPU nodesCPU: 2 Intel (Nehalem EP) Xeon X5560, 48 GByte RAMGPU: 2 x M2090SW: CUDA 4.0, Intel compilers
  • # of cores4(48)8(96)12(144)16(192)24(288)32(384)44(528)time cpu3925265025252450174012901337time gpu+cpu242514371075900737675637SPEEDUP1.6185571.844122.3488372.7222222.3609231.9111112.098901PLX (CINECA): IBM iDataPlex DX360M3, 264 GPU nodesCPU: 2 Intel Westmere (6-core), 48 GByte RAMGPU: 2 x M2070SW: CUDA4.0, Intel compilers, PGI (11.x)
  • I am here today to talk to you about the value of seamlessly adding GPUs to the computer which you use to run TeraChem and achieving phenomenal performance improvements. This small incremental investment will yield significant performance payback.Benefits of GPU Acceleration with TeraChem-Compete with Supercomputers-More powerful hardware-Significantly lower energy usage
  • Before we end this session I would like to tell you about GPU Test Drive. It is an excellent resource for computational chemistry researchers such as yourself to evaluate benefits of GPU computing in speeding up your simulations. Most importantly it is free.NVIDIA along with its partners is offering access to remotely hosted GPU cluster. You can run applications such as AMBER and NAMD to find out how your models speed up. You can also try code that you have developed to run on GPU and see how it scales on a 8 GPU cluster. All you need to do is sign up and log in – it is really that easy! We have several partners who are demonstrating the GPU Test Drive on the GTC show floor. Please plan on visiting them.Sign up forms have been given out. If you are interested please fill them out and return them to me.
  • GPU Accelerated Computational Chemistry Applications

    1. 1. update Updated: February 4, 2013
    2. 2. Molecular Dynamics (MD) Applications Features Application GPU Perf Release Status Notes/Benchmarks Supported > 100 ns/day AMBER 12, GPU Revision Support 12.2 PMEMD Explicit Solvent & GB Released AMBER Implicit Solvent JAC NVE on 2X Multi-GPU, multi-node http://ambermd.org/gpus/benchmarks. K20s htm#Benchmarks 2x C2070 equals Release C37b1; Implicit (5x), Explicit (2x) Released CHARMM Solvent via OpenMM 32-35x X5667 Single & multi-GPU in single node http://www.charmm.org/news/c37b1.html#po CPUs stjump Two-body Forces, Link-cell Source only, Results Published Release V 4.03 DL_POLY Pairs, Ewald SPME forces, 4x Multi-GPU, multi-node http://www.stfc.ac.uk/CSE/randd/ccg/softwa Shake VV re/DL_POLY/25526.aspx 165 ns/Day Released GROMACS Implicit (5x), Explicit (2x) DHFR on Multi-GPU, multi-node Release 4.6; 1st Multi-GPU support 4X C2075s http://lammps.sandia.gov/bench.html#deskto Lennard-Jones, Gay-Berne, Released. LAMMPS Tersoff & many more potentials 3.5-18x on Titan Multi-GPU, multi-node p and http://lammps.sandia.gov/bench.html#titan 4.0 ns/days Released Full electrostatics with PME and NAMD most simulation features F1-ATPase on 100M atom capable NAMD 2.9 1x K20X Multi-GPU, multi-node GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
    3. 3. New/Additional MD Applications Ramping FeaturesApplication GPU Perf Release Status Notes Supported 4-29X Released, Version 1.8.51 Abalone Simulations (on 1060 GPU) (on 1060 GPU) Single GPU Agile Molecule, Inc. Computation of non-valent 4-29X Released, Version 1.1.4 Ascalaph interactions (on 1060 GPU) Single GPU Agile Molecule, Inc. 150 ns/day DHFR on Released Production bio-molecular dynamics (MD) ACEMD Written for use only on GPUs 1x K20 Single and multi-GPUs software specially optimized to run on GPUs Powerful distributed computing Depends upon Released; http://folding.stanford.eduFolding@Home molecular dynamics system; number of GPUs GPUs and CPUs GPUs get 4X the points of CPUs implicit solvent and folding High-performance all-atom Depends upon Released; http://www.gpugrid.net/GPUGrid.net biomolecular simulations; number of GPUs NVIDIA GPUs only explicit solvent and binding Simple fluids and binary mixtures (pair potentials, high- Up to 66x on 2090 Released, Version 0.2.0 http://halmd.org/benchmarks.html#supercool HALMD precision NVE and NVT, dynamic vs. 1 CPU core Single GPU ed-binary-mixture-kob-andersen correlations) Kepler 2X faster Released, Version 0.11.2 http://codeblue.umich.edu/hoomd-blue/HOOMD-Blue Written for use only on GPUs than Fermi Single and multi-GPU on 1 node Multi-GPU w/ MPI in March 2013 Implicit: 127-213 Implicit and explicit solvent, Released Version 4.1.1 Library and application for molecular dynamics OpenMM custom forces ns/day Explicit: 18- Multi-GPU on high-performance 55 ns/day DHFR GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
    4. 4. Quantum Chemistry Applications Application Features Supported GPU Perf Release Status Notes Local Hamiltonian, non-local Hamiltonian, LOBPCG algorithm, Released; Version 7.0.5 www.abinit.org Abinit diagonalization / 1.3-2.7X Multi-GPU support orthogonalization Integrating scheduling GPU into http://www.olcf.ornl.gov/wp- Under development ACES III SIAL programming language and 10X on kernels Multi-GPU support content/training/electronic-structure- SIP runtime environment 2012/deumens_ESaccel_2012.pdf Pilot project completed, ADF Fock Matrix, Hessians TBD Under development www.scm.com Multi-GPU support http://inac.cea.fr/L_Sim/BigDFT/news.html, http://www.olcf.ornl.gov/wp- 5-25X Released June 2009, content/training/electronic-structure- DFT; Daubechies wavelets, BigDFT part of Abinit (1 CPU core to current release 1.6.0 2012/BigDFT-Formalism.pdf and GPU kernel) Multi-GPU support http://www.olcf.ornl.gov/wp- content/training/electronic-structure- 2012/BigDFT-HPC-tues.pdf Under development, http://www.tcm.phy.cam.ac.uk/~mdt26/casino. Casino TBD TBD Spring 2013 release html Multi-GPU support http://www.olcf.ornl.gov/wp- DBCSR (spare matrix multiply Under development CP2K library) 2-7X Multi-GPU support content/training/ascc_2012/friday/ACSS_2012_V andeVondele_s.pdf Libqc with Rys Quadrature 1.3-1.6X, Released Next release Q4 2012. GAMESS-US Algorithm, Hartree-Fock, MP2 2.3-2.9x HF Multi-GPU support http://www.msg.ameslab.gov/gamess/index.html and CCSD in Q4 2012 GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
    5. 5. Quantum Chemistry Applications Application Features Supported GPU Perf Release Status Notes (ss|ss) type integrals within calculations using Hartree Fock ab Release in 2012 http://www.ncbi.nlm.nih.gov/pubmed/215419 GAMESS-UK initio methods and density 8x Multi-GPU support 63 functional theory. Supports organics & inorganics. Under development Joint PGI, NVIDIA & Gaussian Announced Aug. 29, 2011 Gaussian Collaboration TBD Multi-GPU support http://www.gaussian.com/g_press/nvidia_press.htm Electrostatic poisson equation, Released orthonormalizing of vectors, https://wiki.fysik.dtu.dk/gpaw/devel/projects/gpu.html, GPAW residual minimization method 8x Multi-GPU support Samuli Hakala (CSC Finland) & Chris O’Grady (SLAC) (rmm-diis) Under development Schrodinger, Inc. Jaguar Investigating GPU acceleration TBD Multi-GPU support http://www.schrodinger.com/kb/278 Released, Version 7.8 MOLCAS CU_BLAS support 1.1x Single GPU. Additional GPU www.molcas.org support coming in Version 8 Density-fitted MP2 (DF-MP2), 1.7-2.3X Under development www.molpro.net MOLPRO density fitted local correlation projected Multiple GPU Hans-Joachim Werner methods (DF-RHF, DF-KS), DFT GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
    6. 6. Quantum Chemistry Applications Features Application GPU Perf Release Status Notes Supported pseudodiagonalization, full Under Development Academic port. MOPAC2009 diagonalization, and density 3.8-14X Single GPU http://openmopac.net matrix assembling Development GPGPU benchmarks: Triples part of Reg-CCSD(T), www.nwchem-sw.org Release targeting March 2013 NWChem CCSD & EOMCCSD task 3-10X projected Multiple GPUs And http://www.olcf.ornl.gov/wp- schedulers content/training/electronic-structure- 2012/Krishnamoorthy-ESCMA12.pdf Octopus DFT and TDDFT TBD Released http://www.tddft.org/programs/octopus/ Density functional theory (DFT) First principles materials code that computes Released PEtot plane wave pseudopotential 6-10X Multi-GPU the behavior of the electron structures of calculations materials http://www.q- Q-CHEM RI-MP2 8x-14x Released, Version 4.0 chem.com/doc_for_web/qchem_manual_4.0.pdf GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
    7. 7. Quantum Chemistry Applications Features Application GPU Perf Release Status Notes Supported NCSA Released University of Illinois at Urbana-Champaign QMCPACK Main features 3-4x Multiple GPUs http://cms.mcc.uiuc.edu/qmcpack/index.php /GPU_version_of_QMCPACK Created by Irish Centre for Quantum PWscf package: linear algebra (matrix multiply), explicit 2.5-3.5x Released Version 5.0 High-End Computing http://www.quantum-espresso.org/index.phpEspresso/PWscf computational kernels, 3D FFTs Multiple GPUs and http://www.quantum-espresso.org/ Completely redesigned to exploit GPU parallelism. YouTube: 44-650X vs. Released http://youtu.be/EJODzk6RFxE?hd=1 and TeraChem “Full GPU-based solution” GAMESS CPU Version 1.5 http://www.olcf.ornl.gov/wp- version Multi-GPU/single node content/training/electronic-structure- 2012/Luehr-ESCMA.pdf 2x Hybrid Hartree-Fock DFT 2 GPUs Available on request By Carnegie Mellon University VASP functionals including exact comparable to Multiple GPUs http://arxiv.org/pdf/1111.0716.pdf exchange 128 CPU cores Generalized Wang-Landau 3x Under development GPU Perf Electronic Structure Determination Workshop 2012: NICS compared against Multi-core x86 CPU socket. http://www.olcf.ornl.gov/wp- WL-LSMS method with 32 GPUs vs. Multi-GPU support GPU Perf benchmarked on GPU supported features content/training/electronic-structure- 32 (16-core) CPUs and2012/Eisenbach_OakRidge_February.pdfcomparison may be a kernel to kernel perf
    8. 8. Viz, ―Docking‖ and Related Applications Growing Related Features GPU Perf Release Status Notes Applications Supported Visualization from Visage Imaging. Next release, 5.4, will use 3D visualization of volumetric Released, Version 5.3.3 Amira 5® data and surfaces 70x Single GPU GPU for general purpose processing in some functions http://www.visageimaging.com/overview.html High-Throughput parallel blind Virtual Screening, Allows fast processing of large Available upon request to BINDSURF ligand databases 100X authors; single GPU http://www.biomedcentral.com/1471-2105/13/S14/S13 Empirical Free Released University of Bristol BUDE Energy Forcefield 6.5-13.4X Single GPU http://www.bris.ac.uk/biochemistry/cpfg/bude/bude.htm Released, Suite 2011 Schrodinger, Inc. Core Hopping GPU accelerated application 3.75-5000X Single and multi-GPUs. http://www.schrodinger.com/products/14/32/ Real-time shape similarity Released Open Eyes Scientific Software FastROCS searching/comparison 800-3000X Single and multi-GPUs. http://www.eyesopen.com/fastrocs Lines: 460% increase Cartoons: 1246% increase Released, Version 1.5 PyMol Surface: 1746% increase 1700x Single GPUs http://pymol.org/ Spheres: 753% increase Ribbon: 426% increase High quality rendering, GPU Perf compared against Multi-core x86 CPU socket. large structures (100 million atoms), 100-125X or greater GPU Perf benchmarked on GPU supported features Visualization from University of Illinois at Urbana-Champaign VMD analysis and visualization tasks, multiple on kernels Released, Version 1.9 and mayhttp://www.ks.uiuc.edu/Research/vmd/ be a kernel to kernel perf comparison GPU support for display of molecular
    9. 9. Bioinformatics Applications Features GPU Application Release Status Website Supported Speedup Alignment of short sequencing Version 0.6.2 – 3/2012 BarraCUDA reads 6-10x Multi-GPU, multi-node http://seqbarracuda.sourceforge.net/ Parallel search of Smith- Version 2.0.8 – Q1/2012 CUDASW++ Waterman database 10-50x Multi-GPU, multi-node http://sourceforge.net/projects/cudasw/ Parallel, accurate long read Version 1.0.40 – 6/2012 CUSHAW aligner for large genomes 10x Multiple-GPU http://cushaw.sourceforge.net/ Protein alignment according to Version 2.2.26 – 3/2012 http://eudoxus.cheme.cmu.edu/gpublast/gpu GPU-BLAST BLASTP 3-4x Single GPU blast.html Parallel local and global Version 2.3.2 – Q1/2012 http://www.mpihmmer.org/installguideGPUH GPU-HMMER search of Hidden Markov 60-100x Multi-GPU, multi-node MMER.htm Models Scalable motif discovery Version 3.0.12 https://sites.google.com/site/yongchaosoftwa mCUDA-MEME algorithm based on MEME 4-10x Multi-GPU, multi-node re/mcuda-meme Hardware and software for Released. SeqNFind reference assembly, blast, SW, 400x Multi-GPU, multi-node http://www.seqnfind.com/ HMM, de novo assembly Version 1.11 – 5/2012 UGENE Fast short read alignment 6-8x Multi-GPU, multi-node http://ugene.unipro.ru/ GPU Perf compared against same or similar code running on single CPU machine Parallel linear regression on Performance measured internally or independently
    10. 10. MD Average Speedups The blue node contains Dual E5-2687W CPUs 10 (8 Cores per CPU). The green nodes contain Dual E5-2687W CPUs (8 Cores per CPU) and 1 or 2 NVIDIA K10, K20, orPerformance Relative to CPU Only 8 K20X GPUs. 6 4 2 0 CPU CPU + K10 CPU + K20 CPU + K20X CPU + 2x K10 CPU + 2x K20 CPU + 2x K20XAverage speedup calculated from 4 AMBER, 3 NAMD, 3 LAMMPS, and 1 GROMACS test cases. Error bars show the maximum and minimum speedup for each hardware configuration.
    11. 11. Molecular Dynamics (MD) Applications Features Application GPU Perf Release Status Notes/Benchmarks Supported > 100 ns/day AMBER 12, GPU Revision Support 12.2 PMEMD Explicit Solvent & GB Released AMBER Implicit Solvent JAC NVE on 2X Multi-GPU, multi-node http://ambermd.org/gpus/benchmarks. K20s htm#Benchmarks 2x C2070 equals Release C37b1; Implicit (5x), Explicit (2x) Released CHARMM Solvent via OpenMM 32-35x X5667 Single & multi-GPU in single node http://www.charmm.org/news/c37b1.html#po CPUs stjump Two-body Forces, Link-cell Source only, Results Published Release V 4.03 DL_POLY Pairs, Ewald SPME forces, 4x Multi-GPU, multi-node http://www.stfc.ac.uk/CSE/randd/ccg/softwa Shake VV re/DL_POLY/25526.aspx 165 ns/Day Released GROMACS Implicit (5x), Explicit (2x) DHFR on Multi-GPU, multi-node Release 4.6; 1st Multi-GPU support 4X C2075s http://lammps.sandia.gov/bench.html#deskto Lennard-Jones, Gay-Berne, Released. LAMMPS Tersoff & many more potentials 3.5-18x on Titan Multi-GPU, multi-node p and http://lammps.sandia.gov/bench.html#titan 4.0 ns/days Released Full electrostatics with PME and NAMD most simulation features F1-ATPase on 100M atom capable NAMD 2.9 1x K20X Multi-GPU, multi-node GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
    12. 12. New/Additional MD Applications Ramping FeaturesApplication GPU Perf Release Status Notes Supported 4-29X Released, Version 1.8.51 Abalone Simulations (on 1060 GPU) (on 1060 GPU) Single GPU Agile Molecule, Inc. Computation of non-valent 4-29X Released, Version 1.1.4 Ascalaph interactions (on 1060 GPU) Single GPU Agile Molecule, Inc. 150 ns/day DHFR on Released Production bio-molecular dynamics (MD) ACEMD Written for use only on GPUs 1x K20 Single and multi-GPUs software specially optimized to run on GPUs Powerful distributed computing Depends upon Released; http://folding.stanford.eduFolding@Home molecular dynamics system; number of GPUs GPUs and CPUs GPUs get 4X the points of CPUs implicit solvent and folding High-performance all-atom Depends upon Released; http://www.gpugrid.net/GPUGrid.net biomolecular simulations; number of GPUs NVIDIA GPUs only explicit solvent and binding Simple fluids and binary mixtures (pair potentials, high- Up to 66x on 2090 Released, Version 0.2.0 http://halmd.org/benchmarks.html#supercool HALMD precision NVE and NVT, dynamic vs. 1 CPU core Single GPU ed-binary-mixture-kob-andersen correlations) Kepler 2X faster Released, Version 0.11.2 http://codeblue.umich.edu/hoomd-blue/HOOMD-Blue Written for use only on GPUs than Fermi Single and multi-GPU on 1 node Multi-GPU w/ MPI in March 2013 Implicit: 127-213 Implicit and explicit solvent, Released Version 4.1.1 Library and application for molecular dynamics OpenMM custom forces ns/day Explicit: 18- Multi-GPU on high-performance 55 ns/day DHFR GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison
    13. 13. Built from Ground Up for GPUs Computational Chemistry Study disease & discover drugsWhat Predict drug and protein interactions GPU READY Speed of simulations is critical APPLICATIONSWhy Enables study of: Abalone ACEMD Longer timeframes AMBER Larger systems DL_PLOY More simulations GAMESSHow GPUs increase throughput & accelerate simulations GROMACS LAMMPS NAMD AMBER 11 Application NWChem 4.6x performance increase with 2 GPUs with Q-CHEM only a 54% added cost* Quantum Espresso TeraChem • AMBER 11 Cellulose NPT on 2x E5670 CPUS + 2x Tesla C2090s (per node) vs. 2xcE5670 CPUs (per node) • Cost of CPU node assumed to be $9333. Cost of adding two (2) 2090s to single node is assumed to be $5333
    14. 14. AMBER 12 GPU Support Revision 12.2 1/22/201315
    15. 15. Kepler - Our Fastest Family of GPUs Yet 30.00 Factor IX Running AMBER 12 GPU Support Revision 12.1 25.39 25.00 The blue node contains Dual E5-2687W CPUs 22.44 (8 Cores per CPU). 7.4x The green nodes contain Dual E5-2687W CPUs (8 20.00 18.90 Cores per CPU) and either 1x NVIDIA M2090, 1x K10 Nanoseconds / Day or 1x K20 for the GPU 6.6x 15.00 11.85 5.6x 10.00 3.5x 5.00 3.42 0.00 Factor IX 1 CPU Node 1 CPU Node + 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X M2090 GPU speedup/throughput increased from 3.5x (with M2090) to 7.4x (with K20X) when compared to a CPU only node16
    16. 16. K10 Accelerates Simulations of All Sizes 30 Running AMBER 12 GPU Support Revision 12.1 The blue node contains Dual E5-2687W CPUs 25 24.00 (8 Cores per CPU).Speedup Compared to CPU Only The green nodes contain Dual E5-2687W CPUs (8 19.98 20 Cores per CPU) and 1x NVIDIA K10 GPU 15 10 5.50 5.53 5.04 5 2.00 0 CPU TRPcage JAC NVE Factor IX NVE Cellulose NVE Myoglobin Nucleosome All Molecules GB PME PME PME GB GB Gain 24x performance by adding just 1 GPU Nucleosome when compared to dual CPU performance
    17. 17. K20 Accelerates Simulations of All Sizes 30.00 28.00 Running AMBER 12 GPU Support Revision 12.1 25.56 SPFP with CUDA 4.2.9 ECC Off 25.00 The blue node contains 2x Intel E5-2687W CPUs Speedup Compared to CPU Only (8 Cores per CPU) 20.00 Each green nodes contains 2x Intel E5-2687W CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs 15.00 10.00 7.28 6.50 6.56 5.00 2.66 1.00 0.00 CPU All TRPcage GB JAC NVE PME Factor IX NVE Cellulose NVE Myoglobin GB Nucleosome Molecules PME PME GB Gain 28x throughput/performance by adding just one K20 GPU Nucleosome when compared to dual CPU performance18 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
    18. 18. K20X Accelerates Simulations of All Sizes 35 31.30 Running AMBER 12 GPU Support Revision 12.1 30 28.59 The blue node contains Dual E5-2687W CPUs (8 Cores per CPU). Speedup Compared to CPU Only 25 The green nodes contain Dual E5-2687W CPUs (8 Cores per CPU) and 1x NVIDIA K20X GPU 20 15 10 8.30 7.15 7.43 5 2.79 0 CPU TRPcage JAC NVE Factor IX NVE Cellulose NVE Myoglobin Nucleosome All Molecules GB PME PME PME GB GB Gain 31x performance by adding just one K20X GPU Nucleosome when compared to dual CPU performance19 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
    19. 19. K10 Strong Scaling over Nodes Cellulose 408K Atoms (NPT) Running AMBER 12 with CUDA 4.2 ECC Off 6 The blue nodes contains 2x Intel X5670 CPUs (6 Cores per CPU) 5 The green nodes contains 2x Intel X5670 CPUs (6 Cores per CPU) plus 2x NVIDIA K10 GPUs 4Nanoseconds / Day 2.4x 3 CPU Only 3.6x With GPU 2 5.1x 1 Cellulose 0 1 2 4 Number of Nodes GPUs significantly outperform CPUs while scaling over multiple nodes
    20. 20. Kepler – Universally Faster 9 Running AMBER 12 GPU Support Revision 12.1 8 The CPU Only node contains Dual E5-2687W CPUs (8 Cores per CPU).Speedups Compared to CPU Only 7 The Kepler nodes contain Dual E5-2687W CPUs (8 6 Cores per CPU) and 1x NVIDIA K10, K20, or K20X GPUs 5 JAC 4 Factor IX Cellulose 3 2 1 0 CPU Only CPU + K10 CPU + K20 CPU + K20X Cellulose The Kepler GPUs accelerated all simulations, up to 8x
    21. 21. K10 Extreme Performance Running AMBER 12 GPU Support Revision 12.1 JAC 23K Atoms (NVE) 120 The blue node contains Dual E5-2687W CPUs (8 Cores per CPU). 97.99 The green node contain Dual E5-2687W CPUs (8 100 Cores per CPU) and 2x NVIDIA K10 GPUsNanoseconds / Day 80 60 40 20 12.47 0 1 Node 1 Node DHFR Gain 7.8X performance by adding just 2 GPUs when compared to dual CPU performance
    22. 22. K20 Extreme Performance DHRF JAC 23K Atoms (NVE) Running AMBER 12 GPU Support Revision 12.1 SPFP with CUDA 4.2.9 ECC Off 120 The blue node contains 2x Intel E5-2687W CPUs 95.59 (8 Cores per CPU) 100 Each green node contains 2x Intel E5-2687W CPUs (8 Cores per CPU) plus 2x NVIDIA K20 GPU Nanoseconds / Day 80 60 40 20 12.47 0 1 Node 1 Node DHFR Gain > 7.5X throughput/performance by adding just 2 K20 GPUs when compared to dual CPU performance23 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
    23. 23. Replace 8 Nodes with 1 K20 GPU 90.00 35000 $32,000.00 Running AMBER 12 GPU Support Revision 12.1 81.09 SPFP with CUDA 4.2.9 ECC Off 80.00 30000 The eight (8) blue nodes each contain 2x Intel 70.00 E5-2687W CPUs (8 Cores per CPU) 65.00 25000 Each green node contains 2x Intel E5-2687W 60.00 CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPU 50.00 20000 Note: Typical CPU and GPU node pricing used. 40.00 Pricing may vary depending on node 15000 configuration. Contact your preferred HW vendor for actual pricing. 30.00 10000 20.00 $6,500.00 5000 10.00 0.00 0 Nanoseconds/Day Cost DHFR Cut down simulation costs to ¼ and gain higher performance24 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
    24. 24. Replace 7 Nodes with 1 K10 GPU Performance on JAC NVE Cost Running AMBER 12 GPU Support Revision 12.1 SPFP with CUDA 4.2.9 ECC Off 80 $35,000.00 $32,000 The eight (8) blue nodes each contain 2x Intel 70 $30,000.00 E5-2687W CPUs (8 Cores per CPU) 60 The green node contains 2x Intel E5-2687W $25,000.00 CPUs (8 Cores per CPU) plus 1x NVIDIA K10 Nanoseconds / Day GPU 50 $20,000.00 Note: Typical CPU and GPU node pricing used. 40 Pricing may vary depending on node $15,000.00 configuration. Contact your preferred HW vendor 30 for actual pricing. $10,000.00 20 $7,000 10 $5,000.00 0 $0.00 CPU Only GPU Enabled CPU Only GPU Enabled DHFR Cut down simulation costs to ¼ and increase performance by 70%25 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
    25. 25. Extra CPUs decrease Performance Cellulose NVE Running AMBER 12 GPU Support Revision 12.1 8 The orange bars contains one E5-2687W CPUs (8 Cores per CPU). 7 The blue bars contain Dual E5-2687W CPUs (8 6 Cores per CPU)Nanoseconds / Day 2 CPUs 2 GPUs 1 CPU 2 GPUs 5 4 1 E5-2687W 2 E5-2687W 3 2 1 0 Cellulose CPU Only CPU with dual K20sWhen used with GPUs, dual CPU sockets perform worse than single CPU sockets.
    26. 26. Kepler - Greener Science Running AMBER 12 GPU Support Revision 12.1 Energy used in simulating 1 ns of DHFR JAC 2500 The blue node contains Dual E5-2687W CPUs (150W each, 8 Cores per CPU). The green nodes contain Dual E5-2687W CPUs (8 2000 Cores per CPU) and 1x NVIDIA K10, K20, or K20X Lower is better GPUs (235W each).Energy Expended (kJ) 1500 Energy Expended 1000 = Power x Time 500 0 CPU Only CPU + K10 CPU + K20 CPU + K20X The GPU Accelerated systems use 65-75% less energy
    27. 27. Recommended GPU Node Configuration for AMBER Computational Chemistry Workstation or Single Node Configuration # of CPU sockets 2 Cores per CPU socket 4+ (1 CPU core drives 1 GPU) CPU speed (Ghz) 2.66+ System memory per node (GB) 16 Kepler K10, K20, K20X GPUs Fermi M2090, M2075, C2075 1-2 # of GPUs per CPU socket (4 GPUs on 1 socket is good to do 4 fast serial GPU runs) GPU memory preference (GB) 6 GPU to CPU connection PCIe 2.0 16x or higher Server storage 2 TB28 Scale to multiple nodes with same single node configuration AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
    28. 28. Benefits of GPU AMBER Accelerated Computing Faster than CPU only systems in all tests Most major compute intensive aspects of classical MD ported Large performance boost with marginal price increase Energy usage cut by more than half GPUs scale well within a node and over multiple nodes K20 GPU is our fastest and lowest power high performance GPU yet Try GPU accelerated AMBER for free – www.nvidia.com/GPUTestDrive29 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
    29. 29. NAMD 2.9
    30. 30. Kepler - Our Fastest Family of GPUs Yet 4.50 ApoA1 Running NAMD version 2.9 4.00 4.00 The blue node contains Dual E5-2687W CPUs 3.57 (8 Cores per CPU). 3.45 3.50 The green nodes contain Dual E5-2687W CPUs (8 2.9x Cores per CPU) and either 1x NVIDIA M2090, 1x K10 3.00 or 1x K20 for the GPU Nanoseconds/Day 2.63 2.6x 2.50 2.5x 2.00 1.50 1.37 1.9x 1.00 0.50 0.00 1 CPU Node 1 CPU Node + 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X Apolipoprotein A1 M2090 GPU speedup/throughput increased from 1.9x (with M2090) to 2.9x (with K20X) when compared to a CPU only node31 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
    31. 31. Accelerates Simulations of All Sizes 3 Running NAMD 2.9 with CUDA 4.0 ECC Off 2.7 2.6 The blue node contains 2x Intel E5-2687W CPUs 2.5 2.4 (8 Cores per CPU) Speedup Compared to CPU Only Each green node contains 2x Intel E5-2687W 2 CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs 1.5 1 0.5 0 CPU All Molecules ApoA1 F1-ATPase STMV Apolipoprotein A1 Gain 2.5x throughput/performance by adding just 1 GPU when compared to dual CPU performance32 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
    32. 32. Kepler – Universally Faster 6 Running NAMD version 2.9 The CPU Only node contains Dual E5-2687W CPUs 5 (8 Cores per CPU).Speedup Compared to CPU Only 5.1x The Kepler nodes contain Dual E5-2687W CPUs (8 4 4.7x Cores per CPU) and 1 or two NVIDIA K10, K20, or K20X GPUs. 4.3x F1-ATPase 3 ApoA1 STMV 2.9x 2 2.6x 2.4x 1 0 CPU Only 1x K10 1x K20 1x K20X 2x K10 2x K20 2x K20X F1-ATPase | Kepler nodes use Dual CPUs | The Kepler GPUs accelerate all simulations, up to 5x Average acceleration printed in bars
    33. 33. Outstanding Strong Scaling with Multi-STMV Running NAMD version 2.9 Each blue XE6 CPU node contains 1x AMD 100 STMV on Hundreds of Nodes 1600 Opteron (16 Cores per CPU). 1.2 Fermi XK6 Each green XK6 CPU+GPU node contains 1x AMD 1600 Opteron (16 Cores per CPU) 1 and an additional 1x NVIDIA X2090 GPU. CPU XK6 2.7xNanoseconds / Day 0.8 2.9x 0.6 0.4 0.2 3.6x 3.8x Concatenation of 100 0 Satellite Tobacco Mosaic Virus 32 64 128 256 512 640 768 # of Nodes Accelerate your science by 2.7-3.8x when compared to CPU-based supercomputers
    34. 34. Replace 3 Nodes with 1 2090 GPU Running NAMD version 2.9 Each blue node contains 2x Intel Xeon X5550 CPUs F1-ATPase (4 Cores per CPU). 4 CPU Nodes0.8 9000 0.74 The green node contains 2x Intel Xeon X5550 CPUs $8,000 1 CPU Node +8000 (4 Cores per CPU) and 1x NVIDIA M2090 GPU0.7 1x M2090 GPUs 0.63 7000 Note: Typical CPU and GPU node pricing used. Pricing0.6 may vary depending on node configuration. Contact your 6000 preferred HW vendor for actual pricing.0.5 50000.4 $4,000 40000.3 30000.2 20000.1 1000 0 0 Nanoseconds/Day Cost Speedup of 1.2x for 50% the cost F1-ATPase
    35. 35. K20 - Greener: Twice The Science Per Watt 1200000 Energy Used in Simulating 1 Nanosecond of ApoA1 Running NAMD version 2.9 1000000 Each blue node contains Dual E5-2687W CPUs (95W, 4 Cores per CPU). Each green node contains 2x Intel Xeon X5550 Energy Expended (kJ) 800000 CPUs (95W, 4 Cores per CPU) and 2x NVIDIA Lower is better K20 GPUs (225W per GPU) 600000 Energy Expended 400000 = Power x Time 200000 0 1 Node 1 Node + 2x K20 Cut down energy usage by ½ with GPUs36 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
    36. 36. Kepler - Greener: Twice The Science/Joule Energy used in simulating 1 ns of SMTV 250000 Running NAMD version 2.9 The blue node contains Dual E5-2687W CPUs 200000 (150W each, 8 Cores per CPU).Energy Expended (kJ) Lower is better The green nodes contain Dual E5-2687W CPUs (8 Cores per CPU) and 2x NVIDIA K10, K20, or 150000 K20X GPUs (235W each). Energy Expended 100000 = Power x Time 50000 0 CPU Only CPU + 2 K10s CPU + 2 K20s CPU + 2 K20Xs Cut down energy usage by ½ with GPUs Satellite Tobacco Mosaic Virus
    37. 37. Recommended GPU Node Configuration for NAMD Computational Chemistry Workstation or Single Node Configuration # of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32 Kepler K10, K20, K20X GPUs Fermi M2090, M2075, C2075 # of GPUs per CPU socket 1-2 GPU memory preference (GB) 6 GPU to CPU connection PCIe 2.0 or higher Server storage 500 GB or higher Network configuration Gemini, InfiniBand38 Scale to multiple nodes with same single node configuration NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
    38. 38. Summary/Conclusions Benefits of GPU Accelerated Computing Faster than CPU only systems in all tests Large performance boost with small marginal price increase Energy usage cut in half GPUs scale very well within a node and over multiple nodes Tesla K20 GPU is our fastest and lowest power high performance GPU to date Try GPU accelerated NAMD for free – www.nvidia.com/GPUTestDrive39 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
    39. 39. LAMMPS, Jan. 2013 or later
    40. 40. More Science for Your Money Embedded Atom Model Blue node uses 2x E5-2687W (8 Cores 6 and 150W per CPU). 5.5 Green nodes have 2x E5-2687W and 1 5 or 2 NVIDIA K10, K20, or K20X GPUs (235W).Speedup Compared to CPU Only 4.5 4 3.3 2.92 3 2.47 2 1.7 1 0 CPU Only CPU + 1x CPU + 1x CPU + 1x CPU + 2x CPU + 2x CPU + 2x K10 K20 K20X K10 K20 K20X Experience performance increases of up to 5.5x with Kepler GPU nodes.
    41. 41. K20X, the Fastest GPU Yet 7 Blue node uses 2x E5-2687W (8 Cores and 150W per CPU). 6 Green nodes have 2x E5-2687W and 2 NVIDIA M2090s or K20X GPUs (235W).Speedup Relative to CPU Alone 5 4 3 2 1 0 CPU Only CPU + 2x M2090 CPU + K20X CPU + 2x K20X Experience performance increases of up to 6.2x with Kepler GPU nodes. One K20X performs as well as two M2090s
    42. 42. Get a CPU Rebate to Fund Part of Your GPU Budget Acceleration in Loop Time Computation by Additional GPUs Running NAMD version 2.9 20 18.2 The blue node contains Dual X5670 CPUs 18 (6 Cores per CPU). 16 The green nodes contain Dual X5570 CPUs Normalized to CPU Only 14 12.9 (4 Cores per CPU) and 1-4 NVIDIA M2090 GPUs. 12 9.88 10 8 6 5.31 4 2 0 1 Node 1 Node + 1x M20901 Node + 2x M20901 Node + 3x M20901 Node + 4x M2090 Increase performance 18x when compared to CPU-only nodes Cheaper CPUs used with GPUs AND still faster overall performance when compared to more expensive CPUs!
    43. 43. Excellent Strong Scaling on Large Clusters LAMMPS Gay-Berne 134M Atoms 600 GPU Accelerated XK6 500 CPU only XE6 Loop Time (seconds) 400 3.55x 300 200 3.48x 3.45x 100 0 300 400 500 600 700 800 900 Nodes From 300-900 nodes, the NVIDIA GPU-powered XK6 maintained 3.5x performance compared to XE6 CPU nodes Each blue Cray XE6 Nodes have 2x AMD Opteron CPUs (16 Cores per CPU) Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Cores per CPU) and 1x NVIDIA X2090
    44. 44. GPUs Sustain 5x Performance for Weak Scaling Weak Scaling with 32K Atoms per Node 45 40 Loop Time (seconds) 35 30 6.7x 5.8x 4.8x 25 20 15 10 5 0 1 8 27 64 125 216 343 512 729 Nodes Performance of 4.8x-6.7x with GPU-accelerated nodes when compared to CPUs alone Each blue Cray XE6 Node have 2x AMD Opteron CPUs (16 Cores per CPU) Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Core per CPU) and 1x NVIDIA X2090
    45. 45. Faster, Greener — Worth It! Energy Consumed in one loop of EAM 140 120 GPU-accelerated computing uses Lower is better 53% less energy than CPU only 100Energy Expended (kJ) 80 60 Energy Expended = Power x Time Power calculated by combining the component’s TDPs 40 20 0 1 Node 1 Node + 1 K20X 1 Node + 2x K20X Blue node uses 2x E5-2687W (8 Cores and 150W per CPU) and CUDA 4.2.9. Green nodes have 2x E5-2687W and 1 or 2 NVIDIA K20X GPUs (235W) running CUDA 5.0.36. Try GPU accelerated LAMMPS for free – www.nvidia.com/GPUTestDrive
    46. 46. Molecular Dynamics with LAMMPS on a Hybrid Cray Supercomputer W. Michael Brown National Center for Computational Sciences Oak Ridge National Laboratory NVIDIA Technology Theater, Supercomputing 2012 November 14, 2012
    47. 47. Early Kepler Benchmarks on Titan 32.00 4 16.00 XK7+GPU 8.00 4.00 XK6 3 Time (s)Atomic Fluid 2.00 Time (s) XK6+GPU 1.00 2 0.50 XK7+GPU 0.25 XK6 0.13 1 XK6+GPU 0.06 0.03 0 1 2 4 8 16 32 64 128 Nodes 1 4 16 64 6 96 24 4 25 38 40 10 16 3.0 8.00 XK7+GPU 2.5 4.00 2.0 Time (s) 2.00 Time (s)Bulk Copper XK6 1.5 1.00 1.0 0.50 XK6+GPU 0.5 0.25 0.0 0.13 Nodes 1 4 16 64 6 96 24 4 25 38 1 2 4 8 16 32 64 128 40 10 16

    ×