GPU Accelerated Computational Chemistry Applications

update

Updated: February 4, 2013

Molecular Dynamics (MD) Applications
Features
Application GPU Perf Release Status Notes/Benchmarks
Supported
> 100 ns/day AMBER 12, GPU Revision Support 12.2
PMEMD Explicit Solvent & GB Released
AMBER Implicit Solvent
JAC NVE on 2X
Multi-GPU, multi-node
http://ambermd.org/gpus/benchmarks.
K20s htm#Benchmarks

2x C2070 equals Release C37b1;
Implicit (5x), Explicit (2x) Released
CHARMM Solvent via OpenMM
32-35x X5667
Single & multi-GPU in single node
http://www.charmm.org/news/c37b1.html#po
CPUs stjump

Two-body Forces, Link-cell Source only, Results Published
Release V 4.03
DL_POLY Pairs, Ewald SPME forces, 4x
http://www.stfc.ac.uk/CSE/randd/ccg/softwa
Shake VV re/DL_POLY/25526.aspx

165 ns/Day
Released
GROMACS Implicit (5x), Explicit (2x) DHFR on
Release 4.6; 1st Multi-GPU support
4X C2075s

http://lammps.sandia.gov/bench.html#deskto
Lennard-Jones, Gay-Berne, Released.
LAMMPS Tersoff & many more potentials
3.5-18x on Titan
p and
http://lammps.sandia.gov/bench.html#titan

4.0 ns/days Released
Full electrostatics with PME and
NAMD most simulation features
F1-ATPase on 100M atom capable NAMD 2.9
1x K20X Multi-GPU, multi-node
GPU Perf compared against Multi-core x86 CPU socket.
GPU Perf benchmarked on GPU supported features
and may be a kernel to kernel perf comparison

New/Additional MD Applications Ramping
Features
Application GPU Perf Release Status Notes
Supported
4-29X Released, Version 1.8.51
Abalone Simulations (on 1060 GPU)
(on 1060 GPU) Single GPU
Agile Molecule, Inc.

Computation of non-valent 4-29X Released, Version 1.1.4
Ascalaph interactions (on 1060 GPU) Single GPU
Agile Molecule, Inc.

150 ns/day DHFR on Released Production bio-molecular dynamics (MD)
ACEMD Written for use only on GPUs
1x K20 Single and multi-GPUs software specially optimized to run on GPUs

Powerful distributed computing
Depends upon Released; http://folding.stanford.edu
Folding@Home molecular dynamics system;
number of GPUs GPUs and CPUs GPUs get 4X the points of CPUs
implicit solvent and folding

High-performance all-atom
Depends upon Released; http://www.gpugrid.net/
GPUGrid.net biomolecular simulations;
number of GPUs NVIDIA GPUs only
explicit solvent and binding
Simple fluids and binary
mixtures (pair potentials, high- Up to 66x on 2090 Released, Version 0.2.0 http://halmd.org/benchmarks.html#supercool
HALMD precision NVE and NVT, dynamic vs. 1 CPU core Single GPU ed-binary-mixture-kob-andersen
correlations)

Kepler 2X faster Released, Version 0.11.2 http://codeblue.umich.edu/hoomd-blue/
HOOMD-Blue Written for use only on GPUs
than Fermi Single and multi-GPU on 1 node Multi-GPU w/ MPI in March 2013

Implicit: 127-213
Implicit and explicit solvent, Released Version 4.1.1 Library and application for molecular dynamics
OpenMM custom forces
ns/day Explicit: 18-
Multi-GPU on high-performance
55 ns/day DHFR

Quantum Chemistry Applications
Application Features Supported GPU Perf Release Status Notes
Local Hamiltonian, non-local
Hamiltonian, LOBPCG algorithm, Released; Version 7.0.5 www.abinit.org
Abinit diagonalization /
1.3-2.7X
Multi-GPU support
orthogonalization
Integrating scheduling GPU into http://www.olcf.ornl.gov/wp-
Under development
ACES III SIAL programming language and 10X on kernels
Multi-GPU support
content/training/electronic-structure-
SIP runtime environment 2012/deumens_ESaccel_2012.pdf
Pilot project completed,
ADF Fock Matrix, Hessians TBD Under development www.scm.com
Multi-GPU support
http://inac.cea.fr/L_Sim/BigDFT/news.html,
http://www.olcf.ornl.gov/wp-
5-25X Released June 2009, content/training/electronic-structure-
DFT; Daubechies wavelets,
BigDFT part of Abinit
(1 CPU core to current release 1.6.0 2012/BigDFT-Formalism.pdf and
GPU kernel) Multi-GPU support http://www.olcf.ornl.gov/wp-
2012/BigDFT-HPC-tues.pdf
Under development,
http://www.tcm.phy.cam.ac.uk/~mdt26/casino.
Casino TBD TBD Spring 2013 release
html
Multi-GPU support
DBCSR (spare matrix multiply Under development
CP2K library)
2-7X
Multi-GPU support
content/training/ascc_2012/friday/ACSS_2012_V
andeVondele_s.pdf
Libqc with Rys Quadrature
1.3-1.6X, Released Next release Q4 2012.
GAMESS-US Algorithm, Hartree-Fock, MP2
2.3-2.9x HF Multi-GPU support http://www.msg.ameslab.gov/gamess/index.html
and CCSD in Q4 2012

Application Features Supported GPU Perf Release Status Notes

(ss|ss) type integrals within
calculations using Hartree Fock ab
Release in 2012 http://www.ncbi.nlm.nih.gov/pubmed/215419
GAMESS-UK initio methods and density 8x
Multi-GPU support 63
functional theory. Supports
organics & inorganics.

Under development
Joint PGI, NVIDIA & Gaussian Announced Aug. 29, 2011
Gaussian Collaboration
TBD Multi-GPU support http://www.gaussian.com/g_press/nvidia_press.htm

Electrostatic poisson equation,
Released
orthonormalizing of vectors, https://wiki.fysik.dtu.dk/gpaw/devel/projects/gpu.html,
GPAW residual minimization method
8x Multi-GPU support Samuli Hakala (CSC Finland) & Chris O’Grady (SLAC)
(rmm-diis)

Under development
Schrodinger, Inc.
Jaguar Investigating GPU acceleration TBD Multi-GPU support
http://www.schrodinger.com/kb/278

Released, Version 7.8
MOLCAS CU_BLAS support 1.1x Single GPU. Additional GPU www.molcas.org
support coming in Version 8

Density-fitted MP2 (DF-MP2),
1.7-2.3X Under development www.molpro.net
MOLPRO density fitted local correlation
projected Multiple GPU Hans-Joachim Werner
methods (DF-RHF, DF-KS), DFT

Features
Supported
pseudodiagonalization, full
Under Development Academic port.
MOPAC2009 diagonalization, and density 3.8-14X
Single GPU http://openmopac.net
matrix assembling

Development GPGPU benchmarks:
Triples part of Reg-CCSD(T), www.nwchem-sw.org
Release targeting March 2013
NWChem CCSD & EOMCCSD task 3-10X projected
Multiple GPUs
And http://www.olcf.ornl.gov/wp-
schedulers content/training/electronic-structure-
2012/Krishnamoorthy-ESCMA12.pdf

Octopus DFT and TDDFT TBD Released http://www.tddft.org/programs/octopus/

Density functional theory (DFT) First principles materials code that computes
Released
PEtot plane wave pseudopotential 6-10X
Multi-GPU
the behavior of the electron structures of
calculations materials

http://www.q-
Q-CHEM RI-MP2 8x-14x Released, Version 4.0
chem.com/doc_for_web/qchem_manual_4.0.pdf


Features
Supported

NCSA
Released University of Illinois at Urbana-Champaign
QMCPACK Main features 3-4x
Multiple GPUs http://cms.mcc.uiuc.edu/qmcpack/index.php
/GPU_version_of_QMCPACK

Created by Irish Centre for
Quantum PWscf package: linear algebra
(matrix multiply), explicit 2.5-3.5x
Released
Version 5.0
High-End Computing
http://www.quantum-espresso.org/index.php
Espresso/PWscf computational kernels, 3D FFTs Multiple GPUs
and http://www.quantum-espresso.org/

Completely redesigned to
exploit GPU parallelism. YouTube:
44-650X vs. Released
http://youtu.be/EJODzk6RFxE?hd=1 and
TeraChem “Full GPU-based solution” GAMESS CPU Version 1.5
version Multi-GPU/single node content/training/electronic-structure-
2012/Luehr-ESCMA.pdf

2x
Hybrid Hartree-Fock DFT
2 GPUs Available on request By Carnegie Mellon University
VASP functionals including exact
comparable to Multiple GPUs http://arxiv.org/pdf/1111.0716.pdf
exchange
128 CPU cores

Generalized Wang-Landau
3x
Under development GPU Perf Electronic Structure Determination Workshop 2012:
NICS
compared against Multi-core x86 CPU socket.
WL-LSMS method
with 32 GPUs vs.
Multi-GPU support GPU Perf benchmarked on GPU supported features
32 (16-core) CPUs and2012/Eisenbach_OakRidge_February.pdfcomparison
may be a kernel to kernel perf

Viz, ―Docking‖ and Related Applications Growing
Related Features
GPU Perf Release Status Notes
Applications Supported

Visualization from Visage Imaging. Next release, 5.4, will use
3D visualization of volumetric Released, Version 5.3.3
Amira 5® data and surfaces
70x
Single GPU
GPU for general purpose processing in some functions
http://www.visageimaging.com/overview.html

High-Throughput parallel blind Virtual Screening,
Allows fast processing of large Available upon request to
BINDSURF ligand databases
100X
authors; single GPU
http://www.biomedcentral.com/1471-2105/13/S14/S13

Empirical Free Released University of Bristol
BUDE Energy Forcefield
6.5-13.4X
Single GPU http://www.bris.ac.uk/biochemistry/cpfg/bude/bude.htm

Released, Suite 2011 Schrodinger, Inc.
Core Hopping GPU accelerated application 3.75-5000X
Single and multi-GPUs. http://www.schrodinger.com/products/14/32/

Real-time shape similarity Released Open Eyes Scientific Software
FastROCS searching/comparison
800-3000X
Single and multi-GPUs. http://www.eyesopen.com/fastrocs

Lines: 460% increase
Cartoons: 1246% increase
PyMol Surface: 1746% increase 1700x
Single GPUs
http://pymol.org/
Spheres: 753% increase
Ribbon: 426% increase

High quality rendering, GPU Perf compared against Multi-core x86 CPU socket.
large structures (100 million atoms),
100-125X or greater GPU Perf benchmarked on GPU supported features
Visualization from University of Illinois at Urbana-Champaign
VMD analysis and visualization tasks, multiple
on kernels
and mayhttp://www.ks.uiuc.edu/Research/vmd/
be a kernel to kernel perf comparison
GPU support for display of molecular

Bioinformatics Applications
Features GPU
Application Release Status Website
Supported Speedup
Alignment of short sequencing Version 0.6.2 – 3/2012
BarraCUDA reads
6-10x
http://seqbarracuda.sourceforge.net/

Parallel search of Smith- Version 2.0.8 – Q1/2012
CUDASW++ Waterman database
10-50x
http://sourceforge.net/projects/cudasw/

Parallel, accurate long read Version 1.0.40 – 6/2012
CUSHAW aligner for large genomes
10x
Multiple-GPU
http://cushaw.sourceforge.net/

Protein alignment according to Version 2.2.26 – 3/2012 http://eudoxus.cheme.cmu.edu/gpublast/gpu
GPU-BLAST BLASTP
3-4x
Single GPU blast.html

Parallel local and global
Version 2.3.2 – Q1/2012 http://www.mpihmmer.org/installguideGPUH
GPU-HMMER search of Hidden Markov 60-100x
Multi-GPU, multi-node MMER.htm
Models

Scalable motif discovery Version 3.0.12 https://sites.google.com/site/yongchaosoftwa
mCUDA-MEME algorithm based on MEME
4-10x
Multi-GPU, multi-node re/mcuda-meme

Hardware and software for
Released.
SeqNFind reference assembly, blast, SW, 400x
http://www.seqnfind.com/
HMM, de novo assembly

Version 1.11 – 5/2012
UGENE Fast short read alignment 6-8x
http://ugene.unipro.ru/
GPU Perf compared against same or similar code running on single CPU machine
Parallel linear regression on Performance measured internally or independently

MD Average Speedups
The blue node contains Dual E5-2687W CPUs
10 (8 Cores per CPU).

The green nodes contain Dual E5-2687W CPUs (8
Cores per CPU) and 1 or 2 NVIDIA K10, K20, or
Performance Relative to CPU Only

8 K20X GPUs.

6

4

2

0
CPU CPU + K10 CPU + K20 CPU + K20X CPU + 2x K10 CPU + 2x K20 CPU + 2x K20X

Average speedup calculated from 4 AMBER, 3 NAMD, 3 LAMMPS, and 1 GROMACS test cases.
Error bars show the maximum and minimum speedup for each hardware configuration.

Built from Ground Up for GPUs
Computational Chemistry

Study disease & discover drugs
What
Predict drug and protein interactions
GPU READY
Speed of simulations is critical APPLICATIONS
Why Enables study of:
Abalone
ACEMD
Longer timeframes AMBER
Larger systems DL_PLOY
More simulations GAMESS

How GPUs increase throughput & accelerate simulations
GROMACS
LAMMPS
NAMD
AMBER 11 Application NWChem
4.6x performance increase with 2 GPUs with Q-CHEM
only a 54% added cost* Quantum Espresso
TeraChem
• AMBER 11 Cellulose NPT on 2x E5670 CPUS + 2x Tesla C2090s (per node) vs. 2xcE5670 CPUs (per node)
• Cost of CPU node assumed to be $9333. Cost of adding two (2) 2090s to single node is assumed to be $5333

AMBER 12
GPU Support Revision 12.2
1/22/2013

15

Kepler - Our Fastest Family of GPUs Yet
30.00
Factor IX Running AMBER 12 GPU Support Revision 12.1
25.39
25.00 The blue node contains Dual E5-2687W CPUs
22.44 (8 Cores per CPU).
7.4x The green nodes contain Dual E5-2687W CPUs (8
20.00 18.90 Cores per CPU) and either 1x NVIDIA M2090, 1x K10
Nanoseconds / Day

or 1x K20 for the GPU
6.6x
15.00

11.85 5.6x
10.00

3.5x
5.00
3.42

0.00
Factor IX
1 CPU Node 1 CPU Node + 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X
M2090

GPU speedup/throughput increased from 3.5x (with M2090) to 7.4x (with K20X)
when compared to a CPU only node
16

K10 Accelerates Simulations of All Sizes
30
Running AMBER 12 GPU Support Revision 12.1

25 24.00 (8 Cores per CPU).
Speedup Compared to CPU Only

19.98
20 Cores per CPU) and 1x NVIDIA K10 GPU

15

10

5.50 5.53 5.04
5
2.00

0
CPU TRPcage JAC NVE Factor IX NVE Cellulose NVE Myoglobin Nucleosome
All Molecules GB PME PME PME GB GB

Gain 24x performance by adding just 1 GPU
Nucleosome
when compared to dual CPU performance

K20 Accelerates Simulations of All Sizes
30.00
28.00
25.56 SPFP with CUDA 4.2.9 ECC Off
25.00
The blue node contains 2x Intel E5-2687W CPUs

(8 Cores per CPU)
20.00
Each green nodes contains 2x Intel E5-2687W
CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs

15.00

10.00
7.28
6.50 6.56

5.00
2.66
1.00
0.00
CPU All TRPcage GB JAC NVE PME Factor IX NVE Cellulose NVE Myoglobin GB Nucleosome
Molecules PME PME GB

Gain 28x throughput/performance by adding just one K20 GPU
Nucleosome

18 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012

K20X Accelerates Simulations of All Sizes
35
31.30 Running AMBER 12 GPU Support Revision 12.1
30 28.59
(8 Cores per CPU).

25
Cores per CPU) and 1x NVIDIA K20X GPU
20

15

10 8.30
7.15 7.43

5
2.79

0
CPU TRPcage JAC NVE Factor IX NVE Cellulose NVE Myoglobin Nucleosome
All Molecules GB PME PME PME GB GB

Gain 31x performance by adding just one K20X GPU
Nucleosome


K10 Strong Scaling over Nodes
Cellulose 408K Atoms (NPT) Running AMBER 12 with CUDA 4.2 ECC Off
6 The blue nodes contains 2x Intel X5670
CPUs (6 Cores per CPU)

5 The green nodes contains 2x Intel X5670
CPUs (6 Cores per CPU) plus 2x NVIDIA
K10 GPUs
4
Nanoseconds / Day

2.4x
3
CPU Only
3.6x With GPU
2

5.1x
1

Cellulose
0
1 2 4
Number of Nodes

GPUs significantly outperform CPUs while scaling over multiple nodes

Kepler – Universally Faster
9
8 The CPU Only node contains Dual E5-2687W CPUs
(8 Cores per CPU).
Speedups Compared to CPU Only

7
The Kepler nodes contain Dual E5-2687W CPUs (8
6 Cores per CPU) and 1x NVIDIA K10, K20, or K20X
GPUs
5
JAC

4 Factor IX
Cellulose
3

2

1

0
CPU Only CPU + K10 CPU + K20 CPU + K20X Cellulose

The Kepler GPUs accelerated all simulations, up to 8x

K10 Extreme Performance
JAC 23K Atoms (NVE)
120 The blue node contains Dual E5-2687W CPUs
(8 Cores per CPU).

97.99 The green node contain Dual E5-2687W CPUs (8
100
Cores per CPU) and 2x NVIDIA K10 GPUs
Nanoseconds / Day

80

60

40

20
12.47

0
1 Node 1 Node
DHFR

Gain 7.8X performance by adding just 2 GPUs

K20 Extreme Performance
DHRF JAC 23K Atoms (NVE) Running AMBER 12 GPU Support Revision 12.1
SPFP with CUDA 4.2.9 ECC Off
120

95.59 (8 Cores per CPU)
100

Each green node contains 2x Intel E5-2687W
CPUs (8 Cores per CPU) plus 2x NVIDIA K20 GPU
Nanoseconds / Day

80

60

40

20 12.47

0
1 Node 1 Node
DHFR

Gain > 7.5X throughput/performance by adding just 2 K20 GPUs


Replace 8 Nodes with 1 K20 GPU
90.00 35000
$32,000.00
81.09 SPFP with CUDA 4.2.9 ECC Off
80.00
30000
The eight (8) blue nodes each contain 2x Intel
70.00 E5-2687W CPUs (8 Cores per CPU)
65.00
25000
60.00
CPUs (8 Cores per CPU) plus 1x NVIDIA K20
GPU
50.00 20000

Note: Typical CPU and GPU node pricing used.
40.00 Pricing may vary depending on node
15000
configuration. Contact your preferred HW vendor
for actual pricing.
30.00
10000
20.00 $6,500.00

5000
10.00

0.00 0
Nanoseconds/Day Cost

DHFR
Cut down simulation costs to ¼ and gain higher performance


Replace 7 Nodes with 1 K10 GPU
Performance on JAC NVE Cost Running AMBER 12 GPU Support Revision 12.1
SPFP with CUDA 4.2.9 ECC Off
80 $35,000.00
$32,000
The eight (8) blue nodes each contain 2x Intel
70 $30,000.00 E5-2687W CPUs (8 Cores per CPU)

60
The green node contains 2x Intel E5-2687W
$25,000.00 CPUs (8 Cores per CPU) plus 1x NVIDIA K10
Nanoseconds / Day

GPU
50
$20,000.00 Note: Typical CPU and GPU node pricing used.
40 Pricing may vary depending on node
$15,000.00 configuration. Contact your preferred HW vendor
30 for actual pricing.

$10,000.00
20 $7,000

10 $5,000.00

0 $0.00
CPU Only GPU Enabled CPU Only GPU Enabled

DHFR
Cut down simulation costs to ¼ and increase performance by 70%


Extra CPUs decrease Performance
Cellulose NVE Running AMBER 12 GPU Support Revision 12.1

8 The orange bars contains one E5-2687W CPUs
(8 Cores per CPU).
7
The blue bars contain Dual E5-2687W CPUs (8
6 Cores per CPU)
Nanoseconds / Day

2 CPUs 2 GPUs
1 CPU 2 GPUs
5

4 1 E5-2687W
2 E5-2687W
3

2

1

0 Cellulose
CPU Only CPU with dual K20s

When used with GPUs, dual CPU sockets perform worse than single CPU sockets.

Kepler - Greener Science
Energy used in simulating 1 ns of DHFR JAC
2500 The blue node contains Dual E5-2687W CPUs
(150W each, 8 Cores per CPU).

2000 Cores per CPU) and 1x NVIDIA K10, K20, or K20X
Lower is better GPUs (235W each).
Energy Expended (kJ)

1500

Energy Expended
1000
= Power x Time

500

0
CPU Only CPU + K10 CPU + K20 CPU + K20X

The GPU Accelerated systems use 65-75% less energy

Recommended GPU Node Configuration for
AMBER Computational Chemistry
Workstation or Single Node Configuration
# of CPU sockets 2

Cores per CPU socket 4+ (1 CPU core drives 1 GPU)

CPU speed (Ghz) 2.66+
System memory per node (GB) 16

Kepler K10, K20, K20X
GPUs
Fermi M2090, M2075, C2075

1-2
# of GPUs per CPU socket (4 GPUs on 1 socket is good
to do 4 fast serial GPU runs)

GPU memory preference (GB) 6
GPU to CPU connection PCIe 2.0 16x or higher

Server storage 2 TB
28 Scale to multiple nodes with same single node configuration AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012

Benefits of GPU AMBER Accelerated Computing
Faster than CPU only systems in all tests

Most major compute intensive aspects of classical MD ported

Large performance boost with marginal price increase

Energy usage cut by more than half

GPUs scale well within a node and over multiple nodes

K20 GPU is our fastest and lowest power high performance GPU yet

Try GPU accelerated AMBER for free – www.nvidia.com/GPUTestDrive

Kepler - Our Fastest Family of GPUs Yet
4.50
ApoA1 Running NAMD version 2.9
4.00
4.00 The blue node contains Dual E5-2687W CPUs
3.57 (8 Cores per CPU).
3.45
3.50
2.9x Cores per CPU) and either 1x NVIDIA M2090, 1x K10
3.00 or 1x K20 for the GPU
Nanoseconds/Day

2.63
2.6x
2.50

2.5x
2.00

1.50 1.37 1.9x

1.00

0.50

0.00
1 CPU Node 1 CPU Node + 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X
Apolipoprotein A1
M2090

GPU speedup/throughput increased from 1.9x (with M2090) to 2.9x (with K20X)
when compared to a CPU only node
31 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012

Accelerates Simulations of All Sizes
3
Running NAMD 2.9 with CUDA 4.0 ECC Off
2.7
2.6
2.5 2.4
(8 Cores per CPU)

2 CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs

1.5

1

0.5

0
CPU All Molecules ApoA1 F1-ATPase STMV
Apolipoprotein A1

Gain 2.5x throughput/performance by adding just 1 GPU


Kepler – Universally Faster
6
Running NAMD version 2.9

The CPU Only node contains Dual E5-2687W CPUs
5 (8 Cores per CPU).

5.1x The Kepler nodes contain Dual E5-2687W CPUs (8
4 4.7x Cores per CPU) and 1 or two NVIDIA K10, K20, or
K20X GPUs.
4.3x
F1-ATPase
3
ApoA1
STMV
2.9x
2
2.6x
2.4x

1

0
CPU Only 1x K10 1x K20 1x K20X 2x K10 2x K20 2x K20X
F1-ATPase
| Kepler nodes use Dual CPUs |

The Kepler GPUs accelerate all simulations, up to 5x
Average acceleration printed in bars

Outstanding Strong Scaling with Multi-STMV
Each blue XE6 CPU node contains 1x AMD
100 STMV on Hundreds of Nodes 1600 Opteron (16 Cores per CPU).
1.2

Fermi XK6 Each green XK6 CPU+GPU node contains
1x AMD 1600 Opteron (16 Cores per CPU)
1 and an additional 1x NVIDIA X2090 GPU.
CPU XK6
2.7x
Nanoseconds / Day

0.8

2.9x
0.6

0.4

0.2
3.6x
3.8x Concatenation of 100
0 Satellite Tobacco Mosaic Virus
32 64 128 256 512 640 768
# of Nodes

Accelerate your science by 2.7-3.8x when compared to CPU-based supercomputers

Replace 3 Nodes with 1 2090 GPU
Each blue node contains 2x Intel Xeon X5550 CPUs
F1-ATPase (4 Cores per CPU).
4 CPU Nodes
0.8 9000
0.74 The green node contains 2x Intel Xeon X5550 CPUs
$8,000
1 CPU Node +8000 (4 Cores per CPU) and 1x NVIDIA M2090 GPU
0.7 1x M2090 GPUs
0.63
7000 Note: Typical CPU and GPU node pricing used. Pricing
0.6 may vary depending on node configuration. Contact your
6000 preferred HW vendor for actual pricing.
0.5
5000
0.4 $4,000
4000
0.3
3000
0.2
2000

0.1 1000

0 0
Nanoseconds/Day Cost

Speedup of 1.2x for 50% the cost F1-ATPase

K20 - Greener: Twice The Science Per Watt
1200000
Energy Used in Simulating 1 Nanosecond of ApoA1
1000000 Each blue node contains Dual E5-2687W
CPUs (95W, 4 Cores per CPU).

Each green node contains 2x Intel Xeon X5550

800000
CPUs (95W, 4 Cores per CPU) and 2x NVIDIA
Lower is better K20 GPUs (225W per GPU)

600000

Energy Expended
400000
= Power x Time

200000

0
1 Node 1 Node + 2x K20

Cut down energy usage by ½ with GPUs


Kepler - Greener: Twice The Science/Joule
Energy used in simulating 1 ns of SMTV
250000

200000 (150W each, 8 Cores per CPU).

Lower is better The green nodes contain Dual E5-2687W CPUs
(8 Cores per CPU) and 2x NVIDIA K10, K20, or
150000
K20X GPUs (235W each).

Energy Expended
100000
= Power x Time

50000

0
CPU Only CPU + 2 K10s CPU + 2 K20s CPU + 2 K20Xs

Cut down energy usage by ½ with GPUs

Satellite Tobacco Mosaic Virus

Recommended GPU Node Configuration for
NAMD Computational Chemistry
Workstation or Single Node Configuration
# of CPU sockets 2
Cores per CPU socket 6+
CPU speed (Ghz) 2.66+
System memory per socket (GB) 32

Kepler K10, K20, K20X
GPUs
Fermi M2090, M2075, C2075

# of GPUs per CPU socket 1-2
GPU memory preference (GB) 6
GPU to CPU connection PCIe 2.0 or higher

Server storage 500 GB or higher

Network configuration Gemini, InfiniBand

38 Scale to multiple nodes with same single node configuration NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012

Summary/Conclusions
Benefits of GPU Accelerated Computing
Faster than CPU only systems in all tests

Large performance boost with small marginal price increase

Energy usage cut in half

GPUs scale very well within a node and over multiple nodes

Tesla K20 GPU is our fastest and lowest power high performance GPU to date

Try GPU accelerated NAMD for free – www.nvidia.com/GPUTestDrive

More Science for Your Money
Embedded Atom Model Blue node uses 2x E5-2687W (8 Cores
6 and 150W per CPU).
5.5
Green nodes have 2x E5-2687W and 1
5 or 2 NVIDIA K10, K20, or K20X GPUs (235W).

4.5

4
3.3
2.92
3
2.47

2 1.7

1

0
CPU Only CPU + 1x CPU + 1x CPU + 1x CPU + 2x CPU + 2x CPU + 2x
K10 K20 K20X K10 K20 K20X

Experience performance increases of up to 5.5x with Kepler GPU nodes.

K20X, the Fastest GPU Yet
7 Blue node uses 2x E5-2687W (8 Cores
and 150W per CPU).
6
Green nodes have 2x E5-2687W and 2
NVIDIA M2090s or K20X GPUs (235W).
Speedup Relative to CPU Alone

5

4

3

2

1

0
CPU Only CPU + 2x M2090 CPU + K20X CPU + 2x K20X

Experience performance increases of up to 6.2x with Kepler GPU nodes.
One K20X performs as well as two M2090s

Get a CPU Rebate to Fund Part of Your GPU Budget
Acceleration in Loop Time Computation by
Additional GPUs
20
18.2
The blue node contains Dual X5670 CPUs
18
(6 Cores per CPU).
16
The green nodes contain Dual X5570 CPUs
Normalized to CPU Only

14 12.9 (4 Cores per CPU) and 1-4 NVIDIA M2090
GPUs.
12
9.88
10

8

6 5.31

4

2

0
1 Node 1 Node + 1x M20901 Node + 2x M20901 Node + 3x M20901 Node + 4x M2090

Increase performance 18x when compared to CPU-only nodes

Cheaper CPUs used with GPUs AND still faster overall performance when
compared to more expensive CPUs!

Excellent Strong Scaling on Large Clusters
LAMMPS Gay-Berne 134M Atoms

600
GPU Accelerated XK6
500
CPU only XE6
Loop Time (seconds)

400
3.55x
300

200
3.48x
3.45x
100

0
300 400 500 600 700 800 900
Nodes

From 300-900 nodes, the NVIDIA GPU-powered XK6 maintained 3.5x performance
compared to XE6 CPU nodes
Each blue Cray XE6 Nodes have 2x AMD Opteron CPUs (16 Cores per CPU)
Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Cores per CPU) and 1x NVIDIA X2090

GPUs Sustain 5x Performance for Weak Scaling
Weak Scaling with 32K Atoms per Node
45

40

Loop Time (seconds) 35

30
6.7x 5.8x 4.8x
25

20

15

10

5

0
1 8 27 64 125 216 343 512 729
Nodes
Performance of 4.8x-6.7x with GPU-accelerated nodes
when compared to CPUs alone
Each blue Cray XE6 Node have 2x AMD Opteron CPUs (16 Cores per CPU)
Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Core per CPU) and 1x NVIDIA X2090

Faster, Greener — Worth It!
Energy Consumed in one loop of EAM
140

120 GPU-accelerated computing uses
Lower is better 53% less energy than CPU only
100

80

60
Energy Expended = Power x Time
Power calculated by combining the component’s TDPs
40

20

0
1 Node 1 Node + 1 K20X 1 Node + 2x K20X

Blue node uses 2x E5-2687W (8 Cores and 150W per CPU) and CUDA 4.2.9.
Green nodes have 2x E5-2687W and 1 or 2 NVIDIA K20X GPUs (235W) running CUDA 5.0.36.

Try GPU accelerated LAMMPS for free – www.nvidia.com/GPUTestDrive

Molecular Dynamics with LAMMPS
on a Hybrid Cray Supercomputer
W. Michael Brown
National Center for Computational Sciences
Oak Ridge National Laboratory

NVIDIA Technology Theater, Supercomputing 2012
November 14, 2012

Early Kepler Benchmarks on Titan
32.00 4
16.00
XK7+GPU
8.00
4.00 XK6 3

Time (s)
Atomic Fluid 2.00

Time (s)
XK6+GPU
1.00 2
0.50 XK7+GPU
0.25 XK6
0.13 1
XK6+GPU
0.06
0.03 0
1 2 4 8 16 32 64 128 Nodes

1

4
16

64

6

96
24

4
25

38
40
10

16
3.0
8.00
XK7+GPU 2.5
4.00
2.0

Time (s)
2.00
Time (s)

Bulk Copper XK6 1.5
1.00
1.0
0.50 XK6+GPU
0.5
0.25
0.0
0.13 Nodes

1

4
16

64

6

96
24

4
25

38
1 2 4 8 16 32 64 128

40
10

16

GPU Accelerated Computational Chemistry Applications

GPU Accelerated Computational Chemistry Applications

Recommended

Recommended

More Related Content

Similar to GPU Accelerated Computational Chemistry Applications

Similar to GPU Accelerated Computational Chemistry Applications (20)

Recently uploaded

Recently uploaded (20)

GPU Accelerated Computational Chemistry Applications

Editor's Notes