This document provides information on molecular dynamics (MD) and quantum chemistry applications that have been optimized to take advantage of GPU acceleration. It lists the key applications in each category along with notes on supported features, reported GPU speedups compared to CPU performance, and release status including support for single or multiple GPUs. The applications cover a wide range of computational chemistry and bioinformatics domains and demonstrate the growing use of GPUs for scientific computing.
Probabilistic programming is a new approach to machine learning and data science that is currently the focus of intense academic research, including an ongoing DARPA program. If successful, probabilistic programming systems will allow sophisticated predictive models to be written by a wide range of domain experts. Before we get to the promised land, though, some basic challenges need to be addressed, including performance on real-world datasets, programming tools support, and education.
This presentation discusses the basic principles governing EEG Rhythm Generation, and discusses the various circuits that generate and maintain cerebral oscillations.
Probabilistic programming is a new approach to machine learning and data science that is currently the focus of intense academic research, including an ongoing DARPA program. If successful, probabilistic programming systems will allow sophisticated predictive models to be written by a wide range of domain experts. Before we get to the promised land, though, some basic challenges need to be addressed, including performance on real-world datasets, programming tools support, and education.
This presentation discusses the basic principles governing EEG Rhythm Generation, and discusses the various circuits that generate and maintain cerebral oscillations.
PEER 1 Offers NVIDIA GPU to Accelerate High Performance Applications
PEER 1 has teamed up with NVIDIA the creator of the GPU and a world leader in visual computing, to provide high performance GPU Cloud applications. NVIDIA’s GPUs are well known for making customer software run faster and PEER 1 is offering a number of services that run on NVIDA’s GPUs. PEER 1’s cloud service is built on NVIDIA Telsa GPU’s delivering supercomputing performance in the cloud to solve much tougher problems. Click here to find out how PEER 1 and NVIDIA can transform your business.
We have made significant progress over the past couple of years working with scientists around the world helping them to accelerate scientific discovery - using Nvidia Tesla GPU and CUDA computing
Overview of the BF609 dual-core Blackfin processor series covering main features including the Pipelined Vision Processor including the hardware and software development tools. By Analog Devices
The presentation will introduce Nvidia and the concept of GPU computing in the context of Financial Services industry. Customer successes are referenced where dramatic speed-ups in performance have been achieved.
The presentation describes how to select the NVIDIA GPU, what parameters are important and where to find them, what affects the performance of the GPU and code running on it. Today, Deep Learning experts mostly use ready frameworks, but there are situations when you need to understand how the data inside GPU is processed.
Série grafických karet Lightning společnosti MSI, předního světového výrobce základních desek a grafických karet, si získala skvělé renomé jak mezi pokročilými uživateli, tak ve světových médiích. Nejnovější člen této rodiny, MSI N480GTX Lightning, je šitý na míru pro extrémní přetaktování. MSI představuje unikátní architekturu Power4, která modelu N480GTX Lightning poskytuje nejsilnější a nejstabilnější výkon a výrazně zvyšuje potenciál pro přetaktování. Svědky jedinečných schopností karty byli účastníci a návštěvníci finále MSI MOA2010 v Taipei, kde švédský mistr v přetaktování “elmor” překonal dosavadní světový rekord v 3DMark Vantage. Grafická karta MSI N480GTX Lightning je doslova napěchována exkluzivními funkcemi, včetně nového systému chlazení Twin Frozr III, který je v porovnání s referenčním chladičem schopen uchladit grafické jádro na teplotu o 18 °C nižší. Samozřejmostí je také funkce trojnásobné změny napětí (Triple Overvoltage) pomocí unikátního nástroje pro přetaktování MSI Afterburner.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
More Related Content
Similar to GPU Accelerated Computational Chemistry Applications
PEER 1 Offers NVIDIA GPU to Accelerate High Performance Applications
PEER 1 has teamed up with NVIDIA the creator of the GPU and a world leader in visual computing, to provide high performance GPU Cloud applications. NVIDIA’s GPUs are well known for making customer software run faster and PEER 1 is offering a number of services that run on NVIDA’s GPUs. PEER 1’s cloud service is built on NVIDIA Telsa GPU’s delivering supercomputing performance in the cloud to solve much tougher problems. Click here to find out how PEER 1 and NVIDIA can transform your business.
We have made significant progress over the past couple of years working with scientists around the world helping them to accelerate scientific discovery - using Nvidia Tesla GPU and CUDA computing
Overview of the BF609 dual-core Blackfin processor series covering main features including the Pipelined Vision Processor including the hardware and software development tools. By Analog Devices
The presentation will introduce Nvidia and the concept of GPU computing in the context of Financial Services industry. Customer successes are referenced where dramatic speed-ups in performance have been achieved.
The presentation describes how to select the NVIDIA GPU, what parameters are important and where to find them, what affects the performance of the GPU and code running on it. Today, Deep Learning experts mostly use ready frameworks, but there are situations when you need to understand how the data inside GPU is processed.
Série grafických karet Lightning společnosti MSI, předního světového výrobce základních desek a grafických karet, si získala skvělé renomé jak mezi pokročilými uživateli, tak ve světových médiích. Nejnovější člen této rodiny, MSI N480GTX Lightning, je šitý na míru pro extrémní přetaktování. MSI představuje unikátní architekturu Power4, která modelu N480GTX Lightning poskytuje nejsilnější a nejstabilnější výkon a výrazně zvyšuje potenciál pro přetaktování. Svědky jedinečných schopností karty byli účastníci a návštěvníci finále MSI MOA2010 v Taipei, kde švédský mistr v přetaktování “elmor” překonal dosavadní světový rekord v 3DMark Vantage. Grafická karta MSI N480GTX Lightning je doslova napěchována exkluzivními funkcemi, včetně nového systému chlazení Twin Frozr III, který je v porovnání s referenčním chladičem schopen uchladit grafické jádro na teplotu o 18 °C nižší. Samozřejmostí je také funkce trojnásobné změny napětí (Triple Overvoltage) pomocí unikátního nástroje pro přetaktování MSI Afterburner.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
📕 Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
2. Molecular Dynamics (MD) Applications
Features
Application GPU Perf Release Status Notes/Benchmarks
Supported
> 100 ns/day AMBER 12, GPU Revision Support 12.2
PMEMD Explicit Solvent & GB Released
AMBER Implicit Solvent
JAC NVE on 2X
Multi-GPU, multi-node
http://ambermd.org/gpus/benchmarks.
K20s htm#Benchmarks
2x C2070 equals Release C37b1;
Implicit (5x), Explicit (2x) Released
CHARMM Solvent via OpenMM
32-35x X5667
Single & multi-GPU in single node
http://www.charmm.org/news/c37b1.html#po
CPUs stjump
Two-body Forces, Link-cell Source only, Results Published
Release V 4.03
DL_POLY Pairs, Ewald SPME forces, 4x
Multi-GPU, multi-node
http://www.stfc.ac.uk/CSE/randd/ccg/softwa
Shake VV re/DL_POLY/25526.aspx
165 ns/Day
Released
GROMACS Implicit (5x), Explicit (2x) DHFR on
Multi-GPU, multi-node
Release 4.6; 1st Multi-GPU support
4X C2075s
http://lammps.sandia.gov/bench.html#deskto
Lennard-Jones, Gay-Berne, Released.
LAMMPS Tersoff & many more potentials
3.5-18x on Titan
Multi-GPU, multi-node
p and
http://lammps.sandia.gov/bench.html#titan
4.0 ns/days Released
Full electrostatics with PME and
NAMD most simulation features
F1-ATPase on 100M atom capable NAMD 2.9
1x K20X Multi-GPU, multi-node
GPU Perf compared against Multi-core x86 CPU socket.
GPU Perf benchmarked on GPU supported features
and may be a kernel to kernel perf comparison
3. New/Additional MD Applications Ramping
Features
Application GPU Perf Release Status Notes
Supported
4-29X Released, Version 1.8.51
Abalone Simulations (on 1060 GPU)
(on 1060 GPU) Single GPU
Agile Molecule, Inc.
Computation of non-valent 4-29X Released, Version 1.1.4
Ascalaph interactions (on 1060 GPU) Single GPU
Agile Molecule, Inc.
150 ns/day DHFR on Released Production bio-molecular dynamics (MD)
ACEMD Written for use only on GPUs
1x K20 Single and multi-GPUs software specially optimized to run on GPUs
Powerful distributed computing
Depends upon Released; http://folding.stanford.edu
Folding@Home molecular dynamics system;
number of GPUs GPUs and CPUs GPUs get 4X the points of CPUs
implicit solvent and folding
High-performance all-atom
Depends upon Released; http://www.gpugrid.net/
GPUGrid.net biomolecular simulations;
number of GPUs NVIDIA GPUs only
explicit solvent and binding
Simple fluids and binary
mixtures (pair potentials, high- Up to 66x on 2090 Released, Version 0.2.0 http://halmd.org/benchmarks.html#supercool
HALMD precision NVE and NVT, dynamic vs. 1 CPU core Single GPU ed-binary-mixture-kob-andersen
correlations)
Kepler 2X faster Released, Version 0.11.2 http://codeblue.umich.edu/hoomd-blue/
HOOMD-Blue Written for use only on GPUs
than Fermi Single and multi-GPU on 1 node Multi-GPU w/ MPI in March 2013
Implicit: 127-213
Implicit and explicit solvent, Released Version 4.1.1 Library and application for molecular dynamics
OpenMM custom forces
ns/day Explicit: 18-
Multi-GPU on high-performance
55 ns/day DHFR
GPU Perf compared against Multi-core x86 CPU socket.
GPU Perf benchmarked on GPU supported features
and may be a kernel to kernel perf comparison
4. Quantum Chemistry Applications
Application Features Supported GPU Perf Release Status Notes
Local Hamiltonian, non-local
Hamiltonian, LOBPCG algorithm, Released; Version 7.0.5 www.abinit.org
Abinit diagonalization /
1.3-2.7X
Multi-GPU support
orthogonalization
Integrating scheduling GPU into http://www.olcf.ornl.gov/wp-
Under development
ACES III SIAL programming language and 10X on kernels
Multi-GPU support
content/training/electronic-structure-
SIP runtime environment 2012/deumens_ESaccel_2012.pdf
Pilot project completed,
ADF Fock Matrix, Hessians TBD Under development www.scm.com
Multi-GPU support
http://inac.cea.fr/L_Sim/BigDFT/news.html,
http://www.olcf.ornl.gov/wp-
5-25X Released June 2009, content/training/electronic-structure-
DFT; Daubechies wavelets,
BigDFT part of Abinit
(1 CPU core to current release 1.6.0 2012/BigDFT-Formalism.pdf and
GPU kernel) Multi-GPU support http://www.olcf.ornl.gov/wp-
content/training/electronic-structure-
2012/BigDFT-HPC-tues.pdf
Under development,
http://www.tcm.phy.cam.ac.uk/~mdt26/casino.
Casino TBD TBD Spring 2013 release
html
Multi-GPU support
http://www.olcf.ornl.gov/wp-
DBCSR (spare matrix multiply Under development
CP2K library)
2-7X
Multi-GPU support
content/training/ascc_2012/friday/ACSS_2012_V
andeVondele_s.pdf
Libqc with Rys Quadrature
1.3-1.6X, Released Next release Q4 2012.
GAMESS-US Algorithm, Hartree-Fock, MP2
2.3-2.9x HF Multi-GPU support http://www.msg.ameslab.gov/gamess/index.html
and CCSD in Q4 2012
GPU Perf compared against Multi-core x86 CPU socket.
GPU Perf benchmarked on GPU supported features
and may be a kernel to kernel perf comparison
5. Quantum Chemistry Applications
Application Features Supported GPU Perf Release Status Notes
(ss|ss) type integrals within
calculations using Hartree Fock ab
Release in 2012 http://www.ncbi.nlm.nih.gov/pubmed/215419
GAMESS-UK initio methods and density 8x
Multi-GPU support 63
functional theory. Supports
organics & inorganics.
Under development
Joint PGI, NVIDIA & Gaussian Announced Aug. 29, 2011
Gaussian Collaboration
TBD Multi-GPU support http://www.gaussian.com/g_press/nvidia_press.htm
Electrostatic poisson equation,
Released
orthonormalizing of vectors, https://wiki.fysik.dtu.dk/gpaw/devel/projects/gpu.html,
GPAW residual minimization method
8x Multi-GPU support Samuli Hakala (CSC Finland) & Chris O’Grady (SLAC)
(rmm-diis)
Under development
Schrodinger, Inc.
Jaguar Investigating GPU acceleration TBD Multi-GPU support
http://www.schrodinger.com/kb/278
Released, Version 7.8
MOLCAS CU_BLAS support 1.1x Single GPU. Additional GPU www.molcas.org
support coming in Version 8
Density-fitted MP2 (DF-MP2),
1.7-2.3X Under development www.molpro.net
MOLPRO density fitted local correlation
projected Multiple GPU Hans-Joachim Werner
methods (DF-RHF, DF-KS), DFT
GPU Perf compared against Multi-core x86 CPU socket.
GPU Perf benchmarked on GPU supported features
and may be a kernel to kernel perf comparison
6. Quantum Chemistry Applications
Features
Application GPU Perf Release Status Notes
Supported
pseudodiagonalization, full
Under Development Academic port.
MOPAC2009 diagonalization, and density 3.8-14X
Single GPU http://openmopac.net
matrix assembling
Development GPGPU benchmarks:
Triples part of Reg-CCSD(T), www.nwchem-sw.org
Release targeting March 2013
NWChem CCSD & EOMCCSD task 3-10X projected
Multiple GPUs
And http://www.olcf.ornl.gov/wp-
schedulers content/training/electronic-structure-
2012/Krishnamoorthy-ESCMA12.pdf
Octopus DFT and TDDFT TBD Released http://www.tddft.org/programs/octopus/
Density functional theory (DFT) First principles materials code that computes
Released
PEtot plane wave pseudopotential 6-10X
Multi-GPU
the behavior of the electron structures of
calculations materials
http://www.q-
Q-CHEM RI-MP2 8x-14x Released, Version 4.0
chem.com/doc_for_web/qchem_manual_4.0.pdf
GPU Perf compared against Multi-core x86 CPU socket.
GPU Perf benchmarked on GPU supported features
and may be a kernel to kernel perf comparison
7. Quantum Chemistry Applications
Features
Application GPU Perf Release Status Notes
Supported
NCSA
Released University of Illinois at Urbana-Champaign
QMCPACK Main features 3-4x
Multiple GPUs http://cms.mcc.uiuc.edu/qmcpack/index.php
/GPU_version_of_QMCPACK
Created by Irish Centre for
Quantum PWscf package: linear algebra
(matrix multiply), explicit 2.5-3.5x
Released
Version 5.0
High-End Computing
http://www.quantum-espresso.org/index.php
Espresso/PWscf computational kernels, 3D FFTs Multiple GPUs
and http://www.quantum-espresso.org/
Completely redesigned to
exploit GPU parallelism. YouTube:
44-650X vs. Released
http://youtu.be/EJODzk6RFxE?hd=1 and
TeraChem “Full GPU-based solution” GAMESS CPU Version 1.5
http://www.olcf.ornl.gov/wp-
version Multi-GPU/single node content/training/electronic-structure-
2012/Luehr-ESCMA.pdf
2x
Hybrid Hartree-Fock DFT
2 GPUs Available on request By Carnegie Mellon University
VASP functionals including exact
comparable to Multiple GPUs http://arxiv.org/pdf/1111.0716.pdf
exchange
128 CPU cores
Generalized Wang-Landau
3x
Under development GPU Perf Electronic Structure Determination Workshop 2012:
NICS
compared against Multi-core x86 CPU socket.
http://www.olcf.ornl.gov/wp-
WL-LSMS method
with 32 GPUs vs.
Multi-GPU support GPU Perf benchmarked on GPU supported features
content/training/electronic-structure-
32 (16-core) CPUs and2012/Eisenbach_OakRidge_February.pdfcomparison
may be a kernel to kernel perf
8. Viz, ―Docking‖ and Related Applications Growing
Related Features
GPU Perf Release Status Notes
Applications Supported
Visualization from Visage Imaging. Next release, 5.4, will use
3D visualization of volumetric Released, Version 5.3.3
Amira 5® data and surfaces
70x
Single GPU
GPU for general purpose processing in some functions
http://www.visageimaging.com/overview.html
High-Throughput parallel blind Virtual Screening,
Allows fast processing of large Available upon request to
BINDSURF ligand databases
100X
authors; single GPU
http://www.biomedcentral.com/1471-2105/13/S14/S13
Empirical Free Released University of Bristol
BUDE Energy Forcefield
6.5-13.4X
Single GPU http://www.bris.ac.uk/biochemistry/cpfg/bude/bude.htm
Released, Suite 2011 Schrodinger, Inc.
Core Hopping GPU accelerated application 3.75-5000X
Single and multi-GPUs. http://www.schrodinger.com/products/14/32/
Real-time shape similarity Released Open Eyes Scientific Software
FastROCS searching/comparison
800-3000X
Single and multi-GPUs. http://www.eyesopen.com/fastrocs
Lines: 460% increase
Cartoons: 1246% increase
Released, Version 1.5
PyMol Surface: 1746% increase 1700x
Single GPUs
http://pymol.org/
Spheres: 753% increase
Ribbon: 426% increase
High quality rendering, GPU Perf compared against Multi-core x86 CPU socket.
large structures (100 million atoms),
100-125X or greater GPU Perf benchmarked on GPU supported features
Visualization from University of Illinois at Urbana-Champaign
VMD analysis and visualization tasks, multiple
on kernels
Released, Version 1.9
and mayhttp://www.ks.uiuc.edu/Research/vmd/
be a kernel to kernel perf comparison
GPU support for display of molecular
9. Bioinformatics Applications
Features GPU
Application Release Status Website
Supported Speedup
Alignment of short sequencing Version 0.6.2 – 3/2012
BarraCUDA reads
6-10x
Multi-GPU, multi-node
http://seqbarracuda.sourceforge.net/
Parallel search of Smith- Version 2.0.8 – Q1/2012
CUDASW++ Waterman database
10-50x
Multi-GPU, multi-node
http://sourceforge.net/projects/cudasw/
Parallel, accurate long read Version 1.0.40 – 6/2012
CUSHAW aligner for large genomes
10x
Multiple-GPU
http://cushaw.sourceforge.net/
Protein alignment according to Version 2.2.26 – 3/2012 http://eudoxus.cheme.cmu.edu/gpublast/gpu
GPU-BLAST BLASTP
3-4x
Single GPU blast.html
Parallel local and global
Version 2.3.2 – Q1/2012 http://www.mpihmmer.org/installguideGPUH
GPU-HMMER search of Hidden Markov 60-100x
Multi-GPU, multi-node MMER.htm
Models
Scalable motif discovery Version 3.0.12 https://sites.google.com/site/yongchaosoftwa
mCUDA-MEME algorithm based on MEME
4-10x
Multi-GPU, multi-node re/mcuda-meme
Hardware and software for
Released.
SeqNFind reference assembly, blast, SW, 400x
Multi-GPU, multi-node
http://www.seqnfind.com/
HMM, de novo assembly
Version 1.11 – 5/2012
UGENE Fast short read alignment 6-8x
Multi-GPU, multi-node
http://ugene.unipro.ru/
GPU Perf compared against same or similar code running on single CPU machine
Parallel linear regression on Performance measured internally or independently
10.
11. MD Average Speedups
The blue node contains Dual E5-2687W CPUs
10 (8 Cores per CPU).
The green nodes contain Dual E5-2687W CPUs (8
Cores per CPU) and 1 or 2 NVIDIA K10, K20, or
Performance Relative to CPU Only
8 K20X GPUs.
6
4
2
0
CPU CPU + K10 CPU + K20 CPU + K20X CPU + 2x K10 CPU + 2x K20 CPU + 2x K20X
Average speedup calculated from 4 AMBER, 3 NAMD, 3 LAMMPS, and 1 GROMACS test cases.
Error bars show the maximum and minimum speedup for each hardware configuration.
12. Molecular Dynamics (MD) Applications
Features
Application GPU Perf Release Status Notes/Benchmarks
Supported
> 100 ns/day AMBER 12, GPU Revision Support 12.2
PMEMD Explicit Solvent & GB Released
AMBER Implicit Solvent
JAC NVE on 2X
Multi-GPU, multi-node
http://ambermd.org/gpus/benchmarks.
K20s htm#Benchmarks
2x C2070 equals Release C37b1;
Implicit (5x), Explicit (2x) Released
CHARMM Solvent via OpenMM
32-35x X5667
Single & multi-GPU in single node
http://www.charmm.org/news/c37b1.html#po
CPUs stjump
Two-body Forces, Link-cell Source only, Results Published
Release V 4.03
DL_POLY Pairs, Ewald SPME forces, 4x
Multi-GPU, multi-node
http://www.stfc.ac.uk/CSE/randd/ccg/softwa
Shake VV re/DL_POLY/25526.aspx
165 ns/Day
Released
GROMACS Implicit (5x), Explicit (2x) DHFR on
Multi-GPU, multi-node
Release 4.6; 1st Multi-GPU support
4X C2075s
http://lammps.sandia.gov/bench.html#deskto
Lennard-Jones, Gay-Berne, Released.
LAMMPS Tersoff & many more potentials
3.5-18x on Titan
Multi-GPU, multi-node
p and
http://lammps.sandia.gov/bench.html#titan
4.0 ns/days Released
Full electrostatics with PME and
NAMD most simulation features
F1-ATPase on 100M atom capable NAMD 2.9
1x K20X Multi-GPU, multi-node
GPU Perf compared against Multi-core x86 CPU socket.
GPU Perf benchmarked on GPU supported features
and may be a kernel to kernel perf comparison
13. New/Additional MD Applications Ramping
Features
Application GPU Perf Release Status Notes
Supported
4-29X Released, Version 1.8.51
Abalone Simulations (on 1060 GPU)
(on 1060 GPU) Single GPU
Agile Molecule, Inc.
Computation of non-valent 4-29X Released, Version 1.1.4
Ascalaph interactions (on 1060 GPU) Single GPU
Agile Molecule, Inc.
150 ns/day DHFR on Released Production bio-molecular dynamics (MD)
ACEMD Written for use only on GPUs
1x K20 Single and multi-GPUs software specially optimized to run on GPUs
Powerful distributed computing
Depends upon Released; http://folding.stanford.edu
Folding@Home molecular dynamics system;
number of GPUs GPUs and CPUs GPUs get 4X the points of CPUs
implicit solvent and folding
High-performance all-atom
Depends upon Released; http://www.gpugrid.net/
GPUGrid.net biomolecular simulations;
number of GPUs NVIDIA GPUs only
explicit solvent and binding
Simple fluids and binary
mixtures (pair potentials, high- Up to 66x on 2090 Released, Version 0.2.0 http://halmd.org/benchmarks.html#supercool
HALMD precision NVE and NVT, dynamic vs. 1 CPU core Single GPU ed-binary-mixture-kob-andersen
correlations)
Kepler 2X faster Released, Version 0.11.2 http://codeblue.umich.edu/hoomd-blue/
HOOMD-Blue Written for use only on GPUs
than Fermi Single and multi-GPU on 1 node Multi-GPU w/ MPI in March 2013
Implicit: 127-213
Implicit and explicit solvent, Released Version 4.1.1 Library and application for molecular dynamics
OpenMM custom forces
ns/day Explicit: 18-
Multi-GPU on high-performance
55 ns/day DHFR
GPU Perf compared against Multi-core x86 CPU socket.
GPU Perf benchmarked on GPU supported features
and may be a kernel to kernel perf comparison
14. Built from Ground Up for GPUs
Computational Chemistry
Study disease & discover drugs
What
Predict drug and protein interactions
GPU READY
Speed of simulations is critical APPLICATIONS
Why Enables study of:
Abalone
ACEMD
Longer timeframes AMBER
Larger systems DL_PLOY
More simulations GAMESS
How GPUs increase throughput & accelerate simulations
GROMACS
LAMMPS
NAMD
AMBER 11 Application NWChem
4.6x performance increase with 2 GPUs with Q-CHEM
only a 54% added cost* Quantum Espresso
TeraChem
• AMBER 11 Cellulose NPT on 2x E5670 CPUS + 2x Tesla C2090s (per node) vs. 2xcE5670 CPUs (per node)
• Cost of CPU node assumed to be $9333. Cost of adding two (2) 2090s to single node is assumed to be $5333
15. AMBER 12
GPU Support Revision 12.2
1/22/2013
15
16. Kepler - Our Fastest Family of GPUs Yet
30.00
Factor IX Running AMBER 12 GPU Support Revision 12.1
25.39
25.00 The blue node contains Dual E5-2687W CPUs
22.44 (8 Cores per CPU).
7.4x The green nodes contain Dual E5-2687W CPUs (8
20.00 18.90 Cores per CPU) and either 1x NVIDIA M2090, 1x K10
Nanoseconds / Day
or 1x K20 for the GPU
6.6x
15.00
11.85 5.6x
10.00
3.5x
5.00
3.42
0.00
Factor IX
1 CPU Node 1 CPU Node + 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X
M2090
GPU speedup/throughput increased from 3.5x (with M2090) to 7.4x (with K20X)
when compared to a CPU only node
16
17. K10 Accelerates Simulations of All Sizes
30
Running AMBER 12 GPU Support Revision 12.1
The blue node contains Dual E5-2687W CPUs
25 24.00 (8 Cores per CPU).
Speedup Compared to CPU Only
The green nodes contain Dual E5-2687W CPUs (8
19.98
20 Cores per CPU) and 1x NVIDIA K10 GPU
15
10
5.50 5.53 5.04
5
2.00
0
CPU TRPcage JAC NVE Factor IX NVE Cellulose NVE Myoglobin Nucleosome
All Molecules GB PME PME PME GB GB
Gain 24x performance by adding just 1 GPU
Nucleosome
when compared to dual CPU performance
18. K20 Accelerates Simulations of All Sizes
30.00
28.00
Running AMBER 12 GPU Support Revision 12.1
25.56 SPFP with CUDA 4.2.9 ECC Off
25.00
The blue node contains 2x Intel E5-2687W CPUs
Speedup Compared to CPU Only
(8 Cores per CPU)
20.00
Each green nodes contains 2x Intel E5-2687W
CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs
15.00
10.00
7.28
6.50 6.56
5.00
2.66
1.00
0.00
CPU All TRPcage GB JAC NVE PME Factor IX NVE Cellulose NVE Myoglobin GB Nucleosome
Molecules PME PME GB
Gain 28x throughput/performance by adding just one K20 GPU
Nucleosome
when compared to dual CPU performance
18 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
19. K20X Accelerates Simulations of All Sizes
35
31.30 Running AMBER 12 GPU Support Revision 12.1
30 28.59
The blue node contains Dual E5-2687W CPUs
(8 Cores per CPU).
Speedup Compared to CPU Only
25
The green nodes contain Dual E5-2687W CPUs (8
Cores per CPU) and 1x NVIDIA K20X GPU
20
15
10 8.30
7.15 7.43
5
2.79
0
CPU TRPcage JAC NVE Factor IX NVE Cellulose NVE Myoglobin Nucleosome
All Molecules GB PME PME PME GB GB
Gain 31x performance by adding just one K20X GPU
Nucleosome
when compared to dual CPU performance
19 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
20. K10 Strong Scaling over Nodes
Cellulose 408K Atoms (NPT) Running AMBER 12 with CUDA 4.2 ECC Off
6 The blue nodes contains 2x Intel X5670
CPUs (6 Cores per CPU)
5 The green nodes contains 2x Intel X5670
CPUs (6 Cores per CPU) plus 2x NVIDIA
K10 GPUs
4
Nanoseconds / Day
2.4x
3
CPU Only
3.6x With GPU
2
5.1x
1
Cellulose
0
1 2 4
Number of Nodes
GPUs significantly outperform CPUs while scaling over multiple nodes
21. Kepler – Universally Faster
9
Running AMBER 12 GPU Support Revision 12.1
8 The CPU Only node contains Dual E5-2687W CPUs
(8 Cores per CPU).
Speedups Compared to CPU Only
7
The Kepler nodes contain Dual E5-2687W CPUs (8
6 Cores per CPU) and 1x NVIDIA K10, K20, or K20X
GPUs
5
JAC
4 Factor IX
Cellulose
3
2
1
0
CPU Only CPU + K10 CPU + K20 CPU + K20X Cellulose
The Kepler GPUs accelerated all simulations, up to 8x
22. K10 Extreme Performance
Running AMBER 12 GPU Support Revision 12.1
JAC 23K Atoms (NVE)
120 The blue node contains Dual E5-2687W CPUs
(8 Cores per CPU).
97.99 The green node contain Dual E5-2687W CPUs (8
100
Cores per CPU) and 2x NVIDIA K10 GPUs
Nanoseconds / Day
80
60
40
20
12.47
0
1 Node 1 Node
DHFR
Gain 7.8X performance by adding just 2 GPUs
when compared to dual CPU performance
23. K20 Extreme Performance
DHRF JAC 23K Atoms (NVE) Running AMBER 12 GPU Support Revision 12.1
SPFP with CUDA 4.2.9 ECC Off
120
The blue node contains 2x Intel E5-2687W CPUs
95.59 (8 Cores per CPU)
100
Each green node contains 2x Intel E5-2687W
CPUs (8 Cores per CPU) plus 2x NVIDIA K20 GPU
Nanoseconds / Day
80
60
40
20 12.47
0
1 Node 1 Node
DHFR
Gain > 7.5X throughput/performance by adding just 2 K20 GPUs
when compared to dual CPU performance
23 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
24. Replace 8 Nodes with 1 K20 GPU
90.00 35000
$32,000.00
Running AMBER 12 GPU Support Revision 12.1
81.09 SPFP with CUDA 4.2.9 ECC Off
80.00
30000
The eight (8) blue nodes each contain 2x Intel
70.00 E5-2687W CPUs (8 Cores per CPU)
65.00
25000
Each green node contains 2x Intel E5-2687W
60.00
CPUs (8 Cores per CPU) plus 1x NVIDIA K20
GPU
50.00 20000
Note: Typical CPU and GPU node pricing used.
40.00 Pricing may vary depending on node
15000
configuration. Contact your preferred HW vendor
for actual pricing.
30.00
10000
20.00 $6,500.00
5000
10.00
0.00 0
Nanoseconds/Day Cost
DHFR
Cut down simulation costs to ¼ and gain higher performance
24 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
25. Replace 7 Nodes with 1 K10 GPU
Performance on JAC NVE Cost Running AMBER 12 GPU Support Revision 12.1
SPFP with CUDA 4.2.9 ECC Off
80 $35,000.00
$32,000
The eight (8) blue nodes each contain 2x Intel
70 $30,000.00 E5-2687W CPUs (8 Cores per CPU)
60
The green node contains 2x Intel E5-2687W
$25,000.00 CPUs (8 Cores per CPU) plus 1x NVIDIA K10
Nanoseconds / Day
GPU
50
$20,000.00 Note: Typical CPU and GPU node pricing used.
40 Pricing may vary depending on node
$15,000.00 configuration. Contact your preferred HW vendor
30 for actual pricing.
$10,000.00
20 $7,000
10 $5,000.00
0 $0.00
CPU Only GPU Enabled CPU Only GPU Enabled
DHFR
Cut down simulation costs to ¼ and increase performance by 70%
25 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
26. Extra CPUs decrease Performance
Cellulose NVE Running AMBER 12 GPU Support Revision 12.1
8 The orange bars contains one E5-2687W CPUs
(8 Cores per CPU).
7
The blue bars contain Dual E5-2687W CPUs (8
6 Cores per CPU)
Nanoseconds / Day
2 CPUs 2 GPUs
1 CPU 2 GPUs
5
4 1 E5-2687W
2 E5-2687W
3
2
1
0 Cellulose
CPU Only CPU with dual K20s
When used with GPUs, dual CPU sockets perform worse than single CPU sockets.
27. Kepler - Greener Science
Running AMBER 12 GPU Support Revision 12.1
Energy used in simulating 1 ns of DHFR JAC
2500 The blue node contains Dual E5-2687W CPUs
(150W each, 8 Cores per CPU).
The green nodes contain Dual E5-2687W CPUs (8
2000 Cores per CPU) and 1x NVIDIA K10, K20, or K20X
Lower is better GPUs (235W each).
Energy Expended (kJ)
1500
Energy Expended
1000
= Power x Time
500
0
CPU Only CPU + K10 CPU + K20 CPU + K20X
The GPU Accelerated systems use 65-75% less energy
28. Recommended GPU Node Configuration for
AMBER Computational Chemistry
Workstation or Single Node Configuration
# of CPU sockets 2
Cores per CPU socket 4+ (1 CPU core drives 1 GPU)
CPU speed (Ghz) 2.66+
System memory per node (GB) 16
Kepler K10, K20, K20X
GPUs
Fermi M2090, M2075, C2075
1-2
# of GPUs per CPU socket (4 GPUs on 1 socket is good
to do 4 fast serial GPU runs)
GPU memory preference (GB) 6
GPU to CPU connection PCIe 2.0 16x or higher
Server storage 2 TB
28 Scale to multiple nodes with same single node configuration AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
29. Benefits of GPU AMBER Accelerated Computing
Faster than CPU only systems in all tests
Most major compute intensive aspects of classical MD ported
Large performance boost with marginal price increase
Energy usage cut by more than half
GPUs scale well within a node and over multiple nodes
K20 GPU is our fastest and lowest power high performance GPU yet
Try GPU accelerated AMBER for free – www.nvidia.com/GPUTestDrive
29 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012
31. Kepler - Our Fastest Family of GPUs Yet
4.50
ApoA1 Running NAMD version 2.9
4.00
4.00 The blue node contains Dual E5-2687W CPUs
3.57 (8 Cores per CPU).
3.45
3.50
The green nodes contain Dual E5-2687W CPUs (8
2.9x Cores per CPU) and either 1x NVIDIA M2090, 1x K10
3.00 or 1x K20 for the GPU
Nanoseconds/Day
2.63
2.6x
2.50
2.5x
2.00
1.50 1.37 1.9x
1.00
0.50
0.00
1 CPU Node 1 CPU Node + 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X
Apolipoprotein A1
M2090
GPU speedup/throughput increased from 1.9x (with M2090) to 2.9x (with K20X)
when compared to a CPU only node
31 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
32. Accelerates Simulations of All Sizes
3
Running NAMD 2.9 with CUDA 4.0 ECC Off
2.7
2.6
The blue node contains 2x Intel E5-2687W CPUs
2.5 2.4
(8 Cores per CPU)
Speedup Compared to CPU Only
Each green node contains 2x Intel E5-2687W
2 CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs
1.5
1
0.5
0
CPU All Molecules ApoA1 F1-ATPase STMV
Apolipoprotein A1
Gain 2.5x throughput/performance by adding just 1 GPU
when compared to dual CPU performance
32 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
33. Kepler – Universally Faster
6
Running NAMD version 2.9
The CPU Only node contains Dual E5-2687W CPUs
5 (8 Cores per CPU).
Speedup Compared to CPU Only
5.1x The Kepler nodes contain Dual E5-2687W CPUs (8
4 4.7x Cores per CPU) and 1 or two NVIDIA K10, K20, or
K20X GPUs.
4.3x
F1-ATPase
3
ApoA1
STMV
2.9x
2
2.6x
2.4x
1
0
CPU Only 1x K10 1x K20 1x K20X 2x K10 2x K20 2x K20X
F1-ATPase
| Kepler nodes use Dual CPUs |
The Kepler GPUs accelerate all simulations, up to 5x
Average acceleration printed in bars
34. Outstanding Strong Scaling with Multi-STMV
Running NAMD version 2.9
Each blue XE6 CPU node contains 1x AMD
100 STMV on Hundreds of Nodes 1600 Opteron (16 Cores per CPU).
1.2
Fermi XK6 Each green XK6 CPU+GPU node contains
1x AMD 1600 Opteron (16 Cores per CPU)
1 and an additional 1x NVIDIA X2090 GPU.
CPU XK6
2.7x
Nanoseconds / Day
0.8
2.9x
0.6
0.4
0.2
3.6x
3.8x Concatenation of 100
0 Satellite Tobacco Mosaic Virus
32 64 128 256 512 640 768
# of Nodes
Accelerate your science by 2.7-3.8x when compared to CPU-based supercomputers
35. Replace 3 Nodes with 1 2090 GPU
Running NAMD version 2.9
Each blue node contains 2x Intel Xeon X5550 CPUs
F1-ATPase (4 Cores per CPU).
4 CPU Nodes
0.8 9000
0.74 The green node contains 2x Intel Xeon X5550 CPUs
$8,000
1 CPU Node +8000 (4 Cores per CPU) and 1x NVIDIA M2090 GPU
0.7 1x M2090 GPUs
0.63
7000 Note: Typical CPU and GPU node pricing used. Pricing
0.6 may vary depending on node configuration. Contact your
6000 preferred HW vendor for actual pricing.
0.5
5000
0.4 $4,000
4000
0.3
3000
0.2
2000
0.1 1000
0 0
Nanoseconds/Day Cost
Speedup of 1.2x for 50% the cost F1-ATPase
36. K20 - Greener: Twice The Science Per Watt
1200000
Energy Used in Simulating 1 Nanosecond of ApoA1
Running NAMD version 2.9
1000000 Each blue node contains Dual E5-2687W
CPUs (95W, 4 Cores per CPU).
Each green node contains 2x Intel Xeon X5550
Energy Expended (kJ)
800000
CPUs (95W, 4 Cores per CPU) and 2x NVIDIA
Lower is better K20 GPUs (225W per GPU)
600000
Energy Expended
400000
= Power x Time
200000
0
1 Node 1 Node + 2x K20
Cut down energy usage by ½ with GPUs
36 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
37. Kepler - Greener: Twice The Science/Joule
Energy used in simulating 1 ns of SMTV
250000
Running NAMD version 2.9
The blue node contains Dual E5-2687W CPUs
200000 (150W each, 8 Cores per CPU).
Energy Expended (kJ)
Lower is better The green nodes contain Dual E5-2687W CPUs
(8 Cores per CPU) and 2x NVIDIA K10, K20, or
150000
K20X GPUs (235W each).
Energy Expended
100000
= Power x Time
50000
0
CPU Only CPU + 2 K10s CPU + 2 K20s CPU + 2 K20Xs
Cut down energy usage by ½ with GPUs
Satellite Tobacco Mosaic Virus
38. Recommended GPU Node Configuration for
NAMD Computational Chemistry
Workstation or Single Node Configuration
# of CPU sockets 2
Cores per CPU socket 6+
CPU speed (Ghz) 2.66+
System memory per socket (GB) 32
Kepler K10, K20, K20X
GPUs
Fermi M2090, M2075, C2075
# of GPUs per CPU socket 1-2
GPU memory preference (GB) 6
GPU to CPU connection PCIe 2.0 or higher
Server storage 500 GB or higher
Network configuration Gemini, InfiniBand
38 Scale to multiple nodes with same single node configuration NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
39. Summary/Conclusions
Benefits of GPU Accelerated Computing
Faster than CPU only systems in all tests
Large performance boost with small marginal price increase
Energy usage cut in half
GPUs scale very well within a node and over multiple nodes
Tesla K20 GPU is our fastest and lowest power high performance GPU to date
Try GPU accelerated NAMD for free – www.nvidia.com/GPUTestDrive
39 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012
41. More Science for Your Money
Embedded Atom Model Blue node uses 2x E5-2687W (8 Cores
6 and 150W per CPU).
5.5
Green nodes have 2x E5-2687W and 1
5 or 2 NVIDIA K10, K20, or K20X GPUs (235W).
Speedup Compared to CPU Only
4.5
4
3.3
2.92
3
2.47
2 1.7
1
0
CPU Only CPU + 1x CPU + 1x CPU + 1x CPU + 2x CPU + 2x CPU + 2x
K10 K20 K20X K10 K20 K20X
Experience performance increases of up to 5.5x with Kepler GPU nodes.
42. K20X, the Fastest GPU Yet
7 Blue node uses 2x E5-2687W (8 Cores
and 150W per CPU).
6
Green nodes have 2x E5-2687W and 2
NVIDIA M2090s or K20X GPUs (235W).
Speedup Relative to CPU Alone
5
4
3
2
1
0
CPU Only CPU + 2x M2090 CPU + K20X CPU + 2x K20X
Experience performance increases of up to 6.2x with Kepler GPU nodes.
One K20X performs as well as two M2090s
43. Get a CPU Rebate to Fund Part of Your GPU Budget
Acceleration in Loop Time Computation by
Additional GPUs
Running NAMD version 2.9
20
18.2
The blue node contains Dual X5670 CPUs
18
(6 Cores per CPU).
16
The green nodes contain Dual X5570 CPUs
Normalized to CPU Only
14 12.9 (4 Cores per CPU) and 1-4 NVIDIA M2090
GPUs.
12
9.88
10
8
6 5.31
4
2
0
1 Node 1 Node + 1x M20901 Node + 2x M20901 Node + 3x M20901 Node + 4x M2090
Increase performance 18x when compared to CPU-only nodes
Cheaper CPUs used with GPUs AND still faster overall performance when
compared to more expensive CPUs!
44. Excellent Strong Scaling on Large Clusters
LAMMPS Gay-Berne 134M Atoms
600
GPU Accelerated XK6
500
CPU only XE6
Loop Time (seconds)
400
3.55x
300
200
3.48x
3.45x
100
0
300 400 500 600 700 800 900
Nodes
From 300-900 nodes, the NVIDIA GPU-powered XK6 maintained 3.5x performance
compared to XE6 CPU nodes
Each blue Cray XE6 Nodes have 2x AMD Opteron CPUs (16 Cores per CPU)
Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Cores per CPU) and 1x NVIDIA X2090
45. GPUs Sustain 5x Performance for Weak Scaling
Weak Scaling with 32K Atoms per Node
45
40
Loop Time (seconds) 35
30
6.7x 5.8x 4.8x
25
20
15
10
5
0
1 8 27 64 125 216 343 512 729
Nodes
Performance of 4.8x-6.7x with GPU-accelerated nodes
when compared to CPUs alone
Each blue Cray XE6 Node have 2x AMD Opteron CPUs (16 Cores per CPU)
Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Core per CPU) and 1x NVIDIA X2090
46. Faster, Greener — Worth It!
Energy Consumed in one loop of EAM
140
120 GPU-accelerated computing uses
Lower is better 53% less energy than CPU only
100
Energy Expended (kJ)
80
60
Energy Expended = Power x Time
Power calculated by combining the component’s TDPs
40
20
0
1 Node 1 Node + 1 K20X 1 Node + 2x K20X
Blue node uses 2x E5-2687W (8 Cores and 150W per CPU) and CUDA 4.2.9.
Green nodes have 2x E5-2687W and 1 or 2 NVIDIA K20X GPUs (235W) running CUDA 5.0.36.
Try GPU accelerated LAMMPS for free – www.nvidia.com/GPUTestDrive
47. Molecular Dynamics with LAMMPS
on a Hybrid Cray Supercomputer
W. Michael Brown
National Center for Computational Sciences
Oak Ridge National Laboratory
NVIDIA Technology Theater, Supercomputing 2012
November 14, 2012
Note the rise of GPU only applications and GPU-grid applications. This indicates that GPUs are a sweet spot for MD.
Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
Nodes, box size, atoms, cpu time, cpu+gpu time, gpu speedup11x1x13276842.26.336.67 x82x2x226214441.86.736.21 x273x3x388473641.56.866.05 x644x4x4209715241.57.185.78 x1255x5x5409600041.47.185.77 x2166x6x67077888427.665.48 x3437x7x71123942441.98.345.02 x5128x8x81677721642.38.415.03 x7299x9x92388787242.58.924.76 x
44 cpus2 cpu 1 gpu2 cpu 2 gpuNs/day6025.342.4price6000030004000
44 cpus2 cpu 1 gpu2 cpu 2 gpuNs/day 6025.342.4price6000030004000scaled price1 0.050.066666667perf/price18.43333333310.6
64 cpu2 cpu 1 gpu2 cpu 2 gpuNs/day 318.915.1tdp6080428666sec/ns2787.09679707.86515721.8543energy/ns16945.5484154.96623810.7549
Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
Due to the great interest in speeding up quantum chemistry applications, NVIDIA has begin reaching out to these developers to see how it may assist their development on GPUs. These developers either have active GPU development projects or have released application.
Test case not specified in perf lab run
Test case not specified in perf lab run
I am here today to talk to you about the value of seamlessly adding GPUs to the computer which you use to run Quantum Espresso/PWscf and achieving phenomenal performance improvements. This small incremental investment will yield significant performance payback.What is Quantum Espresso/PWscf:-A set of programs used to calculate the electron configuration of atoms or molecules-Uses plane wave basis sets and quantum mechanical principles-Highly compute intensiveBenefits of GPU-accelerated Computing:-Faster than CPU only systems in all tests-Performance boost much larger than marginal price increase-Power consumption more than halved in all simulations-GPUs scale very well on clusters with dozens of nodes, and beyond===============Price assumes FERMI workstation ~$4000 and C2050 $1000shilu 3 water on calcite6 OpenMP CPU nodes 1025 15606 OpenMP CPU nodes 1 gpu 275 480FERMI (ICHEC): assembled workstationCPU: 2 * Intel Xeon X5650 (6-core), 24 GByte RAMGPU: 2 x C2050, GTX480, C2075SW: CUDA 4.1, Intel compilers
ausurf k ptausurf gamma 6 OpenMP CPU nodes 7100 s 7000 s 6 OpenMP CPU nodes 1 gpu 2350 s 2000 s FERMI (ICHEC): assembled workstationCPU: 2 * Intel Xeon X5650 (6-core), 24 GByte RAMGPU: 2 x C2050, GTX480, C2075SW: CUDA 4.1, Intel compilers
CPU: Intel X5550, TDP of 95W, priced at $350GPU: NVIDIA M2070, TDP of 225W, priced at $2000PLX (CINECA): IBM iDataPlex DX360M3, 264 GPU nodesCPU: 2 Intel Westmere X5550 (6-core), 48 GByte RAMGPU: 2 x M2070SW: CUDA4.0, Intel compilers, PGI (11.x)
National average 9.83 cents/kWh kWh/sim tests/year $/test yearly energy billCPU 42.372357 4.16 $9816GPU/CPU 23.214 2400 2.28 $5476CPU: Intel X5550, TDP of 95W, priced at $350GPU: NVIDIA M2070, TDP of 225W, priced at $2000PLX (CINECA): IBM iDataPlex DX360M3, 264 GPU nodesCPU: 2 Intel Westmere X5550 (6-core), 48 GByte RAMGPU: 2 x M2070SW: CUDA4.0, Intel compilers, PGI (11.x)
total # of core2 (16) 4 (32)6 (48)8 (64)10 (80)12 (96)14 (112)time (s) cpu3100016500110009500750060005500time gpu+cpu12500700050004500350030002500SPEEDUP2.482.3571432.22.1111112.14285722.2STONEY (ICHEC): Bull Novascale R422-E2, 24 GPU nodesCPU: 2 Intel (Nehalem EP) Xeon X5560, 48 GByte RAMGPU: 2 x M2090SW: CUDA 4.0, Intel compilers
# of cores4(48)8(96)12(144)16(192)24(288)32(384)44(528)time cpu3925265025252450174012901337time gpu+cpu242514371075900737675637SPEEDUP1.6185571.844122.3488372.7222222.3609231.9111112.098901PLX (CINECA): IBM iDataPlex DX360M3, 264 GPU nodesCPU: 2 Intel Westmere (6-core), 48 GByte RAMGPU: 2 x M2070SW: CUDA4.0, Intel compilers, PGI (11.x)
I am here today to talk to you about the value of seamlessly adding GPUs to the computer which you use to run TeraChem and achieving phenomenal performance improvements. This small incremental investment will yield significant performance payback.Benefits of GPU Acceleration with TeraChem-Compete with Supercomputers-More powerful hardware-Significantly lower energy usage
Before we end this session I would like to tell you about GPU Test Drive. It is an excellent resource for computational chemistry researchers such as yourself to evaluate benefits of GPU computing in speeding up your simulations. Most importantly it is free.NVIDIA along with its partners is offering access to remotely hosted GPU cluster. You can run applications such as AMBER and NAMD to find out how your models speed up. You can also try code that you have developed to run on GPU and see how it scales on a 8 GPU cluster. All you need to do is sign up and log in – it is really that easy! We have several partners who are demonstrating the GPU Test Drive on the GTC show floor. Please plan on visiting them.Sign up forms have been given out. If you are interested please fill them out and return them to me.