04/01/14 1
Establishing a CUDA Research Center at
Penn State: Perspectives on GPU-Enabled
Teaching and Research
William J....
04/01/14 2
Outline
● Center Overview (RCC @ PSU)
● GPU accelerated research
● IceCube
● Metabolic Networks (Fsolve/cuSolve...
04/01/14 3
Center Overview
● Research Computing and Cyberinfrastructure (RCC) at PSU
provides high performance computing s...
04/01/14 4
Center Overview
● Hardware is ~ 12K CPU cores, 64 GPUs (Fermi), several Kepler
● Red Hat Linux, scheduling via ...
04/01/14 5
Center Overview
● Support many GPU accelerated applications
Nvidia GTC 2014
04/01/14 6
Outline
● Center Overview (RCC @ PSU)
● GPU accelerated research
● IceCube
● Metabolic Networks (Fsolve/cuSolve...
04/01/14 7
Nvidia GTC 2014
IceCube
04/01/14 8
Metabolic Networks
● Optimal models for the metabolic networks of microbial organisms
important in pharma, ener...
04/01/14 9
Metabolic Networks
● [CPU] parse equations f(k,y)
● [CPU] differentiate f(k,y), create analytic J(k,y)
● [CPU] ...
04/01/14 10
Metabolic Networks
Nvidia GTC 2014
● Solution uses various libraries
including Boost, Thrust, CUSP and
CUDA
● ...
04/01/14 11
Molecular Dynamics + Sim Anneal
Nvidia GTC 2014
● Solve for MD potentials by fitting experimental data for str...
04/01/14 12
LU Decomposition
Nvidia GTC 2014
● Batch LU decomposition developed for fractional quantum Hall effect,
fundam...
04/01/14 13
LU Decomposition
Nvidia GTC 2014
04/01/14 14
QR Decomposition
Nvidia GTC 2014
● Proppant materials used to stabilize fissures created during hydraulic
frac...
04/01/14 15
QR Decomposition
Nvidia GTC 2014
04/01/14 16
Outline
● Center Overview (RCC @ PSU)
● GPU accelerated research
● IceCube
● Metabolic Networks (Fsolve/cuSolv...
04/01/14 17
GPU Cluster Scaling
Nvidia GTC 2014
● Several key GPU accelerated software suites were tested using
multiple G...
04/01/14 18
GPU Cluster Scaling
Nvidia GTC 2014
● Lion-GA cluster has 3 GPUs per PCIe
switch, 3 to 5 GPUs per IOH chip
● I...
04/01/14 19
Amber
Nvidia GTC 2014
● Molecular Dynamics is widely used for simulation of solvated proteins
or molecules and...
04/01/14 20
Nvidia GTC 2014
12 X5675 2 M2090 4 M2090 6 M2090 8 M2090
01020304050607080
PME simulation of DHFR protein in w...
04/01/14 21
Nvidia GTC 2014
12 X5675 2 M2090 4 M2090 6 M2090 8 M2090
024681012141618
PME simulation of FactorIX molecule i...
04/01/14 22
Nvidia GTC 2014
12 X5675 2 M2090 4 M2090 6 M2090 8 M2090
00.511.522.533.544.5
PME simulation of Cellulose mole...
04/01/14 23
Nvidia GTC 2014
12 X5675 2 M2090 4 M2090 6 M2090 8 M2090
050100150200
Implicit solvent GB simulation of Myoglo...
04/01/14 24
Nvidia GTC 2014
12 X5675 2 M2090 4 M2090 6 M2090 8 M2090
01234567
Implicit solvent GB simulation of Nucleosome...
04/01/14 25
PetaChem
Nvidia GTC 2014
● Quantum Chemistry designed to run on NVIDIA series hardware
● Features restricted H...
04/01/14 26
PetaChem
Nvidia GTC 2014
1 M2070 3 M2070 5 M2070 7 M2070
0100200300400500600
PetaChem Olestra SCF calculation
...
04/01/14 27
Quantum Espresso
Nvidia GTC 2014
● Density Functional Theory (DFT) has enjoyed huge growth in
popularity owing...
04/01/14 28
Nvidia GTC 2014
1 K20 2 K20 4 K20 8 K20 16 K20 32 K20
01234567
SCF calculation for cellulose
Total walltime (i...
04/01/14 29
Lanczos Diagonalization
Nvidia GTC 2014
● Key task in many applications, esp quantum chemistry & DFT is
diagon...
04/01/14 30
Lanczos Diagonalization
Nvidia GTC 2014
04/01/14 31
Lanczos Diagonalization
Nvidia GTC 2014
● CUDA 5.5/Kepler overall yields pleasing communication results (CUDA-...
04/01/14 32
Outline
● Center Overview (RCC @ PSU)
● GPU accelerated research
● IceCube
● Metabolic Networks (Fsolve/cuSolv...
04/01/14 33
CUDA needs + wants
Nvidia GTC 2014
● ODE and Function Solver(s), metabolic networks, chemically reactive
flows...
04/01/14 34
Summary
Nvidia GTC 2014
● Early adopters astrophysics, quantum chem/condensed matter still
active, see most gr...
04/01/14 35
Acknowledgements
Nvidia GTC 2014
● Mark Berger, Chandra Cheij &Nvidia for generous donations
● {Ryan Eagen/Cow...
Upcoming SlideShare
Loading in...5
×

Nvidia GTC 2014 Talk

460

Published on

Penn State RCC has been a CUDA research center for the last year; this talk provides success stories and challenges. GPU case studies are given, including algorithm details and performance results.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
460
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Nvidia GTC 2014 Talk

  1. 1. 04/01/14 1 Establishing a CUDA Research Center at Penn State: Perspectives on GPU-Enabled Teaching and Research William J. Brouwer (wjb19@psu.edu) Pierre-Yves Taunay (py.taunay@psu.edu) Research Computing and Cyberinfrastructure The Pennsylvania State University Nvidia GTC 2014
  2. 2. 04/01/14 2 Outline ● Center Overview (RCC @ PSU) ● GPU accelerated research ● IceCube ● Metabolic Networks (Fsolve/cuSolve) ● MD + Simulated Annealing ● FQHE (LU Decomposition) ● Smart Proppants (QR Decomposition) ● GPU cluster scaling ● Amber ● PetaChem ● Quantum Espresso – Lanczos Diagonalization ● CUDA, needs + wants ● Summary Nvidia GTC 2014
  3. 3. 04/01/14 3 Center Overview ● Research Computing and Cyberinfrastructure (RCC) at PSU provides high performance computing services : ● Hardware, proprietary/open source software ● Consultation (numerical/algorithmic, software development etc) ● PhD's, system admins and programmers work together to provide these services to academics while performing independent research ● Many users are interested in using GPUs for science and engineering research applications, we are a CUDA research center https://research.nvidia.com/content/penn-state-crc-summary ● Formerly under ITS, currently incorporating into Office of the Vice President for Research (OVPR) Nvidia GTC 2014
  4. 4. 04/01/14 4 Center Overview ● Hardware is ~ 12K CPU cores, 64 GPUs (Fermi), several Kepler ● Red Hat Linux, scheduling via PBS/Moab/Torque ● Usual monitoring/management tools eg., Puppet, Jenkins, Nagios, Ganglia, and some custom solution(s) ( eg., CLPR) ● Serve ~ 7k users, all campuses in the commonwealth ● Use CUDA predominantly, although growing numbers of users trying OpenACC, OpenCL, libraries etc ● Environment modules system Nvidia GTC 2014
  5. 5. 04/01/14 5 Center Overview ● Support many GPU accelerated applications Nvidia GTC 2014
  6. 6. 04/01/14 6 Outline ● Center Overview (RCC @ PSU) ● GPU accelerated research ● IceCube ● Metabolic Networks (Fsolve/cuSolve) ● MD + Simulated Annealing ● FQHE (LU Decomposition) ● Smart Proppants (QR Decomposition) ● GPU cluster scaling ● Amber ● PetaChem ● Quantum Espresso – Lanczos Diagonalization ● CUDA, needs + wants ● Summary Nvidia GTC 2014
  7. 7. 04/01/14 7 Nvidia GTC 2014 IceCube
  8. 8. 04/01/14 8 Metabolic Networks ● Optimal models for the metabolic networks of microbial organisms important in pharma, energy industries ● Ensemble Modeling (EM) is used to construct chemical kinetics of microbial organisms → decompose metabolic reactions into the elementary mechanisms, which are ODE systems f(ki ,yj ) = dyj /dt Nvidia GTC 2014 ● Overall approach maximizes correlation between model predictions and experimental measurements, performed in steady state → solve f(k,y) = 0
  9. 9. 04/01/14 9 Metabolic Networks ● [CPU] parse equations f(k,y) ● [CPU] differentiate f(k,y), create analytic J(k,y) ● [CPU] populate data structures representing f(k,y), J(k,y), copy to GPU ● [GPU] Iterate (Newton-Raphson) → ● Numerically evaluate f(k,y) and J(k,y) by parallel reduction ● Solve for delta in f(k,y) = -delta . J(k,y) using GMRES ● Update y += delta and repeat until ||f(k,y)|| < tol Nvidia GTC 2014
  10. 10. 04/01/14 10 Metabolic Networks Nvidia GTC 2014 ● Solution uses various libraries including Boost, Thrust, CUSP and CUDA ● Matrices sparse, poorly conditioned, but solution works well for O(10^2) equations ● Currently working to scale to larger, more interesting networks and microbial organisms ● CuSolve is a work in progress, a GPU-only ODE solve for stiff equations
  11. 11. 04/01/14 11 Molecular Dynamics + Sim Anneal Nvidia GTC 2014 ● Solve for MD potentials by fitting experimental data for structure factor ● Optimization surface (below) is highly non-convex → use simulated annealing, each GPU performs independent MD run
  12. 12. 04/01/14 12 LU Decomposition Nvidia GTC 2014 ● Batch LU decomposition developed for fractional quantum Hall effect, fundamental physics that has implications in quantum computation and material science ● O(N!) determinants need to be evaluated in constructing wavefunction, process repeated many times in Monte Carlo calculation ● Small, dense matrices of side <= 512 ● Implementation exploits SIMD architecture, parallel reduction ● Example; N=11, computation time using 8 GPU devices (w/ MPI), 1024 Monte Carlo iterations is ~ 246 seconds from ~ 31488 single CPU
  13. 13. 04/01/14 13 LU Decomposition Nvidia GTC 2014
  14. 14. 04/01/14 14 QR Decomposition Nvidia GTC 2014 ● Proppant materials used to stabilize fissures created during hydraulic fracturing ● 'Smart proppants' are essentially electrical dipoles which may absorb and re-emit EM energy, irradiated and recorded by downhole instrumentation ● This work considers an iteration-free solution to this EM scattering problem, uses linear algebra including LU and SVD decomposition ● SVD can be performed using the QR algorithm, in turn a function of QR decomposition ● Devised a unique approach for large batches of dense small matrices using Givens rotations; largely independent ops, maps well to GPU
  15. 15. 04/01/14 15 QR Decomposition Nvidia GTC 2014
  16. 16. 04/01/14 16 Outline ● Center Overview (RCC @ PSU) ● GPU accelerated research ● IceCube ● Metabolic Networks (Fsolve/cuSolve) ● MD + Simulated Annealing ● FQHE (LU Decomposition) ● Smart Proppants (QR Decomposition) ● GPU cluster scaling ● Amber ● PetaChem ● Quantum Espresso – Lanczos Diagonalization ● CUDA, needs + wants ● Summary Nvidia GTC 2014
  17. 17. 04/01/14 17 GPU Cluster Scaling Nvidia GTC 2014 ● Several key GPU accelerated software suites were tested using multiple GPUs across two clusters Cluster Lion-GA Stampede CPU 12 X5675 @ 3.07 GHz 16 E5-2680 @ 2.70 GHz GPU 8 M2070 or 8 M2090 1 K20c Nodes equipped with GPUs 8 120 Interconnect 40 Gb/s Mellanox QDR Infiniband 56 Gb/s Mellanox FDR Infiniband
  18. 18. 04/01/14 18 GPU Cluster Scaling Nvidia GTC 2014 ● Lion-GA cluster has 3 GPUs per PCIe switch, 3 to 5 GPUs per IOH chip ● IOH doesn't support peer to peer transfers between GPU devices on different chipsets ● Difficult to achieve peak transfer rates across GPU on different sockets
  19. 19. 04/01/14 19 Amber Nvidia GTC 2014 ● Molecular Dynamics is widely used for simulation of solvated proteins or molecules and make use of various force fields (AMBER, ReaxFF, etc.) ● AMBER force field is implemented in the eponymous software suite ● The software PMEMD in AMBER is used for both explicit solvent Particle Mesh Ewald (PME) and implicit solvent General Borne (GB) simulations ● AMBER does not require extensive communication between GPUs or between CPU and GPU, and does not take advantage of the CPU if GPUs are used ● GPU acceleration allows for longer simulation times ~ nanosecond or more
  20. 20. 04/01/14 20 Nvidia GTC 2014 12 X5675 2 M2090 4 M2090 6 M2090 8 M2090 01020304050607080 PME simulation of DHFR protein in water (NPT ensemble, 23,558 atoms) Achieved performance on Lion-GA ns/day Amber
  21. 21. 04/01/14 21 Nvidia GTC 2014 12 X5675 2 M2090 4 M2090 6 M2090 8 M2090 024681012141618 PME simulation of FactorIX molecule in water (NPT ensemble, 90,906 atoms) Achieved performance on Lion-GA ns/day Amber
  22. 22. 04/01/14 22 Nvidia GTC 2014 12 X5675 2 M2090 4 M2090 6 M2090 8 M2090 00.511.522.533.544.5 PME simulation of Cellulose molecule in water (NPT ensemble, 408,609 atoms) Achieved performance on Lion-GA ns/day Amber
  23. 23. 04/01/14 23 Nvidia GTC 2014 12 X5675 2 M2090 4 M2090 6 M2090 8 M2090 050100150200 Implicit solvent GB simulation of Myoglobin (2,492 atoms) Achieved performance on Lion-GA ns/day Amber
  24. 24. 04/01/14 24 Nvidia GTC 2014 12 X5675 2 M2090 4 M2090 6 M2090 8 M2090 01234567 Implicit solvent GB simulation of Nucleosome (25,095 atoms) Achieved performance on Lion-GA ns/day Amber
  25. 25. 04/01/14 25 PetaChem Nvidia GTC 2014 ● Quantum Chemistry designed to run on NVIDIA series hardware ● Features restricted Hartree-Fock and grid-based Kohn-Sham single point energy and gradient calculations ● Various functions supported, geometry optimization, ab-initio molecular dynamics, support for multi-GPU ● Benchmark: single point energy, using basis 6-31g for Olestra
  26. 26. 04/01/14 26 PetaChem Nvidia GTC 2014 1 M2070 3 M2070 5 M2070 7 M2070 0100200300400500600 PetaChem Olestra SCF calculation Total walltime (in s) on Lion-GA Walltime(s)
  27. 27. 04/01/14 27 Quantum Espresso Nvidia GTC 2014 ● Density Functional Theory (DFT) has enjoyed huge growth in popularity owing to computational and numerical advancements; used widely in material science ● Quantum Espresso (QE) is an open source DFT package that has recently added GPU acceleration, largely through BLAS and FFT routines ● When building QE with MAGMA (UT/ORNL) or phiGEMM, one introduces heterogeneous CPU/GPU linear algebra routines ● Benchmark: ● Self-consistent field calculation, using PBE pseudopotentials,168 atoms (cellulose) ● Periodic boundary conditions, kinetic energy cutoff (Ry) for charge density of 80 Ry, Davidson diagonalization
  28. 28. 04/01/14 28 Nvidia GTC 2014 1 K20 2 K20 4 K20 8 K20 16 K20 32 K20 01234567 SCF calculation for cellulose Total walltime (in hrs) on Stampede@TACC Walltime(hrs) Quantum Espresso
  29. 29. 04/01/14 29 Lanczos Diagonalization Nvidia GTC 2014 ● Key task in many applications, esp quantum chemistry & DFT is diagonalization ie., matrix eigen-decomposition ● Lanczos is a power method, produces a tri-diagonal matrix, more readily solvable; consists of many matrix-vector operations, very amenable to GPU, currently using cuBLAS &MKL in a heterogeneous solution. ● Originally devised for fundamental physics project at PSU, now intended for incorporation into GPU-Quantum Espresso project being led by Filippo Spiga ● Attempting to scale to multiple devices using MPI + GPUdirect, still beset by some numerical/convergence problems with increasing matrix size
  30. 30. 04/01/14 30 Lanczos Diagonalization Nvidia GTC 2014
  31. 31. 04/01/14 31 Lanczos Diagonalization Nvidia GTC 2014 ● CUDA 5.5/Kepler overall yields pleasing communication results (CUDA- enabled openmpi 1.7.3, MPI send/recv), collectives less impressive ● Bandwidths for one-sided comms have some message size dependency &jitter, but effective bandwidth much improved over previous gens. 1e+07 2 4 6 8 5 4 3 2 BandwidthGB/s Increasing msg size in MB, within single application ● Results of 4 tests ● Rhel 6, Intel x86_64, Nvidia driver 331.38 ● Communication btwn K20 & K40
  32. 32. 04/01/14 32 Outline ● Center Overview (RCC @ PSU) ● GPU accelerated research ● IceCube ● Metabolic Networks (Fsolve/cuSolve) ● MD + Simulated Annealing ● FQHE (LU Decomposition) ● Smart Proppants (QR Decomposition) ● GPU cluster scaling ● Amber ● PetaChem ● Quantum Espresso – Lanczos Diagonalization ● CUDA, needs + wants ● Summary Nvidia GTC 2014
  33. 33. 04/01/14 33 CUDA needs + wants Nvidia GTC 2014 ● ODE and Function Solver(s), metabolic networks, chemically reactive flows w/ OpenFOAM → support for more C++11 language features? ● Lanczos Diagonalization, DFT/quantum chemistry, incorporation into Quantum Espresso → further improvements to GPUdirect (or use new multi-GPU interfaces instead)? ● Batch LU/QR → increased warp size?
  34. 34. 04/01/14 34 Summary Nvidia GTC 2014 ● Early adopters astrophysics, quantum chem/condensed matter still active, see most growth in strands of computational biology/life science, 'big data' ● Teaching seminars generally well received/attended, but... ● Most success from working to identify users/codes that can benefit from GPU by monitoring clusters, and on a related note... ● The harvest is plentiful in academia but the workers are few; generally if a code 'works' little pressure to make it better ● However changes even in traditional CPU architecture are forcing workers to reevaluate their computational models (thanks Ken Esler for this perspective); we live more and more in a parallel world
  35. 35. 04/01/14 35 Acknowledgements Nvidia GTC 2014 ● Mark Berger, Chandra Cheij &Nvidia for generous donations ● {Ryan Eagen/Cowen group, Ali Khodayari/Maranas group, Sreejith Jaya Ganesh, Jim Kubicki, Dan Haworth, Adri Van Duin} PSU ● {Chuck Gilbert, Jason Holmes} long-suffering sys admins ● HP for donation of 50 M2070 ● XSEDE/TACC for Stampede cycles
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×