Nvidia GTC 2014 Talk
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Nvidia GTC 2014 Talk

  • 420 views
Uploaded on

Penn State RCC has been a CUDA research center for the last year; this talk provides success stories and challenges. GPU case studies are given, including algorithm details and performance results.

Penn State RCC has been a CUDA research center for the last year; this talk provides success stories and challenges. GPU case studies are given, including algorithm details and performance results.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
420
On Slideshare
416
From Embeds
4
Number of Embeds
2

Actions

Shares
Downloads
4
Comments
0
Likes
0

Embeds 4

https://www.linkedin.com 3
http://www.slideee.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 04/01/14 1 Establishing a CUDA Research Center at Penn State: Perspectives on GPU-Enabled Teaching and Research William J. Brouwer (wjb19@psu.edu) Pierre-Yves Taunay (py.taunay@psu.edu) Research Computing and Cyberinfrastructure The Pennsylvania State University Nvidia GTC 2014
  • 2. 04/01/14 2 Outline ● Center Overview (RCC @ PSU) ● GPU accelerated research ● IceCube ● Metabolic Networks (Fsolve/cuSolve) ● MD + Simulated Annealing ● FQHE (LU Decomposition) ● Smart Proppants (QR Decomposition) ● GPU cluster scaling ● Amber ● PetaChem ● Quantum Espresso – Lanczos Diagonalization ● CUDA, needs + wants ● Summary Nvidia GTC 2014
  • 3. 04/01/14 3 Center Overview ● Research Computing and Cyberinfrastructure (RCC) at PSU provides high performance computing services : ● Hardware, proprietary/open source software ● Consultation (numerical/algorithmic, software development etc) ● PhD's, system admins and programmers work together to provide these services to academics while performing independent research ● Many users are interested in using GPUs for science and engineering research applications, we are a CUDA research center https://research.nvidia.com/content/penn-state-crc-summary ● Formerly under ITS, currently incorporating into Office of the Vice President for Research (OVPR) Nvidia GTC 2014
  • 4. 04/01/14 4 Center Overview ● Hardware is ~ 12K CPU cores, 64 GPUs (Fermi), several Kepler ● Red Hat Linux, scheduling via PBS/Moab/Torque ● Usual monitoring/management tools eg., Puppet, Jenkins, Nagios, Ganglia, and some custom solution(s) ( eg., CLPR) ● Serve ~ 7k users, all campuses in the commonwealth ● Use CUDA predominantly, although growing numbers of users trying OpenACC, OpenCL, libraries etc ● Environment modules system Nvidia GTC 2014
  • 5. 04/01/14 5 Center Overview ● Support many GPU accelerated applications Nvidia GTC 2014
  • 6. 04/01/14 6 Outline ● Center Overview (RCC @ PSU) ● GPU accelerated research ● IceCube ● Metabolic Networks (Fsolve/cuSolve) ● MD + Simulated Annealing ● FQHE (LU Decomposition) ● Smart Proppants (QR Decomposition) ● GPU cluster scaling ● Amber ● PetaChem ● Quantum Espresso – Lanczos Diagonalization ● CUDA, needs + wants ● Summary Nvidia GTC 2014
  • 7. 04/01/14 7 Nvidia GTC 2014 IceCube
  • 8. 04/01/14 8 Metabolic Networks ● Optimal models for the metabolic networks of microbial organisms important in pharma, energy industries ● Ensemble Modeling (EM) is used to construct chemical kinetics of microbial organisms → decompose metabolic reactions into the elementary mechanisms, which are ODE systems f(ki ,yj ) = dyj /dt Nvidia GTC 2014 ● Overall approach maximizes correlation between model predictions and experimental measurements, performed in steady state → solve f(k,y) = 0
  • 9. 04/01/14 9 Metabolic Networks ● [CPU] parse equations f(k,y) ● [CPU] differentiate f(k,y), create analytic J(k,y) ● [CPU] populate data structures representing f(k,y), J(k,y), copy to GPU ● [GPU] Iterate (Newton-Raphson) → ● Numerically evaluate f(k,y) and J(k,y) by parallel reduction ● Solve for delta in f(k,y) = -delta . J(k,y) using GMRES ● Update y += delta and repeat until ||f(k,y)|| < tol Nvidia GTC 2014
  • 10. 04/01/14 10 Metabolic Networks Nvidia GTC 2014 ● Solution uses various libraries including Boost, Thrust, CUSP and CUDA ● Matrices sparse, poorly conditioned, but solution works well for O(10^2) equations ● Currently working to scale to larger, more interesting networks and microbial organisms ● CuSolve is a work in progress, a GPU-only ODE solve for stiff equations
  • 11. 04/01/14 11 Molecular Dynamics + Sim Anneal Nvidia GTC 2014 ● Solve for MD potentials by fitting experimental data for structure factor ● Optimization surface (below) is highly non-convex → use simulated annealing, each GPU performs independent MD run
  • 12. 04/01/14 12 LU Decomposition Nvidia GTC 2014 ● Batch LU decomposition developed for fractional quantum Hall effect, fundamental physics that has implications in quantum computation and material science ● O(N!) determinants need to be evaluated in constructing wavefunction, process repeated many times in Monte Carlo calculation ● Small, dense matrices of side <= 512 ● Implementation exploits SIMD architecture, parallel reduction ● Example; N=11, computation time using 8 GPU devices (w/ MPI), 1024 Monte Carlo iterations is ~ 246 seconds from ~ 31488 single CPU
  • 13. 04/01/14 13 LU Decomposition Nvidia GTC 2014
  • 14. 04/01/14 14 QR Decomposition Nvidia GTC 2014 ● Proppant materials used to stabilize fissures created during hydraulic fracturing ● 'Smart proppants' are essentially electrical dipoles which may absorb and re-emit EM energy, irradiated and recorded by downhole instrumentation ● This work considers an iteration-free solution to this EM scattering problem, uses linear algebra including LU and SVD decomposition ● SVD can be performed using the QR algorithm, in turn a function of QR decomposition ● Devised a unique approach for large batches of dense small matrices using Givens rotations; largely independent ops, maps well to GPU
  • 15. 04/01/14 15 QR Decomposition Nvidia GTC 2014
  • 16. 04/01/14 16 Outline ● Center Overview (RCC @ PSU) ● GPU accelerated research ● IceCube ● Metabolic Networks (Fsolve/cuSolve) ● MD + Simulated Annealing ● FQHE (LU Decomposition) ● Smart Proppants (QR Decomposition) ● GPU cluster scaling ● Amber ● PetaChem ● Quantum Espresso – Lanczos Diagonalization ● CUDA, needs + wants ● Summary Nvidia GTC 2014
  • 17. 04/01/14 17 GPU Cluster Scaling Nvidia GTC 2014 ● Several key GPU accelerated software suites were tested using multiple GPUs across two clusters Cluster Lion-GA Stampede CPU 12 X5675 @ 3.07 GHz 16 E5-2680 @ 2.70 GHz GPU 8 M2070 or 8 M2090 1 K20c Nodes equipped with GPUs 8 120 Interconnect 40 Gb/s Mellanox QDR Infiniband 56 Gb/s Mellanox FDR Infiniband
  • 18. 04/01/14 18 GPU Cluster Scaling Nvidia GTC 2014 ● Lion-GA cluster has 3 GPUs per PCIe switch, 3 to 5 GPUs per IOH chip ● IOH doesn't support peer to peer transfers between GPU devices on different chipsets ● Difficult to achieve peak transfer rates across GPU on different sockets
  • 19. 04/01/14 19 Amber Nvidia GTC 2014 ● Molecular Dynamics is widely used for simulation of solvated proteins or molecules and make use of various force fields (AMBER, ReaxFF, etc.) ● AMBER force field is implemented in the eponymous software suite ● The software PMEMD in AMBER is used for both explicit solvent Particle Mesh Ewald (PME) and implicit solvent General Borne (GB) simulations ● AMBER does not require extensive communication between GPUs or between CPU and GPU, and does not take advantage of the CPU if GPUs are used ● GPU acceleration allows for longer simulation times ~ nanosecond or more
  • 20. 04/01/14 20 Nvidia GTC 2014 12 X5675 2 M2090 4 M2090 6 M2090 8 M2090 01020304050607080 PME simulation of DHFR protein in water (NPT ensemble, 23,558 atoms) Achieved performance on Lion-GA ns/day Amber
  • 21. 04/01/14 21 Nvidia GTC 2014 12 X5675 2 M2090 4 M2090 6 M2090 8 M2090 024681012141618 PME simulation of FactorIX molecule in water (NPT ensemble, 90,906 atoms) Achieved performance on Lion-GA ns/day Amber
  • 22. 04/01/14 22 Nvidia GTC 2014 12 X5675 2 M2090 4 M2090 6 M2090 8 M2090 00.511.522.533.544.5 PME simulation of Cellulose molecule in water (NPT ensemble, 408,609 atoms) Achieved performance on Lion-GA ns/day Amber
  • 23. 04/01/14 23 Nvidia GTC 2014 12 X5675 2 M2090 4 M2090 6 M2090 8 M2090 050100150200 Implicit solvent GB simulation of Myoglobin (2,492 atoms) Achieved performance on Lion-GA ns/day Amber
  • 24. 04/01/14 24 Nvidia GTC 2014 12 X5675 2 M2090 4 M2090 6 M2090 8 M2090 01234567 Implicit solvent GB simulation of Nucleosome (25,095 atoms) Achieved performance on Lion-GA ns/day Amber
  • 25. 04/01/14 25 PetaChem Nvidia GTC 2014 ● Quantum Chemistry designed to run on NVIDIA series hardware ● Features restricted Hartree-Fock and grid-based Kohn-Sham single point energy and gradient calculations ● Various functions supported, geometry optimization, ab-initio molecular dynamics, support for multi-GPU ● Benchmark: single point energy, using basis 6-31g for Olestra
  • 26. 04/01/14 26 PetaChem Nvidia GTC 2014 1 M2070 3 M2070 5 M2070 7 M2070 0100200300400500600 PetaChem Olestra SCF calculation Total walltime (in s) on Lion-GA Walltime(s)
  • 27. 04/01/14 27 Quantum Espresso Nvidia GTC 2014 ● Density Functional Theory (DFT) has enjoyed huge growth in popularity owing to computational and numerical advancements; used widely in material science ● Quantum Espresso (QE) is an open source DFT package that has recently added GPU acceleration, largely through BLAS and FFT routines ● When building QE with MAGMA (UT/ORNL) or phiGEMM, one introduces heterogeneous CPU/GPU linear algebra routines ● Benchmark: ● Self-consistent field calculation, using PBE pseudopotentials,168 atoms (cellulose) ● Periodic boundary conditions, kinetic energy cutoff (Ry) for charge density of 80 Ry, Davidson diagonalization
  • 28. 04/01/14 28 Nvidia GTC 2014 1 K20 2 K20 4 K20 8 K20 16 K20 32 K20 01234567 SCF calculation for cellulose Total walltime (in hrs) on Stampede@TACC Walltime(hrs) Quantum Espresso
  • 29. 04/01/14 29 Lanczos Diagonalization Nvidia GTC 2014 ● Key task in many applications, esp quantum chemistry & DFT is diagonalization ie., matrix eigen-decomposition ● Lanczos is a power method, produces a tri-diagonal matrix, more readily solvable; consists of many matrix-vector operations, very amenable to GPU, currently using cuBLAS &MKL in a heterogeneous solution. ● Originally devised for fundamental physics project at PSU, now intended for incorporation into GPU-Quantum Espresso project being led by Filippo Spiga ● Attempting to scale to multiple devices using MPI + GPUdirect, still beset by some numerical/convergence problems with increasing matrix size
  • 30. 04/01/14 30 Lanczos Diagonalization Nvidia GTC 2014
  • 31. 04/01/14 31 Lanczos Diagonalization Nvidia GTC 2014 ● CUDA 5.5/Kepler overall yields pleasing communication results (CUDA- enabled openmpi 1.7.3, MPI send/recv), collectives less impressive ● Bandwidths for one-sided comms have some message size dependency &jitter, but effective bandwidth much improved over previous gens. 1e+07 2 4 6 8 5 4 3 2 BandwidthGB/s Increasing msg size in MB, within single application ● Results of 4 tests ● Rhel 6, Intel x86_64, Nvidia driver 331.38 ● Communication btwn K20 & K40
  • 32. 04/01/14 32 Outline ● Center Overview (RCC @ PSU) ● GPU accelerated research ● IceCube ● Metabolic Networks (Fsolve/cuSolve) ● MD + Simulated Annealing ● FQHE (LU Decomposition) ● Smart Proppants (QR Decomposition) ● GPU cluster scaling ● Amber ● PetaChem ● Quantum Espresso – Lanczos Diagonalization ● CUDA, needs + wants ● Summary Nvidia GTC 2014
  • 33. 04/01/14 33 CUDA needs + wants Nvidia GTC 2014 ● ODE and Function Solver(s), metabolic networks, chemically reactive flows w/ OpenFOAM → support for more C++11 language features? ● Lanczos Diagonalization, DFT/quantum chemistry, incorporation into Quantum Espresso → further improvements to GPUdirect (or use new multi-GPU interfaces instead)? ● Batch LU/QR → increased warp size?
  • 34. 04/01/14 34 Summary Nvidia GTC 2014 ● Early adopters astrophysics, quantum chem/condensed matter still active, see most growth in strands of computational biology/life science, 'big data' ● Teaching seminars generally well received/attended, but... ● Most success from working to identify users/codes that can benefit from GPU by monitoring clusters, and on a related note... ● The harvest is plentiful in academia but the workers are few; generally if a code 'works' little pressure to make it better ● However changes even in traditional CPU architecture are forcing workers to reevaluate their computational models (thanks Ken Esler for this perspective); we live more and more in a parallel world
  • 35. 04/01/14 35 Acknowledgements Nvidia GTC 2014 ● Mark Berger, Chandra Cheij &Nvidia for generous donations ● {Ryan Eagen/Cowen group, Ali Khodayari/Maranas group, Sreejith Jaya Ganesh, Jim Kubicki, Dan Haworth, Adri Van Duin} PSU ● {Chuck Gilbert, Jason Holmes} long-suffering sys admins ● HP for donation of 50 M2070 ● XSEDE/TACC for Stampede cycles