Your SlideShare is downloading. ×
SpeedIT : GPU-based acceleration of sparse linear algebra
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

SpeedIT : GPU-based acceleration of sparse linear algebra

2,113
views

Published on

The SpeedIT provides a partial acceleration of sparse linear solvers. Acceleration is achieved with a single reasonably priced NVIDIA Graphics Processing Unit (GPU) that supporst CUDA and proprietary …

The SpeedIT provides a partial acceleration of sparse linear solvers. Acceleration is achieved with a single reasonably priced NVIDIA Graphics Processing Unit (GPU) that supporst CUDA and proprietary advanced optimisation techniques.

Check also SpeedIT FLOW, our RANS single phase flow solver that runs fully on GPU: vratis.com/blog

Published in: Technology, Education

3 Comments
1 Like
Statistics
Notes
No Downloads
Views
Total Views
2,113
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
28
Comments
3
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Multi-GPU simulations in OpenFOAM withSpeedIT technology.
  • 2. Attempt I: SpeedIT¡ GPU-based library of iterative solvers for Sparse Linear Algebra and CFD. ¡ Current version: 2.2. Version 1.0 in 2008. ¡ CMRS format for fast SpMV and other storage formats. ¡ BiCGSTAB and CG iterative solvers. ¡ Preconditioners: Jacobi, AMG. ¡ Runs on one or more GPU cards. ¡ OpenFOAM compatibile.¡ Possible Applications: ¡  Quantum Chemistry, ¡  Seminconductor and Power Network Design ¡  CFD (OpenFOAM) ¡  etc.
  • 3. SpeedIT and OpenFOAM Plugin to OpenFOAM ¡ GPL-based generic Plugin to¡ Provides conversion between CSR OpenFOAM. and OpenFOAM LDU data formats.¡ Easy substitution of CG/BCG with ¡ Any acceleration toolkit can be their GPU versions. integrated now. 1.  Edit system/controlDict file and add libSpeedIT.so ¡ OpenFOAM source code is 2.  Edit system/fvSolution and replace CG with CG_accel unchanged due to a plugin 3.  Compile the Plugin to OpenFOAM (wmake) concept. No recompilation of 4.  Complete library dependencies: $FOAM_USER_LIBBIN should contain GPU-based libraries. OpenFOAM is necessary.
  • 4. 0.2 10 100 1000 1 10 100 1000 µ µ SpeedIT vs. CUDA 5.0nline) The speedup of our CMRS implementation of gainst scalar, vector and HYB kernels as a function Figure 2: (color online) The speedup of our CMRS implem against the best of all five alternative SpMV implementatber of nonzero elements per row (µ). Results for given matrix.are omitted for clarity. SpMV is a key operation in scientific software. 3.5 vector 1.4rid kernels give the shortest SpMV times 10 scalar 3 HYB their efficiency decreases as µ is increased, 1.2 2.5r kernel being very inefficient for large µ. CMRS speedup CMRS speedup 1of the scalar kernel is related to its inability speedup 2a transfers if µ is large. The vector kernel 0.8 2 1.5 the opposite way: its efficiency relative to 1.5 1 very good for large µ, but it decreases as 1 0.6 1 ≈ 100. This is related to the fact that the 0.5 0.5 0.4 ocesses each row with a warp of 32 threads, 0 0 2 4 ost 0.2 efficient for sparse matrices with µ far 0 0.2 0 10 20 30 401 50 60 70 10 80 90 100 Interestingly, for 10 120 the speedup of 1 µ 100 1000 100r the vector kernel can be approximated as µ σ µ Fig. Speedup of our CMRS implementation against Fig. Speedup of our CMRS implementation against a power law. scalar, vector (CSR) and HYB kernels as a function of Figure 3: (color (HYB/CSR formats). online) The speedup Figure 1: (color online) The speedup of our CMRS implementation of CUDA 5.0 online) The speedup of CMRS against Figure 2: (color it can be immediately seen that the CMRS (µ). plementation as a function of σ. Only matrices such that the mean number of nonzero elements per row the SpMV kernel against scalar, vector and HYB kernels as a function against the best of all five alternatively does not yield much improvement over taken into account. given matrix. Inset: enlargement of the plot for σ ≤ of the mean number of nonzero elements per row (µ). Results forat for µ lp1-like matrices are omitted forto be system- 150, and tends clarity. than the HYB format for µ 20. Hence t the advantages of the CMRS will be most number of nonzero elements (ni ) per row 3.5
  • 5. OpenFOAM & single GPU
Low-cost hardware GPU cost: $200 Test cases:
¡ Cavity 3D, 512K cells, icoFoam. ¡ Human aorta, 500K cells, simpleFoam.¡ CPU: Intel Core 2 Duo E8400, 3GHz, 8GB RAM @ 800MHz
 GPU: NVIDIA GTX 460, VRAM 1GB
 Ubuntu 11.04 x64, OpenFOAM 2.0.1, CUDA Toolkit 4.1, ¡ Pressure equation solved with CG preconditioned with DIC, GAMG (CPU) and CG preconditioned with AMG (GPU).
  • 6. OpenFOAM & single GPU
ValidationFig.Velocity and pressure profiles along x axis for aorta (top left) and pressure profile along x axis for cavity3D (top right).Solution for all preconditioners.Cross-section for velocity in X and Y direction for cavity3D run with OpenFOAM and SpeedIT (lines) and from literature forRe=100 (dotted line) and Re=400 (solid line) without turbulence model (icoFoam).
  • 7. OpenFOAM single GPU
 Performance2000 25001600 20001200 1500 800 1000 400 500 0 CG+DIC 1C CG+DIC 2C GAMG 1C SpeedIT GAMG 2C 0 CG+DIC 2C CG+DIC 1C SpeedIT GAMG 2C GAMG 1C4.00 3.77x3.50 3.5 2.88x3.00 2.40x 3 2.68x2.50 2.52.00 21.50 1.27x 1.51.00 0.76 1 0.87 0.910.50 0.50.00 0 CG+DIC 1C CG+DIC 2C GAMG 1C GAMG 2C CG+DIC 1C CG+DIC 2C GAMG 2C GAMG 1C Fig. Execution times of 50 time steps (up) and acceleration factor for simpleFoam (left) and pisoFoam (right). Performed at Intel E8400@3GHz and GTX460. 1C and 2C stand for one, two cores respectively.
  • 8. OpenFoam multi-GPU¡ Motivation: large geometries ( 8M celss) do not fit to a single GPU 
 card (with max. 6GB RAM).¡ OpenFOAM performed the decomposition. MPI was used for communication between GPU cards.¡ Tests were performed at IBM PLX CINECA with 2 six-cores Intel Westmere 2.40 GHz and 2 Tesla M2070 per node (supported by NVIDIA). One thread manages one GPU card.¡ Tests focused on multi-GPU precondtioned CG with AMG to solve pressure equation in steady-state flows.
  • 9. OpenFoam multi-GPU60005000 CPU GPU 4000 CPU, PCG with GAMG3000 CPU, PCG with diagonal x1.59 x1.63 SpeedIT2000 1353 1207 x1.62 x1.15 1000 699 394 848 736 429 342 0 6 8 16 32Fig. Time in sec. of first 10 time steps for N GPUs and CPU cores. Motorbike test, simpleFoam with diagonal(blue) and GAMG (green) preconditioners, SpeedIT is marked in red. Geometry is 32 millions cells. Testswere performed at PLX CINECA.
  • 10. Scaling of OpenFOAM multi-GPU ¡ Complex car geometry, external steady state flow, simpleFOAM, 32M cells, pressure: GAMG, velocity: smoothGS. SpeedIT multiGPU Ideal Scaling 12.00140000 11120000 10.00 9.23100000 8.00 8 7.11 80000 6.00 60000 4.00 4 40000 3.63 20000 2.00 2 1.93 0 0.00 8 16 32 64 88 8 16 32 64 88 Fig. Execution times of 2950 time steps (left) and scaling relative to 8 GPU (right).
  • 11. OpenFOAM GPU for industries
 Authors: Saeed Iqbal and Kevin Tubbs Full report can be obtained at Dell Tech Center . Figure 1: OpenFOAM performance of 3D cavity case using 4 million cells on a single node.
  • 12. OpenFOAM GPU for industries
 Authors: Saeed Iqbal and Kevin Tubbs Figure 2: Total Power and Power Efficiency of 3D cavity case on 4 million cells on a single node.
  • 13. OpenFOAM GPU for industries
 Authors: Saeed Iqbal and Kevin Tubbs Figure 1: OpenFOAM performance of 3D cavity case using 8 million cells on a single node.
  • 14. OpenFOAM GPU for industries
 Authors: Saeed Iqbal and Kevin Tubbs Figure 2: Total Power and Power Efficiency of 3D cavity case on 4 million cells on a single node.
  • 15. SpeedIT 3.0What’s new:¡ New ILU based preconditioner.¡ Support for Keppler NVIDIA cards. ¡ much higher bandwidth (300GB/s). ¡ 5x faster intra-GPU communication thanks to GPU Direct 2.0.¡ Release in the begining of 2013.
  • 16. Acknowledgments ¡ Saeed Iqbal and Kevin Tubbs (DELL) ¡ Stan Posey, Edmondo Orlotti (NVIDIA) ¡ CINECA, PRACE Vratis Ltd. Muchoborska 18, Wroclaw, Poland Email: lukasz.miroslaw@vratis.com More information: speed-it.vratis.com, vratis.com/blog