SlideShare a Scribd company logo
Using GPUs to accelerate
nonstiff and stiff chemical
kinetics in combustion
simulations
Kyle Niemeyer
School of Mechanical, Industrial, and Manufacturing Eng.
Oregon State University
20 April 2015
Benefit of GPU computing
(for combustion)
2
Two
avenues
https://www.olcf.ornl.gov/titan/
Exascale science on supercomputers
High-fidelity engineering
on workstations
Challenges of GPU computing
(for combustion)
3
Two
challenges
Design algorithms/strategies to
reduce computational expense
Identify appropriate algorithms
for equal or better performance
GPU
• Graphics Processing Unit
• Developed to process & display 1000s pixels
• Throughput over latency massive parallelism
4
TECHNICAL SPECIFICATIONS
FORM FACTOR> 9.75” PCIe x16 form factor
# OF CUDA CORES> 448
FRE
of high performance
Modern GPU hardware
architecture
Streaming
multiprocessor
(SM)
5
A.R. Brodtkorb et al. / J. Parallel Distrib. Comput. 73 (2013) 4–13
Fermi-class GPU hardware. The GPU consisting of up to 16 streaming multiprocessors (also known as SMs) is shown in (left), and (righ
the information in this article can be found in
different sources, including books, documentation,
nference presentations, and on Internet fora. Getting
of all this information is an arduous exercise that
bstantial effort. The aim of this article is therefore to
distributes thread blocks to multiprocessor thread sc
Fig. 4a). This scheduler handles concurrent kernel2
e
out-of-order thread block execution.
Each multiprocessor has 16 load/store units, all
and destination addresses to be calculated for 16 thre
Brodtkorb AR, Hagen TR, Sætra ML. J
Parallel Distrib Comput 2013;73:4–13.
Using GPUs
• Parallel function: “kernel”
• Hundreds–millions of concurrent threads
• Executed in 32-thread “warps”
• Challenge: thread divergence
6
if
if if
else
elseelse
7
if
if if
else
elseelse
8
Problem Description
9
ODE ODE ODE ODE ODE ODE ODE ODE ODE
ODE ODE ODE ODE ODE ODE ODE ODE ODE
ODE ODE ODE ODE ODE ODE ODE ODE ODE
ODE ODE ODE ODE ODE ODE ODE ODE ODE
ODE ODE ODE ODE ODE ODE ODE ODE ODE
ODE ODE ODE ODE ODE ODE ODE ODE ODE
ODE ODE ODE ODE ODE ODE ODE ODE ODE
• Large number of independent ODEs to
solve
• Can be even more for turbulent
combustion!
dYi
dt
=
Wi
⇢
!i
0
B
B
B
@
dY1
dt
dY2
dt
...
dYk
dt
1
C
C
C
A
10
• Traditionally solved with implicit algorithms:
complex logical flow
GPU Algorithms
significant thread divergence
11
Prior Efforts: Implicit
Implicit algorithms*: not well-suited for
GPU acceleration
12
Algorithm
Single-core CPU
speedup
Stone & Davis 2013 DVODE 7.7×
Sewerin &
Rigopoulos 2015
implicit Runge–
Kutta (Radau5)
4.8×
*Thus far
• Traditionally solved with implicit algorithms:
complex logical flow
• Instead, what about explicit algorithms?
GPU Algorithms
significant thread divergence
13
Prior Efforts: Explicit
14
Algorithm
Single-core
CPU speedup
# Species
Spafford et al.
2010
4th-order Runge–
Kutta*
9× 22
Niemeyer et al.
2011
4th-order Runge–
Kutta
75× 9
Shi et al. 2012 CHEMEQ2 3–13× 39 & 117
Stone & Davis
2013
4th-order Runge–
Kutta–Fehlberg
28.6× 19 (reduced)
*Species production terms only
GPU Algorithms
• For nonstiff chemistry: explicit Runge–Kutta–
Cash–Karp
- H2: 9 species & 38 reactions
• Moderately stiff* chemistry: stabilized explicit
Runge–Kutta–Chebyshev
- H2/CO: 13 species & 27 reactions
- CH4: 53 species & 325 reactions
- C2H4: 111 species & 784 reactions
• Custom CPU & GPU source code
15
Initial Condition Sampling
16
Runge–Kutta–Cash–Karp
• Fifth-order accuracy
• Adaptive time stepping
• Global time step: 1×10-8 sec
• “Nonstiff” hydrogen mechanism1
• Range number of ODEs from 10–107
1Yetter , Dryer, and Rabitz, CST 79 (1991) 97–128
17
18
10
-3
10
-2
10
-1
10
0
10
1
10
2
10
3
10
2
10
3
10
4
10
5
10
6
10
7
Computingtimeperglobaltimestep(s)
Number of independent ODEs
RKCK-CPU
RKCK-CPU × 6
RKCK-GPU
25×
126×
Dealing with Stiffness
• Standard explicit algorithms fail
• “Stabilized” explicit methods
19
Runge–Kutta–Chebyshev
• AKA “stabilized” Runge–Kutta
• Explicit, but capable of handling mild
stiffness
- Extended stability domain along real axis
• Second-order accuracy
• Jacobian free
• Time step: 1×10-6 sec
20
H2/CO – RKC
21
59×
10×
CH4 – RKC
22
69×
13×
C2H4 – RKC
23
18×
4.5×
CH4 – RKC vs. VODE
24
57×
C2H4 – RKC vs. VODE (stiff)
25
2.5×
Time step: 1×10-4 s
Takeaways
• For exascale (i.e., DNS):
- Explicit algorithms significantly faster on GPUs
• For high-fidelity engineering (i.e., LES):
- Implicit algorithms perform comparably to CPU, so
far… (but not much better)
- Stabilized explicit algorithms offer attractive
alternative
- Greater stiffness still a problem
26
More details: see Niemeyer & Sung, J
Comput Phys 256 (2014):854–871
?
27
Acknowledgements: Chih-Jen Sung
& Nick Curtis @ UConn
Thank you!
Questions?
28
10
0
10
1
10
2
10
3
10
4
10
3
10
4
10
5
10
6
Computingtimeperglobaltimestep(s)
Number of independent ODEs
4×
Similar conditions
Randomized conditions
CH4 – Thread divergence

More Related Content

What's hot

Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Tokyo Institute of Technology
 
Machine Learning & Data Science in the Age of the GPU: Smarter, Faster, Better
Machine Learning & Data Science in the Age of the GPU: Smarter, Faster, BetterMachine Learning & Data Science in the Age of the GPU: Smarter, Faster, Better
Machine Learning & Data Science in the Age of the GPU: Smarter, Faster, Better
IDEAS - Int'l Data Engineering and Science Association
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdw
Jan Holčapek
 
Hardware universe reference sheet v series net app vtl rc00330909 (sep 2009)
Hardware universe reference sheet v series net app vtl rc00330909 (sep 2009)Hardware universe reference sheet v series net app vtl rc00330909 (sep 2009)
Hardware universe reference sheet v series net app vtl rc00330909 (sep 2009)
Pedro Siqueira
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Kohei KaiGai
 
Cuda
CudaCuda
uTensor COSCUP
uTensor COSCUPuTensor COSCUP
uTensor COSCUP
Neil Tan
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
DefCamp
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
Savith Satheesh
 
DDR, GDDR, HBM Memory : Presentation
DDR, GDDR, HBM Memory : PresentationDDR, GDDR, HBM Memory : Presentation
DDR, GDDR, HBM Memory : Presentation
Subhajit Sahu
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
prithan
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_maps
lcplcp1
 
WSE 6A-Octo-X Terrain Mapping UAV
WSE 6A-Octo-X Terrain Mapping UAVWSE 6A-Octo-X Terrain Mapping UAV
WSE 6A-Octo-X Terrain Mapping UAV
Manuel De La Cruz
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
Kohei KaiGai
 
3 open power gcg_offering_jp_meetup_20181126_v1
3 open power gcg_offering_jp_meetup_20181126_v13 open power gcg_offering_jp_meetup_20181126_v1
3 open power gcg_offering_jp_meetup_20181126_v1
Yutaka Kawai
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
Kohei KaiGai
 
DevOpsDaysRiga 2017 Ignite: Toshaan Bharvani - POWER your DC
DevOpsDaysRiga 2017 Ignite: Toshaan Bharvani - POWER your DCDevOpsDaysRiga 2017 Ignite: Toshaan Bharvani - POWER your DC
DevOpsDaysRiga 2017 Ignite: Toshaan Bharvani - POWER your DC
DevOpsDays Riga
 
Nvidia tesla-k80-overview
Nvidia tesla-k80-overviewNvidia tesla-k80-overview
Nvidia tesla-k80-overview
Communication Progress
 
Accelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUsAccelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUs
Vivek Venugopalan
 

What's hot (20)

Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
 
Machine Learning & Data Science in the Age of the GPU: Smarter, Faster, Better
Machine Learning & Data Science in the Age of the GPU: Smarter, Faster, BetterMachine Learning & Data Science in the Age of the GPU: Smarter, Faster, Better
Machine Learning & Data Science in the Age of the GPU: Smarter, Faster, Better
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdw
 
Hardware universe reference sheet v series net app vtl rc00330909 (sep 2009)
Hardware universe reference sheet v series net app vtl rc00330909 (sep 2009)Hardware universe reference sheet v series net app vtl rc00330909 (sep 2009)
Hardware universe reference sheet v series net app vtl rc00330909 (sep 2009)
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
 
Cuda
CudaCuda
Cuda
 
uTensor COSCUP
uTensor COSCUPuTensor COSCUP
uTensor COSCUP
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 
DDR, GDDR, HBM Memory : Presentation
DDR, GDDR, HBM Memory : PresentationDDR, GDDR, HBM Memory : Presentation
DDR, GDDR, HBM Memory : Presentation
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_maps
 
WSE 6A-Octo-X Terrain Mapping UAV
WSE 6A-Octo-X Terrain Mapping UAVWSE 6A-Octo-X Terrain Mapping UAV
WSE 6A-Octo-X Terrain Mapping UAV
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
3 open power gcg_offering_jp_meetup_20181126_v1
3 open power gcg_offering_jp_meetup_20181126_v13 open power gcg_offering_jp_meetup_20181126_v1
3 open power gcg_offering_jp_meetup_20181126_v1
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
DevOpsDaysRiga 2017 Ignite: Toshaan Bharvani - POWER your DC
DevOpsDaysRiga 2017 Ignite: Toshaan Bharvani - POWER your DCDevOpsDaysRiga 2017 Ignite: Toshaan Bharvani - POWER your DC
DevOpsDaysRiga 2017 Ignite: Toshaan Bharvani - POWER your DC
 
Nvidia tesla-k80-overview
Nvidia tesla-k80-overviewNvidia tesla-k80-overview
Nvidia tesla-k80-overview
 
Accelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUsAccelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUs
 

Similar to Using GPUs to accelerate nonstiff and stiff chemical kinetics in combustion simulations

Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Igor Sfiligoi
 
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsC-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
Pandey_G
 
RT15 Berkeley | ARTEMiS-SSN Features for Micro-grid / Renewable Energy Sourc...
RT15 Berkeley |  ARTEMiS-SSN Features for Micro-grid / Renewable Energy Sourc...RT15 Berkeley |  ARTEMiS-SSN Features for Micro-grid / Renewable Energy Sourc...
RT15 Berkeley | ARTEMiS-SSN Features for Micro-grid / Renewable Energy Sourc...
OPAL-RT TECHNOLOGIES
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
Fisnik Kraja
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Facultad de Informática UCM
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
NVIDIA Taiwan
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Advancements in the Real-Time Simulation of Large Active Distribution Systems...
Advancements in the Real-Time Simulation of Large Active Distribution Systems...Advancements in the Real-Time Simulation of Large Active Distribution Systems...
Advancements in the Real-Time Simulation of Large Active Distribution Systems...
OPAL-RT TECHNOLOGIES
 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010
TELECOM I+D
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
Sri Ambati
 
HPC Accelerating Combustion Engine Design
HPC Accelerating Combustion Engine DesignHPC Accelerating Combustion Engine Design
HPC Accelerating Combustion Engine Design
inside-BigData.com
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
Intel® Software
 
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATORQGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
NVIDIA Japan
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
William Cunningham
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
inside-BigData.com
 
Programming Models for Heterogeneous Chips
Programming Models for  Heterogeneous ChipsProgramming Models for  Heterogeneous Chips
Programming Models for Heterogeneous Chips
Facultad de Informática UCM
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
Ceph Community
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
Patrick McGarry
 
Dl2 computing gpu
Dl2 computing gpuDl2 computing gpu
Dl2 computing gpu
Armando Vieira
 

Similar to Using GPUs to accelerate nonstiff and stiff chemical kinetics in combustion simulations (20)

Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
 
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsC-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
 
RT15 Berkeley | ARTEMiS-SSN Features for Micro-grid / Renewable Energy Sourc...
RT15 Berkeley |  ARTEMiS-SSN Features for Micro-grid / Renewable Energy Sourc...RT15 Berkeley |  ARTEMiS-SSN Features for Micro-grid / Renewable Energy Sourc...
RT15 Berkeley | ARTEMiS-SSN Features for Micro-grid / Renewable Energy Sourc...
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
Advancements in the Real-Time Simulation of Large Active Distribution Systems...
Advancements in the Real-Time Simulation of Large Active Distribution Systems...Advancements in the Real-Time Simulation of Large Active Distribution Systems...
Advancements in the Real-Time Simulation of Large Active Distribution Systems...
 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
HPC Accelerating Combustion Engine Design
HPC Accelerating Combustion Engine DesignHPC Accelerating Combustion Engine Design
HPC Accelerating Combustion Engine Design
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATORQGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Programming Models for Heterogeneous Chips
Programming Models for  Heterogeneous ChipsProgramming Models for  Heterogeneous Chips
Programming Models for Heterogeneous Chips
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
Dl2 computing gpu
Dl2 computing gpuDl2 computing gpu
Dl2 computing gpu
 

Recently uploaded

IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
ElakkiaU
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
nedcocy
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
Prakhyath Rai
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
shadow0702a
 
Generative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdfGenerative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdf
mahaffeycheryld
 
TIME TABLE MANAGEMENT SYSTEM testing.pptx
TIME TABLE MANAGEMENT SYSTEM testing.pptxTIME TABLE MANAGEMENT SYSTEM testing.pptx
TIME TABLE MANAGEMENT SYSTEM testing.pptx
CVCSOfficial
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
Paris Salesforce Developer Group
 
Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...
bijceesjournal
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
AjmalKhan50578
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
ecqow
 
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
PriyankaKilaniya
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
ijaia
 
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
MadhavJungKarki
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURSCompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
RamonNovais6
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
PKavitha10
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
UReason
 

Recently uploaded (20)

IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
 
Generative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdfGenerative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdf
 
TIME TABLE MANAGEMENT SYSTEM testing.pptx
TIME TABLE MANAGEMENT SYSTEM testing.pptxTIME TABLE MANAGEMENT SYSTEM testing.pptx
TIME TABLE MANAGEMENT SYSTEM testing.pptx
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
 
Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
 
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
 
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
1FIDIC-CONSTRUCTION-CONTRACT-2ND-ED-2017-RED-BOOK.pdf
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURSCompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
 

Using GPUs to accelerate nonstiff and stiff chemical kinetics in combustion simulations

  • 1. Using GPUs to accelerate nonstiff and stiff chemical kinetics in combustion simulations Kyle Niemeyer School of Mechanical, Industrial, and Manufacturing Eng. Oregon State University 20 April 2015
  • 2. Benefit of GPU computing (for combustion) 2 Two avenues https://www.olcf.ornl.gov/titan/ Exascale science on supercomputers High-fidelity engineering on workstations
  • 3. Challenges of GPU computing (for combustion) 3 Two challenges Design algorithms/strategies to reduce computational expense Identify appropriate algorithms for equal or better performance
  • 4. GPU • Graphics Processing Unit • Developed to process & display 1000s pixels • Throughput over latency massive parallelism 4 TECHNICAL SPECIFICATIONS FORM FACTOR> 9.75” PCIe x16 form factor # OF CUDA CORES> 448 FRE of high performance
  • 5. Modern GPU hardware architecture Streaming multiprocessor (SM) 5 A.R. Brodtkorb et al. / J. Parallel Distrib. Comput. 73 (2013) 4–13 Fermi-class GPU hardware. The GPU consisting of up to 16 streaming multiprocessors (also known as SMs) is shown in (left), and (righ the information in this article can be found in different sources, including books, documentation, nference presentations, and on Internet fora. Getting of all this information is an arduous exercise that bstantial effort. The aim of this article is therefore to distributes thread blocks to multiprocessor thread sc Fig. 4a). This scheduler handles concurrent kernel2 e out-of-order thread block execution. Each multiprocessor has 16 load/store units, all and destination addresses to be calculated for 16 thre Brodtkorb AR, Hagen TR, Sætra ML. J Parallel Distrib Comput 2013;73:4–13.
  • 6. Using GPUs • Parallel function: “kernel” • Hundreds–millions of concurrent threads • Executed in 32-thread “warps” • Challenge: thread divergence 6
  • 9. Problem Description 9 ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE ODE
  • 10. • Large number of independent ODEs to solve • Can be even more for turbulent combustion! dYi dt = Wi ⇢ !i 0 B B B @ dY1 dt dY2 dt ... dYk dt 1 C C C A 10
  • 11. • Traditionally solved with implicit algorithms: complex logical flow GPU Algorithms significant thread divergence 11
  • 12. Prior Efforts: Implicit Implicit algorithms*: not well-suited for GPU acceleration 12 Algorithm Single-core CPU speedup Stone & Davis 2013 DVODE 7.7× Sewerin & Rigopoulos 2015 implicit Runge– Kutta (Radau5) 4.8× *Thus far
  • 13. • Traditionally solved with implicit algorithms: complex logical flow • Instead, what about explicit algorithms? GPU Algorithms significant thread divergence 13
  • 14. Prior Efforts: Explicit 14 Algorithm Single-core CPU speedup # Species Spafford et al. 2010 4th-order Runge– Kutta* 9× 22 Niemeyer et al. 2011 4th-order Runge– Kutta 75× 9 Shi et al. 2012 CHEMEQ2 3–13× 39 & 117 Stone & Davis 2013 4th-order Runge– Kutta–Fehlberg 28.6× 19 (reduced) *Species production terms only
  • 15. GPU Algorithms • For nonstiff chemistry: explicit Runge–Kutta– Cash–Karp - H2: 9 species & 38 reactions • Moderately stiff* chemistry: stabilized explicit Runge–Kutta–Chebyshev - H2/CO: 13 species & 27 reactions - CH4: 53 species & 325 reactions - C2H4: 111 species & 784 reactions • Custom CPU & GPU source code 15
  • 17. Runge–Kutta–Cash–Karp • Fifth-order accuracy • Adaptive time stepping • Global time step: 1×10-8 sec • “Nonstiff” hydrogen mechanism1 • Range number of ODEs from 10–107 1Yetter , Dryer, and Rabitz, CST 79 (1991) 97–128 17
  • 19. Dealing with Stiffness • Standard explicit algorithms fail • “Stabilized” explicit methods 19
  • 20. Runge–Kutta–Chebyshev • AKA “stabilized” Runge–Kutta • Explicit, but capable of handling mild stiffness - Extended stability domain along real axis • Second-order accuracy • Jacobian free • Time step: 1×10-6 sec 20
  • 24. CH4 – RKC vs. VODE 24 57×
  • 25. C2H4 – RKC vs. VODE (stiff) 25 2.5× Time step: 1×10-4 s
  • 26. Takeaways • For exascale (i.e., DNS): - Explicit algorithms significantly faster on GPUs • For high-fidelity engineering (i.e., LES): - Implicit algorithms perform comparably to CPU, so far… (but not much better) - Stabilized explicit algorithms offer attractive alternative - Greater stiffness still a problem 26 More details: see Niemeyer & Sung, J Comput Phys 256 (2014):854–871
  • 27. ? 27 Acknowledgements: Chih-Jen Sung & Nick Curtis @ UConn Thank you! Questions?
  • 28. 28 10 0 10 1 10 2 10 3 10 4 10 3 10 4 10 5 10 6 Computingtimeperglobaltimestep(s) Number of independent ODEs 4× Similar conditions Randomized conditions CH4 – Thread divergence