Submit Search
Upload
CFD on Power
•
1 like
•
878 views
G
Ganesan Narayanasamy
Follow
Computational Fluid Dynamics on Power
Read less
Read more
Technology
Report
Share
Report
Share
1 of 28
Download now
Download to read offline
Recommended
BSC LMS DDL
BSC LMS DDL
Ganesan Narayanasamy
2018 bsc power9 and power ai
2018 bsc power9 and power ai
Ganesan Narayanasamy
SNAP MACHINE LEARNING
SNAP MACHINE LEARNING
Ganesan Narayanasamy
AI OpenPOWER Academia Discussion Group
AI OpenPOWER Academia Discussion Group
Ganesan Narayanasamy
OpenPOWER Webinar
OpenPOWER Webinar
Ganesan Narayanasamy
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
Ganesan Narayanasamy
Covid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power Systems
Ganesan Narayanasamy
MIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platform
Ganesan Narayanasamy
Recommended
BSC LMS DDL
BSC LMS DDL
Ganesan Narayanasamy
2018 bsc power9 and power ai
2018 bsc power9 and power ai
Ganesan Narayanasamy
SNAP MACHINE LEARNING
SNAP MACHINE LEARNING
Ganesan Narayanasamy
AI OpenPOWER Academia Discussion Group
AI OpenPOWER Academia Discussion Group
Ganesan Narayanasamy
OpenPOWER Webinar
OpenPOWER Webinar
Ganesan Narayanasamy
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
Ganesan Narayanasamy
Covid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power Systems
Ganesan Narayanasamy
MIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platform
Ganesan Narayanasamy
OpenPOWER/POWER9 AI webinar
OpenPOWER/POWER9 AI webinar
Ganesan Narayanasamy
OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM
Ganesan Narayanasamy
WML OpenPOWER presentation
WML OpenPOWER presentation
Ganesan Narayanasamy
Ac922 cdac webinar
Ac922 cdac webinar
Ganesan Narayanasamy
IBM HPC Transformation with AI
IBM HPC Transformation with AI
Ganesan Narayanasamy
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
Ganesan Narayanasamy
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
Ganesan Narayanasamy
Deeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
Ganesan Narayanasamy
Summit workshop thompto
Summit workshop thompto
Ganesan Narayanasamy
IBM BOA for POWER
IBM BOA for POWER
Ganesan Narayanasamy
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Indrajit Poddar
JMI Techtalk: 한재근 - How to use GPU for developing AI
JMI Techtalk: 한재근 - How to use GPU for developing AI
Lablup Inc.
PowerAI Deep dive
PowerAI Deep dive
Ganesan Narayanasamy
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Lablup Inc.
OpenPOWER Latest Updates
OpenPOWER Latest Updates
Ganesan Narayanasamy
Large Model support and Distribute deep learning
Large Model support and Distribute deep learning
Ganesan Narayanasamy
A Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate Arrays
Taylor Riggan
Transparent Hardware Acceleration for Deep Learning
Transparent Hardware Acceleration for Deep Learning
Indrajit Poddar
AMD It's Time to ROC
AMD It's Time to ROC
inside-BigData.com
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
Anand Haridass
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
George Markomanolis
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
Spark Summit
More Related Content
What's hot
OpenPOWER/POWER9 AI webinar
OpenPOWER/POWER9 AI webinar
Ganesan Narayanasamy
OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM
Ganesan Narayanasamy
WML OpenPOWER presentation
WML OpenPOWER presentation
Ganesan Narayanasamy
Ac922 cdac webinar
Ac922 cdac webinar
Ganesan Narayanasamy
IBM HPC Transformation with AI
IBM HPC Transformation with AI
Ganesan Narayanasamy
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
Ganesan Narayanasamy
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
Ganesan Narayanasamy
Deeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
Ganesan Narayanasamy
Summit workshop thompto
Summit workshop thompto
Ganesan Narayanasamy
IBM BOA for POWER
IBM BOA for POWER
Ganesan Narayanasamy
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Indrajit Poddar
JMI Techtalk: 한재근 - How to use GPU for developing AI
JMI Techtalk: 한재근 - How to use GPU for developing AI
Lablup Inc.
PowerAI Deep dive
PowerAI Deep dive
Ganesan Narayanasamy
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Lablup Inc.
OpenPOWER Latest Updates
OpenPOWER Latest Updates
Ganesan Narayanasamy
Large Model support and Distribute deep learning
Large Model support and Distribute deep learning
Ganesan Narayanasamy
A Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate Arrays
Taylor Riggan
Transparent Hardware Acceleration for Deep Learning
Transparent Hardware Acceleration for Deep Learning
Indrajit Poddar
AMD It's Time to ROC
AMD It's Time to ROC
inside-BigData.com
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
Anand Haridass
What's hot
(20)
OpenPOWER/POWER9 AI webinar
OpenPOWER/POWER9 AI webinar
OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM
WML OpenPOWER presentation
WML OpenPOWER presentation
Ac922 cdac webinar
Ac922 cdac webinar
IBM HPC Transformation with AI
IBM HPC Transformation with AI
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
Deeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
Summit workshop thompto
Summit workshop thompto
IBM BOA for POWER
IBM BOA for POWER
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
JMI Techtalk: 한재근 - How to use GPU for developing AI
JMI Techtalk: 한재근 - How to use GPU for developing AI
PowerAI Deep dive
PowerAI Deep dive
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
OpenPOWER Latest Updates
OpenPOWER Latest Updates
Large Model support and Distribute deep learning
Large Model support and Distribute deep learning
A Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate Arrays
Transparent Hardware Acceleration for Deep Learning
Transparent Hardware Acceleration for Deep Learning
AMD It's Time to ROC
AMD It's Time to ROC
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
Similar to CFD on Power
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
George Markomanolis
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
Spark Summit
RAPIDS Overview
RAPIDS Overview
NVIDIA Japan
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablement
Ganesan Narayanasamy
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Databricks
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Facultad de Informática UCM
Introduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AI
Tyrone Systems
Apache Spark Performance Observations
Apache Spark Performance Observations
Adam Roberts
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
sparktc
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache Spark
AdamRobertsIBM
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
Sundance Multiprocessor Technology Ltd.
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
Amazon Web Services
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
inside-BigData.com
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
Edge AI and Vision Alliance
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
CastLabKAIST
Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
George Markomanolis
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
Rapids: Data Science on GPUs
Rapids: Data Science on GPUs
inside-BigData.com
NVIDIA Rapids presentation
NVIDIA Rapids presentation
testSri1
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
Connected Data World
Similar to CFD on Power
(20)
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
RAPIDS Overview
RAPIDS Overview
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablement
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Introduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AI
Apache Spark Performance Observations
Apache Spark Performance Observations
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache Spark
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
AWS re:Invent 2016: High Performance Computing on AWS (CMP207)
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Rapids: Data Science on GPUs
Rapids: Data Science on GPUs
NVIDIA Rapids presentation
NVIDIA Rapids presentation
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
More from Ganesan Narayanasamy
Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
Ganesan Narayanasamy
Basics of Digital Design and Verilog
Basics of Digital Design and Verilog
Ganesan Narayanasamy
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
Ganesan Narayanasamy
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
Ganesan Narayanasamy
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
Ganesan Narayanasamy
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
Ganesan Narayanasamy
OpenPOWER System Marconi100
OpenPOWER System Marconi100
Ganesan Narayanasamy
POWER10 innovations for HPC
POWER10 innovations for HPC
Ganesan Narayanasamy
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
Ganesan Narayanasamy
AI in healthcare - Use Cases
AI in healthcare - Use Cases
Ganesan Narayanasamy
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
Ganesan Narayanasamy
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
Ganesan Narayanasamy
Poster from NUS
Poster from NUS
Ganesan Narayanasamy
SAP HANA on POWER9 systems
SAP HANA on POWER9 systems
Ganesan Narayanasamy
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
Ganesan Narayanasamy
AI in the enterprise
AI in the enterprise
Ganesan Narayanasamy
Robustness in deep learning
Robustness in deep learning
Ganesan Narayanasamy
Perspectives of Frond end Design
Perspectives of Frond end Design
Ganesan Narayanasamy
A2O Core implementation on FPGA
A2O Core implementation on FPGA
Ganesan Narayanasamy
OpenPOWER Foundation Introduction
OpenPOWER Foundation Introduction
Ganesan Narayanasamy
More from Ganesan Narayanasamy
(20)
Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
Basics of Digital Design and Verilog
Basics of Digital Design and Verilog
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
OpenPOWER System Marconi100
OpenPOWER System Marconi100
POWER10 innovations for HPC
POWER10 innovations for HPC
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare - Use Cases
AI in healthcare - Use Cases
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
Poster from NUS
Poster from NUS
SAP HANA on POWER9 systems
SAP HANA on POWER9 systems
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
AI in the enterprise
AI in the enterprise
Robustness in deep learning
Robustness in deep learning
Perspectives of Frond end Design
Perspectives of Frond end Design
A2O Core implementation on FPGA
A2O Core implementation on FPGA
OpenPOWER Foundation Introduction
OpenPOWER Foundation Introduction
Recently uploaded
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
shyamraj55
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
The Digital Insurer
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
Sinan KOZAK
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
gvaughan
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
Enterprise Knowledge
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
Memoori
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
Fwdays
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
Slibray Presentation
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
BookNet Canada
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April Automation LPDG
MarianaLemus7
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
null - The Open Security Community
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
Miki Katsuragi
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
2toLead Limited
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
Mark Billinghurst
costume and set research powerpoint presentation
costume and set research powerpoint presentation
phoebematthew05
Key Features Of Token Development (1).pptx
Key Features Of Token Development (1).pptx
LBM Solutions
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
Kalema Edgar
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
Ridwan Fadjar
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
Dubai Multi Commodity Centre
Recently uploaded
(20)
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April Automation LPDG
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
costume and set research powerpoint presentation
costume and set research powerpoint presentation
Key Features Of Token Development (1).pptx
Key Features Of Token Development (1).pptx
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
CFD on Power
1.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 1 Task-based GPU acceleration in Computational Fluid Dynamics with OpenMP 4.5 and CUDA in OpenPOWER platforms. OpenPOWER and AI ADG Workshop – BSC, Barcelona, Spain June 2018 Samuel Antao IBM Research, Daresbury, UK
2.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 2 IBM Research @ Daresbury, UK
3.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 3 IBM Research @ Daresbury, UK – STFC Partnership Mission • 2015: £313 million investment over next 5 years • Agreement for IBM Collaborative Research and Development (R&D) that established IBM Research presence in the UK • Product and Services Agreement with IBM UK and Ireland • Access to the latest data-centric and cognitive computing technologies, including IBMs world-class Watson cognitive computing platform • Joint commercialization of intellectual property assets produced in the partnership Help the UK industries and institutions bringing cutting-edge computational science, engineering and applicable technologies, such as data-centric cognitive computing, to boost growth and development of the UK economy
4.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 4 IBM Research @ Daresbury, UK – People 7 Over 26 computational scientists and engineers
5.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 5 IBM Research @ Daresbury, UK – Research areas • Case studies: – Smart Crop Protection - Precision Agriculture • Data science + Life sciences – Improving disease diagnostics and personalised treatments • Life sciences + Machine learning – Cognitive treatment plant • Engineering + Machine learning – Parameterisation of engineering models • Engineering + Machine learning
6.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 6 Task-based GPU acceleration in Computational Fluid Dynamics with OpenMP 4.5 and CUDA in OpenPOWER platforms. OpenPOWER and AI ADG Workshop – BSC, Barcelona, Spain June 2018 Samuel Antao IBM Research, Daresbury, UK
7.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 7 CFD and Algebraic-Multigrid • Solve set of partial-differential equations over several time steps – Discretization: • Unstructured vs Structured – Equations: • Velocity • Pressure • Turbulence • Iterative solvers – Jacobi – Gauss-Seidel – Conjugate Gradient • Multigrid approaches – Solve the problem at different resolutions • Coarse and fine grids/meshes – Less Iterations for fine grids • Algebraic multigrid (AMG) – encode mesh information in algebraic format – Sparse matrices. source: http://web.utk.edu/~wfeng1/research.html
8.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 8 CFD and Algebraic-Multigrid source: http://web.utk.edu/~wfeng1/research.html NVLINKTM NVLINKTM NVLINKTM NVLINKTM InfiniBandTM MPI rank • Grid partitioned by MPI ranks • Ranks distributed by nodes • More than one rank executing in one node • Challenges: – Different grids have different compute needs – Access strides vary, unstructured data accesses. – CPU-GPU data movements – Regular communication between ranks • Halo elements • Residuals • Synchronizations
9.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 9 CFD and Algebraic-Multigrid – Code Saturne • Open-source – developed and maintained by EDF • 350K lines of code: – 50% C – 37% Fortran – 13% Python • Rich ecosystem to configure/parameterise simulations, generate meshes • History of good scalability Cores Time in Solver Efficiency 262,144 789.79 s - 524,288 403.18 s 97% MPI Tasks Time in Solver Efficiency 524,288 70.114 s - 1,048,576 52.574 s 66% 1,572,864 45.731 s 76% 105B Cell Mesh (MIRA, BGQ) 13B Cell Mesh (MIRA, BGQ)
10.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 10 CFD and Algebraic-Multigrid – Execution time distribution • Many components (kernels) contribute to total execution time • There are data dependencies between consecutive kernels • There are opportunities to keep data in the device between kernels • Some kernels may have lower compute intensity, it could still be worthwhile computing them in the GPU if the data is already there Gauss-Seidel solver (Velocity) Other Matrix-vector mult. MSR Matrix-vector mult. CSR Dot products Multigrid setup Compute coarse cells from fine cells Other AMG-related Pressure (AMG) Single thread profiling - Code Saturne 5.0+
11.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 11 Directive-based programming models • Porting existing code to accelerators is time consuming… • The sooner we have code running in the GPU the sooner you can start … – … learning where overheads are – … identifying what data patterns are being used – … spotting kernels performing poorly – … making decisions on what strategies can be used to improve performance • Directive-based programming models can get you started much quicker – Don’t need to bother about device memory allocation and data pointers – Implementation defaults already exploiting device features – Easily create data environments where data resides in the GPU – Improve your code portability • Clang C/C++ and IBM XL C/C++/Fortran compiler provide OpenMP 4.5 support • PGI C/C++/Fortran compiler provide OpenACC support • Can be complemented with existing GPU accelerated libraries – cuSparse – AMGx XL clang
12.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 12 Directive-based programming models • OpenMP 4.5 data environments – Code Saturne 5.0+ snippet – Conjugate Gradient IBM Confidential GPU acceleration in Code Saturne 1 static cs_sles_convergence_state_t _conjugate_gradient (/* ... */) 2 { 3 # pragma omp target data if (n_rows > GPU_THRESHOLD ) 4 /* Move result vector to device and copied it back at the ned of the scope */ 5 map(tofrom:vx[: vx_size ]) 6 /* Move right -hand side vector to the device */ 7 map(to:rhs [: n_rows ]) 8 /* Allocate all auxiliary vectors in the device */ 9 map(alloc: _aux_vectors [: tmp_size ]) 10 { 11 12 /* Solver code */ 13 14 } 15 } Listing 2: OpenMP 4.5 data environment for a level of the AMG solver. during the computation of a level so it can be copied to the device at the beginning of the level. The result vector can also be kept in the device for a significant part of the execution, and only has to be copied to the host during halo exchange. OpenMP 4.5 makes managing the data according to the aforementioned observations almost trivial: a single directive su ces to set the scope - see Listing 2. Each time halos All arrays reside in the device in this scope! The programming model manages host/device pointers mapping for you!
13.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 13 Directive-based programming models • OpenMP 4.5 target regions – Code Saturne 5.0+ snippet – Dot products 6 vx[ii] += (alpha * dk[ii]); 7 rk[ii] += (alpha * zk[ii]); 8 } 9 10 /* ... */ 11 } 12 13 /* ... */ 14 15 static void _cs_dot_xx_xy_superblock (cs_lnum_t n, 16 const cs_real_t *restrict x, 17 const cs_real_t *restrict y, 18 double *xx , 19 double *xy) 20 { 21 double dot_xx = 0.0, dot_xy = 0.0; 22 23 # pragma omp target teams distribute parallel for reduction (+: dot_xx , dot_xy) 24 if ( n > GPU_THRESHOLD ) 25 map(to:x[:n],y[:n]) 26 map(tofrom:dot_xx , dot_xy) 27 for (cs_lnum_t i = 0; i < n; ++i) { 28 const double tx = x[i]; 29 const double ty = y[i]; 30 dot_xx += tx*tx; 31 dot_xy += tx*ty; 32 } 33 34 /* ... */ 35 36 *xx = dot_xx; 37 *xy = dot_xy; 38 } Listing 3: Example of GPU port for two stream kernels: vector multiply-and-add and dot product . … Host … … CUDA blocks OpenMP team Allocate data in the the device. Host Release data in the the device. OpenMP runtime library OpenMP runtime library
14.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 14 Directive-based programming models • OpenMP 4.5 – Code Saturne 5.0+ – AMG NVPROF timeline AMG cycle AMG coarse grid detail Allocations of small variables High kernel launch latency Back-to-back kernels
15.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 15 CUDA-based tuning • Avoid expensive GPU memory allocation/deallocation: – Allocate a memory chunk once and reuse it • Use pinned memory for data copied frequently to the GPU – Avoid pageable-pinned memory copies by the CUDA implementation • Explore asynchronous execution of CUDA API calls – Start copying data to/from the device while the host is preparing the next set of data or the next kernel • Use CUDA constant memory to copy arguments for multiple kernels at once. – The latency of copying tens of KB to the GPU is similar to copy 1B – Dual-buffering enable copies to happen asynchronously • Produce specialized kernels instead of relying on runtime checks. – CUDA is a C++ extension and therefore kernels and device functions can be templated. – Leverage compile-time optimizations for the relevant sequences of kernels. – NVCC toolchain does very aggressive inlining. – Lower register pressure = more occupancy. IBM Confidential GPU acceleration in Cod 1 template < KernelKinds Kind > 2 __device__ int any_kernel ( KernelArgsBase &Arg , unsigned n_rows_per_block ) { 3 switch(Kind) { 4 /* ... */ 5 // Dot product: 6 // 7 case DP_xx: 8 dot_product <Kind >( 9 /* version */ Arg.getArg <cs_lnum_t >(0), 10 /* n_rows */ Arg.getArg <cs_lnum_t >(1), 11 /* x */ Arg.getArg <cs_real_t * >(2), 12 /* y */ nullptr , 13 /* z */ nullptr , 14 /* res */ Arg.getArg <cs_real_t * >(3), 15 /* n_rows_per_block */ n_rows_per_block ); 16 break; 17 /* ... */ 18 } 19 __syncthreads (); 20 return 0; 21 } 22 23 template < KernelKinds ... Kinds > 24 __global__ void any_kernels (void) { 25 26 auto *KA = reinterpret_cast < KernelArgsSeries *>(& KernelArgsSeriesGPU [0]); 27 const unsigned n_rows_per_block = KA -> RowsPerBlock ; 28 unsigned idx = 0; 29 30 int dummy [] = { any_kernel <Kinds >(KA ->Args[idx ++], n_rows_per_block )... }; 31 (void) dummy; 32 } Listing 10: Device entry-point function for kernel execution.
16.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 16 CUDA-based tuning • CUDA – Code Saturne 5.0+ – Results for a single rank – IBM Minsky server - Lid driven cavity flow – 1.5M-cell grid 57.21 43.83 34.77 30.28 49.86 37.37 29.67 25.63 11.87 10.83 9.55 9.32 4.41 4.34 4.40 4.63 0 10 20 30 40 50 60 70 1 2 4 8 Time (seconds) OpenMP threads Wall time CPU Solver time CPU Wall time CPU+GPU Solver time CPU+GPU 4.82 4.05 3.64 3.25 11.29 8.60 6.74 5.53 0 2 4 6 8 10 12 1 2 4 8 GPU speedup (1x) OpenMP threads Wall time Solvers time Execution time Speed up
17.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 17 CUDA-based tuning • CUDA – Code Saturne 5.0+ – NVPROF timeline for a single rank – IBM Minsky server - Lid driven cavity flow – 1.5M-cell grid Gauss-Seidel AMG fine grid
18.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 18 CUDA-based tuning • CUDA – Code Saturne 5.0+ – NVPROF timeline for a single rank (cont.) – IBM Minsky server - Lid driven cavity flow – 1.5M-cell grid AMG coarse grid
19.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 19 MPI and GPU acceleration • Different processes (MPI ranks) will use different CUDA contexts. • CUDA implementation serializes CUDA contexts by default. • NVIDIA Multi-Process Service (MPS) provides context switching capabilities so that multiple processes can use the same GPU. MPS server instance GPU driver Define Visible GPU Start MPS server Execute application Terminate MPS server Define Visible GPU Execute application Define Visible GPU Execute application Rank 0 Rank 1 Rank 2
20.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 20 MPI and GPU acceleration • CUDA – Code Saturne 5.0+ – NVPROF timeline for multiple ranks (5 ranks per GPU) – IBM Minsky server - Lid driven cavity flow – 111M-cell grid Gauss-Seidel Hiding data movement latencies
21.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 21 MPI and GPU acceleration • CUDA – Code Saturne 5.0+ – NVPROF timeline for multiple ranks (5 ranks per GPU – cont.) – IBM Minsky server - Lid driven cavity flow – 111M-cell grid AMG coarse grid
22.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 22 MPI and GPU acceleration • CUDA – Code Saturne 5.0+ – Results for multiple ranks (5 ranks per GPU – cont.) – IBM Minsky server - Lid driven cavity flow – 111M-cell grid – CPU+GPU efficiency 65% @32 nodes Execution time Speed up 2.39 2.42 2.32 2.22 2.08 2.00 2.53 2.57 2.45 2.37 2.20 2.10 1.5 1.7 1.9 2.1 2.3 2.5 2.7 1 2 4 8 16 32 Speedup over CPU-only (1x) Nodes Wall time Solvers time 717.6 369.9 187.4 100.2 54.9 28.9 693.9 358.0 181.5 97.2 53.4 28.1 300.4 153.1 80.8 45.2 26.4 14.4 274.7 139.6 74.0 41.1 24.3 13.410.0 100.0 1000.0 1 2 4 8 16 32 Time (seconds) Nodes CPU wall time CPU solvers time CPU+GPU wall time CPU+GPU solvers time
23.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 23 POWER8 to POWER9 • A code performing well on POWER8 + P100 GPUs should perform well on POWER9 + V100 GPUs – No major code refactoring needed. – More powerful CPUs, GPUs and interconnect. • Some differences to consider: – Core vs Pairs of Cores • POWER9 L3 cache and store queue is shared for each pair of cores • SMT4 per core or SMT8 per pair-of-cores – V100 (Volta) drops lock-step execution per warp-threads • One program-counter per thread • If code assumes lock-step execution explicit barriers have to be inserted • No guarantee threads will converge after divergence within a warp • One has to leverage cooperative groups and thread activity masks NVLINKTM NVLINKTM NVLINKTM ORNL Summit Socket (2 sockets per node) for (cs_lnum_t ii = StartRow; ii < EndRow; ii += bdimy) { // Depending on the number of rows - warps may diverge here unsigned AM = __activemask(); … for (cs_lnum_t kk = 1 ; kk < bdimx ; kk *= 2) sii += __shfl_down_sync(AM, sii,kk, bdimx); … }
24.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 24 POWER8 to POWER9 • CUDA – Code Saturne 5.0+ – Results for multiple ranks (3 ranks per GPU) – 6 GPUs per node – IBM Power 9 and NVLINK 2.0 (Summit) - Lid driven cavity flow – 889M-cell grid – CPU+GPU efficiency 76% @512 nodes 2.31 2.90 2.34 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 64 256 512 Speedup over CPU-only (1x) Nodes Wall time 74.73 21.04 11.16 32.3 7.25 4.76 1.0 10.0 100.0 64 256 512 Time (seconds) Nodes CPU wall time CPU+GPU wall time Execution time Speed up POWER9 vs POWER8: Better efficiency when scaling to 16x more nodes for 8x larger problem
25.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 25 CFD and AI • Cognitive Enhanced Design – Design / prototype new pieces of equipment is expensive (time and finance) • Parameter sweeps need several expensive simulations • Want to make decisions faster • Make decisions on more complex problems – Use cognitive techniques (e.g. Bayesian neural networks) to generate a model based on a parameterized space to relate design parameters to performance. Use this in Bayesian optimization to improve design • Converge to optimal parameters more quickly – Example: airfoil optimization: Lift/Drag maximization • Adaptive Expected Improvement (EI) converges faster and with less variance. Work package ML1 Cognitive Enhanced Design § Problem: Design / prototyping of new pieces of equipment can be expensive (time and finance). Want to do more work in silico, and also use an ‘intelligent’ design process § Solution: Use cognitive techniques (e.g. Bayesian neural networks) to generate a model based on a parameterized space to relate design parameters to performance. Use this in Bayesian optimization to improve design. a
26.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 26 CFD and AI • Enhanced 3D-Feature Detection – Typical bottleneck of the design process is the simulation-led workflow analysis • Extract feature like: – flow – separation – swirl – layering – Extend AI techniques to automatically extract features in 3D • Remove analysis bottlenecks • Semantic querying of simulation data • Contextual event classification • Computational steering for rare-event simulation – Example: Racing car vortex detection • AI-enabled feature detection § Problem: One typical bottleneck in the simulation-led workflow, is the analysis of the output produced by the simulation itself – especially the identification of features (e.g. for flow; separation, swirl, layering). § Solution: Extend deep-feature detection to 3-dimensional problems to remove this bottleneck from the design workflow § Additional extensions planned for the semantic querying of simulation data, contextual event classification, and computational steering for rare-event simulation Workpackage ML3 Enhanced 3D-Feature Detection
27.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 27 CFD and AI • Enhanced 3D-Feature Detection – Typical bottleneck of the design process is the simulation-led workflow analysis • Extract feature like: – flow – separation – swirl – layering – Extend AI techniques to automatically extract features in 3D • Remove analysis bottlenecks • Semantic querying of simulation data • Contextual event classification • Computational steering for rare-event simulation – Example: Racing car vortex detection • AI-enabled feature detection
28.
© 2018 IBM
Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. 28 Questions? samuel.antao@ibm.com
Download now