SlideShare a Scribd company logo
Robert Searles, Sunita Chandrasekaran
(rsearles, schandra)@udel.edu
Wayne Joubert, Oscar Hernandez
(joubert,oscar)@ornl.gov
Abstractions and Directives for
Adapting Wavefront Algorithms to
Future Architectures
PASC 2018 June 3, 2018
1
Motivation
• Parallel programming is software’s future
– Acceleration
• State-of-the-art abstractions handle simple
parallel patterns well
• Complex patterns are hard!
2
Our Contributions
• An abstract representation for wavefront
algorithms
• A performance portable proof-of-concept of
this abstraction using directives: OpenACC
– Evaluation on multiple state-of-the-art platforms
• A description of the limitations of existing
high-level programming models
3
Several ways to accelerate
Directives
Programming
Languages
Libraries
Applications
Drop in
acceleration
Maximum
Flexibility
Used for easier
acceleration
4
Directive-Based Programming Models
• OpenMP (current version 4.5)
– Multi-platform shared multiprocessing API
– Since 2013, supporting device offloading
• OpenACC (current version 2.6)
– Directive-based model for heterogeneous
computing
5
Serial Example
for (int i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}
6
OpenACC Example
#pragma acc parallel loop independent
for (int i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}
7
Host Code
cudaError_t cudaStatus;
// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);
// Allocate GPU buffers for three vectors (two input, one output)
cudaStatus = cudaMalloc((void**)&dev_c, N* sizeof(int));
cudaStatus = cudaMalloc((void**)&dev_a, N* sizeof(int));
cudaStatus = cudaMalloc((void**)&dev_b, N* sizeof(int));
// Copy input vectors from host memory to GPU buffers.
cudaStatus = cudaMemcpy(dev_a, a, N* sizeof(int), cudaMemcpyHostToDevice);
cudaStatus = cudaMemcpy(dev_b, b, N* sizeof(int), cudaMemcpyHostToDevice);
// Launch a kernel on the GPU with one thread for each element.
addKernel<<<N/BLOCK_SIZE, BLOCK_SIZE>>>(dev_c, dev_a, dev_b);
// cudaThreadSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaThreadSynchronize();
// Copy output vector from GPU buffer to host memory.
cudaStatus = cudaMemcpy(c, dev_c, N* sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(dev_c);
cudaFree(dev_a);
cudaFree(dev_b);
return cudaStatus;
8
Kernel
__global__ void addKernel(int *c, const int *a, const int *b)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
c[i] = a[i] + b[i];
}
CUDA Example
Pattern-Based Approach in Parallel
Computing
• Several parallel patterns
– Existing high-level languages provide abstractions for many
simple patterns
• However there are complex patterns often found in
scientific applications that are a challenge to be
represented with software abstractions
– Require manual code rewrite
• Need additional features/extensions!
– How do we approach this? (Our paper’s contribution)
9
Application Motivation:
Minisweep
• A miniapp modeling wavefront sweep component of
Denovo Radiation transport code from ORNL
– Minisweep, a miniapp, represents 80-90% of Denovo
• Denovo - part of DOE INCITE project, is used to model
fusion reactor – CASL, ITER
• Run many times with different parameters
• The faster it runs, the more configurations we can explore
• Poses a six dimensional problem
• 3D in space, 2D in angular particle direction and 1D in
particle energy
10
Minisweep code status
• Github: https://github.com/wdj/minisweep
• Early application readiness on ORNL Titan
• Being used for #1 TOP500 -> Summit
acceptance testing
• Has been ported to Beacon and Titan (ORNL
machines) using OpenMP and CUDA
11
Minisweep: The Basics
12
Parallelizing Sweep Algorithm
13
Complex Parallel Pattern Identified:
Wavefront
14
Complex Parallel Pattern Identified:
Wavefront
15
Overview of Sweep Algorithm
• 5 nested loops
– X, Y, Z dimensions, Energy Groups, Angles
– OpenACC/PGI only offers 2 levels of parallelism:
gang and vector (worker clause not working
properly)
– Upstream data dependency
16
17
18
Parallelizing Sweep Algorithm: KBA
• Koch-Baker-Alcouffe (KBA)
• Algorithm developed in 1992 at
Los Alamos
• Parallel sweep algorithm that
overcomes some of the
dependencies using a wavefront.
19
Image credit: High Performance
Radiation Transport Simulations:
Preparing for TITAN
C. Baker, G. Davidson, T. M. Evans,
S. Hamilton, J. Jarrell and W.
Joubert, ORNL, USA
Expressing Wavefront via Software
Abstractions – A Challenge
• Existing solutions involve manual rewrites, or
compiler-based loop transformations
– Michael Wolfe. 1986. Loop skewing: the wavefront method
revisited. Int. J. Parallel Program. 15, 4 (October 1986),
279-293. DOI=http://dx.doi.org/10.1007/BF01407876
– Polyhedral frameworks, only support affine loops, ChiLL
and Pluto
• No solution in high-level languages like
OpenMP/OpenACC; no software abstractions
20
Our Contribution:
Create Software Abstractions for
Wavefront pattern
• Analyzing flow of data and computation in
wavefront codes
• Memory and threading challenges
• Wavefront loop transformation algorithm
21
Abstract Parallelism Model
for( iz=izbeg; iz!=izend; iz+=izinc )
for( iy=iybeg; iy!=iyend; iy+=iyinc )
for( ix=ixbeg; ix!=ixend; ix+=ixinc ) { // space
for( ie=0; ie<dim_ne; ie++ ) { // energy
for( ia=0; ia<dim_na; ia++ ) { // angles
// in-gridcell computation
}
}
}
22
Abstract Parallelism Model
• Spatial decomposition = outer layer (KBA)
– No existing abstraction for this
• In-gridcell computations = inner layer
– Application specific
• Upstream data dependencies
– Slight variation between wavefront applications
23
Data Model
• Storing all previous wavefronts is unnecessary
– How many neighbors and prior wavefronts are
accessed?
• Face arrays make indexing easy
– Smaller data footprint
• Limiting memory to the size of the largest
wavefront is optimal, but not practical
24
Parallelizing Sweep Algorithm: KBA
25
Programming Model Limitations
• No abstraction for wavefront loop
transformation
– Manual loop restructuring
• Limited layers of parallelism
– 2 isn’t enough (worker is broken)
– Asynchronous execution?
26
Experimental Setup
• NVIDIA PSG Cluster
– CPU: Intel Xeon E5-2698 v3 (16-core) and Xeon E5-2690 v2 (10-core)
– GPU: NVIDIA Tesla P100, Tesla V100, and Tesla K40 (4 GPUs per node)
• ORNL Titan
– CPU: AMD Opteron 6274 (16-core)
– GPU: NVIDIA Tesla K20x
• ORNL SummitDev
– CPU: IBM Power8 (10-core)
– GPU: NVIDIA Tesla P100
• PGI OpenACC Compiler 17.10
• OpenMP – GCC 6.2.0
– Issues running OpenMP minisweep code on Titan but works OK on PSG.
27
Input Parameters
• Scientifically
– X/Y/Z dimensions = 64
– # Energy Groups = 64
– # Angles = 32
• Goal is to explore larger spatial dimensions
28
29
Contributions
• An abstract representation of wavefront
algorithms
• A performance portable proof-of-concept of
this abstraction using OpenACC
– Evaluation on multiple state-of-the-art platforms
• A description of the limitations of existing
high-level programming models
30
Next Steps
• Asynchronous execution
• MPI - multi-node/multi-GPU
• Develop a generalization/extension to existing
high-level programming models
– Prototype
31
Preliminary Results/Ongoing Work
• MPI + OpenACC
– 1 node x 1 P100 GPU = 66.79x speedup
– 4 nodes x 4 P100 GPUs/node = 565.81x speedup
– 4 nodes x 4 V100 GPUs/node = 624.88x speedup
• Distributing the workload lets us examine larger
spatial dimensions
– Future: Use blocking to allow for this on a single GPU
32
Takeaway(s)
• Using directives is not magical! Compilers are already doing a lot for
us! 
• Code benefits from incremental improvement – so let’s not give up! 
• *Profiling and Re-profiling is highly critical*
• Look for any serial code refactoring, if need be
– Make the code parallel and accelerator-friendly
• Watch out for compiler bugs and *report them*
– The programmer is not ‘always’ wrong
• Watch out for *novel language extensions and propose to the
committee* - User feedback
– Did you completely change the loop structure? Did you notice a
parallel pattern for which we don’t have a high-level directive yet?
33
Contributions
• An abstract representation of wavefront algorithms
• A performance portable proof-of-concept of this
abstraction using directives, OpenACC
– Evaluation on multiple state-of-the-art platforms
• A description of the limitations of existing high-level
programming models
• Contact: rsearles@udel.edu
• Github: https://github.com/rsearles35/minisweep
34
Additional Material
35
Additional Material
36
37
silica IFPEN, RMM-DIIS on P100
OPENACC GROWING MOMENTUM
Wide Adoption Across Key HPC Codes
3 of Top 5 HPC Applications Use
OpenACC
ANSYS Fluent, Gaussian, VASP
40 core Broadwell 1 P100 2 P100 4 P100
0
1
2
3
4
5
Speed-up
vasp_std (5.4.4)
VASP, silica IFPEN, RMM-DIIS on P100
* OpenACC port covers more VASP routines than CUDA, OpenACC port planned top
down, with complete analysis of the call tree, OpenACC port leverages improvements in
latest VASP Fortran source base
silica IFPEN, RMM-DIIS on
P100
CAAR Codes Use
OpenACC
GTC
XGC
ACME
FLASH
LSDalton
OpenACC Dominates in
Climate & Weather
Key Codes Globally
COSMO, IFS(ESCAPE),
NICAM, ICON, MPAS
Gordon Bell Finalist
CAM-SE on Sunway
Taihulight
38
NUCLEAR REACTOR
MODELING PROXY CODE :
MINISWEEP
 Minisweep, a miniapp, represents 80-90% of
Denovo Sn code
 Denovo Sn (discrete ordinate), part of DOE
INCITE project, is used to model fusion reactor
– CASL, ITER
 Impact: By running Minisweep faster,
experiments with more configurations can be
performed directly impacting the determination
of accuracy of radiation shielding
 Poses a six dimensional problem
 3D in space, 2D in angular particle direction and
1D in particle energy
38

More Related Content

What's hot

LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
LEGATO project
 
Serving BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServeServing BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServe
Nidhin Pattaniyil
 
Graal Tutorial at CGO 2015 by Christian Wimmer
Graal Tutorial at CGO 2015 by Christian WimmerGraal Tutorial at CGO 2015 by Christian Wimmer
Graal Tutorial at CGO 2015 by Christian Wimmer
Thomas Wuerthinger
 
Micro-Benchmarking Considered Harmful
Micro-Benchmarking Considered HarmfulMicro-Benchmarking Considered Harmful
Micro-Benchmarking Considered Harmful
Thomas Wuerthinger
 
Braden Hancock "Programmatically creating and managing training data with Sno...
Braden Hancock "Programmatically creating and managing training data with Sno...Braden Hancock "Programmatically creating and managing training data with Sno...
Braden Hancock "Programmatically creating and managing training data with Sno...
Fwdays
 
Graal and Truffle: Modularity and Separation of Concerns as Cornerstones for ...
Graal and Truffle: Modularity and Separation of Concerns as Cornerstones for ...Graal and Truffle: Modularity and Separation of Concerns as Cornerstones for ...
Graal and Truffle: Modularity and Separation of Concerns as Cornerstones for ...
Thomas Wuerthinger
 
Graal VM: Multi-Language Execution Platform
Graal VM: Multi-Language Execution PlatformGraal VM: Multi-Language Execution Platform
Graal VM: Multi-Language Execution Platform
Thomas Wuerthinger
 
Advancing OpenFabrics Interfaces
Advancing OpenFabrics InterfacesAdvancing OpenFabrics Interfaces
Advancing OpenFabrics Interfaces
inside-BigData.com
 
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
MLconf
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
Rogue Wave Software
 
Modeling an Embedded Device for PSpice Simulation
Modeling an Embedded Device for PSpice SimulationModeling an Embedded Device for PSpice Simulation
Modeling an Embedded Device for PSpice Simulation
EMA Design Automation
 
Requirements in Cyber-Physical Systems: Specifications and Applications
Requirements in Cyber-Physical Systems: Specifications and ApplicationsRequirements in Cyber-Physical Systems: Specifications and Applications
Requirements in Cyber-Physical Systems: Specifications and Applications
Lionel Briand
 
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Intel® Software
 
Contention - Aware Scheduling (a different approach)
Contention - Aware Scheduling (a different approach)Contention - Aware Scheduling (a different approach)
Contention - Aware Scheduling (a different approach)
Dimos Raptis
 
ScilabTEC 2015 - Silkan
ScilabTEC 2015 - SilkanScilabTEC 2015 - Silkan
ScilabTEC 2015 - Silkan
Scilab
 
On the Use of an Internal DSL for Enriching EMF Models
On the Use of an Internal DSL for Enriching EMF ModelsOn the Use of an Internal DSL for Enriching EMF Models
On the Use of an Internal DSL for Enriching EMF Models
Filip Krikava
 
Fcv rep darrell
Fcv rep darrellFcv rep darrell
Fcv rep darrell
zukun
 
DETR ECCV20
DETR ECCV20DETR ECCV20
DETR ECCV20
Mengmeng Xu
 
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Continual/Lifelong Learning with Deep Architectures, Vincenzo LomonacoContinual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Data Science Milan
 

What's hot (20)

LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
 
Serving BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServeServing BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServe
 
Graal Tutorial at CGO 2015 by Christian Wimmer
Graal Tutorial at CGO 2015 by Christian WimmerGraal Tutorial at CGO 2015 by Christian Wimmer
Graal Tutorial at CGO 2015 by Christian Wimmer
 
Micro-Benchmarking Considered Harmful
Micro-Benchmarking Considered HarmfulMicro-Benchmarking Considered Harmful
Micro-Benchmarking Considered Harmful
 
Braden Hancock "Programmatically creating and managing training data with Sno...
Braden Hancock "Programmatically creating and managing training data with Sno...Braden Hancock "Programmatically creating and managing training data with Sno...
Braden Hancock "Programmatically creating and managing training data with Sno...
 
Graal and Truffle: Modularity and Separation of Concerns as Cornerstones for ...
Graal and Truffle: Modularity and Separation of Concerns as Cornerstones for ...Graal and Truffle: Modularity and Separation of Concerns as Cornerstones for ...
Graal and Truffle: Modularity and Separation of Concerns as Cornerstones for ...
 
Graal VM: Multi-Language Execution Platform
Graal VM: Multi-Language Execution PlatformGraal VM: Multi-Language Execution Platform
Graal VM: Multi-Language Execution Platform
 
Advancing OpenFabrics Interfaces
Advancing OpenFabrics InterfacesAdvancing OpenFabrics Interfaces
Advancing OpenFabrics Interfaces
 
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 
Modeling an Embedded Device for PSpice Simulation
Modeling an Embedded Device for PSpice SimulationModeling an Embedded Device for PSpice Simulation
Modeling an Embedded Device for PSpice Simulation
 
Requirements in Cyber-Physical Systems: Specifications and Applications
Requirements in Cyber-Physical Systems: Specifications and ApplicationsRequirements in Cyber-Physical Systems: Specifications and Applications
Requirements in Cyber-Physical Systems: Specifications and Applications
 
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
 
Contention - Aware Scheduling (a different approach)
Contention - Aware Scheduling (a different approach)Contention - Aware Scheduling (a different approach)
Contention - Aware Scheduling (a different approach)
 
ScilabTEC 2015 - Silkan
ScilabTEC 2015 - SilkanScilabTEC 2015 - Silkan
ScilabTEC 2015 - Silkan
 
On the Use of an Internal DSL for Enriching EMF Models
On the Use of an Internal DSL for Enriching EMF ModelsOn the Use of an Internal DSL for Enriching EMF Models
On the Use of an Internal DSL for Enriching EMF Models
 
Fcv rep darrell
Fcv rep darrellFcv rep darrell
Fcv rep darrell
 
DETR ECCV20
DETR ECCV20DETR ECCV20
DETR ECCV20
 
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Continual/Lifelong Learning with Deep Architectures, Vincenzo LomonacoContinual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
 

Similar to Abstractions and Directives for Adapting Wavefront Algorithms to Future Architectures

A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
Intel® Software
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
HPCC Systems
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
inside-BigData.com
 
Introduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaIntroduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKenna
openseesdays
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Intel® Software
 
Current & Future Use-Cases of OpenDaylight
Current & Future Use-Cases of OpenDaylightCurrent & Future Use-Cases of OpenDaylight
Current & Future Use-Cases of OpenDaylight
abhijit2511
 
SERENE 2014 School: Incremental Model Queries over the Cloud
SERENE 2014 School: Incremental Model Queries over the CloudSERENE 2014 School: Incremental Model Queries over the Cloud
SERENE 2014 School: Incremental Model Queries over the Cloud
SERENEWorkshop
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
Databricks
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Databricks
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroom
Facultad de Informática UCM
 
Scilab Challenge@NTU 2014/2015 Project Briefing
Scilab Challenge@NTU 2014/2015 Project BriefingScilab Challenge@NTU 2014/2015 Project Briefing
Scilab Challenge@NTU 2014/2015 Project Briefing
TBSS Group
 
FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)
Kirill Tsym
 
FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet Processing
Kernel TLV
 
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
jsvetter
 
CloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use Case
CloudLightning
 
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...
Daniel Varro
 
"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen...
"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen..."Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen...
"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen...
Edge AI and Vision Alliance
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionNew Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 Edition
Sri Ambati
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Herman Wu
 
Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern Coprocessors
Unai Lopez-Novoa
 

Similar to Abstractions and Directives for Adapting Wavefront Algorithms to Future Architectures (20)

A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
Introduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaIntroduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKenna
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Current & Future Use-Cases of OpenDaylight
Current & Future Use-Cases of OpenDaylightCurrent & Future Use-Cases of OpenDaylight
Current & Future Use-Cases of OpenDaylight
 
SERENE 2014 School: Incremental Model Queries over the Cloud
SERENE 2014 School: Incremental Model Queries over the CloudSERENE 2014 School: Incremental Model Queries over the Cloud
SERENE 2014 School: Incremental Model Queries over the Cloud
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroom
 
Scilab Challenge@NTU 2014/2015 Project Briefing
Scilab Challenge@NTU 2014/2015 Project BriefingScilab Challenge@NTU 2014/2015 Project Briefing
Scilab Challenge@NTU 2014/2015 Project Briefing
 
FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)
 
FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet Processing
 
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
 
CloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use Case
 
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...
 
"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen...
"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen..."Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen...
"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen...
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionNew Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 Edition
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern Coprocessors
 

More from inside-BigData.com

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
inside-BigData.com
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
inside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
inside-BigData.com
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
inside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
inside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Recently uploaded

Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
jpupo2018
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 

Recently uploaded (20)

Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 

Abstractions and Directives for Adapting Wavefront Algorithms to Future Architectures

  • 1. Robert Searles, Sunita Chandrasekaran (rsearles, schandra)@udel.edu Wayne Joubert, Oscar Hernandez (joubert,oscar)@ornl.gov Abstractions and Directives for Adapting Wavefront Algorithms to Future Architectures PASC 2018 June 3, 2018 1
  • 2. Motivation • Parallel programming is software’s future – Acceleration • State-of-the-art abstractions handle simple parallel patterns well • Complex patterns are hard! 2
  • 3. Our Contributions • An abstract representation for wavefront algorithms • A performance portable proof-of-concept of this abstraction using directives: OpenACC – Evaluation on multiple state-of-the-art platforms • A description of the limitations of existing high-level programming models 3
  • 4. Several ways to accelerate Directives Programming Languages Libraries Applications Drop in acceleration Maximum Flexibility Used for easier acceleration 4
  • 5. Directive-Based Programming Models • OpenMP (current version 4.5) – Multi-platform shared multiprocessing API – Since 2013, supporting device offloading • OpenACC (current version 2.6) – Directive-based model for heterogeneous computing 5
  • 6. Serial Example for (int i = 0; i < N; i++) { c[i] = a[i] + b[i]; } 6
  • 7. OpenACC Example #pragma acc parallel loop independent for (int i = 0; i < N; i++) { c[i] = a[i] + b[i]; } 7
  • 8. Host Code cudaError_t cudaStatus; // Choose which GPU to run on, change this on a multi-GPU system. cudaStatus = cudaSetDevice(0); // Allocate GPU buffers for three vectors (two input, one output) cudaStatus = cudaMalloc((void**)&dev_c, N* sizeof(int)); cudaStatus = cudaMalloc((void**)&dev_a, N* sizeof(int)); cudaStatus = cudaMalloc((void**)&dev_b, N* sizeof(int)); // Copy input vectors from host memory to GPU buffers. cudaStatus = cudaMemcpy(dev_a, a, N* sizeof(int), cudaMemcpyHostToDevice); cudaStatus = cudaMemcpy(dev_b, b, N* sizeof(int), cudaMemcpyHostToDevice); // Launch a kernel on the GPU with one thread for each element. addKernel<<<N/BLOCK_SIZE, BLOCK_SIZE>>>(dev_c, dev_a, dev_b); // cudaThreadSynchronize waits for the kernel to finish, and returns // any errors encountered during the launch. cudaStatus = cudaThreadSynchronize(); // Copy output vector from GPU buffer to host memory. cudaStatus = cudaMemcpy(c, dev_c, N* sizeof(int), cudaMemcpyDeviceToHost); cudaFree(dev_c); cudaFree(dev_a); cudaFree(dev_b); return cudaStatus; 8 Kernel __global__ void addKernel(int *c, const int *a, const int *b) { int i = threadIdx.x + blockIdx.x * blockDim.x; c[i] = a[i] + b[i]; } CUDA Example
  • 9. Pattern-Based Approach in Parallel Computing • Several parallel patterns – Existing high-level languages provide abstractions for many simple patterns • However there are complex patterns often found in scientific applications that are a challenge to be represented with software abstractions – Require manual code rewrite • Need additional features/extensions! – How do we approach this? (Our paper’s contribution) 9
  • 10. Application Motivation: Minisweep • A miniapp modeling wavefront sweep component of Denovo Radiation transport code from ORNL – Minisweep, a miniapp, represents 80-90% of Denovo • Denovo - part of DOE INCITE project, is used to model fusion reactor – CASL, ITER • Run many times with different parameters • The faster it runs, the more configurations we can explore • Poses a six dimensional problem • 3D in space, 2D in angular particle direction and 1D in particle energy 10
  • 11. Minisweep code status • Github: https://github.com/wdj/minisweep • Early application readiness on ORNL Titan • Being used for #1 TOP500 -> Summit acceptance testing • Has been ported to Beacon and Titan (ORNL machines) using OpenMP and CUDA 11
  • 14. Complex Parallel Pattern Identified: Wavefront 14
  • 15. Complex Parallel Pattern Identified: Wavefront 15
  • 16. Overview of Sweep Algorithm • 5 nested loops – X, Y, Z dimensions, Energy Groups, Angles – OpenACC/PGI only offers 2 levels of parallelism: gang and vector (worker clause not working properly) – Upstream data dependency 16
  • 17. 17
  • 18. 18
  • 19. Parallelizing Sweep Algorithm: KBA • Koch-Baker-Alcouffe (KBA) • Algorithm developed in 1992 at Los Alamos • Parallel sweep algorithm that overcomes some of the dependencies using a wavefront. 19 Image credit: High Performance Radiation Transport Simulations: Preparing for TITAN C. Baker, G. Davidson, T. M. Evans, S. Hamilton, J. Jarrell and W. Joubert, ORNL, USA
  • 20. Expressing Wavefront via Software Abstractions – A Challenge • Existing solutions involve manual rewrites, or compiler-based loop transformations – Michael Wolfe. 1986. Loop skewing: the wavefront method revisited. Int. J. Parallel Program. 15, 4 (October 1986), 279-293. DOI=http://dx.doi.org/10.1007/BF01407876 – Polyhedral frameworks, only support affine loops, ChiLL and Pluto • No solution in high-level languages like OpenMP/OpenACC; no software abstractions 20
  • 21. Our Contribution: Create Software Abstractions for Wavefront pattern • Analyzing flow of data and computation in wavefront codes • Memory and threading challenges • Wavefront loop transformation algorithm 21
  • 22. Abstract Parallelism Model for( iz=izbeg; iz!=izend; iz+=izinc ) for( iy=iybeg; iy!=iyend; iy+=iyinc ) for( ix=ixbeg; ix!=ixend; ix+=ixinc ) { // space for( ie=0; ie<dim_ne; ie++ ) { // energy for( ia=0; ia<dim_na; ia++ ) { // angles // in-gridcell computation } } } 22
  • 23. Abstract Parallelism Model • Spatial decomposition = outer layer (KBA) – No existing abstraction for this • In-gridcell computations = inner layer – Application specific • Upstream data dependencies – Slight variation between wavefront applications 23
  • 24. Data Model • Storing all previous wavefronts is unnecessary – How many neighbors and prior wavefronts are accessed? • Face arrays make indexing easy – Smaller data footprint • Limiting memory to the size of the largest wavefront is optimal, but not practical 24
  • 26. Programming Model Limitations • No abstraction for wavefront loop transformation – Manual loop restructuring • Limited layers of parallelism – 2 isn’t enough (worker is broken) – Asynchronous execution? 26
  • 27. Experimental Setup • NVIDIA PSG Cluster – CPU: Intel Xeon E5-2698 v3 (16-core) and Xeon E5-2690 v2 (10-core) – GPU: NVIDIA Tesla P100, Tesla V100, and Tesla K40 (4 GPUs per node) • ORNL Titan – CPU: AMD Opteron 6274 (16-core) – GPU: NVIDIA Tesla K20x • ORNL SummitDev – CPU: IBM Power8 (10-core) – GPU: NVIDIA Tesla P100 • PGI OpenACC Compiler 17.10 • OpenMP – GCC 6.2.0 – Issues running OpenMP minisweep code on Titan but works OK on PSG. 27
  • 28. Input Parameters • Scientifically – X/Y/Z dimensions = 64 – # Energy Groups = 64 – # Angles = 32 • Goal is to explore larger spatial dimensions 28
  • 29. 29
  • 30. Contributions • An abstract representation of wavefront algorithms • A performance portable proof-of-concept of this abstraction using OpenACC – Evaluation on multiple state-of-the-art platforms • A description of the limitations of existing high-level programming models 30
  • 31. Next Steps • Asynchronous execution • MPI - multi-node/multi-GPU • Develop a generalization/extension to existing high-level programming models – Prototype 31
  • 32. Preliminary Results/Ongoing Work • MPI + OpenACC – 1 node x 1 P100 GPU = 66.79x speedup – 4 nodes x 4 P100 GPUs/node = 565.81x speedup – 4 nodes x 4 V100 GPUs/node = 624.88x speedup • Distributing the workload lets us examine larger spatial dimensions – Future: Use blocking to allow for this on a single GPU 32
  • 33. Takeaway(s) • Using directives is not magical! Compilers are already doing a lot for us!  • Code benefits from incremental improvement – so let’s not give up!  • *Profiling and Re-profiling is highly critical* • Look for any serial code refactoring, if need be – Make the code parallel and accelerator-friendly • Watch out for compiler bugs and *report them* – The programmer is not ‘always’ wrong • Watch out for *novel language extensions and propose to the committee* - User feedback – Did you completely change the loop structure? Did you notice a parallel pattern for which we don’t have a high-level directive yet? 33
  • 34. Contributions • An abstract representation of wavefront algorithms • A performance portable proof-of-concept of this abstraction using directives, OpenACC – Evaluation on multiple state-of-the-art platforms • A description of the limitations of existing high-level programming models • Contact: rsearles@udel.edu • Github: https://github.com/rsearles35/minisweep 34
  • 37. 37 silica IFPEN, RMM-DIIS on P100 OPENACC GROWING MOMENTUM Wide Adoption Across Key HPC Codes 3 of Top 5 HPC Applications Use OpenACC ANSYS Fluent, Gaussian, VASP 40 core Broadwell 1 P100 2 P100 4 P100 0 1 2 3 4 5 Speed-up vasp_std (5.4.4) VASP, silica IFPEN, RMM-DIIS on P100 * OpenACC port covers more VASP routines than CUDA, OpenACC port planned top down, with complete analysis of the call tree, OpenACC port leverages improvements in latest VASP Fortran source base silica IFPEN, RMM-DIIS on P100 CAAR Codes Use OpenACC GTC XGC ACME FLASH LSDalton OpenACC Dominates in Climate & Weather Key Codes Globally COSMO, IFS(ESCAPE), NICAM, ICON, MPAS Gordon Bell Finalist CAM-SE on Sunway Taihulight
  • 38. 38 NUCLEAR REACTOR MODELING PROXY CODE : MINISWEEP  Minisweep, a miniapp, represents 80-90% of Denovo Sn code  Denovo Sn (discrete ordinate), part of DOE INCITE project, is used to model fusion reactor – CASL, ITER  Impact: By running Minisweep faster, experiments with more configurations can be performed directly impacting the determination of accuracy of radiation shielding  Poses a six dimensional problem  3D in space, 2D in angular particle direction and 1D in particle energy 38

Editor's Notes

  1. Simple Parallel: loop parallelism, recursion, task, master/worker, pipeline
  2. Robbie – I changed implementation to ->proof-of-concept Third bullet -> currently only slide 26 is for this bullet isn’t it? That is not enough. Think about adding CUDA and OpenMP limitations as 2 slides or in 1 slide, then this bullet will be covered.
  3. Mention that OpenMP 4.5 is now starting to support offloading since 2013, but that is not what it was originally intended for.
  4. Robbie –I removed SummitDev, you can mention it, our focus on this slide is Summit and Top500 Since you are presenting in Europe, I added ORNL to the machine names, coz not everyone is going to put 2 and 2 together that Titan, Beacon where ORNL machines.
  5. EXPLAIN GANG & VECTOR Talk about how to map wavefront using gang/vector
  6. EXPLAIN GANG & VECTOR
  7. EXPLAIN GANG & VECTOR
  8. Go through the motions of analyzing a complex parallel pattern Come up with an abstraction for the pattern in question Robbie – edited the bullet on polyhedral -> Polyhedral transformation tools require code refactoring; only supports affine loops For your notes-> Typical limitations of the polyhedral model is essentially anything that makes the code non-affine. These include array indirections, data-dependent conditionals, linearized arrays and use of pointers.  These limitations need to be addressed before broadening the applicability of polyhedral transformations.
  9. Go through the motions of analyzing a complex parallel pattern Come up with an abstraction for the pattern in question
  10. EXPLAIN FACE ARRAYS! Optimal memory size implies brutal index calculations.
  11. Note that minisweep actually runs 8 sweeps (one per direction) Need a 3rd layer of parallelism for this!
  12. Robbie – didn’t we have results on SummitDev? It is in the paper, so add that to the second bullet? Its OK if you are showing titan results in your chart. Keep the absolute runtime numbers on summitdev as a backup slide. Someone might ask about absolute runtime as people don’t like speedup chart
  13. .
  14. SC: After this slide, bring back the contribution slide, and we have to discuss what we have contributed to these proposed bullets. Because contribution slides means we have contributed something. So how about? Creating standardized high-level abstraction for complex parallel patterns Evaluate abstraction on multiple state-of-the-art platforms to study performance/portability Apply abstraction to real-world scientific applications of interest to Exascale projects
  15. Robbie – this can be your last slide and then switch to the contribution slide and then put the next slide up while taking questions.
  16. Center for Accelerated Application Readiness (CAAR) VASP material science Gaussian Quantum Chemistry ANSYS Fluent, CFD Just keep this slide, in case someone goes like who is using OpenACC, you can bring this up pretty quickly.
  17. current DOE INCITE project to model the International Thermonuclear Experimental Reactor (ITER) fusion reactor and is the primary radiation transport code used in the multi-industry Consortium for Advanced Simulation of Light Water Reactors (CASL) [2] headquartered at Oak Ridge National Laboratory (ORNL). impact of the Fukushima nuclear accident in Japan, highlights the need to increase the production of safe, abundant quantities of energy A transport solver that incorporates an adequately resolved discretization for all scales using current discrete models would require 1017–1021 degrees of freedom per step, which is well beyond even exascale resources. Space: in the x , y and z dimensions, each gridcell depends on results from three upstream cells, based on octant direction, resulting in a wavefront problem. Octant: results for different octants can be calculated indepen- dently, the only dependency being that results from different octants at the same entry of vo′ut are added together and thus are a race hazard, depending on how the computation is done. Energy: computations for different energy groups have no coupling and can be considered separate problem instances, enabling flexible parallelization. One of the six applications selected for ORNL CAAR project Similar to Sn wavefront codes including Kripke, Sn (Discrete Ordinates) Application Proxy (SNAP) and PARTISN