SlideShare a Scribd company logo
1 of 37
LLVM-based Communication Optimizations
for PGAS Programs
Akihiro Hayashi (Rice University)
Jisheng Zhao (Rice University)
Michael Ferguson (Cray Inc.)
Vivek Sarkar (Rice University)
2nd Workshop on the LLVM Compiler Infrastructure in HPC @ SC15
1
A Big Picture
2
Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html,
http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/
© Argonne National Lab.
© RIKEN AICS
©Berkeley Lab.
X10,
Habanero-UPC++,…
PGAS Languages
High-productivity features:
 Global-View
 Task parallelism
 Data Distribution
 Synchronization
3
X10
Photo Credits : http://chapel.cray.com/logo.html, http://upc.lbl.gov/
CAFHabanero-UPC++
Communication is implicit
in some PGAS Programming Models
Global Address Space
 Compiler and Runtime is responsible for
performing communications across nodes
4
1: var x = 1; // on Node 0
2: on Locales[1] {// on Node 1
3: … = x; // DATA ACCESS
4: }
Remote Data Access in Chapel
Communication is Implicit
in some PGAS Programming Models (Cont’d)
5
1: var x = 1; // on Node 0
2: on Locales[1] {// on Node 1
3: … = x; // DATA ACCESS
Remote Data Access
if (x.locale == MYLOCALE) {
*(x.addr) = 1;
} else {
gasnet_get(…);
}
Runtime affinity handling
1: var x = 1;
2: on Locales[1] {
3: … = 1;
Compiler Optimization
OR
Communication Optimization
is Important
6
A synthetic Chapel program
on Intel Xeon CPU X5660 Clusters with QDR Inifiniband
1
10
100
1000
10000
100000
1000000
10000000
Latency(ms)
Transferred Byte
Optimized (Bulk Transfer) Unoptimized
1,500x
59x
Lower is better
PGAS Optimizations are
language-specific
7
Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html,
http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/
X10,
Habanero-UPC++,…
© Argonne National Lab.
©Berkeley Lab.
© RIKEN AICS
Chapel Compiler
UPC Compiler
X10 Compiler
Habanero-C Compiler
Our goal
8
Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html,
http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/
© Argonne National Lab.
© RIKEN AICS
©Berkeley Lab.
X10,
Habanero-UPC++,…
Why LLVM?
Widely used language-agnostic compiler
9
LLVM Intermediate Representation (LLVM IR)
C/C++
Frontend
Clang
C/C++, Fortran, Ada, Objective-C
Frontend
dragonegg
Chapel
Frontend
x86
backend
Power PC
backend
ARM
backend
PTX
backend
Analysis & Optimizations
UPC++
Frontend
x86 Binary PPC Binary ARM Binary GPU Binary
Summary & Contributions
 Our Observations :
 Many PGAS languages share semantically similar
constructs
 PGAS Optimizations are language-specific
 Contributions:
 Built a compilation framework that can uniformly optimize
PGAS programs (Initial Focus : Communication)
 Enabling existing LLVM passes for communication optimizations
 PGAS-aware communication optimizations 10Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html,
Overview of our framework
11
Chapel
Programs
UPC++
Programs
X10
Programs
Chapel-
LLVM
frontend
UPC++-
LLVM
frontend
X10-LLVM
frontend
LLVM IR
LLVM-based
Communication
Optimization
passes
Lowering
Pass
CAF
Programs
CAF-LLVM
frontend
1. Vanilla LLVM IR
2. use address space feature
to express communications
Need to be implemented
when supporting a new language/runtime
Generally language-agnostic
How optimizations work
12
store i64 1, i64 addrspace(100)* %x, …
// x is possibly remote
x = 1;
Chapel
shared_var<int> x;
x = 1;
UPC++
1.Existing LLVM
Optimizations
2.PGAS-aware
Optimizations
treat remote
access as if it
were local
access
Runtime-Specific Lowering
Address
space-aware
Optimizations
Communication API Calls
LLVM-based Communication
Optimizations for Chapel
1. Enabling Existing LLVM passes
 Loop invariant code motion (LICM)
 Scalar replacement, …
2. Aggregation
 Combine sequences of loads/stores on
adjacent memory location into a single
memcpy
13These are already implemented in the standard Chapel compiler
An optimization example:
LICM for Communication Optimizations
14LICM = Loop Invariant Code Motion
for i in 1..100 {
%x = load i64 addrspace(100)* %xptr
A(i) = %x;
}
LICM by LLVM
An optimization example:
Aggregation
15
// p is possibly remote
sum = p.x + p.y;
llvm.memcpy(…);
load i64 addrspace(100)* %pptr+0
load i64 addrspace(100)* %pptr+4
x y
GET GET
GET
LLVM-based Communication
Optimizations for Chapel
3. Locality Optimization
 Infer the locality of data and convert
possibly-remote access to definitely-local
access at compile-time if possible
4. Coalescing
 Remote array access vectorization
16These are implemented, but not in the standard Chapel compiler
An Optimization example:
Locality Optimization
17
1: proc habanero(ref x, ref y, ref z) {
2: var p: int = 0;
3: var A:[1..N] int;
4: local { p = z; }
5: z = A(0) + z;
6:}
2.p and z are
definitely local
3.Definitely-local access!
(avoid runtime affinity checking)
1.A is definitely-
local
An Optimization example:
Coalescing
18
1:for i in 1..N {
2: … = A(i);
3:}
1:localA = A;
2:for i in 1..N {
3: … = localA(i);
4:}
Perform bulk
transfer
Converted to
definitely-local
access
Before
After
Performance Evaluations:
Benchmarks
19
Application Size
Smith-Waterman 185,600 x 192,000
Cholesky Decomp 10,000 x 10,000
NPB EP CLASS = D
Sobel 48,000 x 48,000
SSCA2 Kernel 4 SCALE = 16
Stream EP 2^30
Performance Evaluations:
Platforms
 Cray XC30™ Supercomputer @ NERSC
 Node
 Intel Xeon E5-2695 @ 2.40GHz x 24 cores
 64GB of RAM
 Interconnect
 Cray Aries interconnect with Dragonfly topology
 Westmere Cluster @ Rice
 Node
 Intel Xeon CPU X5660 @ 2.80GHz x 12 cores
 48 GB of RAM
 Interconnect
 Quad-data rated infiniband
20
Performance Evaluations:
Details of Compiler & Runtime
 Compiler
 Chapel Compiler version 1.9.0
 LLVM 3.3
 Runtime :
 GASNet-1.22.0
 Cray XC : aries
 Westmere Cluster : ibv-conduit
 Qthreads-1.10
 Cray XC: 2 shepherds, 24 workers / shepherd
 Westmere Cluster : 2 shepherds, 6 workers / shepherd
21
BRIEF SUMMARY OF
PERFORMANCE EVALUATIONS
Performance Evaluation
22
0.0
1.0
2.0
3.0
4.0
5.0
SW Cholesky Sobel StreamEP EP SSCA2
PerformanceImprovement
overLLVM-unopt
Coalescing Locality Opt
Aggregation Existing
Results on the Cray XC
(LLVM-unopt vs. LLVM-allopt)
23
 4.6x performance improvement relative to LLVM-unopt on
the same # of locales on average (1, 2, 4, 8, 16, 32, 64 locales)
2.1x
19.5x
1.1x
2.4x
1.4x 1.3x
Higher is better
0.0
1.0
2.0
3.0
4.0
5.0
SW Cholesky Sobel StreamEP EP SSCA2
PerformanceImprovement
overLLVM-unopt
Coalescing Locality Opt
Aggregation Existing
Results on Westmere Cluster
(LLVM-unopt vs. LLVM-allopt)
24
2.3x
16.9x
1.1x
2.5x
1.3x
2.3x
 4.4x performance improvement relative to LLVM-unopt on
the same # of locales on average (1, 2, 4, 8, 16, 32, 64 locales)
DETAILED RESULTS & ANALYSIS
OF CHOLESKY DECOMPOSITION
Performance Evaluation
25
Cholesky Decomposition
26
dependencies0
0
0
1
1
2
2
2
3
3
0
0
1
1
2
2
2
3
3
0
1
1
2
2
2
3
3
1
1
2
2
2
3
3
1
2
2
2
3
3
2
2
2
3
3
2
2
3
3
2
3
3
3
3 3
Node0
Node1
Node2
Node3
Metrics
1. Performance & Scalability
 Baseline (LLVM-unopt)
 LLVM-based Optimizations (LLVM-allopt)
2. The dynamic number of communication API
calls
3. Analysis of optimized code
4. Performance comparison
 Conventional C-backend vs. LLVM-backend
27
Performance Improvement
by LLVM (Cholesky on the Cray XC)
28
1
0.1 0.1 0.2 0.2 0.3
2.6 2.7
3.7
4.1 4.3 4.5
0
1
2
3
4
5
1 locale 2 locales 4 locales 8 locales 16 locales 32 locales
SpeedupoverLLVM-unopt
1locale
LLVM-unopt LLVM-allopt
 LLVM-based communication optimizations show scalability
Communication API calls elimination
by LLVM (Cholesky on the Cray XC)
29
100.0% 100.0% 100.0% 100.0%
12.1%
0.2%
89.2%
100.0%
0
0.2
0.4
0.6
0.8
1
1.2
LOCAL_GET REMOTE_GET LOCAL_PUT REMOTE_PUT
Dynamicnumberof
communicationAPIcalls
(normalizedtoLLVM-unopt)
LLVM-unopt LLVM-allopt
8.3x
improvement
500x
improvement
1.1x
improvement
Analysis of optimized code
30
for jB in zero..tileSize-1 {
for kB in zero..tileSize-1 {
4GETS
for iB in zero..tileSize-1 {
9GETS + 1PUT
}}}
1.ALLOCATE LOCAL BUFFER
2.PERFORM BULK TRANSFER
for jB in zero..tileSize-1 {
for kB in zero..tileSize-1 {
1GET
for iB in zero..tileSize-1 {
1GET + 1PUT
}}}
LLVM-unopt LLVM-allopt
Performance comparison with
C-backend
31
2.8
0.1 0.1 0.1 0.3
0.7
0.4
1
0.1 0.1 0.2 0.2 0.3 0.4
2.6 2.7
3.7
4.1 4.3 4.5 4.5
0
1
2
3
4
5
1 locale 2 locales 4 locales 8 locales 16 locales 32 locales 64 locales
SpeedupoverLLVM-unopt
1locale
C-backend LLVM-unopt LLVM-allopt
C-backend is faster!
Current limitation
 In LLVM 3.3, many optimizations assume that the pointer
size is the same across all address spaces
32
Locale
(16bit)
addr
(48bit)
For LLVM Code Generation :
64bit packed pointer
For C Code Generation :
128bit struct pointer
ptr.locale;
ptr.addr;
ptr >> 48
ptr | 48BITS_MASK;
1. Needs more instructions
2. Lose opportunities for Alias
analysis
Conclusions
LLVM-based Communication optimizations
for PGAS Programs
 Promising way to optimize PGAS programs in a
language-agnostic manner
 Preliminary Evaluation with 6 Chapel applications
Cray XC30 Supercomputer
– 4.6x average performance improvement
Westmere Cluster
– 4.4x average performance improvement
33
Future work
Extend LLVM IR to support parallel programs
with PGAS and explicit task parallelism
 Higher-level IR
34
Parallel
Programs
(Chapel, X10,
CAF, HC, …)
1.RI-PIR Gen
2.Analysis
3.Transformation
1.RS-PIR Gen
2.Analysis
3.Transformation
LLVM
Runtime-Independent
Optimizations
e.g. Task Parallel Construct
LLVM
Runtime-Specific
Optimizations
e.g. GASNet API
Binary
Acknowledgements
Special thanks to
 Brad Chamberlain (Cray)
 Rafael Larrosa Jimenez (UMA)
 Rafael Asenjo Plaza (UMA)
 Habanero Group at Rice
35
Backup slides
36
Compilation Flow
37
Chapel
Programs
AST Generation and Optimizations
C-code Generation
LLVM Optimizations
Backend Compiler’s Optimizations
(e.g. gcc –O3)
LLVM IRC Programs
LLVM IR Generation
Binary Binary

More Related Content

What's hot

Course on TCP Dynamic Performance
Course on TCP Dynamic PerformanceCourse on TCP Dynamic Performance
Course on TCP Dynamic PerformanceJavier Arauz
 
CILK/CILK++ and Reducers
CILK/CILK++ and ReducersCILK/CILK++ and Reducers
CILK/CILK++ and ReducersYunming Zhang
 
eBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current TechniqueseBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current TechniquesNetronome
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveNetronome
 
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PAnalyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PGeorge Markomanolis
 
BPF - All your packets belong to me
BPF - All your packets belong to meBPF - All your packets belong to me
BPF - All your packets belong to me_xhr_
 
Programming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & ProductivityProgramming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & ProductivityLinaro
 
Plane Spotting
Plane SpottingPlane Spotting
Plane SpottingTed Coyle
 
eBPF Tooling and Debugging Infrastructure
eBPF Tooling and Debugging InfrastructureeBPF Tooling and Debugging Infrastructure
eBPF Tooling and Debugging InfrastructureNetronome
 
IBM XL Compilers Performance Tuning 2016-11-18
IBM XL Compilers Performance Tuning 2016-11-18IBM XL Compilers Performance Tuning 2016-11-18
IBM XL Compilers Performance Tuning 2016-11-18Yaoqing Gao
 
Porting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVEPorting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVELinaro
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGATO project
 
[Webinar Slides] Programming the Network Dataplane in P4
[Webinar Slides] Programming the Network Dataplane in P4[Webinar Slides] Programming the Network Dataplane in P4
[Webinar Slides] Programming the Network Dataplane in P4Open Networking Summits
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
 
Arm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportArm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportLinaro
 
Performance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVEPerformance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVELinaro
 

What's hot (20)

Course on TCP Dynamic Performance
Course on TCP Dynamic PerformanceCourse on TCP Dynamic Performance
Course on TCP Dynamic Performance
 
CILK/CILK++ and Reducers
CILK/CILK++ and ReducersCILK/CILK++ and Reducers
CILK/CILK++ and Reducers
 
eBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current TechniqueseBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current Techniques
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep Dive
 
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PAnalyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
 
BPF - All your packets belong to me
BPF - All your packets belong to meBPF - All your packets belong to me
BPF - All your packets belong to me
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
Programming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & ProductivityProgramming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & Productivity
 
Plane Spotting
Plane SpottingPlane Spotting
Plane Spotting
 
eBPF/XDP
eBPF/XDP eBPF/XDP
eBPF/XDP
 
eBPF Tooling and Debugging Infrastructure
eBPF Tooling and Debugging InfrastructureeBPF Tooling and Debugging Infrastructure
eBPF Tooling and Debugging Infrastructure
 
IBM XL Compilers Performance Tuning 2016-11-18
IBM XL Compilers Performance Tuning 2016-11-18IBM XL Compilers Performance Tuning 2016-11-18
IBM XL Compilers Performance Tuning 2016-11-18
 
Porting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVEPorting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVE
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
 
[Webinar Slides] Programming the Network Dataplane in P4
[Webinar Slides] Programming the Network Dataplane in P4[Webinar Slides] Programming the Network Dataplane in P4
[Webinar Slides] Programming the Network Dataplane in P4
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
 
DPDK KNI interface
DPDK KNI interfaceDPDK KNI interface
DPDK KNI interface
 
Arm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportArm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler support
 
Performance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVEPerformance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVE
 

Viewers also liked

Compiler optimization
Compiler optimizationCompiler optimization
Compiler optimizationZongYing Lyu
 
UPC and OpenMP Parallel Programming and Analysis in PTP with CDT
UPC and OpenMP Parallel Programming and Analysis in PTP with CDTUPC and OpenMP Parallel Programming and Analysis in PTP with CDT
UPC and OpenMP Parallel Programming and Analysis in PTP with CDTbethtib
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPAnil Bohare
 
Open mp library functions and environment variables
Open mp library functions and environment variablesOpen mp library functions and environment variables
Open mp library functions and environment variablesSuveeksha
 
Multi-Processor computing with OpenMP
Multi-Processor computing with OpenMPMulti-Processor computing with OpenMP
Multi-Processor computing with OpenMPStefan Coetzee
 
Programming using Open Mp
Programming using Open MpProgramming using Open Mp
Programming using Open MpAnshul Sharma
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionCherryBerry2
 
Intel parallel programming
Intel parallel programmingIntel parallel programming
Intel parallel programmingNirma University
 
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5Jeff Larkin
 
Early Experiences with the OpenMP Accelerator Model
Early Experiences with the OpenMP Accelerator ModelEarly Experiences with the OpenMP Accelerator Model
Early Experiences with the OpenMP Accelerator ModelChunhua Liao
 
Intro to OpenMP
Intro to OpenMPIntro to OpenMP
Intro to OpenMPjbp4444
 

Viewers also liked (20)

Compiler optimization
Compiler optimizationCompiler optimization
Compiler optimization
 
UPC and OpenMP Parallel Programming and Analysis in PTP with CDT
UPC and OpenMP Parallel Programming and Analysis in PTP with CDTUPC and OpenMP Parallel Programming and Analysis in PTP with CDT
UPC and OpenMP Parallel Programming and Analysis in PTP with CDT
 
Open mp intro_01
Open mp intro_01Open mp intro_01
Open mp intro_01
 
Open mp 1i
Open mp 1iOpen mp 1i
Open mp 1i
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
 
Agathos-PHD-uoi-2016
Agathos-PHD-uoi-2016Agathos-PHD-uoi-2016
Agathos-PHD-uoi-2016
 
Openmp combined
Openmp combinedOpenmp combined
Openmp combined
 
Open mp library functions and environment variables
Open mp library functions and environment variablesOpen mp library functions and environment variables
Open mp library functions and environment variables
 
Multi-Processor computing with OpenMP
Multi-Processor computing with OpenMPMulti-Processor computing with OpenMP
Multi-Processor computing with OpenMP
 
Open MP cheet sheet
Open MP cheet sheetOpen MP cheet sheet
Open MP cheet sheet
 
Programming using Open Mp
Programming using Open MpProgramming using Open Mp
Programming using Open Mp
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System Discussion
 
Intel parallel programming
Intel parallel programmingIntel parallel programming
Intel parallel programming
 
OpenMP And C++
OpenMP And C++OpenMP And C++
OpenMP And C++
 
OpenMP
OpenMPOpenMP
OpenMP
 
How to Use OpenMP on Native Activity
How to Use OpenMP on Native ActivityHow to Use OpenMP on Native Activity
How to Use OpenMP on Native Activity
 
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
 
Early Experiences with the OpenMP Accelerator Model
Early Experiences with the OpenMP Accelerator ModelEarly Experiences with the OpenMP Accelerator Model
Early Experiences with the OpenMP Accelerator Model
 
Open MP
Open MPOpen MP
Open MP
 
Intro to OpenMP
Intro to OpenMPIntro to OpenMP
Intro to OpenMP
 

Similar to LLVM-based Communication Optimizations for PGAS Programs

Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesChapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesAkihiro Hayashi
 
Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...Fwdays
 
Tungsten Fabric Overview
Tungsten Fabric OverviewTungsten Fabric Overview
Tungsten Fabric OverviewMichelle Holley
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Hajime Tazaki
 
Streaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogJoe Stein
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Nagios
 
State of Linux Containers for HPC
State of Linux Containers for HPCState of Linux Containers for HPC
State of Linux Containers for HPCinside-BigData.com
 
Putting Kafka In Jail – Best Practices To Run Kafka On Kubernetes & DC/OS
Putting Kafka In Jail – Best Practices To Run Kafka On Kubernetes & DC/OSPutting Kafka In Jail – Best Practices To Run Kafka On Kubernetes & DC/OS
Putting Kafka In Jail – Best Practices To Run Kafka On Kubernetes & DC/OSLightbend
 
StrongLoop Overview
StrongLoop OverviewStrongLoop Overview
StrongLoop OverviewShubhra Kar
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at ScaleSean Zhong
 
Dragonflow 01 2016 TLV meetup
Dragonflow 01 2016 TLV meetup  Dragonflow 01 2016 TLV meetup
Dragonflow 01 2016 TLV meetup Eran Gampel
 
NoSQL afternoon in Japan Kumofs & MessagePack
NoSQL afternoon in Japan Kumofs & MessagePackNoSQL afternoon in Japan Kumofs & MessagePack
NoSQL afternoon in Japan Kumofs & MessagePackSadayuki Furuhashi
 
NoSQL afternoon in Japan kumofs & MessagePack
NoSQL afternoon in Japan kumofs & MessagePackNoSQL afternoon in Japan kumofs & MessagePack
NoSQL afternoon in Japan kumofs & MessagePackSadayuki Furuhashi
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobileDataWorks Summit
 
Scaling the Container Dataplane
Scaling the Container Dataplane Scaling the Container Dataplane
Scaling the Container Dataplane Michelle Holley
 
What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0ScyllaDB
 
NYC Docker Meetup: Contiv networking on Docker
NYC Docker Meetup: Contiv networking on DockerNYC Docker Meetup: Contiv networking on Docker
NYC Docker Meetup: Contiv networking on DockerSanjeev Rampal
 
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre..."APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...Edge AI and Vision Alliance
 

Similar to LLVM-based Communication Optimizations for PGAS Programs (20)

Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesChapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
 
Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...
 
Tungsten Fabric Overview
Tungsten Fabric OverviewTungsten Fabric Overview
Tungsten Fabric Overview
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
Streaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit Log
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
 
State of Linux Containers for HPC
State of Linux Containers for HPCState of Linux Containers for HPC
State of Linux Containers for HPC
 
Putting Kafka In Jail – Best Practices To Run Kafka On Kubernetes & DC/OS
Putting Kafka In Jail – Best Practices To Run Kafka On Kubernetes & DC/OSPutting Kafka In Jail – Best Practices To Run Kafka On Kubernetes & DC/OS
Putting Kafka In Jail – Best Practices To Run Kafka On Kubernetes & DC/OS
 
StrongLoop Overview
StrongLoop OverviewStrongLoop Overview
StrongLoop Overview
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
 
Introduction to AirWave 10
Introduction to AirWave 10Introduction to AirWave 10
Introduction to AirWave 10
 
Dragonflow 01 2016 TLV meetup
Dragonflow 01 2016 TLV meetup  Dragonflow 01 2016 TLV meetup
Dragonflow 01 2016 TLV meetup
 
NoSQL afternoon in Japan Kumofs & MessagePack
NoSQL afternoon in Japan Kumofs & MessagePackNoSQL afternoon in Japan Kumofs & MessagePack
NoSQL afternoon in Japan Kumofs & MessagePack
 
NoSQL afternoon in Japan kumofs & MessagePack
NoSQL afternoon in Japan kumofs & MessagePackNoSQL afternoon in Japan kumofs & MessagePack
NoSQL afternoon in Japan kumofs & MessagePack
 
FD.io - The Universal Dataplane
FD.io - The Universal DataplaneFD.io - The Universal Dataplane
FD.io - The Universal Dataplane
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China Mobile
 
Scaling the Container Dataplane
Scaling the Container Dataplane Scaling the Container Dataplane
Scaling the Container Dataplane
 
What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0
 
NYC Docker Meetup: Contiv networking on Docker
NYC Docker Meetup: Contiv networking on DockerNYC Docker Meetup: Contiv networking on Docker
NYC Docker Meetup: Contiv networking on Docker
 
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre..."APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
"APIs for Accelerating Vision and Inferencing: Options and Trade-offs," a Pre...
 

More from Akihiro Hayashi

GPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsGPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsAkihiro Hayashi
 
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Akihiro Hayashi
 
Introduction to Polyhedral Compilation
Introduction to Polyhedral CompilationIntroduction to Polyhedral Compilation
Introduction to Polyhedral CompilationAkihiro Hayashi
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Akihiro Hayashi
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionAkihiro Hayashi
 
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Akihiro Hayashi
 
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Akihiro Hayashi
 
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAccelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAkihiro Hayashi
 

More from Akihiro Hayashi (9)

GPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsGPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
 
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
 
Introduction to Polyhedral Compilation
Introduction to Polyhedral CompilationIntroduction to Polyhedral Compilation
Introduction to Polyhedral Compilation
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
 
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
 
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
 
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAccelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
 

Recently uploaded

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 

Recently uploaded (20)

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 

LLVM-based Communication Optimizations for PGAS Programs

  • 1. LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao (Rice University) Michael Ferguson (Cray Inc.) Vivek Sarkar (Rice University) 2nd Workshop on the LLVM Compiler Infrastructure in HPC @ SC15 1
  • 2. A Big Picture 2 Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html, http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/ © Argonne National Lab. © RIKEN AICS ©Berkeley Lab. X10, Habanero-UPC++,…
  • 3. PGAS Languages High-productivity features:  Global-View  Task parallelism  Data Distribution  Synchronization 3 X10 Photo Credits : http://chapel.cray.com/logo.html, http://upc.lbl.gov/ CAFHabanero-UPC++
  • 4. Communication is implicit in some PGAS Programming Models Global Address Space  Compiler and Runtime is responsible for performing communications across nodes 4 1: var x = 1; // on Node 0 2: on Locales[1] {// on Node 1 3: … = x; // DATA ACCESS 4: } Remote Data Access in Chapel
  • 5. Communication is Implicit in some PGAS Programming Models (Cont’d) 5 1: var x = 1; // on Node 0 2: on Locales[1] {// on Node 1 3: … = x; // DATA ACCESS Remote Data Access if (x.locale == MYLOCALE) { *(x.addr) = 1; } else { gasnet_get(…); } Runtime affinity handling 1: var x = 1; 2: on Locales[1] { 3: … = 1; Compiler Optimization OR
  • 6. Communication Optimization is Important 6 A synthetic Chapel program on Intel Xeon CPU X5660 Clusters with QDR Inifiniband 1 10 100 1000 10000 100000 1000000 10000000 Latency(ms) Transferred Byte Optimized (Bulk Transfer) Unoptimized 1,500x 59x Lower is better
  • 7. PGAS Optimizations are language-specific 7 Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html, http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/ X10, Habanero-UPC++,… © Argonne National Lab. ©Berkeley Lab. © RIKEN AICS Chapel Compiler UPC Compiler X10 Compiler Habanero-C Compiler
  • 8. Our goal 8 Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html, http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/ © Argonne National Lab. © RIKEN AICS ©Berkeley Lab. X10, Habanero-UPC++,…
  • 9. Why LLVM? Widely used language-agnostic compiler 9 LLVM Intermediate Representation (LLVM IR) C/C++ Frontend Clang C/C++, Fortran, Ada, Objective-C Frontend dragonegg Chapel Frontend x86 backend Power PC backend ARM backend PTX backend Analysis & Optimizations UPC++ Frontend x86 Binary PPC Binary ARM Binary GPU Binary
  • 10. Summary & Contributions  Our Observations :  Many PGAS languages share semantically similar constructs  PGAS Optimizations are language-specific  Contributions:  Built a compilation framework that can uniformly optimize PGAS programs (Initial Focus : Communication)  Enabling existing LLVM passes for communication optimizations  PGAS-aware communication optimizations 10Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html,
  • 11. Overview of our framework 11 Chapel Programs UPC++ Programs X10 Programs Chapel- LLVM frontend UPC++- LLVM frontend X10-LLVM frontend LLVM IR LLVM-based Communication Optimization passes Lowering Pass CAF Programs CAF-LLVM frontend 1. Vanilla LLVM IR 2. use address space feature to express communications Need to be implemented when supporting a new language/runtime Generally language-agnostic
  • 12. How optimizations work 12 store i64 1, i64 addrspace(100)* %x, … // x is possibly remote x = 1; Chapel shared_var<int> x; x = 1; UPC++ 1.Existing LLVM Optimizations 2.PGAS-aware Optimizations treat remote access as if it were local access Runtime-Specific Lowering Address space-aware Optimizations Communication API Calls
  • 13. LLVM-based Communication Optimizations for Chapel 1. Enabling Existing LLVM passes  Loop invariant code motion (LICM)  Scalar replacement, … 2. Aggregation  Combine sequences of loads/stores on adjacent memory location into a single memcpy 13These are already implemented in the standard Chapel compiler
  • 14. An optimization example: LICM for Communication Optimizations 14LICM = Loop Invariant Code Motion for i in 1..100 { %x = load i64 addrspace(100)* %xptr A(i) = %x; } LICM by LLVM
  • 15. An optimization example: Aggregation 15 // p is possibly remote sum = p.x + p.y; llvm.memcpy(…); load i64 addrspace(100)* %pptr+0 load i64 addrspace(100)* %pptr+4 x y GET GET GET
  • 16. LLVM-based Communication Optimizations for Chapel 3. Locality Optimization  Infer the locality of data and convert possibly-remote access to definitely-local access at compile-time if possible 4. Coalescing  Remote array access vectorization 16These are implemented, but not in the standard Chapel compiler
  • 17. An Optimization example: Locality Optimization 17 1: proc habanero(ref x, ref y, ref z) { 2: var p: int = 0; 3: var A:[1..N] int; 4: local { p = z; } 5: z = A(0) + z; 6:} 2.p and z are definitely local 3.Definitely-local access! (avoid runtime affinity checking) 1.A is definitely- local
  • 18. An Optimization example: Coalescing 18 1:for i in 1..N { 2: … = A(i); 3:} 1:localA = A; 2:for i in 1..N { 3: … = localA(i); 4:} Perform bulk transfer Converted to definitely-local access Before After
  • 19. Performance Evaluations: Benchmarks 19 Application Size Smith-Waterman 185,600 x 192,000 Cholesky Decomp 10,000 x 10,000 NPB EP CLASS = D Sobel 48,000 x 48,000 SSCA2 Kernel 4 SCALE = 16 Stream EP 2^30
  • 20. Performance Evaluations: Platforms  Cray XC30™ Supercomputer @ NERSC  Node  Intel Xeon E5-2695 @ 2.40GHz x 24 cores  64GB of RAM  Interconnect  Cray Aries interconnect with Dragonfly topology  Westmere Cluster @ Rice  Node  Intel Xeon CPU X5660 @ 2.80GHz x 12 cores  48 GB of RAM  Interconnect  Quad-data rated infiniband 20
  • 21. Performance Evaluations: Details of Compiler & Runtime  Compiler  Chapel Compiler version 1.9.0  LLVM 3.3  Runtime :  GASNet-1.22.0  Cray XC : aries  Westmere Cluster : ibv-conduit  Qthreads-1.10  Cray XC: 2 shepherds, 24 workers / shepherd  Westmere Cluster : 2 shepherds, 6 workers / shepherd 21
  • 22. BRIEF SUMMARY OF PERFORMANCE EVALUATIONS Performance Evaluation 22
  • 23. 0.0 1.0 2.0 3.0 4.0 5.0 SW Cholesky Sobel StreamEP EP SSCA2 PerformanceImprovement overLLVM-unopt Coalescing Locality Opt Aggregation Existing Results on the Cray XC (LLVM-unopt vs. LLVM-allopt) 23  4.6x performance improvement relative to LLVM-unopt on the same # of locales on average (1, 2, 4, 8, 16, 32, 64 locales) 2.1x 19.5x 1.1x 2.4x 1.4x 1.3x Higher is better
  • 24. 0.0 1.0 2.0 3.0 4.0 5.0 SW Cholesky Sobel StreamEP EP SSCA2 PerformanceImprovement overLLVM-unopt Coalescing Locality Opt Aggregation Existing Results on Westmere Cluster (LLVM-unopt vs. LLVM-allopt) 24 2.3x 16.9x 1.1x 2.5x 1.3x 2.3x  4.4x performance improvement relative to LLVM-unopt on the same # of locales on average (1, 2, 4, 8, 16, 32, 64 locales)
  • 25. DETAILED RESULTS & ANALYSIS OF CHOLESKY DECOMPOSITION Performance Evaluation 25
  • 27. Metrics 1. Performance & Scalability  Baseline (LLVM-unopt)  LLVM-based Optimizations (LLVM-allopt) 2. The dynamic number of communication API calls 3. Analysis of optimized code 4. Performance comparison  Conventional C-backend vs. LLVM-backend 27
  • 28. Performance Improvement by LLVM (Cholesky on the Cray XC) 28 1 0.1 0.1 0.2 0.2 0.3 2.6 2.7 3.7 4.1 4.3 4.5 0 1 2 3 4 5 1 locale 2 locales 4 locales 8 locales 16 locales 32 locales SpeedupoverLLVM-unopt 1locale LLVM-unopt LLVM-allopt  LLVM-based communication optimizations show scalability
  • 29. Communication API calls elimination by LLVM (Cholesky on the Cray XC) 29 100.0% 100.0% 100.0% 100.0% 12.1% 0.2% 89.2% 100.0% 0 0.2 0.4 0.6 0.8 1 1.2 LOCAL_GET REMOTE_GET LOCAL_PUT REMOTE_PUT Dynamicnumberof communicationAPIcalls (normalizedtoLLVM-unopt) LLVM-unopt LLVM-allopt 8.3x improvement 500x improvement 1.1x improvement
  • 30. Analysis of optimized code 30 for jB in zero..tileSize-1 { for kB in zero..tileSize-1 { 4GETS for iB in zero..tileSize-1 { 9GETS + 1PUT }}} 1.ALLOCATE LOCAL BUFFER 2.PERFORM BULK TRANSFER for jB in zero..tileSize-1 { for kB in zero..tileSize-1 { 1GET for iB in zero..tileSize-1 { 1GET + 1PUT }}} LLVM-unopt LLVM-allopt
  • 31. Performance comparison with C-backend 31 2.8 0.1 0.1 0.1 0.3 0.7 0.4 1 0.1 0.1 0.2 0.2 0.3 0.4 2.6 2.7 3.7 4.1 4.3 4.5 4.5 0 1 2 3 4 5 1 locale 2 locales 4 locales 8 locales 16 locales 32 locales 64 locales SpeedupoverLLVM-unopt 1locale C-backend LLVM-unopt LLVM-allopt C-backend is faster!
  • 32. Current limitation  In LLVM 3.3, many optimizations assume that the pointer size is the same across all address spaces 32 Locale (16bit) addr (48bit) For LLVM Code Generation : 64bit packed pointer For C Code Generation : 128bit struct pointer ptr.locale; ptr.addr; ptr >> 48 ptr | 48BITS_MASK; 1. Needs more instructions 2. Lose opportunities for Alias analysis
  • 33. Conclusions LLVM-based Communication optimizations for PGAS Programs  Promising way to optimize PGAS programs in a language-agnostic manner  Preliminary Evaluation with 6 Chapel applications Cray XC30 Supercomputer – 4.6x average performance improvement Westmere Cluster – 4.4x average performance improvement 33
  • 34. Future work Extend LLVM IR to support parallel programs with PGAS and explicit task parallelism  Higher-level IR 34 Parallel Programs (Chapel, X10, CAF, HC, …) 1.RI-PIR Gen 2.Analysis 3.Transformation 1.RS-PIR Gen 2.Analysis 3.Transformation LLVM Runtime-Independent Optimizations e.g. Task Parallel Construct LLVM Runtime-Specific Optimizations e.g. GASNet API Binary
  • 35. Acknowledgements Special thanks to  Brad Chamberlain (Cray)  Rafael Larrosa Jimenez (UMA)  Rafael Asenjo Plaza (UMA)  Habanero Group at Rice 35
  • 37. Compilation Flow 37 Chapel Programs AST Generation and Optimizations C-code Generation LLVM Optimizations Backend Compiler’s Optimizations (e.g. gcc –O3) LLVM IRC Programs LLVM IR Generation Binary Binary

Editor's Notes

  1. I choose cholesky decomposition to give you more detailed information about our performance evaluation & analysis. If you are interested in other applications please see the paper.
  2. The rightmost one Second one from the left
  3. But using LLVM has drawback. Chapel uses wide pointer to associate data with node. Wide pointer is as C-struct and you can extract nodeID and address by dot operator.