SlideShare a Scribd company logo
LLVM Optimizations for PGAS Programs
-Case Study: LLVM Wide Pointer Optimizations in Chapel-
CHIUW2014 (co-located with IPDPS 2014),
Phoenix, Arizona
Akihiro Hayashi, Rishi Surendran,
Jisheng Zhao, Vivek Sarkar
(Rice University),
Michael Ferguson
(Laboratory for Telecommunication Sciences)
1
Background: Programming Model
for Large-scale Systems
Message Passing Interface (MPI) is a
ubiquitous programming model
but introduces non-trivial complexity due to
message passing semantics
PGAS languages such as Chapel, X10,
Habanero-C and Co-array Fortran
provide high-productivity features:
Task parallelism
Data Distribution
Synchronization
2
Motivation:
Chapel Support for LLVM
Widely used and easy to Extend
3
LLVM Intermediate Representation (IR)
x86 Binary
C/C++
Frontend
Clang
C/C++, Fortran, Ada, Objective-C
Frontend
dragonegg
Chapel
Compiler
chpl
PPC Binary ARM Binary
x86
backend
Power PC
backend
ARM
backend
PTX
backend
Analysis & Optimizations
GPU Binary
UPC
Compiler
A Big Picture
4
Pictures borrowed from http://chapel.cray.com/logo.html, http://llvm.org/Logo.html,
http://upc.lbl.gov/, http://commons.wikimedia.org/, https://www.olcf.ornl.gov/titan/
Habanero-C, ...
© Argonne National Lab.
©Oak Ridge National Lab.
© RIKEN AICS
Our ultimate goal: A compiler that can
uniformly optimize PGAS Programs
Extend LLVM IR to support parallel programs
with PGAS and explicit task parallelism
 Two parallel intermediate representations(PIR) as
extensions to LLVM IR
(Runtime-Independent, Runtime-Specific)
5
Parallel
Programs
(Chapel, X10,
CAF, HC, …)
1.RI-PIR Gen
2.Analysis
3.Transformation
1.RS-PIR Gen
2.Analysis
3.Transformation
LLVM
Runtime-Independent
Optimizations
e.g. Task Parallel Construct
LLVM
Runtime-Specific
Optimizations
e.g. GASNet API
Binary
The first step:
LLVM-based Chapel compiler
6
Pictures borrowed from 1) http://chapel.cray.com/logo.html
2) http://llvm.org/Logo.html
 Chapel compiler supports LLVM IR generation
 This talk discusses the pros and cons of LLVM-based
communication optimizations for Chapel
 Wide pointer optimization
 Preliminary Performance evaluation & analysis using
three regular applications
Chapel language
An object-oriented PGAS language
developed by Cray Inc.
Part of DARPA HPCS program
Key features
Array Operators: zip, replicate, remap,...
Explicit Task Parallelism: begin, cobegin
Locality Control: Locales
Data-Distribution: domain maps
Synchronizations: sync
7
Compilation Flow
8
Chapel
Programs
AST Generation and Optimizations
C-code Generation
LLVM Optimizations
Backend Compiler’s Optimizations
(e.g. gcc –O3)
LLVM IRC Programs
LLVM IR Generation
Binary Binary
The Pros and Cons of using
LLVM for Chapel
Pro: Using address space feature of LLVM
offers more opportunities for
communication optimization than C gen
9
// LLVM IR
%x = load i64 addrspace(100)* %xptr
// C-Code generation
chpl_comm_get(&x, …);
LLVM Optimizations
(e.g. LICM, scalar replacement)
Backend Compiler’s Optimizations
(e.g. gcc –O3)
Few chances of
optimization because
remote accesses are
lowered to chapel Comm
APIs
1. the existing LLVM passes can be
used for communication optimizations
2. Lowered to chapel Comm APIs after
optimizations
// Chapel
x = remoteData;
Address Space 100 generation
in Chapel
 Address space 100 = possibly-remote
(our convention)
 Constructs which generate address space 100
 Array Load/Store (Except Local constructs)
 Distributed Array
 var d = {1..128} dmapped Block(boundingBox={1..128});
 var A: [d] int;
 Object and Field Load/ Store
 class circle { var radius: real; … }
 var c1 = new circle(radius=1.0);
 On statement
 var loc0: int;
 on Locales[1] { loc0 = …; }
 Ref intent
 proc habanero(ref v: int): void { v = …; }
10
Except remote value
forwarding optimization
Motivating Example of address
space 100
11
(Pseudo-Code: Before LICM)
for i in 1..N {
// REMOTE GET
%x = load i64 addrspace(100)* %xptr
A(i) = %x;
}
(Pseudo-Code: After LICM)
// REMOTE GET
%x = load i64 addrspace(100)* %xptr
for i in 1..N {
A(i) = %x;
}
LICM by
LLVM
LICM = Loop Invariant Code Motion
The Pros and Cons of using
LLVM for Chapel (Cont’d)
 Drawback: Using LLVM may lose opportunity for
optimizations and may add overhead at runtime
 In LLVM 3.3, many optimizations assume that the
pointer size is the same across all address spaces
12
typedef struct wide_ptr_s {
chpl_localeID_t locale;
void* addr;
} wide_ptr_t;
locale addr
For LLVM Code Generation :
64bit packed pointer
CHPL_WIDE_POINTERS=node16
For C Code Generation :
128bit struct pointer
CHPL_WIDE_POINTERS=struct
wide.locale;
wide.addr;
wide >> 48
wide | 48BITS_MASK;
16bit 48bit
1. Needs more instructions
2. Lose opportunities for Alias analysis
Performance Evaluations:
Experimental Methodologies
We tested execution in the following modes
 1.C-Struct (--fast)
C code generation + struct pointer + gcc
Conventional Code generation in Chapel
 2.LLVM without wide optimization (--fast --llvm)
LLVM IR generation + packed pointer
Does not use address space feature
 3.LLVM with wide optimization
(--fast --llvm --llvm-wide-opt)
LLVM IR generation + packed pointer
Use address space feature and apply the existing LLVM
optimizations
13
Performance Evaluations:
Platform
Intel Xeon-based Cluster
Per Node information
Intel Xeon CPU X5660@2.80GHz x 12 cores
48GB of RAM
Interconnect
Quad-data rated Infiniband
Mellanox FCA support
14
Performance Evaluations:
Details of Compiler & Runtime
Compiler:
Chapel version 1.9.0.23154 (Apr. 2014)
 Built with
CHPL_LLVM=llvm
CHPL_WIDE_POINTERS=node16 or struct
CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv
CHPL_TASK=qthread
Backend compiler: gcc-4.4.7, LLVM 3.3
Runtime:
 GASNet-1.22.0 (ibv-conduit, mpi-spawner)
 qthreads-1.10
(2 Shepherds, 6 worker per shepherd) 15
Stream-EP
From HPCC benchmark
Array Size: 2^30
16
coforall loc in Locales do on loc {
// per each locale
var A, B, C: [D] real(64);
forall (a, b, c) in zip(A, B, C) do
a = b + alpha * c;
}
Stream-EP Result
17
2.56
1.33
0.72
0.41 0.24 0.11
6.62
3.22
1.73
1.01
0.62
0.26
2.45
1.28
0.72
0.40 0.25 0.10
0
1
2
3
4
5
6
7
1 locale 2 locales 4 locales 8 locales 16 locales 32 locales
ExecutionTime(sec)
Number of Locales
C-Struct
LLVM w/o wopt
LLVM w/ wopt
Lower is better
vs.
vs. Overhead of introducing LLVM + packed pointer (2.6x slower)
1
2 3
1
2
3
_
_
vs.
Performance improvement by LLVM opt (2.7x faster)
LLVM+wide opt is faster than the conventional C-Struct (1.1x)
Stream-EP Analysis
18
C-Struct LLVM w/o wopt LLVM w/ wopt
1.39E+11 1.40E+11 5.46E+10
Dynamic number of Chapel PUT/GET APIs actually executed (16 Locales):
// C-Struct, LLVM w/o wopt
forall (a, b, c) in zip(A, B, C) do
8GETS / 1PUT
// LLVM w/ wopt
6GETS (Get Array Head, offs)
forall (a, b, c) in zip(A, B, C) do
2GETS / 1PUT
LICM by LLVM
Cholesky Decomposition
 Use Futures & Distributed Array
 Input Size: 10,000x10,000
 Tile Size: 500x500
19
0
0
0
1
1
2
2
2
3
3
0
0
1
1
2
2
2
3
3
0
1
1
2
2
2
3
3
1
1
2
2
2
3
3
1
2
2
2
3
3
2
2
2
3
3
2
2
3
3
2
3
3
3
3 3
20 tiles
dependencies
20 tiles
User Defined Distribution
Cholesky Result
20
Lower is better
2401.32
941.70
730.94
2781.12
1105.38
902.86858.77
283.32 216.48
0.00
500.00
1000.00
1500.00
2000.00
2500.00
3000.00
8 locales 16 locales 32 locales
ExecutionTime(sec)
Number of Locales
C-Struct
LLVM w/o wopt
LLVM w/ wopt
1
2 3
vs.
vs. Overhead of introducing LLVM + packed pointer (1.2x slower)
1
2
3
_
_
vs.
Performance improvement by LLVM opt (4.2x faster)
LLVM+wide opt is faster than the conventional C-Struct (3.4x)
Cholesky Analysis
21
C-Struct LLVM w/o wopt LLVM w/ wopt
1.78E+09 1.97E+09 5.89E+08
Dynamic number of Chapel PUT/GET APIs actually executed (2 Locales):
Obtained with 1,000 x 1,000 input (100x100 tile size)
// C-Struct, LLVM w/o wopt
for jB in zero..tileSize-1 do {
for kB in zero..tileSize-1 do {
4GETS
for iB in zero..tileSize-1 do {
8GETS (+1 GETS w/ LLVM)
1PUT
}}}
// LLVM w/ wopt
for jB in zero..tileSize-1 do {
1GET
for kB in zero..tileSize-1 do {
3GETS
for iB in zero..tileSize-1 do {
2GETS
1PUT
}}}
Smithwaterman
 Use Futures & Distributed Array
 Input Size: 185,500x192,000
 Tile Size: 11,600x12,000 22
0
2
1
3
16 tiles
16 tiles
0
2
1
3
0
2
1
3
0
2
1
3
0
2
1
3
0
2
1
3
0
2
1
3
0
2
1
3
0
2
1
3
0
2
1
3
0
2
1
3
0
2
1
3
0
2
1
3
0
2
1
3
0
2
1
3
0
2
1
3
Cyclic Distribution
dependencies
Smithwaterman Result
23
381.23 379.01
1260.31 1263.76
626.38 635.45
0.00
200.00
400.00
600.00
800.00
1000.00
1200.00
1400.00
8 locales 16 locales
ExecutionTime(sec)
Number of Locales
C-Struct
LLVM w/o wopt
LLVM w/ wopt
Lower is better
1
2 3
vs.
vs. Overhead of introducing LLVM + packed pointer (3.3x slower)
1
2
3
_
_
vs.
Performance improvement by LLVM opt (2.0x faster)
LLVM+wide opt is slower than the conventional C-Struct (0.6x)
Smithwaterman Analysis
24
C-Struct LLVM w/o wopt LLVM w/ wopt
1.41E+08 1.41E+08 5.26E+07
Dynamic number of Chapel PUT/GET APIs actually executed (1 Locale):
Obtained with 1,856 x 1,920 input (232x240 tile size)
// C-Struct, LLVM w/o wopt
for (ii, jj) in tile_1_2d_domain
{
33 GETS
1 PUTS
}
// LLVM w/ wopt
for (ii, jj) in tile_1_2d_domain
{
12 GETS
1 PUTS
}
No LICM though there are opportunities
Key Insights
Using address space 100 offers finer-grain
optimization opportunities (e.g. Chapel Array)
25
for i in {1..N} {
data = A(i);
}
for i in 1..N {
head = GET(pointer to array head)
offset1 = GET(offset)
data = GET(head+i*offset1)
}
Opportunities for
1.LICM
2.Aggregation
Conclusions
 The first performance evaluation and analysis of
LLVM-based Chapel compiler
 Capable of utilizing the existing optimizations passes
even for remote data (e.g. LICM)
Removes significant number of Comm APIs
 LLVM w/ opt is always better than LLVM w/o opt
 Stream-EP, Cholesky
LLVM-based code generation is faster than C-based code
generation (1.04x, 3.4x)
 Smithwaterman
LLVM-based code generation is slower than C-based code
generation due to constraints of address space feature in
LLVM
No LICM though there are opportunities
Significant overhead of Packed Wide pointer
26
Future Work
Evaluate other applications
 Regular applications
 Irregular applications
Possibly-Remote to Definitely-Local
transformation by compiler
PIR in LLVM
27
local { A(i) = … } // hint by programmmer
… = A(i); // Definitely Local
on Locales[1] { // hint by programmer
var A: [D] int; // Definitely Local
Acknowledgements
Special thanks to
Brad Chamberlain (Cray)
Rafael Larrosa Jimenez (UMA)
Rafael Asenjo Plaza (UMA)
Shams Imam (Rice)
Sagnak Tasirlar (Rice)
Jun Shirako (Rice)
28
Backup
29
// modules/internal/DefaultRectangular.chpl
class DefaultRectangularArr: BaseArr {
...
var dom : DefaultRectangularDom(rank=rank, idxType=idxType,
stridable=stridable); /* domain */
var off: rank*idxType; /* per-dimension offset (n-based-> 0-based) */
var blk: rank*idxType; /* per-dimension multiplier */
var str: rank*chpl__signedType(idxType); /* per-dimimension stride */
var origin: idxType; /* used for optimization */
var factoredOffs: idxType; /* used for calculating shiftedData */
var data : _ddata(eltType); /* pointer to an actual data */
var shiftedData : _ddata(eltType); /* shifted pointer to an actual data */
var noinit: bool = false;
...
Chapel Array Structure
30
// chpl_module.bc (with LLVM code generation)
%chpl_DefaultRectangularArr_int64_t_1_int64_t_F_object = type
{ %chpl_BaseArr_object, %chpl_DefaultRectangularDom_1_int64_t_F_object*, [1 x i64], [1 x i64], [1 x
i64], i64, i64, i64*, i64*, i8 }
Example1: Array Store
(very simple)
proc habanero (A) {
A(0) = 1;
}
31
 Chapel version: 1.8.0.22047
 Compiler option: --llvm --llvm-wide-opt --fast
 Add “noinline” attribute to the function to avoid dead code
elimination
define internal fastcc void @habanero(%chpl_DefaultRectangularArr_int64_t_1_int64_t_F_object
addrspace(100)* %A) #9 {
entry:
// possibly remote access
1: %0 = getelementptr inbounds %chpl_DefaultRectangularArr_int64_t_1_int64_t_F_object
addrspace(100)* %A, i64 0, i32 8
2: %1 = load i64 addrspace(100)* addrspace(100)* %0, align 1
3: store i64 1, i64 addrspace(100)* %1, align 8, !tbaa !0
}
Example1: Generated LLVM IR
32
Get 8th member
%0 = A->shiftedData
store 1
Example2: Array Store
proc habanero (A) {
A(1) = 0;
}
33
 Chapel version: 1.8.0.22047
 Compiler option: --llvm --llvm-wide-opt --fast
 Add “noinline” attribute to the function to avoid dead code
elimination
define internal fastcc void @habanero(%chpl_DefaultRectangularArr_int64_t_1_int64_t_F_object addrspace(100)* %A) #9 {
entry:
// possibly remote access
1: %0 = getelementptr inbounds %chpl_DefaultRectangularArr_int64_t_1_int64_t_F_object addrspace(100)* %A, i64 0,
i32 3, i64 0
2: %agg.tmp = alloca i8, i32 48, align 1
3: %agg.cast = bitcast i64 addrspace(100)* %0 to i8 addrspace(100)*
4: call void @llvm.memcpy.p0i8.p100i8.i64(i8* %agg.tmp, i8 addrspace(100)* %agg.cast, i64 48, i32 0, i1 false)
5: %agg.tmp.cast = bitcast i8* %agg.tmp to i64*
6: %1 = load i64* %agg.tmp.cast, align 1
7: %2 = getelementptr inbounds %chpl_DefaultRectangularArr_int64_t_1_int64_t_F_object addrspace(100)* %A, i64 0,
i32 8
8: %agg.tmp.ptr.i = ptrtoint i64 addrspace(100)* addrspace(100)* %2 to i64
9: %agg.tmp.oldb.i = ptrtoint i64 addrspace(100)* %0 to i64
10:%agg.tmp.newb.i = ptrtoint i8* %agg.tmp to i64
11:%agg.tmp.diff = sub i64 %agg.tmp.ptr.i, %agg.tmp.oldb.i
12:%agg.tmp.sum = add i64 %agg.tmp.newb.i, %agg.tmp.diff
13:%agg.tmp.cast10 = inttoptr i64 %agg.tmp.sum to i64 addrspace(100)**
14:%3 = load i64 addrspace(100)** %agg.tmp.cast10, align 1
15:%4 = getelementptr inbounds i64 addrspace(100)* %3, i64 %1
16:store i64 0, i64 addrspace(100)* %4, align 8, !tbaa !0
…
Example2: Generated LLVM IR
34
%0 = A->blk
%2 = A->shiftedData
 Sequence of load are merged by aggregation pass
%1 = A-
>blk[0]
Buffer for
aggregation
Offset for getting
A->shifted Data in
buffer
Get pointer of
A(1)
Store 0
memcpy

More Related Content

What's hot

Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PAnalyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
George Markomanolis
 
eBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current TechniqueseBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current Techniques
Netronome
 
CILK/CILK++ and Reducers
CILK/CILK++ and ReducersCILK/CILK++ and Reducers
CILK/CILK++ and Reducers
Yunming Zhang
 
Course on TCP Dynamic Performance
Course on TCP Dynamic PerformanceCourse on TCP Dynamic Performance
Course on TCP Dynamic Performance
Javier Arauz
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep Dive
Netronome
 
Programming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & ProductivityProgramming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & Productivity
Linaro
 
BPF - All your packets belong to me
BPF - All your packets belong to meBPF - All your packets belong to me
BPF - All your packets belong to me
_xhr_
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
Intel® Software
 
Porting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVEPorting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVE
Linaro
 
eBPF Tooling and Debugging Infrastructure
eBPF Tooling and Debugging InfrastructureeBPF Tooling and Debugging Infrastructure
eBPF Tooling and Debugging Infrastructure
Netronome
 
IBM XL Compilers Performance Tuning 2016-11-18
IBM XL Compilers Performance Tuning 2016-11-18IBM XL Compilers Performance Tuning 2016-11-18
IBM XL Compilers Performance Tuning 2016-11-18
Yaoqing Gao
 
eBPF/XDP
eBPF/XDP eBPF/XDP
eBPF/XDP
Netronome
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Intel® Software
 
Arm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportArm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler support
Linaro
 
Performance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVEPerformance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVE
Linaro
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
LEGATO project
 
An evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsAn evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loops
Linaro
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Intel® Software
 
DPDK KNI interface
DPDK KNI interfaceDPDK KNI interface
DPDK KNI interface
Denys Haryachyy
 
Demystify eBPF JIT Compiler
Demystify eBPF JIT CompilerDemystify eBPF JIT Compiler
Demystify eBPF JIT Compiler
Netronome
 

What's hot (20)

Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PAnalyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
 
eBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current TechniqueseBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current Techniques
 
CILK/CILK++ and Reducers
CILK/CILK++ and ReducersCILK/CILK++ and Reducers
CILK/CILK++ and Reducers
 
Course on TCP Dynamic Performance
Course on TCP Dynamic PerformanceCourse on TCP Dynamic Performance
Course on TCP Dynamic Performance
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep Dive
 
Programming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & ProductivityProgramming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & Productivity
 
BPF - All your packets belong to me
BPF - All your packets belong to meBPF - All your packets belong to me
BPF - All your packets belong to me
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
Porting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVEPorting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVE
 
eBPF Tooling and Debugging Infrastructure
eBPF Tooling and Debugging InfrastructureeBPF Tooling and Debugging Infrastructure
eBPF Tooling and Debugging Infrastructure
 
IBM XL Compilers Performance Tuning 2016-11-18
IBM XL Compilers Performance Tuning 2016-11-18IBM XL Compilers Performance Tuning 2016-11-18
IBM XL Compilers Performance Tuning 2016-11-18
 
eBPF/XDP
eBPF/XDP eBPF/XDP
eBPF/XDP
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
Arm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportArm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler support
 
Performance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVEPerformance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVE
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
 
An evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsAn evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loops
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
 
DPDK KNI interface
DPDK KNI interfaceDPDK KNI interface
DPDK KNI interface
 
Demystify eBPF JIT Compiler
Demystify eBPF JIT CompilerDemystify eBPF JIT Compiler
Demystify eBPF JIT Compiler
 

Similar to LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in Chapel-

Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
Hajime Tazaki
 
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesChapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Akihiro Hayashi
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBM
chiportal
 
UCX-Python - A Flexible Communication Library for Python Applications
UCX-Python - A Flexible Communication Library for Python ApplicationsUCX-Python - A Flexible Communication Library for Python Applications
UCX-Python - A Flexible Communication Library for Python Applications
Matthew Rocklin
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
Andriy Berestovskyy
 
mTCP使ってみた
mTCP使ってみたmTCP使ってみた
mTCP使ってみた
Hajime Tazaki
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Sean Zhong
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Intel® Software
 
FD.io - The Universal Dataplane
FD.io - The Universal DataplaneFD.io - The Universal Dataplane
FD.io - The Universal Dataplane
Open Networking Summit
 
Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...
Fwdays
 
LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1
Hajime Tazaki
 
HOW Series: Knights Landing
HOW Series: Knights LandingHOW Series: Knights Landing
HOW Series: Knights Landing
Andrey Vladimirov
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
Intel® Software
 
Hands on with CoAP and Californium
Hands on with CoAP and CaliforniumHands on with CoAP and Californium
Hands on with CoAP and Californium
Julien Vermillard
 
MattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxMattsonTutorialSC14.pptx
MattsonTutorialSC14.pptx
gopikahari7
 
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
Yuuki Takano
 
Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)
Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)
Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)
Jakub Botwicz
 
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Thomas Weise
 
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP..."Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
Edge AI and Vision Alliance
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kevin Lynch
 

Similar to LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in Chapel- (20)

Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesChapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBM
 
UCX-Python - A Flexible Communication Library for Python Applications
UCX-Python - A Flexible Communication Library for Python ApplicationsUCX-Python - A Flexible Communication Library for Python Applications
UCX-Python - A Flexible Communication Library for Python Applications
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
mTCP使ってみた
mTCP使ってみたmTCP使ってみた
mTCP使ってみた
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
 
FD.io - The Universal Dataplane
FD.io - The Universal DataplaneFD.io - The Universal Dataplane
FD.io - The Universal Dataplane
 
Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...
 
LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1
 
HOW Series: Knights Landing
HOW Series: Knights LandingHOW Series: Knights Landing
HOW Series: Knights Landing
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
Hands on with CoAP and Californium
Hands on with CoAP and CaliforniumHands on with CoAP and Californium
Hands on with CoAP and Californium
 
MattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxMattsonTutorialSC14.pptx
MattsonTutorialSC14.pptx
 
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
 
Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)
Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)
Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)
 
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
 
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP..."Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 

More from Akihiro Hayashi

GPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsGPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
Akihiro Hayashi
 
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Akihiro Hayashi
 
Introduction to Polyhedral Compilation
Introduction to Polyhedral CompilationIntroduction to Polyhedral Compilation
Introduction to Polyhedral Compilation
Akihiro Hayashi
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Akihiro Hayashi
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Akihiro Hayashi
 
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Akihiro Hayashi
 
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Akihiro Hayashi
 
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAccelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
Akihiro Hayashi
 

More from Akihiro Hayashi (9)

GPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsGPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
 
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
 
Introduction to Polyhedral Compilation
Introduction to Polyhedral CompilationIntroduction to Polyhedral Compilation
Introduction to Polyhedral Compilation
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
 
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
 
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
 
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAccelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
 

Recently uploaded

Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
ijseajournal
 
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENTNATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
Addu25809
 
Mechatronics material . Mechanical engineering
Mechatronics material . Mechanical engineeringMechatronics material . Mechanical engineering
Mechatronics material . Mechanical engineering
sachin chaurasia
 
P5 Working Drawings.pdf floor plan, civil
P5 Working Drawings.pdf floor plan, civilP5 Working Drawings.pdf floor plan, civil
P5 Working Drawings.pdf floor plan, civil
AnasAhmadNoor
 
Pressure Relief valve used in flow line to release the over pressure at our d...
Pressure Relief valve used in flow line to release the over pressure at our d...Pressure Relief valve used in flow line to release the over pressure at our d...
Pressure Relief valve used in flow line to release the over pressure at our d...
cannyengineerings
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
Gino153088
 
Height and depth gauge linear metrology.pdf
Height and depth gauge linear metrology.pdfHeight and depth gauge linear metrology.pdf
Height and depth gauge linear metrology.pdf
q30122000
 
Generative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdfGenerative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdf
mahaffeycheryld
 
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptxSENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
b0754201
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
Prakhyath Rai
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
21UME003TUSHARDEB
 
5G Radio Network Througput Problem Analysis HCIA.pdf
5G Radio Network Througput Problem Analysis HCIA.pdf5G Radio Network Througput Problem Analysis HCIA.pdf
5G Radio Network Througput Problem Analysis HCIA.pdf
AlvianRamadhani5
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
PreethaV16
 
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
sydezfe
 
Introduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.pptIntroduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.ppt
Dwarkadas J Sanghvi College of Engineering
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
shadow0702a
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
Software Engineering and Project Management - Software Testing + Agile Method...
Software Engineering and Project Management - Software Testing + Agile Method...Software Engineering and Project Management - Software Testing + Agile Method...
Software Engineering and Project Management - Software Testing + Agile Method...
Prakhyath Rai
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
Divyanshu
 
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
PriyankaKilaniya
 

Recently uploaded (20)

Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
 
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENTNATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
 
Mechatronics material . Mechanical engineering
Mechatronics material . Mechanical engineeringMechatronics material . Mechanical engineering
Mechatronics material . Mechanical engineering
 
P5 Working Drawings.pdf floor plan, civil
P5 Working Drawings.pdf floor plan, civilP5 Working Drawings.pdf floor plan, civil
P5 Working Drawings.pdf floor plan, civil
 
Pressure Relief valve used in flow line to release the over pressure at our d...
Pressure Relief valve used in flow line to release the over pressure at our d...Pressure Relief valve used in flow line to release the over pressure at our d...
Pressure Relief valve used in flow line to release the over pressure at our d...
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
 
Height and depth gauge linear metrology.pdf
Height and depth gauge linear metrology.pdfHeight and depth gauge linear metrology.pdf
Height and depth gauge linear metrology.pdf
 
Generative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdfGenerative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdf
 
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptxSENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
 
5G Radio Network Througput Problem Analysis HCIA.pdf
5G Radio Network Througput Problem Analysis HCIA.pdf5G Radio Network Througput Problem Analysis HCIA.pdf
5G Radio Network Througput Problem Analysis HCIA.pdf
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
 
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
 
Introduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.pptIntroduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.ppt
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
Software Engineering and Project Management - Software Testing + Agile Method...
Software Engineering and Project Management - Software Testing + Agile Method...Software Engineering and Project Management - Software Testing + Agile Method...
Software Engineering and Project Management - Software Testing + Agile Method...
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
 
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
 

LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in Chapel-

  • 1. LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Pointer Optimizations in Chapel- CHIUW2014 (co-located with IPDPS 2014), Phoenix, Arizona Akihiro Hayashi, Rishi Surendran, Jisheng Zhao, Vivek Sarkar (Rice University), Michael Ferguson (Laboratory for Telecommunication Sciences) 1
  • 2. Background: Programming Model for Large-scale Systems Message Passing Interface (MPI) is a ubiquitous programming model but introduces non-trivial complexity due to message passing semantics PGAS languages such as Chapel, X10, Habanero-C and Co-array Fortran provide high-productivity features: Task parallelism Data Distribution Synchronization 2
  • 3. Motivation: Chapel Support for LLVM Widely used and easy to Extend 3 LLVM Intermediate Representation (IR) x86 Binary C/C++ Frontend Clang C/C++, Fortran, Ada, Objective-C Frontend dragonegg Chapel Compiler chpl PPC Binary ARM Binary x86 backend Power PC backend ARM backend PTX backend Analysis & Optimizations GPU Binary UPC Compiler
  • 4. A Big Picture 4 Pictures borrowed from http://chapel.cray.com/logo.html, http://llvm.org/Logo.html, http://upc.lbl.gov/, http://commons.wikimedia.org/, https://www.olcf.ornl.gov/titan/ Habanero-C, ... © Argonne National Lab. ©Oak Ridge National Lab. © RIKEN AICS
  • 5. Our ultimate goal: A compiler that can uniformly optimize PGAS Programs Extend LLVM IR to support parallel programs with PGAS and explicit task parallelism  Two parallel intermediate representations(PIR) as extensions to LLVM IR (Runtime-Independent, Runtime-Specific) 5 Parallel Programs (Chapel, X10, CAF, HC, …) 1.RI-PIR Gen 2.Analysis 3.Transformation 1.RS-PIR Gen 2.Analysis 3.Transformation LLVM Runtime-Independent Optimizations e.g. Task Parallel Construct LLVM Runtime-Specific Optimizations e.g. GASNet API Binary
  • 6. The first step: LLVM-based Chapel compiler 6 Pictures borrowed from 1) http://chapel.cray.com/logo.html 2) http://llvm.org/Logo.html  Chapel compiler supports LLVM IR generation  This talk discusses the pros and cons of LLVM-based communication optimizations for Chapel  Wide pointer optimization  Preliminary Performance evaluation & analysis using three regular applications
  • 7. Chapel language An object-oriented PGAS language developed by Cray Inc. Part of DARPA HPCS program Key features Array Operators: zip, replicate, remap,... Explicit Task Parallelism: begin, cobegin Locality Control: Locales Data-Distribution: domain maps Synchronizations: sync 7
  • 8. Compilation Flow 8 Chapel Programs AST Generation and Optimizations C-code Generation LLVM Optimizations Backend Compiler’s Optimizations (e.g. gcc –O3) LLVM IRC Programs LLVM IR Generation Binary Binary
  • 9. The Pros and Cons of using LLVM for Chapel Pro: Using address space feature of LLVM offers more opportunities for communication optimization than C gen 9 // LLVM IR %x = load i64 addrspace(100)* %xptr // C-Code generation chpl_comm_get(&x, …); LLVM Optimizations (e.g. LICM, scalar replacement) Backend Compiler’s Optimizations (e.g. gcc –O3) Few chances of optimization because remote accesses are lowered to chapel Comm APIs 1. the existing LLVM passes can be used for communication optimizations 2. Lowered to chapel Comm APIs after optimizations // Chapel x = remoteData;
  • 10. Address Space 100 generation in Chapel  Address space 100 = possibly-remote (our convention)  Constructs which generate address space 100  Array Load/Store (Except Local constructs)  Distributed Array  var d = {1..128} dmapped Block(boundingBox={1..128});  var A: [d] int;  Object and Field Load/ Store  class circle { var radius: real; … }  var c1 = new circle(radius=1.0);  On statement  var loc0: int;  on Locales[1] { loc0 = …; }  Ref intent  proc habanero(ref v: int): void { v = …; } 10 Except remote value forwarding optimization
  • 11. Motivating Example of address space 100 11 (Pseudo-Code: Before LICM) for i in 1..N { // REMOTE GET %x = load i64 addrspace(100)* %xptr A(i) = %x; } (Pseudo-Code: After LICM) // REMOTE GET %x = load i64 addrspace(100)* %xptr for i in 1..N { A(i) = %x; } LICM by LLVM LICM = Loop Invariant Code Motion
  • 12. The Pros and Cons of using LLVM for Chapel (Cont’d)  Drawback: Using LLVM may lose opportunity for optimizations and may add overhead at runtime  In LLVM 3.3, many optimizations assume that the pointer size is the same across all address spaces 12 typedef struct wide_ptr_s { chpl_localeID_t locale; void* addr; } wide_ptr_t; locale addr For LLVM Code Generation : 64bit packed pointer CHPL_WIDE_POINTERS=node16 For C Code Generation : 128bit struct pointer CHPL_WIDE_POINTERS=struct wide.locale; wide.addr; wide >> 48 wide | 48BITS_MASK; 16bit 48bit 1. Needs more instructions 2. Lose opportunities for Alias analysis
  • 13. Performance Evaluations: Experimental Methodologies We tested execution in the following modes  1.C-Struct (--fast) C code generation + struct pointer + gcc Conventional Code generation in Chapel  2.LLVM without wide optimization (--fast --llvm) LLVM IR generation + packed pointer Does not use address space feature  3.LLVM with wide optimization (--fast --llvm --llvm-wide-opt) LLVM IR generation + packed pointer Use address space feature and apply the existing LLVM optimizations 13
  • 14. Performance Evaluations: Platform Intel Xeon-based Cluster Per Node information Intel Xeon CPU X5660@2.80GHz x 12 cores 48GB of RAM Interconnect Quad-data rated Infiniband Mellanox FCA support 14
  • 15. Performance Evaluations: Details of Compiler & Runtime Compiler: Chapel version 1.9.0.23154 (Apr. 2014)  Built with CHPL_LLVM=llvm CHPL_WIDE_POINTERS=node16 or struct CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv CHPL_TASK=qthread Backend compiler: gcc-4.4.7, LLVM 3.3 Runtime:  GASNet-1.22.0 (ibv-conduit, mpi-spawner)  qthreads-1.10 (2 Shepherds, 6 worker per shepherd) 15
  • 16. Stream-EP From HPCC benchmark Array Size: 2^30 16 coforall loc in Locales do on loc { // per each locale var A, B, C: [D] real(64); forall (a, b, c) in zip(A, B, C) do a = b + alpha * c; }
  • 17. Stream-EP Result 17 2.56 1.33 0.72 0.41 0.24 0.11 6.62 3.22 1.73 1.01 0.62 0.26 2.45 1.28 0.72 0.40 0.25 0.10 0 1 2 3 4 5 6 7 1 locale 2 locales 4 locales 8 locales 16 locales 32 locales ExecutionTime(sec) Number of Locales C-Struct LLVM w/o wopt LLVM w/ wopt Lower is better vs. vs. Overhead of introducing LLVM + packed pointer (2.6x slower) 1 2 3 1 2 3 _ _ vs. Performance improvement by LLVM opt (2.7x faster) LLVM+wide opt is faster than the conventional C-Struct (1.1x)
  • 18. Stream-EP Analysis 18 C-Struct LLVM w/o wopt LLVM w/ wopt 1.39E+11 1.40E+11 5.46E+10 Dynamic number of Chapel PUT/GET APIs actually executed (16 Locales): // C-Struct, LLVM w/o wopt forall (a, b, c) in zip(A, B, C) do 8GETS / 1PUT // LLVM w/ wopt 6GETS (Get Array Head, offs) forall (a, b, c) in zip(A, B, C) do 2GETS / 1PUT LICM by LLVM
  • 19. Cholesky Decomposition  Use Futures & Distributed Array  Input Size: 10,000x10,000  Tile Size: 500x500 19 0 0 0 1 1 2 2 2 3 3 0 0 1 1 2 2 2 3 3 0 1 1 2 2 2 3 3 1 1 2 2 2 3 3 1 2 2 2 3 3 2 2 2 3 3 2 2 3 3 2 3 3 3 3 3 20 tiles dependencies 20 tiles User Defined Distribution
  • 20. Cholesky Result 20 Lower is better 2401.32 941.70 730.94 2781.12 1105.38 902.86858.77 283.32 216.48 0.00 500.00 1000.00 1500.00 2000.00 2500.00 3000.00 8 locales 16 locales 32 locales ExecutionTime(sec) Number of Locales C-Struct LLVM w/o wopt LLVM w/ wopt 1 2 3 vs. vs. Overhead of introducing LLVM + packed pointer (1.2x slower) 1 2 3 _ _ vs. Performance improvement by LLVM opt (4.2x faster) LLVM+wide opt is faster than the conventional C-Struct (3.4x)
  • 21. Cholesky Analysis 21 C-Struct LLVM w/o wopt LLVM w/ wopt 1.78E+09 1.97E+09 5.89E+08 Dynamic number of Chapel PUT/GET APIs actually executed (2 Locales): Obtained with 1,000 x 1,000 input (100x100 tile size) // C-Struct, LLVM w/o wopt for jB in zero..tileSize-1 do { for kB in zero..tileSize-1 do { 4GETS for iB in zero..tileSize-1 do { 8GETS (+1 GETS w/ LLVM) 1PUT }}} // LLVM w/ wopt for jB in zero..tileSize-1 do { 1GET for kB in zero..tileSize-1 do { 3GETS for iB in zero..tileSize-1 do { 2GETS 1PUT }}}
  • 22. Smithwaterman  Use Futures & Distributed Array  Input Size: 185,500x192,000  Tile Size: 11,600x12,000 22 0 2 1 3 16 tiles 16 tiles 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 Cyclic Distribution dependencies
  • 23. Smithwaterman Result 23 381.23 379.01 1260.31 1263.76 626.38 635.45 0.00 200.00 400.00 600.00 800.00 1000.00 1200.00 1400.00 8 locales 16 locales ExecutionTime(sec) Number of Locales C-Struct LLVM w/o wopt LLVM w/ wopt Lower is better 1 2 3 vs. vs. Overhead of introducing LLVM + packed pointer (3.3x slower) 1 2 3 _ _ vs. Performance improvement by LLVM opt (2.0x faster) LLVM+wide opt is slower than the conventional C-Struct (0.6x)
  • 24. Smithwaterman Analysis 24 C-Struct LLVM w/o wopt LLVM w/ wopt 1.41E+08 1.41E+08 5.26E+07 Dynamic number of Chapel PUT/GET APIs actually executed (1 Locale): Obtained with 1,856 x 1,920 input (232x240 tile size) // C-Struct, LLVM w/o wopt for (ii, jj) in tile_1_2d_domain { 33 GETS 1 PUTS } // LLVM w/ wopt for (ii, jj) in tile_1_2d_domain { 12 GETS 1 PUTS } No LICM though there are opportunities
  • 25. Key Insights Using address space 100 offers finer-grain optimization opportunities (e.g. Chapel Array) 25 for i in {1..N} { data = A(i); } for i in 1..N { head = GET(pointer to array head) offset1 = GET(offset) data = GET(head+i*offset1) } Opportunities for 1.LICM 2.Aggregation
  • 26. Conclusions  The first performance evaluation and analysis of LLVM-based Chapel compiler  Capable of utilizing the existing optimizations passes even for remote data (e.g. LICM) Removes significant number of Comm APIs  LLVM w/ opt is always better than LLVM w/o opt  Stream-EP, Cholesky LLVM-based code generation is faster than C-based code generation (1.04x, 3.4x)  Smithwaterman LLVM-based code generation is slower than C-based code generation due to constraints of address space feature in LLVM No LICM though there are opportunities Significant overhead of Packed Wide pointer 26
  • 27. Future Work Evaluate other applications  Regular applications  Irregular applications Possibly-Remote to Definitely-Local transformation by compiler PIR in LLVM 27 local { A(i) = … } // hint by programmmer … = A(i); // Definitely Local on Locales[1] { // hint by programmer var A: [D] int; // Definitely Local
  • 28. Acknowledgements Special thanks to Brad Chamberlain (Cray) Rafael Larrosa Jimenez (UMA) Rafael Asenjo Plaza (UMA) Shams Imam (Rice) Sagnak Tasirlar (Rice) Jun Shirako (Rice) 28
  • 30. // modules/internal/DefaultRectangular.chpl class DefaultRectangularArr: BaseArr { ... var dom : DefaultRectangularDom(rank=rank, idxType=idxType, stridable=stridable); /* domain */ var off: rank*idxType; /* per-dimension offset (n-based-> 0-based) */ var blk: rank*idxType; /* per-dimension multiplier */ var str: rank*chpl__signedType(idxType); /* per-dimimension stride */ var origin: idxType; /* used for optimization */ var factoredOffs: idxType; /* used for calculating shiftedData */ var data : _ddata(eltType); /* pointer to an actual data */ var shiftedData : _ddata(eltType); /* shifted pointer to an actual data */ var noinit: bool = false; ... Chapel Array Structure 30 // chpl_module.bc (with LLVM code generation) %chpl_DefaultRectangularArr_int64_t_1_int64_t_F_object = type { %chpl_BaseArr_object, %chpl_DefaultRectangularDom_1_int64_t_F_object*, [1 x i64], [1 x i64], [1 x i64], i64, i64, i64*, i64*, i8 }
  • 31. Example1: Array Store (very simple) proc habanero (A) { A(0) = 1; } 31  Chapel version: 1.8.0.22047  Compiler option: --llvm --llvm-wide-opt --fast  Add “noinline” attribute to the function to avoid dead code elimination
  • 32. define internal fastcc void @habanero(%chpl_DefaultRectangularArr_int64_t_1_int64_t_F_object addrspace(100)* %A) #9 { entry: // possibly remote access 1: %0 = getelementptr inbounds %chpl_DefaultRectangularArr_int64_t_1_int64_t_F_object addrspace(100)* %A, i64 0, i32 8 2: %1 = load i64 addrspace(100)* addrspace(100)* %0, align 1 3: store i64 1, i64 addrspace(100)* %1, align 8, !tbaa !0 } Example1: Generated LLVM IR 32 Get 8th member %0 = A->shiftedData store 1
  • 33. Example2: Array Store proc habanero (A) { A(1) = 0; } 33  Chapel version: 1.8.0.22047  Compiler option: --llvm --llvm-wide-opt --fast  Add “noinline” attribute to the function to avoid dead code elimination
  • 34. define internal fastcc void @habanero(%chpl_DefaultRectangularArr_int64_t_1_int64_t_F_object addrspace(100)* %A) #9 { entry: // possibly remote access 1: %0 = getelementptr inbounds %chpl_DefaultRectangularArr_int64_t_1_int64_t_F_object addrspace(100)* %A, i64 0, i32 3, i64 0 2: %agg.tmp = alloca i8, i32 48, align 1 3: %agg.cast = bitcast i64 addrspace(100)* %0 to i8 addrspace(100)* 4: call void @llvm.memcpy.p0i8.p100i8.i64(i8* %agg.tmp, i8 addrspace(100)* %agg.cast, i64 48, i32 0, i1 false) 5: %agg.tmp.cast = bitcast i8* %agg.tmp to i64* 6: %1 = load i64* %agg.tmp.cast, align 1 7: %2 = getelementptr inbounds %chpl_DefaultRectangularArr_int64_t_1_int64_t_F_object addrspace(100)* %A, i64 0, i32 8 8: %agg.tmp.ptr.i = ptrtoint i64 addrspace(100)* addrspace(100)* %2 to i64 9: %agg.tmp.oldb.i = ptrtoint i64 addrspace(100)* %0 to i64 10:%agg.tmp.newb.i = ptrtoint i8* %agg.tmp to i64 11:%agg.tmp.diff = sub i64 %agg.tmp.ptr.i, %agg.tmp.oldb.i 12:%agg.tmp.sum = add i64 %agg.tmp.newb.i, %agg.tmp.diff 13:%agg.tmp.cast10 = inttoptr i64 %agg.tmp.sum to i64 addrspace(100)** 14:%3 = load i64 addrspace(100)** %agg.tmp.cast10, align 1 15:%4 = getelementptr inbounds i64 addrspace(100)* %3, i64 %1 16:store i64 0, i64 addrspace(100)* %4, align 8, !tbaa !0 … Example2: Generated LLVM IR 34 %0 = A->blk %2 = A->shiftedData  Sequence of load are merged by aggregation pass %1 = A- >blk[0] Buffer for aggregation Offset for getting A->shifted Data in buffer Get pointer of A(1) Store 0 memcpy

Editor's Notes

  1. Good afternoon everyone. My name is Akihiro Hayashi. I’m a postdoc at Rice university. Today, I’ll be talking about LLVM-based optimizations for PGAS programs. In particular, I focus on Chapel language and its optimization in this talk.
  2. Let me first talk about the Programming model for Large-scale systems. Message passing interface is very common programming model for large-scale system. But It is well known that using MPI introduces non-trivial complexity due to message passing semantics. PGAS languages such as Chapel, X10, Habanero-C and CAF are designed for facilitating programming for large-scale systems by providing high-productivity language features such as task parallelism, data distribution and synchronization.
  3. When it comes to compiler’s optimization, LLVM is emerging compiler infrastructure, which tries to replace the conventional compiler like GCC. Here is an overview of LLVM. LLVM defines machine-independent intermediate representation and it also provides powerful analyzer and optimizer for LLVM IR. If you prepare the frontend that generates LLVM IR, you can analyze and optimize a code in a language independent manner. I think the most famous one is “Clang”, which takes C/C++ and generates LLVM IR. You’ll finally get target specific binary by using target specific backend. The most important thing in this slide is that Chapel compiler is now capable of generating LLVM IR.
  4. Here is a big picture. We think It’s feasible to build LLVM-based compiler that can uniformly analyze and optimize PGAS languages because PGAS languages have similar philosophy and languages design. That means one sophisticated compiler optimize several kinds of PGAS languages and generates binary for several kinds of supercomputers.
  5. This slide shows the details of universal PGAS compiler. Our plan is to extend LLVM IR to support parallel programs with PGAS and explicit task parallelism. We’re thinking we defines two kinds of parallel intermediate representations as extensions to LLVM IR. These are runtime independent IR and Runtime specific IR. You may want to detect task parallel construct and apply some sort of optimization with Runtime independent IR.
  6. In this talk, we focus on LLVM-based chapel compiler to take the first step to the our ultimate goal. just
  7. Just read
  8. Let’s talk about the pros and cons of using LLVM for Chapel. We believe good thing to use LLVM is that we can use address space feature of LLVM. This offers more opportunities for communication optimization than C code generations. Here are examples of remote get code. If you use c-code generator, remote get is expressed as chpl_comm_get API. But there are few changes of optimization because remote accesses are already lowered to chapel comm APIs. On the other hand, if we use LLVM and its address space feature. We can express remote get as one instruction that involves address space 100.
  9. Suppose xptr is loop invariant. We can reduce the redundant Comm API by LICM
  10. But using LLVM has drawback. Chapel uses wide pointer to associate data with node. Wide pointer is as C-struct and you can extract nodeID and address by dot operator.