LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in Chapel-

LLVM Optimizations for PGAS Programs
-Case Study: LLVM Wide Pointer Optimizations in Chapel-
CHIUW2014 (co-located with IPDPS 2014),
Phoenix, Arizona
Akihiro Hayashi, Rishi Surendran,
Jisheng Zhao, Vivek Sarkar
(Rice University),
Michael Ferguson
(Laboratory for Telecommunication Sciences)
1

Background: Programming Model
for Large-scale Systems
Message Passing Interface (MPI) is a
ubiquitous programming model
but introduces non-trivial complexity due to
message passing semantics
PGAS languages such as Chapel, X10,
Habanero-C and Co-array Fortran
provide high-productivity features:
Task parallelism
Data Distribution
Synchronization
2

Motivation:
Chapel Support for LLVM
Widely used and easy to Extend
3
LLVM Intermediate Representation (IR)
x86 Binary
C/C++
Frontend
Clang
C/C++, Fortran, Ada, Objective-C
Frontend
dragonegg
Chapel
Compiler
chpl
PPC Binary ARM Binary
x86
backend
Power PC
backend
ARM
backend
PTX
backend
Analysis & Optimizations
GPU Binary
UPC
Compiler

A Big Picture
4
Pictures borrowed from http://chapel.cray.com/logo.html, http://llvm.org/Logo.html,
http://upc.lbl.gov/, http://commons.wikimedia.org/, https://www.olcf.ornl.gov/titan/
Habanero-C, ...
© Argonne National Lab.
©Oak Ridge National Lab.
© RIKEN AICS

Our ultimate goal: A compiler that can
uniformly optimize PGAS Programs
Extend LLVM IR to support parallel programs
with PGAS and explicit task parallelism
 Two parallel intermediate representations(PIR) as
extensions to LLVM IR
(Runtime-Independent, Runtime-Specific)
5
Parallel
Programs
(Chapel, X10,
CAF, HC, …)
1.RI-PIR Gen
2.Analysis
3.Transformation
1.RS-PIR Gen
2.Analysis
3.Transformation
LLVM
Runtime-Independent
Optimizations
e.g. Task Parallel Construct
LLVM
Runtime-Specific
Optimizations
e.g. GASNet API
Binary

The first step:
LLVM-based Chapel compiler
6
Pictures borrowed from 1) http://chapel.cray.com/logo.html
2) http://llvm.org/Logo.html
 Chapel compiler supports LLVM IR generation
 This talk discusses the pros and cons of LLVM-based
communication optimizations for Chapel
 Wide pointer optimization
 Preliminary Performance evaluation & analysis using
three regular applications

Chapel language
An object-oriented PGAS language
developed by Cray Inc.
Part of DARPA HPCS program
Key features
Array Operators: zip, replicate, remap,...
Explicit Task Parallelism: begin, cobegin
Locality Control: Locales
Data-Distribution: domain maps
Synchronizations: sync
7

Compilation Flow
8
Chapel
Programs
AST Generation and Optimizations
C-code Generation
LLVM Optimizations
Backend Compiler’s Optimizations
(e.g. gcc –O3)
LLVM IRC Programs
LLVM IR Generation
Binary Binary

The Pros and Cons of using
LLVM for Chapel
Pro: Using address space feature of LLVM
offers more opportunities for
communication optimization than C gen
9
// LLVM IR
%x = load i64 addrspace(100)* %xptr
// C-Code generation
chpl_comm_get(&x, …);
LLVM Optimizations
(e.g. LICM, scalar replacement)
Backend Compiler’s Optimizations
(e.g. gcc –O3)
Few chances of
optimization because
remote accesses are
lowered to chapel Comm
APIs
1. the existing LLVM passes can be
used for communication optimizations
2. Lowered to chapel Comm APIs after
optimizations
// Chapel
x = remoteData;

Address Space 100 generation
in Chapel
 Address space 100 = possibly-remote
(our convention)
 Constructs which generate address space 100
 Array Load/Store (Except Local constructs)
 Distributed Array
 var d = {1..128} dmapped Block(boundingBox={1..128});
 var A: [d] int;
 Object and Field Load/ Store
 class circle { var radius: real; … }
 var c1 = new circle(radius=1.0);
 On statement
 var loc0: int;
 on Locales[1] { loc0 = …; }
 Ref intent
 proc habanero(ref v: int): void { v = …; }
10
Except remote value
forwarding optimization

Motivating Example of address
space 100
11
(Pseudo-Code: Before LICM)
for i in 1..N {
// REMOTE GET
A(i) = %x;
}
(Pseudo-Code: After LICM)
// REMOTE GET
for i in 1..N {
A(i) = %x;
}
LICM by
LLVM
LICM = Loop Invariant Code Motion

The Pros and Cons of using
LLVM for Chapel (Cont’d)
 Drawback: Using LLVM may lose opportunity for
optimizations and may add overhead at runtime
 In LLVM 3.3, many optimizations assume that the
pointer size is the same across all address spaces
12
typedef struct wide_ptr_s {
chpl_localeID_t locale;
void* addr;
} wide_ptr_t;
locale addr
For LLVM Code Generation :
64bit packed pointer
CHPL_WIDE_POINTERS=node16
For C Code Generation :
128bit struct pointer
CHPL_WIDE_POINTERS=struct
wide.locale;
wide.addr;
wide >> 48
wide | 48BITS_MASK;
16bit 48bit
1. Needs more instructions
2. Lose opportunities for Alias analysis

Performance Evaluations:
Experimental Methodologies
We tested execution in the following modes
 1.C-Struct (--fast)
C code generation + struct pointer + gcc
Conventional Code generation in Chapel
 2.LLVM without wide optimization (--fast --llvm)
LLVM IR generation + packed pointer
Does not use address space feature
 3.LLVM with wide optimization
(--fast --llvm --llvm-wide-opt)
LLVM IR generation + packed pointer
Use address space feature and apply the existing LLVM
optimizations
13

Platform
Intel Xeon-based Cluster
Per Node information
Intel Xeon CPU X5660@2.80GHz x 12 cores
48GB of RAM
Interconnect
Quad-data rated Infiniband
Mellanox FCA support
14

Details of Compiler & Runtime
Compiler:
Chapel version 1.9.0.23154 (Apr. 2014)
 Built with
CHPL_LLVM=llvm
CHPL_WIDE_POINTERS=node16 or struct
CHPL_COMM=gasnet CHPL_COMM_SUBSTRATE=ibv
CHPL_TASK=qthread
Backend compiler: gcc-4.4.7, LLVM 3.3
Runtime:
 GASNet-1.22.0 (ibv-conduit, mpi-spawner)
 qthreads-1.10
(2 Shepherds, 6 worker per shepherd) 15

Stream-EP
From HPCC benchmark
Array Size: 2^30
16
coforall loc in Locales do on loc {
// per each locale
var A, B, C: [D] real(64);
forall (a, b, c) in zip(A, B, C) do
a = b + alpha * c;
}

Stream-EP Result
17
2.56
1.33
0.72
0.41 0.24 0.11
6.62
3.22
1.73
1.01
0.62
0.26
2.45
1.28
0.72
0.40 0.25 0.10
0
1
2
3
4
5
6
7
1 locale 2 locales 4 locales 8 locales 16 locales 32 locales
ExecutionTime(sec)
Number of Locales
C-Struct
LLVM w/o wopt
LLVM w/ wopt
Lower is better
vs.
vs. Overhead of introducing LLVM + packed pointer (2.6x slower)
1
2 3
1
2
3
_
_
vs.
Performance improvement by LLVM opt (2.7x faster)
LLVM+wide opt is faster than the conventional C-Struct (1.1x)

Stream-EP Analysis
18
C-Struct LLVM w/o wopt LLVM w/ wopt
1.39E+11 1.40E+11 5.46E+10
Dynamic number of Chapel PUT/GET APIs actually executed (16 Locales):
// C-Struct, LLVM w/o wopt
8GETS / 1PUT
// LLVM w/ wopt
6GETS (Get Array Head, offs)
2GETS / 1PUT
LICM by LLVM

Cholesky Decomposition
 Use Futures & Distributed Array
 Input Size: 10,000x10,000
 Tile Size: 500x500
19
0
0
0
1
1
2
2
2
3
3
0
0
1
1
2
2
2
3
3
0
1
1
2
2
2
3
3
1
1
2
2
2
3
3
1
2
2
2
3
3
2
2
2
3
3
2
2
3
3
2
3
3
3
3 3
20 tiles
dependencies
20 tiles
User Defined Distribution

Cholesky Result
20
Lower is better
2401.32
941.70
730.94
2781.12
1105.38
902.86858.77
283.32 216.48
0.00
500.00
1000.00
1500.00
2000.00
2500.00
3000.00
8 locales 16 locales 32 locales
ExecutionTime(sec)
Number of Locales
C-Struct
LLVM w/o wopt
LLVM w/ wopt
1
2 3
vs.
1
2
3
_
_
vs.
LLVM+wide opt is faster than the conventional C-Struct (3.4x)

Cholesky Analysis
21
1.78E+09 1.97E+09 5.89E+08
Dynamic number of Chapel PUT/GET APIs actually executed (2 Locales):
Obtained with 1,000 x 1,000 input (100x100 tile size)
for jB in zero..tileSize-1 do {
for kB in zero..tileSize-1 do {
4GETS
for iB in zero..tileSize-1 do {
8GETS (+1 GETS w/ LLVM)
1PUT
}}}
// LLVM w/ wopt
for jB in zero..tileSize-1 do {
1GET
for kB in zero..tileSize-1 do {
3GETS
for iB in zero..tileSize-1 do {
2GETS
1PUT
}}}

Smithwaterman
 Use Futures & Distributed Array
 Input Size: 185,500x192,000
 Tile Size: 11,600x12,000 22
0
2
1
3
16 tiles
16 tiles
0
2
1
3
0
2
1
3
0
2
1
3
0
2
1
3
0
2
1
3
0
2
1
3
0
2
1
3
0
2
1
3
0
2
1
3
0
2
1
3
0
2
1
3
0
2
1
3
0
2
1
3
0
2
1
3
0
2
1
3
Cyclic Distribution
dependencies

Smithwaterman Result
23
381.23 379.01
1260.31 1263.76
626.38 635.45
0.00
200.00
400.00
600.00
800.00
1000.00
1200.00
1400.00
8 locales 16 locales
ExecutionTime(sec)
Number of Locales
C-Struct
LLVM w/o wopt
LLVM w/ wopt
Lower is better
1
2 3
vs.
1
2
3
_
_
vs.
LLVM+wide opt is slower than the conventional C-Struct (0.6x)

Smithwaterman Analysis
24
1.41E+08 1.41E+08 5.26E+07
Dynamic number of Chapel PUT/GET APIs actually executed (1 Locale):
Obtained with 1,856 x 1,920 input (232x240 tile size)
for (ii, jj) in tile_1_2d_domain
{
33 GETS
1 PUTS
}
// LLVM w/ wopt
for (ii, jj) in tile_1_2d_domain
{
12 GETS
1 PUTS
}
No LICM though there are opportunities

Key Insights
Using address space 100 offers finer-grain
optimization opportunities (e.g. Chapel Array)
25
for i in {1..N} {
data = A(i);
}
for i in 1..N {
head = GET(pointer to array head)
offset1 = GET(offset)
data = GET(head+i*offset1)
}
Opportunities for
1.LICM
2.Aggregation

Conclusions
 The first performance evaluation and analysis of
LLVM-based Chapel compiler
 Capable of utilizing the existing optimizations passes
even for remote data (e.g. LICM)
Removes significant number of Comm APIs
 LLVM w/ opt is always better than LLVM w/o opt
 Stream-EP, Cholesky
LLVM-based code generation is faster than C-based code
generation (1.04x, 3.4x)
 Smithwaterman
LLVM-based code generation is slower than C-based code
generation due to constraints of address space feature in
LLVM
No LICM though there are opportunities
Significant overhead of Packed Wide pointer
26

Future Work
Evaluate other applications
 Regular applications
 Irregular applications
Possibly-Remote to Definitely-Local
transformation by compiler
PIR in LLVM
27
local { A(i) = … } // hint by programmmer
… = A(i); // Definitely Local
on Locales[1] { // hint by programmer
var A: [D] int; // Definitely Local

Acknowledgements
Special thanks to
Brad Chamberlain (Cray)
Rafael Larrosa Jimenez (UMA)
Rafael Asenjo Plaza (UMA)
Shams Imam (Rice)
Sagnak Tasirlar (Rice)
Jun Shirako (Rice)
28

// modules/internal/DefaultRectangular.chpl
class DefaultRectangularArr: BaseArr {
...
var dom : DefaultRectangularDom(rank=rank, idxType=idxType,
stridable=stridable); /* domain */
var off: rank*idxType; /* per-dimension offset (n-based-> 0-based) */
var blk: rank*idxType; /* per-dimension multiplier */
var str: rank*chpl__signedType(idxType); /* per-dimimension stride */
var origin: idxType; /* used for optimization */
var factoredOffs: idxType; /* used for calculating shiftedData */
var data : _ddata(eltType); /* pointer to an actual data */
var shiftedData : _ddata(eltType); /* shifted pointer to an actual data */
var noinit: bool = false;
...
Chapel Array Structure
30
// chpl_module.bc (with LLVM code generation)
%chpl_DefaultRectangularArr_int64_t_1_int64_t_F_object = type
{ %chpl_BaseArr_object, %chpl_DefaultRectangularDom_1_int64_t_F_object*, [1 x i64], [1 x i64], [1 x
i64], i64, i64, i64*, i64*, i8 }

Example1: Array Store
(very simple)
proc habanero (A) {
A(0) = 1;
}
31
 Chapel version: 1.8.0.22047
 Compiler option: --llvm --llvm-wide-opt --fast
 Add “noinline” attribute to the function to avoid dead code
elimination

define internal fastcc void @habanero(%chpl_DefaultRectangularArr_int64_t_1_int64_t_F_object
addrspace(100)* %A) #9 {
entry:
// possibly remote access
1: %0 = getelementptr inbounds %chpl_DefaultRectangularArr_int64_t_1_int64_t_F_object
addrspace(100)* %A, i64 0, i32 8
2: %1 = load i64 addrspace(100)* addrspace(100)* %0, align 1
3: store i64 1, i64 addrspace(100)* %1, align 8, !tbaa !0
}
Example1: Generated LLVM IR
32
Get 8th member
%0 = A->shiftedData
store 1

Example2: Array Store
proc habanero (A) {
A(1) = 0;
}
33
 Chapel version: 1.8.0.22047
 Compiler option: --llvm --llvm-wide-opt --fast
 Add “noinline” attribute to the function to avoid dead code
elimination

define internal fastcc void @habanero(%chpl_DefaultRectangularArr_int64_t_1_int64_t_F_object addrspace(100)* %A) #9 {
entry:
// possibly remote access
1: %0 = getelementptr inbounds %chpl_DefaultRectangularArr_int64_t_1_int64_t_F_object addrspace(100)* %A, i64 0,
i32 3, i64 0
2: %agg.tmp = alloca i8, i32 48, align 1
3: %agg.cast = bitcast i64 addrspace(100)* %0 to i8 addrspace(100)*
4: call void @llvm.memcpy.p0i8.p100i8.i64(i8* %agg.tmp, i8 addrspace(100)* %agg.cast, i64 48, i32 0, i1 false)
5: %agg.tmp.cast = bitcast i8* %agg.tmp to i64*
6: %1 = load i64* %agg.tmp.cast, align 1
7: %2 = getelementptr inbounds %chpl_DefaultRectangularArr_int64_t_1_int64_t_F_object addrspace(100)* %A, i64 0,
i32 8
8: %agg.tmp.ptr.i = ptrtoint i64 addrspace(100)* addrspace(100)* %2 to i64
9: %agg.tmp.oldb.i = ptrtoint i64 addrspace(100)* %0 to i64
10:%agg.tmp.newb.i = ptrtoint i8* %agg.tmp to i64
11:%agg.tmp.diff = sub i64 %agg.tmp.ptr.i, %agg.tmp.oldb.i
12:%agg.tmp.sum = add i64 %agg.tmp.newb.i, %agg.tmp.diff
13:%agg.tmp.cast10 = inttoptr i64 %agg.tmp.sum to i64 addrspace(100)**
14:%3 = load i64 addrspace(100)** %agg.tmp.cast10, align 1
15:%4 = getelementptr inbounds i64 addrspace(100)* %3, i64 %1
16:store i64 0, i64 addrspace(100)* %4, align 8, !tbaa !0
…
Example2: Generated LLVM IR
34
%0 = A->blk
%2 = A->shiftedData
 Sequence of load are merged by aggregation pass
%1 = A-
>blk[0]
Buffer for
aggregation
Offset for getting
A->shifted Data in
buffer
Get pointer of
A(1)
Store 0
memcpy

LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in Chapel-

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in Chapel-

Similar to LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in Chapel- (20)

More from Akihiro Hayashi

More from Akihiro Hayashi (9)

Recently uploaded

Recently uploaded (20)

LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in Chapel-

Editor's Notes