Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages

Chapel-on-X:
Exploring Tasking Runtimes for PGAS Languages
Akihiro Hayashi (Rice), Sri Raj Paul
(Rice),
Max Grossman (Rice), Jun Shirako
(Rice),
Vivek Sarkar (Georgia Tech)
1

A Big Picture: Exploring Dynamic
Tasking Runtimes
2
Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html,
http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/
© Argonne National Lab. © RIKEN AICS©Berkeley Lab.
Large Scale
Systems
Runtime
Systems
PGAS
Languages
CommunicationDynamic Tasking
X10 CAF
Habanero-UPC++
Qthreads, OCR,
Habanero-C/C++ …
MPI, GASNet, …
What are desirable
features for future
systems?

Chapel Language
High-productivity features:
 Global-View
 Task/Loop parallelism
 Data Distribution
 Synchronization
Sync Variables
3Photo Credits : http://chapel.cray.com/logo.html

Chapel’s Tasking Constructs:
Task Creation
4
// spawn a task
begin task();
cobegin {
taskA(); // spawn a task
taskB(); // spawn a task
}
begin/cobegin
// spawn tasks (chunked)
forall i in 1..N {
task();
}
// spawn tasks (nTasks)
coforall i in 1..N {
task();
}
forall/coforall

Chapel’s Tasking Constructs:
Sync Variables
5
// initial state is EMPTY
var sy$: sync int;
// set the state of sy$ FULL
sy$ = 1;
Producer
begin {
// blocked until sy$ is FULL
var sy = sy$;
…
}
Consumer
Logical full/empty state associated with
value

Goal of Our Study:
Explore alternatives to Qthreads-based
threading runtime for Chapel to prepare for
future systems
6
Communication
Tasking/
Threading
Memory LaunchersQIO Timers Standard
Chapel Runtime
GASNet
Third-party libs
OCR/
HClib
Qthreads libc
HDFS

Our Approach
 Study usage of Qthread API by current Chapel task
API’s
 23 tasking functions, 9 sync functions, 5 Threading
functions (chpl-tasks.h)
 Alternative 1: Open Community Runtime (OCR) - task-
based runtime with multiple implementations of
runtime APIs that adhere to a common specification
 Alternative 2: Habanero-C/C++ library (HClib) - able to
efficiently support blocking operations in tasks (sync
vars, barriers, future gets, …), unlike OCR
7

Chapel Tasking API (chpl-tasks.h)
8
[9 Sync functions]
chpl_sync_lock();
chpl_sync_unlock();
chpl_sync_waitFullAndLock();
chpl_sync_waitEmptyAndLock();
chpl_sync_markAndSignalFull();
chpl_sync_markAndSignalEmpty();
chpl_sync_isFull();
chpl_sync_initAux();
chpl_sync_destroyAux();
[23 Tasking functions]
chpl_task_init();
chpl_task_exit();
chpl_task_createCommTask();
chpl_task_callMain();
chpl_task_addToTaskList();
chpl_task_executeTasksInList();
chpl_task_taskCallFTable();
chpl_task_startMovedTask();
chpl_task_getSubloc();
chpl_task_setSubloc();
[23 Tasking functions (Cont’d)]
chpl_task_getRequestedSubloc();
chpl_task_getId() ;
chpl_task_yield();
chpl_task_sleep();
chpl_task_getSerial();
chpl_task_setSerial();
chpl_task_getPrvData();
chpl_task_getMaxPar();
chpl_task_getNumSublocales();
chpl_task_getCallStackSize();
chpl_task_getNumQueuedTasks();
chpl_task_getNumRunningTasks();
chpl_task_getNumBlockedTasks();
[5 Threading functions]
chpl_task_getNumThreads();
chpl_task_getNumIdleThreads();
chpl_task_getenvNumThreadsPerLocale();
chpl_task_getEnvCallStackSize();
chpl_task_getDefaultCallStackSize();

Important Chapel Tasking API
(11 tasking functions out of 23 functions,
based on profile)
9
Kind API Description
Task chpl_task_init(); Call before executing the “main” function (Initialization)
chpl_task_callMain(); Create a task that runs “main” and then execute it
chpl_task_exit(); Called when exiting
chpl_task_yield(); Yield the execution to another thread (sched_yield())
chpl_task_addToTaskList(); Create a task and execute it (begin, cobegin, …)
chpl_task_executeTasksInList(); Do nothing in the qthreads implementation
chpl_task_getId(); Returns the ID of thread
chpl_task_get/setSerial(); Usually “getSerial” returns false - i.e. create tasks
chpl_task_getMaxPar(); Returns # of workers in the node
chpl_task_getCallStackSize(); Returns the size of call stack size in the node

Important Chapel Tasking API
(8 synchronization functions out of 9 functions,
based on profile)
10
Kind API Description
Sync chpl_sync_lock(sync_var s); Acquire a lock on the specified sync variable
chpl_sync_unlock(sync_var s); Release a lock on the specified sync variable
chpl_sync_initAux(); Initialize meta-information associated with a sync var
chpl_sync_destroy(); Destoy meta-information associated with a sync var
chpl_sync_waitFullAndLock(); Block until the specific sync variable is FULL
chpl_sync_waitEmptyAndLock(); Block until the specific sync variable is EMPTY
chpl_sync_markAndSignalFull(); Set the specific sync variable to FULL
chpl_sync_markAndSignalEmpty(); Set the specific sync variable to EMPTY
Note: Chapel Runtime, “sync variables” (ChapelSyncvar.chpl) bypass some API (e.g., waitFullAndLock,
…) and directly calls Qthreads API (e.g., qthread_readFF).

Open Community Runtime (OCR)
 An asynchronous event-driven runtime
 OCR API formalized in community-developed
specification
 Two known open-source implementations of OCR
specification:
 OCR-REF developed by Intel, Rice, and others
 OCR-VSM developed by U. Vienna on top of TBB library
 Extensible through the use of hints – separation of
concerns
11

OCR Components
 GUID: Globally visible Unique ID
 Event Driven Task (EDT):
Computation
 Event: Synchronization
 Data Block (DB): Relocatable
chunk of data
12

Habanero-C/C++ Library (HClib)
 Library-based tasking runtime and API
 Semantically derived from X10
 Focused on: lightweight, minimal overheads; flexible
synchronization; locality control; composability with
other libraries;
 Simplified deployment: no custom compiler, entirely
library-based, only requires C++11 compliant compiler
 Uses runtime-managed call stacks to avoid blocking
 https://github.com/habanero-rice/hclib
13

HClib Constructs
14
Description Example
Asynchronous task
creation
async(() -> { S1; });
Bulk task synchronization finish(() -> {
async(() -> { S1; async(() -> S2;); });
});
Futures and promises async(() -> { prom->put(42); });
async(() -> { prom->get_future()->wait(); });
Bulk task creation forall(loop, (i, j, k) -> { S3; });
Places for locality control async_at(pl, () -> { S4; });

Overview of Our Implementation
15
Chapel
Constructs
Qthreads OCR HClib
begin
qthread_fork_copyargs
ocrEdtTemplateCreate
ocrEdtCreate
hclib_async
cobegin
forall
coforall
Sync variables Qthread’s Full/Empty API
Pthreads Synchronization
API
Future/Proimies

Example:
Implementations of “begin” statement
16
proc main() {
// creating a task
begin writeln(“b");
}
void chpl_task_addToTaskList(…) {
// Qthreads Version
qthread_fork_copyargs(func);
// OCR Version
ocrEdtTemplateCreate();
ocrEdtCreate();
// HClib Version
hclib_async();
}
Chapel Runtime
void func(void *arg) {
writeln(“b”);
}
int main() {
// creating a task
chpl_task_addToTaskList(func, …);
}
Code Generated By Compiler
Chapel Tasking API

17
Code Size of different Runtime Systems
Runtime Logical Lines of Code
HClib 3,891
(< 2,000 lines relevant to Chapel)
OCR-REF 49,967
(~40,000 if only shared memory subset of OCR is used)
OCR-TBB 2,182
(~ 85,000 additional LLOC for TBB)
Qthreads 24,008
Code size measured using UCC Tool (http://csse.usc.edu/ucc_new/wordpress/)

Validation of the OCR Implementation
 Verified Tests/Applications :
 251 parallel-construct tests from Chapel code base, taken from
https://github.com/chapel-lang/chapel/tree/master/test/parallel)
 OCR
 1 failure due to current lack of support for setting callstack size
 21 failures due to timeouts caused by deadlock introduced when the number of
tasks trying to acquire a sync variable is more than the number of OCR workers.
 All other tests are passed by our OCR implementation
 HClib
 1 failure due to current lack of support for setting callstack size
 20 failures due to timeouts caused by deadlock introduced when the number of
tasks trying to acquire a sync variable is more than the number of HClib workers.
 All other tests are passed by our Hclib implementation
18

Preliminary Evaluations:
Platforms
19
Cray XC30™ Supercomputer @ NERSC
(Edison)
 Node
Intel Xeon E5-2695 @ 2.40GHz x 24 cores (only
used one 12-core socket)
64GB of RAM
 Interconnect
Cray Aries interconnect with Dragonfly topology

Preliminary Evaluations:
Applications
20
Application
Application
Field
Description Data Size
Constructs
Used
UTS Tree Search
Unstructured
Tree Search
T1 Tree
(4M Nodes)
begin
Stream
Numerical
Computing
A Simple Vector
Kernel
N= 256M forall
Labelprop
Graph
Analytics
An Analysis of
Tweets on Twitter
Users = 10K
nTweets = 100K
forall
KMeans
Machine
Learning
K-Means
Clustering
N = 10M
K = 3, dim = 3
reduce
CoMD Simulation
Molecular
Dynamics
Simulation
Cu,
Lennard-Jones,
Grid = 20x20x20
coforall

Preliminary Results:
21
1.23
0.17
0.90
0.19
1.05
1.14
0.16
1.69
0.37
1.1
0.99
0.14
1.19
0.24
1.53
0.97
0.13
0.62
0.29
0.99
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
UTS Stream Labelprop Kmeans CoMD
AbsolutePerformance(sec)
Lower is better
Qthreads OCR-REF OCR-VSM HClib
1. HClib is the fastest, 2. OCR-VSM is faster than OCR-REF, 3. Qthreads is
in some cases faster, in some cases slower than the other variants
Single-Node

Analysis #1: Work-stealing
22
1.23
0.17
0.90
0.19
1.05
1.14
0.16
1.69
0.37
1.1
0.99
0.14
1.19
0.24
1.53
0.97
0.13
0.62
0.29
0.99
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Lower is better
Work-stealing (OCR, HClib) accelerates UTS, however, such is not the case with KMeans
Single-Node

Detailed analysis with “perf” (UTS)
 [UTS OCR]
15.34% uts-deq.ocr.out uts-deq.ocr.out [.]
sha1_compile
create_tree_chpl
12.28% uts-deq.ocr.out libc-2.12.so [.]
_int_free
malloc
chpl_user_main
6.88% uts-deq.ocr.out libm-2.12.so [.]
__ieee754_log
_int_malloc
remove3 23
 [UTS Qthreads]
25.71% uts-deq.qthread uts-deq.qthreads.out [.] qt_scheduler_get_thread
14.09% uts-deq.qthread uts-deq.qthreads.out [.] sha1_compile
10.16% uts-deq.qthread libc-2.12.so [.] _int_free
9.34% uts-deq.qthread libc-2.12.so [.] malloc
6.87% uts-deq.qthread uts-deq.qthreads.out [.] create_tree_chpl
5.60% uts-deq.qthread libc-2.12.so [.] _int_malloc
5.49% uts-deq.qthread libm-2.12.so [.] __ieee754_log
 Qthreads: the scheduler is the bottleneck
 The default qthreads scheduler in Chapel (nemesis) does not perform work-stealking
 OCR: the main computation is the bottleneck thanks. to work stealing
 The percentage of collected samples in functions with “perf” command

Analysis #2: Coforall with
OCR-VSM
24
1.23
0.17
0.90
0.19
1.05
1.14
0.16
1.69
0.37
1.1
0.99
0.14
1.19
0.24
1.53
0.97
0.13
0.62
0.29
0.99
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Lower is better
Single-Node

Detailed analysis of “Coforall”
25
1.00E-06
1.00E-05
1.00E-04
1.00E-03
1.00E-02
1.00E-01
1.00E+00
AbosuluteTiminigs(sec)
# of Tasks
Coforall overheads (lower is better)
Qthreads OCR-REF OCR-VSM HCLIB
coforall i in 1..nTasks {
habanero();
}
Chapel
 OCR-VSM is the slowest
 HClib is the mostly fastest when nTasks <
2048
 Qthreads is the fastest when nTasks > 2048
 CoMD’s force computation
 nTasks = 1728
 Bottlenecks identified with HPCToolkit
 (OCR-VSM)
tbb::internal::private_worker::run()
 (OCR-REF) wstSchedulerObjectCount

Analysis #3: Overhead Analysis
26
16.85%
0.78%
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Qthreads HClib
Overheadoftasking(%)
 Measure TS = execution
time of sequential version
 Measure T1 = single-thread
execution time of parallel
version
 Overhead = (T1- TS)/TS
(not reported for OCR
because T1 cannot be easily
measured for OCR)
Single-Node

27
• Standard lock-based implementations
• chpl_sync_waitFullAndLock / chpl_sync_waitEmptyAndLock:
• Based on general lock (chpl_sync_lock) and state flag
• chpl_sync_markAndSignalFull / chpl_sync_markAndSignalEmpty
• Based on general unlock (chpl_sync_unlock) and state flag
• Efficient lock support using spin-lock
• Extension of ticket lock approach [1,2]
• Two-step inter-task coordinations
1. Fast p-2-p synchronizations based on busy-wait (with timeout)
2. Sleep-and-awake synchronizations for context switching
[1] Algorithms for Scalable Synchronization on Shared Memory Multiprocessors. J. Mellor-Crummey and M. Scott. ACM Transactions on
Computer Systems, 9(1):21–65, February 1991.
[2] Design, Verification and Applications of a New Read-Write Lock Algorithm. Jun Shirako, Nick Vrvilo, Eric G. Mercer, Vivek Sarkar. 24th
ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), June 2012.
Optimized Sync Variable Implementation in HClib

The impact of
the Sync Variable Optimization
28
8.11 7.97
9.10
6.26
0.00
5.00
10.00
15.00
UTS-REC
Absolute
Performance
(sec)
Lower is better
Qthreads-nemesis (FEBs) Qthreads-sherwood (FEBs)
HClib (Promise/Futures) HClib (Promises/Futures + Ticket Locks)
Single-Node

Conclusions
 We’ve implemented OCR and HClib based Chapel tasking
runtimes
 https://github.com/srirajpaul/chapel/tree/hclib_ocr
 Lessons Learned
 The use of HClib is a promising way to accelerate PGAS tasking
runtimes
 The use of TicketLock can improve the performance of Sync
Vars
 However, there are still further research opportunities
 Future Work
 Improving work-stealing policies
 Supporting high level constructs to Chapel’s tasking layer
29

Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages

More Related Content

What's hot

Similar to Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages

More from Akihiro Hayashi

Recently uploaded

Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages

Editor's Notes