Chapel-on-X:
Exploring Tasking Runtimes for PGAS Languages
Akihiro Hayashi (Rice), Sri Raj Paul
(Rice),
Max Grossman (Rice), Jun Shirako
(Rice),
Vivek Sarkar (Georgia Tech)
1
A Big Picture: Exploring Dynamic
Tasking Runtimes
2
Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html,
http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/
© Argonne National Lab. © RIKEN AICS©Berkeley Lab.
Large Scale
Systems
Runtime
Systems
PGAS
Languages
CommunicationDynamic Tasking
X10 CAF
Habanero-UPC++
Qthreads, OCR,
Habanero-C/C++ …
MPI, GASNet, …
What are desirable
features for future
systems?
Chapel Language
High-productivity features:
 Global-View
 Task/Loop parallelism
 Data Distribution
 Synchronization
Sync Variables
3Photo Credits : http://chapel.cray.com/logo.html
Chapel’s Tasking Constructs:
Task Creation
4
// spawn a task
begin task();
cobegin {
taskA(); // spawn a task
taskB(); // spawn a task
}
begin/cobegin
// spawn tasks (chunked)
forall i in 1..N {
task();
}
// spawn tasks (nTasks)
coforall i in 1..N {
task();
}
forall/coforall
Chapel’s Tasking Constructs:
Sync Variables
5
// initial state is EMPTY
var sy$: sync int;
// set the state of sy$ FULL
sy$ = 1;
Producer
begin {
// blocked until sy$ is FULL
var sy = sy$;
…
}
Consumer
Logical full/empty state associated with
value
Goal of Our Study:
Explore alternatives to Qthreads-based
threading runtime for Chapel to prepare for
future systems
6
Communication
Tasking/
Threading
Memory LaunchersQIO Timers Standard
Chapel Runtime
GASNet
Third-party libs
OCR/
HClib
Qthreads libc
HDFS
Our Approach
 Study usage of Qthread API by current Chapel task
API’s
 23 tasking functions, 9 sync functions, 5 Threading
functions (chpl-tasks.h)
 Alternative 1: Open Community Runtime (OCR) - task-
based runtime with multiple implementations of
runtime APIs that adhere to a common specification
 Alternative 2: Habanero-C/C++ library (HClib) - able to
efficiently support blocking operations in tasks (sync
vars, barriers, future gets, …), unlike OCR
7
Chapel Tasking API (chpl-tasks.h)
8
[9 Sync functions]
chpl_sync_lock();
chpl_sync_unlock();
chpl_sync_waitFullAndLock();
chpl_sync_waitEmptyAndLock();
chpl_sync_markAndSignalFull();
chpl_sync_markAndSignalEmpty();
chpl_sync_isFull();
chpl_sync_initAux();
chpl_sync_destroyAux();
[23 Tasking functions]
chpl_task_init();
chpl_task_exit();
chpl_task_createCommTask();
chpl_task_callMain();
chpl_task_addToTaskList();
chpl_task_executeTasksInList();
chpl_task_taskCallFTable();
chpl_task_startMovedTask();
chpl_task_getSubloc();
chpl_task_setSubloc();
[23 Tasking functions (Cont’d)]
chpl_task_getRequestedSubloc();
chpl_task_getId() ;
chpl_task_yield();
chpl_task_sleep();
chpl_task_getSerial();
chpl_task_setSerial();
chpl_task_getPrvData();
chpl_task_getMaxPar();
chpl_task_getNumSublocales();
chpl_task_getCallStackSize();
chpl_task_getNumQueuedTasks();
chpl_task_getNumRunningTasks();
chpl_task_getNumBlockedTasks();
[5 Threading functions]
chpl_task_getNumThreads();
chpl_task_getNumIdleThreads();
chpl_task_getenvNumThreadsPerLocale();
chpl_task_getEnvCallStackSize();
chpl_task_getDefaultCallStackSize();
Important Chapel Tasking API
(11 tasking functions out of 23 functions,
based on profile)
9
Kind API Description
Task chpl_task_init(); Call before executing the “main” function (Initialization)
chpl_task_callMain(); Create a task that runs “main” and then execute it
chpl_task_exit(); Called when exiting
chpl_task_yield(); Yield the execution to another thread (sched_yield())
chpl_task_addToTaskList(); Create a task and execute it (begin, cobegin, …)
chpl_task_executeTasksInList(); Do nothing in the qthreads implementation
chpl_task_getId(); Returns the ID of thread
chpl_task_get/setSerial(); Usually “getSerial” returns false - i.e. create tasks
chpl_task_getMaxPar(); Returns # of workers in the node
chpl_task_getCallStackSize(); Returns the size of call stack size in the node
Important Chapel Tasking API
(8 synchronization functions out of 9 functions,
based on profile)
10
Kind API Description
Sync chpl_sync_lock(sync_var s); Acquire a lock on the specified sync variable
chpl_sync_unlock(sync_var s); Release a lock on the specified sync variable
chpl_sync_initAux(); Initialize meta-information associated with a sync var
chpl_sync_destroy(); Destoy meta-information associated with a sync var
chpl_sync_waitFullAndLock(); Block until the specific sync variable is FULL
chpl_sync_waitEmptyAndLock(); Block until the specific sync variable is EMPTY
chpl_sync_markAndSignalFull(); Set the specific sync variable to FULL
chpl_sync_markAndSignalEmpty(); Set the specific sync variable to EMPTY
Note: Chapel Runtime, “sync variables” (ChapelSyncvar.chpl) bypass some API (e.g., waitFullAndLock,
…) and directly calls Qthreads API (e.g., qthread_readFF).
Open Community Runtime (OCR)
 An asynchronous event-driven runtime
 OCR API formalized in community-developed
specification
 Two known open-source implementations of OCR
specification:
 OCR-REF developed by Intel, Rice, and others
 OCR-VSM developed by U. Vienna on top of TBB library
 Extensible through the use of hints – separation of
concerns
11
OCR Components
 GUID: Globally visible Unique ID
 Event Driven Task (EDT):
Computation
 Event: Synchronization
 Data Block (DB): Relocatable
chunk of data
12
Habanero-C/C++ Library (HClib)
 Library-based tasking runtime and API
 Semantically derived from X10
 Focused on: lightweight, minimal overheads; flexible
synchronization; locality control; composability with
other libraries;
 Simplified deployment: no custom compiler, entirely
library-based, only requires C++11 compliant compiler
 Uses runtime-managed call stacks to avoid blocking
 https://github.com/habanero-rice/hclib
13
HClib Constructs
14
Description Example
Asynchronous task
creation
async(() -> { S1; });
Bulk task synchronization finish(() -> {
async(() -> { S1; async(() -> S2;); });
});
Futures and promises async(() -> { prom->put(42); });
async(() -> { prom->get_future()->wait(); });
Bulk task creation forall(loop, (i, j, k) -> { S3; });
Places for locality control async_at(pl, () -> { S4; });
Overview of Our Implementation
15
Chapel
Constructs
Qthreads OCR HClib
begin
qthread_fork_copyargs
ocrEdtTemplateCreate
ocrEdtCreate
hclib_async
cobegin
forall
coforall
Sync variables Qthread’s Full/Empty API
Pthreads Synchronization
API
Future/Proimies
Example:
Implementations of “begin” statement
16
proc main() {
// creating a task
begin writeln(“b");
}
void chpl_task_addToTaskList(…) {
// Qthreads Version
qthread_fork_copyargs(func);
// OCR Version
ocrEdtTemplateCreate();
ocrEdtCreate();
// HClib Version
hclib_async();
}
Chapel Runtime
void func(void *arg) {
writeln(“b”);
}
int main() {
// creating a task
chpl_task_addToTaskList(func, …);
}
Code Generated By Compiler
Chapel Tasking API
17
Code Size of different Runtime Systems
Runtime Logical Lines of Code
HClib 3,891
(< 2,000 lines relevant to Chapel)
OCR-REF 49,967
(~40,000 if only shared memory subset of OCR is used)
OCR-TBB 2,182
(~ 85,000 additional LLOC for TBB)
Qthreads 24,008
Code size measured using UCC Tool (http://csse.usc.edu/ucc_new/wordpress/)
Validation of the OCR Implementation
 Verified Tests/Applications :
 251 parallel-construct tests from Chapel code base, taken from
https://github.com/chapel-lang/chapel/tree/master/test/parallel)
 OCR
 1 failure due to current lack of support for setting callstack size
 21 failures due to timeouts caused by deadlock introduced when the number of
tasks trying to acquire a sync variable is more than the number of OCR workers.
 All other tests are passed by our OCR implementation
 HClib
 1 failure due to current lack of support for setting callstack size
 20 failures due to timeouts caused by deadlock introduced when the number of
tasks trying to acquire a sync variable is more than the number of HClib workers.
 All other tests are passed by our Hclib implementation
18
Preliminary Evaluations:
Platforms
19
Cray XC30™ Supercomputer @ NERSC
(Edison)
 Node
Intel Xeon E5-2695 @ 2.40GHz x 24 cores (only
used one 12-core socket)
64GB of RAM
 Interconnect
Cray Aries interconnect with Dragonfly topology
Preliminary Evaluations:
Applications
20
Application
Application
Field
Description Data Size
Constructs
Used
UTS Tree Search
Unstructured
Tree Search
T1 Tree
(4M Nodes)
begin
Stream
Numerical
Computing
A Simple Vector
Kernel
N= 256M forall
Labelprop
Graph
Analytics
An Analysis of
Tweets on Twitter
Users = 10K
nTweets = 100K
forall
KMeans
Machine
Learning
K-Means
Clustering
N = 10M
K = 3, dim = 3
reduce
CoMD Simulation
Molecular
Dynamics
Simulation
Cu,
Lennard-Jones,
Grid = 20x20x20
coforall
Preliminary Results:
21
1.23
0.17
0.90
0.19
1.05
1.14
0.16
1.69
0.37
1.1
0.99
0.14
1.19
0.24
1.53
0.97
0.13
0.62
0.29
0.99
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
UTS Stream Labelprop Kmeans CoMD
AbsolutePerformance(sec)
Lower is better
Qthreads OCR-REF OCR-VSM HClib
1. HClib is the fastest, 2. OCR-VSM is faster than OCR-REF, 3. Qthreads is
in some cases faster, in some cases slower than the other variants
Single-Node
Analysis #1: Work-stealing
22
1.23
0.17
0.90
0.19
1.05
1.14
0.16
1.69
0.37
1.1
0.99
0.14
1.19
0.24
1.53
0.97
0.13
0.62
0.29
0.99
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
UTS Stream Labelprop Kmeans CoMD
AbsolutePerformance(sec)
Lower is better
Qthreads OCR-REF OCR-VSM HClib
Work-stealing (OCR, HClib) accelerates UTS, however, such is not the case with KMeans
Single-Node
Detailed analysis with “perf” (UTS)
 [UTS OCR]
15.34% uts-deq.ocr.out uts-deq.ocr.out [.]
sha1_compile
13.87% uts-deq.ocr.out uts-deq.ocr.out [.]
create_tree_chpl
12.28% uts-deq.ocr.out libc-2.12.so [.]
_int_free
11.29% uts-deq.ocr.out libc-2.12.so [.]
malloc
8.27% uts-deq.ocr.out uts-deq.ocr.out [.]
chpl_user_main
6.88% uts-deq.ocr.out libm-2.12.so [.]
__ieee754_log
6.41% uts-deq.ocr.out libc-2.12.so [.]
_int_malloc
4.77% uts-deq.ocr.out uts-deq.ocr.out [.]
remove3 23
 [UTS Qthreads]
25.71% uts-deq.qthread uts-deq.qthreads.out [.] qt_scheduler_get_thread
14.09% uts-deq.qthread uts-deq.qthreads.out [.] sha1_compile
10.16% uts-deq.qthread libc-2.12.so [.] _int_free
9.34% uts-deq.qthread libc-2.12.so [.] malloc
6.87% uts-deq.qthread uts-deq.qthreads.out [.] create_tree_chpl
5.60% uts-deq.qthread libc-2.12.so [.] _int_malloc
5.49% uts-deq.qthread libm-2.12.so [.] __ieee754_log
 Qthreads: the scheduler is the bottleneck
 The default qthreads scheduler in Chapel (nemesis) does not perform work-stealking
 OCR: the main computation is the bottleneck thanks. to work stealing
 The percentage of collected samples in functions with “perf” command
Analysis #2: Coforall with
OCR-VSM
24
1.23
0.17
0.90
0.19
1.05
1.14
0.16
1.69
0.37
1.1
0.99
0.14
1.19
0.24
1.53
0.97
0.13
0.62
0.29
0.99
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
UTS Stream Labelprop Kmeans CoMD
AbsolutePerformance(sec)
Lower is better
Qthreads OCR-REF OCR-VSM HClib
Single-Node
Detailed analysis of “Coforall”
25
1.00E-06
1.00E-05
1.00E-04
1.00E-03
1.00E-02
1.00E-01
1.00E+00
AbosuluteTiminigs(sec)
# of Tasks
Coforall overheads (lower is better)
Qthreads OCR-REF OCR-VSM HCLIB
coforall i in 1..nTasks {
habanero();
}
Chapel
 OCR-VSM is the slowest
 HClib is the mostly fastest when nTasks <
2048
 Qthreads is the fastest when nTasks > 2048
 CoMD’s force computation
 nTasks = 1728
 Bottlenecks identified with HPCToolkit
 (OCR-VSM)
tbb::internal::private_worker::run()
 (OCR-REF) wstSchedulerObjectCount
Analysis #3: Overhead Analysis
26
16.85%
0.78%
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Qthreads HClib
Overheadoftasking(%)
 Measure TS = execution
time of sequential version
 Measure T1 = single-thread
execution time of parallel
version
 Overhead = (T1- TS)/TS
(not reported for OCR
because T1 cannot be easily
measured for OCR)
Single-Node
27
• Standard lock-based implementations
• chpl_sync_waitFullAndLock / chpl_sync_waitEmptyAndLock:
• Based on general lock (chpl_sync_lock) and state flag
• chpl_sync_markAndSignalFull / chpl_sync_markAndSignalEmpty
• Based on general unlock (chpl_sync_unlock) and state flag
• Efficient lock support using spin-lock
• Extension of ticket lock approach [1,2]
• Two-step inter-task coordinations
1. Fast p-2-p synchronizations based on busy-wait (with timeout)
2. Sleep-and-awake synchronizations for context switching
[1] Algorithms for Scalable Synchronization on Shared Memory Multiprocessors. J. Mellor-Crummey and M. Scott. ACM Transactions on
Computer Systems, 9(1):21–65, February 1991.
[2] Design, Verification and Applications of a New Read-Write Lock Algorithm. Jun Shirako, Nick Vrvilo, Eric G. Mercer, Vivek Sarkar. 24th
ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), June 2012.
Optimized Sync Variable Implementation in HClib
The impact of
the Sync Variable Optimization
28
8.11 7.97
9.10
6.26
0.00
5.00
10.00
15.00
UTS-REC
Absolute
Performance
(sec)
Lower is better
Qthreads-nemesis (FEBs) Qthreads-sherwood (FEBs)
HClib (Promise/Futures) HClib (Promises/Futures + Ticket Locks)
Single-Node
Conclusions
 We’ve implemented OCR and HClib based Chapel tasking
runtimes
 https://github.com/srirajpaul/chapel/tree/hclib_ocr
 Lessons Learned
 The use of HClib is a promising way to accelerate PGAS tasking
runtimes
 The use of TicketLock can improve the performance of Sync
Vars
 However, there are still further research opportunities
 Future Work
 Improving work-stealing policies
 Supporting high level constructs to Chapel’s tasking layer
29

Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages

  • 1.
    Chapel-on-X: Exploring Tasking Runtimesfor PGAS Languages Akihiro Hayashi (Rice), Sri Raj Paul (Rice), Max Grossman (Rice), Jun Shirako (Rice), Vivek Sarkar (Georgia Tech) 1
  • 2.
    A Big Picture:Exploring Dynamic Tasking Runtimes 2 Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html, http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/ © Argonne National Lab. © RIKEN AICS©Berkeley Lab. Large Scale Systems Runtime Systems PGAS Languages CommunicationDynamic Tasking X10 CAF Habanero-UPC++ Qthreads, OCR, Habanero-C/C++ … MPI, GASNet, … What are desirable features for future systems?
  • 3.
    Chapel Language High-productivity features: Global-View  Task/Loop parallelism  Data Distribution  Synchronization Sync Variables 3Photo Credits : http://chapel.cray.com/logo.html
  • 4.
    Chapel’s Tasking Constructs: TaskCreation 4 // spawn a task begin task(); cobegin { taskA(); // spawn a task taskB(); // spawn a task } begin/cobegin // spawn tasks (chunked) forall i in 1..N { task(); } // spawn tasks (nTasks) coforall i in 1..N { task(); } forall/coforall
  • 5.
    Chapel’s Tasking Constructs: SyncVariables 5 // initial state is EMPTY var sy$: sync int; // set the state of sy$ FULL sy$ = 1; Producer begin { // blocked until sy$ is FULL var sy = sy$; … } Consumer Logical full/empty state associated with value
  • 6.
    Goal of OurStudy: Explore alternatives to Qthreads-based threading runtime for Chapel to prepare for future systems 6 Communication Tasking/ Threading Memory LaunchersQIO Timers Standard Chapel Runtime GASNet Third-party libs OCR/ HClib Qthreads libc HDFS
  • 7.
    Our Approach  Studyusage of Qthread API by current Chapel task API’s  23 tasking functions, 9 sync functions, 5 Threading functions (chpl-tasks.h)  Alternative 1: Open Community Runtime (OCR) - task- based runtime with multiple implementations of runtime APIs that adhere to a common specification  Alternative 2: Habanero-C/C++ library (HClib) - able to efficiently support blocking operations in tasks (sync vars, barriers, future gets, …), unlike OCR 7
  • 8.
    Chapel Tasking API(chpl-tasks.h) 8 [9 Sync functions] chpl_sync_lock(); chpl_sync_unlock(); chpl_sync_waitFullAndLock(); chpl_sync_waitEmptyAndLock(); chpl_sync_markAndSignalFull(); chpl_sync_markAndSignalEmpty(); chpl_sync_isFull(); chpl_sync_initAux(); chpl_sync_destroyAux(); [23 Tasking functions] chpl_task_init(); chpl_task_exit(); chpl_task_createCommTask(); chpl_task_callMain(); chpl_task_addToTaskList(); chpl_task_executeTasksInList(); chpl_task_taskCallFTable(); chpl_task_startMovedTask(); chpl_task_getSubloc(); chpl_task_setSubloc(); [23 Tasking functions (Cont’d)] chpl_task_getRequestedSubloc(); chpl_task_getId() ; chpl_task_yield(); chpl_task_sleep(); chpl_task_getSerial(); chpl_task_setSerial(); chpl_task_getPrvData(); chpl_task_getMaxPar(); chpl_task_getNumSublocales(); chpl_task_getCallStackSize(); chpl_task_getNumQueuedTasks(); chpl_task_getNumRunningTasks(); chpl_task_getNumBlockedTasks(); [5 Threading functions] chpl_task_getNumThreads(); chpl_task_getNumIdleThreads(); chpl_task_getenvNumThreadsPerLocale(); chpl_task_getEnvCallStackSize(); chpl_task_getDefaultCallStackSize();
  • 9.
    Important Chapel TaskingAPI (11 tasking functions out of 23 functions, based on profile) 9 Kind API Description Task chpl_task_init(); Call before executing the “main” function (Initialization) chpl_task_callMain(); Create a task that runs “main” and then execute it chpl_task_exit(); Called when exiting chpl_task_yield(); Yield the execution to another thread (sched_yield()) chpl_task_addToTaskList(); Create a task and execute it (begin, cobegin, …) chpl_task_executeTasksInList(); Do nothing in the qthreads implementation chpl_task_getId(); Returns the ID of thread chpl_task_get/setSerial(); Usually “getSerial” returns false - i.e. create tasks chpl_task_getMaxPar(); Returns # of workers in the node chpl_task_getCallStackSize(); Returns the size of call stack size in the node
  • 10.
    Important Chapel TaskingAPI (8 synchronization functions out of 9 functions, based on profile) 10 Kind API Description Sync chpl_sync_lock(sync_var s); Acquire a lock on the specified sync variable chpl_sync_unlock(sync_var s); Release a lock on the specified sync variable chpl_sync_initAux(); Initialize meta-information associated with a sync var chpl_sync_destroy(); Destoy meta-information associated with a sync var chpl_sync_waitFullAndLock(); Block until the specific sync variable is FULL chpl_sync_waitEmptyAndLock(); Block until the specific sync variable is EMPTY chpl_sync_markAndSignalFull(); Set the specific sync variable to FULL chpl_sync_markAndSignalEmpty(); Set the specific sync variable to EMPTY Note: Chapel Runtime, “sync variables” (ChapelSyncvar.chpl) bypass some API (e.g., waitFullAndLock, …) and directly calls Qthreads API (e.g., qthread_readFF).
  • 11.
    Open Community Runtime(OCR)  An asynchronous event-driven runtime  OCR API formalized in community-developed specification  Two known open-source implementations of OCR specification:  OCR-REF developed by Intel, Rice, and others  OCR-VSM developed by U. Vienna on top of TBB library  Extensible through the use of hints – separation of concerns 11
  • 12.
    OCR Components  GUID:Globally visible Unique ID  Event Driven Task (EDT): Computation  Event: Synchronization  Data Block (DB): Relocatable chunk of data 12
  • 13.
    Habanero-C/C++ Library (HClib) Library-based tasking runtime and API  Semantically derived from X10  Focused on: lightweight, minimal overheads; flexible synchronization; locality control; composability with other libraries;  Simplified deployment: no custom compiler, entirely library-based, only requires C++11 compliant compiler  Uses runtime-managed call stacks to avoid blocking  https://github.com/habanero-rice/hclib 13
  • 14.
    HClib Constructs 14 Description Example Asynchronoustask creation async(() -> { S1; }); Bulk task synchronization finish(() -> { async(() -> { S1; async(() -> S2;); }); }); Futures and promises async(() -> { prom->put(42); }); async(() -> { prom->get_future()->wait(); }); Bulk task creation forall(loop, (i, j, k) -> { S3; }); Places for locality control async_at(pl, () -> { S4; });
  • 15.
    Overview of OurImplementation 15 Chapel Constructs Qthreads OCR HClib begin qthread_fork_copyargs ocrEdtTemplateCreate ocrEdtCreate hclib_async cobegin forall coforall Sync variables Qthread’s Full/Empty API Pthreads Synchronization API Future/Proimies
  • 16.
    Example: Implementations of “begin”statement 16 proc main() { // creating a task begin writeln(“b"); } void chpl_task_addToTaskList(…) { // Qthreads Version qthread_fork_copyargs(func); // OCR Version ocrEdtTemplateCreate(); ocrEdtCreate(); // HClib Version hclib_async(); } Chapel Runtime void func(void *arg) { writeln(“b”); } int main() { // creating a task chpl_task_addToTaskList(func, …); } Code Generated By Compiler Chapel Tasking API
  • 17.
    17 Code Size ofdifferent Runtime Systems Runtime Logical Lines of Code HClib 3,891 (< 2,000 lines relevant to Chapel) OCR-REF 49,967 (~40,000 if only shared memory subset of OCR is used) OCR-TBB 2,182 (~ 85,000 additional LLOC for TBB) Qthreads 24,008 Code size measured using UCC Tool (http://csse.usc.edu/ucc_new/wordpress/)
  • 18.
    Validation of theOCR Implementation  Verified Tests/Applications :  251 parallel-construct tests from Chapel code base, taken from https://github.com/chapel-lang/chapel/tree/master/test/parallel)  OCR  1 failure due to current lack of support for setting callstack size  21 failures due to timeouts caused by deadlock introduced when the number of tasks trying to acquire a sync variable is more than the number of OCR workers.  All other tests are passed by our OCR implementation  HClib  1 failure due to current lack of support for setting callstack size  20 failures due to timeouts caused by deadlock introduced when the number of tasks trying to acquire a sync variable is more than the number of HClib workers.  All other tests are passed by our Hclib implementation 18
  • 19.
    Preliminary Evaluations: Platforms 19 Cray XC30™Supercomputer @ NERSC (Edison)  Node Intel Xeon E5-2695 @ 2.40GHz x 24 cores (only used one 12-core socket) 64GB of RAM  Interconnect Cray Aries interconnect with Dragonfly topology
  • 20.
    Preliminary Evaluations: Applications 20 Application Application Field Description DataSize Constructs Used UTS Tree Search Unstructured Tree Search T1 Tree (4M Nodes) begin Stream Numerical Computing A Simple Vector Kernel N= 256M forall Labelprop Graph Analytics An Analysis of Tweets on Twitter Users = 10K nTweets = 100K forall KMeans Machine Learning K-Means Clustering N = 10M K = 3, dim = 3 reduce CoMD Simulation Molecular Dynamics Simulation Cu, Lennard-Jones, Grid = 20x20x20 coforall
  • 21.
    Preliminary Results: 21 1.23 0.17 0.90 0.19 1.05 1.14 0.16 1.69 0.37 1.1 0.99 0.14 1.19 0.24 1.53 0.97 0.13 0.62 0.29 0.99 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 UTS StreamLabelprop Kmeans CoMD AbsolutePerformance(sec) Lower is better Qthreads OCR-REF OCR-VSM HClib 1. HClib is the fastest, 2. OCR-VSM is faster than OCR-REF, 3. Qthreads is in some cases faster, in some cases slower than the other variants Single-Node
  • 22.
    Analysis #1: Work-stealing 22 1.23 0.17 0.90 0.19 1.05 1.14 0.16 1.69 0.37 1.1 0.99 0.14 1.19 0.24 1.53 0.97 0.13 0.62 0.29 0.99 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 UTSStream Labelprop Kmeans CoMD AbsolutePerformance(sec) Lower is better Qthreads OCR-REF OCR-VSM HClib Work-stealing (OCR, HClib) accelerates UTS, however, such is not the case with KMeans Single-Node
  • 23.
    Detailed analysis with“perf” (UTS)  [UTS OCR] 15.34% uts-deq.ocr.out uts-deq.ocr.out [.] sha1_compile 13.87% uts-deq.ocr.out uts-deq.ocr.out [.] create_tree_chpl 12.28% uts-deq.ocr.out libc-2.12.so [.] _int_free 11.29% uts-deq.ocr.out libc-2.12.so [.] malloc 8.27% uts-deq.ocr.out uts-deq.ocr.out [.] chpl_user_main 6.88% uts-deq.ocr.out libm-2.12.so [.] __ieee754_log 6.41% uts-deq.ocr.out libc-2.12.so [.] _int_malloc 4.77% uts-deq.ocr.out uts-deq.ocr.out [.] remove3 23  [UTS Qthreads] 25.71% uts-deq.qthread uts-deq.qthreads.out [.] qt_scheduler_get_thread 14.09% uts-deq.qthread uts-deq.qthreads.out [.] sha1_compile 10.16% uts-deq.qthread libc-2.12.so [.] _int_free 9.34% uts-deq.qthread libc-2.12.so [.] malloc 6.87% uts-deq.qthread uts-deq.qthreads.out [.] create_tree_chpl 5.60% uts-deq.qthread libc-2.12.so [.] _int_malloc 5.49% uts-deq.qthread libm-2.12.so [.] __ieee754_log  Qthreads: the scheduler is the bottleneck  The default qthreads scheduler in Chapel (nemesis) does not perform work-stealking  OCR: the main computation is the bottleneck thanks. to work stealing  The percentage of collected samples in functions with “perf” command
  • 24.
    Analysis #2: Coforallwith OCR-VSM 24 1.23 0.17 0.90 0.19 1.05 1.14 0.16 1.69 0.37 1.1 0.99 0.14 1.19 0.24 1.53 0.97 0.13 0.62 0.29 0.99 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 UTS Stream Labelprop Kmeans CoMD AbsolutePerformance(sec) Lower is better Qthreads OCR-REF OCR-VSM HClib Single-Node
  • 25.
    Detailed analysis of“Coforall” 25 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 AbosuluteTiminigs(sec) # of Tasks Coforall overheads (lower is better) Qthreads OCR-REF OCR-VSM HCLIB coforall i in 1..nTasks { habanero(); } Chapel  OCR-VSM is the slowest  HClib is the mostly fastest when nTasks < 2048  Qthreads is the fastest when nTasks > 2048  CoMD’s force computation  nTasks = 1728  Bottlenecks identified with HPCToolkit  (OCR-VSM) tbb::internal::private_worker::run()  (OCR-REF) wstSchedulerObjectCount
  • 26.
    Analysis #3: OverheadAnalysis 26 16.85% 0.78% 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Qthreads HClib Overheadoftasking(%)  Measure TS = execution time of sequential version  Measure T1 = single-thread execution time of parallel version  Overhead = (T1- TS)/TS (not reported for OCR because T1 cannot be easily measured for OCR) Single-Node
  • 27.
    27 • Standard lock-basedimplementations • chpl_sync_waitFullAndLock / chpl_sync_waitEmptyAndLock: • Based on general lock (chpl_sync_lock) and state flag • chpl_sync_markAndSignalFull / chpl_sync_markAndSignalEmpty • Based on general unlock (chpl_sync_unlock) and state flag • Efficient lock support using spin-lock • Extension of ticket lock approach [1,2] • Two-step inter-task coordinations 1. Fast p-2-p synchronizations based on busy-wait (with timeout) 2. Sleep-and-awake synchronizations for context switching [1] Algorithms for Scalable Synchronization on Shared Memory Multiprocessors. J. Mellor-Crummey and M. Scott. ACM Transactions on Computer Systems, 9(1):21–65, February 1991. [2] Design, Verification and Applications of a New Read-Write Lock Algorithm. Jun Shirako, Nick Vrvilo, Eric G. Mercer, Vivek Sarkar. 24th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), June 2012. Optimized Sync Variable Implementation in HClib
  • 28.
    The impact of theSync Variable Optimization 28 8.11 7.97 9.10 6.26 0.00 5.00 10.00 15.00 UTS-REC Absolute Performance (sec) Lower is better Qthreads-nemesis (FEBs) Qthreads-sherwood (FEBs) HClib (Promise/Futures) HClib (Promises/Futures + Ticket Locks) Single-Node
  • 29.
    Conclusions  We’ve implementedOCR and HClib based Chapel tasking runtimes  https://github.com/srirajpaul/chapel/tree/hclib_ocr  Lessons Learned  The use of HClib is a promising way to accelerate PGAS tasking runtimes  The use of TicketLock can improve the performance of Sync Vars  However, there are still further research opportunities  Future Work  Improving work-stealing policies  Supporting high level constructs to Chapel’s tasking layer 29

Editor's Notes

  • #2 Hello everyone. My name is Akihiro and I’m a research scientist at Rice University. Today’ I’ll be sharing our preliminary experiences in using different tasking runtimes for Chapel language.
  • #3 Okay, when it comes to programming models for large scale systems, we believe that PGAS languages have lots of interesting features that facilitate large-scale programing. Also, these high-level features are usually handled by runtime systems like tasking and communication runtime and so on. In this talk, our focus is dynamic tasking runtime and our goal is to study desirable features of tasking runtimes for future exascale systems.
  • #4 Currently, our focus is the Chapel language, Chapel offers several high-productivity features. One nice thing about Chapel is that it supports a global-view programming model where programmers can write disributed programs in such a way that they do for shared systems. It also supports task and loop parallelism, data distribution and synchronization for facilitate large scale programming.
  • #5 For those who don’t know much about Chapel, let me first show some examples using Chapel’s tasking constructs. If you take a look at the left box, you’ll see an example of begin and cobegin constructs. The begin construct basically spawns a task running independently from the main thread execution and similarly the cobegin spawns a block of task, one for each statement. The right box
  • #6 One of the interesting features of Chapel is Sync Variables. Sync variables have logical full/empty state associate with value. Usually sync variables are used for synchronization between tasks. What you see here is an example of producer-consumer synchronization. In the left box the producer creates a sync variable whose initial state is EMPTY and sets the state of the sync variable by assigning some value. In the right box, the consumer is blocked until the state of the sync variable is FULL. That’s how synchronization works using sync variable.
  • #7 Okay,
  • #12 Alright, let’s talk about OCR. OCR is an asynchronous event-driven runtime and it’s API was formalized in community-developed specification. So far there are two known open-source implementations. One is what we call OCR-REF developed by Intel, Rice, and others. The other is OCR-VSM which is developed by U. Vienna on top of TBB library.
  • #13 Components of OCR are as follows: GUID is a globally unique ID used for identifying OCR objects. Basically, OCR objects are Event Driven Tasks (EDT), Events, and Data Blocks. which actually perform some computation.
  • #17 (VIVEK) Replace “Generated Code By Compiler” by “Code Generated By Compiler”
  • #19 (VIVEK) Remove “the”, i.e., replace “the verify the” by “verify the” Edit bullet on “Parallel constructs tests”. Is it correct that qthreads passes 163 tests?
  • #24 (VIVEK) Minor edit to las bullet