1. DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
UNIVERSITY OF IOANNINA
GREECE
Spiros N. Agathos
Efficient OpenMP Runtime
Support for General-Purpose
and Embedded Multi-core
Platforms
2. Parallel Processing Group
UNIVERSITY OF IOANNINA
2
Presentation overview
Multi-core systems and their programming
Summary of contributions
Tasking support
General design & optimization for NUMA machines
Transform nested parallel loops to tasks
Runtime support for embedded systems
STHORM multi-core architecture
Epiphany accelerator
Application-driven runtime support
3. Parallel Processing Group
UNIVERSITY OF IOANNINA
3
Personal computers
Multiple compute cores in a socket
Market trends (CPU + coprocessor):
Home appliances now include embedded systems
HPC supercomputers Multicore CPUs / GPGPUs / DSPs / FPGAs0
Multi-core systems
16,000 nodes X {2 Intel Xeon + 3 Xeon Phi}
RANK 1st (November 2015)
Tianhe-2 (China)
4. Parallel Processing Group
UNIVERSITY OF IOANNINA
4
More cores = More power ?
Easy transition from serial to
parallel programming?
Efficient
programming
Code Portability
Exploiting
heterogeneous
systems Low level VS high level
programming languages
5. Parallel Processing Group
UNIVERSITY OF IOANNINA
5
OpenMP
Directive-based model for parallel programming
Base language (C/C++/Fortran)
Incremental parallelization
Highly successful, widely used, supported by many
compilers
Fork-join execution model
First version in 1997, now v4.5 (November 2015)
Parallel loops
Sections
Synchronization (barrier/critical/atomic)
OpenMP
Initial thread
(Preamble)
Fork()
Work threadWork threadWork thread Work threadWork thread
Join
Initial thread
(Continues)
6. Parallel Processing Group
UNIVERSITY OF IOANNINA
6
OpenMP since version 3.0 (May 2008) supports for
task-based parallelism.
With tasking, the expressiveness of OpenMP is
enriched with flexible management or irregular and
recursive parallelism
An OpenMP task is a unit of work scheduled for
asynchronous execution
Has its own data environment
Can share data with other tasks
OpenMP tasking model
7. Parallel Processing Group
UNIVERSITY OF IOANNINA
7
void main()
{
int res;
#pragma omp parallel
{
#pragma omp single
res = fib(40);
}
}
int fib(int n)
{
int n1, n2;
if(n <= 1) return n;
#pragma omp task shared(n1)
n1 = fib(n-2);
#pragma omp task shared(n2)
n2 = fib(n-1);
#pragma omp taskwait
return n1+n2;
}
OpenMP task example
8. Parallel Processing Group
UNIVERSITY OF IOANNINA
8
Device Support added in version 4.0 (November 2013)
An application can be executed by a set of devices
host device and other target devices
Host-centric execution model
Create data environment in the device
Instruct device to execute code regions (kernel)
OpenMP device model
9. Parallel Processing Group
UNIVERSITY OF IOANNINA
9
void main()
{
int input[10], result[10];
init(input);
#pragma omp target map(to:input) map(from:result)
{
int i;
#pragma omp parallel for
for(i=0; i<10; i++)
result[i] = input[i]++;
}
print(result);
}
OpenMP target example
Executed on host Communication Executed on device
10. Parallel Processing Group
UNIVERSITY OF IOANNINA
10
OMPi (http://paragroup.cs.uoi.gr/wpsite/software/ompi):
OpenMP C infrastructure
Source to source compiler + Runtime libraries
OMPi compiler
OpenMP C
Multithreaded C
11. Parallel Processing Group
UNIVERSITY OF IOANNINA
11
Contributions
Tasking
General design & optimization for NUMA machines
Transform nested parallel loops to tasks
Runtime support for embedded systems
STHORM multi-core architecture
Epiphany accelerator
Application-driven runtime support
12. Parallel Processing Group
UNIVERSITY OF IOANNINA
12
Presentation overview
Multi-core systems and their programming
Summary of contributions
Tasking support
General design & optimization for NUMA machines
Transform nested parallel loops to tasks
Runtime support for embedded systems
STHORM multi-core architecture
Epiphany accelerator
Application-driven runtime support
13. Parallel Processing Group
UNIVERSITY OF IOANNINA
13
Tasking in the Ompi Compiler
#pragma omp task firstprivate(x)
{
<CODE>
}
{
/* task directive replacement */
env = capture_task_environment();
ort_new_task(taskFunc0, env);
}
/* Produced task function */
void taskFunc0(void *env)
{
get_env_data(env);
<CODE>
}
/* Fast Path Optimization */
if(must_execute_serially)
{
declare_local_vars();
ort_immediate_start();
<CODE>
ort_task_immediate_end();
}
else
14. Parallel Processing Group
UNIVERSITY OF IOANNINA
14
Tasking Runtime Architecture
TASK_QUEUE
Task Queue:
Per OpenMP Thread (worker)
DEQueue
Breadth first approach
Depth first when full
Fast Path optimization
Less overheads
Data locality
Less parallelism exploitation
Back to Breadth first when 30% empty
Work-stealing between siblings
Lock-free [Chase & Lev, SPAA, 2005]
Worker adds and removes elements to queue’s top
Thief removes elements from queue’s bottom to
avoid contention
Create 3
tasks
15. Parallel Processing Group
UNIVERSITY OF IOANNINA
15
Thread Memory
Array
Td Td Td Td Td Td Td
Overflow List
Td Td Td Td
Descriptor Pool
TASK_QUEUE
Tasking runtime architecture
Exec?
Descriptor
Fields
Data Env
Actual
Data
Task Descriptor
Spiros N. Agathos, Panagiotis E. Hadjidoukas, Vassilios V. Dimakopoulos, “Design and
Implementation of OpenMP Tasks in the OMPi Compiler”, PCI 2011, 15th Panhellenic
Conference on Informatics, Kastoria, Greece, September 2011, pages 265 - 269
16. Parallel Processing Group
UNIVERSITY OF IOANNINA
16
Μodern multicore multiprocessors
Deep cache hierarchies, private memory channels
Non Uniform Memory Access (NUMA)
A general tasking runtime will have sub-optimal
behaviour
We designed & implemented an optimized tasking
runtime for NUMA machines
Maximizes local operations
Minimizes remote accesses, less cache misses, low overheads
Work-stealing system uses an efficient blocking algorithm
Optimizations for NUMA machines
17. Parallel Processing Group
UNIVERSITY OF IOANNINA
17
Optimizations for NUMA machines
Td thief recycle issue:
A thread wishes to steal a task
After task completion
In which Descriptor pool should Td return?
a) Thief? No contention
1-producer/N-thieves?
b) Task’s creator?
Synchronization issues appear
TASK_QUEUE
Td Td
Td
Td
Descriptor Pool
Thief
18. Parallel Processing Group
UNIVERSITY OF IOANNINA
18
CreatorMemory
Optimizations for NUMA machines
TASK_QUEUE
Td
Td
Td
Td
DescriptorPool
Thief
Pending Queue
Worker
Less synchronization between threads Less cache misses
More local data operations
19. Parallel Processing Group
UNIVERSITY OF IOANNINA
19
Work-Stealing mechanism
Crucial component of an OpenMP runtime
TASK_QUEUE is a shared object supporting
OwnerEnqueue
Enqueues a task in a thread’s queue
Executed only by the thread that owns queue
No need for synchronization
Dequeue
Removes the oldest enqueued task
Executed by any thread in an OpenMP team
Synchronization needed!
A Fast Work-stealing Mechanism
TASK_QUEUE
20. Parallel Processing Group
UNIVERSITY OF IOANNINA
20
Based on the CC-Synch object/algorithm of
[Fatourou & Kallimanis, PPoPP, 2012].
CC-Synch for thread synchronization
To implement Dequeue operation
Use one instance of CC-Synch for each TASK_QUEUE
Result of the combining technique:
One thread (the combiner) holds a coarse lock
Additionally to the application of its own operation, serves the
operations of all other active threads
Highly reduces synchronization costs
A Fast Work-stealing Mechanism
21. Parallel Processing Group
UNIVERSITY OF IOANNINA
21
Performance evaluation
NUMA Environment
2 x 8-core AMD Opteron 6128 CPUs @ 2GHz
16GB of main memory
Debian Squeeze on the 2.6.32.5 kernel
GNU gcc (version 4.4.5-8) [-O3 -fopenmp]
Intel icc (version 12.1.0) [-fast -openmp]
Oracle suncc (version 12.2) [-fast -xopenmp=parallel]
OMPi uses GNU gcc as a back-end compiler [-O3]
Default Runtime Settings
22. Parallel Processing Group
UNIVERSITY OF IOANNINA
22
Synthetic benchmark
Fine grain tasks (max workload = 128) 16 threads
1 thread produces tasks
Rest threads executes them
Taskload = for loop with ‘max workload’ repetitions
23. Parallel Processing Group
UNIVERSITY OF IOANNINA
23
Barcelona OpenMP Task Suite
Fibonacci
Computes the nth
Fibonacci number
Exploits nested task
parallelization which
creates a deep tree
Very large number of fine-
grain tasks
40th Fibonacci number
New work-stealing
implementation and the
fast execution path
No manual cut-off
24. Parallel Processing Group
UNIVERSITY OF IOANNINA
24
Barcelona OpenMP Task Suite
NQueens
Calculates solutions of the
n-queens chessboard
problem
Backtracking search
algorithm with pruning
creates unbalanced tasks
Exploits nested task
parallelization which
creates a deep tree of
tasks
Input: 14 queens
No manual cut-off
Spiros N. Agathos, Nikolaos D. Kallimanis, Vassilios V. Dimakopoulos, “Speeding Up OpenMP
Tasking”, Europar 2012, International European Conference onParallel and Distributed
Computing, Rhodes, Greece, August 2012, pages 650-661.
25. Parallel Processing Group
UNIVERSITY OF IOANNINA
25
Presentation overview
Multi-core systems and their programming
Summary of contributions
Tasking support
General design & optimization for NUMA machines
Transform nested parallel loops to tasks
Runtime support for embedded systems
STHORM multi-core architecture
Epiphany accelerator
Application-driven runtime support
26. Parallel Processing Group
UNIVERSITY OF IOANNINA
26
Nested parallelism
A parallel region inside a parallel region
Every thread of team creates its own team of threads
Difficult to handle efficiently
Possible processor oversubscription
Typically nested loops are parallelized using
nested parallelism
Can we replace a nested loop with tasks?
Introduction
27. Parallel Processing Group
UNIVERSITY OF IOANNINA
27
#pragma omp parallel num_threads(M)
{
#pragma omp parallel for
schedule(static) num_threads(N)
for (i=LB; i<UB; i++) {
<body>
}
…………
}
Transforming Loop Code Manually
#pragma omp parallel num_threads(M)
{
for(t=0; t<N; t++)
#pragma omp task
{
calculate(N, LB, UB, &lb, &ub);
for (i=lb; i<ub; i++)
<body>
}
#pragma omp taskwait
…………
}
#N parallel tasks transformed to #N tasks
Instead of NxM, #M threads in system
28. Parallel Processing Group
UNIVERSITY OF IOANNINA
28
Similar Transformation is possible for
Dynamic
Guided
Complicated user code
Impossible to have access too thread specific
data (example)
Manual transformation limitations
29. Parallel Processing Group
UNIVERSITY OF IOANNINA
29
In general:
A mini worksharing runtime must be written
What thread will execute what task?
Impossible to handle thread specific data
But within an OpenMP runtime system:
All the worksharing functionality already there
Access to all thread specific data
Auto transformation in OMPi compiler
Manual transformation limitations
30. Parallel Processing Group
UNIVERSITY OF IOANNINA
30
OMPi’s runtime organization
EECB
All OpenMP thread info:
1)Thread ID
2)Parallel level
3)Pointer to parent EECB
4)………
Each OpenMP thread is associated with an EECB
(Execution Entity Control Block)
31. Parallel Processing Group
UNIVERSITY OF IOANNINA
31
Parallel: Creation of parallel tasks
#pragma omp parallel num_threads(4)
P0
parallel
task
P1
parallel
task
P2
parallel
task
P3
parallel
task
Thread ID = 0
Level = 0
Initial
parallel
Task
4 new threads are created
New EECBs are assigned
Execute parallel region
32. Parallel Processing Group
UNIVERSITY OF IOANNINA
32
Nested Parallel: Creation of special tasks
#pragma omp parallel for num_threads(4)
S0
special
task
S1
special
task
S2
special
task
S3
special
task
Thread ID = 0
Level = 1
P0
parallel
task
Special tasks are stored in TaskQueue
Sibling threads steal special tasks
Change EECB && execute ‘parallel region’
33. Parallel Processing Group
UNIVERSITY OF IOANNINA
33
Compiler Side
No changes
Runtime:
New type of task called pfor_task
Emulation of parallel tasks
Same technique works for nested parallel sections
Auto transformation implementation
34. Parallel Processing Group
UNIVERSITY OF IOANNINA
34
NUMA Environment
2X 8-core AMD Opteron 6128 CPUs @ 2GHz
16GB of main memory
Debian Squeeze on the 2.6.32.5 kernel
GNU gcc (version 4.4.5-8) [-O3 -fopenmp]
Intel icc (version 12.1.0) [-fast -openmp]
Oracle suncc (version 12.2) [-fast -xopenmp=parallel]
OMPi uses GNU gcc as a back-end compiler [-O3]
Original, non-optimized task runtime was used
Default Runtime Settings
Evaluation
35. Parallel Processing Group
UNIVERSITY OF IOANNINA
35
Synthetic benchmark results
TASK_LOAD = 500
16 Threads in 1st Level
16 Threads in 1st Level
N (L2 Threads) = 4
Create a parallel with 16 threads
Each thread create a nested team to execute for-loop
Iteration workload = for loop with repetitions
36. Parallel Processing Group
UNIVERSITY OF IOANNINA
36
Takes as input an image and
discovers the number of faces
depicted
Utilizing nested parallelism in
order to obtain better
performance
Face detection results
161 Images CMU test set
Spiros N. Agathos, Panagiotis E. Hadjidoukas, Vassilios V. Dimakopoulos, “Task based Execution
of Nested OpenMP Loops”, IWOMP 12, International Workshop on OpenMP, OpenMP in a
Heterogenous World, Rome, Italy, June 2012, pages 210 - 222.
37. Parallel Processing Group
UNIVERSITY OF IOANNINA
37
Hybrid policy
Idle cores? Create threads
All cores occupied ? Create tasks
Not enough cores ? Mix tasks/threads
Face detection results
161 Images CMU test set
38. Parallel Processing Group
UNIVERSITY OF IOANNINA
38
Presentation overview
Multi-core systems and their programming
Summary of contributions
Tasking support
General design & optimization for NUMA machines
Transform nested parallel loops to tasks
Runtime support for embedded systems
STHORM multi-core architecture
Epiphany accelerator
Application-driven runtime support
39. Parallel Processing Group
UNIVERSITY OF IOANNINA
39
Can OpenMP form the basis for a programming model
for multicore embedded accelerators?
The answer is positive, but not straightforward
since:
OpenMP is designed for homogeneous shared
memory systems
Embedded systems
– Include groups of weak Pes
– Have limited resources
OpenMP for accelerators?
40. Parallel Processing Group
UNIVERSITY OF IOANNINA
40
OpenMP for accelerators?
The design and implementation of a programming
model for the STHORM architecture (Artemis EU
Project No. 100230 SMECY)
Offer decent parallelization without significant
programming effort
43. Parallel Processing Group
UNIVERSITY OF IOANNINA
43
Execution model
An accelerator is usually a back-end system attached to a host.
Where and how the OpenMP programming model is to be applied?
1. On the Host side: Multiple OpenMP Host threads generate multiple
jobs on the accelerator
2. On the Accelerator side: Each Host thread can trigger multiple
OpenMP threads on the accelerator
3. Our proposal : On Both sides!
General solution
Flexible, programmer friendly
Multicore
ARM
HOST Accelerator
44. Parallel Processing Group
UNIVERSITY OF IOANNINA
44
Suppoting OpenMP
OpenMP on ARM
Minimal changes
New module for communicating with STHORM
OpenMP on STHORM
Very difficult to implement
Limited hardware resources
Full albeit non optimized implementation
45. Parallel Processing Group
UNIVERSITY OF IOANNINA
45
Compilation chain
Code for the
HOST & the
Accelerator
OpenMP C
OpenMP C
OpenMP C
Multithreaded C
Multithreaded C
46. Parallel Processing Group
UNIVERSITY OF IOANNINA
46
EECB management
L2 TCDM
Emptylist
ID0
Lev0
ID1
Lev1
ID1
Lev1
ID2
Lev1
ID3
Lev1
ID4
Lev1
ID5
Lev1
ID15
Lev1
Execution Entity Control Block (EECB) per thread
Assigned to thread when it starts the execution
Freed when the team is disbanded
Problem : Placement of EECBs
EECBs are constantly accessed during execution
Solution : Scratchpad memory guaranteed
performance
Ok for 1 level of parallelism
Infeasible for multiple levels
Our Proposal:
Use TCDM only for the 16 active threads
Use L2 memory for nested teams
Keep all active EECBs in TCDM
ID 0
Lev2
47. Parallel Processing Group
UNIVERSITY OF IOANNINA
47
Parallel regions
OpenMP parallel region is a group of jobs
executed by different PEs
PE meets a parallel region (Master PE)
Suspends the execution of its current job
Sends a request to the CC
Allocates a new EECB
Executes its implicit task
Wait the CC to notify the end of PR
Return to its old EECB
Other PE’s:
Receive the request of the CC
Acquire an EECB
Start execute their implicit job
When a job is executed notifies CC
Release EECB
CC:
Supply PE’s with implicit jobs
Receive end notifications of PE’s
Informs the master PE that PR is finished
TCDM EnCore
CC
Cluster
M
48. Parallel Processing Group
UNIVERSITY OF IOANNINA
48
Parallel regions
TCDM EnCore
CC
Cluster
M
OpenMP parallel region is a group of jobs
executed by different PEs
PE meets a parallel region (Master PE)
Suspends the execution of its current job
Sends a request to the CC
Allocates a new EECB
Executes its implicit task
Wait the CC to notify the end of PR
Return to its old EECB
Other PE’s:
Receive the request of the CC
Acquire an EECB
Start execute their implicit job
When a job is executed notifies CC
Release EECB
CC:
Supply PE’s with implicit jobs
Receive end notifications of PE’s
Informs the master PE that PR is finished
49. Parallel Processing Group
UNIVERSITY OF IOANNINA
49
Parallel regions
TCDM EnCore
CC
Cluster
M
OpenMP parallel region is a group of jobs
executed by different PEs
PE meets a parallel region (Master PE)
Suspends the execution of its current job
Sends a request to the CC
Allocates a new EECB
Executes its implicit task
Wait the CC to notify the end of PR
Return to its old EECB
Other PE’s:
Receive the request of the CC
Acquire an EECB
Start execute their implicit job
When a job is executed notifies CC
Release EECB
CC:
Supply PE’s with implicit jobs
Receive end notifications of PE’s
Informs the master PE that PR is finished
50. Parallel Processing Group
UNIVERSITY OF IOANNINA
50
Experimental results
Calculation of the Mandelbrot set for
an image of 362x208 pixels
Image fits in TCDM
We parallelized the computation of
image pixels values
Results for different scheduling policies
of the OpenMP parallel for
Spiros N. Agathos, Vasileios V. Dimakopoulos, Aggelos Mourelis, Alexandros Papadogiannakis,
“Deploying OpenMP on an Embedded Multicore ccelarator”, SAMOS XIII, International
Conference on Embedded Computer Systems: Architectures, Samos, Greece, July 2013, pages
180 - 187
51. Parallel Processing Group
UNIVERSITY OF IOANNINA
51
Presentation overview
Multi-core systems and their programming
Summary of contributions
Tasking support
General design & optimization for NUMA machines
Transform nested parallel loops to tasks
Runtime support for embedded systems
STHORM multi-core architecture
Epiphany accelerator
Application-driven runtime support
52. Parallel Processing Group
UNIVERSITY OF IOANNINA
52
void main()
{
int input[10], result[10];
init(input);
#pragma omp target map(to:input) map(from:result)
{
int i;
#pragma omp parallel for
for(i=0; i<10; i++)
result[i] = input[i]++;
}
print(result);
}
OpenMP target example
Executed on host Communication Executed on device
53. Parallel Processing Group
UNIVERSITY OF IOANNINA
53
void main()
{
int input[10], result[10];
init(input)
#pragma omp target data map(to:input) map(from:result)
{
#pragma omp target
{
<Kernel 01 code, accesses input, output>
}
<Host code, prepare for new kernel>
#pragma omp target
{
<Kernel 02 code, accesse input, output>
}
}
}
OpenMP target example
Executed on host
Communication
Executed on device
54. Parallel Processing Group
UNIVERSITY OF IOANNINA
54
• $99 Board (Basic version)
• Epiphany-16 (25 gflops, less than 2 Watt)
• Zynq Linux OS
• Epiphany-16 No OS
• eSDK tools for native programming
Parallella board
55. Parallel Processing Group
UNIVERSITY OF IOANNINA
55
Each kernel (i.e. target region), is outlined to a separate function
The code generation phase produces multiple output files, one for
each different kernel, plus the host code (host may be called to
execute any of them)
Compiling for the new device directives
56. Parallel Processing Group
UNIVERSITY OF IOANNINA
56
A full-fledged OpenMP runtime library
Supporting execution on the dual-ARM processor
Additional functionality
Required for controlling and accessing the Epiphany device
The communication between the Host and the eCores
takes place through the shared memory portion of the
system RAM
For offloading a kernel
The first idle eCore is chosen
Precompiled object file is loaded to it for immediate
execution
Ecores inform Host about the completion of a kernel
through special flags in shared memory.
Multiple host threads can offload multiple independent
kernels concurrently onto the Epiphany
Runtime architecture - what the host does
57. Parallel Processing Group
UNIVERSITY OF IOANNINA
57
Runtime architecture - what the host does
int X[10], Y[10];
int k;
#pragma omp target data map(X,Y)
{
#pragma omp target map(to:k)
{
/* Kernel code */
}
}
32 MiB Shared Memory
4 KiB Device
Control Data
Target data variables
YXk*
Y
*
X
Data
Env.
58. Parallel Processing Group
UNIVERSITY OF IOANNINA
58
Supporting OpenMP within the Epiphany is nontrivial
ECores do not execute any OS
No provision for dynamic parallelism
The 32KiB local memory is quite limited:
Unable to handle sophisticated OpenMP runtime structures
The runtime infrastructure originally designed for the Host was
trimmed down to a minimum
e.g. Tasking = Shared queue protected by a lock
This is linked and offloaded with each kernel
The corresponding coordination among the participating eCores
utilizes the local memory of the team’s master eCore
OpenMP within the Epiphany
59. Parallel Processing Group
UNIVERSITY OF IOANNINA
59
#parallel inside a kernel
Shared memory
Zynq
Epiphany
Master Core
Worker Cores
Time
Offload
kernel
Request
workers
Seq
Code
Start
worker
threads
Reply to
master
core
Initialize
& notify
workers
Initialize &
wait for
master
Join
team
Join
team
On Chip memory
Parallel
Code
Parallel
Code
Parallel
ended
Idle
Kernel
ended
Seq
Code
Idle
Book
keeping
Ack
Book
keeping
60. Parallel Processing Group
UNIVERSITY OF IOANNINA
60
Environment
Parallella-16 SKUA101020
Ubuntu 14.04, kernel 3.12.0 armv7l
gcc and e-gcc v.4.8.2 as back-end
for OMPi
eSDK 5.13.9.10
Experimental results
61. Parallel Processing Group
UNIVERSITY OF IOANNINA
61
Experimental results
Overhead results of EPCC benchmark
Modified version of the EPCC benchmarks
Basic routines are offloaded through target directives
Measurements from the host side after subtracting any
offloading costs
Resetting an eCore: 0,1 sec
62. Parallel Processing Group
UNIVERSITY OF IOANNINA
62
Experimental results
Frames per second for theMandelbrot deep zoom application (1024x768)
eSDK version: only 8%-13% better
Original code: 301 lines (3 files)
OpenMP code: 198 lines (1 file)
Spiros N. Agathos, Alexandros Papadogiannakis, Vassilios V. Dimakopoulos, “Targeting the
Parallella”, Europar 2015, International European Conference on Parallel and Distributed
Computing, Vienna, Austria, August 2015, pages 662 – 674
Alexandros Papadogiannakis, Spiros N. Agathos, Vassilios V. Dimakopoulos, “OpenMP 4.0
Device Support in the Ompi Compiler”, IWOMP 12, nternational Workshop on OpenMP,
Heterogenous Execution and Data Movements, Aachen, Germany, October 2015, pages 202 -
216
63. Parallel Processing Group
UNIVERSITY OF IOANNINA
63
Presentation overview
Multi-core systems and their programming
Summary of contributions
Tasking support
General design & optimization for NUMA machines
Transform nested parallel loops to tasks
Runtime support for embedded systems
STHORM multi-core architecture
Epiphany accelerator
Application-driven runtime support
64. Parallel Processing Group
UNIVERSITY OF IOANNINA
64
A kernel may contain any OpenMP directive
Great flexibility
Parallelization expressiveness
Requires runtime support within the co-processor
Implementing such a runtime is a non-trivial task
GPGPUs, accelerators etc have special functionality
characteristics and/or limited resources
Solutions:
Provide limited (and sometimes no) support for some of the
directives
Design of sub-optimal runtime system implementations
depending on the capabilities of a given device.
Motivation
Scenario OMPi executable size bytes
Empty kernel 7092
Create a parallel team 10560
65. Parallel Processing Group
UNIVERSITY OF IOANNINA
65
A novel runtime organization designed to work with
an OpenMP infrastructure
Instead of having a single monolithic runtime system
Adaptive runtime system architecture which
implements only the OpenMP features required by a
particular application
Example: OpenMP kernel with no explicit tasking
No tasking subsystem required
Barrier which has no (time consuming) tasking extensions
The library includes only the needed functionality
reduce-sized executable.
Desirable in systems with minimal local memories
Our proposal
66. Parallel Processing Group
UNIVERSITY OF IOANNINA
66
The compiler
Analyzes the kernels
Provides metrics
Select a particular runtime system configuration to
accompany the kernel
The user's code implies the choice of an optimized
runtime system
Reduced executable sizes
Faster execution times
Compiler Assisted Runtime Support (CARS)
67. Parallel Processing Group
UNIVERSITY OF IOANNINA
67
CARS Overview
foo.c
User code
(C + OpenMP)
Compiler
(transformation
& analysis)
Analysis
metrics
Mapper
Runtime
alternatives
Kernel
code
System
Compiler/linker
Kernel specific
library
Kernel
executable
Host code
executable
68. Parallel Processing Group
UNIVERSITY OF IOANNINA
68
OpenMP in kernel: Whether the kernel code includes any OpenMP directives or
not.
Dynamic parallelism: The exact number of threads used in all parallel teams,
their nesting level as well as the presence of the reduction clause along with its
parameters.
Work-sharing regions: The types of the workshares that are used in the kernel.
Whether for, single or sections regions are present. In the case of for regions,
exactly the types and parameters of the schedules, that is whether static,
dynamic or guided is used and the chunk size. In addition the presence of the
ordered clause. Finally whether the program takes advantage of the no-wait
region feature of worksharing.
Explicit tasking: The presence of user defined tasks, taskgroups and the possible
dependencies if any.
Special synchronization: The presence of atomic construct and the type of the
required functionality. The number and the names of the critical regions in the
kernel. The number and the characteristics (whether is normal of nested) of
user-defined locks.
Metrics
69. Parallel Processing Group
UNIVERSITY OF IOANNINA
69
Environment
Parallella-16 SKUA101020
Ubuntu 14.04, kernel 3.12.0 armv7l
gcc and e-gcc v.4.8.2 as back-end
for OMPi
eSDK 5.13.9.10
Experimental results
70. Parallel Processing Group
UNIVERSITY OF IOANNINA
70
Experimental Results
.elf sizes in bytes
Scenario Full RTS CARS Difference
Mandelbrot 13156 9620 26.88%
Empty Kernel 8228 2252 73.63%
Pi Calculation 11972 8864 25.96%
Nqueens (tasks) 20908 19704 5.76%
EPCC-for-static 14176 10944 22.80%
EPCC-critical 12560 9320 35.80%
EPCC-single 12200 8900 27.05%
EPCC-ordered 14192 10952 22.83%
71. Parallel Processing Group
UNIVERSITY OF IOANNINA
71
Experimental Results
Execution times in microseconds
Spiros N. Agathos, Vasileios V. Dimakopoulos, “Compiler-Assisted OpenMP Runtime
Organization for Embedded Multicores”, Technical Report, Number 2016-01, University of
Ioannina, Department of Computer Science & Engineering, April 2016
Scenario Full RTS CARS Difference
Mandelbrot 30.05(sec) 30.00(sec) 0.16%
Empty Kernel 0.10(sec) 0.10(sec) 0%
Pi Calculation 0.28(sec) 0.26(sec) 7.14%
Nqueens (tasks) 1.81(sec) 1.81(sec) 0%
EPCC-for-static 72.65 19.85 72.68%
EPCC-critical 2.17 1.55 39.98%
EPCC-single 83.72 14.92 28.57%
EPCC-ordered 4.70 4.66 0.85%
72. Parallel Processing Group
UNIVERSITY OF IOANNINA
72
Conclusion
OpenMP is an easy to use programming model
More powerfull due to addition of tasking facilities
Generally applicable due to device constructs
Contributions related to tasks
General design & optimization for NUMA machines
Transform nested parallel loops to tasks
Contributions related to OpenMP for devices
STHORM multi-core architecture
Epiphany accelerator
Application-driven runtime support
73. Parallel Processing Group
UNIVERSITY OF IOANNINA
73
Optimize Epiphany runtime to better exploit the
hardware characteristics
Support of OpenMP 4 device constructs for various
devices
GPGPUs
OpenMP extensions:
Data-block transfer
Resident kernels
Fine synchronization between Host and Device
Future Work