SlideShare a Scribd company logo
1 of 74
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
UNIVERSITY OF IOANNINA
GREECE
Spiros N. Agathos
Efficient OpenMP Runtime
Support for General-Purpose
and Embedded Multi-core
Platforms
Parallel Processing Group
UNIVERSITY OF IOANNINA
2
Presentation overview
 Multi-core systems and their programming
 Summary of contributions
 Tasking support
 General design & optimization for NUMA machines
 Transform nested parallel loops to tasks
 Runtime support for embedded systems
 STHORM multi-core architecture
 Epiphany accelerator
 Application-driven runtime support
Parallel Processing Group
UNIVERSITY OF IOANNINA
3
 Personal computers
 Multiple compute cores in a socket
 Market trends (CPU + coprocessor):
 Home appliances now include embedded systems
 HPC supercomputers  Multicore CPUs / GPGPUs / DSPs / FPGAs0
Multi-core systems
16,000 nodes X {2 Intel Xeon + 3 Xeon Phi}
RANK 1st (November 2015)
Tianhe-2 (China)
Parallel Processing Group
UNIVERSITY OF IOANNINA
4
More cores = More power ?
Easy transition from serial to
parallel programming?
Efficient
programming
Code Portability
Exploiting
heterogeneous
systems Low level VS high level
programming languages
Parallel Processing Group
UNIVERSITY OF IOANNINA
5
OpenMP
Directive-based model for parallel programming
Base language (C/C++/Fortran)
Incremental parallelization
Highly successful, widely used, supported by many
compilers
Fork-join execution model
First version in 1997, now v4.5 (November 2015)
Parallel loops
Sections
Synchronization (barrier/critical/atomic)
OpenMP
Initial thread
(Preamble)
Fork()
Work threadWork threadWork thread Work threadWork thread
Join
Initial thread
(Continues)
Parallel Processing Group
UNIVERSITY OF IOANNINA
6
OpenMP since version 3.0 (May 2008) supports for
task-based parallelism.
With tasking, the expressiveness of OpenMP is
enriched with flexible management or irregular and
recursive parallelism
An OpenMP task is a unit of work scheduled for
asynchronous execution
Has its own data environment
Can share data with other tasks
OpenMP tasking model
Parallel Processing Group
UNIVERSITY OF IOANNINA
7
void main()
{
int res;
#pragma omp parallel
{
#pragma omp single
res = fib(40);
}
}
int fib(int n)
{
int n1, n2;
if(n <= 1) return n;
#pragma omp task shared(n1)
n1 = fib(n-2);
#pragma omp task shared(n2)
n2 = fib(n-1);
#pragma omp taskwait
return n1+n2;
}
OpenMP task example
Parallel Processing Group
UNIVERSITY OF IOANNINA
8
Device Support added in version 4.0 (November 2013)
An application can be executed by a set of devices
host device and other target devices
Host-centric execution model
Create data environment in the device
Instruct device to execute code regions (kernel)
OpenMP device model
Parallel Processing Group
UNIVERSITY OF IOANNINA
9
void main()
{
int input[10], result[10];
init(input);
#pragma omp target map(to:input) map(from:result)
{
int i;
#pragma omp parallel for
for(i=0; i<10; i++)
result[i] = input[i]++;
}
print(result);
}
OpenMP target example
Executed on host Communication Executed on device
Parallel Processing Group
UNIVERSITY OF IOANNINA
10
 OMPi (http://paragroup.cs.uoi.gr/wpsite/software/ompi):
 OpenMP C infrastructure
 Source to source compiler + Runtime libraries
OMPi compiler
OpenMP C
Multithreaded C
Parallel Processing Group
UNIVERSITY OF IOANNINA
11
Contributions
 Tasking
 General design & optimization for NUMA machines
 Transform nested parallel loops to tasks
 Runtime support for embedded systems
 STHORM multi-core architecture
 Epiphany accelerator
 Application-driven runtime support
Parallel Processing Group
UNIVERSITY OF IOANNINA
12
Presentation overview
 Multi-core systems and their programming
 Summary of contributions
 Tasking support
 General design & optimization for NUMA machines
 Transform nested parallel loops to tasks
 Runtime support for embedded systems
 STHORM multi-core architecture
 Epiphany accelerator
 Application-driven runtime support
Parallel Processing Group
UNIVERSITY OF IOANNINA
13
Tasking in the Ompi Compiler
#pragma omp task firstprivate(x)
{
<CODE>
}
{
/* task directive replacement */
env = capture_task_environment();
ort_new_task(taskFunc0, env);
}
/* Produced task function */
void taskFunc0(void *env)
{
get_env_data(env);
<CODE>
}
/* Fast Path Optimization */
if(must_execute_serially)
{
declare_local_vars();
ort_immediate_start();
<CODE>
ort_task_immediate_end();
}
else
Parallel Processing Group
UNIVERSITY OF IOANNINA
14
Tasking Runtime Architecture
TASK_QUEUE
Task Queue:
 Per OpenMP Thread (worker)
 DEQueue
Breadth first approach
Depth first when full
 Fast Path optimization
 Less overheads
 Data locality
 Less parallelism exploitation
Back to Breadth first when 30% empty
Work-stealing between siblings
 Lock-free [Chase & Lev, SPAA, 2005]
 Worker  adds and removes elements to queue’s top
 Thief  removes elements from queue’s bottom to
avoid contention
Create 3
tasks
Parallel Processing Group
UNIVERSITY OF IOANNINA
15
Thread Memory
Array
Td Td Td Td Td Td Td
Overflow List
Td Td Td Td
Descriptor Pool
TASK_QUEUE
Tasking runtime architecture
Exec?
Descriptor
Fields
Data Env
Actual
Data
Task Descriptor
Spiros N. Agathos, Panagiotis E. Hadjidoukas, Vassilios V. Dimakopoulos, “Design and
Implementation of OpenMP Tasks in the OMPi Compiler”, PCI 2011, 15th Panhellenic
Conference on Informatics, Kastoria, Greece, September 2011, pages 265 - 269
Parallel Processing Group
UNIVERSITY OF IOANNINA
16
Μodern multicore multiprocessors
Deep cache hierarchies, private memory channels
Non Uniform Memory Access (NUMA)
A general tasking runtime will have sub-optimal
behaviour
We designed & implemented an optimized tasking
runtime for NUMA machines
Maximizes local operations
Minimizes remote accesses, less cache misses, low overheads
Work-stealing system uses an efficient blocking algorithm
Optimizations for NUMA machines
Parallel Processing Group
UNIVERSITY OF IOANNINA
17
Optimizations for NUMA machines
Td thief recycle issue:
 A thread wishes to steal a task
 After task completion
 In which Descriptor pool should Td return?
a) Thief?  No contention
1-producer/N-thieves?
b) Task’s creator?
Synchronization issues appear
TASK_QUEUE
Td Td
Td
Td
Descriptor Pool
Thief
Parallel Processing Group
UNIVERSITY OF IOANNINA
18
CreatorMemory
Optimizations for NUMA machines
TASK_QUEUE
Td
Td
Td
Td
DescriptorPool
Thief
Pending Queue
Worker
 Less synchronization between threads Less cache misses
 More local data operations
Parallel Processing Group
UNIVERSITY OF IOANNINA
19
Work-Stealing mechanism
Crucial component of an OpenMP runtime
TASK_QUEUE is a shared object supporting
OwnerEnqueue
 Enqueues a task in a thread’s queue
 Executed only by the thread that owns queue
 No need for synchronization
Dequeue
 Removes the oldest enqueued task
 Executed by any thread in an OpenMP team
 Synchronization needed!
A Fast Work-stealing Mechanism
TASK_QUEUE
Parallel Processing Group
UNIVERSITY OF IOANNINA
20
Based on the CC-Synch object/algorithm of
[Fatourou & Kallimanis, PPoPP, 2012].
CC-Synch for thread synchronization
To implement Dequeue operation
 Use one instance of CC-Synch for each TASK_QUEUE
Result of the combining technique:
 One thread (the combiner) holds a coarse lock
 Additionally to the application of its own operation, serves the
operations of all other active threads
 Highly reduces synchronization costs
A Fast Work-stealing Mechanism
Parallel Processing Group
UNIVERSITY OF IOANNINA
21
Performance evaluation
NUMA Environment
2 x 8-core AMD Opteron 6128 CPUs @ 2GHz
16GB of main memory
Debian Squeeze on the 2.6.32.5 kernel
GNU gcc (version 4.4.5-8) [-O3 -fopenmp]
Intel icc (version 12.1.0) [-fast -openmp]
Oracle suncc (version 12.2) [-fast -xopenmp=parallel]
OMPi uses GNU gcc as a back-end compiler [-O3]
Default Runtime Settings
Parallel Processing Group
UNIVERSITY OF IOANNINA
22
Synthetic benchmark
Fine grain tasks (max workload = 128) 16 threads
1 thread produces tasks
Rest threads executes them
Taskload = for loop with ‘max workload’ repetitions
Parallel Processing Group
UNIVERSITY OF IOANNINA
23
Barcelona OpenMP Task Suite
Fibonacci
 Computes the nth
Fibonacci number
 Exploits nested task
parallelization which
creates a deep tree
 Very large number of fine-
grain tasks
 40th Fibonacci number
 New work-stealing
implementation and the
fast execution path
No manual cut-off
Parallel Processing Group
UNIVERSITY OF IOANNINA
24
Barcelona OpenMP Task Suite
NQueens
 Calculates solutions of the
n-queens chessboard
problem
 Backtracking search
algorithm with pruning
creates unbalanced tasks
 Exploits nested task
parallelization which
creates a deep tree of
tasks
 Input: 14 queens
No manual cut-off
Spiros N. Agathos, Nikolaos D. Kallimanis, Vassilios V. Dimakopoulos, “Speeding Up OpenMP
Tasking”, Europar 2012, International European Conference onParallel and Distributed
Computing, Rhodes, Greece, August 2012, pages 650-661.
Parallel Processing Group
UNIVERSITY OF IOANNINA
25
Presentation overview
 Multi-core systems and their programming
 Summary of contributions
 Tasking support
 General design & optimization for NUMA machines
 Transform nested parallel loops to tasks
 Runtime support for embedded systems
 STHORM multi-core architecture
 Epiphany accelerator
 Application-driven runtime support
Parallel Processing Group
UNIVERSITY OF IOANNINA
26
Nested parallelism
A parallel region inside a parallel region
Every thread of team creates its own team of threads
Difficult to handle efficiently
 Possible processor oversubscription
Typically nested loops are parallelized using
nested parallelism
Can we replace a nested loop with tasks?
Introduction
Parallel Processing Group
UNIVERSITY OF IOANNINA
27
#pragma omp parallel num_threads(M)
{
#pragma omp parallel for
schedule(static) num_threads(N)
for (i=LB; i<UB; i++) {
<body>
}
…………
}
Transforming Loop Code Manually
#pragma omp parallel num_threads(M)
{
for(t=0; t<N; t++)
#pragma omp task
{
calculate(N, LB, UB, &lb, &ub);
for (i=lb; i<ub; i++)
<body>
}
#pragma omp taskwait
…………
}
 #N parallel tasks transformed to #N tasks
Instead of NxM, #M threads in system
Parallel Processing Group
UNIVERSITY OF IOANNINA
28
Similar Transformation is possible for
 Dynamic
 Guided
Complicated user code
Impossible to have access too thread specific
data (example)
Manual transformation limitations
Parallel Processing Group
UNIVERSITY OF IOANNINA
29
In general:
 A mini worksharing runtime must be written
 What thread will execute what task?
 Impossible to handle thread specific data
But within an OpenMP runtime system:
 All the worksharing functionality already there
 Access to all thread specific data
Auto transformation in OMPi compiler
Manual transformation limitations
Parallel Processing Group
UNIVERSITY OF IOANNINA
30
OMPi’s runtime organization
EECB
All OpenMP thread info:
1)Thread ID
2)Parallel level
3)Pointer to parent EECB
4)………
Each OpenMP thread is associated with an EECB
(Execution Entity Control Block)
Parallel Processing Group
UNIVERSITY OF IOANNINA
31
Parallel: Creation of parallel tasks
#pragma omp parallel num_threads(4)
P0
parallel
task
P1
parallel
task
P2
parallel
task
P3
parallel
task
Thread ID = 0
Level = 0
Initial
parallel
Task
4 new threads are created
New EECBs are assigned
Execute parallel region
Parallel Processing Group
UNIVERSITY OF IOANNINA
32
Nested Parallel: Creation of special tasks
#pragma omp parallel for num_threads(4)
S0
special
task
S1
special
task
S2
special
task
S3
special
task
Thread ID = 0
Level = 1
P0
parallel
task
Special tasks are stored in TaskQueue
Sibling threads steal special tasks
Change EECB && execute ‘parallel region’
Parallel Processing Group
UNIVERSITY OF IOANNINA
33
Compiler Side
 No changes
Runtime:
 New type of task called pfor_task
 Emulation of parallel tasks
Same technique works for nested parallel sections
Auto transformation implementation
Parallel Processing Group
UNIVERSITY OF IOANNINA
34
NUMA Environment
2X 8-core AMD Opteron 6128 CPUs @ 2GHz
16GB of main memory
Debian Squeeze on the 2.6.32.5 kernel
GNU gcc (version 4.4.5-8) [-O3 -fopenmp]
Intel icc (version 12.1.0) [-fast -openmp]
Oracle suncc (version 12.2) [-fast -xopenmp=parallel]
OMPi uses GNU gcc as a back-end compiler [-O3]
 Original, non-optimized task runtime was used
Default Runtime Settings
Evaluation
Parallel Processing Group
UNIVERSITY OF IOANNINA
35
Synthetic benchmark results
TASK_LOAD = 500
16 Threads in 1st Level
16 Threads in 1st Level
N (L2 Threads) = 4
Create a parallel with 16 threads
Each thread create a nested team to execute for-loop
Iteration workload = for loop with repetitions
Parallel Processing Group
UNIVERSITY OF IOANNINA
36
 Takes as input an image and
discovers the number of faces
depicted
 Utilizing nested parallelism in
order to obtain better
performance
Face detection results
161 Images CMU test set
Spiros N. Agathos, Panagiotis E. Hadjidoukas, Vassilios V. Dimakopoulos, “Task based Execution
of Nested OpenMP Loops”, IWOMP 12, International Workshop on OpenMP, OpenMP in a
Heterogenous World, Rome, Italy, June 2012, pages 210 - 222.
Parallel Processing Group
UNIVERSITY OF IOANNINA
37
 Hybrid policy
 Idle cores? Create threads
 All cores occupied ? Create tasks
 Not enough cores ? Mix tasks/threads
Face detection results
161 Images CMU test set
Parallel Processing Group
UNIVERSITY OF IOANNINA
38
Presentation overview
 Multi-core systems and their programming
 Summary of contributions
 Tasking support
 General design & optimization for NUMA machines
 Transform nested parallel loops to tasks
 Runtime support for embedded systems
 STHORM multi-core architecture
 Epiphany accelerator
 Application-driven runtime support
Parallel Processing Group
UNIVERSITY OF IOANNINA
39
Can OpenMP form the basis for a programming model
for multicore embedded accelerators?
 The answer is positive, but not straightforward
since:
 OpenMP is designed for homogeneous shared
memory systems
 Embedded systems
– Include groups of weak Pes
– Have limited resources
OpenMP for accelerators?
Parallel Processing Group
UNIVERSITY OF IOANNINA
40
OpenMP for accelerators?
The design and implementation of a programming
model for the STHORM architecture (Artemis EU
Project No. 100230 SMECY)
Offer decent parallelization without significant
programming effort
Parallel Processing Group
UNIVERSITY OF IOANNINA
41
STHORM architecture
Parallel Processing Group
UNIVERSITY OF IOANNINA
42
STHORM architecture
Parallel Processing Group
UNIVERSITY OF IOANNINA
43
Execution model
 An accelerator is usually a back-end system attached to a host.
 Where and how the OpenMP programming model is to be applied?
1. On the Host side: Multiple OpenMP Host threads generate multiple
jobs on the accelerator
2. On the Accelerator side: Each Host thread can trigger multiple
OpenMP threads on the accelerator
3. Our proposal : On Both sides!
 General solution
 Flexible, programmer friendly
Multicore
ARM
HOST Accelerator
Parallel Processing Group
UNIVERSITY OF IOANNINA
44
Suppoting OpenMP
OpenMP on ARM
Minimal changes
New module for communicating with STHORM
OpenMP on STHORM
Very difficult to implement
Limited hardware resources
Full albeit non optimized implementation
Parallel Processing Group
UNIVERSITY OF IOANNINA
45
Compilation chain
Code for the
HOST & the
Accelerator
OpenMP C
OpenMP C
OpenMP C
Multithreaded C
Multithreaded C
Parallel Processing Group
UNIVERSITY OF IOANNINA
46
EECB management
L2 TCDM
Emptylist
ID0
Lev0
ID1
Lev1
ID1
Lev1
ID2
Lev1
ID3
Lev1
ID4
Lev1
ID5
Lev1
ID15
Lev1
Execution Entity Control Block (EECB) per thread
Assigned to thread when it starts the execution
Freed when the team is disbanded
Problem : Placement of EECBs
EECBs are constantly accessed during execution
Solution : Scratchpad memory guaranteed
performance
Ok for 1 level of parallelism
Infeasible for multiple levels
Our Proposal:
Use TCDM only for the 16 active threads
Use L2 memory for nested teams
Keep all active EECBs in TCDM
ID 0
Lev2
Parallel Processing Group
UNIVERSITY OF IOANNINA
47
Parallel regions
 OpenMP parallel region is a group of jobs
executed by different PEs
 PE meets a parallel region (Master PE)
 Suspends the execution of its current job
 Sends a request to the CC
 Allocates a new EECB
 Executes its implicit task
 Wait the CC to notify the end of PR
 Return to its old EECB
 Other PE’s:
 Receive the request of the CC
 Acquire an EECB
 Start execute their implicit job
 When a job is executed notifies CC
 Release EECB
 CC:
 Supply PE’s with implicit jobs
 Receive end notifications of PE’s
 Informs the master PE that PR is finished
TCDM EnCore
CC
Cluster
M
Parallel Processing Group
UNIVERSITY OF IOANNINA
48
Parallel regions
TCDM EnCore
CC
Cluster
M
 OpenMP parallel region is a group of jobs
executed by different PEs
 PE meets a parallel region (Master PE)
 Suspends the execution of its current job
 Sends a request to the CC
 Allocates a new EECB
 Executes its implicit task
 Wait the CC to notify the end of PR
 Return to its old EECB
 Other PE’s:
 Receive the request of the CC
 Acquire an EECB
 Start execute their implicit job
 When a job is executed notifies CC
 Release EECB
 CC:
 Supply PE’s with implicit jobs
 Receive end notifications of PE’s
 Informs the master PE that PR is finished
Parallel Processing Group
UNIVERSITY OF IOANNINA
49
Parallel regions
TCDM EnCore
CC
Cluster
M
 OpenMP parallel region is a group of jobs
executed by different PEs
 PE meets a parallel region (Master PE)
 Suspends the execution of its current job
 Sends a request to the CC
 Allocates a new EECB
 Executes its implicit task
 Wait the CC to notify the end of PR
 Return to its old EECB
 Other PE’s:
 Receive the request of the CC
 Acquire an EECB
 Start execute their implicit job
 When a job is executed notifies CC
 Release EECB
 CC:
 Supply PE’s with implicit jobs
 Receive end notifications of PE’s
 Informs the master PE that PR is finished
Parallel Processing Group
UNIVERSITY OF IOANNINA
50
Experimental results
Calculation of the Mandelbrot set for
an image of 362x208 pixels
Image fits in TCDM
We parallelized the computation of
image pixels values
Results for different scheduling policies
of the OpenMP parallel for
Spiros N. Agathos, Vasileios V. Dimakopoulos, Aggelos Mourelis, Alexandros Papadogiannakis,
“Deploying OpenMP on an Embedded Multicore ccelarator”, SAMOS XIII, International
Conference on Embedded Computer Systems: Architectures, Samos, Greece, July 2013, pages
180 - 187
Parallel Processing Group
UNIVERSITY OF IOANNINA
51
Presentation overview
 Multi-core systems and their programming
 Summary of contributions
 Tasking support
 General design & optimization for NUMA machines
 Transform nested parallel loops to tasks
 Runtime support for embedded systems
 STHORM multi-core architecture
 Epiphany accelerator
 Application-driven runtime support
Parallel Processing Group
UNIVERSITY OF IOANNINA
52
void main()
{
int input[10], result[10];
init(input);
#pragma omp target map(to:input) map(from:result)
{
int i;
#pragma omp parallel for
for(i=0; i<10; i++)
result[i] = input[i]++;
}
print(result);
}
OpenMP target example
Executed on host Communication Executed on device
Parallel Processing Group
UNIVERSITY OF IOANNINA
53
void main()
{
int input[10], result[10];
init(input)
#pragma omp target data map(to:input) map(from:result)
{
#pragma omp target
{
<Kernel 01 code, accesses input, output>
}
<Host code, prepare for new kernel>
#pragma omp target
{
<Kernel 02 code, accesse input, output>
}
}
}
OpenMP target example
Executed on host
Communication
Executed on device
Parallel Processing Group
UNIVERSITY OF IOANNINA
54
• $99 Board (Basic version)
• Epiphany-16 (25 gflops, less than 2 Watt)
• Zynq  Linux OS
• Epiphany-16  No OS
• eSDK tools for native programming
Parallella board
Parallel Processing Group
UNIVERSITY OF IOANNINA
55
 Each kernel (i.e. target region), is outlined to a separate function
 The code generation phase produces multiple output files, one for
each different kernel, plus the host code (host may be called to
execute any of them)
Compiling for the new device directives
Parallel Processing Group
UNIVERSITY OF IOANNINA
56
A full-fledged OpenMP runtime library
 Supporting execution on the dual-ARM processor
Additional functionality
 Required for controlling and accessing the Epiphany device
The communication between the Host and the eCores
takes place through the shared memory portion of the
system RAM
For offloading a kernel
 The first idle eCore is chosen
 Precompiled object file is loaded to it for immediate
execution
 Ecores inform Host about the completion of a kernel
through special flags in shared memory.
 Multiple host threads can offload multiple independent
kernels concurrently onto the Epiphany
Runtime architecture - what the host does
Parallel Processing Group
UNIVERSITY OF IOANNINA
57
Runtime architecture - what the host does
int X[10], Y[10];
int k;
#pragma omp target data map(X,Y)
{
#pragma omp target map(to:k)
{
/* Kernel code */
}
}
32 MiB Shared Memory
4 KiB Device
Control Data
Target data variables
YXk*
Y
*
X
Data
Env.
Parallel Processing Group
UNIVERSITY OF IOANNINA
58
 Supporting OpenMP within the Epiphany is nontrivial
 ECores do not execute any OS
 No provision for dynamic parallelism
 The 32KiB local memory is quite limited:
Unable to handle sophisticated OpenMP runtime structures
 The runtime infrastructure originally designed for the Host was
trimmed down to a minimum
 e.g. Tasking = Shared queue protected by a lock
 This is linked and offloaded with each kernel
 The corresponding coordination among the participating eCores
utilizes the local memory of the team’s master eCore
OpenMP within the Epiphany
Parallel Processing Group
UNIVERSITY OF IOANNINA
59
#parallel inside a kernel
Shared memory
Zynq
Epiphany
Master Core
Worker Cores
Time
Offload
kernel
Request
workers
Seq
Code
Start
worker
threads
Reply to
master
core
Initialize
& notify
workers
Initialize &
wait for
master
Join
team
Join
team
On Chip memory
Parallel
Code
Parallel
Code
Parallel
ended
Idle
Kernel
ended
Seq
Code
Idle
Book
keeping
Ack
Book
keeping
Parallel Processing Group
UNIVERSITY OF IOANNINA
60
Environment
 Parallella-16 SKUA101020
 Ubuntu 14.04, kernel 3.12.0 armv7l
 gcc and e-gcc v.4.8.2 as back-end
for OMPi
 eSDK 5.13.9.10
Experimental results
Parallel Processing Group
UNIVERSITY OF IOANNINA
61
Experimental results
Overhead results of EPCC benchmark
Modified version of the EPCC benchmarks
 Basic routines are offloaded through target directives
 Measurements from the host side after subtracting any
offloading costs
Resetting an eCore: 0,1 sec
Parallel Processing Group
UNIVERSITY OF IOANNINA
62
Experimental results
Frames per second for theMandelbrot deep zoom application (1024x768)
eSDK version: only 8%-13% better
Original code: 301 lines (3 files)
OpenMP code: 198 lines (1 file)
Spiros N. Agathos, Alexandros Papadogiannakis, Vassilios V. Dimakopoulos, “Targeting the
Parallella”, Europar 2015, International European Conference on Parallel and Distributed
Computing, Vienna, Austria, August 2015, pages 662 – 674
Alexandros Papadogiannakis, Spiros N. Agathos, Vassilios V. Dimakopoulos, “OpenMP 4.0
Device Support in the Ompi Compiler”, IWOMP 12, nternational Workshop on OpenMP,
Heterogenous Execution and Data Movements, Aachen, Germany, October 2015, pages 202 -
216
Parallel Processing Group
UNIVERSITY OF IOANNINA
63
Presentation overview
 Multi-core systems and their programming
 Summary of contributions
 Tasking support
 General design & optimization for NUMA machines
 Transform nested parallel loops to tasks
 Runtime support for embedded systems
 STHORM multi-core architecture
 Epiphany accelerator
 Application-driven runtime support
Parallel Processing Group
UNIVERSITY OF IOANNINA
64
A kernel may contain any OpenMP directive
 Great flexibility
 Parallelization expressiveness
Requires runtime support within the co-processor
Implementing such a runtime is a non-trivial task
GPGPUs, accelerators etc have special functionality
characteristics and/or limited resources
Solutions:
 Provide limited (and sometimes no) support for some of the
directives
 Design of sub-optimal runtime system implementations
depending on the capabilities of a given device.
Motivation
Scenario OMPi executable size bytes
Empty kernel 7092
Create a parallel team 10560
Parallel Processing Group
UNIVERSITY OF IOANNINA
65
A novel runtime organization designed to work with
an OpenMP infrastructure
Instead of having a single monolithic runtime system
Adaptive runtime system architecture which
implements only the OpenMP features required by a
particular application
Example: OpenMP kernel with no explicit tasking
No tasking subsystem required
Barrier which has no (time consuming) tasking extensions
The library includes only the needed functionality 
reduce-sized executable.
Desirable in systems with minimal local memories
Our proposal
Parallel Processing Group
UNIVERSITY OF IOANNINA
66
The compiler
Analyzes the kernels
Provides metrics
Select a particular runtime system configuration to
accompany the kernel
The user's code implies the choice of an optimized
runtime system
Reduced executable sizes
Faster execution times
Compiler Assisted Runtime Support (CARS)
Parallel Processing Group
UNIVERSITY OF IOANNINA
67
CARS Overview
foo.c
User code
(C + OpenMP)
Compiler
(transformation
& analysis)
Analysis
metrics
Mapper
Runtime
alternatives
Kernel
code
System
Compiler/linker
Kernel specific
library
Kernel
executable
Host code
executable
Parallel Processing Group
UNIVERSITY OF IOANNINA
68
 OpenMP in kernel: Whether the kernel code includes any OpenMP directives or
not.
 Dynamic parallelism: The exact number of threads used in all parallel teams,
their nesting level as well as the presence of the reduction clause along with its
parameters.
 Work-sharing regions: The types of the workshares that are used in the kernel.
Whether for, single or sections regions are present. In the case of for regions,
exactly the types and parameters of the schedules, that is whether static,
dynamic or guided is used and the chunk size. In addition the presence of the
ordered clause. Finally whether the program takes advantage of the no-wait
region feature of worksharing.
 Explicit tasking: The presence of user defined tasks, taskgroups and the possible
dependencies if any.
 Special synchronization: The presence of atomic construct and the type of the
required functionality. The number and the names of the critical regions in the
kernel. The number and the characteristics (whether is normal of nested) of
user-defined locks.
Metrics
Parallel Processing Group
UNIVERSITY OF IOANNINA
69
Environment
 Parallella-16 SKUA101020
 Ubuntu 14.04, kernel 3.12.0 armv7l
 gcc and e-gcc v.4.8.2 as back-end
for OMPi
 eSDK 5.13.9.10
Experimental results
Parallel Processing Group
UNIVERSITY OF IOANNINA
70
Experimental Results
.elf sizes in bytes
Scenario Full RTS CARS Difference
Mandelbrot 13156 9620 26.88%
Empty Kernel 8228 2252 73.63%
Pi Calculation 11972 8864 25.96%
Nqueens (tasks) 20908 19704 5.76%
EPCC-for-static 14176 10944 22.80%
EPCC-critical 12560 9320 35.80%
EPCC-single 12200 8900 27.05%
EPCC-ordered 14192 10952 22.83%
Parallel Processing Group
UNIVERSITY OF IOANNINA
71
Experimental Results
Execution times in microseconds
Spiros N. Agathos, Vasileios V. Dimakopoulos, “Compiler-Assisted OpenMP Runtime
Organization for Embedded Multicores”, Technical Report, Number 2016-01, University of
Ioannina, Department of Computer Science & Engineering, April 2016
Scenario Full RTS CARS Difference
Mandelbrot 30.05(sec) 30.00(sec) 0.16%
Empty Kernel 0.10(sec) 0.10(sec) 0%
Pi Calculation 0.28(sec) 0.26(sec) 7.14%
Nqueens (tasks) 1.81(sec) 1.81(sec) 0%
EPCC-for-static 72.65 19.85 72.68%
EPCC-critical 2.17 1.55 39.98%
EPCC-single 83.72 14.92 28.57%
EPCC-ordered 4.70 4.66 0.85%
Parallel Processing Group
UNIVERSITY OF IOANNINA
72
Conclusion
 OpenMP is an easy to use programming model
 More powerfull due to addition of tasking facilities
 Generally applicable due to device constructs
 Contributions related to tasks
 General design & optimization for NUMA machines
 Transform nested parallel loops to tasks
 Contributions related to OpenMP for devices
 STHORM multi-core architecture
 Epiphany accelerator
 Application-driven runtime support
Parallel Processing Group
UNIVERSITY OF IOANNINA
73
Optimize Epiphany runtime to better exploit the
hardware characteristics
Support of OpenMP 4 device constructs for various
devices
GPGPUs
OpenMP extensions:
Data-block transfer
Resident kernels
Fine synchronization between Host and Device
Future Work
Parallel Processing Group
UNIVERSITY OF IOANNINA
74
END…
Acknowledgements:

More Related Content

What's hot

Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
 
Introduction to TensorFlow Lite
Introduction to TensorFlow Lite Introduction to TensorFlow Lite
Introduction to TensorFlow Lite Koan-Sin Tan
 
TinyML as-a-Service
TinyML as-a-ServiceTinyML as-a-Service
TinyML as-a-ServiceHiroshi Doyu
 
Tensorflow on Android
Tensorflow on AndroidTensorflow on Android
Tensorflow on AndroidKoan-Sin Tan
 
Programming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & ProductivityProgramming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & ProductivityLinaro
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveNetronome
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
 
open source nn frameworks on cellphones
open source nn frameworks on cellphonesopen source nn frameworks on cellphones
open source nn frameworks on cellphonesKoan-Sin Tan
 
Understanding Android Benchmarks
Understanding Android BenchmarksUnderstanding Android Benchmarks
Understanding Android BenchmarksKoan-Sin Tan
 
Containerizing HPC and AI applications using E4S and Performance Monitor tool
Containerizing HPC and AI applications using E4S and Performance Monitor toolContainerizing HPC and AI applications using E4S and Performance Monitor tool
Containerizing HPC and AI applications using E4S and Performance Monitor toolGanesan Narayanasamy
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesDr. Fabio Baruffa
 
eBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current TechniqueseBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current TechniquesNetronome
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Jorisimec.archive
 
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMPAlgoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMPPier Luca Lanzi
 

What's hot (20)

Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
Introduction to TensorFlow Lite
Introduction to TensorFlow Lite Introduction to TensorFlow Lite
Introduction to TensorFlow Lite
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
TinyML as-a-Service
TinyML as-a-ServiceTinyML as-a-Service
TinyML as-a-Service
 
Tensorflow on Android
Tensorflow on AndroidTensorflow on Android
Tensorflow on Android
 
Programming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & ProductivityProgramming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & Productivity
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep Dive
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
 
open source nn frameworks on cellphones
open source nn frameworks on cellphonesopen source nn frameworks on cellphones
open source nn frameworks on cellphones
 
Understanding Android Benchmarks
Understanding Android BenchmarksUnderstanding Android Benchmarks
Understanding Android Benchmarks
 
Containerizing HPC and AI applications using E4S and Performance Monitor tool
Containerizing HPC and AI applications using E4S and Performance Monitor toolContainerizing HPC and AI applications using E4S and Performance Monitor tool
Containerizing HPC and AI applications using E4S and Performance Monitor tool
 
A Peek into TFRT
A Peek into TFRTA Peek into TFRT
A Peek into TFRT
 
Openmp
OpenmpOpenmp
Openmp
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
 
GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
eBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current TechniqueseBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current Techniques
 
FIR filter on GPU
FIR filter on GPUFIR filter on GPU
FIR filter on GPU
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris
 
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMPAlgoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP
 

Viewers also liked

Perus- ja erityistaso lasten, nuorten ja perheiden kanssa LAPE-työpaja
Perus- ja erityistaso lasten, nuorten ja perheiden kanssa LAPE-työpajaPerus- ja erityistaso lasten, nuorten ja perheiden kanssa LAPE-työpaja
Perus- ja erityistaso lasten, nuorten ja perheiden kanssa LAPE-työpajaTHL
 
Truepsychotherapy Βασίλης Γιαννακόπουλος
Truepsychotherapy Βασίλης ΓιαννακόπουλοςTruepsychotherapy Βασίλης Γιαννακόπουλος
Truepsychotherapy Βασίλης ΓιαννακόπουλοςTruepsychotherapy.gr
 
SciREX センターシンポジウム 政策分析・影響評価領域
SciREX センターシンポジウム 政策分析・影響評価領域SciREX センターシンポジウム 政策分析・影響評価領域
SciREX センターシンポジウム 政策分析・影響評価領域scirexcenter
 
Spring Framework / Boot / Data 徹底活用 〜Spring Data Redis 編〜
Spring Framework / Boot / Data 徹底活用  〜Spring Data Redis 編〜Spring Framework / Boot / Data 徹底活用  〜Spring Data Redis 編〜
Spring Framework / Boot / Data 徹底活用 〜Spring Data Redis 編〜Naohiro Yoshida
 
Initial title research
Initial title researchInitial title research
Initial title researchCharlie Robson
 
今後必要になるイメージング技術サポート_公開用
今後必要になるイメージング技術サポート_公開用今後必要になるイメージング技術サポート_公開用
今後必要になるイメージング技術サポート_公開用Tatsuaki Kobayashi
 
Vertikaalinen integraatio, vaativimman, erityistason ja perustason välillä
Vertikaalinen integraatio, vaativimman, erityistason ja perustason välilläVertikaalinen integraatio, vaativimman, erityistason ja perustason välillä
Vertikaalinen integraatio, vaativimman, erityistason ja perustason välilläTHL
 

Viewers also liked (9)

Perus- ja erityistaso lasten, nuorten ja perheiden kanssa LAPE-työpaja
Perus- ja erityistaso lasten, nuorten ja perheiden kanssa LAPE-työpajaPerus- ja erityistaso lasten, nuorten ja perheiden kanssa LAPE-työpaja
Perus- ja erityistaso lasten, nuorten ja perheiden kanssa LAPE-työpaja
 
Referral_Letter_for_Annabelle 1
Referral_Letter_for_Annabelle 1Referral_Letter_for_Annabelle 1
Referral_Letter_for_Annabelle 1
 
Truepsychotherapy Βασίλης Γιαννακόπουλος
Truepsychotherapy Βασίλης ΓιαννακόπουλοςTruepsychotherapy Βασίλης Γιαννακόπουλος
Truepsychotherapy Βασίλης Γιαννακόπουλος
 
SciREX センターシンポジウム 政策分析・影響評価領域
SciREX センターシンポジウム 政策分析・影響評価領域SciREX センターシンポジウム 政策分析・影響評価領域
SciREX センターシンポジウム 政策分析・影響評価領域
 
Spring Framework / Boot / Data 徹底活用 〜Spring Data Redis 編〜
Spring Framework / Boot / Data 徹底活用  〜Spring Data Redis 編〜Spring Framework / Boot / Data 徹底活用  〜Spring Data Redis 編〜
Spring Framework / Boot / Data 徹底活用 〜Spring Data Redis 編〜
 
Initial title research
Initial title researchInitial title research
Initial title research
 
今後必要になるイメージング技術サポート_公開用
今後必要になるイメージング技術サポート_公開用今後必要になるイメージング技術サポート_公開用
今後必要になるイメージング技術サポート_公開用
 
Vertikaalinen integraatio, vaativimman, erityistason ja perustason välillä
Vertikaalinen integraatio, vaativimman, erityistason ja perustason välilläVertikaalinen integraatio, vaativimman, erityistason ja perustason välillä
Vertikaalinen integraatio, vaativimman, erityistason ja perustason välillä
 
Prefixes
PrefixesPrefixes
Prefixes
 

Similar to defense-linkedin

A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...ChangWoo Min
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8AbdullahMunir32
 
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...jsvetter
 
WRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation WorkbenchWRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation WorkbenchRafael Ferreira da Silva
 
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)Igalia
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Hajime Tazaki
 
Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsUnai Lopez-Novoa
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
OpenACC Monthly Highlights- December
OpenACC Monthly Highlights- DecemberOpenACC Monthly Highlights- December
OpenACC Monthly Highlights- DecemberNVIDIA
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsZvi Avraham
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeDmitri Nesteruk
 
Journal Seminar: Is Singularity-based Container Technology Ready for Running ...
Journal Seminar: Is Singularity-based Container Technology Ready for Running ...Journal Seminar: Is Singularity-based Container Technology Ready for Running ...
Journal Seminar: Is Singularity-based Container Technology Ready for Running ...Kento Aoyama
 
Enlightenment Foundation Libraries (Overview)
Enlightenment Foundation Libraries (Overview)Enlightenment Foundation Libraries (Overview)
Enlightenment Foundation Libraries (Overview)Samsung Open Source Group
 
parellel computing
parellel computingparellel computing
parellel computingkatakdound
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformGanesan Narayanasamy
 

Similar to defense-linkedin (20)

A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...
 
parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
 
Multicore
MulticoreMulticore
Multicore
 
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
 
WRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation WorkbenchWRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation Workbench
 
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern Coprocessors
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
OpenACC Monthly Highlights- December
OpenACC Monthly Highlights- DecemberOpenACC Monthly Highlights- December
OpenACC Monthly Highlights- December
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
 
Journal Seminar: Is Singularity-based Container Technology Ready for Running ...
Journal Seminar: Is Singularity-based Container Technology Ready for Running ...Journal Seminar: Is Singularity-based Container Technology Ready for Running ...
Journal Seminar: Is Singularity-based Container Technology Ready for Running ...
 
Enlightenment Foundation Libraries (Overview)
Enlightenment Foundation Libraries (Overview)Enlightenment Foundation Libraries (Overview)
Enlightenment Foundation Libraries (Overview)
 
parellel computing
parellel computingparellel computing
parellel computing
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
 
Ch1
Ch1Ch1
Ch1
 
Ch1
Ch1Ch1
Ch1
 

defense-linkedin

  • 1. DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING UNIVERSITY OF IOANNINA GREECE Spiros N. Agathos Efficient OpenMP Runtime Support for General-Purpose and Embedded Multi-core Platforms
  • 2. Parallel Processing Group UNIVERSITY OF IOANNINA 2 Presentation overview  Multi-core systems and their programming  Summary of contributions  Tasking support  General design & optimization for NUMA machines  Transform nested parallel loops to tasks  Runtime support for embedded systems  STHORM multi-core architecture  Epiphany accelerator  Application-driven runtime support
  • 3. Parallel Processing Group UNIVERSITY OF IOANNINA 3  Personal computers  Multiple compute cores in a socket  Market trends (CPU + coprocessor):  Home appliances now include embedded systems  HPC supercomputers  Multicore CPUs / GPGPUs / DSPs / FPGAs0 Multi-core systems 16,000 nodes X {2 Intel Xeon + 3 Xeon Phi} RANK 1st (November 2015) Tianhe-2 (China)
  • 4. Parallel Processing Group UNIVERSITY OF IOANNINA 4 More cores = More power ? Easy transition from serial to parallel programming? Efficient programming Code Portability Exploiting heterogeneous systems Low level VS high level programming languages
  • 5. Parallel Processing Group UNIVERSITY OF IOANNINA 5 OpenMP Directive-based model for parallel programming Base language (C/C++/Fortran) Incremental parallelization Highly successful, widely used, supported by many compilers Fork-join execution model First version in 1997, now v4.5 (November 2015) Parallel loops Sections Synchronization (barrier/critical/atomic) OpenMP Initial thread (Preamble) Fork() Work threadWork threadWork thread Work threadWork thread Join Initial thread (Continues)
  • 6. Parallel Processing Group UNIVERSITY OF IOANNINA 6 OpenMP since version 3.0 (May 2008) supports for task-based parallelism. With tasking, the expressiveness of OpenMP is enriched with flexible management or irregular and recursive parallelism An OpenMP task is a unit of work scheduled for asynchronous execution Has its own data environment Can share data with other tasks OpenMP tasking model
  • 7. Parallel Processing Group UNIVERSITY OF IOANNINA 7 void main() { int res; #pragma omp parallel { #pragma omp single res = fib(40); } } int fib(int n) { int n1, n2; if(n <= 1) return n; #pragma omp task shared(n1) n1 = fib(n-2); #pragma omp task shared(n2) n2 = fib(n-1); #pragma omp taskwait return n1+n2; } OpenMP task example
  • 8. Parallel Processing Group UNIVERSITY OF IOANNINA 8 Device Support added in version 4.0 (November 2013) An application can be executed by a set of devices host device and other target devices Host-centric execution model Create data environment in the device Instruct device to execute code regions (kernel) OpenMP device model
  • 9. Parallel Processing Group UNIVERSITY OF IOANNINA 9 void main() { int input[10], result[10]; init(input); #pragma omp target map(to:input) map(from:result) { int i; #pragma omp parallel for for(i=0; i<10; i++) result[i] = input[i]++; } print(result); } OpenMP target example Executed on host Communication Executed on device
  • 10. Parallel Processing Group UNIVERSITY OF IOANNINA 10  OMPi (http://paragroup.cs.uoi.gr/wpsite/software/ompi):  OpenMP C infrastructure  Source to source compiler + Runtime libraries OMPi compiler OpenMP C Multithreaded C
  • 11. Parallel Processing Group UNIVERSITY OF IOANNINA 11 Contributions  Tasking  General design & optimization for NUMA machines  Transform nested parallel loops to tasks  Runtime support for embedded systems  STHORM multi-core architecture  Epiphany accelerator  Application-driven runtime support
  • 12. Parallel Processing Group UNIVERSITY OF IOANNINA 12 Presentation overview  Multi-core systems and their programming  Summary of contributions  Tasking support  General design & optimization for NUMA machines  Transform nested parallel loops to tasks  Runtime support for embedded systems  STHORM multi-core architecture  Epiphany accelerator  Application-driven runtime support
  • 13. Parallel Processing Group UNIVERSITY OF IOANNINA 13 Tasking in the Ompi Compiler #pragma omp task firstprivate(x) { <CODE> } { /* task directive replacement */ env = capture_task_environment(); ort_new_task(taskFunc0, env); } /* Produced task function */ void taskFunc0(void *env) { get_env_data(env); <CODE> } /* Fast Path Optimization */ if(must_execute_serially) { declare_local_vars(); ort_immediate_start(); <CODE> ort_task_immediate_end(); } else
  • 14. Parallel Processing Group UNIVERSITY OF IOANNINA 14 Tasking Runtime Architecture TASK_QUEUE Task Queue:  Per OpenMP Thread (worker)  DEQueue Breadth first approach Depth first when full  Fast Path optimization  Less overheads  Data locality  Less parallelism exploitation Back to Breadth first when 30% empty Work-stealing between siblings  Lock-free [Chase & Lev, SPAA, 2005]  Worker  adds and removes elements to queue’s top  Thief  removes elements from queue’s bottom to avoid contention Create 3 tasks
  • 15. Parallel Processing Group UNIVERSITY OF IOANNINA 15 Thread Memory Array Td Td Td Td Td Td Td Overflow List Td Td Td Td Descriptor Pool TASK_QUEUE Tasking runtime architecture Exec? Descriptor Fields Data Env Actual Data Task Descriptor Spiros N. Agathos, Panagiotis E. Hadjidoukas, Vassilios V. Dimakopoulos, “Design and Implementation of OpenMP Tasks in the OMPi Compiler”, PCI 2011, 15th Panhellenic Conference on Informatics, Kastoria, Greece, September 2011, pages 265 - 269
  • 16. Parallel Processing Group UNIVERSITY OF IOANNINA 16 Μodern multicore multiprocessors Deep cache hierarchies, private memory channels Non Uniform Memory Access (NUMA) A general tasking runtime will have sub-optimal behaviour We designed & implemented an optimized tasking runtime for NUMA machines Maximizes local operations Minimizes remote accesses, less cache misses, low overheads Work-stealing system uses an efficient blocking algorithm Optimizations for NUMA machines
  • 17. Parallel Processing Group UNIVERSITY OF IOANNINA 17 Optimizations for NUMA machines Td thief recycle issue:  A thread wishes to steal a task  After task completion  In which Descriptor pool should Td return? a) Thief?  No contention 1-producer/N-thieves? b) Task’s creator? Synchronization issues appear TASK_QUEUE Td Td Td Td Descriptor Pool Thief
  • 18. Parallel Processing Group UNIVERSITY OF IOANNINA 18 CreatorMemory Optimizations for NUMA machines TASK_QUEUE Td Td Td Td DescriptorPool Thief Pending Queue Worker  Less synchronization between threads Less cache misses  More local data operations
  • 19. Parallel Processing Group UNIVERSITY OF IOANNINA 19 Work-Stealing mechanism Crucial component of an OpenMP runtime TASK_QUEUE is a shared object supporting OwnerEnqueue  Enqueues a task in a thread’s queue  Executed only by the thread that owns queue  No need for synchronization Dequeue  Removes the oldest enqueued task  Executed by any thread in an OpenMP team  Synchronization needed! A Fast Work-stealing Mechanism TASK_QUEUE
  • 20. Parallel Processing Group UNIVERSITY OF IOANNINA 20 Based on the CC-Synch object/algorithm of [Fatourou & Kallimanis, PPoPP, 2012]. CC-Synch for thread synchronization To implement Dequeue operation  Use one instance of CC-Synch for each TASK_QUEUE Result of the combining technique:  One thread (the combiner) holds a coarse lock  Additionally to the application of its own operation, serves the operations of all other active threads  Highly reduces synchronization costs A Fast Work-stealing Mechanism
  • 21. Parallel Processing Group UNIVERSITY OF IOANNINA 21 Performance evaluation NUMA Environment 2 x 8-core AMD Opteron 6128 CPUs @ 2GHz 16GB of main memory Debian Squeeze on the 2.6.32.5 kernel GNU gcc (version 4.4.5-8) [-O3 -fopenmp] Intel icc (version 12.1.0) [-fast -openmp] Oracle suncc (version 12.2) [-fast -xopenmp=parallel] OMPi uses GNU gcc as a back-end compiler [-O3] Default Runtime Settings
  • 22. Parallel Processing Group UNIVERSITY OF IOANNINA 22 Synthetic benchmark Fine grain tasks (max workload = 128) 16 threads 1 thread produces tasks Rest threads executes them Taskload = for loop with ‘max workload’ repetitions
  • 23. Parallel Processing Group UNIVERSITY OF IOANNINA 23 Barcelona OpenMP Task Suite Fibonacci  Computes the nth Fibonacci number  Exploits nested task parallelization which creates a deep tree  Very large number of fine- grain tasks  40th Fibonacci number  New work-stealing implementation and the fast execution path No manual cut-off
  • 24. Parallel Processing Group UNIVERSITY OF IOANNINA 24 Barcelona OpenMP Task Suite NQueens  Calculates solutions of the n-queens chessboard problem  Backtracking search algorithm with pruning creates unbalanced tasks  Exploits nested task parallelization which creates a deep tree of tasks  Input: 14 queens No manual cut-off Spiros N. Agathos, Nikolaos D. Kallimanis, Vassilios V. Dimakopoulos, “Speeding Up OpenMP Tasking”, Europar 2012, International European Conference onParallel and Distributed Computing, Rhodes, Greece, August 2012, pages 650-661.
  • 25. Parallel Processing Group UNIVERSITY OF IOANNINA 25 Presentation overview  Multi-core systems and their programming  Summary of contributions  Tasking support  General design & optimization for NUMA machines  Transform nested parallel loops to tasks  Runtime support for embedded systems  STHORM multi-core architecture  Epiphany accelerator  Application-driven runtime support
  • 26. Parallel Processing Group UNIVERSITY OF IOANNINA 26 Nested parallelism A parallel region inside a parallel region Every thread of team creates its own team of threads Difficult to handle efficiently  Possible processor oversubscription Typically nested loops are parallelized using nested parallelism Can we replace a nested loop with tasks? Introduction
  • 27. Parallel Processing Group UNIVERSITY OF IOANNINA 27 #pragma omp parallel num_threads(M) { #pragma omp parallel for schedule(static) num_threads(N) for (i=LB; i<UB; i++) { <body> } ………… } Transforming Loop Code Manually #pragma omp parallel num_threads(M) { for(t=0; t<N; t++) #pragma omp task { calculate(N, LB, UB, &lb, &ub); for (i=lb; i<ub; i++) <body> } #pragma omp taskwait ………… }  #N parallel tasks transformed to #N tasks Instead of NxM, #M threads in system
  • 28. Parallel Processing Group UNIVERSITY OF IOANNINA 28 Similar Transformation is possible for  Dynamic  Guided Complicated user code Impossible to have access too thread specific data (example) Manual transformation limitations
  • 29. Parallel Processing Group UNIVERSITY OF IOANNINA 29 In general:  A mini worksharing runtime must be written  What thread will execute what task?  Impossible to handle thread specific data But within an OpenMP runtime system:  All the worksharing functionality already there  Access to all thread specific data Auto transformation in OMPi compiler Manual transformation limitations
  • 30. Parallel Processing Group UNIVERSITY OF IOANNINA 30 OMPi’s runtime organization EECB All OpenMP thread info: 1)Thread ID 2)Parallel level 3)Pointer to parent EECB 4)……… Each OpenMP thread is associated with an EECB (Execution Entity Control Block)
  • 31. Parallel Processing Group UNIVERSITY OF IOANNINA 31 Parallel: Creation of parallel tasks #pragma omp parallel num_threads(4) P0 parallel task P1 parallel task P2 parallel task P3 parallel task Thread ID = 0 Level = 0 Initial parallel Task 4 new threads are created New EECBs are assigned Execute parallel region
  • 32. Parallel Processing Group UNIVERSITY OF IOANNINA 32 Nested Parallel: Creation of special tasks #pragma omp parallel for num_threads(4) S0 special task S1 special task S2 special task S3 special task Thread ID = 0 Level = 1 P0 parallel task Special tasks are stored in TaskQueue Sibling threads steal special tasks Change EECB && execute ‘parallel region’
  • 33. Parallel Processing Group UNIVERSITY OF IOANNINA 33 Compiler Side  No changes Runtime:  New type of task called pfor_task  Emulation of parallel tasks Same technique works for nested parallel sections Auto transformation implementation
  • 34. Parallel Processing Group UNIVERSITY OF IOANNINA 34 NUMA Environment 2X 8-core AMD Opteron 6128 CPUs @ 2GHz 16GB of main memory Debian Squeeze on the 2.6.32.5 kernel GNU gcc (version 4.4.5-8) [-O3 -fopenmp] Intel icc (version 12.1.0) [-fast -openmp] Oracle suncc (version 12.2) [-fast -xopenmp=parallel] OMPi uses GNU gcc as a back-end compiler [-O3]  Original, non-optimized task runtime was used Default Runtime Settings Evaluation
  • 35. Parallel Processing Group UNIVERSITY OF IOANNINA 35 Synthetic benchmark results TASK_LOAD = 500 16 Threads in 1st Level 16 Threads in 1st Level N (L2 Threads) = 4 Create a parallel with 16 threads Each thread create a nested team to execute for-loop Iteration workload = for loop with repetitions
  • 36. Parallel Processing Group UNIVERSITY OF IOANNINA 36  Takes as input an image and discovers the number of faces depicted  Utilizing nested parallelism in order to obtain better performance Face detection results 161 Images CMU test set Spiros N. Agathos, Panagiotis E. Hadjidoukas, Vassilios V. Dimakopoulos, “Task based Execution of Nested OpenMP Loops”, IWOMP 12, International Workshop on OpenMP, OpenMP in a Heterogenous World, Rome, Italy, June 2012, pages 210 - 222.
  • 37. Parallel Processing Group UNIVERSITY OF IOANNINA 37  Hybrid policy  Idle cores? Create threads  All cores occupied ? Create tasks  Not enough cores ? Mix tasks/threads Face detection results 161 Images CMU test set
  • 38. Parallel Processing Group UNIVERSITY OF IOANNINA 38 Presentation overview  Multi-core systems and their programming  Summary of contributions  Tasking support  General design & optimization for NUMA machines  Transform nested parallel loops to tasks  Runtime support for embedded systems  STHORM multi-core architecture  Epiphany accelerator  Application-driven runtime support
  • 39. Parallel Processing Group UNIVERSITY OF IOANNINA 39 Can OpenMP form the basis for a programming model for multicore embedded accelerators?  The answer is positive, but not straightforward since:  OpenMP is designed for homogeneous shared memory systems  Embedded systems – Include groups of weak Pes – Have limited resources OpenMP for accelerators?
  • 40. Parallel Processing Group UNIVERSITY OF IOANNINA 40 OpenMP for accelerators? The design and implementation of a programming model for the STHORM architecture (Artemis EU Project No. 100230 SMECY) Offer decent parallelization without significant programming effort
  • 41. Parallel Processing Group UNIVERSITY OF IOANNINA 41 STHORM architecture
  • 42. Parallel Processing Group UNIVERSITY OF IOANNINA 42 STHORM architecture
  • 43. Parallel Processing Group UNIVERSITY OF IOANNINA 43 Execution model  An accelerator is usually a back-end system attached to a host.  Where and how the OpenMP programming model is to be applied? 1. On the Host side: Multiple OpenMP Host threads generate multiple jobs on the accelerator 2. On the Accelerator side: Each Host thread can trigger multiple OpenMP threads on the accelerator 3. Our proposal : On Both sides!  General solution  Flexible, programmer friendly Multicore ARM HOST Accelerator
  • 44. Parallel Processing Group UNIVERSITY OF IOANNINA 44 Suppoting OpenMP OpenMP on ARM Minimal changes New module for communicating with STHORM OpenMP on STHORM Very difficult to implement Limited hardware resources Full albeit non optimized implementation
  • 45. Parallel Processing Group UNIVERSITY OF IOANNINA 45 Compilation chain Code for the HOST & the Accelerator OpenMP C OpenMP C OpenMP C Multithreaded C Multithreaded C
  • 46. Parallel Processing Group UNIVERSITY OF IOANNINA 46 EECB management L2 TCDM Emptylist ID0 Lev0 ID1 Lev1 ID1 Lev1 ID2 Lev1 ID3 Lev1 ID4 Lev1 ID5 Lev1 ID15 Lev1 Execution Entity Control Block (EECB) per thread Assigned to thread when it starts the execution Freed when the team is disbanded Problem : Placement of EECBs EECBs are constantly accessed during execution Solution : Scratchpad memory guaranteed performance Ok for 1 level of parallelism Infeasible for multiple levels Our Proposal: Use TCDM only for the 16 active threads Use L2 memory for nested teams Keep all active EECBs in TCDM ID 0 Lev2
  • 47. Parallel Processing Group UNIVERSITY OF IOANNINA 47 Parallel regions  OpenMP parallel region is a group of jobs executed by different PEs  PE meets a parallel region (Master PE)  Suspends the execution of its current job  Sends a request to the CC  Allocates a new EECB  Executes its implicit task  Wait the CC to notify the end of PR  Return to its old EECB  Other PE’s:  Receive the request of the CC  Acquire an EECB  Start execute their implicit job  When a job is executed notifies CC  Release EECB  CC:  Supply PE’s with implicit jobs  Receive end notifications of PE’s  Informs the master PE that PR is finished TCDM EnCore CC Cluster M
  • 48. Parallel Processing Group UNIVERSITY OF IOANNINA 48 Parallel regions TCDM EnCore CC Cluster M  OpenMP parallel region is a group of jobs executed by different PEs  PE meets a parallel region (Master PE)  Suspends the execution of its current job  Sends a request to the CC  Allocates a new EECB  Executes its implicit task  Wait the CC to notify the end of PR  Return to its old EECB  Other PE’s:  Receive the request of the CC  Acquire an EECB  Start execute their implicit job  When a job is executed notifies CC  Release EECB  CC:  Supply PE’s with implicit jobs  Receive end notifications of PE’s  Informs the master PE that PR is finished
  • 49. Parallel Processing Group UNIVERSITY OF IOANNINA 49 Parallel regions TCDM EnCore CC Cluster M  OpenMP parallel region is a group of jobs executed by different PEs  PE meets a parallel region (Master PE)  Suspends the execution of its current job  Sends a request to the CC  Allocates a new EECB  Executes its implicit task  Wait the CC to notify the end of PR  Return to its old EECB  Other PE’s:  Receive the request of the CC  Acquire an EECB  Start execute their implicit job  When a job is executed notifies CC  Release EECB  CC:  Supply PE’s with implicit jobs  Receive end notifications of PE’s  Informs the master PE that PR is finished
  • 50. Parallel Processing Group UNIVERSITY OF IOANNINA 50 Experimental results Calculation of the Mandelbrot set for an image of 362x208 pixels Image fits in TCDM We parallelized the computation of image pixels values Results for different scheduling policies of the OpenMP parallel for Spiros N. Agathos, Vasileios V. Dimakopoulos, Aggelos Mourelis, Alexandros Papadogiannakis, “Deploying OpenMP on an Embedded Multicore ccelarator”, SAMOS XIII, International Conference on Embedded Computer Systems: Architectures, Samos, Greece, July 2013, pages 180 - 187
  • 51. Parallel Processing Group UNIVERSITY OF IOANNINA 51 Presentation overview  Multi-core systems and their programming  Summary of contributions  Tasking support  General design & optimization for NUMA machines  Transform nested parallel loops to tasks  Runtime support for embedded systems  STHORM multi-core architecture  Epiphany accelerator  Application-driven runtime support
  • 52. Parallel Processing Group UNIVERSITY OF IOANNINA 52 void main() { int input[10], result[10]; init(input); #pragma omp target map(to:input) map(from:result) { int i; #pragma omp parallel for for(i=0; i<10; i++) result[i] = input[i]++; } print(result); } OpenMP target example Executed on host Communication Executed on device
  • 53. Parallel Processing Group UNIVERSITY OF IOANNINA 53 void main() { int input[10], result[10]; init(input) #pragma omp target data map(to:input) map(from:result) { #pragma omp target { <Kernel 01 code, accesses input, output> } <Host code, prepare for new kernel> #pragma omp target { <Kernel 02 code, accesse input, output> } } } OpenMP target example Executed on host Communication Executed on device
  • 54. Parallel Processing Group UNIVERSITY OF IOANNINA 54 • $99 Board (Basic version) • Epiphany-16 (25 gflops, less than 2 Watt) • Zynq  Linux OS • Epiphany-16  No OS • eSDK tools for native programming Parallella board
  • 55. Parallel Processing Group UNIVERSITY OF IOANNINA 55  Each kernel (i.e. target region), is outlined to a separate function  The code generation phase produces multiple output files, one for each different kernel, plus the host code (host may be called to execute any of them) Compiling for the new device directives
  • 56. Parallel Processing Group UNIVERSITY OF IOANNINA 56 A full-fledged OpenMP runtime library  Supporting execution on the dual-ARM processor Additional functionality  Required for controlling and accessing the Epiphany device The communication between the Host and the eCores takes place through the shared memory portion of the system RAM For offloading a kernel  The first idle eCore is chosen  Precompiled object file is loaded to it for immediate execution  Ecores inform Host about the completion of a kernel through special flags in shared memory.  Multiple host threads can offload multiple independent kernels concurrently onto the Epiphany Runtime architecture - what the host does
  • 57. Parallel Processing Group UNIVERSITY OF IOANNINA 57 Runtime architecture - what the host does int X[10], Y[10]; int k; #pragma omp target data map(X,Y) { #pragma omp target map(to:k) { /* Kernel code */ } } 32 MiB Shared Memory 4 KiB Device Control Data Target data variables YXk* Y * X Data Env.
  • 58. Parallel Processing Group UNIVERSITY OF IOANNINA 58  Supporting OpenMP within the Epiphany is nontrivial  ECores do not execute any OS  No provision for dynamic parallelism  The 32KiB local memory is quite limited: Unable to handle sophisticated OpenMP runtime structures  The runtime infrastructure originally designed for the Host was trimmed down to a minimum  e.g. Tasking = Shared queue protected by a lock  This is linked and offloaded with each kernel  The corresponding coordination among the participating eCores utilizes the local memory of the team’s master eCore OpenMP within the Epiphany
  • 59. Parallel Processing Group UNIVERSITY OF IOANNINA 59 #parallel inside a kernel Shared memory Zynq Epiphany Master Core Worker Cores Time Offload kernel Request workers Seq Code Start worker threads Reply to master core Initialize & notify workers Initialize & wait for master Join team Join team On Chip memory Parallel Code Parallel Code Parallel ended Idle Kernel ended Seq Code Idle Book keeping Ack Book keeping
  • 60. Parallel Processing Group UNIVERSITY OF IOANNINA 60 Environment  Parallella-16 SKUA101020  Ubuntu 14.04, kernel 3.12.0 armv7l  gcc and e-gcc v.4.8.2 as back-end for OMPi  eSDK 5.13.9.10 Experimental results
  • 61. Parallel Processing Group UNIVERSITY OF IOANNINA 61 Experimental results Overhead results of EPCC benchmark Modified version of the EPCC benchmarks  Basic routines are offloaded through target directives  Measurements from the host side after subtracting any offloading costs Resetting an eCore: 0,1 sec
  • 62. Parallel Processing Group UNIVERSITY OF IOANNINA 62 Experimental results Frames per second for theMandelbrot deep zoom application (1024x768) eSDK version: only 8%-13% better Original code: 301 lines (3 files) OpenMP code: 198 lines (1 file) Spiros N. Agathos, Alexandros Papadogiannakis, Vassilios V. Dimakopoulos, “Targeting the Parallella”, Europar 2015, International European Conference on Parallel and Distributed Computing, Vienna, Austria, August 2015, pages 662 – 674 Alexandros Papadogiannakis, Spiros N. Agathos, Vassilios V. Dimakopoulos, “OpenMP 4.0 Device Support in the Ompi Compiler”, IWOMP 12, nternational Workshop on OpenMP, Heterogenous Execution and Data Movements, Aachen, Germany, October 2015, pages 202 - 216
  • 63. Parallel Processing Group UNIVERSITY OF IOANNINA 63 Presentation overview  Multi-core systems and their programming  Summary of contributions  Tasking support  General design & optimization for NUMA machines  Transform nested parallel loops to tasks  Runtime support for embedded systems  STHORM multi-core architecture  Epiphany accelerator  Application-driven runtime support
  • 64. Parallel Processing Group UNIVERSITY OF IOANNINA 64 A kernel may contain any OpenMP directive  Great flexibility  Parallelization expressiveness Requires runtime support within the co-processor Implementing such a runtime is a non-trivial task GPGPUs, accelerators etc have special functionality characteristics and/or limited resources Solutions:  Provide limited (and sometimes no) support for some of the directives  Design of sub-optimal runtime system implementations depending on the capabilities of a given device. Motivation Scenario OMPi executable size bytes Empty kernel 7092 Create a parallel team 10560
  • 65. Parallel Processing Group UNIVERSITY OF IOANNINA 65 A novel runtime organization designed to work with an OpenMP infrastructure Instead of having a single monolithic runtime system Adaptive runtime system architecture which implements only the OpenMP features required by a particular application Example: OpenMP kernel with no explicit tasking No tasking subsystem required Barrier which has no (time consuming) tasking extensions The library includes only the needed functionality  reduce-sized executable. Desirable in systems with minimal local memories Our proposal
  • 66. Parallel Processing Group UNIVERSITY OF IOANNINA 66 The compiler Analyzes the kernels Provides metrics Select a particular runtime system configuration to accompany the kernel The user's code implies the choice of an optimized runtime system Reduced executable sizes Faster execution times Compiler Assisted Runtime Support (CARS)
  • 67. Parallel Processing Group UNIVERSITY OF IOANNINA 67 CARS Overview foo.c User code (C + OpenMP) Compiler (transformation & analysis) Analysis metrics Mapper Runtime alternatives Kernel code System Compiler/linker Kernel specific library Kernel executable Host code executable
  • 68. Parallel Processing Group UNIVERSITY OF IOANNINA 68  OpenMP in kernel: Whether the kernel code includes any OpenMP directives or not.  Dynamic parallelism: The exact number of threads used in all parallel teams, their nesting level as well as the presence of the reduction clause along with its parameters.  Work-sharing regions: The types of the workshares that are used in the kernel. Whether for, single or sections regions are present. In the case of for regions, exactly the types and parameters of the schedules, that is whether static, dynamic or guided is used and the chunk size. In addition the presence of the ordered clause. Finally whether the program takes advantage of the no-wait region feature of worksharing.  Explicit tasking: The presence of user defined tasks, taskgroups and the possible dependencies if any.  Special synchronization: The presence of atomic construct and the type of the required functionality. The number and the names of the critical regions in the kernel. The number and the characteristics (whether is normal of nested) of user-defined locks. Metrics
  • 69. Parallel Processing Group UNIVERSITY OF IOANNINA 69 Environment  Parallella-16 SKUA101020  Ubuntu 14.04, kernel 3.12.0 armv7l  gcc and e-gcc v.4.8.2 as back-end for OMPi  eSDK 5.13.9.10 Experimental results
  • 70. Parallel Processing Group UNIVERSITY OF IOANNINA 70 Experimental Results .elf sizes in bytes Scenario Full RTS CARS Difference Mandelbrot 13156 9620 26.88% Empty Kernel 8228 2252 73.63% Pi Calculation 11972 8864 25.96% Nqueens (tasks) 20908 19704 5.76% EPCC-for-static 14176 10944 22.80% EPCC-critical 12560 9320 35.80% EPCC-single 12200 8900 27.05% EPCC-ordered 14192 10952 22.83%
  • 71. Parallel Processing Group UNIVERSITY OF IOANNINA 71 Experimental Results Execution times in microseconds Spiros N. Agathos, Vasileios V. Dimakopoulos, “Compiler-Assisted OpenMP Runtime Organization for Embedded Multicores”, Technical Report, Number 2016-01, University of Ioannina, Department of Computer Science & Engineering, April 2016 Scenario Full RTS CARS Difference Mandelbrot 30.05(sec) 30.00(sec) 0.16% Empty Kernel 0.10(sec) 0.10(sec) 0% Pi Calculation 0.28(sec) 0.26(sec) 7.14% Nqueens (tasks) 1.81(sec) 1.81(sec) 0% EPCC-for-static 72.65 19.85 72.68% EPCC-critical 2.17 1.55 39.98% EPCC-single 83.72 14.92 28.57% EPCC-ordered 4.70 4.66 0.85%
  • 72. Parallel Processing Group UNIVERSITY OF IOANNINA 72 Conclusion  OpenMP is an easy to use programming model  More powerfull due to addition of tasking facilities  Generally applicable due to device constructs  Contributions related to tasks  General design & optimization for NUMA machines  Transform nested parallel loops to tasks  Contributions related to OpenMP for devices  STHORM multi-core architecture  Epiphany accelerator  Application-driven runtime support
  • 73. Parallel Processing Group UNIVERSITY OF IOANNINA 73 Optimize Epiphany runtime to better exploit the hardware characteristics Support of OpenMP 4 device constructs for various devices GPGPUs OpenMP extensions: Data-block transfer Resident kernels Fine synchronization between Host and Device Future Work
  • 74. Parallel Processing Group UNIVERSITY OF IOANNINA 74 END… Acknowledgements: