defense-linkedin

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
UNIVERSITY OF IOANNINA
GREECE
Spiros N. Agathos
Efficient OpenMP Runtime
Support for General-Purpose
and Embedded Multi-core
Platforms

Parallel Processing Group
2
Presentation overview
 Multi-core systems and their programming
 Summary of contributions
 Tasking support
 General design & optimization for NUMA machines
 Transform nested parallel loops to tasks
 Runtime support for embedded systems
 STHORM multi-core architecture
 Epiphany accelerator
 Application-driven runtime support

3
 Personal computers
 Multiple compute cores in a socket
 Market trends (CPU + coprocessor):
 Home appliances now include embedded systems
 HPC supercomputers  Multicore CPUs / GPGPUs / DSPs / FPGAs0
Multi-core systems
16,000 nodes X {2 Intel Xeon + 3 Xeon Phi}
RANK 1st (November 2015)
Tianhe-2 (China)

4
More cores = More power ?
Easy transition from serial to
parallel programming?
Efficient
programming
Code Portability
Exploiting
heterogeneous
systems Low level VS high level
programming languages

5
OpenMP
Directive-based model for parallel programming
Base language (C/C++/Fortran)
Incremental parallelization
Highly successful, widely used, supported by many
compilers
Fork-join execution model
First version in 1997, now v4.5 (November 2015)
Parallel loops
Sections
Synchronization (barrier/critical/atomic)
OpenMP
Initial thread
(Preamble)
Fork()
Work threadWork threadWork thread Work threadWork thread
Join
Initial thread
(Continues)

6
OpenMP since version 3.0 (May 2008) supports for
task-based parallelism.
With tasking, the expressiveness of OpenMP is
enriched with flexible management or irregular and
recursive parallelism
An OpenMP task is a unit of work scheduled for
asynchronous execution
Has its own data environment
Can share data with other tasks
OpenMP tasking model

7
void main()
{
int res;
#pragma omp parallel
{
#pragma omp single
res = fib(40);
}
}
int fib(int n)
{
int n1, n2;
if(n <= 1) return n;
#pragma omp task shared(n1)
n1 = fib(n-2);
#pragma omp task shared(n2)
n2 = fib(n-1);
#pragma omp taskwait
return n1+n2;
}
OpenMP task example

8
Device Support added in version 4.0 (November 2013)
An application can be executed by a set of devices
host device and other target devices
Host-centric execution model
Create data environment in the device
Instruct device to execute code regions (kernel)
OpenMP device model

9
void main()
{
int input[10], result[10];
init(input);
#pragma omp target map(to:input) map(from:result)
{
int i;
#pragma omp parallel for
for(i=0; i<10; i++)
result[i] = input[i]++;
}
print(result);
}
OpenMP target example
Executed on host Communication Executed on device

10
 OMPi (http://paragroup.cs.uoi.gr/wpsite/software/ompi):
 OpenMP C infrastructure
 Source to source compiler + Runtime libraries
OMPi compiler
OpenMP C
Multithreaded C

11
Contributions
 Tasking

12
 Tasking support

13
Tasking in the Ompi Compiler
#pragma omp task firstprivate(x)
{
<CODE>
}
{
/* task directive replacement */
env = capture_task_environment();
ort_new_task(taskFunc0, env);
}
/* Produced task function */
void taskFunc0(void *env)
{
get_env_data(env);
<CODE>
}
/* Fast Path Optimization */
if(must_execute_serially)
{
declare_local_vars();
ort_immediate_start();
<CODE>
ort_task_immediate_end();
}
else

14
Tasking Runtime Architecture
TASK_QUEUE
Task Queue:
 Per OpenMP Thread (worker)
 DEQueue
Breadth first approach
Depth first when full
 Fast Path optimization
 Less overheads
 Data locality
 Less parallelism exploitation
Back to Breadth first when 30% empty
Work-stealing between siblings
 Lock-free [Chase & Lev, SPAA, 2005]
 Worker  adds and removes elements to queue’s top
 Thief  removes elements from queue’s bottom to
avoid contention
Create 3
tasks

15
Thread Memory
Array
Td Td Td Td Td Td Td
Overflow List
Td Td Td Td
Descriptor Pool
TASK_QUEUE
Tasking runtime architecture
Exec?
Descriptor
Fields
Data Env
Actual
Data
Task Descriptor
Spiros N. Agathos, Panagiotis E. Hadjidoukas, Vassilios V. Dimakopoulos, “Design and
Implementation of OpenMP Tasks in the OMPi Compiler”, PCI 2011, 15th Panhellenic
Conference on Informatics, Kastoria, Greece, September 2011, pages 265 - 269

16
Μodern multicore multiprocessors
Deep cache hierarchies, private memory channels
Non Uniform Memory Access (NUMA)
A general tasking runtime will have sub-optimal
behaviour
We designed & implemented an optimized tasking
runtime for NUMA machines
Maximizes local operations
Minimizes remote accesses, less cache misses, low overheads
Work-stealing system uses an efficient blocking algorithm
Optimizations for NUMA machines

17
Td thief recycle issue:
 A thread wishes to steal a task
 After task completion
 In which Descriptor pool should Td return?
a) Thief?  No contention
1-producer/N-thieves?
b) Task’s creator?
Synchronization issues appear
TASK_QUEUE
Td Td
Td
Td
Descriptor Pool
Thief

18
CreatorMemory
TASK_QUEUE
Td
Td
Td
Td
DescriptorPool
Thief
Pending Queue
Worker
 Less synchronization between threads Less cache misses
 More local data operations

19
Work-Stealing mechanism
Crucial component of an OpenMP runtime
TASK_QUEUE is a shared object supporting
OwnerEnqueue
 Enqueues a task in a thread’s queue
 Executed only by the thread that owns queue
 No need for synchronization
Dequeue
 Removes the oldest enqueued task
 Executed by any thread in an OpenMP team
 Synchronization needed!
A Fast Work-stealing Mechanism
TASK_QUEUE

20
Based on the CC-Synch object/algorithm of
[Fatourou & Kallimanis, PPoPP, 2012].
CC-Synch for thread synchronization
To implement Dequeue operation
 Use one instance of CC-Synch for each TASK_QUEUE
Result of the combining technique:
 One thread (the combiner) holds a coarse lock
 Additionally to the application of its own operation, serves the
operations of all other active threads
 Highly reduces synchronization costs
A Fast Work-stealing Mechanism

21
Performance evaluation
NUMA Environment
2 x 8-core AMD Opteron 6128 CPUs @ 2GHz
16GB of main memory
Debian Squeeze on the 2.6.32.5 kernel
GNU gcc (version 4.4.5-8) [-O3 -fopenmp]
Intel icc (version 12.1.0) [-fast -openmp]
Oracle suncc (version 12.2) [-fast -xopenmp=parallel]
OMPi uses GNU gcc as a back-end compiler [-O3]
Default Runtime Settings

22
Synthetic benchmark
Fine grain tasks (max workload = 128) 16 threads
1 thread produces tasks
Rest threads executes them
Taskload = for loop with ‘max workload’ repetitions

23
Barcelona OpenMP Task Suite
Fibonacci
 Computes the nth
Fibonacci number
 Exploits nested task
parallelization which
creates a deep tree
 Very large number of fine-
grain tasks
 40th Fibonacci number
 New work-stealing
implementation and the
fast execution path
No manual cut-off

24
Barcelona OpenMP Task Suite
NQueens
 Calculates solutions of the
n-queens chessboard
problem
 Backtracking search
algorithm with pruning
creates unbalanced tasks
 Exploits nested task
parallelization which
creates a deep tree of
tasks
 Input: 14 queens
No manual cut-off
Spiros N. Agathos, Nikolaos D. Kallimanis, Vassilios V. Dimakopoulos, “Speeding Up OpenMP
Tasking”, Europar 2012, International European Conference onParallel and Distributed
Computing, Rhodes, Greece, August 2012, pages 650-661.

25
 Tasking support

26
Nested parallelism
A parallel region inside a parallel region
Every thread of team creates its own team of threads
Difficult to handle efficiently
 Possible processor oversubscription
Typically nested loops are parallelized using
nested parallelism
Can we replace a nested loop with tasks?
Introduction

27
#pragma omp parallel num_threads(M)
{
schedule(static) num_threads(N)
for (i=LB; i<UB; i++) {
<body>
}
…………
}
Transforming Loop Code Manually
#pragma omp parallel num_threads(M)
{
for(t=0; t<N; t++)
#pragma omp task
{
calculate(N, LB, UB, &lb, &ub);
for (i=lb; i<ub; i++)
<body>
}
#pragma omp taskwait
…………
}
 #N parallel tasks transformed to #N tasks
Instead of NxM, #M threads in system

28
Similar Transformation is possible for
 Dynamic
 Guided
Complicated user code
Impossible to have access too thread specific
data (example)
Manual transformation limitations

29
In general:
 A mini worksharing runtime must be written
 What thread will execute what task?
 Impossible to handle thread specific data
But within an OpenMP runtime system:
 All the worksharing functionality already there
 Access to all thread specific data
Auto transformation in OMPi compiler
Manual transformation limitations

30
OMPi’s runtime organization
EECB
All OpenMP thread info:
1)Thread ID
2)Parallel level
3)Pointer to parent EECB
4)………
Each OpenMP thread is associated with an EECB
(Execution Entity Control Block)

31
Parallel: Creation of parallel tasks
#pragma omp parallel num_threads(4)
P0
parallel
task
P1
parallel
task
P2
parallel
task
P3
parallel
task
Thread ID = 0
Level = 0
Initial
parallel
Task
4 new threads are created
New EECBs are assigned
Execute parallel region

32
Nested Parallel: Creation of special tasks
#pragma omp parallel for num_threads(4)
S0
special
task
S1
special
task
S2
special
task
S3
special
task
Thread ID = 0
Level = 1
P0
parallel
task
Special tasks are stored in TaskQueue
Sibling threads steal special tasks
Change EECB && execute ‘parallel region’

33
Compiler Side
 No changes
Runtime:
 New type of task called pfor_task
 Emulation of parallel tasks
Same technique works for nested parallel sections
Auto transformation implementation

34
NUMA Environment
2X 8-core AMD Opteron 6128 CPUs @ 2GHz
16GB of main memory
Debian Squeeze on the 2.6.32.5 kernel
GNU gcc (version 4.4.5-8) [-O3 -fopenmp]
Intel icc (version 12.1.0) [-fast -openmp]
Oracle suncc (version 12.2) [-fast -xopenmp=parallel]
OMPi uses GNU gcc as a back-end compiler [-O3]
 Original, non-optimized task runtime was used
Default Runtime Settings
Evaluation

35
Synthetic benchmark results
TASK_LOAD = 500
16 Threads in 1st Level
16 Threads in 1st Level
N (L2 Threads) = 4
Create a parallel with 16 threads
Each thread create a nested team to execute for-loop
Iteration workload = for loop with repetitions

36
 Takes as input an image and
discovers the number of faces
depicted
 Utilizing nested parallelism in
order to obtain better
performance
Face detection results
161 Images CMU test set
Spiros N. Agathos, Panagiotis E. Hadjidoukas, Vassilios V. Dimakopoulos, “Task based Execution
of Nested OpenMP Loops”, IWOMP 12, International Workshop on OpenMP, OpenMP in a
Heterogenous World, Rome, Italy, June 2012, pages 210 - 222.

37
 Hybrid policy
 Idle cores? Create threads
 All cores occupied ? Create tasks
 Not enough cores ? Mix tasks/threads
Face detection results
161 Images CMU test set

38
 Tasking support

39
Can OpenMP form the basis for a programming model
for multicore embedded accelerators?
 The answer is positive, but not straightforward
since:
 OpenMP is designed for homogeneous shared
memory systems
 Embedded systems
– Include groups of weak Pes
– Have limited resources
OpenMP for accelerators?

40
OpenMP for accelerators?
The design and implementation of a programming
model for the STHORM architecture (Artemis EU
Project No. 100230 SMECY)
Offer decent parallelization without significant
programming effort

41
STHORM architecture

42
STHORM architecture

43
Execution model
 An accelerator is usually a back-end system attached to a host.
 Where and how the OpenMP programming model is to be applied?
1. On the Host side: Multiple OpenMP Host threads generate multiple
jobs on the accelerator
2. On the Accelerator side: Each Host thread can trigger multiple
OpenMP threads on the accelerator
3. Our proposal : On Both sides!
 General solution
 Flexible, programmer friendly
Multicore
ARM
HOST Accelerator

44
Suppoting OpenMP
OpenMP on ARM
Minimal changes
New module for communicating with STHORM
OpenMP on STHORM
Very difficult to implement
Limited hardware resources
Full albeit non optimized implementation

45
Compilation chain
Code for the
HOST & the
Accelerator
OpenMP C
OpenMP C
OpenMP C
Multithreaded C
Multithreaded C

46
EECB management
L2 TCDM
Emptylist
ID0
Lev0
ID1
Lev1
ID1
Lev1
ID2
Lev1
ID3
Lev1
ID4
Lev1
ID5
Lev1
ID15
Lev1
Execution Entity Control Block (EECB) per thread
Assigned to thread when it starts the execution
Freed when the team is disbanded
Problem : Placement of EECBs
EECBs are constantly accessed during execution
Solution : Scratchpad memory guaranteed
performance
Ok for 1 level of parallelism
Infeasible for multiple levels
Our Proposal:
Use TCDM only for the 16 active threads
Use L2 memory for nested teams
Keep all active EECBs in TCDM
ID 0
Lev2

47
Parallel regions
 OpenMP parallel region is a group of jobs
executed by different PEs
 PE meets a parallel region (Master PE)
 Suspends the execution of its current job
 Sends a request to the CC
 Allocates a new EECB
 Executes its implicit task
 Wait the CC to notify the end of PR
 Return to its old EECB
 Other PE’s:
 Receive the request of the CC
 Acquire an EECB
 Start execute their implicit job
 When a job is executed notifies CC
 Release EECB
 CC:
 Supply PE’s with implicit jobs
 Receive end notifications of PE’s
 Informs the master PE that PR is finished
TCDM EnCore
CC
Cluster
M

48
Parallel regions
TCDM EnCore
CC
Cluster
M
 Other PE’s:
 Acquire an EECB
 Release EECB
 CC:

49
Parallel regions
TCDM EnCore
CC
Cluster
M
 Other PE’s:
 Acquire an EECB
 Release EECB
 CC:

50
Experimental results
Calculation of the Mandelbrot set for
an image of 362x208 pixels
Image fits in TCDM
We parallelized the computation of
image pixels values
Results for different scheduling policies
of the OpenMP parallel for
Spiros N. Agathos, Vasileios V. Dimakopoulos, Aggelos Mourelis, Alexandros Papadogiannakis,
“Deploying OpenMP on an Embedded Multicore ccelarator”, SAMOS XIII, International
Conference on Embedded Computer Systems: Architectures, Samos, Greece, July 2013, pages
180 - 187

51
 Tasking support

52
void main()
{
init(input);
#pragma omp target map(to:input) map(from:result)
{
int i;
for(i=0; i<10; i++)
result[i] = input[i]++;
}
print(result);
}
Executed on host Communication Executed on device

53
void main()
{
init(input)
#pragma omp target data map(to:input) map(from:result)
{
#pragma omp target
{
<Kernel 01 code, accesses input, output>
}
<Host code, prepare for new kernel>
#pragma omp target
{
<Kernel 02 code, accesse input, output>
}
}
}
Executed on host
Communication
Executed on device

54
• $99 Board (Basic version)
• Epiphany-16 (25 gflops, less than 2 Watt)
• Zynq  Linux OS
• Epiphany-16  No OS
• eSDK tools for native programming
Parallella board

55
 Each kernel (i.e. target region), is outlined to a separate function
 The code generation phase produces multiple output files, one for
each different kernel, plus the host code (host may be called to
execute any of them)
Compiling for the new device directives

56
A full-fledged OpenMP runtime library
 Supporting execution on the dual-ARM processor
Additional functionality
 Required for controlling and accessing the Epiphany device
The communication between the Host and the eCores
takes place through the shared memory portion of the
system RAM
For offloading a kernel
 The first idle eCore is chosen
 Precompiled object file is loaded to it for immediate
execution
 Ecores inform Host about the completion of a kernel
through special flags in shared memory.
 Multiple host threads can offload multiple independent
kernels concurrently onto the Epiphany
Runtime architecture - what the host does

57
Runtime architecture - what the host does
int X[10], Y[10];
int k;
#pragma omp target data map(X,Y)
{
#pragma omp target map(to:k)
{
/* Kernel code */
}
}
32 MiB Shared Memory
4 KiB Device
Control Data
Target data variables
YXk*
Y
*
X
Data
Env.

58
 Supporting OpenMP within the Epiphany is nontrivial
 ECores do not execute any OS
 No provision for dynamic parallelism
 The 32KiB local memory is quite limited:
Unable to handle sophisticated OpenMP runtime structures
 The runtime infrastructure originally designed for the Host was
trimmed down to a minimum
 e.g. Tasking = Shared queue protected by a lock
 This is linked and offloaded with each kernel
 The corresponding coordination among the participating eCores
utilizes the local memory of the team’s master eCore
OpenMP within the Epiphany

59
#parallel inside a kernel
Shared memory
Zynq
Epiphany
Master Core
Worker Cores
Time
Offload
kernel
Request
workers
Seq
Code
Start
worker
threads
Reply to
master
core
Initialize
& notify
workers
Initialize &
wait for
master
Join
team
Join
team
On Chip memory
Parallel
Code
Parallel
Code
Parallel
ended
Idle
Kernel
ended
Seq
Code
Idle
Book
keeping
Ack
Book
keeping

60
Environment
 Parallella-16 SKUA101020
 Ubuntu 14.04, kernel 3.12.0 armv7l
 gcc and e-gcc v.4.8.2 as back-end
for OMPi
 eSDK 5.13.9.10

61
Overhead results of EPCC benchmark
Modified version of the EPCC benchmarks
 Basic routines are offloaded through target directives
 Measurements from the host side after subtracting any
offloading costs
Resetting an eCore: 0,1 sec

62
Frames per second for theMandelbrot deep zoom application (1024x768)
eSDK version: only 8%-13% better
Original code: 301 lines (3 files)
OpenMP code: 198 lines (1 file)
Spiros N. Agathos, Alexandros Papadogiannakis, Vassilios V. Dimakopoulos, “Targeting the
Parallella”, Europar 2015, International European Conference on Parallel and Distributed
Computing, Vienna, Austria, August 2015, pages 662 – 674
Alexandros Papadogiannakis, Spiros N. Agathos, Vassilios V. Dimakopoulos, “OpenMP 4.0
Device Support in the Ompi Compiler”, IWOMP 12, nternational Workshop on OpenMP,
Heterogenous Execution and Data Movements, Aachen, Germany, October 2015, pages 202 -
216

63
 Tasking support

64
A kernel may contain any OpenMP directive
 Great flexibility
 Parallelization expressiveness
Requires runtime support within the co-processor
Implementing such a runtime is a non-trivial task
GPGPUs, accelerators etc have special functionality
characteristics and/or limited resources
Solutions:
 Provide limited (and sometimes no) support for some of the
directives
 Design of sub-optimal runtime system implementations
depending on the capabilities of a given device.
Motivation
Scenario OMPi executable size bytes
Empty kernel 7092
Create a parallel team 10560

65
A novel runtime organization designed to work with
an OpenMP infrastructure
Instead of having a single monolithic runtime system
Adaptive runtime system architecture which
implements only the OpenMP features required by a
particular application
Example: OpenMP kernel with no explicit tasking
No tasking subsystem required
Barrier which has no (time consuming) tasking extensions
The library includes only the needed functionality 
reduce-sized executable.
Desirable in systems with minimal local memories
Our proposal

66
The compiler
Analyzes the kernels
Provides metrics
Select a particular runtime system configuration to
accompany the kernel
The user's code implies the choice of an optimized
runtime system
Reduced executable sizes
Faster execution times
Compiler Assisted Runtime Support (CARS)

67
CARS Overview
foo.c
User code
(C + OpenMP)
Compiler
(transformation
& analysis)
Analysis
metrics
Mapper
Runtime
alternatives
Kernel
code
System
Compiler/linker
Kernel specific
library
Kernel
executable
Host code
executable

68
 OpenMP in kernel: Whether the kernel code includes any OpenMP directives or
not.
 Dynamic parallelism: The exact number of threads used in all parallel teams,
their nesting level as well as the presence of the reduction clause along with its
parameters.
 Work-sharing regions: The types of the workshares that are used in the kernel.
Whether for, single or sections regions are present. In the case of for regions,
exactly the types and parameters of the schedules, that is whether static,
dynamic or guided is used and the chunk size. In addition the presence of the
ordered clause. Finally whether the program takes advantage of the no-wait
region feature of worksharing.
 Explicit tasking: The presence of user defined tasks, taskgroups and the possible
dependencies if any.
 Special synchronization: The presence of atomic construct and the type of the
required functionality. The number and the names of the critical regions in the
kernel. The number and the characteristics (whether is normal of nested) of
user-defined locks.
Metrics

69
Environment
 Parallella-16 SKUA101020
 Ubuntu 14.04, kernel 3.12.0 armv7l
 gcc and e-gcc v.4.8.2 as back-end
for OMPi
 eSDK 5.13.9.10

70
Experimental Results
.elf sizes in bytes
Scenario Full RTS CARS Difference
Mandelbrot 13156 9620 26.88%
Empty Kernel 8228 2252 73.63%
Pi Calculation 11972 8864 25.96%
Nqueens (tasks) 20908 19704 5.76%
EPCC-for-static 14176 10944 22.80%
EPCC-critical 12560 9320 35.80%
EPCC-single 12200 8900 27.05%
EPCC-ordered 14192 10952 22.83%

71
Experimental Results
Execution times in microseconds
Spiros N. Agathos, Vasileios V. Dimakopoulos, “Compiler-Assisted OpenMP Runtime
Organization for Embedded Multicores”, Technical Report, Number 2016-01, University of
Ioannina, Department of Computer Science & Engineering, April 2016
Scenario Full RTS CARS Difference
Mandelbrot 30.05(sec) 30.00(sec) 0.16%
Empty Kernel 0.10(sec) 0.10(sec) 0%
Pi Calculation 0.28(sec) 0.26(sec) 7.14%
Nqueens (tasks) 1.81(sec) 1.81(sec) 0%
EPCC-for-static 72.65 19.85 72.68%
EPCC-critical 2.17 1.55 39.98%
EPCC-single 83.72 14.92 28.57%
EPCC-ordered 4.70 4.66 0.85%

72
Conclusion
 OpenMP is an easy to use programming model
 More powerfull due to addition of tasking facilities
 Generally applicable due to device constructs
 Contributions related to tasks
 Contributions related to OpenMP for devices

73
Optimize Epiphany runtime to better exploit the
hardware characteristics
Support of OpenMP 4 device constructs for various
devices
GPGPUs
OpenMP extensions:
Data-block transfer
Resident kernels
Fine synchronization between Host and Device
Future Work

74
END…
Acknowledgements:

defense-linkedin

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to defense-linkedin

Similar to defense-linkedin (20)

defense-linkedin