Elastic multicore scheduling with the XiTAO runtime

Elastic multicore scheduling with
the XiTAO runtime
Jing Chen, Pirah Noor, Mustafa Abduljabbar,
Miquel Pericàs
Chalmers University of Technology
Embedded Multicore Programming -
Industrial state-of-the-art and future directions
Edinburgh, April 17th
, 2019

22/01/2019 HiPEAC CSW Spring 2019 2
Heterogeneous-Parallel Platforms
Heterogeneity + Parallelism common in embedded platforms
●
Power-efficiency, battery-constrained devices
●
Examples:
– ARM big.LITTLE
– Nvidia Jetson TX2 (Denver2/A57/Pascal)
– Dynamic heterogeneity: DVFS, interference, cache
partitioning
HiKEY 960 Nvidia Jetson TX2

04/25/19 CSW Spring 2019 3
Heterogeneity as a dynamic property
Heterogeneity: cores in the system have different performance,
energy-efficiency etc.
Two types of heterogeneity: static and dynamic
●
Static:
– big.LITTLE, CPU-GPU
●
Dynamic:
– DVFS, cache partitioning, interference
– Interference:
●
Intra-process: cache, memory oversubscription
●
Inter-process: cache, memory, processor timesharing
●
Heterogeneity needs to be addressed dynamically by the
runtime!

EU LEGaTO Project
• Create software stack-support for energy-
efficient heterogeneous computing

EU LEGaTO Project
XiTAO


Many applications can be expressed as mixed mode parallel
applications := external task parallelism + internal data parallelism

Naturally supports hierarchy/heterogeneity in modern architectures

Challenge: how to schedule? how many resources?
Mixed-mode parallelism
#pragma omp parallel for...
can be generalized to other
forms of parallelism!


Improves Parallel Slackness

Bulk creation of parallelism
(low overhead)

Interference-avoidance

Constructive sharing
XiTAO mixed-mode runtime
1.Schedule external task parallelism via work stealing + locally
expand internal parallel tasks across multiple cores
2.Reduce inter-task interference by decoupling internal parallelism
from resources: Task Assembly Objects (TAO)

XiTAO application
●
Example of 2D stencil execution on XiTAO
w=2
w=1
Application

Elastic Places: Adaptivity
●
Example: Cilksort reduction on 48 cores. Dynamically resize places
as external parallelism decreases and TAO working set increases
●
Each colored box is a resource container, executing one TAO
Quick generation of parallelism, low overheads and good
isolation + constructive sharing

XiTAO implementation
Basic TAO
class (XiTAO)
User-level API
for defining TAOs
User-level API for
defining TAO-DAGs
+ locality-awareness
●
XiTAO is fully implemented in C++11
●
Decentralized design targeting scalability
XiTAO API

critical
path
internal DAG
fixed resource
container (cores, caches, ...)
Task Assembly Object (TAO)external
task
DAG
Heterogeneous scheduling
Main Idea: map only those tasks to high performance cores that
benefit due to criticality or due to performance characteristics
Faster Cores Slower Cores
Heterogeneous Platforms:
HiKEY 960,
Nvidia Jetson TX2
PTT
schedule
Performance Monitor
“Performance Trace Table”

Performance Trace Table (PTT)
• Function: record the running time of each core in each resource
width;
• Aim: which is the best core and the best width to execute in the
available resources, efficiently resource usage;
• Implementation: table of size core_number * resource_width
1 PTT for each task type (in XiTAO: for each TAO type)
Resource width := number of cores that execute a TAO

Random DAGs
250 500 1000 2000 4000
Task Number
16
8
4
2
1
Parallelism
500
750
1000
1250
1500
Throughput(TAOs/s)
250 500 1000 2000 4000
Task Number
16
8
4
2
1
Parallelism
500
750
1000
1250
1500
Throughput(TAOs/s)
Performance-based SchedulerPerformance-based Scheduler
(PTT-based)(PTT-based)
Homogeneous SchedulerHomogeneous Scheduler
(random work stealing)(random work stealing)
average DAG parallelism
throughput (performance)

Runtime assessment of resource partitions +
criticality-aware scheduling

0 2 4 6 8 10 12 14
Elapsed Time [s]
0
1
2
3
4
5
6
7
8
9
Thread
8
10
12
14
16
18
20
PTTValue[ms]
Interference-awareness

Detects interference episodes and migrates critical tasks
tasks with multiple resources critical task schedules
interference episode PTT evolution for core=0 & width=1

●
Porting VGG-16 in Darknet framework to XiTAO
Current directions: VGG-16
maxpool
CONV3-64
CONV3-64
maxpool
CONV3-128
CONV3-128
maxpool
CONV3-256
CONV3-256
CONV3-256
CONV3-512
CONV3-512
CONV3-512
CONV3-512
CONV3-512
CONV3-512
FC-4096
FC-4096
FC-1000
maxpool
maxpool
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
maxpool
softmax
GEMM
GEMM
TAO 0 TAO 1 TAO N.....
XiTAO
●
PTT automatically finds best widths to execute
VGG-16 on the dual-socket Intel platform (20 cores)
69,06
90,89
66,67
53,81
30,94
5,83
3,38
1,68
3,28
0,74
14,76
29,21
29,31
0,45
0
20
40
60
80
100
2 4 8 16
PercentageofTAOsw.r.t
TAO-width
Number of threads
1
2
4
8
16

Future Directions
●
Front-ends for XiTAO
– OmpSs to XiTAO
– Array (tensor) programming
●
Low-energy runtime optimizations
●
Automatic DAG partitioning for generation of
mixed-mode computations

Thank you!
Acknowledgements:
The XiTAO team
Jing Chen Pirah Noor Mustafa Abduljabbar Miquel Pericàs

Elastic multicore scheduling with the XiTAO runtime

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Similar to Elastic multicore scheduling with the XiTAO runtime

Similar to Elastic multicore scheduling with the XiTAO runtime (20)

Recently uploaded

Recently uploaded (20)

Elastic multicore scheduling with the XiTAO runtime