This presentation describes the XiTAO scheduler for heterogeneous computing that is currently under development in the EU LEGaTO project. The scheduler targets mixed-mode parallelism and assigns resource partitions just-in-time by creating a model of the platform's static and dynamic heterogeneity.
Elastic multicore scheduling with the XiTAO runtime
1. Elastic multicore scheduling with
the XiTAO runtime
Jing Chen, Pirah Noor, Mustafa Abduljabbar,
Miquel Pericàs
Chalmers University of Technology
Embedded Multicore Programming -
Industrial state-of-the-art and future directions
Edinburgh, April 17th
, 2019
3. 04/25/19 CSW Spring 2019 3
Heterogeneity as a dynamic property
Heterogeneity: cores in the system have different performance,
energy-efficiency etc.
Two types of heterogeneity: static and dynamic
●
Static:
– big.LITTLE, CPU-GPU
●
Dynamic:
– DVFS, cache partitioning, interference
– Interference:
●
Intra-process: cache, memory oversubscription
●
Inter-process: cache, memory, processor timesharing
●
Heterogeneity needs to be addressed dynamically by the
runtime!
4. 22/01/2019 HiPEAC CSW Spring 2019 4
EU LEGaTO Project
• Create software stack-support for energy-
efficient heterogeneous computing
6. 22/01/2019 HiPEAC CSW Spring 2019 6
Many applications can be expressed as mixed mode parallel
applications := external task parallelism + internal data parallelism
Naturally supports hierarchy/heterogeneity in modern architectures
Challenge: how to schedule? how many resources?
Mixed-mode parallelism
#pragma omp parallel for...
can be generalized to other
forms of parallelism!
7. 22/01/2019 HiPEAC CSW Spring 2019 7
Improves Parallel Slackness
Bulk creation of parallelism
(low overhead)
Interference-avoidance
Constructive sharing
XiTAO mixed-mode runtime
1.Schedule external task parallelism via work stealing + locally
expand internal parallel tasks across multiple cores
2.Reduce inter-task interference by decoupling internal parallelism
from resources: Task Assembly Objects (TAO)
8. 22/01/2019 HiPEAC CSW Spring 2019 8
XiTAO application
●
Example of 2D stencil execution on XiTAO
w=2
w=1
Application
9. 22/01/2019 HiPEAC CSW Spring 2019 9
Elastic Places: Adaptivity
●
Example: Cilksort reduction on 48 cores. Dynamically resize places
as external parallelism decreases and TAO working set increases
●
Each colored box is a resource container, executing one TAO
Quick generation of parallelism, low overheads and good
isolation + constructive sharing
10. 22/01/2019 HiPEAC CSW Spring 2019 10
XiTAO implementation
Basic TAO
class (XiTAO)
User-level API
for defining TAOs
User-level API for
defining TAO-DAGs
+ locality-awareness
●
XiTAO is fully implemented in C++11
●
Decentralized design targeting scalability
XiTAO API
11. 22/01/2019 HiPEAC CSW Spring 2019 11
critical
path
internal DAG
fixed resource
container (cores, caches, ...)
Task Assembly Object (TAO)external
task
DAG
Heterogeneous scheduling
Main Idea: map only those tasks to high performance cores that
benefit due to criticality or due to performance characteristics
Faster Cores Slower Cores
Heterogeneous Platforms:
HiKEY 960,
Nvidia Jetson TX2
PTT
schedule
Performance Monitor
“Performance Trace Table”
12. 22/01/2019 HiPEAC CSW Spring 2019 12
Performance Trace Table (PTT)
• Function: record the running time of each core in each resource
width;
• Aim: which is the best core and the best width to execute in the
available resources, efficiently resource usage;
• Implementation: table of size core_number * resource_width
1 PTT for each task type (in XiTAO: for each TAO type)
Resource width := number of cores that execute a TAO
13. 22/01/2019 HiPEAC CSW Spring 2019 13
Random DAGs
250 500 1000 2000 4000
Task Number
16
8
4
2
1
Parallelism
500
750
1000
1250
1500
Throughput(TAOs/s)
250 500 1000 2000 4000
Task Number
16
8
4
2
1
Parallelism
500
750
1000
1250
1500
Throughput(TAOs/s)
Performance-based SchedulerPerformance-based Scheduler
(PTT-based)(PTT-based)
Homogeneous SchedulerHomogeneous Scheduler
(random work stealing)(random work stealing)
average DAG parallelism
throughput (performance)
Runtime assessment of resource partitions +
criticality-aware scheduling