Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Elastic multicore scheduling with the XiTAO runtime

73 views

Published on

This presentation describes the XiTAO scheduler for heterogeneous computing that is currently under development in the EU LEGaTO project. The scheduler targets mixed-mode parallelism and assigns resource partitions just-in-time by creating a model of the platform's static and dynamic heterogeneity.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Elastic multicore scheduling with the XiTAO runtime

  1. 1. Elastic multicore scheduling with the XiTAO runtime Jing Chen, Pirah Noor, Mustafa Abduljabbar, Miquel Pericàs Chalmers University of Technology Embedded Multicore Programming - Industrial state-of-the-art and future directions Edinburgh, April 17th , 2019
  2. 2. 22/01/2019 HiPEAC CSW Spring 2019 2 Heterogeneous-Parallel Platforms Heterogeneity + Parallelism common in embedded platforms ● Power-efficiency, battery-constrained devices ● Examples: – ARM big.LITTLE – Nvidia Jetson TX2 (Denver2/A57/Pascal) – Dynamic heterogeneity: DVFS, interference, cache partitioning HiKEY 960 Nvidia Jetson TX2
  3. 3. 04/25/19 CSW Spring 2019 3 Heterogeneity as a dynamic property Heterogeneity: cores in the system have different performance, energy-efficiency etc. Two types of heterogeneity: static and dynamic ● Static: – big.LITTLE, CPU-GPU ● Dynamic: – DVFS, cache partitioning, interference – Interference: ● Intra-process: cache, memory oversubscription ● Inter-process: cache, memory, processor timesharing ● Heterogeneity needs to be addressed dynamically by the runtime!
  4. 4. 22/01/2019 HiPEAC CSW Spring 2019 4 EU LEGaTO Project • Create software stack-support for energy- efficient heterogeneous computing
  5. 5. 22/01/2019 HiPEAC CSW Spring 2019 5 EU LEGaTO Project XiTAO
  6. 6. 22/01/2019 HiPEAC CSW Spring 2019 6  Many applications can be expressed as mixed mode parallel applications := external task parallelism + internal data parallelism  Naturally supports hierarchy/heterogeneity in modern architectures  Challenge: how to schedule? how many resources? Mixed-mode parallelism #pragma omp parallel for... can be generalized to other forms of parallelism!
  7. 7. 22/01/2019 HiPEAC CSW Spring 2019 7  Improves Parallel Slackness  Bulk creation of parallelism (low overhead)  Interference-avoidance  Constructive sharing XiTAO mixed-mode runtime 1.Schedule external task parallelism via work stealing + locally expand internal parallel tasks across multiple cores 2.Reduce inter-task interference by decoupling internal parallelism from resources: Task Assembly Objects (TAO)
  8. 8. 22/01/2019 HiPEAC CSW Spring 2019 8 XiTAO application ● Example of 2D stencil execution on XiTAO w=2 w=1 Application
  9. 9. 22/01/2019 HiPEAC CSW Spring 2019 9 Elastic Places: Adaptivity ● Example: Cilksort reduction on 48 cores. Dynamically resize places as external parallelism decreases and TAO working set increases ● Each colored box is a resource container, executing one TAO Quick generation of parallelism, low overheads and good isolation + constructive sharing
  10. 10. 22/01/2019 HiPEAC CSW Spring 2019 10 XiTAO implementation Basic TAO class (XiTAO) User-level API for defining TAOs User-level API for defining TAO-DAGs + locality-awareness ● XiTAO is fully implemented in C++11 ● Decentralized design targeting scalability XiTAO API
  11. 11. 22/01/2019 HiPEAC CSW Spring 2019 11 critical path internal DAG fixed resource container (cores, caches, ...) Task Assembly Object (TAO)external task DAG Heterogeneous scheduling Main Idea: map only those tasks to high performance cores that benefit due to criticality or due to performance characteristics Faster Cores Slower Cores Heterogeneous Platforms: HiKEY 960, Nvidia Jetson TX2 PTT schedule Performance Monitor “Performance Trace Table”
  12. 12. 22/01/2019 HiPEAC CSW Spring 2019 12 Performance Trace Table (PTT) • Function: record the running time of each core in each resource width; • Aim: which is the best core and the best width to execute in the available resources, efficiently resource usage; • Implementation: table of size core_number * resource_width 1 PTT for each task type (in XiTAO: for each TAO type) Resource width := number of cores that execute a TAO
  13. 13. 22/01/2019 HiPEAC CSW Spring 2019 13 Random DAGs 250 500 1000 2000 4000 Task Number 16 8 4 2 1 Parallelism 500 750 1000 1250 1500 Throughput(TAOs/s) 250 500 1000 2000 4000 Task Number 16 8 4 2 1 Parallelism 500 750 1000 1250 1500 Throughput(TAOs/s) Performance-based SchedulerPerformance-based Scheduler (PTT-based)(PTT-based) Homogeneous SchedulerHomogeneous Scheduler (random work stealing)(random work stealing) average DAG parallelism throughput (performance)  Runtime assessment of resource partitions + criticality-aware scheduling
  14. 14. 22/01/2019 HiPEAC CSW Spring 2019 14 0 2 4 6 8 10 12 14 Elapsed Time [s] 0 1 2 3 4 5 6 7 8 9 Thread 8 10 12 14 16 18 20 PTTValue[ms] Interference-awareness  Detects interference episodes and migrates critical tasks tasks with multiple resources critical task schedules interference episode PTT evolution for core=0 & width=1
  15. 15. ● Porting VGG-16 in Darknet framework to XiTAO Current directions: VGG-16 maxpool CONV3-64 CONV3-64 maxpool CONV3-128 CONV3-128 maxpool CONV3-256 CONV3-256 CONV3-256 CONV3-512 CONV3-512 CONV3-512 CONV3-512 CONV3-512 CONV3-512 FC-4096 FC-4096 FC-1000 maxpool maxpool GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM GEMM maxpool softmax GEMM GEMM  TAO 0  TAO 1  TAO N..... XiTAO ● PTT automatically finds best widths to execute VGG-16 on the dual-socket Intel platform (20 cores) 69,06 90,89 66,67 53,81 30,94 5,83 3,38 1,68 3,28 0,74 14,76 29,21 29,31 0,45 0 20 40 60 80 100 2 4 8 16 PercentageofTAOsw.r.t TAO-width Number of threads 1 2 4 8 16
  16. 16. 22/01/2019 HiPEAC CSW Spring 2019 16 Future Directions ● Front-ends for XiTAO – OmpSs to XiTAO – Array (tensor) programming ● Low-energy runtime optimizations ● Automatic DAG partitioning for generation of mixed-mode computations
  17. 17. 22/01/2019 HiPEAC CSW Spring 2019 17 Thank you! Acknowledgements: The XiTAO team Jing Chen Pirah Noor Mustafa Abduljabbar Miquel Pericàs

×