Advertisement

Scheduling Task-parallel Applications in Dynamically Asymmetric Environments

LEGATO project
Sep. 3, 2020
Advertisement

More Related Content

Similar to Scheduling Task-parallel Applications in Dynamically Asymmetric Environments(20)

Advertisement

More from LEGATO project(20)

Advertisement

Scheduling Task-parallel Applications in Dynamically Asymmetric Environments

  1. Scheduling Task-parallel Applications in Dynamically Asymmetric Environments CHALMERS UNIVERSITY OF TECHNOLOGY JING CHEN, PIRAH NOOR SOOMRO, MUSTAFAABDULJABBAR, MADHAVAN MANIVANNAN, MIQUEL PERICÀS SCHEDULING AND RESOURCE MANAGEMENT FOR PARALLEL AND DISTRIBUTED SYSTEMS 2020-08-10
  2. OUTLINE 1. Motivation 2. Background 3. Dynamic Asymmetry Scheduler 4. Implementation 5. Experimental Setup 6. Evaluation 7. Conclusion 2
  3. OUTLINE 1. Motivation 2. Background 3. Dynamic Asymmetry Scheduler 4. Implementation 5. Experimental Setup 6. Evaluation 7. Conclusion 3
  4. Motivation Applications observe performance variability due to variety of dynamic events - interference • e.g., I/O, user activity, co-scheduled jobs, OS power management etc. As the scale of system heterogeneity keeps increasing, tackling interference become even larger concern. Common strategies to reduce interference include eliminating unnecessary activities at the system level. How about mitigating interference in task-parallel runtimes? 4
  5. What is Performance Asymmetry? Individual cores have different progress rates (e.g. MIPS) during execution u Applications are not directly aware of interference nor source of interference. u Rather they observe temporary episodes of performance asymmetry. 5
  6. Sources of Performance Asymmetry n Fixed Asymmetry u Hardware asymmetry, cores with different compute capabilities, e.g., Arm big. LITTLE (permanent and doesn't evolve over time). n Dynamic Asymmetry u Execution-time activities, e.g., DVFS for power management, sharing of resources between applications or dynamic recompilation in some platforms (e.g. Nvidia Jetson TX2). n Our paper addresses two scenarios of interference due to dynamic asymmetry caused by: u DVFS activity u Co-scheduled applications 6
  7. Proposed runtime scheduling techniques to address interference n Moldability u Flexibility to work as either single threaded or multithreaded task. n Task criticality u Knowledge of task criticality to handle performance asymmetry. How moldability and knowledge of task criticality can be leveraged to address interference? 7
  8. OUTLINE 1. Motivation 2. Background 3. Dynamic Asymmetry Scheduler 4. Implementation 5. Experimental Setup 6. Evaluation 7. Conclusion 8
  9. Application Model 9 • T0, T1, T5 and T9 are high priority tasks/critical tasks. • Critical path: T0, T1, T5, T9 • DAG parallelism = 4. • Execution place for T0[0,2]: T0 is scheduled on core 0 and 1. n Directed acyclic graph – DAG A common way to express multithreaded applications n High priority tasks (critical tasks) Tasks that release a large amount of dependent tasks, or tasks that lie on the DAG’s critical path. n DAG parallelism The total number of tasks / the length of critical path n Task type Functions implemented as tasks. n Execution places [leader core, resource width]: Leader core: starting thread number Resource width (RW): number of cores
  10. OUTLINE 1. Motivation 2. Background 3. Dynamic Asymmetry Scheduler 4. Implementation 5. Experimental Setup 6. Evaluation 7. Conclusion 10
  11. Dynamic Asymmetry Scheduler The proposed scheduler is built on two key ideas: 1. We leverage a performance modeling approach called Performance Trace Table (PTT) to detect and predict performance asymmetry on a per- task level. 2. Applying scheduling techniques to minimize the impact of interference. u Predicting the best possible execution place for high priority tasks according to PTT. u Molding tasks to reduce inter-task contention and resource oversubscription caused by interference. 11
  12. Two variants of Dynamic Asymmetry Scheduler Ø DAM-C: Dynamic Asymmetry scheduler with Moldability, targeting parallel Cost u Reduces parallel cost (execution time x number of cores), minimizes resource usage. u Protects from overcommitting the fast cores. Ø DAM-P: Dynamic Asymmetry scheduler with Moldability, targeting best Performance with the knowledge of with critical tasks. u Performs a global search for critical tasks and selects the execution place that minimizes the predicted execution time. 12
  13. OUTLINE 1. Motivation 2. Background 3. Dynamic Asymmetry Scheduler 4. Implementation 5. Experimental Setup 6. Evaluation 7. Conclusion 13
  14. Performance Trace Table (PTT) 14 Goal: Performance prediction for future tasks given a set of resources; n Elastic execution place tuple (leader core, resource width); n Table size = #Cores x #Valid RW; n Initialize all entries to 0; n Leader core updates the table; n Update method: weighted ratio 1 (new) : 4 (old); n One PTT for each task type;
  15. Task Scheduling Flow Chart n High priority tasks – critical Global search whole PTT Determine the best entry (leader core, resource width) Minimize either parallel cost or execution time n Low priority tasks – noncritical Local search one row of PTT Determine best resource width, without migrating the task away from current core Minimize the parallel cost 15 L H P i i L cal ea ch DAM-P DAM-CSchd le Gl bal ea ch minimi e e ec i n ime Gl bal ea ch minimi e a allel c End S a
  16. Performance Trace Table (PTT) 16 Advantages: ü Awareness of interference activities is reflected by the dynamic change of execution time records ü Only require few information number of cores and core-cluster organization (hwloc) ü Dynamic update during execution not in profiling phase ü Independent of platforms ü Low overhead: 0.2% ~ 0.5% on TX2 Limitation: Scalability on large systems (for future work)
  17. OUTLINE 1. Motivation 2. Background 3. Dynamic Asymmetry Scheduler 4. Implementation 5. Experimental Setup 6. Evaluation 7. Conclusion 17
  18. Experimental Setup Platforms n NVIDIA Jetson TX2 – Asymmetric Platform NVIDIA Denver: faster, 2 cores, 2MB L2 cache + ARM A57: slower, 4 cores , 2MB L2 cache n Intel 2650v3 Haswell node – Symmetric Platform 20 cores – two sockets, 10 cores per socket 18 Benchmarks n Synthetic microbenchmark Kernels: Matrix Multiplication (compute-bound), Copy (memory-bound), Stencil (cache-bound) n Two real applications K-means Clustering Distributed 2D Heat
  19. Evaluated Scheduling Policies 19 ü Typical greedy scheduler – baseline scheduler ü Not aware of core asymmetry ü With task moldability ü Critical tasks target performance and can not be stolen Random Work-Stealing (RWS) RWSM-C ü Aware of core asymmetry before execution ü Task criticality awareness before execution ü Critical tasks => faster cores, can not be stolen ü RWS + Moldability (PTT => best RW) FAM-C Fixed Asymmetry (FA) ü FA + Moldability (PTT => best RW) Dynamic Asymmetry (DA) ü PTT - select fastest core for critical tasks ü No task moldability, RW = 1 DAM-C ü With task moldability ü All tasks target minimizing parallel cost ü Critical tasks can not be stolen DAM-P [1] [1] XiTAO runtime: https://github.com/mpericas/xitao
  20. OUTLINE 1. Motivation 2. Background 3. Dynamic Asymmetry Scheduler 4. Implementation 5. Experimental Setup 6. Evaluation 7. Conclusion 20
  21. Dynamic Asymmetry Awareness Interference on TX2: Co-running kernel matrix multiplication on one Denver core (C0,1) 21 0 500 1000 1500 2000 2500 3000 3500 2 3 4 5 6 Throughput[Tasks/s] DAG Parallelism RWS RWSM-C FA FAM-C DA DAM-C DAM-P Synthetic Matrix Multiplication ü Task moldability improves the performance for RWS ü Dynamic asymmetry schedulers perform better ü DAM-C achieves up to 3.5× speedup than RWS ü DAM-C achieves up to 90% and 85% performance improvement than FA and FAM-C.
  22. Priority Task Distribution 22 Synthetic Matrix Multiplication Benchmark, DAG Parallelism = 2 (50% priority tasks) (C0,1) 50% (C1,1) 50% (C0,1) 35% (C1,1) 48% (C0,2) 17% (C0,1) 2% (C1,1) 98% (C0,1) 2% (C1,1) 92% (C0,2) 2% (C2,4) 4% (C0,1) 1.8% (C1,1) 96.7% (C0,2) 1.3% (C0,1) 3.7% (C1,1) 9% (C0,2) 4.5% (C2,1) 16% (C3,1) 16.6%(C2,2) 4.4% (C4,1) 15.5% (C5,1) 16% (C4,2) 5.9% (C2,4) 8.6% (C0,1) 14% (C1,1) 24% (C2,1) 16% (C3,1) 15% (C4,1) 15% (C5,1) 16% RWS RWSM-C FA FAM-C DA DAM-C DAM-P Few priority tasks execute on interference core! Interference is running here!
  23. DVFS Awareness 23 n DVFS: Periodic frequency changes of Denver cluster on TX2 n Max frequency 5s => Min frequency 5s => Max frequency 5s => … Synthetic Stencil on TX2 0 100 200 300 400 500 600 700 800 900 2 3 4 5 6 Throughput[Tasks/s] DAG Parallelism RWS RWSM-C FA FAM-C DA DAM-C DAM-P n DA, DAM-C, DAM-P are more resilient to DVFS n DAM-P performs better with low parallelism
  24. OUTLINE 1. Motivation 2. Background 3. Dynamic Asymmetry Scheduler 4. Implementation 5. Experimental Setup 6. Evaluation 7. Conclusion 24
  25. Conclusion Ø Random work stealing schedulers and fixed asymmetry schedulers can not effectively aware of system dynamic changes; Ø PTT achieves better interference awareness through online dynamic performance tracing on per-task basis DVFS + Co-running background applications Ø Task moldability sometimes makes schedulers more resilient to interference awareness. 25
  26. Thank you! 26
Advertisement