Scheduler Performance in Many-      Core Architecture               Itai Avron                MSc Thesis  Technion - Elect...
Agenda•   Introduction and Motivation•   The Plural Architecture•   Improved Scheduler•   Analysis of Simulation Results• ...
Background• CPU performance improvements  – In the past : Increase of clock frequency     • We reached the power wall  – T...
Scheduling In Many-Core Architecture• Software scheduling is slow  – A lot of cores to schedule  – Fine granularity tasks ...
Scheduler Challenges• Latency  – Message delay     • From core to scheduler (completed prev. task)     • From scheduler to...
Other Architectures•   Graphic Processing Unit (GPU’s)•   Tilera•   Larrabee•   XMT•   Rigel•   Data-Driven Multithreading...
GPU – NVIDIA Fermi• Composed of many  processing elements  (PEs)• Scheduling is done in  hardware   – Schedule warps   – O...
Tilera• Composed of tiles   – Each tile is independent• Static Scheduling   – Determined during     compile time• MIMD [Ag...
Larrabee (Intel)• Array of processor cores• Software controlled  Scheduling   – Lightweight distributed     task-stealing ...
XMT• Composed of TCU’s  – Thread control unit• Hardware Scheduling  – Using Prefix-Sum• PRAM Programming  Model• SPMD [Vis...
Rigel• Composed of tiles of  clusters   – Each cluster holds 8     cores• Software Scheduling   – Allocation via task     ...
Data-Driven Multithreading Model• A Threads  Synchronization Unit  (TSU)   – Connects to existing     cores• Hardware Sche...
Task Superscalar• An Out-of-Order Task  Pipeline   – Connects to existing cores   – No Speculations• Hardware Scheduling  ...
Agenda•   Introduction and Motivation•   The Plural Architecture•   Improved Scheduler•   Analysis of Simulation Results• ...
The ‘Plural’ System Architecture                               Scheduler                  Cores                           ...
The System• Many RISC cores  – In-Order, Blocking LoadStore  – No data cache• Shared On-Chip memory banks  – Interleaved a...
Plural Task Map                                                              Task• Precedence Graph                       ...
Plural Scheduling•   Central Synchronization Unit (CSU)     –   Manages allocation, scheduling, and synchronization of tas...
Scheduling Process                         CSU                      allocates                      ready to               ...
Agenda•   Introduction and Motivation•   The Plural Architecture•   Improved Scheduler•   Analysis of Simulation Results• ...
Scheduler Improvements• Enhancing scheduler capacity• Reducing scheduling latency• Adding task queues to each core  – Shar...
Simulation Environment• Matlab Simulator                                 [Friedman, Kh                                    ...
Benchmark Task Maps        Normal and                        Mandelbrot                            JPEG                   ...
Agenda•   Introduction and Motivation•   The Plural Architecture•   Improved Scheduler•   Analysis of Simulation Results• ...
Analysis of Simulation Results•   “Normal” Benchmark•   “Parallel” Benchmark•   “Shared Variable” Benchmark•   JPEG Benchm...
A                                                              1                                                          ...
A                                                         1                                                         23    ...
A                                                           1                                                           23...
Analysis of Simulation Results•   “Normal” Benchmark•   “Parallel” Benchmark•   “Shared Variable” Benchmark•   JPEG Benchm...
A                                                               1                                                         ...
A                                                             1                                                           ...
Analysis of Simulation Results•   “Normal” Benchmark•   “Parallel” Benchmark•   “Shared Variable” Benchmark•   JPEG Benchm...
A                                                               1                                                         ...
Analysis of Simulation Results•   “Normal” Benchmark•   “Parallel” Benchmark•   “Shared Variable” Benchmark•   JPEG Benchm...
A                                                                      1                                                  ...
A                                                                    1                                                    ...
Solutions to imbalance1.   Queue sharing among multiple cores2.   Scheduling awareness of long tasks       Simulated3.   U...
Solutions to imbalance• Queue sharing among multiple cores• Scheduling awareness of long tasks• Using fine granularity tas...
A                                                                      1                                                  ...
Solutions to imbalance• Queue sharing among multiple cores• Scheduling awareness of long tasks• Using fine granularity tas...
JPEG Benchmark    [Green 2010]      Execution-Time Aware SchedulerActivity Per cycle, Latency = 0 cycles, Task E flagged a...
JPEG Benchmark      Execution-Time Aware SchedulerActivity Per cycle, Latency = 0 cycles, Task E and C flagged as long    ...
Solutions to imbalance• Queue sharing among multiple cores• Scheduling awareness of long tasks• Using fine granularity tas...
A                                                          1                                                          10  ...
Analysis of Simulation Results•   “Normal” Benchmark•   “Parallel” Benchmark•   “Shared Variable” Benchmark•   JPEG Benchm...
A                                                                 1                                                       ...
Analysis of Simulation Results•   “Normal” Benchmark•   “Parallel” Benchmark•   “Shared Variable” Benchmark•   JPEG Benchm...
A                                                         1                                                        540    ...
A                                                                                   1                                     ...
Analysis of Simulation Results•   “Normal” Benchmark•   “Parallel” Benchmark•   “Shared Variable” Benchmark•   JPEG Benchm...
Total Run-TimeA 2 slot queue and a scheduler capacity of 10 is enough                  to utilize 256 2012                ...
(STD of cores busy time, latency = 20)                  Load Balancing•   Queues may cause imbalance•   Larger scheduler c...
Effective Allocation LatencyA 1 slot queue is sufficient to hide much of the latency                          May 2, 2012
Agenda•   Introduction and Motivation•   The Plural Architecture•   Improved Scheduler•   Analysis of Simulation Results• ...
Conclusions• Analysis of scheduler effect on many-core  architecture• A simulation and investigation tool• Queues to hide ...
Future Research• Scheduler distribution networks• Implications of scheduler on power• Other imbalance solutions  – As desc...
QUESTIONS?             May 2, 2012
Upcoming SlideShare
Loading in...5
×

Scheduler performance in manycore architecture

594
-1

Published on

Itai
Avron, Technion

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
594
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Unbalanced work distribution in low capacityCapacity reduces run-timeUnbalanced scheduling in deep queues
  • Latency is added to the task’s run timeQueues hide latencySynchronization points cannot be compensated by queues
  • Low capacity cannot utilize all the cores
  • Queues in low capacity generates imbalance (only low cores receives instance to the queue and hides latency)
  • The low capacity scheduler spreads the access time to the shared bank
  • An instance of task G is stuck behind task E
  • requiring more complex hardware possibly requiring a more complex scheduler.possibly requiring more complex hardware and enhanced communication bandwidth, and incurring higher power and latency
  • Queues are shared among 2 cores
  • Notice that sharing will not always solve this problem
  • Scheduler do not schedule new tasks to a queue to which he scheduled a “long” task
  • Break long tasks to many fine grained tasks. In this case, we brake task E to 3 parts
  • Very parallelLong tasks, so only one slot queue and low capacity is sufficient
  • Task D is very short (7 cycles), so a large capacity scheduler is neededThe infinite capacity causes collisions in memory (after the first collision the accesses are spread in time)In the no queue configuration we can see all tasks finish together
  • Might be solved by unifying several instances together (but it degrades parallelism)
  • Scheduler performance in manycore architecture

    1. 1. Scheduler Performance in Many- Core Architecture Itai Avron MSc Thesis Technion - Electrical Engineering Dept. May 2, 2012
    2. 2. Agenda• Introduction and Motivation• The Plural Architecture• Improved Scheduler• Analysis of Simulation Results• Conclusions and Future Work May 2, 2012
    3. 3. Background• CPU performance improvements – In the past : Increase of clock frequency • We reached the power wall – Today : Multi-cores – The future : Many-cores • Homogeneous Heterogeneous? • What architecture? • Memory model? • Scheduler? • … May 2, 2012
    4. 4. Scheduling In Many-Core Architecture• Software scheduling is slow – A lot of cores to schedule – Fine granularity tasks  many tasks to schedule at the same time • To enhance parallelism• Dedicated Hardware required! May 2, 2012
    5. 5. Scheduler Challenges• Latency – Message delay • From core to scheduler (completed prev. task) • From scheduler to core (start new task) – Schedule time • to allocate tasks to cores• Capacity – Number of instancestasks scheduled per cycle May 2, 2012
    6. 6. Other Architectures• Graphic Processing Unit (GPU’s)• Tilera• Larrabee• XMT• Rigel• Data-Driven Multithreading Model• Task Superscalar May 2, 2012
    7. 7. GPU – NVIDIA Fermi• Composed of many processing elements (PEs)• Scheduling is done in hardware – Schedule warps – Only one control flow• SIMD May 2, 2012
    8. 8. Tilera• Composed of tiles – Each tile is independent• Static Scheduling – Determined during compile time• MIMD [Agarwal (MIT) 1997- ] May 2, 2012
    9. 9. Larrabee (Intel)• Array of processor cores• Software controlled Scheduling – Lightweight distributed task-stealing scheduler• MIMD May 2, 2012
    10. 10. XMT• Composed of TCU’s – Thread control unit• Hardware Scheduling – Using Prefix-Sum• PRAM Programming Model• SPMD [Vishkin (UMD) 2005-] May 2, 2012
    11. 11. Rigel• Composed of tiles of clusters – Each cluster holds 8 cores• Software Scheduling – Allocation via task queues – Synchronization via Barriers• SPMD [Patel (UIUC) 2008- ] May 2, 2012
    12. 12. Data-Driven Multithreading Model• A Threads Synchronization Unit (TSU) – Connects to existing cores• Hardware Scheduling – Using Task Map• Producer-Consumer Programming Model [Evripidou (U Cyprus) 1997- ] May 2, 2012
    13. 13. Task Superscalar• An Out-of-Order Task Pipeline – Connects to existing cores – No Speculations• Hardware Scheduling – Creation of new tasks is done in software – Management and Allocation is done in Hardware• StarSs Programming Model [Etsion (BSC) 2009- ] May 2, 2012
    14. 14. Agenda• Introduction and Motivation• The Plural Architecture• Improved Scheduler• Analysis of Simulation Results• Conclusions and Future Work May 2, 2012
    15. 15. The ‘Plural’ System Architecture Scheduler Cores Memory Network Memory banks[Bayer (Technion) 1988 ] May 2, 2012
    16. 16. The System• Many RISC cores – In-Order, Blocking LoadStore – No data cache• Shared On-Chip memory banks – Interleaved address – Access takes 2 cycles • Core retries on collision• Hardware synchronization and scheduling unit – Distributes tasks to cores according to a task map – Collects task completion messages from cores May 2, 2012
    17. 17. Plural Task Map Task• Precedence Graph A• Created by the 1 Dependency programmer C 5000 B 1200• Duplicable Tasks D – All instances are 130 Condition concurrent cntr=4 E Task Name 1 Number of instances May 2, 2012
    18. 18. Plural Scheduling• Central Synchronization Unit (CSU) – Manages allocation, scheduling, and synchronization of tasks – Collects task-termination – Programmed by the task map – Allocates packs (sets) of parallel task-instances• Distribution Network (DN) – Organized as a tree with the CSU as its root – Mediates between the CSU and the processing cores – Downstream flow -> decomposes allocated packs of task instances – Upstream flow -> unifies task-termination events from the cores May 2, 2012
    19. 19. Scheduling Process CSU allocates ready to run tasks CSU DN process distributesnew eligible packs toto run tasks cores Cores sends DN unifies termination termination message on messages completion May 2, 2012
    20. 20. Agenda• Introduction and Motivation• The Plural Architecture• Improved Scheduler• Analysis of Simulation Results• Conclusions and Future Work May 2, 2012
    21. 21. Scheduler Improvements• Enhancing scheduler capacity• Reducing scheduling latency• Adding task queues to each core – Sharing queues• Adding task length indicator May 2, 2012
    22. 22. Simulation Environment• Matlab Simulator [Friedman, Kh oretz, Ginosar, – Based on Eyal and Dima’s simulator PDP 2012]• Benchmarks – 3 Demo programs – 3 Benchmarks • JPEG, Mandelbrot, Linear Solver• 24 System configurations – 256 cores, 256 banks – Scheduler capacity: 5, 10, infinite [instances] – Latency (scheduler—cores): 0, 20 [cycles] – Task queue depth: 0, 1, 2, 10 [instances] May 2, 2012
    23. 23. Benchmark Task Maps Normal and Mandelbrot JPEG Linear Solver Parallel Shared Variable A A A A A 1 1 1 1 1 540 10 236 23 23 B B B B B 1 1 1 100 2000 225 10 40 15 25 C C 4096 C E G J I K 1D C D C 80 1 1 300 200 100 100 214600 500 2600 2500 5715 12810 2418 1490 1952 1659 20 35 26 35 D D 4096 D F H 1 7 300 300 300 172 F E E 181 705 2927 1 130 2300 58 18 18 E 100 L 126 1 460 cn cn G H tr tr =4 =4 7720 100 M 197 78 1 2548 F F J 1 1 1 47 27 19 N 1 207 cn tr =5 Task Name Number of instances Length in time units K 1 87 May 2, 2012
    24. 24. Agenda• Introduction and Motivation• The Plural Architecture• Improved Scheduler• Analysis of Simulation Results• Conclusions and Future Work May 2, 2012
    25. 25. Analysis of Simulation Results• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark• Benchmarks Analysis May 2, 2012
    26. 26. A 1 23 B 100 “Normal” Benchmark 15 D C 600 500 20 35 EActivity Per core, Latency = 0 cycles 130 18 cn tr =4 F 1 27 May 2, 2012
    27. 27. A 1 23 B 100 “Normal” Benchmark 15 D C 600 500 20 35 EUnbalanced scheduling, Latency = 0 cycles 130 18 cn tr =4 F 1 27 May 2, 2012
    28. 28. A 1 23 B 100 “Normal” Benchmark 15 D C 600 500 20 35 EActivity Per core, Latency = 20 cycles 130 18 cn tr =4 F 1 27 May 2, 2012
    29. 29. Analysis of Simulation Results• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark• Benchmarks Analysis May 2, 2012
    30. 30. A 1 23 B 2000 “Parallel” Benchmark 25 D C 2600 2500 26 35 EActivity Per core, Latency = 0 cycles 2300 18 cn tr =4 F 1 19 May 2, 2012
    31. 31. A 1 23 B 2000 “Parallel” Benchmark 25 D C 2600 2500 26 35 E Activity Per core, Latency = 20 cycles 2300 18 cn tr =4 F 1 19Queues help hide latency only if schedule capacity is sufficiently high May 2, 2012
    32. 32. Analysis of Simulation Results• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark• Benchmarks Analysis May 2, 2012
    33. 33. A 1 23 B 100 “Shared Variable” Benchmark 15 D C 600 500 20 35 EActivity Per cycle, Latency = 0 cycles 130 18 cn tr =4 F 1 27 Is this a problem of the scheduler? May 2, 2012
    34. 34. Analysis of Simulation Results• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark• Benchmarks Analysis May 2, 2012
    35. 35. A 1 10 B 1 JPEG Benchmark 10 C E G J I K 1 1 300 200 100 100 5715 12810 2418 1490 1952 1659Activity Per cycle, Latency = 0 cycles D 300 F 300 H 300 181 705 2927 L 1 460 M 1 2548 N 1 207 May 2, 2012
    36. 36. A 1 10 B 1 JPEG Benchmark 10 C E G J I K 1 1 300 200 100 100 5715 12810 2418 1490 1952 1659Unbalanced scheduling, Latency = 0 cycles D 300 F 300 H 300 181 705 2927 L 1 460 M 1 2548 N 1 207 Queues may degrade system performance May 2, 2012
    37. 37. Solutions to imbalance1. Queue sharing among multiple cores2. Scheduling awareness of long tasks Simulated3. Using fine granularity tasks4. Task migration among queues5. Task map optimization6. Pipeline multiple instances of an algorithm May 2, 2012
    38. 38. Solutions to imbalance• Queue sharing among multiple cores• Scheduling awareness of long tasks• Using fine granularity tasks May 2, 2012
    39. 39. A 1 10 JPEG Benchmark 1 B 10 Shared Queues C 1 5715 E 1 12810 G 300 2418 J 200 1490 I 100 1952 K 100 1659Activity Per cycle, Latency = 0 cycles D 300 F 300 H 300 181 705 2927 L 1 460 M 1 2548 N 1 207 May 2, 2012
    40. 40. Solutions to imbalance• Queue sharing among multiple cores• Scheduling awareness of long tasks• Using fine granularity tasks May 2, 2012
    41. 41. JPEG Benchmark [Green 2010] Execution-Time Aware SchedulerActivity Per cycle, Latency = 0 cycles, Task E flagged as long Flag task C as well May 2, 2012
    42. 42. JPEG Benchmark Execution-Time Aware SchedulerActivity Per cycle, Latency = 0 cycles, Task E and C flagged as long Need Profiling Tool May 2, 2012
    43. 43. Solutions to imbalance• Queue sharing among multiple cores• Scheduling awareness of long tasks• Using fine granularity tasks May 2, 2012
    44. 44. A 1 10 JPEG Benchmark 1 B 10 Fine Granularity C E1 G J I K 1 1 300 200 100 100 5715 4270 2418 1490 1952 1659 E2 H 1 300Activity Per cycle, Latency = 0 cycles 4270 E3 2927 1 4270 D F 300 300 181 705 L 1 460 M 1 2548 N 1 207 Might be further improved by decomposing task E May 2, 2012 further and by also decomposing task C
    45. 45. Analysis of Simulation Results• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark• Benchmarks Analysis May 2, 2012
    46. 46. A 1 236 B 1 Linear Solver Benchmark 40 C 1 214Activity Per core, Latency = 20 cycles D 1 172 F 1 58 E 100 126 G H 7720 100 197 78 J 1 47 cn tr 5= K 1 87 May 2, 2012
    47. 47. Analysis of Simulation Results• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark• Benchmarks Analysis May 2, 2012
    48. 48. A 1 540 Mandelbrot Benchmark B 1 225Activity Per cycle, Latency = 20 cycles C 4096 80 D 4096 7 May 2, 2012
    49. 49. A 1 540 Mandelbrot Benchmark B 1 225 Activity Per cycle, Latency = 20 cycles, Zoom on task D execution for infinite capacity C 4096 80 D 4096 7Fine grained tasks requires deep queues and a powerful scheduler to assign instances fast enough to hide latencies May 2, 2012
    50. 50. Analysis of Simulation Results• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark• Benchmarks Analysis May 2, 2012
    51. 51. Total Run-TimeA 2 slot queue and a scheduler capacity of 10 is enough to utilize 256 2012 May 2, cores
    52. 52. (STD of cores busy time, latency = 20) Load Balancing• Queues may cause imbalance• Larger scheduler capacityMay 2, 2012 imbalance decreases
    53. 53. Effective Allocation LatencyA 1 slot queue is sufficient to hide much of the latency May 2, 2012
    54. 54. Agenda• Introduction and Motivation• The Plural Architecture• Improved Scheduler• Analysis of Simulation Results• Conclusions and Future Work May 2, 2012
    55. 55. Conclusions• Analysis of scheduler effect on many-core architecture• A simulation and investigation tool• Queues to hide latencies – Might cause imbalance • Task map optimization and tuning • Sharing queues May 2, 2012
    56. 56. Future Research• Scheduler distribution networks• Implications of scheduler on power• Other imbalance solutions – As described before• Profiling for task map optimization and scheduling analysis May 2, 2012
    57. 57. QUESTIONS? May 2, 2012
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×