Your SlideShare is downloading. ×
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Scheduler performance in manycore architecture
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Scheduler performance in manycore architecture

538

Published on

Itai …

Itai
Avron, Technion

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
538
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Unbalanced work distribution in low capacityCapacity reduces run-timeUnbalanced scheduling in deep queues
  • Latency is added to the task’s run timeQueues hide latencySynchronization points cannot be compensated by queues
  • Low capacity cannot utilize all the cores
  • Queues in low capacity generates imbalance (only low cores receives instance to the queue and hides latency)
  • The low capacity scheduler spreads the access time to the shared bank
  • An instance of task G is stuck behind task E
  • requiring more complex hardware possibly requiring a more complex scheduler.possibly requiring more complex hardware and enhanced communication bandwidth, and incurring higher power and latency
  • Queues are shared among 2 cores
  • Notice that sharing will not always solve this problem
  • Scheduler do not schedule new tasks to a queue to which he scheduled a “long” task
  • Break long tasks to many fine grained tasks. In this case, we brake task E to 3 parts
  • Very parallelLong tasks, so only one slot queue and low capacity is sufficient
  • Task D is very short (7 cycles), so a large capacity scheduler is neededThe infinite capacity causes collisions in memory (after the first collision the accesses are spread in time)In the no queue configuration we can see all tasks finish together
  • Might be solved by unifying several instances together (but it degrades parallelism)
  • Transcript

    • 1. Scheduler Performance in Many- Core Architecture Itai Avron MSc Thesis Technion - Electrical Engineering Dept. May 2, 2012
    • 2. Agenda• Introduction and Motivation• The Plural Architecture• Improved Scheduler• Analysis of Simulation Results• Conclusions and Future Work May 2, 2012
    • 3. Background• CPU performance improvements – In the past : Increase of clock frequency • We reached the power wall – Today : Multi-cores – The future : Many-cores • Homogeneous Heterogeneous? • What architecture? • Memory model? • Scheduler? • … May 2, 2012
    • 4. Scheduling In Many-Core Architecture• Software scheduling is slow – A lot of cores to schedule – Fine granularity tasks  many tasks to schedule at the same time • To enhance parallelism• Dedicated Hardware required! May 2, 2012
    • 5. Scheduler Challenges• Latency – Message delay • From core to scheduler (completed prev. task) • From scheduler to core (start new task) – Schedule time • to allocate tasks to cores• Capacity – Number of instancestasks scheduled per cycle May 2, 2012
    • 6. Other Architectures• Graphic Processing Unit (GPU’s)• Tilera• Larrabee• XMT• Rigel• Data-Driven Multithreading Model• Task Superscalar May 2, 2012
    • 7. GPU – NVIDIA Fermi• Composed of many processing elements (PEs)• Scheduling is done in hardware – Schedule warps – Only one control flow• SIMD May 2, 2012
    • 8. Tilera• Composed of tiles – Each tile is independent• Static Scheduling – Determined during compile time• MIMD [Agarwal (MIT) 1997- ] May 2, 2012
    • 9. Larrabee (Intel)• Array of processor cores• Software controlled Scheduling – Lightweight distributed task-stealing scheduler• MIMD May 2, 2012
    • 10. XMT• Composed of TCU’s – Thread control unit• Hardware Scheduling – Using Prefix-Sum• PRAM Programming Model• SPMD [Vishkin (UMD) 2005-] May 2, 2012
    • 11. Rigel• Composed of tiles of clusters – Each cluster holds 8 cores• Software Scheduling – Allocation via task queues – Synchronization via Barriers• SPMD [Patel (UIUC) 2008- ] May 2, 2012
    • 12. Data-Driven Multithreading Model• A Threads Synchronization Unit (TSU) – Connects to existing cores• Hardware Scheduling – Using Task Map• Producer-Consumer Programming Model [Evripidou (U Cyprus) 1997- ] May 2, 2012
    • 13. Task Superscalar• An Out-of-Order Task Pipeline – Connects to existing cores – No Speculations• Hardware Scheduling – Creation of new tasks is done in software – Management and Allocation is done in Hardware• StarSs Programming Model [Etsion (BSC) 2009- ] May 2, 2012
    • 14. Agenda• Introduction and Motivation• The Plural Architecture• Improved Scheduler• Analysis of Simulation Results• Conclusions and Future Work May 2, 2012
    • 15. The ‘Plural’ System Architecture Scheduler Cores Memory Network Memory banks[Bayer (Technion) 1988 ] May 2, 2012
    • 16. The System• Many RISC cores – In-Order, Blocking LoadStore – No data cache• Shared On-Chip memory banks – Interleaved address – Access takes 2 cycles • Core retries on collision• Hardware synchronization and scheduling unit – Distributes tasks to cores according to a task map – Collects task completion messages from cores May 2, 2012
    • 17. Plural Task Map Task• Precedence Graph A• Created by the 1 Dependency programmer C 5000 B 1200• Duplicable Tasks D – All instances are 130 Condition concurrent cntr=4 E Task Name 1 Number of instances May 2, 2012
    • 18. Plural Scheduling• Central Synchronization Unit (CSU) – Manages allocation, scheduling, and synchronization of tasks – Collects task-termination – Programmed by the task map – Allocates packs (sets) of parallel task-instances• Distribution Network (DN) – Organized as a tree with the CSU as its root – Mediates between the CSU and the processing cores – Downstream flow -> decomposes allocated packs of task instances – Upstream flow -> unifies task-termination events from the cores May 2, 2012
    • 19. Scheduling Process CSU allocates ready to run tasks CSU DN process distributesnew eligible packs toto run tasks cores Cores sends DN unifies termination termination message on messages completion May 2, 2012
    • 20. Agenda• Introduction and Motivation• The Plural Architecture• Improved Scheduler• Analysis of Simulation Results• Conclusions and Future Work May 2, 2012
    • 21. Scheduler Improvements• Enhancing scheduler capacity• Reducing scheduling latency• Adding task queues to each core – Sharing queues• Adding task length indicator May 2, 2012
    • 22. Simulation Environment• Matlab Simulator [Friedman, Kh oretz, Ginosar, – Based on Eyal and Dima’s simulator PDP 2012]• Benchmarks – 3 Demo programs – 3 Benchmarks • JPEG, Mandelbrot, Linear Solver• 24 System configurations – 256 cores, 256 banks – Scheduler capacity: 5, 10, infinite [instances] – Latency (scheduler—cores): 0, 20 [cycles] – Task queue depth: 0, 1, 2, 10 [instances] May 2, 2012
    • 23. Benchmark Task Maps Normal and Mandelbrot JPEG Linear Solver Parallel Shared Variable A A A A A 1 1 1 1 1 540 10 236 23 23 B B B B B 1 1 1 100 2000 225 10 40 15 25 C C 4096 C E G J I K 1D C D C 80 1 1 300 200 100 100 214600 500 2600 2500 5715 12810 2418 1490 1952 1659 20 35 26 35 D D 4096 D F H 1 7 300 300 300 172 F E E 181 705 2927 1 130 2300 58 18 18 E 100 L 126 1 460 cn cn G H tr tr =4 =4 7720 100 M 197 78 1 2548 F F J 1 1 1 47 27 19 N 1 207 cn tr =5 Task Name Number of instances Length in time units K 1 87 May 2, 2012
    • 24. Agenda• Introduction and Motivation• The Plural Architecture• Improved Scheduler• Analysis of Simulation Results• Conclusions and Future Work May 2, 2012
    • 25. Analysis of Simulation Results• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark• Benchmarks Analysis May 2, 2012
    • 26. A 1 23 B 100 “Normal” Benchmark 15 D C 600 500 20 35 EActivity Per core, Latency = 0 cycles 130 18 cn tr =4 F 1 27 May 2, 2012
    • 27. A 1 23 B 100 “Normal” Benchmark 15 D C 600 500 20 35 EUnbalanced scheduling, Latency = 0 cycles 130 18 cn tr =4 F 1 27 May 2, 2012
    • 28. A 1 23 B 100 “Normal” Benchmark 15 D C 600 500 20 35 EActivity Per core, Latency = 20 cycles 130 18 cn tr =4 F 1 27 May 2, 2012
    • 29. Analysis of Simulation Results• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark• Benchmarks Analysis May 2, 2012
    • 30. A 1 23 B 2000 “Parallel” Benchmark 25 D C 2600 2500 26 35 EActivity Per core, Latency = 0 cycles 2300 18 cn tr =4 F 1 19 May 2, 2012
    • 31. A 1 23 B 2000 “Parallel” Benchmark 25 D C 2600 2500 26 35 E Activity Per core, Latency = 20 cycles 2300 18 cn tr =4 F 1 19Queues help hide latency only if schedule capacity is sufficiently high May 2, 2012
    • 32. Analysis of Simulation Results• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark• Benchmarks Analysis May 2, 2012
    • 33. A 1 23 B 100 “Shared Variable” Benchmark 15 D C 600 500 20 35 EActivity Per cycle, Latency = 0 cycles 130 18 cn tr =4 F 1 27 Is this a problem of the scheduler? May 2, 2012
    • 34. Analysis of Simulation Results• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark• Benchmarks Analysis May 2, 2012
    • 35. A 1 10 B 1 JPEG Benchmark 10 C E G J I K 1 1 300 200 100 100 5715 12810 2418 1490 1952 1659Activity Per cycle, Latency = 0 cycles D 300 F 300 H 300 181 705 2927 L 1 460 M 1 2548 N 1 207 May 2, 2012
    • 36. A 1 10 B 1 JPEG Benchmark 10 C E G J I K 1 1 300 200 100 100 5715 12810 2418 1490 1952 1659Unbalanced scheduling, Latency = 0 cycles D 300 F 300 H 300 181 705 2927 L 1 460 M 1 2548 N 1 207 Queues may degrade system performance May 2, 2012
    • 37. Solutions to imbalance1. Queue sharing among multiple cores2. Scheduling awareness of long tasks Simulated3. Using fine granularity tasks4. Task migration among queues5. Task map optimization6. Pipeline multiple instances of an algorithm May 2, 2012
    • 38. Solutions to imbalance• Queue sharing among multiple cores• Scheduling awareness of long tasks• Using fine granularity tasks May 2, 2012
    • 39. A 1 10 JPEG Benchmark 1 B 10 Shared Queues C 1 5715 E 1 12810 G 300 2418 J 200 1490 I 100 1952 K 100 1659Activity Per cycle, Latency = 0 cycles D 300 F 300 H 300 181 705 2927 L 1 460 M 1 2548 N 1 207 May 2, 2012
    • 40. Solutions to imbalance• Queue sharing among multiple cores• Scheduling awareness of long tasks• Using fine granularity tasks May 2, 2012
    • 41. JPEG Benchmark [Green 2010] Execution-Time Aware SchedulerActivity Per cycle, Latency = 0 cycles, Task E flagged as long Flag task C as well May 2, 2012
    • 42. JPEG Benchmark Execution-Time Aware SchedulerActivity Per cycle, Latency = 0 cycles, Task E and C flagged as long Need Profiling Tool May 2, 2012
    • 43. Solutions to imbalance• Queue sharing among multiple cores• Scheduling awareness of long tasks• Using fine granularity tasks May 2, 2012
    • 44. A 1 10 JPEG Benchmark 1 B 10 Fine Granularity C E1 G J I K 1 1 300 200 100 100 5715 4270 2418 1490 1952 1659 E2 H 1 300Activity Per cycle, Latency = 0 cycles 4270 E3 2927 1 4270 D F 300 300 181 705 L 1 460 M 1 2548 N 1 207 Might be further improved by decomposing task E May 2, 2012 further and by also decomposing task C
    • 45. Analysis of Simulation Results• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark• Benchmarks Analysis May 2, 2012
    • 46. A 1 236 B 1 Linear Solver Benchmark 40 C 1 214Activity Per core, Latency = 20 cycles D 1 172 F 1 58 E 100 126 G H 7720 100 197 78 J 1 47 cn tr 5= K 1 87 May 2, 2012
    • 47. Analysis of Simulation Results• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark• Benchmarks Analysis May 2, 2012
    • 48. A 1 540 Mandelbrot Benchmark B 1 225Activity Per cycle, Latency = 20 cycles C 4096 80 D 4096 7 May 2, 2012
    • 49. A 1 540 Mandelbrot Benchmark B 1 225 Activity Per cycle, Latency = 20 cycles, Zoom on task D execution for infinite capacity C 4096 80 D 4096 7Fine grained tasks requires deep queues and a powerful scheduler to assign instances fast enough to hide latencies May 2, 2012
    • 50. Analysis of Simulation Results• “Normal” Benchmark• “Parallel” Benchmark• “Shared Variable” Benchmark• JPEG Benchmark• Linear Solver Benchmark• Mandelbrot Benchmark• Benchmarks Analysis May 2, 2012
    • 51. Total Run-TimeA 2 slot queue and a scheduler capacity of 10 is enough to utilize 256 2012 May 2, cores
    • 52. (STD of cores busy time, latency = 20) Load Balancing• Queues may cause imbalance• Larger scheduler capacityMay 2, 2012 imbalance decreases
    • 53. Effective Allocation LatencyA 1 slot queue is sufficient to hide much of the latency May 2, 2012
    • 54. Agenda• Introduction and Motivation• The Plural Architecture• Improved Scheduler• Analysis of Simulation Results• Conclusions and Future Work May 2, 2012
    • 55. Conclusions• Analysis of scheduler effect on many-core architecture• A simulation and investigation tool• Queues to hide latencies – Might cause imbalance • Task map optimization and tuning • Sharing queues May 2, 2012
    • 56. Future Research• Scheduler distribution networks• Implications of scheduler on power• Other imbalance solutions – As described before• Profiling for task map optimization and scheduling analysis May 2, 2012
    • 57. QUESTIONS? May 2, 2012

    ×