Ph.D. thesis presentation

3,151 views
3,045 views

Published on

Slides from my Ph.D. thesis presentation.

"Operating System Management of Shared Caches on Multicore Processors"

0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,151
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
91
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Ph.D. thesis presentation

  1. 1. Operating System Management of Shared Caches on Multicore Processors -- Ph.D. Thesis Presentation -- Apr. 20, 2010 David Tam Supervisor: Michael Stumm 1
  2. 2. Multicores Today ... Cache ... Cache Shared Cache Multicores are Ubiquitous ● Unexpected by most software developers ● Software support is lacking (e.g., OS) General Role of OS ● Manage shared hardware resources New Candidate ● Shared cache: performance critical ● Focus of thesis 2
  3. 3. Thesis OS should manage on-chip shared caches of multicore processors Demonstrate: ● Properly managing shared caches at OS level can increase performance Management Principles 1. Promote sharing ● For threads that share data ● Maximize major advantage of shared caches 2. Provide isolation ● For threads that do not share data ● Minimize major disadvantage of shared caches Supporting Role ● Provision the shared cache online 3
  4. 4. #1 - Promote Sharing Problem: Cross-chip accesses are slow Solution: Exploit major advantage of shared caches: Fast access to shared data OS Actions: Identify & localize data sharing View: Match software sharing to hardware sharing Chip A Chip B Thread A Thread B L2 L2 Shared Data Shared Data Shared Data Traffic 4
  5. 5. #1 - Promote Sharing Problem: Cross-chip accesses are slow Solution: Exploit major advantage of shared caches: Fast access to shared data OS Actions: Identify & localize data sharing View: Match software sharing to hardware sharing Chip A Thread B Chip B Thread A L2 L2 Shared Data 5
  6. 6. Thread Clustering [EuroSys'07] Identify Data Sharing ● Detect sharing online with hardware performance counters ● Monitor remote cache accesses (data addresses) ● Track on a per-thread basis ● Data addresses are memory regions shared with other threads Localize Data Sharing ● Identify clusters of threads that access same memory regions ● Migrate threads of a cluster onto same chip Chip A Chip B Thread A Thread B L2 L2 Shared Data Shared Data Shared Data Traffic 6
  7. 7. Thread Clustering [EuroSys'07] Identify Data Sharing ● Detect sharing online with hardware performance counters ● Monitor remote cache accesses (data addresses) ● Track on a per-thread basis ● Data addresses are memory regions shared with other threads Localize Data Sharing ● Identify clusters of threads that access same memory regions ● Migrate threads of a cluster onto same chip Chip A Thread B Chip B Thread A L2 L2 Shared Data 7
  8. 8. Visualization of Clusters ● SPECjbb 2000 Sharing Intensity ● 4 warehouses, 16 threads per warehouse High Medium ● Threads have been sorted by cluster for visualization Low None 16 { threads Threads 0 Memory Regions 264 (Virtual Address) 8
  9. 9. Visualization of Clusters ● SPECjbb 2000 Sharing Intensity ● 4 warehouses, 16 threads per warehouse High Medium ● Threads have been sorted by cluster for visualization Low None 16 { threads Threads 0 Memory Regions 264 (Virtual Address) 9
  10. 10. Performance Results 1.9MB L2 1.9MB L2 36 MB 36 MB 4 GB 4 GB ● Multithreaded commercial workloads ● RUBiS, VolanoMark, SPECjbb2k ● 8-way IBM POWER5 Linux system ● 22%, 32%, 70% reduction in stalls caused by cross-chip accesses ● 7%, 5%, 6% performance improvement ● 32-way IBM POWER5+ Linux system ● 14% SPECjbb2k potential improvement 10
  11. 11. #2 – Provide Isolation Problem: Major disadvantage of shared caches Cache space interference Solution: Provide cache space isolation between applications OS Actions: Enforce isolation during physical page allocation View: Partition into smaller private caches Apache MySQL 11
  12. 12. #2 – Provide Isolation Problem: Major disadvantage of shared caches Cache space interference Solution: Provide cache space isolation between applications OS Actions: Enforce isolation during physical page allocation View: Partition into smaller private caches Apache MySQL 12
  13. 13. #2 – Provide Isolation Problem: Major disadvantage of shared caches Cache space interference Solution: Provide cache space isolation between applications OS Actions: Enforce isolation during physical page allocation View: Partition into smaller private caches Apache MySQL 13
  14. 14. #2 – Provide Isolation Problem: Major disadvantage of shared caches Cache space interference Solution: Provide cache space isolation between applications OS Actions: Enforce isolation during physical page allocation View: Partition into smaller private caches Apache MySQL Boundary 14
  15. 15. #2 – Provide Isolation Problem: Major disadvantage of shared caches Cache space interference Solution: Provide cache space isolation between applications OS Actions: Enforce isolation during physical page allocation View: Partition into smaller private caches Apache MySQL Boundary 15
  16. 16. Cache Partitioning [WIOSCA'07] ● Apply page-coloring technique ● Guide physical page allocation to control cache line usage ● Works on existing processors Virtual Pages Physical Pages Color A L2 Cache Application { } Color A (N sets) Color A Color A OS Managed Fixed Mapping (Hardware) 16
  17. 17. Cache Partitioning [WIOSCA'07] ● Apply page-coloring technique ● Guide physical page allocation to control cache line usage ● Works on existing processors Virtual Pages Physical Pages Color A Color B L2 Cache Application A { } Color A (N sets) Color A Color B Virtual Pages { } Color B (N sets) Color A Application B Color B OS Managed Fixed Mapping (Hardware) 17
  18. 18. Impact of Partitioning mcf Performance Without art Isolation mcf 16 14 12 10 8 6 4 2 0 art L2 Cache Sizes (# of Colors) Performance of Other Combos ● 10 pairs of applications: SPECcpu2k, SPECjbb2k ● 4% to 17% improvement (36MB L3 cache) ● 28%, 50% improvement (no L3 cache) 18
  19. 19. Provisioning the Cache Problem: How to determine cache partition size Solution: Use L2 cache miss rate curve (MRC) of application Criteria: Obtain MRC rapidly, accurately, online, with low overhead, on existing hardware OS Actions: Monitor L2 cache accesses using hardware performance counters Application X 100 90 80 Miss Rate (%) 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 Allocated Cache Size (%) 19
  20. 20. RapidMRC [ASPLOS'09] Design ● Upon every L2 access: ● Update sampling register with data address ● Trigger interrupt to copy register to trace log in main memory ● Feed trace log into Mattson's stack algorithm [1970] to obtain L2 MRC Results ● Workloads ● 30 apps from SPECcpu2k, SPECcpu2k6, SPECjbb2k ● Latency ● 227 ms to generate online L2 MRC ● Accuracy ● Good, e.g. up to 27% performance improvement when applied to cache partitioning 20
  21. 21. Accuracy of RapidMRC ● Execution slice at 10 billion instructions jbb gzip mgrid Miss Rate (MPKI) Cache Size (# colors) mcf 2k xalancbmk ammp 21
  22. 22. Effectiveness on Provisioning Performance Without RapidMRC Real MRC Isolation 0 2 4 6 8 10 12 14 16 twolf 16 14 12 10 8 6 4 2 0 equake L2 Cache Sizes (# of colors) Performance of Other Combos Using RapidMRC ● 12% improvement for vpr+applu ● 14% improvement for ammp+3applu 22
  23. 23. Contributions On commodity multicores, first to demonstrate ● Mechanism: To detect data sharing online & automatically cluster threads ● Benefits: Promoting sharing [EuroSys'07] ● Mechanism: To partition shared cache by applying page-coloring ● Benefits: Providing isolation [WIOSCA'07] ● Mechanism: To approximate L2 MRCs online in software ● Benefits: Provisioning the cache [ASPLOS'09] ...all performed by the OS. 23
  24. 24. Concluding Remarks Demonstrated Performance Improvements ● Promoting Sharing ● 5% – 7% SPECjbb2k, RUBiS, VolanoMark (2 chips) ● 14% potential: SPECjbb2k (8 chips) ● Providing Isolation ● 4% – 17% 8 combos: SPECcpu2k, SPECjbb2k (36MB L3 cache) ● 28%, 50% 2 combos: SPECcpu2k (no L3 cache) ● Provisioning the Cache Online ● 12% – 27% 3 combos: SPECcpu2k OS should manage on-chip shared caches 24
  25. 25. Thank You 25
  26. 26. 24-9=15 slides 26
  27. 27. Future Research Opportunities Shared cache management principles can be applied to other layers: ● Application, managed runtime, virtual machine monitor Promoting sharing ● Improve locality on NUMA multiprocessor systems Providing isolation ● Finer granularity, within one application [MICRO'08] ● Regions ● Objects RapidMRC ● Online L2 MRCs ● Reducing energy ● Guiding co-scheduling ● Underlying Tracing Mechanism ● Trace other hardware events 27

×