Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Operating System Management
      of Shared Caches
  on Multicore Processors
     -- Ph.D. Thesis Presentation --
        ...
Multicores Today
                        ...
                Cache
                        ...    Cache

                 ...
Thesis
OS should manage on-chip shared caches
of multicore processors

Demonstrate:
 ●   Properly managing shared caches a...
#1 - Promote Sharing
  Problem: Cross-chip accesses are slow
  Solution: Exploit major advantage of shared caches:
       ...
#1 - Promote Sharing
  Problem: Cross-chip accesses are slow
  Solution: Exploit major advantage of shared caches:
       ...
Thread Clustering [EuroSys'07]
  Identify Data Sharing
  ●   Detect sharing online with hardware performance counters
    ...
Thread Clustering [EuroSys'07]
  Identify Data Sharing
  ●   Detect sharing online with hardware performance counters
    ...
Visualization of Clusters
           ●   SPECjbb 2000                                            Sharing Intensity
       ...
Visualization of Clusters
           ●   SPECjbb 2000                                            Sharing Intensity
       ...
Performance Results

                     1.9MB L2           1.9MB L2
           36 MB                                   3...
#2 – Provide Isolation
Problem: Major disadvantage of shared caches
             Cache space interference
Solution: Provid...
#2 – Provide Isolation
Problem: Major disadvantage of shared caches
             Cache space interference
Solution: Provid...
#2 – Provide Isolation
Problem: Major disadvantage of shared caches
             Cache space interference
Solution: Provid...
#2 – Provide Isolation
Problem: Major disadvantage of shared caches
             Cache space interference
Solution: Provid...
#2 – Provide Isolation
Problem: Major disadvantage of shared caches
             Cache space interference
Solution: Provid...
Cache Partitioning                                   [WIOSCA'07]

 ●   Apply page-coloring technique
 ●   Guide physical p...
Cache Partitioning                                    [WIOSCA'07]

   ●   Apply page-coloring technique
   ●   Guide physi...
Impact of Partitioning

                                       mcf
                                                       ...
Provisioning the Cache
Problem: How to determine cache partition size
Solution: Use L2 cache miss rate curve (MRC) of appl...
RapidMRC [ASPLOS'09]
Design
 ●   Upon every L2 access:
     ● Update sampling register with data address

     ● Trigger i...
Accuracy of RapidMRC
                   ●   Execution slice at 10 billion instructions
                               jbb ...
Effectiveness on Provisioning



                                             Performance
                                ...
Contributions
On commodity multicores, first to demonstrate
● Mechanism: To detect data sharing online & automatically clu...
Concluding Remarks
Demonstrated Performance Improvements
●
    Promoting Sharing
     ●   5% – 7%        SPECjbb2k, RUBiS,...
Thank You


            25
24-9=15 slides




                 26
Future Research Opportunities
 Shared cache management principles
 can be applied to other layers:
  ●   Application, mana...
Upcoming SlideShare
Loading in …5
×

Ph.D. thesis presentation

3,463 views

Published on

Slides from my Ph.D. thesis presentation.

"Operating System Management of Shared Caches on Multicore Processors"

  • Be the first to comment

Ph.D. thesis presentation

  1. 1. Operating System Management of Shared Caches on Multicore Processors -- Ph.D. Thesis Presentation -- Apr. 20, 2010 David Tam Supervisor: Michael Stumm 1
  2. 2. Multicores Today ... Cache ... Cache Shared Cache Multicores are Ubiquitous ● Unexpected by most software developers ● Software support is lacking (e.g., OS) General Role of OS ● Manage shared hardware resources New Candidate ● Shared cache: performance critical ● Focus of thesis 2
  3. 3. Thesis OS should manage on-chip shared caches of multicore processors Demonstrate: ● Properly managing shared caches at OS level can increase performance Management Principles 1. Promote sharing ● For threads that share data ● Maximize major advantage of shared caches 2. Provide isolation ● For threads that do not share data ● Minimize major disadvantage of shared caches Supporting Role ● Provision the shared cache online 3
  4. 4. #1 - Promote Sharing Problem: Cross-chip accesses are slow Solution: Exploit major advantage of shared caches: Fast access to shared data OS Actions: Identify & localize data sharing View: Match software sharing to hardware sharing Chip A Chip B Thread A Thread B L2 L2 Shared Data Shared Data Shared Data Traffic 4
  5. 5. #1 - Promote Sharing Problem: Cross-chip accesses are slow Solution: Exploit major advantage of shared caches: Fast access to shared data OS Actions: Identify & localize data sharing View: Match software sharing to hardware sharing Chip A Thread B Chip B Thread A L2 L2 Shared Data 5
  6. 6. Thread Clustering [EuroSys'07] Identify Data Sharing ● Detect sharing online with hardware performance counters ● Monitor remote cache accesses (data addresses) ● Track on a per-thread basis ● Data addresses are memory regions shared with other threads Localize Data Sharing ● Identify clusters of threads that access same memory regions ● Migrate threads of a cluster onto same chip Chip A Chip B Thread A Thread B L2 L2 Shared Data Shared Data Shared Data Traffic 6
  7. 7. Thread Clustering [EuroSys'07] Identify Data Sharing ● Detect sharing online with hardware performance counters ● Monitor remote cache accesses (data addresses) ● Track on a per-thread basis ● Data addresses are memory regions shared with other threads Localize Data Sharing ● Identify clusters of threads that access same memory regions ● Migrate threads of a cluster onto same chip Chip A Thread B Chip B Thread A L2 L2 Shared Data 7
  8. 8. Visualization of Clusters ● SPECjbb 2000 Sharing Intensity ● 4 warehouses, 16 threads per warehouse High Medium ● Threads have been sorted by cluster for visualization Low None 16 { threads Threads 0 Memory Regions 264 (Virtual Address) 8
  9. 9. Visualization of Clusters ● SPECjbb 2000 Sharing Intensity ● 4 warehouses, 16 threads per warehouse High Medium ● Threads have been sorted by cluster for visualization Low None 16 { threads Threads 0 Memory Regions 264 (Virtual Address) 9
  10. 10. Performance Results 1.9MB L2 1.9MB L2 36 MB 36 MB 4 GB 4 GB ● Multithreaded commercial workloads ● RUBiS, VolanoMark, SPECjbb2k ● 8-way IBM POWER5 Linux system ● 22%, 32%, 70% reduction in stalls caused by cross-chip accesses ● 7%, 5%, 6% performance improvement ● 32-way IBM POWER5+ Linux system ● 14% SPECjbb2k potential improvement 10
  11. 11. #2 – Provide Isolation Problem: Major disadvantage of shared caches Cache space interference Solution: Provide cache space isolation between applications OS Actions: Enforce isolation during physical page allocation View: Partition into smaller private caches Apache MySQL 11
  12. 12. #2 – Provide Isolation Problem: Major disadvantage of shared caches Cache space interference Solution: Provide cache space isolation between applications OS Actions: Enforce isolation during physical page allocation View: Partition into smaller private caches Apache MySQL 12
  13. 13. #2 – Provide Isolation Problem: Major disadvantage of shared caches Cache space interference Solution: Provide cache space isolation between applications OS Actions: Enforce isolation during physical page allocation View: Partition into smaller private caches Apache MySQL 13
  14. 14. #2 – Provide Isolation Problem: Major disadvantage of shared caches Cache space interference Solution: Provide cache space isolation between applications OS Actions: Enforce isolation during physical page allocation View: Partition into smaller private caches Apache MySQL Boundary 14
  15. 15. #2 – Provide Isolation Problem: Major disadvantage of shared caches Cache space interference Solution: Provide cache space isolation between applications OS Actions: Enforce isolation during physical page allocation View: Partition into smaller private caches Apache MySQL Boundary 15
  16. 16. Cache Partitioning [WIOSCA'07] ● Apply page-coloring technique ● Guide physical page allocation to control cache line usage ● Works on existing processors Virtual Pages Physical Pages Color A L2 Cache Application { } Color A (N sets) Color A Color A OS Managed Fixed Mapping (Hardware) 16
  17. 17. Cache Partitioning [WIOSCA'07] ● Apply page-coloring technique ● Guide physical page allocation to control cache line usage ● Works on existing processors Virtual Pages Physical Pages Color A Color B L2 Cache Application A { } Color A (N sets) Color A Color B Virtual Pages { } Color B (N sets) Color A Application B Color B OS Managed Fixed Mapping (Hardware) 17
  18. 18. Impact of Partitioning mcf Performance Without art Isolation mcf 16 14 12 10 8 6 4 2 0 art L2 Cache Sizes (# of Colors) Performance of Other Combos ● 10 pairs of applications: SPECcpu2k, SPECjbb2k ● 4% to 17% improvement (36MB L3 cache) ● 28%, 50% improvement (no L3 cache) 18
  19. 19. Provisioning the Cache Problem: How to determine cache partition size Solution: Use L2 cache miss rate curve (MRC) of application Criteria: Obtain MRC rapidly, accurately, online, with low overhead, on existing hardware OS Actions: Monitor L2 cache accesses using hardware performance counters Application X 100 90 80 Miss Rate (%) 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 Allocated Cache Size (%) 19
  20. 20. RapidMRC [ASPLOS'09] Design ● Upon every L2 access: ● Update sampling register with data address ● Trigger interrupt to copy register to trace log in main memory ● Feed trace log into Mattson's stack algorithm [1970] to obtain L2 MRC Results ● Workloads ● 30 apps from SPECcpu2k, SPECcpu2k6, SPECjbb2k ● Latency ● 227 ms to generate online L2 MRC ● Accuracy ● Good, e.g. up to 27% performance improvement when applied to cache partitioning 20
  21. 21. Accuracy of RapidMRC ● Execution slice at 10 billion instructions jbb gzip mgrid Miss Rate (MPKI) Cache Size (# colors) mcf 2k xalancbmk ammp 21
  22. 22. Effectiveness on Provisioning Performance Without RapidMRC Real MRC Isolation 0 2 4 6 8 10 12 14 16 twolf 16 14 12 10 8 6 4 2 0 equake L2 Cache Sizes (# of colors) Performance of Other Combos Using RapidMRC ● 12% improvement for vpr+applu ● 14% improvement for ammp+3applu 22
  23. 23. Contributions On commodity multicores, first to demonstrate ● Mechanism: To detect data sharing online & automatically cluster threads ● Benefits: Promoting sharing [EuroSys'07] ● Mechanism: To partition shared cache by applying page-coloring ● Benefits: Providing isolation [WIOSCA'07] ● Mechanism: To approximate L2 MRCs online in software ● Benefits: Provisioning the cache [ASPLOS'09] ...all performed by the OS. 23
  24. 24. Concluding Remarks Demonstrated Performance Improvements ● Promoting Sharing ● 5% – 7% SPECjbb2k, RUBiS, VolanoMark (2 chips) ● 14% potential: SPECjbb2k (8 chips) ● Providing Isolation ● 4% – 17% 8 combos: SPECcpu2k, SPECjbb2k (36MB L3 cache) ● 28%, 50% 2 combos: SPECcpu2k (no L3 cache) ● Provisioning the Cache Online ● 12% – 27% 3 combos: SPECcpu2k OS should manage on-chip shared caches 24
  25. 25. Thank You 25
  26. 26. 24-9=15 slides 26
  27. 27. Future Research Opportunities Shared cache management principles can be applied to other layers: ● Application, managed runtime, virtual machine monitor Promoting sharing ● Improve locality on NUMA multiprocessor systems Providing isolation ● Finer granularity, within one application [MICRO'08] ● Regions ● Objects RapidMRC ● Online L2 MRCs ● Reducing energy ● Guiding co-scheduling ● Underlying Tracing Mechanism ● Trace other hardware events 27

×