thread-clustering
Upcoming SlideShare
Loading in...5
×
 

thread-clustering

on

  • 778 views

Presentation slides from EuroSys 2007.

Presentation slides from EuroSys 2007.

Statistics

Views

Total Views
778
Views on SlideShare
777
Embed Views
1

Actions

Likes
2
Downloads
16
Comments
0

1 Embed 1

https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

thread-clustering thread-clustering Presentation Transcript

  • Thread Clustering: Sharing-Aware Thread Scheduling on SMP-CMP-SMT Multiprocessors David Tam, Reza Azimi, Michael Stumm University of Toronto {tamda, azimi, stumm}@eecg.toronto.edu Thread Clustering
  • Multiprocessors Today Example: IBM Power 5 system 1 Thread Clustering
  • Multiprocessors Today Example: IBM Power 5 system SMP CMP SMT SHARED CACHE 1 Thread Clustering
  • Multiprocessors Today Example: IBM Power 5 system Disparity in L2 latencies 120 14 1 Thread Clustering
  • Operating Systems Today CPU Schedulers: Ignore disparity in L2 latencies ● Ignore data sharing among threads ● Distribute threads poorly ● Cross-chip traffic ● Remote L2 cache accesses ● Causes performance problem ● 2 Thread Clustering
  • Our Goal: Sharing-Aware Scheduling Detect sharing patterns ● Cluster threads ● Benefits: Decrease cross-chip traffic ● Increase on-chip cache locality ● Exploit shared L2 caches ● 3 Thread Clustering
  • Our Online Technique STEPS: 1) Monitor remote cache access rate 2) Detect thread sharing patterns REPEAT 3) Determine thread clusters 4) Migrate thread clusters 4 Thread Clustering
  • Sharing Detection To observe remote cache accesses: ● Exploit HPCs (hardware performance counters) ● Sample remote cache miss addresses ● Local cache misses satisfied by remote cache ● IBM Power 5 continuous data sampling ● 1 X 5 Thread Clustering
  • Sharing Detection To observe remote cache accesses: ● Exploit HPCs (hardware performance counters) ● Sample remote cache miss addresses ● Local cache misses satisfied by remote cache ● IBM Power 5 continuous data sampling ● X 2 5 Thread Clustering
  • Sharing Detection To observe remote cache accesses: ● Exploit HPCs (hardware performance counters) ● Sample remote cache miss addresses ● Local cache misses satisfied by remote cache ● IBM Power 5 continuous data sampling ● 3 5 Thread Clustering
  • Sharing Signatures Construct for each thread ● Counts remote cache accesses ● Conceptually virtual address virtual address 264 0 block 8-bit counter 6 Thread Clustering
  • Sharing Signatures Construct for each thread ● Counts remote cache accesses ● Conceptually virtual address virtual address 264 0 ctri++ block 8-bit counter 6 Thread Clustering
  • Optimizations CPU: Temporal Sampling ● Sample every Nth remote cache access ● Memory: Spatial Sampling ● 256-entry vector ● Hash function ● Block ID filter ● Vectors still effective at indicating sharing ● 7 Thread Clustering
  • Spatial Sampling Hash collision & alias removal ● Filter Legend Empty Reserved Block ID 255 0 255 0 8 Thread Clustering
  • Spatial Sampling Hash collision & alias removal ● Filter Legend Empty Reserved hash EMPTY Block ID 255 0 255 0 8 Thread Clustering
  • Spatial Sampling Hash collision & alias removal ● Filter Legend Empty Reserved hash (First-Come-First-Reserved) Block ID 255 0 255 0 8 Thread Clustering
  • Spatial Sampling Hash collision & alias removal ● Filter Legend Empty Reserved hash MATCH Block ID Block ID 255 0 255 0 8 Thread Clustering
  • Spatial Sampling Hash collision & alias removal ● Filter Legend Empty Reserved hash MISMATCH Block ID Block ID 255 0 ALIASING PREVENTED 255 0 8 Thread Clustering
  • Automated Clustering Clustering Heuristic: Simple, one-pass algorithm ● Compare vector against existing clusters ● If not similar, create a new cluster ● Similarity Metric: N ∑ V1[i] * V2[i] i=0 Shared blocks amplified ● Non-shared blocks nullified ● 9 Thread Clustering
  • Experimental Platform 8-way Power 5, 1.5GHz ● Linux 2.6 ● IBM J2SE 5.0 JVM ● 1.9MB L2 1.9MB L2 36MB 36MB 4 GB 4 GB 10 Thread Clustering
  • Workloads Microbenchmark expect 4 clusters ● 4 threads per cluster ● SPECjbb2000 (modified) expect 2 clusters ● 2 warehouses, 8 threads per warehouse ● RUBiS + MySQL expect 2 clusters ● 2 databases, 16 threads per database ● VolanoMark chat server expect 2 clusters ● 2 rooms, 8 threads per room ● 11 Thread Clustering
  • Visualizing Clusters Counter Values An example 255 ● 128 64 0 { Cluster A, 4 vectors { Cluster B, 4 vectors 12 Thread Clustering
  • Visualizing Clusters Counter Values An example 255 ● 128 64 0 { Cluster A, 4 vectors { Cluster B, 4 vectors 12 Thread Clustering
  • Visualizing Clusters Counter Values An example 255 ● 128 64 0 { Cluster A, 4 vectors { Cluster B, 4 vectors 12 Thread Clustering
  • Visualizing Clusters Microbenchmark ● { 4 vectors 13 Thread Clustering
  • Visualizing Clusters Modified SPECjbb2000 (4 warehouses) ● { 16 vectors 14 Thread Clustering
  • Visualizing Clusters RUBiS + MySQL (2 databases) ● { 24 vectors 15 Thread Clustering
  • Visualizing Clusters VolanoMark (4 rooms) ● 16 Thread Clustering
  • Remote Cache Impact Normalized to default Linux ● 90 70 72 43 32 22 9 2 -17 17 Thread Clustering
  • Performance Impact IPC: instructions per cycle ● Normalized to default Linux ● 7.4 7.4 7.1 6.1 6.1 5.1 5.0 3.7 -0.8 18 Thread Clustering
  • Summary AFTER: BEFORE: Operating System With Current Operating Systems Thread Clustering 19 Thread Clustering
  • Conclusions HPCs can detect sharing ● Sharing signatures are effective ● Automated thread clustering: ● Reduces remote cache access up to 70% ● Improves performance up to 7% ● All with low overhead ● Future Work: More workloads ● Improve clustering algorithm ● Integration with load-balancing aspects ● 20 Thread Clustering
  • Thread Clustering
  • Sampling Overhead Modified SPECjbb2000 ● Thread Clustering