Hardware-aware thread scheduling: thecase of asymmetric multicore processors    Achille Peternier*, Danilo Ansaloni, Danie...
IntroductionCONTEXT AND OVERALL IDEA                           2
Context• Modern CPUs increase the computational  power through additional cores• HW architectures are becoming increasingl...
Context• Operating System (OS) kernel and scheduler  try to automatically optimize applications’  performance according to...
“Today it is impossible to estimate performance:you have to measure it. Programming has becomean empirical science.” Perfo...
Contributions1) Automated workload analysis technique relying on aspecific set of performance metrics that are currently n...
The big picture     Monitoring daemon                                   FPU                                             IN...
The big picture                           FPUHardware-aware scheduler             INT                           Workload c...
Target architectureAMD BULLDOZER PROCESSOR                          9
AMD Bulldozer• AMD Bulldozer architecture  – Each CPU is implemented as a series of modules    (a.k.a. “cores”) with two c...
AMD Bulldozer                11
AMD Bulldozer                12
AMD Bulldozer                13
WORKLOAD CHARACTERIZATION                            14
Workload characterization• Is used to sort processes and threads that are  floating point intensive  – Among the X most ru...
…about HPCs…• Registers embedded into processors to keep track  of hardware-related events such as cache misses,  number o...
Workload characterization• HPCs used:  – PERF_COUNT_HW_CPU_CYCLES: measures the    total number of CPU cycles consumed by ...
MONITORING AND SCHEDULINGINFRASTUCTURE DESING                            18
BulldOver design• Bulldozer Overseer -> BulldOver• Client-server architecture                                    19
BulldOver design• Server  – Daemon  – Scans the underlying architecture  – Time-based HPC monitoring (once per sec)     • ...
BulldOver design• Client  – Command-line tool     • prompt> bulldover java myprogram  – Traces the creation/termination of...
BulldOver design                   User space                          22
EVALUATION             23
Testing environment• Dell PowerEdge M915  – 4x AMD 6282SE 2.6 GHz CPUs (16 cores/8    modules each)     • Limited to 1 CPU...
Benchmark suites• SPEC CPU 2006  – Perfect match for evaluating Integer vs. Floating point    behaviors• SciMark 2.0  – Ja...
Workload characterizationSpec CPU 2006                                              26                 Empty FPU Cycles   ...
Workload characterizationSciMark 2.0                 Empty FPU Cycles   Total CPU Cycles                                  ...
FPU usage and caches                       28
Results for SPEC CPU 2006                     Running 4x Int and 4x FPU                     benchmarks on a single NUMA   ...
Discussion• BulldOver avoids the worst case scenario  – The default OS scheduler is not aware of the    workload character...
Results for Scimark 2.0                          Running 8x randomly changing                          over-time benchmark...
Discussion• All the threads are FPU-intensive  – But at different levels• Still a reasonable speedup “for free”• Dynamic a...
Conclusions- We show how thread scheduling not aware of the shared  HW resources available on the AMD Bulldozer processor ...
Conclusions• Currently tuned for a specific HW architecture• Good for scientific workloads  – Sampling rate is required (1...
Thanks!   Achille Peternier                       achille.peternier@usi.ch                       http://sosoa.inf.unisi.ch...
“Pow7Over”• Work in progress on IBM Power7 processors   – 1 CPU, 8 cores, up to 4 SMT units per core   – Completely differ...
Upcoming SlideShare
Loading in …5
×

Hardware-aware thread scheduling: the case of asymmetric multicore processors

1,014 views
907 views

Published on

Talk given at ICPADS 2012 in Singapore.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,014
On SlideShare
0
From Embeds
0
Number of Embeds
205
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Hardware-aware thread scheduling: the case of asymmetric multicore processors

  1. 1. Hardware-aware thread scheduling: thecase of asymmetric multicore processors Achille Peternier*, Danilo Ansaloni, Daniele Bonetta, Cesare Pautasso and Walter Binder * achille.peternier@usi.ch http://sosoa.inf.unisi.ch
  2. 2. IntroductionCONTEXT AND OVERALL IDEA 2
  3. 3. Context• Modern CPUs increase the computational power through additional cores• HW architectures are becoming increasingly more complex – Shared caches – Non Uniform Memory Access (NUMA) – Single Instruction Multiple Data (SIMD) registers – Simultaneous MultiThreading (SMT) units 3
  4. 4. Context• Operating System (OS) kernel and scheduler try to automatically optimize applications’ performance according to the available resources – Based on the underlying HW – Using a limited set of performance indicators (CPU time, memory usage, etc.) 4
  5. 5. “Today it is impossible to estimate performance:you have to measure it. Programming has becomean empirical science.” Performance Anxiety: Performance analysis in the new millennium Joshua Bloch, Google Inc. 5
  6. 6. Contributions1) Automated workload analysis technique relying on aspecific set of performance metrics that are currently notused by common OS schedulers2) Hardware-aware optimized scheduler performingdecisions based on hardware resource usage and theoutput of the workload analysis - to improve processing units occupancy on SMT/asymmetric processors 6
  7. 7. The big picture Monitoring daemon FPU INT Workload characterization OS threads and processes 7
  8. 8. The big picture FPUHardware-aware scheduler INT Workload characterization 8
  9. 9. Target architectureAMD BULLDOZER PROCESSOR 9
  10. 10. AMD Bulldozer• AMD Bulldozer architecture – Each CPU is implemented as a series of modules (a.k.a. “cores”) with two cores (a.k.a. “processing or SMT units”) – Arithmetic-Logic Units (ALUs) are really available per SMT unit – A module is more similar to: • A dual core when doing integer ops • A single core with SMT=2 when doing floating point ops 10
  11. 11. AMD Bulldozer 11
  12. 12. AMD Bulldozer 12
  13. 13. AMD Bulldozer 13
  14. 14. WORKLOAD CHARACTERIZATION 14
  15. 15. Workload characterization• Is used to sort processes and threads that are floating point intensive – Among the X most running threads • (where X = the number of cores available)• Based on realtime monitoring system using Hardware Performance Counters (HPCs) 15
  16. 16. …about HPCs…• Registers embedded into processors to keep track of hardware-related events such as cache misses, number of CPU cycles, branch mispredictions, etc.• Very low overhead (about 1%)• Extremely accurate• Limited resources, only few of them can be used at the same time – This limits their wide adoption (yet) on large scale• HW-specific 16
  17. 17. Workload characterization• HPCs used: – PERF_COUNT_HW_CPU_CYCLES: measures the total number of CPU cycles consumed by a thread during its execution time – CYCLES_FPU_EMPTY: keeps track of the number of CPU cycles the floating point units are not being used by a thread during its execution time – L2_CACHE_MISSES: counts the number of L2 cache misses generated by a thread during its execution time 17
  18. 18. MONITORING AND SCHEDULINGINFRASTUCTURE DESING 18
  19. 19. BulldOver design• Bulldozer Overseer -> BulldOver• Client-server architecture 19
  20. 20. BulldOver design• Server – Daemon – Scans the underlying architecture – Time-based HPC monitoring (once per sec) • We target scientific workloads, short-lived threads are not well suitable – Applies scheduling policies – libHpcOverseer, hwloc, libpfm 20
  21. 21. BulldOver design• Client – Command-line tool • prompt> bulldover java myprogram – Traces the creation/termination of threads/processes – Share information through shared memory with the server – libmonitor, boost 21
  22. 22. BulldOver design User space 22
  23. 23. EVALUATION 23
  24. 24. Testing environment• Dell PowerEdge M915 – 4x AMD 6282SE 2.6 GHz CPUs (16 cores/8 modules each) • Limited to 1 CPU with 8 cores/4 modules – Test limited to a single NUMA node • Avoiding latencies and other NUMA-related well known effects – Turbo mode and freq. scaling disabled 24
  25. 25. Benchmark suites• SPEC CPU 2006 – Perfect match for evaluating Integer vs. Floating point behaviors• SciMark 2.0 – Java based – Noisy environment (additional threads for garbage collection, JIT, etc.) – Mainly FPU-oriented, with different levels of stress – Modified multi-threaded version running several random benchmarks over a thread-pool 25
  26. 26. Workload characterizationSpec CPU 2006 26 Empty FPU Cycles Total CPU Cycles
  27. 27. Workload characterizationSciMark 2.0 Empty FPU Cycles Total CPU Cycles 27
  28. 28. FPU usage and caches 28
  29. 29. Results for SPEC CPU 2006 Running 4x Int and 4x FPU benchmarks on a single NUMA node (4 modules/8 cores) Inefficient baseline Improved scheduling Default OS scheduling 29
  30. 30. Discussion• BulldOver avoids the worst case scenario – The default OS scheduler is not aware of the workload characterization• Benefits coming both from improved cache usage AND better FPU/Integer units occupancy 30
  31. 31. Results for Scimark 2.0 Running 8x randomly changing over-time benchmarks on a single NUMA node (4 modules/8 cores) Default OS scheduling Improved scheduling 31
  32. 32. Discussion• All the threads are FPU-intensive – But at different levels• Still a reasonable speedup “for free”• Dynamic adaptation, since the FPU usage intensity varies over time – BulldOver reacts accordingly 32
  33. 33. Conclusions- We show how thread scheduling not aware of the shared HW resources available on the AMD Bulldozer processor can incur a significant performance penalty- We presented a monitoring system that is able to characterize the most active threads according to their FPU/Integer usage- Thanks to the realtime analysis, improved scheduling can be applied and performance improved- Our system is very low intrusive: - Low overhead (below 2%) - No kernel patching required - No code instrumentation - Works on any application 33
  34. 34. Conclusions• Currently tuned for a specific HW architecture• Good for scientific workloads – Sampling rate is required (1 sec in our case, could be less but can’t be 0…)• Based on a very simple scheduling policy – More sophisticated policies could be used 34
  35. 35. Thanks! Achille Peternier achille.peternier@usi.ch http://sosoa.inf.unisi.ch 35
  36. 36. “Pow7Over”• Work in progress on IBM Power7 processors – 1 CPU, 8 cores, up to 4 SMT units per core – Completely different… • …operating system: RHEL 6.3 • …architecture: PowerPC • …HPCs: IBM-specific ones (more than 500 available…) • …compiler: autotools 6.0• Similar approach• Slightly less significant speedup – But this is a full SMT – Similar overall behavior both for the PUs and L2 caches 36

×