Hardware-aware thread scheduling: the
case of asymmetric multicore processors
    Achille Peternier*, Danilo Ansaloni, Daniele Bonetta,
             Cesare Pautasso and Walter Binder

                 * achille.peternier@usi.ch
                  http://sosoa.inf.unisi.ch
Introduction

CONTEXT AND OVERALL IDEA


                           2
Context
• Modern CPUs increase the computational
  power through additional cores
• HW architectures are becoming increasingly
  more complex
  – Shared caches
  – Non Uniform Memory Access (NUMA)
  – Single Instruction Multiple Data (SIMD) registers
  – Simultaneous MultiThreading (SMT) units


                                                        3
Context
• Operating System (OS) kernel and scheduler
  try to automatically optimize applications’
  performance according to the available
  resources
  – Based on the underlying HW
  – Using a limited set of performance indicators (CPU
    time, memory usage, etc.)



                                                     4
“Today it is impossible to estimate performance:
you have to measure it. Programming has become
an empirical science.”

 Performance Anxiety: Performance analysis in the new millennium
                                       Joshua Bloch, Google Inc.




                                                             5
Contributions
1) Automated workload analysis technique relying on a
specific set of performance metrics that are currently not
used by common OS schedulers


2) Hardware-aware optimized scheduler performing
decisions based on hardware resource usage and the
output of the workload analysis
       - to improve processing units occupancy on
       SMT/asymmetric processors
                                                         6
The big picture
     Monitoring daemon




                                   FPU
                                             INT

                                   Workload characterization




        OS threads and processes
                                                               7
The big picture



                           FPU
Hardware-aware scheduler             INT

                           Workload characterization




                                                       8
Target architecture

AMD BULLDOZER PROCESSOR


                          9
AMD Bulldozer
• AMD Bulldozer architecture
  – Each CPU is implemented as a series of modules
    (a.k.a. “cores”) with two cores (a.k.a. “processing
    or SMT units”)
  – Arithmetic-Logic Units (ALUs) are really available
    per SMT unit
  – A module is more similar to:
     • A dual core when doing integer ops
     • A single core with SMT=2 when
       doing floating point ops

                                                          10
AMD Bulldozer




                11
AMD Bulldozer




                12
AMD Bulldozer




                13
WORKLOAD CHARACTERIZATION


                            14
Workload characterization

• Is used to sort processes and threads that are
  floating point intensive
  – Among the X most running threads
     • (where X = the number of cores available)


• Based on realtime monitoring system using
  Hardware Performance Counters (HPCs)

                                                   15
…about HPCs…
• Registers embedded into processors to keep track
  of hardware-related events such as cache misses,
  number of CPU cycles, branch mispredictions,
  etc.
• Very low overhead (about 1%)
• Extremely accurate
• Limited resources, only few of them can be used
  at the same time
  – This limits their wide adoption (yet) on large scale
• HW-specific

                                                           16
Workload characterization
• HPCs used:
  – PERF_COUNT_HW_CPU_CYCLES: measures the
    total number of CPU cycles consumed by a thread
    during its execution time
  – CYCLES_FPU_EMPTY: keeps track of the number
    of CPU cycles the floating point units are not being
    used by a thread during its execution time
  – L2_CACHE_MISSES: counts the number of L2
    cache misses generated by a thread during its
    execution time

                                                       17
MONITORING AND SCHEDULING
INFRASTUCTURE DESING

                            18
BulldOver design
• Bulldozer Overseer -> BulldOver
• Client-server architecture




                                    19
BulldOver design
• Server
  – Daemon
  – Scans the underlying architecture
  – Time-based HPC monitoring (once per sec)
     • We target scientific workloads, short-lived threads are
       not well suitable
  – Applies scheduling policies
  – libHpcOverseer, hwloc, libpfm


                                                                 20
BulldOver design
• Client
  – Command-line tool
     • prompt> bulldover java myprogram
  – Traces the creation/termination of
    threads/processes
  – Share information through shared memory with
    the server
  – libmonitor, boost


                                                   21
BulldOver design



                   User space




                          22
EVALUATION


             23
Testing environment
• Dell PowerEdge M915
  – 4x AMD 6282SE 2.6 GHz CPUs (16 cores/8
    modules each)
     • Limited to 1 CPU with 8 cores/4 modules
  – Test limited to a single NUMA node
     • Avoiding latencies and other NUMA-related well known
       effects
  – Turbo mode and freq. scaling disabled


                                                          24
Benchmark suites
• SPEC CPU 2006
  – Perfect match for evaluating Integer vs. Floating point
    behaviors

• SciMark 2.0
  – Java based
  – Noisy environment (additional threads for garbage
    collection, JIT, etc.)
  – Mainly FPU-oriented, with different levels of stress
  – Modified multi-threaded version running several
    random benchmarks over a thread-pool

                                                           25
Workload characterization
Spec CPU 2006




                                              26
                 Empty FPU Cycles   Total CPU Cycles
Workload characterization
SciMark 2.0




                 Empty FPU Cycles   Total CPU Cycles


                                              27
FPU usage and caches




                       28
Results for SPEC CPU 2006
                     Running 4x Int and 4x FPU
                     benchmarks on a single NUMA
                     node (4 modules/8 cores)



                          Inefficient baseline

                          Improved scheduling

                          Default OS scheduling




                                                  29
Discussion
• BulldOver avoids the worst case scenario
  – The default OS scheduler is not aware of the
    workload characterization
• Benefits coming both from improved cache
  usage AND better FPU/Integer units
  occupancy




                                                   30
Results for Scimark 2.0
                          Running 8x randomly changing
                          over-time benchmarks on a
                          single NUMA node (4 modules/8
                          cores)

                               Default OS scheduling

                               Improved scheduling




                                                 31
Discussion
• All the threads are FPU-intensive
  – But at different levels
• Still a reasonable speedup “for free”
• Dynamic adaptation, since the FPU usage
  intensity varies over time
  – BulldOver reacts accordingly




                                            32
Conclusions
- We show how thread scheduling not aware of the shared
  HW resources available on the AMD Bulldozer processor
  can incur a significant performance penalty
- We presented a monitoring system that is able to
  characterize the most active threads according to their
  FPU/Integer usage
- Thanks to the realtime analysis, improved scheduling can
  be applied and performance improved
- Our system is very low intrusive:
   -   Low overhead (below 2%)
   -   No kernel patching required
   -   No code instrumentation
   -   Works on any application

                                                             33
Conclusions
• Currently tuned for a specific HW architecture
• Good for scientific workloads
  – Sampling rate is required (1 sec in our case, could
    be less but can’t be 0…)
• Based on a very simple scheduling policy
  – More sophisticated policies could be used




                                                          34
Thanks!




   Achille Peternier
                       achille.peternier@usi.ch
                       http://sosoa.inf.unisi.ch

                                                   35
“Pow7Over”
• Work in progress on IBM Power7 processors
   – 1 CPU, 8 cores, up to 4 SMT units per core
   – Completely different…
      •   …operating system: RHEL 6.3
      •   …architecture: PowerPC
      •   …HPCs: IBM-specific ones (more than 500 available…)
      •   …compiler: autotools 6.0
• Similar approach
• Slightly less significant speedup
   – But this is a full SMT
   – Similar overall behavior both for the PUs and L2 caches

                                                                36

Hardware-aware thread scheduling: the case of asymmetric multicore processors

  • 1.
    Hardware-aware thread scheduling:the case of asymmetric multicore processors Achille Peternier*, Danilo Ansaloni, Daniele Bonetta, Cesare Pautasso and Walter Binder * achille.peternier@usi.ch http://sosoa.inf.unisi.ch
  • 2.
  • 3.
    Context • Modern CPUsincrease the computational power through additional cores • HW architectures are becoming increasingly more complex – Shared caches – Non Uniform Memory Access (NUMA) – Single Instruction Multiple Data (SIMD) registers – Simultaneous MultiThreading (SMT) units 3
  • 4.
    Context • Operating System(OS) kernel and scheduler try to automatically optimize applications’ performance according to the available resources – Based on the underlying HW – Using a limited set of performance indicators (CPU time, memory usage, etc.) 4
  • 5.
    “Today it isimpossible to estimate performance: you have to measure it. Programming has become an empirical science.” Performance Anxiety: Performance analysis in the new millennium Joshua Bloch, Google Inc. 5
  • 6.
    Contributions 1) Automated workloadanalysis technique relying on a specific set of performance metrics that are currently not used by common OS schedulers 2) Hardware-aware optimized scheduler performing decisions based on hardware resource usage and the output of the workload analysis - to improve processing units occupancy on SMT/asymmetric processors 6
  • 7.
    The big picture Monitoring daemon FPU INT Workload characterization OS threads and processes 7
  • 8.
    The big picture FPU Hardware-aware scheduler INT Workload characterization 8
  • 9.
  • 10.
    AMD Bulldozer • AMDBulldozer architecture – Each CPU is implemented as a series of modules (a.k.a. “cores”) with two cores (a.k.a. “processing or SMT units”) – Arithmetic-Logic Units (ALUs) are really available per SMT unit – A module is more similar to: • A dual core when doing integer ops • A single core with SMT=2 when doing floating point ops 10
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
    Workload characterization • Isused to sort processes and threads that are floating point intensive – Among the X most running threads • (where X = the number of cores available) • Based on realtime monitoring system using Hardware Performance Counters (HPCs) 15
  • 16.
    …about HPCs… • Registersembedded into processors to keep track of hardware-related events such as cache misses, number of CPU cycles, branch mispredictions, etc. • Very low overhead (about 1%) • Extremely accurate • Limited resources, only few of them can be used at the same time – This limits their wide adoption (yet) on large scale • HW-specific 16
  • 17.
    Workload characterization • HPCsused: – PERF_COUNT_HW_CPU_CYCLES: measures the total number of CPU cycles consumed by a thread during its execution time – CYCLES_FPU_EMPTY: keeps track of the number of CPU cycles the floating point units are not being used by a thread during its execution time – L2_CACHE_MISSES: counts the number of L2 cache misses generated by a thread during its execution time 17
  • 18.
  • 19.
    BulldOver design • BulldozerOverseer -> BulldOver • Client-server architecture 19
  • 20.
    BulldOver design • Server – Daemon – Scans the underlying architecture – Time-based HPC monitoring (once per sec) • We target scientific workloads, short-lived threads are not well suitable – Applies scheduling policies – libHpcOverseer, hwloc, libpfm 20
  • 21.
    BulldOver design • Client – Command-line tool • prompt> bulldover java myprogram – Traces the creation/termination of threads/processes – Share information through shared memory with the server – libmonitor, boost 21
  • 22.
    BulldOver design User space 22
  • 23.
  • 24.
    Testing environment • DellPowerEdge M915 – 4x AMD 6282SE 2.6 GHz CPUs (16 cores/8 modules each) • Limited to 1 CPU with 8 cores/4 modules – Test limited to a single NUMA node • Avoiding latencies and other NUMA-related well known effects – Turbo mode and freq. scaling disabled 24
  • 25.
    Benchmark suites • SPECCPU 2006 – Perfect match for evaluating Integer vs. Floating point behaviors • SciMark 2.0 – Java based – Noisy environment (additional threads for garbage collection, JIT, etc.) – Mainly FPU-oriented, with different levels of stress – Modified multi-threaded version running several random benchmarks over a thread-pool 25
  • 26.
    Workload characterization Spec CPU2006 26 Empty FPU Cycles Total CPU Cycles
  • 27.
    Workload characterization SciMark 2.0 Empty FPU Cycles Total CPU Cycles 27
  • 28.
    FPU usage andcaches 28
  • 29.
    Results for SPECCPU 2006 Running 4x Int and 4x FPU benchmarks on a single NUMA node (4 modules/8 cores) Inefficient baseline Improved scheduling Default OS scheduling 29
  • 30.
    Discussion • BulldOver avoidsthe worst case scenario – The default OS scheduler is not aware of the workload characterization • Benefits coming both from improved cache usage AND better FPU/Integer units occupancy 30
  • 31.
    Results for Scimark2.0 Running 8x randomly changing over-time benchmarks on a single NUMA node (4 modules/8 cores) Default OS scheduling Improved scheduling 31
  • 32.
    Discussion • All thethreads are FPU-intensive – But at different levels • Still a reasonable speedup “for free” • Dynamic adaptation, since the FPU usage intensity varies over time – BulldOver reacts accordingly 32
  • 33.
    Conclusions - We showhow thread scheduling not aware of the shared HW resources available on the AMD Bulldozer processor can incur a significant performance penalty - We presented a monitoring system that is able to characterize the most active threads according to their FPU/Integer usage - Thanks to the realtime analysis, improved scheduling can be applied and performance improved - Our system is very low intrusive: - Low overhead (below 2%) - No kernel patching required - No code instrumentation - Works on any application 33
  • 34.
    Conclusions • Currently tunedfor a specific HW architecture • Good for scientific workloads – Sampling rate is required (1 sec in our case, could be less but can’t be 0…) • Based on a very simple scheduling policy – More sophisticated policies could be used 34
  • 35.
    Thanks! Achille Peternier achille.peternier@usi.ch http://sosoa.inf.unisi.ch 35
  • 36.
    “Pow7Over” • Work inprogress on IBM Power7 processors – 1 CPU, 8 cores, up to 4 SMT units per core – Completely different… • …operating system: RHEL 6.3 • …architecture: PowerPC • …HPCs: IBM-specific ones (more than 500 available…) • …compiler: autotools 6.0 • Similar approach • Slightly less significant speedup – But this is a full SMT – Similar overall behavior both for the PUs and L2 caches 36