Hardware-aware thread scheduling: the case of asymmetric multicore processors

Hardware-aware thread scheduling: the
case of asymmetric multicore processors
Achille Peternier*, Danilo Ansaloni, Daniele Bonetta,
Cesare Pautasso and Walter Binder

* achille.peternier@usi.ch
http://sosoa.inf.unisi.ch

Introduction

CONTEXT AND OVERALL IDEA

2

Context
• Modern CPUs increase the computational
power through additional cores
• HW architectures are becoming increasingly
more complex
– Shared caches
– Non Uniform Memory Access (NUMA)
– Single Instruction Multiple Data (SIMD) registers
– Simultaneous MultiThreading (SMT) units

3

Context
• Operating System (OS) kernel and scheduler
try to automatically optimize applications’
performance according to the available
resources
– Based on the underlying HW
– Using a limited set of performance indicators (CPU
time, memory usage, etc.)

4

“Today it is impossible to estimate performance:
you have to measure it. Programming has become
an empirical science.”

Performance Anxiety: Performance analysis in the new millennium
Joshua Bloch, Google Inc.

5

Contributions
1) Automated workload analysis technique relying on a
specific set of performance metrics that are currently not
used by common OS schedulers

2) Hardware-aware optimized scheduler performing
decisions based on hardware resource usage and the
output of the workload analysis
- to improve processing units occupancy on
SMT/asymmetric processors
6

The big picture
Monitoring daemon

FPU
INT

Workload characterization

OS threads and processes
7

The big picture

FPU
Hardware-aware scheduler INT


8

Target architecture

AMD BULLDOZER PROCESSOR

9

AMD Bulldozer
• AMD Bulldozer architecture
– Each CPU is implemented as a series of modules
(a.k.a. “cores”) with two cores (a.k.a. “processing
or SMT units”)
– Arithmetic-Logic Units (ALUs) are really available
per SMT unit
– A module is more similar to:
• A dual core when doing integer ops
• A single core with SMT=2 when
doing floating point ops

10

WORKLOAD CHARACTERIZATION

14


• Is used to sort processes and threads that are
floating point intensive
– Among the X most running threads
• (where X = the number of cores available)

• Based on realtime monitoring system using
Hardware Performance Counters (HPCs)

15

…about HPCs…
• Registers embedded into processors to keep track
of hardware-related events such as cache misses,
number of CPU cycles, branch mispredictions,
etc.
• Very low overhead (about 1%)
• Extremely accurate
• Limited resources, only few of them can be used
at the same time
– This limits their wide adoption (yet) on large scale
• HW-specific

16

• HPCs used:
– PERF_COUNT_HW_CPU_CYCLES: measures the
total number of CPU cycles consumed by a thread
during its execution time
– CYCLES_FPU_EMPTY: keeps track of the number
of CPU cycles the floating point units are not being
used by a thread during its execution time
– L2_CACHE_MISSES: counts the number of L2
cache misses generated by a thread during its
execution time

17

MONITORING AND SCHEDULING
INFRASTUCTURE DESING

18

BulldOver design
• Bulldozer Overseer -> BulldOver
• Client-server architecture

19

BulldOver design
• Server
– Daemon
– Scans the underlying architecture
– Time-based HPC monitoring (once per sec)
• We target scientific workloads, short-lived threads are
not well suitable
– Applies scheduling policies
– libHpcOverseer, hwloc, libpfm

20

BulldOver design
• Client
– Command-line tool
• prompt> bulldover java myprogram
– Traces the creation/termination of
threads/processes
– Share information through shared memory with
the server
– libmonitor, boost

21

BulldOver design

User space

22

Testing environment
• Dell PowerEdge M915
– 4x AMD 6282SE 2.6 GHz CPUs (16 cores/8
modules each)
• Limited to 1 CPU with 8 cores/4 modules
– Test limited to a single NUMA node
• Avoiding latencies and other NUMA-related well known
effects
– Turbo mode and freq. scaling disabled

24

Benchmark suites
• SPEC CPU 2006
– Perfect match for evaluating Integer vs. Floating point
behaviors

• SciMark 2.0
– Java based
– Noisy environment (additional threads for garbage
collection, JIT, etc.)
– Mainly FPU-oriented, with different levels of stress
– Modified multi-threaded version running several
random benchmarks over a thread-pool

25

Spec CPU 2006

26
Empty FPU Cycles Total CPU Cycles

SciMark 2.0

Empty FPU Cycles Total CPU Cycles

27

FPU usage and caches

28

Results for SPEC CPU 2006
Running 4x Int and 4x FPU
benchmarks on a single NUMA
node (4 modules/8 cores)

Inefficient baseline

Improved scheduling

Default OS scheduling

29

Discussion
• BulldOver avoids the worst case scenario
– The default OS scheduler is not aware of the
workload characterization
• Benefits coming both from improved cache
usage AND better FPU/Integer units
occupancy

30

Results for Scimark 2.0
Running 8x randomly changing
over-time benchmarks on a
single NUMA node (4 modules/8
cores)

Default OS scheduling

Improved scheduling

31

Discussion
• All the threads are FPU-intensive
– But at different levels
• Still a reasonable speedup “for free”
• Dynamic adaptation, since the FPU usage
intensity varies over time
– BulldOver reacts accordingly

32

Conclusions
- We show how thread scheduling not aware of the shared
HW resources available on the AMD Bulldozer processor
can incur a significant performance penalty
- We presented a monitoring system that is able to
characterize the most active threads according to their
FPU/Integer usage
- Thanks to the realtime analysis, improved scheduling can
be applied and performance improved
- Our system is very low intrusive:
- Low overhead (below 2%)
- No kernel patching required
- No code instrumentation
- Works on any application

33

Conclusions
• Currently tuned for a specific HW architecture
• Good for scientific workloads
– Sampling rate is required (1 sec in our case, could
be less but can’t be 0…)
• Based on a very simple scheduling policy
– More sophisticated policies could be used

34

Thanks!

Achille Peternier
achille.peternier@usi.ch
http://sosoa.inf.unisi.ch

35

“Pow7Over”
• Work in progress on IBM Power7 processors
– 1 CPU, 8 cores, up to 4 SMT units per core
– Completely different…
• …operating system: RHEL 6.3
• …architecture: PowerPC
• …HPCs: IBM-specific ones (more than 500 available…)
• …compiler: autotools 6.0
• Similar approach
• Slightly less significant speedup
– But this is a full SMT
– Similar overall behavior both for the PUs and L2 caches

36

Hardware-aware thread scheduling: the case of asymmetric multicore processors

More Related Content

What's hot

Similar to Hardware-aware thread scheduling: the case of asymmetric multicore processors

More from Achille Peternier

Recently uploaded

Hardware-aware thread scheduling: the case of asymmetric multicore processors