Fast switching of threads between cores is a published research paper on Operating systems, This is our attempt to decode the research and present to the class
Fast switching of threads between cores - Advanced Operating Systems
1. Fast Switching of Threads
Between Cores
Richard Strong & Dean Tullsen (University San Diego)
Jayaram Mudigonda, Jeffrey C. Mogul & Nathan Binkert (HP Labs)
Ruhaim Izmeth | MS14901218
Nipuna Pannala | MS14902208
2. Introduction
● Now we are in the MULTICORE era.
● Multi Core CPUs enable inter core communication
with less cost in the terms of Magnitude compared
to the traditional multi processors. [This reduce the
time for hardware to move migrating data working
set]
● But software cost for moving thread remain as high
3. Asymmetric Multicore Processor
● Core – Core performance asymmetry appears to
be very useful way to improve energy and area
efficiency.
● Relatively little performance cost, But greater
throughput per watt.
● Asymmetric Multicore Processor increases the
need for frequent migration of threads between
cores very efficiently.
4. Fast Switching of Threads between
Cores
● To get a good performance in switching
threads, between cores
○ OS scheduler needs to migrate thread from slow
core to fast or ideal core.
○ Also necessary to balance the load between
cores.(In a symmetric or Asymmetric system)
○ All thread execution time segments should be
relatively short.
5. Simple Cores…
● Normally simple Cores can be better match
for memory-bound application code.
○ Operating systems and OS like codes are typical
memory bounded applications.
6. Thread Migration Techniques
● Migration Mechanism 1 : Constantinou
○ This mechanism considered verity of costs
associated with thread migration, But primary
focus about the threads in warming up (Caches
and branch predictors)
○ But this is not addressing the software cost to
migrate threads between cores.
7. Thread Migration Techniques
● Migration Mechanism 2 : Choi
○ This mechanism specific case of migrating the
branch predictor state when thread switches
cores
○ But this is not addressing the software overhead
issues.
8. Thread Migration Techniques
Shared Thread Multiprocessor: Brown & Tulsan
● Hardware manage's the thread moments.
● Thread State is represented in hardware and that is
shared among the all cores in a chip.
● Therefore hardware can move threads between
cores without direct OS involvement.
9. Software Approaches to Core Switching
•Core B is in IDLE state ?
•Is there any thread to run on
core A after T switching to B ?
•Can ensure T is the most
appropriate thread to run on B?
Transfer architectural state of
thread from A to B
10. Approaches used in the research
● V1: Linux’s thread-migration mechanism
● V2: Modified scheduler
● V3: Scheduler fast-paths
● V4: Addressing IPI costs
● V5: Cross-core wakeup from quiesce
11. V1: Linux Thread Migration Mechanism
● Normally using for relatively long-term load
balancing across the cores.
● Linux thread migration mechanism is the art
of the core switching.
● One thread is available to initiate the
migration.
12. V1: Linux Thread Migration Mechanism
● When task wants to migrate it puts itself on
Per-Core Migration Queue.
● If the target core is idle thread wakes up from
per-core migration queue and move to the
Run Queue of the target core.
● After getting the approval from the target
queue thread will execute in the target core.
13. V1: Linux Thread Migration Mechanism
Cons...
● This migration approach involves “Extra”
context switch between initiating thread and
migrating thread.
14. Linux Thread Migration Mechanism
Increase Efficiency
● To remove extra context switching,
○ Threads can take migrating decisions by itself
○ Centralize the thread status
○ Increase the number of per core queues.
○ Create Cross core signals
15. V2: Modified scheduler
Core 0
Run Queue
N T Core 1
Alternative
Queue (AQ)
T
Run Queue
T
schedule()
interrupt
SwitchCore()
Control Block : T
Core : 1
...
1
2
3
4
5
6
7
● Remove an extra context switch described in V1,
● Initiate thread migrate by process itself.
16. V3: Scheduler fast-paths
● The original modified schedule
● A fast schedule source version (FSS), called to initiate a core switch,
● A fast schedule target version (FST), called at the target core in response to the cross-core
signal.
FSS and FST omit a number of housekeeping functions normally done in
schedule (eg: Priority calculation)
FSS only makes a hint to FST, so no locking takes place
FST has AQ check, FSS does not have AQ checks.
18. V4: Addressing inter-processor
interrupt (IPI) costs
Inter-processor interrupts are sent to ‘wake up’ polling
or paused processors.
Modified scheduler wakes up target core if idle.
The “IPI sending code” modified to be more efficient as
it sends the interrupts to all members of a specified
set.
schedule() is invoked on the target core with the
interrupt
19. Modified System Calls
Modified long
running system
calls to initiate
CoreSwitch()
Modified system calls :
open,stat, read, write,
readv, writev, select,
poll, fsync, fdatasync,
readfrom, sendto and
sendfile.
4096 bytes
20. Simulation Environment
M5 Simulator used for generating detailed timelines,
showing when interesting events such as procedure
calls, cache misses, and long-latency instructions
occur
x86 models are not debugged with M5.
Complex core : Alpha EV6 (21264), 64KB L1
Simple core : EV4-based (21064), 8KB L1
Simulated on shared L2 3.5 MBytes
Main-memory access time of 25 nsec.
21. Simulation Environment -
Configuration naming scheme
sim_XXX - number of ‘x’
denote the number of
processors
eg:
sim_c - single processor
sim_sC - dual processor
Prefix 750Mhz 3Ghz
Complex c C
Simple s S
Tests run on Linux v 2.6.18 kernel
Only one trial run per experiment, as the
simulator is deterministic
23. Cross-core wakeup from quiesce
● idle loop polling is
inefficient
● initiating cross-CPU
interrupt is slow as a
powered down CPU
needs to be awakened
● Kernel should
dynamically decide
between spinlock and
powering down based
on recent history.
25. Macrobenchmark results -
Database Benchmark
Using “TPC-B-like” example from the Berkeley DB
distribution
Core switch done only on fdatasync()
Eliminated disk I/O delays by using a RAM disk on the
real hardware, and by setting the access time to zero
in M5’s disk simulator.
26. Future Work
● Energy measurement/savings benchmarks
for the above tests
● Determining the best core to switch to and
the best time to switch in
● Optimal mechanism to poll or power down a
Processor
27. Summary
● Cost of core switching is more important
when use asymmetric multicores.
● Core switching to slower OS cores on
frequent, expensive system calls some times
reduce performance
○ But it also provide power down complex application
cores.
28. References
● J. Aas. Understanding the Linux 2.6.8.1 CPU Scheduler. http://josh.trancesoftware.
com/linux/, Feb. 2005.
● S. Balakrishnan, R. Rajwar, M. Upton, and K. Lai. The Impact of Performance
Asymmetry in Emerging Multicore Architectures. In Proc. ISCA, pages 506–517,
2005.
● M. Becchi and P. Crowley. Dynamic Thread Assignment on Heterogeneous
Multiprocessor Architectures. J. Instruction Level Parallelism, pages 1–26, June
2008.
● N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G.Saidi, and S. K. Reinhardt.
The M5 Simulator: Modeling Networked Systems. IEEE Micro, 26(4):52–60, 2006.
● D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a framework for architectural-level
power analysis and optimizations. In Proc. ISCA, pages 83–94, Jun. 2000.