This document presents an overview of Remote Core Locking (RCL), a new locking technique designed to reduce cache misses and access contention in multithreaded applications running on multicore systems. RCL migrates the execution of critical sections to a dedicated "server core" to avoid these issues. It consists of three main phases: profiling an application to determine lock candidates for RCL, reengineering the code to extract critical sections as separate functions, and implementing the RCL runtime. The runtime uses a "request array" to queue client requests and a "service thread" on the server core to execute critical sections remotely. It employs various techniques to ensure responsiveness, and an evaluation shows RCL improves performance and reduces cache misses compared to
2. Presentation Outline
1. Introduction
2. Motivation
3. RCL
3.1 Core algorithm
3.2 Profiling
3.3 Re-engineering
3.4 RCL runtime implenetation
4. Evaluation
4.1 Comparison with other locks
4.2 Comparison of app. Performance
4.3 Locality analysis
5. Related work
6. References
3. 1. Introduction
The lock algorithm in a multithreaded application
is a key factor to scaling up the performance in
multicore world.
Remote Core Locking (RCL) is a newly invented
locking techinque to reduce cache misses and to
reduce access contention simultaniously.
8. 3. RCL
Main idea of RCL is,
- Requests of the client cores are entered into a
request array.
- Remote server core executes CS and returns the
results to the client core.
10. 3.1 Core algorithm
Fig: 3 The request array [1]
"Service thread“(ST) of the server core, searches for the non-NULL 3rd
element of each request over and over again. If it finds a non-Null 3rd
element and the requested lock is free, executes the critical section
using function pointer and context.
11. 3.2 Profiling
Profiler is a tool which dynamically loads a library
and intercepts the applications.
Extracts the information( involving POSIX locks,
condition variables and threads etc.) about
application
Determines that which locks can be improved by
using RCL.
12. 3.3 Re-engineering
Reengineering tool takes out the critical section
code into a separate function.
Such a function receives the values of the
variables and returns the updated values of the
variables.
13. 3.4 RCL runtime implenetation
It is difficult to ensure the liveness and the
responsiveness using only a server thread
because ,
(i) Blocked by the operating system
(ii) Spin in the cases of acquiring a spin lock or
nested RCL or implements some form of ad
hoc synchronization.
(iii) Thread can be pre-empted by the operating
system if the time-slice of the thread is run
out or due to a page fault.
t
14. In RCL runtime, there is a "management thread“
(MT) which responsible to keep liveness of RCL
by managing the ST pool.
MT is activated and is expired in a given
frequency. When it is activated it runs at highest
priority.
MT checks a global flag which indicates the ST is
progressing since last activation of the MT.
If the flag is not updated, the MT considers that
the ST is waiting or is blocked and it adds a free
ST to service thread pool.
15. 4 strategies to improve the responsiveness.
i) RCL runtime uses POSIX FIFO scheduling policy
to prevent the thread pre-emption from the OS
scheduler.
ii) RCL runtime minimizes the number of STs before
an unblocked servicing thread is rescheduled in
order to reduce the delay.
16. iii) When servicing threads are blocked by the
OS, RCL runtime uses a low prioritized (than ST)
backup thread to clear the global flag and to wake
up the MT.
iv) When nested RCL is handled by the same
core, sometimes the lock may already owned by
another servicing thread. In this case the
servicing thread yields without delay, in order to
owner of the lock to release the lock.
17. Using FIFO policy introduces another two problems.
1) FIFO scheduling can course to priority mismatches.
Ex: between BT and ST and between ST and MT.
This problem can be solved by only using lock-free
algorithms in RCL runtime.
2) When a ST mumbles in an active wait loop, it will
not be pre-empted. There for unable to elect a free
thread. In this case MT detects no progression of the
servicing thread and it decreases the priority of the
particular ST and then increase the priorities of all
STs.
18. 4. Evaluation
Comparison with other locks using a custom
microbenchmark
Comparison of the application performance
Locality analysis
22. 5. Related work
Attaluri at. al. Proposed a control concurrency
with lock pre-emption and restoration in 1995 [2].
Abellan at. al. have proposed the concept of G-
locks [3]
Suleman et al. have proposed to critical sections
are executed in a special fast core in an ACMP by
introducing new instructions to handover the
control [4].
23. Related work contd..
Handler et al. have suggested software only
solution called "Flat combining" based on coarse
gained locking [5].
24. 6. References
[1] Jean-Pierre Lozi, Florian David, Gael Thomas, Julia Lawall, and Filles
Muller. Re-mote core locking: Migrating critical-section execution to improve
the performance of multithreaded applications. IBM Systems Journal,2012
USENIX Annual Technical Conference BOSTON MA, 47(2):221{236, April
2008.
[2] J. Slonim G. K. Attaluri and P. Larson. Concurrency control with lock
preemtion and restoration. CASCON ' 95, 1995.
[3] J. Fernndez J. L. Abelln and M. E. Acacio. Glocks: Efficient support for
highly-contended locks in many-core cmps. In 25th IPDPS, 2011.
[4] M.K. Qureshi M. A. Suleman, O. Mutlu and Y. N. Patt. Accelerating
critical section execution with asynchronous multi-core architecture.
ASPLOS, pages 253-264, 2009.
[5] N. Shavit D. Hendler, I. Incze and M. Tzafrir. Flat combining and the
synchronization-parallelism tradeo. SPAA' 10, pages 355-354, 2010.