SlideShare a Scribd company logo
1 of 94
Download to read offline
Linux Synchronization Mechanism: RCU (Read-
Copy-Update)
Adrian Huang | Apr, 2023
Agenda
• [Overview] rwlock (reader-writer spinlock) vs RCU
• RCU Implementation Overview
✓High-level overview
✓RCU: List Manipulation – Old and New data
✓Reader/Writer synchronization: six basic APIs
➢Reader
➢ rcu_read_lock() & rcu_read_unlock()
➢ rcu_dereference()
➢Writer
➢ rcu_assign_pointer()
➢ synchronize_rcu() & call_rcu()
• Classic RCU vs Tree RCU
• RCU Flavors
• RCU Usage Summary & RCU Case Study
task 0
task 1
task N read_unlock()
read_lock()
Critical Section
write_unlock()
write_lock()
Readers
Critical Section
[Overview] rwlock (reader-writer spinlock) vs RCU
rcu_read_unlock()
rcu_read_lock()
Critical Section
synchronize_rcu() or call_rcu()
spin_unlock
spinlock
Update or remove data (pointer)
Optional: free memory
Ensure only one writer
Wait for job completion of readers
Deferred destruction: Safe to free memory
task 0
task 1
task N
Readers
(Subscribers)
task 0
task 1
task N
Writers
task 0
task 1
task N
Writers/Updaters
(Publisher)
rwlock_t
1. Mutual exclusion between reader and writer
2. Writer might be starved
1. RCU is a non-blocking synchronization mechanism: No mutual exclusion between readers and a writer
2. No specific *lock* data structure
[Overview] rwlock (reader-writer spinlock) vs RCU
* Reference from section 9.5 of Is Parallel Programming Hard, And, If So, What Can You Do About It?
How does RCU achieve concurrent readers and one writer?
(No mutual exclusion between readers and one writer)
rwlock
• Mutual exclusion between reader and writer
• Writer might be starved
RCU
• A non-blocking synchronization mechanism
• No specific *lock* data structure
Agenda
• [Overview] rwlock (reader-writer spinlock) vs RCU
• RCU Implementation Overview
✓High-level overview
✓RCU: List Manipulation – Old and New data
✓Reader/Writer synchronization: six basic APIs
➢Reader
➢ rcu_read_lock() & rcu_read_unlock()
➢ rcu_dereference()
➢Writer
➢ rcu_assign_pointer()
➢ synchronize_rcu() & call_rcu()
• Classic RCU vs Tree RCU
• RCU Flavors
• RCU Usage Summary & RCU Case Study
6 important APIs
Readers or
Updaters
Primitive Purpose
Readers
rcu_read_lock() Start an RCU read-side critical section
rcu_read_unlock() End an RCU read-side critical section
rcu_derefernce() Safely load an RCU-protected pointer
Updaters
synchronize_rcu() Wait for all pre-existing RCU read-side critical sections to complete
call_rcu() Invoke the specified function after all pre-existing RCU read-side critical sections
complete
rcu_assign_pointer() Safely update an RCU-protected pointer
RCU manipulates pointers
RCU Implementation Overview
rcu_read_unlock()
rcu_read_lock()
Critical Section
synchronize_rcu() or call_rcu()
spin_unlock
spinlock
Update or remove data (pointer)
Optional: free memory
Ensure only one writer:
removal phase
Wait for job completion of readers
Safe to free memory:
reclamation phase
task 0
task 1
task N
Readers
(Subscribers)
task 0
task 1
task N
Writers/Updaters
(Publisher)
• RCU is often used as a replacement for reader-writer locking
• Lockless readers
✓ rcu_read_lock() simply disables preemption: No lock mechanism (Ex: spinlock, mutex and so on)
✓ Readers need not wait for updates (non-blocking): low overhead and excellent scalability
Reader
RCU Implementation Overview
rcu_read_unlock()
rcu_read_lock()
Critical Section
synchronize_rcu() or call_rcu()
spin_unlock
spinlock
Update or remove data (pointer)
Optional: free memory
Ensure only one writer:
removal phase
Wait for job completion of readers
Safe to free memory:
reclamation phase
task 0
task 1
task N
Readers
(Subscribers)
task 0
task 1
task N
Writers/Updaters
(Publisher)
• [Writer] Waiting for readers
✓ Removal phase: remove references to data items (possibly by replacing them with references to new versions of these
data items)
➢ Can run concurrently with readers: readers see either the old or the new version of the data structure rather than
a partially updated reference.
✓ synchronize_rcu() and call_rcu(): wait for readers exiting critical section
✓ Block or register a callback that is invoked after active readers have completed.
✓ Reclamation phase: reclaim data items.
• Quiescent State (QS): [per-core] The time after pre-existing readers are done
✓ Context switch: a valid quiescent state
• Grace Period (GP): All cores have passed through the quiescent state → Complete a grace period
Typical RCU Update Sequence
RCU: List Manipulation – Old and New data
Head 5 2 9
Allocate/fill a structure 8
Insert to the list
1
2
Head 5 2 9
8
RCU: List Manipulation – Old and New data
Head 5 2 9
8
Remove
1
Head 5 2 9
rcu_read_unlock()
rcu_read_lock()
Critical Section
synchronize_rcu() or call_rcu()
spin_unlock
spinlock
Update or remove data (pointer)
Optional: free memory
Ensure only one writer
Wait for job completion of readers
Safe to free memory
task 0
task 1
task N
Readers
(Subscribers)
task 0
task 1
task N
Writers/Updaters
(Publisher)
2
Old data 2 8
New data 2 9
Either the old data or the new data is read by readers.
RCU: List Manipulation – Old and New data
Head 5 2 9
8
Remove
1
Head 5 2 9
2
Old data 2 8
New data 2 9
Either the old data or the new
data is read by readers.
RCU reader Old data is read
New data is read
Legend
* Reference from section 9.5 of Is Parallel Programming Hard, And, If So, What Can You Do About It?
Reader/Writer synchronization
rcu_read_unlock()
rcu_read_lock()
ptr = rcu_dereference(shared_abc);
synchronize_rcu()
spin_unlock
spinlock
rcu_assign_pointer(shared_abc, ptr);
Optional: free memory
Ensure only one writer
Wait for job completion of readers
Safe to free memory
printk("%dn", ptr->number);
synchronize_rcu();
spin_unlock
spinlock
rcu_assign_pointer(shared_abc, NULL);
kfree(tmp);
RCU Reader
RCU Writer w/ valid pointer assignment
RCU Writer w/ NULL pointer
assignment and free memory
RCU Spatial/Temporal Synchronization: Sample Code
RCU Reader
RCU Writer/Updater
Reference: Is Parallel Programming Hard, And, If So, What Can You Do About It?
RCU Spatial/Temporal Synchronization
5,25 9,81
curconfig
Address Space
rcu_read_lock();
mcp = rcu_dereference(curconfig);
*cur_a = mcp->a; (5)
*cur_b = mcp->b; (25)
rcu_read_unlock();
mcp = kmalloc(…);
rcu_assign_pointer(curconfig, mcp);
synchronize_rcu();
…
…
kfree(old_mcp);
rcu_read_lock();
mcp = rcu_dereference(curconfig);
*cur_a = mcp->a; (9)
*cur_b = mcp->b; (81)
rcu_read_unlock();
Grace
Period
Readers
Readers
• Temporal Synchronization
✓ [Reader] rcu_read_lock() / rcu_read_unlock()
✓ [Writer/Update] synchronize_rcu() / call_rcu()
• Spatial Synchronization
• [Reader] rcu_dereference()
• [Writer/Update] rcu_assign_pointer()
Reference: Is Parallel Programming Hard, And, If So, What Can You Do About It?
Time
RCU Spatial/Temporal Synchronization
5,25 9,81
curconfig
Address Space
rcu_read_lock();
mcp = rcu_dereference(curconfig);
*cur_a = mcp->a; (5)
*cur_b = mcp->b; (25)
rcu_read_unlock();
mcp = kmalloc(…);
rcu_assign_pointer(curconfig, mcp);
synchronize_rcu();
…
…
kfree(old_mcp);
rcu_read_lock();
mcp = rcu_dereference(curconfig);
*cur_a = mcp->a; (9)
*cur_b = mcp->b; (81)
rcu_read_unlock();
Grace
Period
Readers
Readers
RCU combines temporal and spatial synchronization in order to approximate
reader-writer locking
• Temporal Synchronization
✓ [Reader] rcu_read_lock() / rcu_read_unlock()
✓ [Writer/Update] synchronize_rcu() / call_rcu()
• Spatial Synchronization
• [Reader] rcu_dereference()
• [Writer/Update] rcu_assign_pointer()
Reader: rcu_read_lock() & rcu_read_unlock():
rcu_read_unlock()
rcu_read_lock()
ptr = rcu_dereference(shared_abc);
synchronize_rcu()
spin_unlock
spinlock
rcu_assign_pointer(shared_abc, ptr);
Optional: free memory
Ensure only one writer
Wait for job completion of readers
Safe to free memory
printk("%dn", ptr->number);
RCU Reader
RCU Writer w/ valid pointer assignment
✓ Simply disable/enable preemption when entering/exiting RCU critical section
• [Why] A QS is detected by a context switch.
Kernel Source Reference: 2.6.24
Reader: rcu_read_lock() & rcu_read_unlock()
rcu_read_unlock()
rcu_read_lock()
ptr = rcu_dereference(shared_abc);
synchronize_rcu()
spin_unlock
spinlock
rcu_assign_pointer(shared_abc, ptr);
Optional: free memory
Ensure only one writer
Wait for job completion of readers
Safe to free memory
printk("%dn", ptr->number);
RCU Reader
RCU Writer w/ valid pointer assignment
Kernel Source Reference: 2.6.24
• __acquire() / __release(): sparse checker (semantic parser)
✓ Sparse checking of RCU-protected pointers: Add __rcu marker
➢ Reference: PROPER CARE AND FEEDING OF RETURN VALUES FROM rcu_dereference()
Reader: rcu_read_lock() & rcu_read_unlock()
rcu_read_unlock()
rcu_read_lock()
ptr = rcu_dereference(shared_abc);
synchronize_rcu()
spin_unlock
spinlock
rcu_assign_pointer(shared_abc, ptr);
Optional: free memory
Ensure only one writer
Wait for job completion of readers
Safe to free memory
printk("%dn", ptr->number);
RCU Reader
RCU Writer w/ valid pointer assignment
Kernel Source Reference: 2.6.24
• rcu_read_acquire() / rcu_read_release(): lock dep (runtime locking correctness validator)
✓ Programming error (deadlock)
rcu_read_unlock()
rcu_read_lock()
ptr = rcu_dereference(shared_abc);
synchronize_rcu()
spin_unlock
spinlock
rcu_assign_pointer(shared_abc, ptr);
Optional: free memory
Ensure only one writer
Wait for job completion of readers
Safe to free memory
printk("%dn", ptr->number);
RCU Reader
RCU Writer w/ valid pointer assignment
It’s illegal to block while in an RCU read-side CS (Exception: CONFIG_PREEMPT_RCU=y)
Reader: rcu_read_lock() & rcu_read_unlock() Kernel Source Reference: 2.6.24
rcu_read_unlock()
rcu_read_lock()
ptr = rcu_dereference(shared_abc);
synchronize_rcu()
spin_unlock
spinlock
rcu_assign_pointer(shared_abc, ptr);
Optional: free memory
Ensure only one writer
Wait for job completion of readers
Safe to free memory
printk("%dn", ptr->number);
RCU Reader
RCU Writer w/ valid pointer assignment
rcu_assign_pointer() and rcu_dereference() invocations communicate spatial
synchronization via stores to and loads from the RCU-protected pointer
Reader: rcu_read_lock() & rcu_read_unlock() Kernel Source Reference: 2.6.24
Reader: rcu_dereference()
rcu_read_unlock()
rcu_read_lock()
ptr = rcu_dereference(shared_abc);
printk("%dn", ptr->number);
RCU Reader
• Preserve order
• Load -> Load, Load -> Store and Store -> Store
• Might be re-order because of store buffer
• Store -> Load
Total Store Order (TSO) – x86 Memory Model: x86 is relatively strongly ordered system
Kernel Source Reference: 2.6.24
Writer: rcu_assign_pointer()
synchronize_rcu()
spin_unlock
spinlock
rcu_assign_pointer(shared_abc, ptr);
Optional: free memory
• Preserve order
• Load -> Load, Load -> Store and Store -> Store
• Might be re-order because of store buffer
• Store -> Load
Total Store Order (TSO) – x86 Memory Model: x86 is relatively strongly ordered system
RCU Writer
Kernel Source Reference: 2.6.24
Writer: synchronize_rcu() & call_rcu()
synchronize_rcu() or call_rcu()
spin_unlock
spinlock
rcu_assign_pointer(shared_abc, ptr);
Optional: free memory
RCU Writer
• Mark the end of updater code and the beginning of reclaimer code
• [Synchronous] synchronize_rcu()
✓ Block until all pre-existing RCU read-side critical sections on all CPUs have completed
✓ leverages call_rcu()
• [Asynchronous] call_rcu()
✓ Queue a callback for invocation after a grace period
✓ [Scenario]
➢ It’s illegal to block RCU updater
➢ Update-side performance is critically important
[Updater] Ensure only one writer: removal phase
Wait for job completion of readers
[Reclaimer] Safe to free memory: reclamation phase
Why the name RCU (read-copy update)?
RCU Writer
Create a copy
Why the name RCU (read-copy update)?
RCU Writer
Update the copy
Why the name RCU (read-copy update)?
RCU Writer
Replace the old entry with
the newly created one
RCU: Quiescent State & Grace Period
Removal Reclamation
reader
reader
reader
reader
reader
Old data is read
New data is read
Legend
reader
reader
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
(RCU Updater)
Grace Period
Time
reader
list_del_rcu() synchronize_rcu() kfree()
RCU: Quiescent State & Grace Period
Removal Reclamation
reader
reader
reader
reader
reader
Old data is read
New data is read
Legend
reader
reader
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
(RCU Updater)
Grace Period
Time
Quiescent State (QS)
• A point in the code where there can be no references held to RCU-protected data
structures, which is normally any point outside of an RCU read-side critical section.
✓ RCU read-side critical section is defined by the range between rcu_read_lock() and
rcu_read_unlock().
Grace Period (GP)
• All threads (cores) pass through at least one quiescent state.
Quotes from the book “Is Parallel Programming Hard, And, If So, What Can You Do About It?”
reader
list_del_rcu() synchronize_rcu() kfree()
RCU: Quiescent State & Grace Period
Removal Reclamation
reader
reader
reader
reader
reader
Old data is read
New data is read
Legend
reader
reader
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
(RCU Updater)
Grace Period
Time
reader
list_del_rcu() synchronize_rcu() kfree()
synchronize_rcu(): How to detect a grace period?
Approach #1:
Reference
Counter
Approach #2:
CPU Register
Approach #3:
Wait for a fixed
period time
Approach #4:
Wait forever
Approach #5:
Avoid the
period crashes
Approach #6: Quiescent-state-
based reclamation (QSBR)
Grace Period: Waiting for readers - 6 Approaches
Approach #1:
Reference
Counter
Approach #2:
CPU Register
Approach #3:
Wait for a fixed
period time
Approach #4:
Wait forever
Approach #5:
Avoid the
period crashes
Approach #6: Quiescent-state-
based reclamation (QSBR)
Grace Period: Waiting for readers – Reference Counter
• Reference counter (share data) updated by rcu_read_lock() and rcu_read_unlock()
✓ Scalability problem due to cache bouncing
Reference Counter
* Figure reference from Is Parallel Programming Hard, And, If So, What Can You Do About It?
Approach #1:
Reference
Counter
Approach #2:
CPU Register
Approach #3:
Wait for a fixed
period time
Approach #4:
Wait forever
Approach #5:
Avoid the
period crashes
Approach #6: Quiescent-state-
based reclamation (QSBR)
Grace Period: Waiting for readers – CPU Register
• Check CPUs’ register (Prevent from accessing share data such as reference counter)
✓CPUs’ register: Check each CPU’s program counter (PC)
➢The updater polls each relevant PC. If the PC in not within read-side code, the corresponding CPU is within a quiescent state.
➢A complete grace period: All CPU’s PCs have been observed to be outside of read-side code.
✓Challenges
➢Readers might invoke other functions
➢Code-motion optimization
CPU Register
Approach #1:
Reference
Counter
Approach #2:
CPU Register
Approach #3:
Wait for a fixed
period time
Approach #4:
Wait forever
Approach #5:
Avoid the
period crashes
Approach #6: Quiescent-state-
based reclamation (QSBR)
Grace Period: Waiting for readers – Wait for a fixed period time
• Wait the enough time to comfortably exceed the lifetime of any reasonable reader
✓Unreasonable reader → issue!
Wait for a fixed period time
Approach #1:
Reference
Counter
Approach #2:
CPU Register
Approach #3:
Wait for a fixed
period time
Approach #4:
Wait forever
Approach #5:
Avoid the
period crashes
Approach #6: Quiescent-state-
based reclamation (QSBR)
Grace Period: Waiting for readers – Wait forever
• Accommodate the unreasonable reader.
• Bad reputation: leaking memory
✓ Memory leaks often require untimely and inconvenient reboots.
✓ Work well in a high-availability cluster where systems were periodically
crashed in order to ensure that cluster really remained highly available.
Wait forever
Approach #1:
Reference
Counter
Approach #2:
CPU Register
Approach #3:
Wait for a fixed
period time
Approach #4:
Wait forever
Approach #5:
Avoid the
period crashes
Approach #6: Quiescent-state-
based reclamation (QSBR)
Grace Period: Waiting for readers – Avoid the period crashes
• Covered by stop-the-world garbage collector
• In today’s always-connected always-on world, stopping the world can
gravely degrade response times.
Avoid the period crashes
Approach #1:
Reference
Counter
Approach #2:
CPU Register
Approach #3:
Wait for a fixed
period time
Approach #4:
Wait forever
Approach #5:
Avoid the
period crashes
Approach #6: Quiescent-state-
based reclamation (QSBR)
Grace Period: Waiting for readers – QSBR
• Numerous applications already have states (termed quiescent states) that
can be reached only after all pre-existing readers are done.
✓ Transaction-processing application: the time between a pair of successive
transactions might be a quiescent state.
✓ Non-preemptive OS kernel: Context switch can be a quiescent state.
✓ [Non-preemptive OS kernel] RCU reader must be prohibited from blocking
while referencing a global data.
QSBR
* Reference from Is Parallel Programming Hard, And, If So, What Can You Do About It?
Grace Period: Waiting for readers – QSBR
[Concept] Implementation for non-preemptive Linux kernel
* [Not production quality] sched_setaffinity() function causes the
current thread to execute on the specified CPU, which forces the
destination CPU to execute a context switch.
[Concept] QSBR: non-production and non-preemptible
implementation
Force the destination CPU to execution switch: Completion of RCU reader
Reference: Page #142 of Is Parallel Programming Hard, And, If So, What Can You Do About It?
synchronize_rcu()
rcu_read_lock() & rcu_read_unlock() → disable/enable preemption
Agenda
• [Overview] rwlock (reader-writer spinlock) vs RCU
• RCU Implementation Overview
✓High-level overview
✓RCU: List Manipulation – Old and New data
✓Reader/Writer synchronization: five basic APIs
➢Reader
➢ rcu_read_lock() & rcu_read_unlock()
➢ rcu_dereference()
➢Writer
➢ rcu_assign_pointer()
➢ synchronize_rcu() & call_rcu()
• Classic RCU vs Tree RCU
✓Will discuss implementation details about classic RCU *only*
• RCU Flavors
• RCU Usage Summary & RCU Case Study
Classic RCU (< 2.6.29) and Tree RCU (>= 2.6.29)
struct rcu_ctrlblk rcu_ctrlblk
rcu_ctrlblk.lock
(spinlock protection)
rcu_ctrlblk.cpumask
CPU 0 CPU 1 CPU N
. . .
• Global cpumask: each bit indicates each core
• A grace period is complete if rcu_ctrlblk.cpumask = 0
• Scalability problem happens when the number of cores are increased
✓ Lock contention: QS cores are contended to update rcu_ctrlblk.cpumask
✓ Cache bouncing for frequent writes
CPU 0 CPU 1 CPU N
. . .
Classic RCU
• Group cpumask: reduce lock contention
• A grace period is complete if root’s rcu_node->qsmask = 0
• Excellent scalability
Tree RCU
qsmask
struct rcu_node
CPU 2 CPU 3
qsmask
struct rcu_node
CPU M . . .
qsmask
struct rcu_node
Root node
struct rcu_state
Reference code: v2.6.24
Reference code: v2.6.29
Classic RCU (< 2.6.29)
struct rcu_ctrlblk rcu_ctrlblk
rcu_ctrlblk.lock
(spinlock protection)
rcu_ctrlblk.cpumask
CPU 0 CPU 1 CPU N
. . .
• Global cpumask: each bit indicates each core
• Scalability problem happens when the number of cores are increased
✓ Lock contention
✓ Cache bouncing for frequent writes
start a new grace period
rcu_ctrlblk.cpumask=0xffff
schedule(): clear the
corresponding bit in
rcu_ctrlblk.cpumask
rcu_ctrlblk.cpumask=0?
Context switch
Y: Finish a grace period
N: Wait for all CPUs
to pass a QS
High-level concept: Suppose it’s a 16-core system
Kernel Source Reference: 2.6.24
Classic RCU: Data Structure
rcu_head
struct rcu_head *next
rcu_ctrlblk
func
cur
completed
next_pending
signaled
lock
cpumask
rcu_data
quiescbatch
passed_quiesc
qs_pending
batch
*nxtlist
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist
**donetail
qlen
# of queued callbacks
cpu
Global control block per-cpu
qs_pending: Avoid cacheline thrashing by assessing
this percpu instead of cpumask bitmap
rcu_head rcu_head
Callback supplied by call_rcu()
struct rcu_ctrlblk rcu_ctrlblk
rcu_ctrlblk.lock
(spinlock protection)
rcu_ctrlblk.cpumask
CPU 0 CPU 1 CPU N
. . .
Kernel Source Reference: 2.6.24
Classic RCU: start_kernel() -> rcu_init()
rcu_ctrlblk
cur = -300
completed = -300
next_pending
signaled
lock
cpumask = 0
rcu_data
quiescbatch
passed_quiesc
qs_pending
batch
*nxtlist
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist
**donetail
qlen
# of queued callbacks
cpu
Global control block:
rcu_ctrlblk
per-cpu: rcu_data
struct rcu_ctrlblk rcu_ctrlblk
rcu_ctrlblk.lock
(spinlock protection)
rcu_ctrlblk.cpumask
CPU 0 CPU 1 CPU N
. . .
Kernel Source Reference: 2.6.24
struct rcu_ctrlblk rcu_bh_ctrlblk
rcu_ctrlblk.lock
(spinlock protection)
rcu_ctrlblk.cpumask
CPU 0 CPU 1 CPU N
. . .
rcu_ctrlblk
cur = -300
completed = -300
next_pending
signaled
lock
cpumask = 0
Global control block:
rcu_bh_ctrlblk
rcu_data
quiescbatch
passed_quiesc
qs_pending
batch
*nxtlist
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist
**donetail
qlen
# of queued callbacks
cpu
per-cpu: rcu_bh_data
1 2
init percpu data init percpu data
rcu_tasklet per-cpu
rcu_process_callbacks()
3 init rcu_tasklet
rcu_ctrlblk
cur = -300
completed = -300
next_pending = 0
signaled = 0
lock
cpumask = 0
[Global control block]
static struct rcu_ctrlblk rcu_ctrlblk;
per-cpu variable initialized by rcu_init_percpu_data()
rcu_data
quiescbatch = rcp->completed = -300
passed_quiesc = 0
qs_pending = 0
batch = 0
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist = NULL
**curtail
*donelist = NULL
**donetail
qlen = 0
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
address of
address of
address of
Kernel Source Reference: 2.6.24
Classic RCU: start_kernel() -> rcu_init(): show “rcu_ctrlblk” only
Classic RCU: call_rcu()
per-cpu
rcu_data
quiescbatch = rcp->completed = -300
passed_quiesc = 0
qs_pending = 0
batch = 0
*nxtlist
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist = NULL
**curtail
*donelist = NULL
**donetail
qlen = 0 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
call_rcu(&my_rcu_head, my_rcu_func);
address of
address of
address of
Kernel Source Reference: 2.6.24
Classic RCU: synchronize_rcu() leverages call_rcu()
Kernel Source Reference: 2.6.24
Classic RCU: Invoke rcu_pending() when a timer interrupt is triggered
rcu_tasklet per-cpu
rcu_process_callbacks()
rcu_check_callbacks
tasklet_schedule
update_process_times
rcu_pending()?
Y
timer interrupt
__rcu_process_callbacks()
rcu_check_quiescent_state
call rcu_do_batch if
rdp->donelist
tasklet_schedule if rdp->donelist: All callbacks are not
invoked yet due to rdp->blimit (limit on a processed batch)
per-cpu
rcu_data
quiescbatch = rcp->completed = -300
passed_quiesc = 0
qs_pending = 0
batch = 0
*nxtlist
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist = NULL
**curtail
*donelist = NULL
**donetail
qlen = 0 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
address of
Kernel Source Reference: 2.6.24
Classic RCU: Check if a CPU passes a QS
rcu_check_callbacks
update_process_times
rcu_pending()?
Y
timer interrupt
Kernel Source Reference: 2.6.24
rcu_qsctr_inc
schedule
Timer interrupt Scheduler: context switch
[Note] “rcu_data.passed_quiesc = 1” does not clear the corresponding rcu_ctrlblk.cpumask
directly. More checks are performed. See later slides.
Classic RCU: first timer interrupt
rcu_tasklet per-cpu
rcu_process_callbacks()
rcu_check_callbacks
tasklet_schedule
update_process_times
rcu_pending()?
Y
timer interrupt
__rcu_process_callbacks()
rcu_check_quiescent_state
call rcu_do_batch if
rdp->donelist
rcu_ctrlblk
cur = -300
completed = -300
next_pending = 0
signaled = 0
lock
cpumask = 0
[Global control block]
static struct rcu_ctrlblk rcu_ctrlblk;
per-cpu
rcu_data
quiescbatch = rcp->completed = -300
passed_quiesc = 0
qs_pending = 0
batch = 0
*nxtlist
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist = NULL
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
address of
Manipulation
Kernel Source Reference: 2.6.24
Classic RCU
per-cpu
rcu_data
quiescbatch = rcp->completed = -300
passed_quiesc = 0
qs_pending = 0
batch = rcp->cur + 1 = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
rcu_ctrlblk
cur = -300
completed = -300
next_pending = 0 1
signaled = 0
lock
cpumask = 0
static struct rcu_ctrlblk rcu_ctrlblk;
rcu_ctrlblk
cur = -300
completed = -300
next_pending = 0
signaled = 0
lock
cpumask = 0
static struct rcu_ctrlblk rcu_ctrlblk;
per-cpu
rcu_data
quiescbatch = rcp->completed = -300
passed_quiesc = 0
qs_pending = 0
batch = 0
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist = NULL
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
__rcu_process_callbacks()
Kernel Source Reference: 2.6.24
Classic RCU: first timer interrupt
per-cpu
rcu_data
quiescbatch = rcp->completed = -300
passed_quiesc = 0
qs_pending = 0
batch = rcp->cur + 1 = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
rcu_ctrlblk
cur = -300
completed = -300
next_pending = 0 1
signaled = 0
lock
cpumask = 0
static struct rcu_ctrlblk rcu_ctrlblk;
rcu_start_batch()
• A new grace period is started
• All cpus must go through a quiescent state
✓ at least two calls to rcu_check_quiescent_state() are required
➢ The first call: A new grace period is running
➢ The second call: If there was a quiescent state, then
1. Update rcu_ctrlblk.cpumask: Clear the
corresponding CPU bit
2. rcu_ctrlblk.cpumask is empty: the grace period is
completed.
Kernel Source Reference: 2.6.24
Classic RCU: first timer interrupt
rcu_ctrlblk
cur = -300
completed = -300
next_pending = 1
signaled = 0
lock
cpumask = 0
rcu_ctrlblk
cur++ → -299
completed = -300
next_pending = 1 0
signaled = 0 0
lock
cpumask = cpu_online_map & ~nohz_cpu_mask
Kernel Source Reference: 2.6.24
Classic RCU
per-cpu
rcu_data
quiescbatch = rcp->completed = -300
passed_quiesc = 0
qs_pending = 0
batch = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
rcu_ctrlblk
cur = -300
completed = -300
next_pending = 1
signaled = 0
lock
cpumask = 0
static struct rcu_ctrlblk rcu_ctrlblk;
per-cpu
rcu_data
quiescbatch = rcp->completed = -300
passed_quiesc = 0
qs_pending = 0
batch = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
rcu_ctrlblk
cur++ → -299
completed = -300
next_pending = 1 0
signaled = 0 0
lock
cpumask = 0 all_online_cpu_bitmask
static struct rcu_ctrlblk rcu_ctrlblk;
rcu_start_batch()
Kernel Source Reference: 2.6.24
Classic RCU: first timer interrupt
rcu_start_batch()
• A new grace period is started
• All cpus must go through a quiescent state
✓ at least two calls to rcu_check_quiescent_state() are required
➢ The first call: A new grace period is running
➢ The second call: If there was a quiescent state, then
1. Update rcu_ctrlblk.cpumask: Clear the
corresponding CPU bit
2. rcu_ctrlblk.cpumask is empty: the grace period is
completed.
The first call
The second call
Kernel Source Reference: 2.6.24
The first call, then wait for
the next qs (context switch)
The second call
per-cpu
rcu_data
quiescbatch = rcp->cur = -299
passed_quiesc = 0
qs_pending = 0 1
batch = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
rcu_ctrlblk
cur = -299
completed = -300
next_pending = 0
signaled = 0
lock
cpumask = all_online_cpu_bitmask
static struct rcu_ctrlblk rcu_ctrlblk;
After the first call
Classic RCU (< 2.6.29): first timer interrupt Kernel Source Reference: 2.6.24
per-cpu
rcu_data
quiescbatch = rcp->cur = -299
passed_quiesc = 0 1
qs_pending = 1
batch = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
rcu_ctrlblk
cur = -299
completed = -300
next_pending = 0
signaled = 0
lock
cpumask = all_online_cpu_bitmask
static struct rcu_ctrlblk rcu_ctrlblk;
Context switch happens
rcu_data
quiescbatch
passed_quiesc
qs_pending
batch
*nxtlist
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist
**donetail
qlen
# of queued callbacks
cpu
schedule
rcu_qsctr_inc
rdp->passed_quiesc = 1
Classic RCU: first timer interrupt: quiescent state (context switch)
2
1
Kernel Source Reference: 2.6.24
Classic RCU: second timer interrupt
per-cpu
rcu_data
quiescbatch = -299
passed_quiesc = 1
qs_pending = 1
batch = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
rcu_ctrlblk
cur = -299
completed = -300
next_pending = 0
signaled = 0
lock
cpumask = all_online_cpu_bitmask
static struct rcu_ctrlblk rcu_ctrlblk;
Second timer interrupt -> __rcu_pending()
True
Kernel Source Reference: 2.6.24
rcu_tasklet per-cpu
rcu_process_callbacks()
rcu_check_callbacks
tasklet_schedule
update_process_times
rcu_pending()?
Y
timer interrupt
Classic RCU
per-cpu
rcu_data
quiescbatch = -299
passed_quiesc = 1
qs_pending = 1 0
batch = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
rcu_ctrlblk
cur = -299
completed = cur = -299
next_pending = 0
signaled = 0
lock
cpumask = all_online_cpu_bitmask
static struct rcu_ctrlblk rcu_ctrlblk;
The second call
cpu_clear()
1
2
3
4
5
6 All CPUs pass QS
7
Classic RCU: Second timer interrupt: rcu_start_batch()
rcu_ctrlblk
cur++ → -298
completed = -299
next_pending = 0
signaled = 0 0
lock
cpumask = 0 all_online_cpu_bitmask
rcu_ctrlblk
cur = -299
completed = -299
next_pending = 0
signaled = 0
lock
cpumask = 0
Kernel Source Reference: 2.6.24
Classic RCU: Third timer interrupt
rcu_data
quiescbatch = -299
passed_quiesc = 1
qs_pending = 0
batch = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
static struct rcu_ctrlblk rcu_ctrlblk;
6 All CPUs pass QS
rcu_ctrlblk
cur++ = -298
completed = -299
next_pending = 0
signaled = 0
lock
cpumask = all_online_cpu_bitmask
rcu_tasklet per-cpu
rcu_process_callbacks()
rcu_check_callbacks
tasklet_schedule
update_process_times
rcu_pending()?
Y
Third timer interrupt
__rcu_process_callbacks()
rcu_check_quiescent_state
call rcu_do_batch if
rdp->donelist
Kernel Source Reference: 2.6.24
Classic RCU: Third timer interrupt
rcu_data
quiescbatch = -299
passed_quiesc = 1
qs_pending = 0
batch = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
static struct rcu_ctrlblk rcu_ctrlblk;
6 All CPUs pass QS
rcu_ctrlblk
cur++ = -298
completed = -299
next_pending = 0
signaled = 0
lock
cpumask = all_online_cpu_bitmask
rcu_data
quiescbatch = -299
passed_quiesc = 1
qs_pending = 0
batch = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist
**curtail
*donelist = NULL
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
address of
1
2
Kernel Source Reference: 2.6.24
Classic RCU: Third timer interrupt
rcu_data
quiescbatch = -299 = rcp->cur = -298
passed_quiesc = 1 0
qs_pending = 0 1
batch = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist = NULL
**curtail
*donelist
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
static struct rcu_ctrlblk rcu_ctrlblk;
6 All CPUs pass QS
rcu_ctrlblk
cur++ = -298
completed = -299
next_pending = 0
signaled = 0
lock
cpumask = all_online_cpu_bitmask
1
2
3
4
5
Kernel Source Reference: 2.6.24
Classic RCU: Third timer interrupt
rcu_data
quiescbatch = rcp->cur = -298
passed_quiesc = 1 0
qs_pending = 0 1
batch = -299
*nxtlist = NULL
**nxttail
qs
handling
batch
handling
struct
rcu_head
*curlist = NULL
**curtail
*donelist
**donetail
qlen = 1
# of queued callbacks
cpu = current cpu’s ID
blimit = 10 (default value)
rcu_head
struct rcu_head *next
func = my_rcu_func
address of
static struct rcu_ctrlblk rcu_ctrlblk;
6 All CPUs pass QS
rcu_ctrlblk
cur++ = -298
completed = -299
next_pending = 0
signaled = 0
lock
cpumask = all_online_cpu_bitmask
1
2
3
Invoke callbacks: depend on
rdp->blimit (Check function
‘rcu_do_batch’ for detail)
Kernel Source Reference: 2.6.24
Tree RCU (>= 2.6.29)
CPU 0 CPU 1 CPU N
. . .
• Group cpumask: reduce lock contention
• A grace period is complete if root’s rcu_node->qsmask = 0
• Excellent scalability
Tree RCU
qsmask
struct rcu_node
CPU 2 CPU 4
qsmask
struct rcu_node
CPU M . . .
qsmask
struct rcu_node
Root node
struct rcu_state
Won’t describe implementation detail about tree RCU. (kinda similar to classic RCU)
Agenda
• [Overview] rwlock (reader-writer spinlock) vs RCU
• RCU Implementation Overview
✓High-level overview
✓RCU: List Manipulation – Old and New data
✓Reader/Writer synchronization: five basic APIs
➢Reader
➢ rcu_read_lock() & rcu_read_unlock()
➢ rcu_dereference()
➢Writer
➢ rcu_assign_pointer()
➢ synchronize_rcu() & call_rcu()
• Classic RCU vs Tree RCU
• RCU Flavors
• RCU Usage Summary & RCU Case Study
• Vanilla RCU
• Bottom-half Flavor (Historical)
• Sched Flavor (Historical)
• Sleepable RCU (SRCU)
• Tasks RCU
• Tasks Rude RCU
• Tasks Trace RCU
RCU Flavors
• Reader RCU APIs
✓rcu_read_lock() / rcu_read_unlock()
✓rcu_dereference()
• Writer RCU APIs
✓rcu_assign_pointer()
✓Synchronous grace-period wait primitive: synchronize_rcu()
✓Asynchronous grace-period wait primitive: call_rcu()
RCU Flavors: Vanilla RCU (Classic RCU and Tree RCU)
• Networking data structures that may be subjected to remote denial-
of-service attacks.
✓CPUs never exit softirq execution due to DoS attacks.
➢Prevent CPUs from executing a context switch → prevent grace periods from ever ending.
◼ Out-of-memory and a system hang
✓Disabling bottom halve can prevent the issue.
• Reader RCU APIs
✓rcu_read_lock_bh() / rcu_read_unlock_bh()
➢ local_bh_disable() / local_bh_enable()
✓rcu_dereference_bh()
• Writer: No change
RCU Flavors: Bottom-half Flavor (Historical)
• Before preemptible RCU, context switch is a quiescent state
✓ [A complete grace period] Need to wait for all pre-existing interrupt and NMI handler
• [CONFIG_PREEMPTION=n]
✓ Vanilla RCU and RCU-sched grace period waits for pre-existing interrupt and NMI handlers
✓ Vanilla RCU and RCU-sched have identical implementations
• [CONFIG_PREEMPTION=Y] for RCU-sched
✓ Preemptible RCU does not need to wait for pre-existing interrupt and NMI handler.
✓ The code outside of an RCU read-side critical section → a QS
✓ rcu_read_lock_sched() → disable preemption
✓ rcu_read_unlock_sched() → re-enable preemption
✓ A preemption attempt during the RCU-sched read-side critical section:
✓ rcu_read_unlock_sched() will enter the scheduler
• Reader RCU APIs
✓ rcu_read_lock_sched() / rcu_read_unlock_sched()
✓ preempt_disable() / preempt_enable()
✓ local_irq_save() / local_irq_restore()
✓ hardirq enter / hardirq exit
✓ NMI enter / NMI exit
✓ rcu_dereference_sched()
• Writer: No change
RCU Flavors: Sched Flavor (Historical)
• Classic RCU: blocking or sleeping is strictly prohibited
• SRCU: Allow arbitrary sleeping (or blocking) within RCU read-side critical
section
✓Real-time kernel
➢ Require that spinlock critical section be preemptible
➢ Require that RCU read-side critical section be preemptible
✓Extend grace periods
• Different domains (srcu_struct structure) are defined
• Benefit: a slow SRCU reader in one domain does not delay a SRCU grace period in
another domain
RCU Flavors: Sleepable RCU (SRCU)
struct srcu_struct ss;
int idx;
idx = srcu_read_lock(&ss);
do_something();
srcu_read_unlock(&ss, idx);
• RCU mechanism
✓Keep old version of data structure until no CPU holds a reference to it. → the
structure can be freed (context switch happens).
• Tasks RCU
✓Defer the destruction of an old data structure until it is known that no process holds a reference to it
✓Scenario: Handle the trampolines used in Linux-kernel tracing
➢ Tracer subsystem: ftrace and kprobe
➢ Kprobe: Return probe – Trampoline
✓Reader RCU APIs
➢ No explicit read-side marker
➢ Voluntary context switches separate successive Tasks RCU read-side critical sections.
✓Writer RCU APIs
➢ Synchronous grace-period-wait primitives: synchronize_rcu_tasks()
➢ Asynchronous grace-period-wait primitives: call_rcu_tasks()
RCU Flavors: Tasks RCU
• Tasks RCU does not wait for idle tasks
✓Idle tasks: do not run voluntary context switches
➢ Remain idle for long periods of time
✓Cannot work for tracing of code within idle loop
• Tasks Rude RCU
✓Scenario: Trampoline that might be involved in tracing of code within the idle loop
✓Reader RCU APIs
➢ No explicit read-side marker
➢ Preemption-disabled region of code is a Tasks Rude RCU reader
✓Writer RCU APIs
➢ Synchronous grace-period-wait primitives: synchronize_rcu_tasks_rude()
➢ Asynchronous grace-period-wait primitives: call_rcu_tasks_rude()
RCU Flavors: Tasks Rude RCU
• Tasks RCU and Tasks Rude RCU disallow sleeping while executing in a given
trampoline
• Tasks Rude RCU
✓Scenario: BPF programs need to sleep
✓Reader RCU APIs
➢ Explicit read-side marker: rcu_read_lock_trace() / rcu_read_unlock_trace()
✓Writer RCU APIs
➢ Synchronous grace-period-wait primitives: synchronize_rcu_tasks_trace()
➢ Asynchronous grace-period-wait primitives: call_rcu_tasks_trace()
RCU Flavors: Tasks Trace RCU
Agenda
• [Overview] rwlock (reader-writer spinlock) vs RCU
• RCU Implementation Overview
✓High-level overview
✓RCU: List Manipulation – Old and New data
✓Reader/Writer synchronization: five basic APIs
➢Reader
➢ rcu_read_lock() & rcu_read_unlock()
➢ rcu_dereference()
➢Writer
➢ rcu_assign_pointer()
➢ synchronize_rcu() & call_rcu()
• Classic RCU vs Tree RCU
• RCU Flavors
• RCU Usage Summary & RCU Case Study
RCU Usage Summary
Usage Description Applicable?
Routing Table 1. Read mostly
2. Stale and inconsistent data is permissible
Work great
Linux kernel’s mapping from user-level
System-V semaphore IDs to in-kernel data
structures
1. Read mostly: Semaphore tends to be used far more
frequently than they are created and destroyed.
2. Need consistent data: perform a semaphore operation
on a semaphore that has already been deleted.
Work well
dentry cache in Linux kernel 1. Read/Write workload
2. Need consistent data
Might be ok
SLAB_TYPESAFE_BY_RCU slab-allocator flag
provides type-safe memory to RCU readers
1. Write mostly
2. Need consistent data
Not best
Case Study: Design Patterns and Lock Granularity
• Code locking: use global locks only
✓ Lock contention & scalability issue
• Data locking
✓ Many data structures may be partitioned
➢ Example: hash table
✓ Each partition of data structure has its own lock
✓ Improve lock contention & better scalability
• Data ownership
✓ Data structure is partitioned over threads or CPUs
➢ Each thread/CPU accesses its subset of data structure without any
synchronization overhead
➢ The most-heavily used data owned by a single CPU → Hot spot in this CPU
➢ No sharing is required -> achieve ideal performance
➢ Example: percpu variables in Linux kernel
Case Study: Code locking: dentry lookup
• v2.5.61 and earlier
• dcache_lock
✓ protect the hash chain (hash table), d_child, d_alias,
d_lru lists and d_inode
.
.
dentry_hashtable
Source code from: v2.5.61
Case Study: Code locking: dentry lookup
• v2.5.61 and earlier
• dcache_lock
✓ protect the hash chain (hash table), d_child, d_alias,
d_lru lists and d_inode
.
.
dentry_hashtable
Code locking by `dcache_lock`
• dcache_lock protects the whole hash table
• More lock contention, poor scalability
Code locking
Source code from: v2.5.61
Case Study: Data locking: dentry lookup
• v2.5.62 and later
• RCU
✓ lock-free for dcache (dentry) look-up
✓ No need to acquire `dcache_lock` when traversing the
hash table in d_lookup
✓ Rely on RCU to ensure the dentry has not been *freed*.
• dcache_lock
✓ dcache_lock must be taken for the followings:
✓ Traversing and updating
✓ Hashtable update
.
.
dentry_hashtable
Data locking by `dentry->d_lock
• Improve lock contention
• Better scalability
Source code from: v2.5.62
Data locking
Case Study: Data locking: dentry lookup - seqlock
Source code from: v2.5.67
Case Study: Data locking: dentry lookup - seqlock
Source code from: v2.5.67
This approach is still used in latest kernel (v6.3)
Reference
• Is Parallel Programming Hard, And, If So, What Can You Do About It?
• What is RCU? – “Read, Copy, Update”
• What Does It Mean To Be An RCU Implementation?
• https://docs.kernel.org/RCU/index.html
• Using RCU for linked lists — a case study
• Scaling dcache with RCU
• Sleepable RCU (SRCU)
• https://lwn.net/Articles/253651/
• https://zhuanlan.zhihu.com/p/90223380
• Preemptible RCU
• https://lwn.net/Articles/253651/
• https://zhuanlan.zhihu.com/p/90223380
Backup
RCU (Read-Copy-Update)
• Non-blocking synchronization
✓Deadlock Immunity: RCU read-side primitives do not block, spin or even do
backwards branches → execution time is deterministic.
➢Exception (programming error):
➢Immunity to priority inversion
◼ Low-priority RCU readers cannot prevent a high-priority RCU updater from acquiring the
update-side lock
◼ A low-priority RCU updater cannot prevent high-priority RCU readers from entering read-side
critical section
➢[-rt kernel] RCU is susceptible to priority inversion scenarios:
➢ A High-priority process blocked waiting for an RCU grace period to elapse can be blocked by
low-priority RCU readers. --> Solve by RCU priority boosting.
➢ [RCU priority boosting] Require rcu_read_unclock() do deboosting, which entails acquiring
scheduler locks.
➢ Need to avoid deadlocks within the scheduler and RCU: v5.15 kernel requires RCU to
avoid invoking the scheduler while holding any of RCU’s locks
➢ rcu_read_unlock() is not always lockless when RCU priority boosting is enabled.
RCU Properties
• Reader
✓Reads need not wait for updates
➢Low-cost or even no-cost readers → excellent scalability
✓Each reader has a coherent view of each object within the block of rcu_read_lock()
and rcu_read_unlock()
• Updater
✓synchronized_rcu(): ensure that objects are not freed until after the completion of all
readers that might be using them.
✓rcu_assign_pointer() and rcu_dereference(): efficient and scalable mechanisms for
publishing and reading new versions of an object.
Wait for pre-existing RCU readers
• Wait for the RCU read-side critical section: rcu_read_lock()/rcu_read_unlock()
✓Illegal to sleep within an RCU read-side critical section because a context switch is a
quiescent state
* If any portion of a given critical section precedes the
beginning of a given grace period, then RCU
guarantees that all of that critical section will precede
the end of that grace period.
Wait for pre-existing RCU readers: if-then relationship
• If any portion of a given critical section precedes the beginning of a
given grace period, then RCU guarantees that all of that critical
section will precede the end of that grace period.
✓ r1 = 0 and r2 = 0
• If any portion of a given critical section follows the end of a given
grace period, then RCU guarantees that all of that critical section will
follow the beginning of that grace period.
✓ r1 = 1 and r2 = 1
• What would happen if the order of P()’s two accesses was reversed?
✓ Nothing change because loads from x and y are in the same
RCU read-side critical section.
Wait for pre-existing RCU readers: if-then relationship
• An RCU read-side critical section can be completely overlapped by an
RCU grace period.
✓ r1 = 1 and r2 = 0
✓ Cannot be r1 = 0 and r2 = 1
Wait for pre-existing RCU readers: if-then relationship
RCU Grace-Period Ordering Guarantee
Given a grace period, each readers ends before the end of that grace period,
starts after the beginning of that grace period, or both.
Maintain Multiple Versions of Recently Updated Objects
• RCU accommodates synchronization-free readers (weak temporal
synchronization) by maintaining multiple versions of data
* This slide is referred from page 44 of What is RCU?
* This slide is referred from page 80 of slides
task_struct
cg_list
rcu_node_entry
(CONFIG_PREEMPT_RCU=y)
tasks
thread_node
(linked to signal_struct)
Kernel Source Reference: 6.2
Not use RCU: not read-mostly
data structures (use spinlock)
RCU: Defer destruction
ptraced
ptrace_entry
RCU + spinlock
children
sibling
Case study: Access task_struct linked list
__cacheline_aligned DEFINE_RWLOCK(tasklist_lock);

More Related Content

What's hot

Memory Management with Page Folios
Memory Management with Page FoliosMemory Management with Page Folios
Memory Management with Page FoliosAdrian Huang
 
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...Adrian Huang
 
twlkh-linux-vsyscall-and-vdso
twlkh-linux-vsyscall-and-vdsotwlkh-linux-vsyscall-and-vdso
twlkh-linux-vsyscall-and-vdsoViller Hsiao
 
malloc & vmalloc in Linux
malloc & vmalloc in Linuxmalloc & vmalloc in Linux
malloc & vmalloc in LinuxAdrian Huang
 
/proc/irq/&lt;irq>/smp_affinity
/proc/irq/&lt;irq>/smp_affinity/proc/irq/&lt;irq>/smp_affinity
/proc/irq/&lt;irq>/smp_affinityTakuya ASADA
 
Reverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux KernelReverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux KernelAdrian Huang
 
きつねさんでもわかるLlvm読書会 第2回
きつねさんでもわかるLlvm読書会 第2回きつねさんでもわかるLlvm読書会 第2回
きつねさんでもわかるLlvm読書会 第2回Tomoya Kawanishi
 
Decompressed vmlinux: linux kernel initialization from page table configurati...
Decompressed vmlinux: linux kernel initialization from page table configurati...Decompressed vmlinux: linux kernel initialization from page table configurati...
Decompressed vmlinux: linux kernel initialization from page table configurati...Adrian Huang
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernelAdrian Huang
 
10分で分かるLinuxブロックレイヤ
10分で分かるLinuxブロックレイヤ10分で分かるLinuxブロックレイヤ
10分で分かるLinuxブロックレイヤTakashi Hoshino
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)shimosawa
 
semaphore & mutex.pdf
semaphore & mutex.pdfsemaphore & mutex.pdf
semaphore & mutex.pdfAdrian Huang
 
Linux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBLinux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBshimosawa
 
Receive side scaling (RSS) with eBPF in QEMU and virtio-net
Receive side scaling (RSS) with eBPF in QEMU and virtio-netReceive side scaling (RSS) with eBPF in QEMU and virtio-net
Receive side scaling (RSS) with eBPF in QEMU and virtio-netYan Vugenfirer
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debuggingHao-Ran Liu
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringScyllaDB
 
Memory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux KernelMemory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux KernelAdrian Huang
 

What's hot (20)

Memory Management with Page Folios
Memory Management with Page FoliosMemory Management with Page Folios
Memory Management with Page Folios
 
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
 
twlkh-linux-vsyscall-and-vdso
twlkh-linux-vsyscall-and-vdsotwlkh-linux-vsyscall-and-vdso
twlkh-linux-vsyscall-and-vdso
 
malloc & vmalloc in Linux
malloc & vmalloc in Linuxmalloc & vmalloc in Linux
malloc & vmalloc in Linux
 
/proc/irq/&lt;irq>/smp_affinity
/proc/irq/&lt;irq>/smp_affinity/proc/irq/&lt;irq>/smp_affinity
/proc/irq/&lt;irq>/smp_affinity
 
Reverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux KernelReverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux Kernel
 
きつねさんでもわかるLlvm読書会 第2回
きつねさんでもわかるLlvm読書会 第2回きつねさんでもわかるLlvm読書会 第2回
きつねさんでもわかるLlvm読書会 第2回
 
Kernel crashdump
Kernel crashdumpKernel crashdump
Kernel crashdump
 
Decompressed vmlinux: linux kernel initialization from page table configurati...
Decompressed vmlinux: linux kernel initialization from page table configurati...Decompressed vmlinux: linux kernel initialization from page table configurati...
Decompressed vmlinux: linux kernel initialization from page table configurati...
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernel
 
10分で分かるLinuxブロックレイヤ
10分で分かるLinuxブロックレイヤ10分で分かるLinuxブロックレイヤ
10分で分かるLinuxブロックレイヤ
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)
 
semaphore & mutex.pdf
semaphore & mutex.pdfsemaphore & mutex.pdf
semaphore & mutex.pdf
 
Making Linux do Hard Real-time
Making Linux do Hard Real-timeMaking Linux do Hard Real-time
Making Linux do Hard Real-time
 
Linux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBLinux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKB
 
Making Linux do Hard Real-time
Making Linux do Hard Real-timeMaking Linux do Hard Real-time
Making Linux do Hard Real-time
 
Receive side scaling (RSS) with eBPF in QEMU and virtio-net
Receive side scaling (RSS) with eBPF in QEMU and virtio-netReceive side scaling (RSS) with eBPF in QEMU and virtio-net
Receive side scaling (RSS) with eBPF in QEMU and virtio-net
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
 
Memory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux KernelMemory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux Kernel
 

Similar to Linux Synchronization Mechanism: RCU (Read Copy Update)

Gluster d thread_synchronization_using_urcu_lca2016
Gluster d thread_synchronization_using_urcu_lca2016Gluster d thread_synchronization_using_urcu_lca2016
Gluster d thread_synchronization_using_urcu_lca2016Gluster.org
 
Kernel Recipes 2019 - RCU in 2019 - Joel Fernandes
Kernel Recipes 2019 - RCU in 2019 - Joel FernandesKernel Recipes 2019 - RCU in 2019 - Joel Fernandes
Kernel Recipes 2019 - RCU in 2019 - Joel FernandesAnne Nicolas
 
Userspace RCU library : what linear multiprocessor scalability means for your...
Userspace RCU library : what linear multiprocessor scalability means for your...Userspace RCU library : what linear multiprocessor scalability means for your...
Userspace RCU library : what linear multiprocessor scalability means for your...Alexey Ivanov
 
Yet another introduction to Linux RCU
Yet another introduction to Linux RCUYet another introduction to Linux RCU
Yet another introduction to Linux RCUViller Hsiao
 
Glusterd_thread_synchronization_using_urcu_lca2016
Glusterd_thread_synchronization_using_urcu_lca2016Glusterd_thread_synchronization_using_urcu_lca2016
Glusterd_thread_synchronization_using_urcu_lca2016Atin Mukherjee
 
2021 10-13 i ox query processing
2021 10-13 i ox query processing2021 10-13 i ox query processing
2021 10-13 i ox query processingAndrew Lamb
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxData
 
Thread synchronization in GlusterD using URCU
Thread synchronization in GlusterD using URCUThread synchronization in GlusterD using URCU
Thread synchronization in GlusterD using URCUAtin Mukherjee
 
Kernel Recipes 2015 - So you want to write a Linux driver framework
Kernel Recipes 2015 - So you want to write a Linux driver frameworkKernel Recipes 2015 - So you want to write a Linux driver framework
Kernel Recipes 2015 - So you want to write a Linux driver frameworkAnne Nicolas
 
Streaming replication in practice
Streaming replication in practiceStreaming replication in practice
Streaming replication in practiceAlexey Lesovsky
 
Your tuning arsenal: AWR, ADDM, ASH, Metrics and Advisors
Your tuning arsenal: AWR, ADDM, ASH, Metrics and AdvisorsYour tuning arsenal: AWR, ADDM, ASH, Metrics and Advisors
Your tuning arsenal: AWR, ADDM, ASH, Metrics and AdvisorsJohn Kanagaraj
 
10 Multicore 07
10 Multicore 0710 Multicore 07
10 Multicore 07timcrack
 
Introduction to RCU
Introduction to RCUIntroduction to RCU
Introduction to RCUKernel TLV
 
Bpf performance tools chapter 4 bcc
Bpf performance tools chapter 4   bccBpf performance tools chapter 4   bcc
Bpf performance tools chapter 4 bccViller Hsiao
 
Lock, Stock and Backup: Data Guaranteed
Lock, Stock and Backup: Data GuaranteedLock, Stock and Backup: Data Guaranteed
Lock, Stock and Backup: Data GuaranteedJervin Real
 
Understanding of linux kernel memory model
Understanding of linux kernel memory modelUnderstanding of linux kernel memory model
Understanding of linux kernel memory modelSeongJae Park
 

Similar to Linux Synchronization Mechanism: RCU (Read Copy Update) (20)

RCU
RCURCU
RCU
 
Gluster d thread_synchronization_using_urcu_lca2016
Gluster d thread_synchronization_using_urcu_lca2016Gluster d thread_synchronization_using_urcu_lca2016
Gluster d thread_synchronization_using_urcu_lca2016
 
Kernel Recipes 2019 - RCU in 2019 - Joel Fernandes
Kernel Recipes 2019 - RCU in 2019 - Joel FernandesKernel Recipes 2019 - RCU in 2019 - Joel Fernandes
Kernel Recipes 2019 - RCU in 2019 - Joel Fernandes
 
Userspace RCU library : what linear multiprocessor scalability means for your...
Userspace RCU library : what linear multiprocessor scalability means for your...Userspace RCU library : what linear multiprocessor scalability means for your...
Userspace RCU library : what linear multiprocessor scalability means for your...
 
Yet another introduction to Linux RCU
Yet another introduction to Linux RCUYet another introduction to Linux RCU
Yet another introduction to Linux RCU
 
Glusterd_thread_synchronization_using_urcu_lca2016
Glusterd_thread_synchronization_using_urcu_lca2016Glusterd_thread_synchronization_using_urcu_lca2016
Glusterd_thread_synchronization_using_urcu_lca2016
 
2021 10-13 i ox query processing
2021 10-13 i ox query processing2021 10-13 i ox query processing
2021 10-13 i ox query processing
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
 
Thread synchronization in GlusterD using URCU
Thread synchronization in GlusterD using URCUThread synchronization in GlusterD using URCU
Thread synchronization in GlusterD using URCU
 
Kernel Recipes 2015 - So you want to write a Linux driver framework
Kernel Recipes 2015 - So you want to write a Linux driver frameworkKernel Recipes 2015 - So you want to write a Linux driver framework
Kernel Recipes 2015 - So you want to write a Linux driver framework
 
Using Statspack and AWR for Memory Monitoring and Tuning
Using Statspack and AWR for Memory Monitoring and TuningUsing Statspack and AWR for Memory Monitoring and Tuning
Using Statspack and AWR for Memory Monitoring and Tuning
 
Streaming replication in practice
Streaming replication in practiceStreaming replication in practice
Streaming replication in practice
 
Your tuning arsenal: AWR, ADDM, ASH, Metrics and Advisors
Your tuning arsenal: AWR, ADDM, ASH, Metrics and AdvisorsYour tuning arsenal: AWR, ADDM, ASH, Metrics and Advisors
Your tuning arsenal: AWR, ADDM, ASH, Metrics and Advisors
 
10 Multicore 07
10 Multicore 0710 Multicore 07
10 Multicore 07
 
FIFOPt
FIFOPtFIFOPt
FIFOPt
 
Introduction to RCU
Introduction to RCUIntroduction to RCU
Introduction to RCU
 
Bpf performance tools chapter 4 bcc
Bpf performance tools chapter 4   bccBpf performance tools chapter 4   bcc
Bpf performance tools chapter 4 bcc
 
Lock, Stock and Backup: Data Guaranteed
Lock, Stock and Backup: Data GuaranteedLock, Stock and Backup: Data Guaranteed
Lock, Stock and Backup: Data Guaranteed
 
seminar report
seminar reportseminar report
seminar report
 
Understanding of linux kernel memory model
Understanding of linux kernel memory modelUnderstanding of linux kernel memory model
Understanding of linux kernel memory model
 

Recently uploaded

Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxnada99848
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 

Recently uploaded (20)

Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptx
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 

Linux Synchronization Mechanism: RCU (Read Copy Update)

  • 1. Linux Synchronization Mechanism: RCU (Read- Copy-Update) Adrian Huang | Apr, 2023
  • 2. Agenda • [Overview] rwlock (reader-writer spinlock) vs RCU • RCU Implementation Overview ✓High-level overview ✓RCU: List Manipulation – Old and New data ✓Reader/Writer synchronization: six basic APIs ➢Reader ➢ rcu_read_lock() & rcu_read_unlock() ➢ rcu_dereference() ➢Writer ➢ rcu_assign_pointer() ➢ synchronize_rcu() & call_rcu() • Classic RCU vs Tree RCU • RCU Flavors • RCU Usage Summary & RCU Case Study
  • 3. task 0 task 1 task N read_unlock() read_lock() Critical Section write_unlock() write_lock() Readers Critical Section [Overview] rwlock (reader-writer spinlock) vs RCU rcu_read_unlock() rcu_read_lock() Critical Section synchronize_rcu() or call_rcu() spin_unlock spinlock Update or remove data (pointer) Optional: free memory Ensure only one writer Wait for job completion of readers Deferred destruction: Safe to free memory task 0 task 1 task N Readers (Subscribers) task 0 task 1 task N Writers task 0 task 1 task N Writers/Updaters (Publisher) rwlock_t 1. Mutual exclusion between reader and writer 2. Writer might be starved 1. RCU is a non-blocking synchronization mechanism: No mutual exclusion between readers and a writer 2. No specific *lock* data structure
  • 4. [Overview] rwlock (reader-writer spinlock) vs RCU * Reference from section 9.5 of Is Parallel Programming Hard, And, If So, What Can You Do About It? How does RCU achieve concurrent readers and one writer? (No mutual exclusion between readers and one writer) rwlock • Mutual exclusion between reader and writer • Writer might be starved RCU • A non-blocking synchronization mechanism • No specific *lock* data structure
  • 5. Agenda • [Overview] rwlock (reader-writer spinlock) vs RCU • RCU Implementation Overview ✓High-level overview ✓RCU: List Manipulation – Old and New data ✓Reader/Writer synchronization: six basic APIs ➢Reader ➢ rcu_read_lock() & rcu_read_unlock() ➢ rcu_dereference() ➢Writer ➢ rcu_assign_pointer() ➢ synchronize_rcu() & call_rcu() • Classic RCU vs Tree RCU • RCU Flavors • RCU Usage Summary & RCU Case Study
  • 6. 6 important APIs Readers or Updaters Primitive Purpose Readers rcu_read_lock() Start an RCU read-side critical section rcu_read_unlock() End an RCU read-side critical section rcu_derefernce() Safely load an RCU-protected pointer Updaters synchronize_rcu() Wait for all pre-existing RCU read-side critical sections to complete call_rcu() Invoke the specified function after all pre-existing RCU read-side critical sections complete rcu_assign_pointer() Safely update an RCU-protected pointer RCU manipulates pointers
  • 7. RCU Implementation Overview rcu_read_unlock() rcu_read_lock() Critical Section synchronize_rcu() or call_rcu() spin_unlock spinlock Update or remove data (pointer) Optional: free memory Ensure only one writer: removal phase Wait for job completion of readers Safe to free memory: reclamation phase task 0 task 1 task N Readers (Subscribers) task 0 task 1 task N Writers/Updaters (Publisher) • RCU is often used as a replacement for reader-writer locking • Lockless readers ✓ rcu_read_lock() simply disables preemption: No lock mechanism (Ex: spinlock, mutex and so on) ✓ Readers need not wait for updates (non-blocking): low overhead and excellent scalability Reader
  • 8. RCU Implementation Overview rcu_read_unlock() rcu_read_lock() Critical Section synchronize_rcu() or call_rcu() spin_unlock spinlock Update or remove data (pointer) Optional: free memory Ensure only one writer: removal phase Wait for job completion of readers Safe to free memory: reclamation phase task 0 task 1 task N Readers (Subscribers) task 0 task 1 task N Writers/Updaters (Publisher) • [Writer] Waiting for readers ✓ Removal phase: remove references to data items (possibly by replacing them with references to new versions of these data items) ➢ Can run concurrently with readers: readers see either the old or the new version of the data structure rather than a partially updated reference. ✓ synchronize_rcu() and call_rcu(): wait for readers exiting critical section ✓ Block or register a callback that is invoked after active readers have completed. ✓ Reclamation phase: reclaim data items. • Quiescent State (QS): [per-core] The time after pre-existing readers are done ✓ Context switch: a valid quiescent state • Grace Period (GP): All cores have passed through the quiescent state → Complete a grace period Typical RCU Update Sequence
  • 9. RCU: List Manipulation – Old and New data Head 5 2 9 Allocate/fill a structure 8 Insert to the list 1 2 Head 5 2 9 8
  • 10. RCU: List Manipulation – Old and New data Head 5 2 9 8 Remove 1 Head 5 2 9 rcu_read_unlock() rcu_read_lock() Critical Section synchronize_rcu() or call_rcu() spin_unlock spinlock Update or remove data (pointer) Optional: free memory Ensure only one writer Wait for job completion of readers Safe to free memory task 0 task 1 task N Readers (Subscribers) task 0 task 1 task N Writers/Updaters (Publisher) 2 Old data 2 8 New data 2 9 Either the old data or the new data is read by readers.
  • 11. RCU: List Manipulation – Old and New data Head 5 2 9 8 Remove 1 Head 5 2 9 2 Old data 2 8 New data 2 9 Either the old data or the new data is read by readers. RCU reader Old data is read New data is read Legend * Reference from section 9.5 of Is Parallel Programming Hard, And, If So, What Can You Do About It?
  • 12. Reader/Writer synchronization rcu_read_unlock() rcu_read_lock() ptr = rcu_dereference(shared_abc); synchronize_rcu() spin_unlock spinlock rcu_assign_pointer(shared_abc, ptr); Optional: free memory Ensure only one writer Wait for job completion of readers Safe to free memory printk("%dn", ptr->number); synchronize_rcu(); spin_unlock spinlock rcu_assign_pointer(shared_abc, NULL); kfree(tmp); RCU Reader RCU Writer w/ valid pointer assignment RCU Writer w/ NULL pointer assignment and free memory
  • 13. RCU Spatial/Temporal Synchronization: Sample Code RCU Reader RCU Writer/Updater Reference: Is Parallel Programming Hard, And, If So, What Can You Do About It?
  • 14. RCU Spatial/Temporal Synchronization 5,25 9,81 curconfig Address Space rcu_read_lock(); mcp = rcu_dereference(curconfig); *cur_a = mcp->a; (5) *cur_b = mcp->b; (25) rcu_read_unlock(); mcp = kmalloc(…); rcu_assign_pointer(curconfig, mcp); synchronize_rcu(); … … kfree(old_mcp); rcu_read_lock(); mcp = rcu_dereference(curconfig); *cur_a = mcp->a; (9) *cur_b = mcp->b; (81) rcu_read_unlock(); Grace Period Readers Readers • Temporal Synchronization ✓ [Reader] rcu_read_lock() / rcu_read_unlock() ✓ [Writer/Update] synchronize_rcu() / call_rcu() • Spatial Synchronization • [Reader] rcu_dereference() • [Writer/Update] rcu_assign_pointer() Reference: Is Parallel Programming Hard, And, If So, What Can You Do About It? Time
  • 15. RCU Spatial/Temporal Synchronization 5,25 9,81 curconfig Address Space rcu_read_lock(); mcp = rcu_dereference(curconfig); *cur_a = mcp->a; (5) *cur_b = mcp->b; (25) rcu_read_unlock(); mcp = kmalloc(…); rcu_assign_pointer(curconfig, mcp); synchronize_rcu(); … … kfree(old_mcp); rcu_read_lock(); mcp = rcu_dereference(curconfig); *cur_a = mcp->a; (9) *cur_b = mcp->b; (81) rcu_read_unlock(); Grace Period Readers Readers RCU combines temporal and spatial synchronization in order to approximate reader-writer locking • Temporal Synchronization ✓ [Reader] rcu_read_lock() / rcu_read_unlock() ✓ [Writer/Update] synchronize_rcu() / call_rcu() • Spatial Synchronization • [Reader] rcu_dereference() • [Writer/Update] rcu_assign_pointer()
  • 16. Reader: rcu_read_lock() & rcu_read_unlock(): rcu_read_unlock() rcu_read_lock() ptr = rcu_dereference(shared_abc); synchronize_rcu() spin_unlock spinlock rcu_assign_pointer(shared_abc, ptr); Optional: free memory Ensure only one writer Wait for job completion of readers Safe to free memory printk("%dn", ptr->number); RCU Reader RCU Writer w/ valid pointer assignment ✓ Simply disable/enable preemption when entering/exiting RCU critical section • [Why] A QS is detected by a context switch. Kernel Source Reference: 2.6.24
  • 17. Reader: rcu_read_lock() & rcu_read_unlock() rcu_read_unlock() rcu_read_lock() ptr = rcu_dereference(shared_abc); synchronize_rcu() spin_unlock spinlock rcu_assign_pointer(shared_abc, ptr); Optional: free memory Ensure only one writer Wait for job completion of readers Safe to free memory printk("%dn", ptr->number); RCU Reader RCU Writer w/ valid pointer assignment Kernel Source Reference: 2.6.24 • __acquire() / __release(): sparse checker (semantic parser) ✓ Sparse checking of RCU-protected pointers: Add __rcu marker ➢ Reference: PROPER CARE AND FEEDING OF RETURN VALUES FROM rcu_dereference()
  • 18. Reader: rcu_read_lock() & rcu_read_unlock() rcu_read_unlock() rcu_read_lock() ptr = rcu_dereference(shared_abc); synchronize_rcu() spin_unlock spinlock rcu_assign_pointer(shared_abc, ptr); Optional: free memory Ensure only one writer Wait for job completion of readers Safe to free memory printk("%dn", ptr->number); RCU Reader RCU Writer w/ valid pointer assignment Kernel Source Reference: 2.6.24 • rcu_read_acquire() / rcu_read_release(): lock dep (runtime locking correctness validator) ✓ Programming error (deadlock)
  • 19. rcu_read_unlock() rcu_read_lock() ptr = rcu_dereference(shared_abc); synchronize_rcu() spin_unlock spinlock rcu_assign_pointer(shared_abc, ptr); Optional: free memory Ensure only one writer Wait for job completion of readers Safe to free memory printk("%dn", ptr->number); RCU Reader RCU Writer w/ valid pointer assignment It’s illegal to block while in an RCU read-side CS (Exception: CONFIG_PREEMPT_RCU=y) Reader: rcu_read_lock() & rcu_read_unlock() Kernel Source Reference: 2.6.24
  • 20. rcu_read_unlock() rcu_read_lock() ptr = rcu_dereference(shared_abc); synchronize_rcu() spin_unlock spinlock rcu_assign_pointer(shared_abc, ptr); Optional: free memory Ensure only one writer Wait for job completion of readers Safe to free memory printk("%dn", ptr->number); RCU Reader RCU Writer w/ valid pointer assignment rcu_assign_pointer() and rcu_dereference() invocations communicate spatial synchronization via stores to and loads from the RCU-protected pointer Reader: rcu_read_lock() & rcu_read_unlock() Kernel Source Reference: 2.6.24
  • 21. Reader: rcu_dereference() rcu_read_unlock() rcu_read_lock() ptr = rcu_dereference(shared_abc); printk("%dn", ptr->number); RCU Reader • Preserve order • Load -> Load, Load -> Store and Store -> Store • Might be re-order because of store buffer • Store -> Load Total Store Order (TSO) – x86 Memory Model: x86 is relatively strongly ordered system Kernel Source Reference: 2.6.24
  • 22. Writer: rcu_assign_pointer() synchronize_rcu() spin_unlock spinlock rcu_assign_pointer(shared_abc, ptr); Optional: free memory • Preserve order • Load -> Load, Load -> Store and Store -> Store • Might be re-order because of store buffer • Store -> Load Total Store Order (TSO) – x86 Memory Model: x86 is relatively strongly ordered system RCU Writer Kernel Source Reference: 2.6.24
  • 23. Writer: synchronize_rcu() & call_rcu() synchronize_rcu() or call_rcu() spin_unlock spinlock rcu_assign_pointer(shared_abc, ptr); Optional: free memory RCU Writer • Mark the end of updater code and the beginning of reclaimer code • [Synchronous] synchronize_rcu() ✓ Block until all pre-existing RCU read-side critical sections on all CPUs have completed ✓ leverages call_rcu() • [Asynchronous] call_rcu() ✓ Queue a callback for invocation after a grace period ✓ [Scenario] ➢ It’s illegal to block RCU updater ➢ Update-side performance is critically important [Updater] Ensure only one writer: removal phase Wait for job completion of readers [Reclaimer] Safe to free memory: reclamation phase
  • 24. Why the name RCU (read-copy update)? RCU Writer Create a copy
  • 25. Why the name RCU (read-copy update)? RCU Writer Update the copy
  • 26. Why the name RCU (read-copy update)? RCU Writer Replace the old entry with the newly created one
  • 27. RCU: Quiescent State & Grace Period Removal Reclamation reader reader reader reader reader Old data is read New data is read Legend reader reader Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 (RCU Updater) Grace Period Time reader list_del_rcu() synchronize_rcu() kfree()
  • 28. RCU: Quiescent State & Grace Period Removal Reclamation reader reader reader reader reader Old data is read New data is read Legend reader reader Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 (RCU Updater) Grace Period Time Quiescent State (QS) • A point in the code where there can be no references held to RCU-protected data structures, which is normally any point outside of an RCU read-side critical section. ✓ RCU read-side critical section is defined by the range between rcu_read_lock() and rcu_read_unlock(). Grace Period (GP) • All threads (cores) pass through at least one quiescent state. Quotes from the book “Is Parallel Programming Hard, And, If So, What Can You Do About It?” reader list_del_rcu() synchronize_rcu() kfree()
  • 29. RCU: Quiescent State & Grace Period Removal Reclamation reader reader reader reader reader Old data is read New data is read Legend reader reader Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 (RCU Updater) Grace Period Time reader list_del_rcu() synchronize_rcu() kfree() synchronize_rcu(): How to detect a grace period?
  • 30. Approach #1: Reference Counter Approach #2: CPU Register Approach #3: Wait for a fixed period time Approach #4: Wait forever Approach #5: Avoid the period crashes Approach #6: Quiescent-state- based reclamation (QSBR) Grace Period: Waiting for readers - 6 Approaches
  • 31. Approach #1: Reference Counter Approach #2: CPU Register Approach #3: Wait for a fixed period time Approach #4: Wait forever Approach #5: Avoid the period crashes Approach #6: Quiescent-state- based reclamation (QSBR) Grace Period: Waiting for readers – Reference Counter • Reference counter (share data) updated by rcu_read_lock() and rcu_read_unlock() ✓ Scalability problem due to cache bouncing Reference Counter * Figure reference from Is Parallel Programming Hard, And, If So, What Can You Do About It?
  • 32. Approach #1: Reference Counter Approach #2: CPU Register Approach #3: Wait for a fixed period time Approach #4: Wait forever Approach #5: Avoid the period crashes Approach #6: Quiescent-state- based reclamation (QSBR) Grace Period: Waiting for readers – CPU Register • Check CPUs’ register (Prevent from accessing share data such as reference counter) ✓CPUs’ register: Check each CPU’s program counter (PC) ➢The updater polls each relevant PC. If the PC in not within read-side code, the corresponding CPU is within a quiescent state. ➢A complete grace period: All CPU’s PCs have been observed to be outside of read-side code. ✓Challenges ➢Readers might invoke other functions ➢Code-motion optimization CPU Register
  • 33. Approach #1: Reference Counter Approach #2: CPU Register Approach #3: Wait for a fixed period time Approach #4: Wait forever Approach #5: Avoid the period crashes Approach #6: Quiescent-state- based reclamation (QSBR) Grace Period: Waiting for readers – Wait for a fixed period time • Wait the enough time to comfortably exceed the lifetime of any reasonable reader ✓Unreasonable reader → issue! Wait for a fixed period time
  • 34. Approach #1: Reference Counter Approach #2: CPU Register Approach #3: Wait for a fixed period time Approach #4: Wait forever Approach #5: Avoid the period crashes Approach #6: Quiescent-state- based reclamation (QSBR) Grace Period: Waiting for readers – Wait forever • Accommodate the unreasonable reader. • Bad reputation: leaking memory ✓ Memory leaks often require untimely and inconvenient reboots. ✓ Work well in a high-availability cluster where systems were periodically crashed in order to ensure that cluster really remained highly available. Wait forever
  • 35. Approach #1: Reference Counter Approach #2: CPU Register Approach #3: Wait for a fixed period time Approach #4: Wait forever Approach #5: Avoid the period crashes Approach #6: Quiescent-state- based reclamation (QSBR) Grace Period: Waiting for readers – Avoid the period crashes • Covered by stop-the-world garbage collector • In today’s always-connected always-on world, stopping the world can gravely degrade response times. Avoid the period crashes
  • 36. Approach #1: Reference Counter Approach #2: CPU Register Approach #3: Wait for a fixed period time Approach #4: Wait forever Approach #5: Avoid the period crashes Approach #6: Quiescent-state- based reclamation (QSBR) Grace Period: Waiting for readers – QSBR • Numerous applications already have states (termed quiescent states) that can be reached only after all pre-existing readers are done. ✓ Transaction-processing application: the time between a pair of successive transactions might be a quiescent state. ✓ Non-preemptive OS kernel: Context switch can be a quiescent state. ✓ [Non-preemptive OS kernel] RCU reader must be prohibited from blocking while referencing a global data. QSBR
  • 37. * Reference from Is Parallel Programming Hard, And, If So, What Can You Do About It? Grace Period: Waiting for readers – QSBR [Concept] Implementation for non-preemptive Linux kernel * [Not production quality] sched_setaffinity() function causes the current thread to execute on the specified CPU, which forces the destination CPU to execute a context switch.
  • 38. [Concept] QSBR: non-production and non-preemptible implementation Force the destination CPU to execution switch: Completion of RCU reader Reference: Page #142 of Is Parallel Programming Hard, And, If So, What Can You Do About It? synchronize_rcu() rcu_read_lock() & rcu_read_unlock() → disable/enable preemption
  • 39. Agenda • [Overview] rwlock (reader-writer spinlock) vs RCU • RCU Implementation Overview ✓High-level overview ✓RCU: List Manipulation – Old and New data ✓Reader/Writer synchronization: five basic APIs ➢Reader ➢ rcu_read_lock() & rcu_read_unlock() ➢ rcu_dereference() ➢Writer ➢ rcu_assign_pointer() ➢ synchronize_rcu() & call_rcu() • Classic RCU vs Tree RCU ✓Will discuss implementation details about classic RCU *only* • RCU Flavors • RCU Usage Summary & RCU Case Study
  • 40. Classic RCU (< 2.6.29) and Tree RCU (>= 2.6.29) struct rcu_ctrlblk rcu_ctrlblk rcu_ctrlblk.lock (spinlock protection) rcu_ctrlblk.cpumask CPU 0 CPU 1 CPU N . . . • Global cpumask: each bit indicates each core • A grace period is complete if rcu_ctrlblk.cpumask = 0 • Scalability problem happens when the number of cores are increased ✓ Lock contention: QS cores are contended to update rcu_ctrlblk.cpumask ✓ Cache bouncing for frequent writes CPU 0 CPU 1 CPU N . . . Classic RCU • Group cpumask: reduce lock contention • A grace period is complete if root’s rcu_node->qsmask = 0 • Excellent scalability Tree RCU qsmask struct rcu_node CPU 2 CPU 3 qsmask struct rcu_node CPU M . . . qsmask struct rcu_node Root node struct rcu_state Reference code: v2.6.24 Reference code: v2.6.29
  • 41. Classic RCU (< 2.6.29) struct rcu_ctrlblk rcu_ctrlblk rcu_ctrlblk.lock (spinlock protection) rcu_ctrlblk.cpumask CPU 0 CPU 1 CPU N . . . • Global cpumask: each bit indicates each core • Scalability problem happens when the number of cores are increased ✓ Lock contention ✓ Cache bouncing for frequent writes start a new grace period rcu_ctrlblk.cpumask=0xffff schedule(): clear the corresponding bit in rcu_ctrlblk.cpumask rcu_ctrlblk.cpumask=0? Context switch Y: Finish a grace period N: Wait for all CPUs to pass a QS High-level concept: Suppose it’s a 16-core system Kernel Source Reference: 2.6.24
  • 42. Classic RCU: Data Structure rcu_head struct rcu_head *next rcu_ctrlblk func cur completed next_pending signaled lock cpumask rcu_data quiescbatch passed_quiesc qs_pending batch *nxtlist **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist **donetail qlen # of queued callbacks cpu Global control block per-cpu qs_pending: Avoid cacheline thrashing by assessing this percpu instead of cpumask bitmap rcu_head rcu_head Callback supplied by call_rcu() struct rcu_ctrlblk rcu_ctrlblk rcu_ctrlblk.lock (spinlock protection) rcu_ctrlblk.cpumask CPU 0 CPU 1 CPU N . . . Kernel Source Reference: 2.6.24
  • 43. Classic RCU: start_kernel() -> rcu_init() rcu_ctrlblk cur = -300 completed = -300 next_pending signaled lock cpumask = 0 rcu_data quiescbatch passed_quiesc qs_pending batch *nxtlist **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist **donetail qlen # of queued callbacks cpu Global control block: rcu_ctrlblk per-cpu: rcu_data struct rcu_ctrlblk rcu_ctrlblk rcu_ctrlblk.lock (spinlock protection) rcu_ctrlblk.cpumask CPU 0 CPU 1 CPU N . . . Kernel Source Reference: 2.6.24 struct rcu_ctrlblk rcu_bh_ctrlblk rcu_ctrlblk.lock (spinlock protection) rcu_ctrlblk.cpumask CPU 0 CPU 1 CPU N . . . rcu_ctrlblk cur = -300 completed = -300 next_pending signaled lock cpumask = 0 Global control block: rcu_bh_ctrlblk rcu_data quiescbatch passed_quiesc qs_pending batch *nxtlist **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist **donetail qlen # of queued callbacks cpu per-cpu: rcu_bh_data 1 2 init percpu data init percpu data rcu_tasklet per-cpu rcu_process_callbacks() 3 init rcu_tasklet
  • 44. rcu_ctrlblk cur = -300 completed = -300 next_pending = 0 signaled = 0 lock cpumask = 0 [Global control block] static struct rcu_ctrlblk rcu_ctrlblk; per-cpu variable initialized by rcu_init_percpu_data() rcu_data quiescbatch = rcp->completed = -300 passed_quiesc = 0 qs_pending = 0 batch = 0 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist = NULL **curtail *donelist = NULL **donetail qlen = 0 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) address of address of address of Kernel Source Reference: 2.6.24 Classic RCU: start_kernel() -> rcu_init(): show “rcu_ctrlblk” only
  • 45. Classic RCU: call_rcu() per-cpu rcu_data quiescbatch = rcp->completed = -300 passed_quiesc = 0 qs_pending = 0 batch = 0 *nxtlist **nxttail qs handling batch handling struct rcu_head *curlist = NULL **curtail *donelist = NULL **donetail qlen = 0 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func call_rcu(&my_rcu_head, my_rcu_func); address of address of address of Kernel Source Reference: 2.6.24
  • 46. Classic RCU: synchronize_rcu() leverages call_rcu() Kernel Source Reference: 2.6.24
  • 47. Classic RCU: Invoke rcu_pending() when a timer interrupt is triggered rcu_tasklet per-cpu rcu_process_callbacks() rcu_check_callbacks tasklet_schedule update_process_times rcu_pending()? Y timer interrupt __rcu_process_callbacks() rcu_check_quiescent_state call rcu_do_batch if rdp->donelist tasklet_schedule if rdp->donelist: All callbacks are not invoked yet due to rdp->blimit (limit on a processed batch) per-cpu rcu_data quiescbatch = rcp->completed = -300 passed_quiesc = 0 qs_pending = 0 batch = 0 *nxtlist **nxttail qs handling batch handling struct rcu_head *curlist = NULL **curtail *donelist = NULL **donetail qlen = 0 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of address of Kernel Source Reference: 2.6.24
  • 48. Classic RCU: Check if a CPU passes a QS rcu_check_callbacks update_process_times rcu_pending()? Y timer interrupt Kernel Source Reference: 2.6.24 rcu_qsctr_inc schedule Timer interrupt Scheduler: context switch [Note] “rcu_data.passed_quiesc = 1” does not clear the corresponding rcu_ctrlblk.cpumask directly. More checks are performed. See later slides.
  • 49. Classic RCU: first timer interrupt rcu_tasklet per-cpu rcu_process_callbacks() rcu_check_callbacks tasklet_schedule update_process_times rcu_pending()? Y timer interrupt __rcu_process_callbacks() rcu_check_quiescent_state call rcu_do_batch if rdp->donelist rcu_ctrlblk cur = -300 completed = -300 next_pending = 0 signaled = 0 lock cpumask = 0 [Global control block] static struct rcu_ctrlblk rcu_ctrlblk; per-cpu rcu_data quiescbatch = rcp->completed = -300 passed_quiesc = 0 qs_pending = 0 batch = 0 *nxtlist **nxttail qs handling batch handling struct rcu_head *curlist = NULL **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of address of Manipulation Kernel Source Reference: 2.6.24
  • 50. Classic RCU per-cpu rcu_data quiescbatch = rcp->completed = -300 passed_quiesc = 0 qs_pending = 0 batch = rcp->cur + 1 = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of rcu_ctrlblk cur = -300 completed = -300 next_pending = 0 1 signaled = 0 lock cpumask = 0 static struct rcu_ctrlblk rcu_ctrlblk; rcu_ctrlblk cur = -300 completed = -300 next_pending = 0 signaled = 0 lock cpumask = 0 static struct rcu_ctrlblk rcu_ctrlblk; per-cpu rcu_data quiescbatch = rcp->completed = -300 passed_quiesc = 0 qs_pending = 0 batch = 0 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist = NULL **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of __rcu_process_callbacks() Kernel Source Reference: 2.6.24
  • 51. Classic RCU: first timer interrupt per-cpu rcu_data quiescbatch = rcp->completed = -300 passed_quiesc = 0 qs_pending = 0 batch = rcp->cur + 1 = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of rcu_ctrlblk cur = -300 completed = -300 next_pending = 0 1 signaled = 0 lock cpumask = 0 static struct rcu_ctrlblk rcu_ctrlblk; rcu_start_batch() • A new grace period is started • All cpus must go through a quiescent state ✓ at least two calls to rcu_check_quiescent_state() are required ➢ The first call: A new grace period is running ➢ The second call: If there was a quiescent state, then 1. Update rcu_ctrlblk.cpumask: Clear the corresponding CPU bit 2. rcu_ctrlblk.cpumask is empty: the grace period is completed. Kernel Source Reference: 2.6.24
  • 52. Classic RCU: first timer interrupt rcu_ctrlblk cur = -300 completed = -300 next_pending = 1 signaled = 0 lock cpumask = 0 rcu_ctrlblk cur++ → -299 completed = -300 next_pending = 1 0 signaled = 0 0 lock cpumask = cpu_online_map & ~nohz_cpu_mask Kernel Source Reference: 2.6.24
  • 53. Classic RCU per-cpu rcu_data quiescbatch = rcp->completed = -300 passed_quiesc = 0 qs_pending = 0 batch = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of rcu_ctrlblk cur = -300 completed = -300 next_pending = 1 signaled = 0 lock cpumask = 0 static struct rcu_ctrlblk rcu_ctrlblk; per-cpu rcu_data quiescbatch = rcp->completed = -300 passed_quiesc = 0 qs_pending = 0 batch = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of rcu_ctrlblk cur++ → -299 completed = -300 next_pending = 1 0 signaled = 0 0 lock cpumask = 0 all_online_cpu_bitmask static struct rcu_ctrlblk rcu_ctrlblk; rcu_start_batch() Kernel Source Reference: 2.6.24
  • 54. Classic RCU: first timer interrupt rcu_start_batch() • A new grace period is started • All cpus must go through a quiescent state ✓ at least two calls to rcu_check_quiescent_state() are required ➢ The first call: A new grace period is running ➢ The second call: If there was a quiescent state, then 1. Update rcu_ctrlblk.cpumask: Clear the corresponding CPU bit 2. rcu_ctrlblk.cpumask is empty: the grace period is completed. The first call The second call Kernel Source Reference: 2.6.24
  • 55. The first call, then wait for the next qs (context switch) The second call per-cpu rcu_data quiescbatch = rcp->cur = -299 passed_quiesc = 0 qs_pending = 0 1 batch = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of rcu_ctrlblk cur = -299 completed = -300 next_pending = 0 signaled = 0 lock cpumask = all_online_cpu_bitmask static struct rcu_ctrlblk rcu_ctrlblk; After the first call Classic RCU (< 2.6.29): first timer interrupt Kernel Source Reference: 2.6.24
  • 56. per-cpu rcu_data quiescbatch = rcp->cur = -299 passed_quiesc = 0 1 qs_pending = 1 batch = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of rcu_ctrlblk cur = -299 completed = -300 next_pending = 0 signaled = 0 lock cpumask = all_online_cpu_bitmask static struct rcu_ctrlblk rcu_ctrlblk; Context switch happens rcu_data quiescbatch passed_quiesc qs_pending batch *nxtlist **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist **donetail qlen # of queued callbacks cpu schedule rcu_qsctr_inc rdp->passed_quiesc = 1 Classic RCU: first timer interrupt: quiescent state (context switch) 2 1 Kernel Source Reference: 2.6.24
  • 57. Classic RCU: second timer interrupt per-cpu rcu_data quiescbatch = -299 passed_quiesc = 1 qs_pending = 1 batch = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of rcu_ctrlblk cur = -299 completed = -300 next_pending = 0 signaled = 0 lock cpumask = all_online_cpu_bitmask static struct rcu_ctrlblk rcu_ctrlblk; Second timer interrupt -> __rcu_pending() True Kernel Source Reference: 2.6.24 rcu_tasklet per-cpu rcu_process_callbacks() rcu_check_callbacks tasklet_schedule update_process_times rcu_pending()? Y timer interrupt
  • 58. Classic RCU per-cpu rcu_data quiescbatch = -299 passed_quiesc = 1 qs_pending = 1 0 batch = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of rcu_ctrlblk cur = -299 completed = cur = -299 next_pending = 0 signaled = 0 lock cpumask = all_online_cpu_bitmask static struct rcu_ctrlblk rcu_ctrlblk; The second call cpu_clear() 1 2 3 4 5 6 All CPUs pass QS 7
  • 59. Classic RCU: Second timer interrupt: rcu_start_batch() rcu_ctrlblk cur++ → -298 completed = -299 next_pending = 0 signaled = 0 0 lock cpumask = 0 all_online_cpu_bitmask rcu_ctrlblk cur = -299 completed = -299 next_pending = 0 signaled = 0 lock cpumask = 0 Kernel Source Reference: 2.6.24
  • 60. Classic RCU: Third timer interrupt rcu_data quiescbatch = -299 passed_quiesc = 1 qs_pending = 0 batch = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of static struct rcu_ctrlblk rcu_ctrlblk; 6 All CPUs pass QS rcu_ctrlblk cur++ = -298 completed = -299 next_pending = 0 signaled = 0 lock cpumask = all_online_cpu_bitmask rcu_tasklet per-cpu rcu_process_callbacks() rcu_check_callbacks tasklet_schedule update_process_times rcu_pending()? Y Third timer interrupt __rcu_process_callbacks() rcu_check_quiescent_state call rcu_do_batch if rdp->donelist Kernel Source Reference: 2.6.24
  • 61. Classic RCU: Third timer interrupt rcu_data quiescbatch = -299 passed_quiesc = 1 qs_pending = 0 batch = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of static struct rcu_ctrlblk rcu_ctrlblk; 6 All CPUs pass QS rcu_ctrlblk cur++ = -298 completed = -299 next_pending = 0 signaled = 0 lock cpumask = all_online_cpu_bitmask rcu_data quiescbatch = -299 passed_quiesc = 1 qs_pending = 0 batch = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist **curtail *donelist = NULL **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of address of 1 2 Kernel Source Reference: 2.6.24
  • 62. Classic RCU: Third timer interrupt rcu_data quiescbatch = -299 = rcp->cur = -298 passed_quiesc = 1 0 qs_pending = 0 1 batch = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist = NULL **curtail *donelist **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of static struct rcu_ctrlblk rcu_ctrlblk; 6 All CPUs pass QS rcu_ctrlblk cur++ = -298 completed = -299 next_pending = 0 signaled = 0 lock cpumask = all_online_cpu_bitmask 1 2 3 4 5 Kernel Source Reference: 2.6.24
  • 63. Classic RCU: Third timer interrupt rcu_data quiescbatch = rcp->cur = -298 passed_quiesc = 1 0 qs_pending = 0 1 batch = -299 *nxtlist = NULL **nxttail qs handling batch handling struct rcu_head *curlist = NULL **curtail *donelist **donetail qlen = 1 # of queued callbacks cpu = current cpu’s ID blimit = 10 (default value) rcu_head struct rcu_head *next func = my_rcu_func address of static struct rcu_ctrlblk rcu_ctrlblk; 6 All CPUs pass QS rcu_ctrlblk cur++ = -298 completed = -299 next_pending = 0 signaled = 0 lock cpumask = all_online_cpu_bitmask 1 2 3 Invoke callbacks: depend on rdp->blimit (Check function ‘rcu_do_batch’ for detail) Kernel Source Reference: 2.6.24
  • 64. Tree RCU (>= 2.6.29) CPU 0 CPU 1 CPU N . . . • Group cpumask: reduce lock contention • A grace period is complete if root’s rcu_node->qsmask = 0 • Excellent scalability Tree RCU qsmask struct rcu_node CPU 2 CPU 4 qsmask struct rcu_node CPU M . . . qsmask struct rcu_node Root node struct rcu_state Won’t describe implementation detail about tree RCU. (kinda similar to classic RCU)
  • 65. Agenda • [Overview] rwlock (reader-writer spinlock) vs RCU • RCU Implementation Overview ✓High-level overview ✓RCU: List Manipulation – Old and New data ✓Reader/Writer synchronization: five basic APIs ➢Reader ➢ rcu_read_lock() & rcu_read_unlock() ➢ rcu_dereference() ➢Writer ➢ rcu_assign_pointer() ➢ synchronize_rcu() & call_rcu() • Classic RCU vs Tree RCU • RCU Flavors • RCU Usage Summary & RCU Case Study
  • 66. • Vanilla RCU • Bottom-half Flavor (Historical) • Sched Flavor (Historical) • Sleepable RCU (SRCU) • Tasks RCU • Tasks Rude RCU • Tasks Trace RCU RCU Flavors
  • 67. • Reader RCU APIs ✓rcu_read_lock() / rcu_read_unlock() ✓rcu_dereference() • Writer RCU APIs ✓rcu_assign_pointer() ✓Synchronous grace-period wait primitive: synchronize_rcu() ✓Asynchronous grace-period wait primitive: call_rcu() RCU Flavors: Vanilla RCU (Classic RCU and Tree RCU)
  • 68. • Networking data structures that may be subjected to remote denial- of-service attacks. ✓CPUs never exit softirq execution due to DoS attacks. ➢Prevent CPUs from executing a context switch → prevent grace periods from ever ending. ◼ Out-of-memory and a system hang ✓Disabling bottom halve can prevent the issue. • Reader RCU APIs ✓rcu_read_lock_bh() / rcu_read_unlock_bh() ➢ local_bh_disable() / local_bh_enable() ✓rcu_dereference_bh() • Writer: No change RCU Flavors: Bottom-half Flavor (Historical)
  • 69. • Before preemptible RCU, context switch is a quiescent state ✓ [A complete grace period] Need to wait for all pre-existing interrupt and NMI handler • [CONFIG_PREEMPTION=n] ✓ Vanilla RCU and RCU-sched grace period waits for pre-existing interrupt and NMI handlers ✓ Vanilla RCU and RCU-sched have identical implementations • [CONFIG_PREEMPTION=Y] for RCU-sched ✓ Preemptible RCU does not need to wait for pre-existing interrupt and NMI handler. ✓ The code outside of an RCU read-side critical section → a QS ✓ rcu_read_lock_sched() → disable preemption ✓ rcu_read_unlock_sched() → re-enable preemption ✓ A preemption attempt during the RCU-sched read-side critical section: ✓ rcu_read_unlock_sched() will enter the scheduler • Reader RCU APIs ✓ rcu_read_lock_sched() / rcu_read_unlock_sched() ✓ preempt_disable() / preempt_enable() ✓ local_irq_save() / local_irq_restore() ✓ hardirq enter / hardirq exit ✓ NMI enter / NMI exit ✓ rcu_dereference_sched() • Writer: No change RCU Flavors: Sched Flavor (Historical)
  • 70. • Classic RCU: blocking or sleeping is strictly prohibited • SRCU: Allow arbitrary sleeping (or blocking) within RCU read-side critical section ✓Real-time kernel ➢ Require that spinlock critical section be preemptible ➢ Require that RCU read-side critical section be preemptible ✓Extend grace periods • Different domains (srcu_struct structure) are defined • Benefit: a slow SRCU reader in one domain does not delay a SRCU grace period in another domain RCU Flavors: Sleepable RCU (SRCU) struct srcu_struct ss; int idx; idx = srcu_read_lock(&ss); do_something(); srcu_read_unlock(&ss, idx);
  • 71. • RCU mechanism ✓Keep old version of data structure until no CPU holds a reference to it. → the structure can be freed (context switch happens). • Tasks RCU ✓Defer the destruction of an old data structure until it is known that no process holds a reference to it ✓Scenario: Handle the trampolines used in Linux-kernel tracing ➢ Tracer subsystem: ftrace and kprobe ➢ Kprobe: Return probe – Trampoline ✓Reader RCU APIs ➢ No explicit read-side marker ➢ Voluntary context switches separate successive Tasks RCU read-side critical sections. ✓Writer RCU APIs ➢ Synchronous grace-period-wait primitives: synchronize_rcu_tasks() ➢ Asynchronous grace-period-wait primitives: call_rcu_tasks() RCU Flavors: Tasks RCU
  • 72. • Tasks RCU does not wait for idle tasks ✓Idle tasks: do not run voluntary context switches ➢ Remain idle for long periods of time ✓Cannot work for tracing of code within idle loop • Tasks Rude RCU ✓Scenario: Trampoline that might be involved in tracing of code within the idle loop ✓Reader RCU APIs ➢ No explicit read-side marker ➢ Preemption-disabled region of code is a Tasks Rude RCU reader ✓Writer RCU APIs ➢ Synchronous grace-period-wait primitives: synchronize_rcu_tasks_rude() ➢ Asynchronous grace-period-wait primitives: call_rcu_tasks_rude() RCU Flavors: Tasks Rude RCU
  • 73. • Tasks RCU and Tasks Rude RCU disallow sleeping while executing in a given trampoline • Tasks Rude RCU ✓Scenario: BPF programs need to sleep ✓Reader RCU APIs ➢ Explicit read-side marker: rcu_read_lock_trace() / rcu_read_unlock_trace() ✓Writer RCU APIs ➢ Synchronous grace-period-wait primitives: synchronize_rcu_tasks_trace() ➢ Asynchronous grace-period-wait primitives: call_rcu_tasks_trace() RCU Flavors: Tasks Trace RCU
  • 74. Agenda • [Overview] rwlock (reader-writer spinlock) vs RCU • RCU Implementation Overview ✓High-level overview ✓RCU: List Manipulation – Old and New data ✓Reader/Writer synchronization: five basic APIs ➢Reader ➢ rcu_read_lock() & rcu_read_unlock() ➢ rcu_dereference() ➢Writer ➢ rcu_assign_pointer() ➢ synchronize_rcu() & call_rcu() • Classic RCU vs Tree RCU • RCU Flavors • RCU Usage Summary & RCU Case Study
  • 75. RCU Usage Summary Usage Description Applicable? Routing Table 1. Read mostly 2. Stale and inconsistent data is permissible Work great Linux kernel’s mapping from user-level System-V semaphore IDs to in-kernel data structures 1. Read mostly: Semaphore tends to be used far more frequently than they are created and destroyed. 2. Need consistent data: perform a semaphore operation on a semaphore that has already been deleted. Work well dentry cache in Linux kernel 1. Read/Write workload 2. Need consistent data Might be ok SLAB_TYPESAFE_BY_RCU slab-allocator flag provides type-safe memory to RCU readers 1. Write mostly 2. Need consistent data Not best
  • 76. Case Study: Design Patterns and Lock Granularity • Code locking: use global locks only ✓ Lock contention & scalability issue • Data locking ✓ Many data structures may be partitioned ➢ Example: hash table ✓ Each partition of data structure has its own lock ✓ Improve lock contention & better scalability • Data ownership ✓ Data structure is partitioned over threads or CPUs ➢ Each thread/CPU accesses its subset of data structure without any synchronization overhead ➢ The most-heavily used data owned by a single CPU → Hot spot in this CPU ➢ No sharing is required -> achieve ideal performance ➢ Example: percpu variables in Linux kernel
  • 77. Case Study: Code locking: dentry lookup • v2.5.61 and earlier • dcache_lock ✓ protect the hash chain (hash table), d_child, d_alias, d_lru lists and d_inode . . dentry_hashtable Source code from: v2.5.61
  • 78. Case Study: Code locking: dentry lookup • v2.5.61 and earlier • dcache_lock ✓ protect the hash chain (hash table), d_child, d_alias, d_lru lists and d_inode . . dentry_hashtable Code locking by `dcache_lock` • dcache_lock protects the whole hash table • More lock contention, poor scalability Code locking Source code from: v2.5.61
  • 79. Case Study: Data locking: dentry lookup • v2.5.62 and later • RCU ✓ lock-free for dcache (dentry) look-up ✓ No need to acquire `dcache_lock` when traversing the hash table in d_lookup ✓ Rely on RCU to ensure the dentry has not been *freed*. • dcache_lock ✓ dcache_lock must be taken for the followings: ✓ Traversing and updating ✓ Hashtable update . . dentry_hashtable Data locking by `dentry->d_lock • Improve lock contention • Better scalability Source code from: v2.5.62 Data locking
  • 80. Case Study: Data locking: dentry lookup - seqlock Source code from: v2.5.67
  • 81. Case Study: Data locking: dentry lookup - seqlock Source code from: v2.5.67 This approach is still used in latest kernel (v6.3)
  • 82. Reference • Is Parallel Programming Hard, And, If So, What Can You Do About It? • What is RCU? – “Read, Copy, Update” • What Does It Mean To Be An RCU Implementation? • https://docs.kernel.org/RCU/index.html • Using RCU for linked lists — a case study • Scaling dcache with RCU • Sleepable RCU (SRCU) • https://lwn.net/Articles/253651/ • https://zhuanlan.zhihu.com/p/90223380 • Preemptible RCU • https://lwn.net/Articles/253651/ • https://zhuanlan.zhihu.com/p/90223380
  • 84. RCU (Read-Copy-Update) • Non-blocking synchronization ✓Deadlock Immunity: RCU read-side primitives do not block, spin or even do backwards branches → execution time is deterministic. ➢Exception (programming error): ➢Immunity to priority inversion ◼ Low-priority RCU readers cannot prevent a high-priority RCU updater from acquiring the update-side lock ◼ A low-priority RCU updater cannot prevent high-priority RCU readers from entering read-side critical section ➢[-rt kernel] RCU is susceptible to priority inversion scenarios: ➢ A High-priority process blocked waiting for an RCU grace period to elapse can be blocked by low-priority RCU readers. --> Solve by RCU priority boosting. ➢ [RCU priority boosting] Require rcu_read_unclock() do deboosting, which entails acquiring scheduler locks. ➢ Need to avoid deadlocks within the scheduler and RCU: v5.15 kernel requires RCU to avoid invoking the scheduler while holding any of RCU’s locks ➢ rcu_read_unlock() is not always lockless when RCU priority boosting is enabled.
  • 85. RCU Properties • Reader ✓Reads need not wait for updates ➢Low-cost or even no-cost readers → excellent scalability ✓Each reader has a coherent view of each object within the block of rcu_read_lock() and rcu_read_unlock() • Updater ✓synchronized_rcu(): ensure that objects are not freed until after the completion of all readers that might be using them. ✓rcu_assign_pointer() and rcu_dereference(): efficient and scalable mechanisms for publishing and reading new versions of an object.
  • 86. Wait for pre-existing RCU readers • Wait for the RCU read-side critical section: rcu_read_lock()/rcu_read_unlock() ✓Illegal to sleep within an RCU read-side critical section because a context switch is a quiescent state * If any portion of a given critical section precedes the beginning of a given grace period, then RCU guarantees that all of that critical section will precede the end of that grace period.
  • 87. Wait for pre-existing RCU readers: if-then relationship • If any portion of a given critical section precedes the beginning of a given grace period, then RCU guarantees that all of that critical section will precede the end of that grace period. ✓ r1 = 0 and r2 = 0
  • 88. • If any portion of a given critical section follows the end of a given grace period, then RCU guarantees that all of that critical section will follow the beginning of that grace period. ✓ r1 = 1 and r2 = 1 • What would happen if the order of P()’s two accesses was reversed? ✓ Nothing change because loads from x and y are in the same RCU read-side critical section. Wait for pre-existing RCU readers: if-then relationship
  • 89. • An RCU read-side critical section can be completely overlapped by an RCU grace period. ✓ r1 = 1 and r2 = 0 ✓ Cannot be r1 = 0 and r2 = 1 Wait for pre-existing RCU readers: if-then relationship
  • 90. RCU Grace-Period Ordering Guarantee Given a grace period, each readers ends before the end of that grace period, starts after the beginning of that grace period, or both.
  • 91. Maintain Multiple Versions of Recently Updated Objects • RCU accommodates synchronization-free readers (weak temporal synchronization) by maintaining multiple versions of data
  • 92. * This slide is referred from page 44 of What is RCU?
  • 93. * This slide is referred from page 80 of slides
  • 94. task_struct cg_list rcu_node_entry (CONFIG_PREEMPT_RCU=y) tasks thread_node (linked to signal_struct) Kernel Source Reference: 6.2 Not use RCU: not read-mostly data structures (use spinlock) RCU: Defer destruction ptraced ptrace_entry RCU + spinlock children sibling Case study: Access task_struct linked list __cacheline_aligned DEFINE_RWLOCK(tasklist_lock);