SlideShare a Scribd company logo
1 of 64
Download to read offline
* Based on kernel 5.11 (x86_64) – QEMU
* 2-socket CPUs (4 cores/socket)
* 16GB memory
* Kernel parameter: nokaslr norandmaps
* KASAN: disabled
* Userspace: ASLR is disabled
* Legacy BIOS
Linux Synchronization Mechanism: spinlock
Adrian Huang | Dec, 2022
Agenda
• Spinlock history – Approach evolution
✓Simple Approach: Spin on test-and-set
✓Test and test-and-set (spin on read)
✓Ticket spinlock
✓MCS (Mellor-Crummey & Scott) lock
➢Performance benchmark: Ticket spinlock vs MCS lock
➢MCS Lock History in Linux Kernel
• Current spinlock approach in Linux kernel: qspinlock (Queue spinlock)
• spin_lock() SMP & UP
• spin_lock() API variants
✓How to use those variants in different scenarios
• Spinlock derivative: rwlock and seqlock
Simple Approach: Spin on test-and-set
Core 0 Core 1 Core 2 Core 3
Core 4 Core 5 Core 6 Core 7
Core 8 Core 9 Core 10 Core 11
Core 12 Core 13 Core 14 Core 15 Memory
lock
Lock holder
TestAndSet
TestAndSet
TestAndSet TestAndSet
TestAndSet
[Spinning Cores] Keep consuming memory bus (write 1): Cache coherence
Test-and-set (atomic):
1. old_value = read a memory location
2. Write 1 to a memory location
3. Return old_value
Simple Approach: Spin on test-and-set
Core 0 Core 1 Core 2 Core 3
Core 4 Core 5 Core 6 Core 7
Core 8 Core 9 Core 10 Core 11
Core 12 Core 13 Core 14 Core 15 Memory
lock
Lock holder
TestAndSet
TestAndSet
TestAndSet TestAndSet
TestAndSet
Due to the memory write, spinning cores invalidate cache copies of cores
even if the value is not changed
Invalidate
Invalidate
Test-and-set (atomic):
1. old_value = read a memory location
2. Write 1 to a memory location
3. Return old_value
• Read Invalidate message of MESI Protocol Messages
✓ Reference: C.2.2 MESI Protocol Messages of Is Parallel Programming Hard, And, If So, What Can You Do About It?
Simple Approach: Spin on test-and-set
Core 0 Core 1 Core 2 Core 3
Core 4 Core 5 Core 6 Core 7
Core 8 Core 9 Core 10 Core 11
Core 12 Core 13 Core 14 Core 15 Memory
lock
Lock holder
Test-and-set (atomic):
1. old_value = read a memory location
2. Write 1 to a memory location
3. Return old_value
TestAndSet
TestAndSet
TestAndSet TestAndSet
TestAndSet
[Spinning Cores] Cores reload the memory due to the cache miss: performance impact
Cache miss: reload
Cache miss: reload
Test and test-and-set (spin on read)
1. Spinning is done in the cache without consuming memory bus
2. Reduce the repeated test-and-set cost if the lock is held
Core 0 Core 1 Core 2 Core 3
Core 4 Core 5 Core 6 Core 7
Core 8 Core 9 Core 10 Core 11
Core 12 Core 13 Core 14 Core 15 Memory
lock
Lock holder
TestAndSet
TestAndSet
TestAndSet TestAndSet
TestAndSet
1. Spinning is done in the cache without consuming memory bus
2. Reduce the repeated test-and-set cost if the lock is held
Core 0 Core 1 Core 2 Core 3
Core 4 Core 5 Core 6 Core 7
Core 8 Core 9 Core 10 Core 11
Core 12 Core 13 Core 14 Core 15 Memory
lock
Lock holder:
release lock
TestAndSet
TestAndSet
TestAndSet TestAndSet
TestAndSet
Invalidate
Invalidate
Invalidate
Invalidate
Test and test-and-set (spin on read)
1. Spinning cores incur a read miss and fetch the new value back into cache
2. Spinning cores compete for accessing memory bus
3. The first core to test-and-set will acquire the lock
4. Other spinning cores cannot get lock: invalidate caches & cache misses
• Quiescence: These operations are finished
Core 0 Core 1 Core 2 Core 3
Core 4 Core 5 Core 6 Core 7
Core 8 Core 9 Core 10 Core 11
Core 12 Core 13 Core 14 Core 15 Memory
lock
Lock holder:
release lock
TestAndSet
TestAndSet
TestAndSet TestAndSet
TestAndSet
Test and test-and-set (spin on read)
Performance Comparison
Reference from: T. E. Anderson, "The performance of spin lock alternatives for shared-money multiprocessors",
IEEE Transactions on Parallel and Distributed Systems, vol. 1, no. 1, pp. 6-16, 1990.
Issue Statements about “Spin on test-and-set”
and “Test and test-and-set”
• Cache coherence
✓[Cache-line bouncing] Scalability issue
• All spinning cores compete the lock when the lock is freed by lock holder
✓Unfairness
➢Ticket spinlock is proposed to fix the unfairness
Ticket spinlock (v2.6.25: released in April 2008)
Concept: pseudo code
Ticket spinlock (v2.6.25: released in April 2008)
Concept: pseudo code
current_ticket
next_ticket
slock
v2.6.25 implementation
Ticket spinlock: Acquire a spinlock
Core 0 Core 1 Core 2 Core 3
Core 4 Core 5 Core 6 Core 7
Core 8 Core 9 Core 10 Core 11
Core 12 Core 13 Core 14 Core 15
Memory
{current_ticket, next_ticket}
current_ticket
next_ticket
cache
lock holder
Core 0 Core 1 Core 2 Core 3
Core 4 Core 5 Core 6 Core 7
Core 8 Core 9 Core 10 Core 11
Core 12 Core 13 Core 14 Core 15
Memory
{current_ticket, next_ticket}
current_ticket
next_ticket
cache
lock holder
current_ticket
next_ticket
current_ticket
next_ticket
current_ticket
next_ticket
current_ticket
next_ticket
spinning core
spinning core
spinning core
spinning core
Ticket spinlock: Other cores acquire a locked spinlock
Core 0 Core 1 Core 2 Core 3
Core 4 Core 5 Core 6 Core 7
Core 8 Core 9 Core 10 Core 11
Core 12 Core 13 Core 14 Core 15
Memory
{current_ticket, next_ticket}
current_ticket
next_ticket
cache
lock holder
current_ticket
next_ticket
current_ticket
next_ticket
current_ticket
next_ticket
current_ticket
next_ticket
unlock & update
invalidate
Ticket spinlock: Lock holder unlocks a spinlock and
accumulates current_ticket
Core 0 Core 1 Core 2 Core 3
Core 4 Core 5 Core 6 Core 7
Core 8 Core 9 Core 10 Core 11
Core 12 Core 13 Core 14 Core 15
Memory
{current_ticket, next_ticket}
current_ticket
next_ticket
cache
lock holder
current_ticket
next_ticket
current_ticket
next_ticket
current_ticket
next_ticket
current_ticket
next_ticket
cache miss: reload
Ticket spinlock: Cache miss → Reload
[Ticket spinlock] Cache coherence issue: non-scalable spinlock (Cache-line bouncing)
Ticket spinlock: Performance Measurement
Reference Paper: Boyd-Wickizer, Silas, et al. "Non-scalable locks
are dangerous." Proceedings of the Linux Symposium. 2012.
Benchmarks
• FOPS: creates a single file and starts one
process on each core. Repeated system calls
“open() and close()”.
• MEMPOP
✓ One process per core
✓ Each process mmaps 64 kB of memory
with the MAP_POPULATE flag, then
munmaps the memory
• PFIND: searches for a file by executing several
instances of the GNU find utility
• EXIM (mail server): A single master process
listens for incoming SMTP connections via TCP
and forks a new process for each connection
Ticket spinlock: Performance Measurement
Reference Paper: Boyd-Wickizer, Silas, et al. "Non-scalable locks
are dangerous." Proceedings of the Linux Symposium. 2012.
MCS (Mellor-Crummey & Scott) lock
MCS Lock
• Adhere to fairness: FIFO (Implemented via linked list)
• Scalable spinlock: Prevent cache coherence issue
✓ Each core reference its own data ‘next’ (struct mcs_spinlock)
• Data ownership concept: Is Parallel Programming Hard, And, If So, What Can You Do About It?
MCS (Mellor-Crummey & Scott) lock
1. Adhere to fairness
2. Scalable spinlock: Prevent cache coherence issue
mcs_spinlock
next = NULL
locked = 0
mcs_spinlock (core 0)
next = NULL
locked = 0
mcs_spin_lock(…)
prev = xchg(…)
prev = NULL → lock acquired
(no spinning)
Main MCS lock
MCS (Mellor-Crummey & Scott) lock
1. [Spinning core] check its owner `locked`
2. ‘next’ pointer of the main MCS lock indicates the tail of the queue of waiting cores
mcs_spinlock
next
locked = 0
Main MCS lock
mcs_spinlock (core 0)
next = NULL
locked = 0
prev = xchg(…)
lock acquired (no spinning)
mcs_spinlock (core 1)
next = NULL
locked = 0
mcs_spin_lock(…)
prev != NULL → locked by
another core
Will be replaced by
xchg()
1
2
3
4
WRITE_ONCE(prev->next, node) 5 Spinning until
locked = 1
prev
MCS (Mellor-Crummey & Scott) lock
mcs_spinlock
next
locked = 0
mcs_spinlock (core 0)
next
locked = 0
next != NULL ➔
Set `next->locked = 1`
mcs_spinlock (core 1)
next = NULL
locked = 0 1
2
next = READ_ONCE(node->next)
4 locked =1 → enter critical section
mcs_spin_unlock(…)
1
3
Main MCS lock
MCS (Mellor-Crummey & Scott) lock
Cache coherence only happens for the next core (will not
be spinning; will enter critical section)
mcs_spinlock
next
locked = 0
mcs_spinlock (core 0)
next
locked = 0
next != NULL ➔
Set `next->locked = 1`
mcs_spinlock (core 1)
next = NULL
locked = 0 1
2
next = READ_ONCE(node->next)
4 locked =1 → enter critical section
mcs_spin_unlock(…)
1
3
Main MCS lock
MCS (Mellor-Crummey & Scott) lock
mcs_spinlock
next
locked = 0
mcs_spinlock (core 1)
next = NULL
locked = 1 next = READ_ONCE(node->next)
2
mcs_spin_unlock(…)
1
Main MCS lock next = NULL ➔ cmpxchg_release(lock, node, NULL)
MCS (Mellor-Crummey & Scott) lock
mcs_spinlock
next = NULL
locked = 0
mcs_spinlock (core 1)
next = NULL
locked = 1 next = READ_ONCE(node->next)
2
mcs_spin_unlock(…)
1
Main MCS lock next = NULL ➔ cmpxchg_release(lock, node, NULL)
3 cmpxchg_release(lock, node, NULL)
4
Ticket spinlock vs MCS lock
Reference Paper: Boyd-Wickizer, Silas, et al. "Non-scalable locks
are dangerous." Proceedings of the Linux Symposium. 2012.
Benchmarks
• FOPS: creates a single file and starts one
process on each core. Repeated system calls
“open() and close()”.
✓ Executing the critical section increases
from 450 cycles on two cores to 852 cycles
on four cores
✓ The critical section is executed multiple
times per-operation and modifies shared
data, which incurs costly cache misses
• MEMPOP
✓ One process per core
✓ Each process mmaps 64 kB of memory
with the MAP_POPULATE flag, then
munmaps the memory
• PFIND: searches for a file by executing several
instances of the GNU find utility
• EXIM (mail server): A single master process
listens for incoming SMTP connections via TCP
and forks a new process for each connection
MCS Lock History in Linux Kernel
• Two variants
✓Standard: v3.15
➢ None of Linux kernel subsystems uses it (No one calls mcs_spin_lock() and mcs_spin_unlock()) because:
• sizeof(struct mcs_spinlock) > 4
 Note: sizeof (msc spinlock struct) = 16
• spinlock struct is embedded into kernel structure. Example:
 struct page (size = 64) cannot tolerate the increased size
➢ kernel/locking/mcs_spinlock.h
➢ Replacement of ticket spinlock: qspinlock – based on standard MCS lock (v4.2)
• A simple generic 4-byte queued spinlock
• qspinlock is the default spinlock mechanism
✓Cancelable MCS lock (OSQ - Optimistic Spin Queue: MCS-like lock): v3.15
➢ Used by mutex implementation & rw-semaphores
➢ Mutex implementation paths:
• Fastpath: Uncontended case by using cmpxchg()
• Midpath (optimistic spinning) - The priority of the lock owner is the highest one
 Spin for mutex lock acquisition when the lock owner is running.
 The lock owner is likely to release the lock soon.
• Slowpath: The task is added to the waiting queue and sleeps until woken up by the unlock path.
➢ Mutex is a hybrid type (spinning & sleeping): Busy-waiting for a few cycles instead of immediately sleeping
➢ kernel/locking/{mutex.c, osq_lock.c}
➢ Reference: Generic Mutex Subsystem
qspinlock (Queue spinlock)
locked
pending
locked_pending
tail index
tail cpu
tail
0
atomic_t val
8
15
16
18
31
struct qspinlock or arch_spinlock_t
spin_lock
_raw_spin_lock
__raw_spin_lock
raw_spin_lock
rlock
struct spinlock or spinlock_t
raw_lock
struct raw_spinlock or raw_spinlock_t
spin_lock(), spin_unlock()
preempt_disable
do_raw_spin_lock
arch_spin_lock
queued_spin_lock
spin_unlock
_raw_spin_unlock
__raw_spin_unlock
raw_spin_unlock
do_raw_spin_unlock
preempt_enable
arch_spin_unlock
queued_spin_unlock
qspinlock (Queue spinlock) – First core to acquire the lock
locked = 0 1
pending = 0
locked_pending
tail index
tail cpu
tail
0
atomic_t val
8
15
16
18
31
struct qspinlock or arch_spinlock_t
Core 0 [Nobody occupies this lock]
spin_lock() → Lock owner
atomic_try_cmpxchg_acquire()
qspinlock – Second core to acquire the lock
locked = 1
pending = 0 1
locked_pending
tail index
tail cpu
tail
0
atomic_t val
8
15
16
18
31
struct qspinlock or arch_spinlock_t
Core 0
Lock owner
Core 1
spin_lock()
1 atomic_try_cmpxchg_acquire() → false
2
queued_spin_lock_slowpath() ->
queued_fetch_set_pending_acquire()
3
• atomic_cond_read_acquire(): spinning until `locked = 0`
• Clear `pending` field if `locked = 0`
[Second core for lock acquisition] Spinning by checking `locked` field of arch_spinlock_t
(Optimization: No need to configure a per-cpu MSC struct)
qspinlock – Third core to acquire the lock
locked = 1
pending = 1
locked_pending
tail index
tail cpu
tail
0
atomic_t val
8
15
16
18
31
struct qspinlock or arch_spinlock_t
Core 0
Lock owner
Core 2
atomic_try_cmpxchg_acquire()
→ false
Core 1
spinning until `locked = 0`
1
2
queued_spin_lock_slowpath() ->
val = queued_fetch_set_pending_acquire()
• if (val & ~_Q_LOCKED_MASK)
queue this task
qspinlock – Third core to acquire the lock
locked = 1
pending = 1
locked_pending
tail index
tail cpu
tail
0
atomic_t val
8
15
16
18
31
struct qspinlock or arch_spinlock_t
Core 0
Lock owner
percpu
[Third core for lock acquisition] Spinning until `locked = 0 && pending = 0`
Core 1
spinning until `locked = 0`
mcs_spinlock (core 2)
next = NULL
locked = 0
count
3
val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
• Spinning until `locked = 0 && pending = 0`
Core 2
4
4
xchg_tail()
3
qspinlock – Forth core to acquire the lock
locked = 1
pending = 1
locked_pending
tail index
tail cpu
tail
0
atomic_t val
8
15
16
18
31
struct qspinlock or arch_spinlock_t
Core 0
Lock owner
percpu
[Forth and later core: N >= 4] Spinning until its own `locked = 1`: Improve cache bouncing
• Who sets `mcs_spinlock.locked = 1`? (Hint: Adhere to MCS lock implementation)
Core 1
spinning until `locked = 0`
mcs_spinlock (core 2)
next
locked = 0
count
Core 2
Spinning until `locked = 0
&& pending = 0`
percpu
mcs_spinlock (core 3)
next = NULL
locked = 0
count Core 3
1
1 xchg_tail()
2
3 Spinning until its own `locked = 1`
qspinlock – queued_spin_unlock()
locked = 1 0
pending = 1
locked_pending
tail index
tail cpu
tail
0
atomic_t val
8
15
16
18
31
struct qspinlock or arch_spinlock_t
Core 0
Lock owner
percpu
Core 1 spinning until `locked = 0`
mcs_spinlock (core 2)
next
locked = 0
count
Core 2
Spinning until `locked = 0
&& pending = 0`
percpu
mcs_spinlock (core 3)
next = NULL
locked = 0
count Core 3
Spinnig until its own `locked = 1`
1
2
Lock owner
qspinlock – queued_spin_unlock() & queued_spin_lock()
locked = 1 0
pending = 1
locked_pending
tail index
tail cpu
tail
0
atomic_t val
8
15
16
18
31
struct qspinlock or arch_spinlock_t
Core 0
Lock owner
percpu
Core 1
spinning until `locked = 0`
mcs_spinlock (core 2)
next
locked = 0
count
Core 2
Spinning until `locked = 0
&& pending = 0`
percpu
mcs_spinlock (core 3)
next = NULL
locked = 0
count Core 3
Spinning until its own `locked = 1`
1
Core 4
2
What happens to core 4 for invoking queued_spin_lock() under this circumstance?
qspinlock – queued_spin_unlock() & queued_spin_lock()
locked = 1 0
pending = 1
locked_pending
tail index
tail cpu
tail
0
atomic_t val
8
15
16
18
31
struct qspinlock or arch_spinlock_t
Core 0
Lock owner
percpu
Core 1
spinning until `locked = 0`
mcs_spinlock (core 2)
next
locked = 0
count
Core 2
Spinning until `locked = 0
&& pending = 0`
percpu
mcs_spinlock (core 3)
next
locked = 0
count
Core 3
Spinning until its
own `locked = 1`
1
1. Core 4 is queued: Adhere to the principle of ticket spinlock
2. Spin its own `mcs_spinlock.locked`: Improve cache bouncing
percpu
mcs_spinlock (core 4)
next = NULL
locked = 0
count Core 4
• atomic_try_cmpxchg_acquire(…)
◼ Can acquire the lock (set ‘locked=1)
when the 32-bit qspinlock (val) is 0.
◼ Otherwise, the core will go to the
slowpath.
Spinning until its
own `locked = 1`
spin_lock() SMP & UP
spin_lock
_raw_spin_lock
__raw_spin_lock
raw_spin_lock
preempt_disable
do_raw_spin_lock
arch_spin_lock
queued_spin_lock
spin_lock
_raw_spin_lock
__LOCK
raw_spin_lock
preempt_disable
SMP spin_lock() UP spin_lock()
• spin_lock() & spin_unlock()
✓Multiple process contexts share the same data
• spin_lock_irq() & spin_unlock_irq()
✓Process context and top halve (hardirq) share the same data
• spin_lock_irqsave() & raw_spin_unlock_irqrestore()
✓Process context and top halve (hardirq) share the same data
✓Save/restore eflags
• spin_lock_bh() & spin_unlock_bh()
✓Process context and bottom halve share the same data
• spin_lock_nest_lock() & spin_lock_nested()
✓lockdep: Annotate places where we take multiple locks of the same class and avoid deadlock –
Commit b7d39aff9145 (“lockdep: spin_lock_nest_lock()”)
spinlock() variants
• spin_lock() & spin_unlock()
✓Multiple process contexts share the same data
• spin_lock_irq() & spin_unlock_irq()
✓Process context and top halve (hardirq) share the same data
• spin_lock_irqsave() & raw_spin_unlock_irqrestore()
✓Process context and top halve (hardirq) share the same data
✓Save/restore eflags
• spin_lock_bh() & spin_unlock_bh()
✓Process context and bottom halve share the same data
• spin_lock_nest_lock() & spin_lock_nested()
✓lockdep: Annotate places where we take multiple locks of the same class and avoid deadlock –
Commit b7d39aff9145 (“lockdep: spin_lock_nest_lock()”)
spinlock() variants
Critical Section
spin_lock(&lock)
static DEFINE_SPINLOCK(lock);
spin_unlock(&lock)
Critical Section
Process Context
Critical Section
spin_lock(&lock)
spin_unlock(&lock)
Critical Section
ISR (hardirq) – Interrupt Context
1
2
HW Interrupt
spinning: dead lock
Processor Core 3
Deadlock only happens on the same core in this circumstance
Deadlock between process context and interrupt context
(the same core)
Critical Section
spin_lock_irq(&lock)
static DEFINE_SPINLOCK(lock);
spin_unlock_irq(&lock)
Critical Section
Process Context
Critical Section
spin_lock(&lock)
spin_unlock(&lock)
Critical Section
ISR (hardirq) – Interrupt Context
1
2 HW Interrupt
Processor Core
• Use spin_{un}lock_irq() or spin_{un}lock_irqsave() to prevent deadlock
✓ spin_lock_irq() or spin_lock_irqsave(): Disable local interrupt delivery
✓ spin_unlock_irq() or spin_unlock_irqsave(): Enable local interrupt delivery
Deadlock between process context and interrupt context
(the same core)
Local interrupt disabled
4
3
unlock
5
6
Critical Section
spin_lock_irq(&lock)
static DEFINE_SPINLOCK(lock);
spin_unlock_irq(&lock)
Critical Section
Process Context
Critical Section
spin_lock(&lock)
spin_unlock(&lock)
Critical Section
ISR (hardirq) – Interrupt Context
1
2 HW Interrupt
Processor Core
Deadlock between process context and interrupt context
(the same core)
Local interrupt disabled
4
3
unlock
5
5
Who disables the local interrupt before entering ISR?
• Linux kernel common interrupt code?
• CPU HW?
Does Linux kernel disable local interrupt before entering ISR?
SYM_CODE_START(asm_common_interrupt)
common_interrupt(): entry point for all normal device IRQs
SYM_CODE_START(asm_common_interrupt)
Does Linux kernel disable local interrupt before entering ISR?
The current stack is not switched yet:
will be done in error_entry()
IF (Interrupt Enable Flag) is disabled!
Does Linux kernel disable local interrupt before entering ISR?
Local interrupt has been disabled before entering asm_common_interrupt(): CPU disables it.
Does Linux kernel disable local interrupt before entering ISR?
common_interrupt(…)
__common_interrupt(…)
Does Linux kernel disable local interrupt before entering ISR?
asm_common_interrupt
__common_interrupt
common_interrupt
Common Interrupt Call Path: Serial Driver
handle_irq
run_irq_on_irqstack_cond
__run_irq_on_irqstack
asm_call_on_stack
handle_edge_irq
handle_irq_event
handle_irq_event_percpu
__handle_irq_event_percpu
serial8250_interrupt
Does Linux kernel disable local interrupt before entering ISR?
Common interrupt in Linux kernel: Handle vector number 32-255
Common Interrupt Code Path
[Reference] Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D and 4
CPU disables local interrupt if the interrupt handle is called via
an interrupt gate
spin_lock_irq(&lock1)
spin_unlock_irq(&lock1)
Process Context
Critical Section
spin_lock(&lock1)
spin_unlock(&lock1)
Critical Section
ISR (hardirq) – Interrupt Context
Processor Core
Deadlock between process context and interrupt context
(the same core)
Local interrupt disabled
deadlock
spin_lock_irq(&lock2)
spin_unlock_irq(&lock2)
Critical Section
1
2
3
spin_lock_irq(&lock1)
spin_unlock_irq(&lock1)
Process Context
Critical Section
spin_lock(&lock1)
spin_unlock(&lock1)
Critical Section
ISR (hardirq) – Interrupt Context
Processor Core
Deadlock between process context and interrupt context
(the same core) – Solution 1
Local interrupt disabled
spin_lock_irqsave(&lock2)
spin_unlock_irqrestore(&lock2)
Critical Section
1
Restore irq flags: Still disable
local interrupt because of
spin_lock_irq(&lock1)
spin_lock_irqsave(&lock1)
spin_unlock_irqrestore(&lock1)
Process Context
Critical Section
spin_lock(&lock1)
spin_unlock(&lock1)
Critical Section
ISR (hardirq) – Interrupt Context
Processor Core
Deadlock between process context and interrupt context
(the same core) – Solution 2
Local interrupt disabled
spin_lock_irqsave(&lock2)
spin_unlock_irqrestore(&lock2)
Critical Section
1
Restore irq flags: Still disable
local interrupt because of
spin_lock_irq(&lock1)
• spin_lock() & spin_unlock()
✓Multiple process contexts share the same data
• spin_lock_irq() & spin_unlock_irq()
✓Process context and top halve (hardirq) share the same data
• spin_lock_irqsave() & raw_spin_unlock_irqrestore()
✓Process context and top halve (hardirq) share the same data
✓Save/restore eflags
• spin_lock_bh() & spin_unlock_bh()
✓Process context and bottom halve share the same data
• spin_lock_nest_lock() & spin_lock_nested()
✓lockdep: Annotate places where we take multiple locks of the same class and avoid deadlock –
Commit b7d39aff9145 (“lockdep: spin_lock_nest_lock()”)
spinlock() variants
Critical Section
spin_lock(&lock)
static DEFINE_SPINLOCK(lock);
spin_unlock(&lock)
Critical Section
Process Context
Critical Section
spin_lock(&lock2)
spin_unlock(&lock2)
Critical Section
ISR (hardirq) – Interrupt Context
1
2
HW Interrupt
Processor Core
Deadlock between process context and bottom halve (the
same core)
3
Critical Section
spin_lock(&lock)
spin_unlock(&lock)
Critical Section
Bottom halve: softirq or tasklet
4
5 spinning: dead lock
Critical Section
spin_lock(&lock)
static DEFINE_SPINLOCK(lock);
spin_unlock(&lock)
Critical Section
Process Context
Critical Section
spin_lock(&lock2)
spin_unlock(&lock2)
Critical Section
ISR (hardirq) – Interrupt Context
1
2
HW Interrupt
Processor Core
Deadlock between process context and bottom halve (the
same core)
3
Critical Section
spin_lock(&lock)
spin_unlock(&lock)
Critical Section
Bottom halve: softirq or tasklet
4
5 spinning: dead lock
Bottom halve is invoked when returning from ISR
Critical Section
spin_lock(&lock)
static DEFINE_SPINLOCK(lock);
spin_unlock(&lock)
Critical Section
Process Context
Critical Section
spin_lock(&lock2)
spin_unlock(&lock2)
Critical Section
ISR (hardirq) – Interrupt Context
1
2
HW Interrupt
Processor Core
Deadlock between process context and bottom halve (the
same core)
3
Critical Section
spin_lock(&lock)
spin_unlock(&lock)
Critical Section
Bottom halve: softirq or tasklet
4
5 spinning: dead lock
Softirq is invoked when returning from ISR Softirq is invoked when running ksoftirqd
Critical Section
spin_lock_bh(&lock)
static DEFINE_SPINLOCK(lock);
spin_unlock_bh(&lock)
Critical Section
Process Context
Critical Section
spin_lock(&lock2)
spin_unlock(&lock2)
Critical Section
ISR (hardirq) – Interrupt Context
1
2
HW Interrupt
Processor Core
Deadlock between process context and bottom halve (the
same core)
3
Critical Section
spin_lock(&lock)
spin_unlock(&lock)
Critical Section
Bottom halve: softirq or tasklet
Softirq is invoked when returning from ISR
Return because in_interrupt() is true
4
task 0
task 1
task N
read_unlock()
read_lock()
Critical Section
write_unlock()
write_lock()
task 0
task 1
task N
Readers Writers
• Concept
✓ [Without readers] Only one writer can enter CS
✓ [Without writer] N-readers can enter CS simultaneously
✓ Readers and writer can enter CS simultaneously
✓ Reader(s) in CS → The writer needs to wait (spinning)
✓ Writer in CS → Readers need to wait (spinning)
• Mutual exclusion between reader and writer
• Writer might be starved!
• Reader has higher priority than writer
• Useful scenario: search for linked list without changing the list
✓ Example: tasklist_lock → __cacheline_aligned DEFINE_RWLOCK(tasklist_lock);
• Linux kernel developers are trying to remove rwlock in most cases.
Critical Section
[Reference] Lesson 2: reader-writer spinlocks
rwlock (reader-writer spinlock)
writer lock
Reader count
0
8
31
cnts
raw_lock
rwlock_t
atomic_t cnts
struct qrwlock or arch_rwlock_t
read_lock(), read_unlock()
write_lock(), write_unlock()
arch_spinlock_t wait_lock 9
* W: A writer is waiting
W
rwlock (reader-writer spinlock): Data structure
* wait_lock: for spinning
seqlock (Sequential lock)
prev_count != current count
prev_count = read count
Critical Section
spinunlock
spinlock
Critical Section
Y
N
prev_count != current count
prev_count = read count
Critical Section Y
N
count++
count++
Readers Writers
• Lockless readers (only retry loop)
• No writer starvation
✓ Writer has higher priority than reader
✓ No mutual exclusion between reader and writer
➢ Mutual exclusion between writers because of writers’ spinlock
• Scenario: lots of readers and few writers
✓ Example: jiffies → Check get_jiffies_64(), tick_setup_periodic() and so on.
spinlock derivative
spinlock rwlock seqlock
Reader(s) 1 N N
Writer(s) 1 1 1
spinlock rwlock seqlock RCU
Reader(s) 1 N N N
Writer(s) 1 1 1 1
spinlock derivative and RCU
Reference
• Boyd-Wickizer, Silas, et al. "Non-scalable locks are dangerous." Proceedings of the Linux
Symposium. 2012. http://pdos.csail.mit.edu/papers/linux:lock.pdf
• T. E. Anderson, "The performance of spin lock alternatives for shared-money
multiprocessors", IEEE Transactions on Parallel and Distributed Systems, vol. 1, no. 1, pp.
6-16, 1990.
• J. M. Mellor-Crummey and M. L. Scott. “Algorithms for scalable synchronization on shared-
memory multiprocessors”, ACM Transactions on Computer Systems, 9(1):21–65, 1991.
• MCS locks and qspinlocks, LWN
• Is Parallel Programming Hard, And, If So, What Can You Do About It?

More Related Content

What's hot

Memory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux KernelMemory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux KernelAdrian Huang
 
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...Adrian Huang
 
Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Adrian Huang
 
Vmlinux: anatomy of bzimage and how x86 64 processor is booted
Vmlinux: anatomy of bzimage and how x86 64 processor is bootedVmlinux: anatomy of bzimage and how x86 64 processor is booted
Vmlinux: anatomy of bzimage and how x86 64 processor is bootedAdrian Huang
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernelAdrian Huang
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File SystemAdrian Huang
 
Linux Initialization Process (2)
Linux Initialization Process (2)Linux Initialization Process (2)
Linux Initialization Process (2)shimosawa
 
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven RostedtKernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven RostedtAnne Nicolas
 
Linux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKBLinux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKBshimosawa
 
Memory Compaction in Linux Kernel.pdf
Memory Compaction in Linux Kernel.pdfMemory Compaction in Linux Kernel.pdf
Memory Compaction in Linux Kernel.pdfAdrian Huang
 
Anatomy of the loadable kernel module (lkm)
Anatomy of the loadable kernel module (lkm)Anatomy of the loadable kernel module (lkm)
Anatomy of the loadable kernel module (lkm)Adrian Huang
 
Kdump and the kernel crash dump analysis
Kdump and the kernel crash dump analysisKdump and the kernel crash dump analysis
Kdump and the kernel crash dump analysisBuland Singh
 
Linux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBLinux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBshimosawa
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)shimosawa
 
Linux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionLinux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionGene Chang
 
Qemu device prototyping
Qemu device prototypingQemu device prototyping
Qemu device prototypingYan Vugenfirer
 
Kernel_Crash_Dump_Analysis
Kernel_Crash_Dump_AnalysisKernel_Crash_Dump_Analysis
Kernel_Crash_Dump_AnalysisBuland Singh
 
Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory ManagementNi Zo-Ma
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugginglibfetion
 
Linux Kernel Module - For NLKB
Linux Kernel Module - For NLKBLinux Kernel Module - For NLKB
Linux Kernel Module - For NLKBshimosawa
 

What's hot (20)

Memory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux KernelMemory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux Kernel
 
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
 
Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...
 
Vmlinux: anatomy of bzimage and how x86 64 processor is booted
Vmlinux: anatomy of bzimage and how x86 64 processor is bootedVmlinux: anatomy of bzimage and how x86 64 processor is booted
Vmlinux: anatomy of bzimage and how x86 64 processor is booted
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernel
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File System
 
Linux Initialization Process (2)
Linux Initialization Process (2)Linux Initialization Process (2)
Linux Initialization Process (2)
 
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven RostedtKernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
 
Linux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKBLinux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKB
 
Memory Compaction in Linux Kernel.pdf
Memory Compaction in Linux Kernel.pdfMemory Compaction in Linux Kernel.pdf
Memory Compaction in Linux Kernel.pdf
 
Anatomy of the loadable kernel module (lkm)
Anatomy of the loadable kernel module (lkm)Anatomy of the loadable kernel module (lkm)
Anatomy of the loadable kernel module (lkm)
 
Kdump and the kernel crash dump analysis
Kdump and the kernel crash dump analysisKdump and the kernel crash dump analysis
Kdump and the kernel crash dump analysis
 
Linux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBLinux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKB
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)
 
Linux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionLinux MMAP & Ioremap introduction
Linux MMAP & Ioremap introduction
 
Qemu device prototyping
Qemu device prototypingQemu device prototyping
Qemu device prototyping
 
Kernel_Crash_Dump_Analysis
Kernel_Crash_Dump_AnalysisKernel_Crash_Dump_Analysis
Kernel_Crash_Dump_Analysis
 
Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory Management
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
Linux Kernel Module - For NLKB
Linux Kernel Module - For NLKBLinux Kernel Module - For NLKB
Linux Kernel Module - For NLKB
 

Similar to spinlock.pdf

Basics_of_Kernel_Panic_Hang_and_ Kdump.pdf
Basics_of_Kernel_Panic_Hang_and_ Kdump.pdfBasics_of_Kernel_Panic_Hang_and_ Kdump.pdf
Basics_of_Kernel_Panic_Hang_and_ Kdump.pdfstroganovboris
 
Is your SQL Exadata-aware?
Is your SQL Exadata-aware?Is your SQL Exadata-aware?
Is your SQL Exadata-aware?Mauro Pagano
 
AOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itAOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itZubair Nabi
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit
 
Profiler Guided Java Performance Tuning
Profiler Guided Java Performance TuningProfiler Guided Java Performance Tuning
Profiler Guided Java Performance Tuningosa_ora
 
Instaclustr introduction to managing cassandra
Instaclustr introduction to managing cassandraInstaclustr introduction to managing cassandra
Instaclustr introduction to managing cassandraInstaclustr
 
Introduction to memory order consume
Introduction to memory order consumeIntroduction to memory order consume
Introduction to memory order consumeYi-Hsiu Hsu
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntuSim Janghoon
 
Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...IndicThreads
 
Colin-Ian-King-Mentorship-Stress-ng.pdf
Colin-Ian-King-Mentorship-Stress-ng.pdfColin-Ian-King-Mentorship-Stress-ng.pdf
Colin-Ian-King-Mentorship-Stress-ng.pdfxiso
 
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time SystemsTaming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time SystemsHeechul Yun
 
Devoxx Fr 2022 - Remèdes aux oomkill, warm-ups, et lenteurs pour des conteneu...
Devoxx Fr 2022 - Remèdes aux oomkill, warm-ups, et lenteurs pour des conteneu...Devoxx Fr 2022 - Remèdes aux oomkill, warm-ups, et lenteurs pour des conteneu...
Devoxx Fr 2022 - Remèdes aux oomkill, warm-ups, et lenteurs pour des conteneu...Jean-Philippe BEMPEL
 
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Tokyo Institute of Technology
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephRongze Zhu
 

Similar to spinlock.pdf (20)

Memory model
Memory modelMemory model
Memory model
 
Basics_of_Kernel_Panic_Hang_and_ Kdump.pdf
Basics_of_Kernel_Panic_Hang_and_ Kdump.pdfBasics_of_Kernel_Panic_Hang_and_ Kdump.pdf
Basics_of_Kernel_Panic_Hang_and_ Kdump.pdf
 
opt-mem-trx
opt-mem-trxopt-mem-trx
opt-mem-trx
 
Is your SQL Exadata-aware?
Is your SQL Exadata-aware?Is your SQL Exadata-aware?
Is your SQL Exadata-aware?
 
Memory model
Memory modelMemory model
Memory model
 
Kernel
KernelKernel
Kernel
 
Sge
SgeSge
Sge
 
AOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itAOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on it
 
Docker vs kvm
Docker vs kvmDocker vs kvm
Docker vs kvm
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
 
Profiler Guided Java Performance Tuning
Profiler Guided Java Performance TuningProfiler Guided Java Performance Tuning
Profiler Guided Java Performance Tuning
 
Instaclustr introduction to managing cassandra
Instaclustr introduction to managing cassandraInstaclustr introduction to managing cassandra
Instaclustr introduction to managing cassandra
 
Introduction to memory order consume
Introduction to memory order consumeIntroduction to memory order consume
Introduction to memory order consume
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
 
Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...
 
Colin-Ian-King-Mentorship-Stress-ng.pdf
Colin-Ian-King-Mentorship-Stress-ng.pdfColin-Ian-King-Mentorship-Stress-ng.pdf
Colin-Ian-King-Mentorship-Stress-ng.pdf
 
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time SystemsTaming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
 
Devoxx Fr 2022 - Remèdes aux oomkill, warm-ups, et lenteurs pour des conteneu...
Devoxx Fr 2022 - Remèdes aux oomkill, warm-ups, et lenteurs pour des conteneu...Devoxx Fr 2022 - Remèdes aux oomkill, warm-ups, et lenteurs pour des conteneu...
Devoxx Fr 2022 - Remèdes aux oomkill, warm-ups, et lenteurs pour des conteneu...
 
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
 

Recently uploaded

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 

Recently uploaded (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 

spinlock.pdf

  • 1. * Based on kernel 5.11 (x86_64) – QEMU * 2-socket CPUs (4 cores/socket) * 16GB memory * Kernel parameter: nokaslr norandmaps * KASAN: disabled * Userspace: ASLR is disabled * Legacy BIOS Linux Synchronization Mechanism: spinlock Adrian Huang | Dec, 2022
  • 2. Agenda • Spinlock history – Approach evolution ✓Simple Approach: Spin on test-and-set ✓Test and test-and-set (spin on read) ✓Ticket spinlock ✓MCS (Mellor-Crummey & Scott) lock ➢Performance benchmark: Ticket spinlock vs MCS lock ➢MCS Lock History in Linux Kernel • Current spinlock approach in Linux kernel: qspinlock (Queue spinlock) • spin_lock() SMP & UP • spin_lock() API variants ✓How to use those variants in different scenarios • Spinlock derivative: rwlock and seqlock
  • 3. Simple Approach: Spin on test-and-set Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 Core 9 Core 10 Core 11 Core 12 Core 13 Core 14 Core 15 Memory lock Lock holder TestAndSet TestAndSet TestAndSet TestAndSet TestAndSet [Spinning Cores] Keep consuming memory bus (write 1): Cache coherence Test-and-set (atomic): 1. old_value = read a memory location 2. Write 1 to a memory location 3. Return old_value
  • 4. Simple Approach: Spin on test-and-set Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 Core 9 Core 10 Core 11 Core 12 Core 13 Core 14 Core 15 Memory lock Lock holder TestAndSet TestAndSet TestAndSet TestAndSet TestAndSet Due to the memory write, spinning cores invalidate cache copies of cores even if the value is not changed Invalidate Invalidate Test-and-set (atomic): 1. old_value = read a memory location 2. Write 1 to a memory location 3. Return old_value • Read Invalidate message of MESI Protocol Messages ✓ Reference: C.2.2 MESI Protocol Messages of Is Parallel Programming Hard, And, If So, What Can You Do About It?
  • 5. Simple Approach: Spin on test-and-set Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 Core 9 Core 10 Core 11 Core 12 Core 13 Core 14 Core 15 Memory lock Lock holder Test-and-set (atomic): 1. old_value = read a memory location 2. Write 1 to a memory location 3. Return old_value TestAndSet TestAndSet TestAndSet TestAndSet TestAndSet [Spinning Cores] Cores reload the memory due to the cache miss: performance impact Cache miss: reload Cache miss: reload
  • 6. Test and test-and-set (spin on read) 1. Spinning is done in the cache without consuming memory bus 2. Reduce the repeated test-and-set cost if the lock is held Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 Core 9 Core 10 Core 11 Core 12 Core 13 Core 14 Core 15 Memory lock Lock holder TestAndSet TestAndSet TestAndSet TestAndSet TestAndSet
  • 7. 1. Spinning is done in the cache without consuming memory bus 2. Reduce the repeated test-and-set cost if the lock is held Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 Core 9 Core 10 Core 11 Core 12 Core 13 Core 14 Core 15 Memory lock Lock holder: release lock TestAndSet TestAndSet TestAndSet TestAndSet TestAndSet Invalidate Invalidate Invalidate Invalidate Test and test-and-set (spin on read)
  • 8. 1. Spinning cores incur a read miss and fetch the new value back into cache 2. Spinning cores compete for accessing memory bus 3. The first core to test-and-set will acquire the lock 4. Other spinning cores cannot get lock: invalidate caches & cache misses • Quiescence: These operations are finished Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 Core 9 Core 10 Core 11 Core 12 Core 13 Core 14 Core 15 Memory lock Lock holder: release lock TestAndSet TestAndSet TestAndSet TestAndSet TestAndSet Test and test-and-set (spin on read)
  • 9. Performance Comparison Reference from: T. E. Anderson, "The performance of spin lock alternatives for shared-money multiprocessors", IEEE Transactions on Parallel and Distributed Systems, vol. 1, no. 1, pp. 6-16, 1990.
  • 10. Issue Statements about “Spin on test-and-set” and “Test and test-and-set” • Cache coherence ✓[Cache-line bouncing] Scalability issue • All spinning cores compete the lock when the lock is freed by lock holder ✓Unfairness ➢Ticket spinlock is proposed to fix the unfairness
  • 11. Ticket spinlock (v2.6.25: released in April 2008) Concept: pseudo code
  • 12. Ticket spinlock (v2.6.25: released in April 2008) Concept: pseudo code current_ticket next_ticket slock v2.6.25 implementation
  • 13. Ticket spinlock: Acquire a spinlock Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 Core 9 Core 10 Core 11 Core 12 Core 13 Core 14 Core 15 Memory {current_ticket, next_ticket} current_ticket next_ticket cache lock holder
  • 14. Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 Core 9 Core 10 Core 11 Core 12 Core 13 Core 14 Core 15 Memory {current_ticket, next_ticket} current_ticket next_ticket cache lock holder current_ticket next_ticket current_ticket next_ticket current_ticket next_ticket current_ticket next_ticket spinning core spinning core spinning core spinning core Ticket spinlock: Other cores acquire a locked spinlock
  • 15. Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 Core 9 Core 10 Core 11 Core 12 Core 13 Core 14 Core 15 Memory {current_ticket, next_ticket} current_ticket next_ticket cache lock holder current_ticket next_ticket current_ticket next_ticket current_ticket next_ticket current_ticket next_ticket unlock & update invalidate Ticket spinlock: Lock holder unlocks a spinlock and accumulates current_ticket
  • 16. Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 Core 9 Core 10 Core 11 Core 12 Core 13 Core 14 Core 15 Memory {current_ticket, next_ticket} current_ticket next_ticket cache lock holder current_ticket next_ticket current_ticket next_ticket current_ticket next_ticket current_ticket next_ticket cache miss: reload Ticket spinlock: Cache miss → Reload [Ticket spinlock] Cache coherence issue: non-scalable spinlock (Cache-line bouncing)
  • 17. Ticket spinlock: Performance Measurement Reference Paper: Boyd-Wickizer, Silas, et al. "Non-scalable locks are dangerous." Proceedings of the Linux Symposium. 2012. Benchmarks • FOPS: creates a single file and starts one process on each core. Repeated system calls “open() and close()”. • MEMPOP ✓ One process per core ✓ Each process mmaps 64 kB of memory with the MAP_POPULATE flag, then munmaps the memory • PFIND: searches for a file by executing several instances of the GNU find utility • EXIM (mail server): A single master process listens for incoming SMTP connections via TCP and forks a new process for each connection
  • 18. Ticket spinlock: Performance Measurement Reference Paper: Boyd-Wickizer, Silas, et al. "Non-scalable locks are dangerous." Proceedings of the Linux Symposium. 2012.
  • 19. MCS (Mellor-Crummey & Scott) lock MCS Lock • Adhere to fairness: FIFO (Implemented via linked list) • Scalable spinlock: Prevent cache coherence issue ✓ Each core reference its own data ‘next’ (struct mcs_spinlock) • Data ownership concept: Is Parallel Programming Hard, And, If So, What Can You Do About It?
  • 20. MCS (Mellor-Crummey & Scott) lock 1. Adhere to fairness 2. Scalable spinlock: Prevent cache coherence issue mcs_spinlock next = NULL locked = 0 mcs_spinlock (core 0) next = NULL locked = 0 mcs_spin_lock(…) prev = xchg(…) prev = NULL → lock acquired (no spinning) Main MCS lock
  • 21. MCS (Mellor-Crummey & Scott) lock 1. [Spinning core] check its owner `locked` 2. ‘next’ pointer of the main MCS lock indicates the tail of the queue of waiting cores mcs_spinlock next locked = 0 Main MCS lock mcs_spinlock (core 0) next = NULL locked = 0 prev = xchg(…) lock acquired (no spinning) mcs_spinlock (core 1) next = NULL locked = 0 mcs_spin_lock(…) prev != NULL → locked by another core Will be replaced by xchg() 1 2 3 4 WRITE_ONCE(prev->next, node) 5 Spinning until locked = 1 prev
  • 22. MCS (Mellor-Crummey & Scott) lock mcs_spinlock next locked = 0 mcs_spinlock (core 0) next locked = 0 next != NULL ➔ Set `next->locked = 1` mcs_spinlock (core 1) next = NULL locked = 0 1 2 next = READ_ONCE(node->next) 4 locked =1 → enter critical section mcs_spin_unlock(…) 1 3 Main MCS lock
  • 23. MCS (Mellor-Crummey & Scott) lock Cache coherence only happens for the next core (will not be spinning; will enter critical section) mcs_spinlock next locked = 0 mcs_spinlock (core 0) next locked = 0 next != NULL ➔ Set `next->locked = 1` mcs_spinlock (core 1) next = NULL locked = 0 1 2 next = READ_ONCE(node->next) 4 locked =1 → enter critical section mcs_spin_unlock(…) 1 3 Main MCS lock
  • 24. MCS (Mellor-Crummey & Scott) lock mcs_spinlock next locked = 0 mcs_spinlock (core 1) next = NULL locked = 1 next = READ_ONCE(node->next) 2 mcs_spin_unlock(…) 1 Main MCS lock next = NULL ➔ cmpxchg_release(lock, node, NULL)
  • 25. MCS (Mellor-Crummey & Scott) lock mcs_spinlock next = NULL locked = 0 mcs_spinlock (core 1) next = NULL locked = 1 next = READ_ONCE(node->next) 2 mcs_spin_unlock(…) 1 Main MCS lock next = NULL ➔ cmpxchg_release(lock, node, NULL) 3 cmpxchg_release(lock, node, NULL) 4
  • 26. Ticket spinlock vs MCS lock Reference Paper: Boyd-Wickizer, Silas, et al. "Non-scalable locks are dangerous." Proceedings of the Linux Symposium. 2012. Benchmarks • FOPS: creates a single file and starts one process on each core. Repeated system calls “open() and close()”. ✓ Executing the critical section increases from 450 cycles on two cores to 852 cycles on four cores ✓ The critical section is executed multiple times per-operation and modifies shared data, which incurs costly cache misses • MEMPOP ✓ One process per core ✓ Each process mmaps 64 kB of memory with the MAP_POPULATE flag, then munmaps the memory • PFIND: searches for a file by executing several instances of the GNU find utility • EXIM (mail server): A single master process listens for incoming SMTP connections via TCP and forks a new process for each connection
  • 27. MCS Lock History in Linux Kernel • Two variants ✓Standard: v3.15 ➢ None of Linux kernel subsystems uses it (No one calls mcs_spin_lock() and mcs_spin_unlock()) because: • sizeof(struct mcs_spinlock) > 4  Note: sizeof (msc spinlock struct) = 16 • spinlock struct is embedded into kernel structure. Example:  struct page (size = 64) cannot tolerate the increased size ➢ kernel/locking/mcs_spinlock.h ➢ Replacement of ticket spinlock: qspinlock – based on standard MCS lock (v4.2) • A simple generic 4-byte queued spinlock • qspinlock is the default spinlock mechanism ✓Cancelable MCS lock (OSQ - Optimistic Spin Queue: MCS-like lock): v3.15 ➢ Used by mutex implementation & rw-semaphores ➢ Mutex implementation paths: • Fastpath: Uncontended case by using cmpxchg() • Midpath (optimistic spinning) - The priority of the lock owner is the highest one  Spin for mutex lock acquisition when the lock owner is running.  The lock owner is likely to release the lock soon. • Slowpath: The task is added to the waiting queue and sleeps until woken up by the unlock path. ➢ Mutex is a hybrid type (spinning & sleeping): Busy-waiting for a few cycles instead of immediately sleeping ➢ kernel/locking/{mutex.c, osq_lock.c} ➢ Reference: Generic Mutex Subsystem
  • 28. qspinlock (Queue spinlock) locked pending locked_pending tail index tail cpu tail 0 atomic_t val 8 15 16 18 31 struct qspinlock or arch_spinlock_t spin_lock _raw_spin_lock __raw_spin_lock raw_spin_lock rlock struct spinlock or spinlock_t raw_lock struct raw_spinlock or raw_spinlock_t spin_lock(), spin_unlock() preempt_disable do_raw_spin_lock arch_spin_lock queued_spin_lock spin_unlock _raw_spin_unlock __raw_spin_unlock raw_spin_unlock do_raw_spin_unlock preempt_enable arch_spin_unlock queued_spin_unlock
  • 29. qspinlock (Queue spinlock) – First core to acquire the lock locked = 0 1 pending = 0 locked_pending tail index tail cpu tail 0 atomic_t val 8 15 16 18 31 struct qspinlock or arch_spinlock_t Core 0 [Nobody occupies this lock] spin_lock() → Lock owner atomic_try_cmpxchg_acquire()
  • 30. qspinlock – Second core to acquire the lock locked = 1 pending = 0 1 locked_pending tail index tail cpu tail 0 atomic_t val 8 15 16 18 31 struct qspinlock or arch_spinlock_t Core 0 Lock owner Core 1 spin_lock() 1 atomic_try_cmpxchg_acquire() → false 2 queued_spin_lock_slowpath() -> queued_fetch_set_pending_acquire() 3 • atomic_cond_read_acquire(): spinning until `locked = 0` • Clear `pending` field if `locked = 0` [Second core for lock acquisition] Spinning by checking `locked` field of arch_spinlock_t (Optimization: No need to configure a per-cpu MSC struct)
  • 31. qspinlock – Third core to acquire the lock locked = 1 pending = 1 locked_pending tail index tail cpu tail 0 atomic_t val 8 15 16 18 31 struct qspinlock or arch_spinlock_t Core 0 Lock owner Core 2 atomic_try_cmpxchg_acquire() → false Core 1 spinning until `locked = 0` 1 2 queued_spin_lock_slowpath() -> val = queued_fetch_set_pending_acquire() • if (val & ~_Q_LOCKED_MASK) queue this task
  • 32. qspinlock – Third core to acquire the lock locked = 1 pending = 1 locked_pending tail index tail cpu tail 0 atomic_t val 8 15 16 18 31 struct qspinlock or arch_spinlock_t Core 0 Lock owner percpu [Third core for lock acquisition] Spinning until `locked = 0 && pending = 0` Core 1 spinning until `locked = 0` mcs_spinlock (core 2) next = NULL locked = 0 count 3 val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK)); • Spinning until `locked = 0 && pending = 0` Core 2 4 4 xchg_tail() 3
  • 33. qspinlock – Forth core to acquire the lock locked = 1 pending = 1 locked_pending tail index tail cpu tail 0 atomic_t val 8 15 16 18 31 struct qspinlock or arch_spinlock_t Core 0 Lock owner percpu [Forth and later core: N >= 4] Spinning until its own `locked = 1`: Improve cache bouncing • Who sets `mcs_spinlock.locked = 1`? (Hint: Adhere to MCS lock implementation) Core 1 spinning until `locked = 0` mcs_spinlock (core 2) next locked = 0 count Core 2 Spinning until `locked = 0 && pending = 0` percpu mcs_spinlock (core 3) next = NULL locked = 0 count Core 3 1 1 xchg_tail() 2 3 Spinning until its own `locked = 1`
  • 34. qspinlock – queued_spin_unlock() locked = 1 0 pending = 1 locked_pending tail index tail cpu tail 0 atomic_t val 8 15 16 18 31 struct qspinlock or arch_spinlock_t Core 0 Lock owner percpu Core 1 spinning until `locked = 0` mcs_spinlock (core 2) next locked = 0 count Core 2 Spinning until `locked = 0 && pending = 0` percpu mcs_spinlock (core 3) next = NULL locked = 0 count Core 3 Spinnig until its own `locked = 1` 1 2 Lock owner
  • 35. qspinlock – queued_spin_unlock() & queued_spin_lock() locked = 1 0 pending = 1 locked_pending tail index tail cpu tail 0 atomic_t val 8 15 16 18 31 struct qspinlock or arch_spinlock_t Core 0 Lock owner percpu Core 1 spinning until `locked = 0` mcs_spinlock (core 2) next locked = 0 count Core 2 Spinning until `locked = 0 && pending = 0` percpu mcs_spinlock (core 3) next = NULL locked = 0 count Core 3 Spinning until its own `locked = 1` 1 Core 4 2 What happens to core 4 for invoking queued_spin_lock() under this circumstance?
  • 36. qspinlock – queued_spin_unlock() & queued_spin_lock() locked = 1 0 pending = 1 locked_pending tail index tail cpu tail 0 atomic_t val 8 15 16 18 31 struct qspinlock or arch_spinlock_t Core 0 Lock owner percpu Core 1 spinning until `locked = 0` mcs_spinlock (core 2) next locked = 0 count Core 2 Spinning until `locked = 0 && pending = 0` percpu mcs_spinlock (core 3) next locked = 0 count Core 3 Spinning until its own `locked = 1` 1 1. Core 4 is queued: Adhere to the principle of ticket spinlock 2. Spin its own `mcs_spinlock.locked`: Improve cache bouncing percpu mcs_spinlock (core 4) next = NULL locked = 0 count Core 4 • atomic_try_cmpxchg_acquire(…) ◼ Can acquire the lock (set ‘locked=1) when the 32-bit qspinlock (val) is 0. ◼ Otherwise, the core will go to the slowpath. Spinning until its own `locked = 1`
  • 37. spin_lock() SMP & UP spin_lock _raw_spin_lock __raw_spin_lock raw_spin_lock preempt_disable do_raw_spin_lock arch_spin_lock queued_spin_lock spin_lock _raw_spin_lock __LOCK raw_spin_lock preempt_disable SMP spin_lock() UP spin_lock()
  • 38. • spin_lock() & spin_unlock() ✓Multiple process contexts share the same data • spin_lock_irq() & spin_unlock_irq() ✓Process context and top halve (hardirq) share the same data • spin_lock_irqsave() & raw_spin_unlock_irqrestore() ✓Process context and top halve (hardirq) share the same data ✓Save/restore eflags • spin_lock_bh() & spin_unlock_bh() ✓Process context and bottom halve share the same data • spin_lock_nest_lock() & spin_lock_nested() ✓lockdep: Annotate places where we take multiple locks of the same class and avoid deadlock – Commit b7d39aff9145 (“lockdep: spin_lock_nest_lock()”) spinlock() variants
  • 39. • spin_lock() & spin_unlock() ✓Multiple process contexts share the same data • spin_lock_irq() & spin_unlock_irq() ✓Process context and top halve (hardirq) share the same data • spin_lock_irqsave() & raw_spin_unlock_irqrestore() ✓Process context and top halve (hardirq) share the same data ✓Save/restore eflags • spin_lock_bh() & spin_unlock_bh() ✓Process context and bottom halve share the same data • spin_lock_nest_lock() & spin_lock_nested() ✓lockdep: Annotate places where we take multiple locks of the same class and avoid deadlock – Commit b7d39aff9145 (“lockdep: spin_lock_nest_lock()”) spinlock() variants
  • 40. Critical Section spin_lock(&lock) static DEFINE_SPINLOCK(lock); spin_unlock(&lock) Critical Section Process Context Critical Section spin_lock(&lock) spin_unlock(&lock) Critical Section ISR (hardirq) – Interrupt Context 1 2 HW Interrupt spinning: dead lock Processor Core 3 Deadlock only happens on the same core in this circumstance Deadlock between process context and interrupt context (the same core)
  • 41. Critical Section spin_lock_irq(&lock) static DEFINE_SPINLOCK(lock); spin_unlock_irq(&lock) Critical Section Process Context Critical Section spin_lock(&lock) spin_unlock(&lock) Critical Section ISR (hardirq) – Interrupt Context 1 2 HW Interrupt Processor Core • Use spin_{un}lock_irq() or spin_{un}lock_irqsave() to prevent deadlock ✓ spin_lock_irq() or spin_lock_irqsave(): Disable local interrupt delivery ✓ spin_unlock_irq() or spin_unlock_irqsave(): Enable local interrupt delivery Deadlock between process context and interrupt context (the same core) Local interrupt disabled 4 3 unlock 5 6
  • 42. Critical Section spin_lock_irq(&lock) static DEFINE_SPINLOCK(lock); spin_unlock_irq(&lock) Critical Section Process Context Critical Section spin_lock(&lock) spin_unlock(&lock) Critical Section ISR (hardirq) – Interrupt Context 1 2 HW Interrupt Processor Core Deadlock between process context and interrupt context (the same core) Local interrupt disabled 4 3 unlock 5 5 Who disables the local interrupt before entering ISR? • Linux kernel common interrupt code? • CPU HW?
  • 43. Does Linux kernel disable local interrupt before entering ISR? SYM_CODE_START(asm_common_interrupt) common_interrupt(): entry point for all normal device IRQs
  • 44. SYM_CODE_START(asm_common_interrupt) Does Linux kernel disable local interrupt before entering ISR?
  • 45. The current stack is not switched yet: will be done in error_entry() IF (Interrupt Enable Flag) is disabled! Does Linux kernel disable local interrupt before entering ISR? Local interrupt has been disabled before entering asm_common_interrupt(): CPU disables it.
  • 46. Does Linux kernel disable local interrupt before entering ISR?
  • 47. common_interrupt(…) __common_interrupt(…) Does Linux kernel disable local interrupt before entering ISR?
  • 48. asm_common_interrupt __common_interrupt common_interrupt Common Interrupt Call Path: Serial Driver handle_irq run_irq_on_irqstack_cond __run_irq_on_irqstack asm_call_on_stack handle_edge_irq handle_irq_event handle_irq_event_percpu __handle_irq_event_percpu serial8250_interrupt Does Linux kernel disable local interrupt before entering ISR? Common interrupt in Linux kernel: Handle vector number 32-255
  • 50. [Reference] Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D and 4 CPU disables local interrupt if the interrupt handle is called via an interrupt gate
  • 51. spin_lock_irq(&lock1) spin_unlock_irq(&lock1) Process Context Critical Section spin_lock(&lock1) spin_unlock(&lock1) Critical Section ISR (hardirq) – Interrupt Context Processor Core Deadlock between process context and interrupt context (the same core) Local interrupt disabled deadlock spin_lock_irq(&lock2) spin_unlock_irq(&lock2) Critical Section 1 2 3
  • 52. spin_lock_irq(&lock1) spin_unlock_irq(&lock1) Process Context Critical Section spin_lock(&lock1) spin_unlock(&lock1) Critical Section ISR (hardirq) – Interrupt Context Processor Core Deadlock between process context and interrupt context (the same core) – Solution 1 Local interrupt disabled spin_lock_irqsave(&lock2) spin_unlock_irqrestore(&lock2) Critical Section 1 Restore irq flags: Still disable local interrupt because of spin_lock_irq(&lock1)
  • 53. spin_lock_irqsave(&lock1) spin_unlock_irqrestore(&lock1) Process Context Critical Section spin_lock(&lock1) spin_unlock(&lock1) Critical Section ISR (hardirq) – Interrupt Context Processor Core Deadlock between process context and interrupt context (the same core) – Solution 2 Local interrupt disabled spin_lock_irqsave(&lock2) spin_unlock_irqrestore(&lock2) Critical Section 1 Restore irq flags: Still disable local interrupt because of spin_lock_irq(&lock1)
  • 54. • spin_lock() & spin_unlock() ✓Multiple process contexts share the same data • spin_lock_irq() & spin_unlock_irq() ✓Process context and top halve (hardirq) share the same data • spin_lock_irqsave() & raw_spin_unlock_irqrestore() ✓Process context and top halve (hardirq) share the same data ✓Save/restore eflags • spin_lock_bh() & spin_unlock_bh() ✓Process context and bottom halve share the same data • spin_lock_nest_lock() & spin_lock_nested() ✓lockdep: Annotate places where we take multiple locks of the same class and avoid deadlock – Commit b7d39aff9145 (“lockdep: spin_lock_nest_lock()”) spinlock() variants
  • 55. Critical Section spin_lock(&lock) static DEFINE_SPINLOCK(lock); spin_unlock(&lock) Critical Section Process Context Critical Section spin_lock(&lock2) spin_unlock(&lock2) Critical Section ISR (hardirq) – Interrupt Context 1 2 HW Interrupt Processor Core Deadlock between process context and bottom halve (the same core) 3 Critical Section spin_lock(&lock) spin_unlock(&lock) Critical Section Bottom halve: softirq or tasklet 4 5 spinning: dead lock
  • 56. Critical Section spin_lock(&lock) static DEFINE_SPINLOCK(lock); spin_unlock(&lock) Critical Section Process Context Critical Section spin_lock(&lock2) spin_unlock(&lock2) Critical Section ISR (hardirq) – Interrupt Context 1 2 HW Interrupt Processor Core Deadlock between process context and bottom halve (the same core) 3 Critical Section spin_lock(&lock) spin_unlock(&lock) Critical Section Bottom halve: softirq or tasklet 4 5 spinning: dead lock Bottom halve is invoked when returning from ISR
  • 57. Critical Section spin_lock(&lock) static DEFINE_SPINLOCK(lock); spin_unlock(&lock) Critical Section Process Context Critical Section spin_lock(&lock2) spin_unlock(&lock2) Critical Section ISR (hardirq) – Interrupt Context 1 2 HW Interrupt Processor Core Deadlock between process context and bottom halve (the same core) 3 Critical Section spin_lock(&lock) spin_unlock(&lock) Critical Section Bottom halve: softirq or tasklet 4 5 spinning: dead lock Softirq is invoked when returning from ISR Softirq is invoked when running ksoftirqd
  • 58. Critical Section spin_lock_bh(&lock) static DEFINE_SPINLOCK(lock); spin_unlock_bh(&lock) Critical Section Process Context Critical Section spin_lock(&lock2) spin_unlock(&lock2) Critical Section ISR (hardirq) – Interrupt Context 1 2 HW Interrupt Processor Core Deadlock between process context and bottom halve (the same core) 3 Critical Section spin_lock(&lock) spin_unlock(&lock) Critical Section Bottom halve: softirq or tasklet Softirq is invoked when returning from ISR Return because in_interrupt() is true 4
  • 59. task 0 task 1 task N read_unlock() read_lock() Critical Section write_unlock() write_lock() task 0 task 1 task N Readers Writers • Concept ✓ [Without readers] Only one writer can enter CS ✓ [Without writer] N-readers can enter CS simultaneously ✓ Readers and writer can enter CS simultaneously ✓ Reader(s) in CS → The writer needs to wait (spinning) ✓ Writer in CS → Readers need to wait (spinning) • Mutual exclusion between reader and writer • Writer might be starved! • Reader has higher priority than writer • Useful scenario: search for linked list without changing the list ✓ Example: tasklist_lock → __cacheline_aligned DEFINE_RWLOCK(tasklist_lock); • Linux kernel developers are trying to remove rwlock in most cases. Critical Section [Reference] Lesson 2: reader-writer spinlocks rwlock (reader-writer spinlock)
  • 60. writer lock Reader count 0 8 31 cnts raw_lock rwlock_t atomic_t cnts struct qrwlock or arch_rwlock_t read_lock(), read_unlock() write_lock(), write_unlock() arch_spinlock_t wait_lock 9 * W: A writer is waiting W rwlock (reader-writer spinlock): Data structure * wait_lock: for spinning
  • 61. seqlock (Sequential lock) prev_count != current count prev_count = read count Critical Section spinunlock spinlock Critical Section Y N prev_count != current count prev_count = read count Critical Section Y N count++ count++ Readers Writers • Lockless readers (only retry loop) • No writer starvation ✓ Writer has higher priority than reader ✓ No mutual exclusion between reader and writer ➢ Mutual exclusion between writers because of writers’ spinlock • Scenario: lots of readers and few writers ✓ Example: jiffies → Check get_jiffies_64(), tick_setup_periodic() and so on.
  • 62. spinlock derivative spinlock rwlock seqlock Reader(s) 1 N N Writer(s) 1 1 1
  • 63. spinlock rwlock seqlock RCU Reader(s) 1 N N N Writer(s) 1 1 1 1 spinlock derivative and RCU
  • 64. Reference • Boyd-Wickizer, Silas, et al. "Non-scalable locks are dangerous." Proceedings of the Linux Symposium. 2012. http://pdos.csail.mit.edu/papers/linux:lock.pdf • T. E. Anderson, "The performance of spin lock alternatives for shared-money multiprocessors", IEEE Transactions on Parallel and Distributed Systems, vol. 1, no. 1, pp. 6-16, 1990. • J. M. Mellor-Crummey and M. L. Scott. “Algorithms for scalable synchronization on shared- memory multiprocessors”, ACM Transactions on Computer Systems, 9(1):21–65, 1991. • MCS locks and qspinlocks, LWN • Is Parallel Programming Hard, And, If So, What Can You Do About It?