Seminar

RGLock: Recoverable Mutual
Exclusion for Non-Volatile Main
Memory Systems
MASc Thesis Seminar
by
Aditya Ramaraju
Academic Supervisor: Prof. Wojciech Golab

Outline
 Preliminaries: Spinlocks
 Motivation: Crash-recovery, NVMM
 Shortcomings in Related Work
 Execution Model
 Recoverable Mutual Exclusion
 RGLock Algorithm
 Proof Sketch
 Conclusion:
 Learnings
 Limitations
 Further research
 Summary of contributions
2

Preliminaries
 The primary challenge of concurrency is managing access to shared, mutable state.
 If there is no controlled access to shared data, some processes will obtain an inconsistent
view of this data.
 A race condition arises when any two concurrent processes simultaneously modifying the
value of a shared variable can produce different outcomes, depending on their sequence
of operations.
 Critical Section (CS), a block of code to manipulate shared data is needed to avoid race
conditions in multiprocessor programming.
3

Preliminaries
 Mutual Exclusion is the problem of implementing a CS such that no two concurrent
processes execute the CS simultaneously.
 Generally, processes gain permission to access CS by acquiring the lock in an entry
protocol and then release the lock in an exit protocol, after completing the CS.
 Actions that do not involve the protected shared resource are categorized under non-
critical section (NCS).
4

Preliminaries
 A concurrent program is thus defined as a non-terminating loop alternating between
critical and non-critical sections.
 A passage is a single iteration of such loop consisting of four sections of code in a
concurrent program with the following structure:
 Doorway: a wait-free block of code in the entry protocol.
 If the mutex is already being held by another process, busy-waiting is performed by a
technique called spinning.
5
Entry
Protocol
Critical
Section
Exit
Protocol
Non-
Critical
Section

Preliminaries
 Spin-locks:
• Attempt to acquire lock by repeatedly polling a shared variable.
• Release the lock by resetting the spin variable.
• Eg: Test-and-Set lock, Ticket lock, etc.
• Prone to high contention on single cache line.
 Queue-based locks:
• contending processes “line up” in a queue, only head enters the CS.
• FCFS guarantee, high scalability.
 In-depth surveys by Raynal (1986), Anderson et al. (2003), and Buhr et al. (2014).
6

Preliminaries
 MCS Lock (1991):
• Gained most widespread usage and popularity.
• Relies on fetch_and_store availability for doorway.
• Makes use of compare_and_swap (CAS) in lock release.
• Generates 𝒪(1) remote memory references.
• Requires only a constant amount of space per lock per process.
• Guarantees Mutual Exclusion, FCFS order, and Starvation freedom.
7

Motivation
 Crash-recovery:
• Examples of crash failures: system crash, power loss, accidental or intentional termination, heuristic
deadlock recovery mechanisms, etc.
• In a crash-recovery model, a failed process may be resurrected after a crash failure to resume execution of
its algorithm.
• Several crash-recovery techniques exist for the message-passing model, which use check-pointing and
message logging.
• For DSM and CC models with SRAM-based caches and DRAM-based memories, such techniques are
poorly suited owing to frequent disk accesses.
8

Motivation
 Crash-recoverable Mutex:
• Lamport was first to consider failures in his Bakery algorithm: processes ‘restart’ in NCS when they fail.
• However, none of the prominent mutual exclusion algorithms (Peterson’s, Lamport’s Bakery, MCS, etc.) can
provide fault-tolerance “out-of-the box” if the state of the spin variable is lost in a crash failure.
 Goals for a Crash-recoverable Mutex:
– No process’s queue entry is lost in the crash, i.e., no process in the system should starve due to a crash.
– Each process contains at most one instance of its record in the lock queue.
– At most one process owns the lock. Also, at most one process at a time believes it is the lock-holder.
– If a lock-holder crashes, then it should not lose the ownership when it recovers from the crash.
– No process should wait indefinitely to relinquish its lock ownership.
9

Motivation
 How NVMM is a big step in the quest for a crash-recoverable mutex:
• Potentially the most advanced alternative to the 40-year old CPU, DRAM and disk design.
• Combines the high speed of SRAM, the density of DRAM and the non-volatility of flash memory.
• All execution state can be dissociated from process crashes and power failures by storing it on a persistent
non-volatile medium (PCM, FeRAM, MRAM, memristors, etc.).
10
Image: K. Bailey and L. Ceze, “Operating system implications of fast, cheap, non-volatile memory,” Proceedings
of the 13th USENIX conference on Hot topics in operating systems. USENIX Association, pp. 2–2, 2011.

Motivation
 Why “out-of-the box” MCS is a poor fit in the event of a crash
(even in NVMM systems):
• Besides the state of the PC, the evidence of a process ever completing
the doorway is lost in the crash.
• A lock holder
• may attempt to acquire lock again
• may never set the lock free
• may never relinquish the lock
• A busy-waiting process
• may attempt to enter the queue twice!
• may never link itself behind last known predecessor
• may block itself even though it was just promoted
• In all cases above, the progress of most active processes in the queue
is impeded.
11

Shortcomings in Related Work
 Bohannon et al. (1995 & 96) proposed recovery mechanisms for test_and_set lock and
MCS Lock. Michael and Kim (2009) proposed a CAS-based implementation of a
recoverable queue lock.
 However, in the event of a crash, their solutions
 require the OS/scheduler to play ‘Big Brother’
 are highly inefficient in large non-homogeneous systems
 involve a ‘cleanup’ routine that itself is assumed to never crash
 do not account for system crash, i.e., all processes fail simultaneously
 do not guarantee FCFS due to “usurping” of lock from a dead process
 do not guarantee starvation freedom and are also prone to priority inversion during “cleanup”
12

Execution Model
 Hardware considerations:
 An asynchronous multi-processor architecture of Cache Coherent (CC) model – write-through approach
 The main memory modules are based on the persistent and reliable Non-Volatile Random Access Memory
(NVRAM) medium. We assume that
• Information once stored in NVRAM is never lost or corrupted.
• the caching and memory ordering can be controlled to the point where the shared memory operations
are atomic and durable.
 Local memory references (e.g., in-cache reads) vs Remote Memory References (RMRs).
 The time complexity of our algorithm is measured by counting the RMRs performed during a passage.
 Support for swap_and_store (SAS) and compare_and_swap (CAS) instructions.
13

Execution Model
 Formalism:
 We use a less formal approach to the I/O automata model, by defining the behavior of processes using a
pseudo-code representation.
 A process is a sequential program consisting of operations on variables. Each variable is either private or
shared. Each process also has a special private variable, program counter (PC).
 A step in a history corresponds to a statement execution or a crash.
 The processes in the system interact with a finite set of variables in corresponding sequence of steps
recorded in an execution history 𝐻 ∈ ℋ.
 In a fair history, each individual process in the system is given an opportunity to perform its locally controlled
steps infinitely often.
14

Execution Model
 Formalism (contd..):
 A crash is a failure in an execution of one process where the private variables of the crashed process are
reset to their initial values and the process simply stops executing any computation until it is active again.
 A crash-recovery procedure reconstructs a crashed process’s state and resumes its active execution from
the point of failure in the algorithm.
 A process is said to be in recovery until the execution of its crash-recovery procedure is complete.
 Classification of steps:
 Normal step
 Crash-recovery step
 CS step
15

Execution Model
 Formalism (contd..):
 A crash-recoverable execution history is a fair history wherein every process either executes infinitely many
passages or crashes a finite number of times.
 In other words, if a process is ever inactive, it is not because it is crashing indefinitely.
16

Execution Model
 Summary of assumptions:
 A process in recovery reconstructs its state from the shared variables stored in non-volatile memory.
 Process crashes are independent, i.e., failure of one process does not crash other active processes in the
system.
 Other active processes in the system may read, modify and write to the globally accessible shared variables
of a process in recovery.
 The code for critical section is idempotent and harmlessly repeatable by a process in recovery if it has the
necessary exclusive access to do so.
17

Recoverable Mutual Exclusion
 To the best of our knowledge, we are the first to provide a formal specification to the
correctness properties of Recoverable Mutual Exclusion.
 A crash-recoverable mutex satisfies all the following :
 Mutual Exclusion (ME)
 First-come-first-served (FCFS)
 Livelock-freedom (LF)
 Starvation-freedom (SF)
 Terminating Exit (TE)
 Finite Recovery (FR)
18

RMEQ
 𝑅𝑀𝐸𝑄 is a linked-list of qnodes.
 Each qnode contains:
 a checkpoint number 𝑐ℎ𝑘.
 an 𝑎ℎ𝑒𝑎𝑑 pointer that determines the links in 𝑅𝑀𝐸𝑄 and also acts as the spin variable.
 a 𝑛𝑒𝑥𝑡 pointer to hold the address of the successor qnode.
 The lock is represented by pointer 𝐿, set either to 𝑛𝑢𝑙𝑙 when the lock is free or to the tail
qnode of 𝑅𝑀𝐸𝑄.
 Processes append their qnodes to 𝑅𝑀𝐸𝑄 using the SAS instruction (doorway).
 The process with head qnode in 𝑅𝑀𝐸𝑄 is the lock-holder.
 To release a lock a process either sets 𝐿 to 𝑛𝑢𝑙𝑙 if it has no immediate successor in 𝑅𝑀𝐸𝑄, or
flips the successor’s spin variable to 𝑛𝑢𝑙𝑙.
20

RMEQ
21
NULL
NULL
ID: P1
Chk: 2
1
ID: P2
Chk:1
NULL
Lock L
ID: P4
Chk:1
1
ID: P3
Chk:1
1
ID: P5
Chk:1
ID: P8
Chk:0/1/3
NULL
NULL
or
some
qnode
𝑞𝑖 ∈ 𝑅𝑀𝐸𝑄
Head qnode
𝑞𝑖 ∉ 𝑅𝑀𝐸𝑄
Crashed qnode
Index:

RGLock Algorithm
 Overview:
 All processes start from an initial state in the NCS.
 In a failure-free passage, execute 𝑎𝑐𝑞𝑢𝑖𝑟𝑒_𝑙𝑜𝑐𝑘, CS and 𝑟𝑒𝑙𝑒𝑎𝑠𝑒_𝑙𝑜𝑐𝑘 before returning to NCS.
 A process may take several steps in NCS until subsequent request for lock acquisition.
 If a process crashes at any point of execution within a failure-free passage, it
 reads the state of its qnode from NVRAM;
 invokes corresponding recovery procedure based on the 𝑐ℎ𝑘 value;
 identifies the position of its qnode in RMEQ; and then
 completes the crash-recoverable passage accordingly and returns to NCS.
22

RGLock Algorithm
 atomic 𝒔𝒘𝒂𝒑_𝒂𝒏𝒅_𝒔𝒕𝒐𝒓𝒆 (SAS):
 In one indivisible atomic step, a fetch_and_store is immediately followed by another store that writes the result
of the fetch_and_store operation to a location in the invoking process’s non-volatile memory.
 Ensures strict FCFS order in lock acquisitions.
 Aids a process in recovery in identifying the position of its qnode in 𝑅𝑀𝐸𝑄.
 Pseudo-code representation:
function SAS (old_element: address, new_element: value, location: address)
atomic {
temp: val_type := *old_element
*old_element := new_element
*location := temp
}
23

RGLock Algorithm
24
MCS Lock
RGLock

NCS (𝑞𝑖 ∉ 𝑅𝑀𝐸𝑄)
𝑞𝑖. 𝑎ℎ𝑒𝑎𝑑. 𝑛𝑒𝑥𝑡 = 𝑞𝑖 𝑞𝑖. 𝑐ℎ𝑘 ≔ 2
SAS(L,𝑞𝑖, 𝑞𝑖ahead)
𝑞𝑖.ahead ≠ null 𝑞𝑖.ahead = null
𝑞𝑖.ahead null
INITIAL
𝑞𝑖.next = null
𝑞𝑖.next ≠ null
false
𝑞𝑖.𝑛𝑒𝑥𝑡.𝑎ℎ𝑒𝑎𝑑≔𝑛𝑢𝑙𝑙
if
CAS
true
if
CS
ENTRYProtocol
EXITProtocol
Crash-recovery step
recoverHead
recoverReleaserecoverBlocked
failureFree
25
Index:
Normal step
Crash step
Recovery procedure
selected based on
𝑞𝑖. 𝑐ℎ𝑘 value
𝑞𝑖.next
≠ null ?
true
RGLock Algorithm

Crash-recovery procedures
26
RGLock: crash-recovery procedures

Crash-recovery procedures
27
recoverBlocked recoverHead recoverRelease
• Invoked if 𝑞𝑖. 𝑐ℎ𝑘 = 1
immediately after crash.
• Check if 𝑞𝑖 ∈ 𝑅𝑀𝐸𝑄
• If yes,
• busy-wait in
waitForCS until 𝑞𝑖
is head
• proceed to CS in
recoverHead
• release the lock
• If no,
• return false
• execute failureFree
immediately after crash
or within
recoverBlocked.
• Execute CS
• Release the lock
• Check if 𝑞𝑖 ∈ 𝑅𝑀𝐸𝑄
• If yes,
• release the lock
• If no,
• reset 𝑞𝑖. 𝑐ℎ𝑘 and
return to NCS
failureFree
• Invoked if 𝑞𝑖. 𝑐ℎ𝑘 =0
or if recoverBlocked
returns false.

Proof Sketch
28
 The correctness of our algorithm is derived by an induction on the length of the execution history or by contradiction,
where applicable.
 We use a history variable 𝑄 which represents the sequence of process IDs whose qnodes are in 𝑅𝑀𝐸𝑄.
 An invariant is established with respect to the state of 𝐿, 𝑄, 𝑅𝑀𝐸𝑄 and the 𝑎ℎ𝑒𝑎𝑑 and 𝑐ℎ𝑘 fields on a qnode.
 We show that the elements of 𝑄 are the same as the qnodes in 𝑅𝑀𝐸𝑄 at the end of a finite history, in that order.
 The head qnode of 𝑅𝑀𝐸𝑄 is the lock holder and since 𝑄 always has at most one head element, ME is guaranteed.
 FCFS, SF, LF, and TE are proved by contradiction, using the invariant.
 And since every procedure in the RGLock algorithm terminates in a finite number of steps, FR is guaranteed.
 Finally, we show that the RGLock algorithm incurs 𝒪(1) RMRs per process per failure-free passage.

Conclusion
29
 Learnings (for me, that is):
 Less is more.
 For designing synchronization datastructures.
 Evolution of qnodes in RMEQ.
 Asynchrony is a harsh mistress.
 𝑓𝑖𝑛𝑑𝑀𝑒 accuracy.
 𝑤𝑎𝑖𝑡𝐹𝑜𝑟𝐶𝑆 correctness.

Conclusion
30
 Known Limitations
 Requires support for an unconventional hardware instruction (SAS).
 𝑓𝑖𝑛𝑑𝑀𝑒 presets the no. of processes in the system.
 Further Research
 Programmatic implementation of the algorithm.
 Simplify the code for more rigorous analysis.
 Bakery algorithm in the context of crash-recovery for NVMM.
 Make provision for processes to be added to the system even after the algorithm is initialized.
 Potential Impact
 In-memory databases for ‘always-on’ applications and high-performance computing.

Conclusion
31
 Summary of Contributions:
 Formal specification of the correctness properties of Recoverable Mutual Exclusion.
 RGLock: a first-of-its-kind crash-recoverable mutual exclusion lock for NVMM systems.
 Proposed doorway instruction could help guide design of future NVMM architectures.
 Distinguishing RGLock from earlier attempts for crash-recoverable mutex:
 RGLock satisfies all safety and liveness properties simultaneously in presence of crash failures.
 RGLock tolerates failures on any individual component, including a lock-holder, and system-wide crashes as well.
 Compared to MCS Lock, RGLock does not inflate time complexity in failure-free execution.
 A comprehensive proof of correctness for the RGLock algorithm.

Let’s Talk!
32
NCS (𝑞𝑖 ∉ 𝑅𝑀𝐸𝑄)
𝑞𝑖. 𝑎ℎ𝑒𝑎𝑑. 𝑛𝑒𝑥𝑡 = 𝑞𝑖 𝑞𝑖. 𝑐ℎ𝑘 ≔ 2
SAS(L,𝑞𝑖, 𝑞𝑖ahead)
𝑞𝑖.ahead ≠ null 𝑞𝑖.ahead = null
𝑞𝑖.ahead null
INITIAL
𝑞𝑖.next = null
𝑞𝑖.next ≠ null
false
𝑞𝑖.𝑛𝑒𝑥𝑡.𝑎ℎ𝑒𝑎𝑑≔𝑛𝑢𝑙𝑙
if
CAS
true
if
CS
ENTRYProtocol
EXITProtocol
Crash-recovery step
recoverHead
recoverReleaserecoverBlocked
failureFree
𝑞𝑖.next
≠ null ?
true

Seminar

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to Seminar

Similar to Seminar (20)

Seminar