The timing behavior of the OS must be predictable - services of the OS: Upper bound on the execution time!
2. OS must manage the timing and scheduling
OS possibly has to be aware of task deadlines;
(unless scheduling is done off-line).
3. The OS must be fast
UNIT IV FAILURE RECOVERY AND FAULT TOLERANCE 9
Basic Concepts-Classification of Failures – Basic Approaches to Recovery; Recovery in
Concurrent System; Synchronous and Asynchronous Checkpointing and Recovery; Check
pointing in Distributed Database Systems; Fault Tolerance; Issues - Two-phase and Nonblocking
Commit Protocols; Voting Protocols; Dynamic Voting Protocols;
Operating system 23 process synchronizationVaibhav Khanna
Processes can execute concurrently
May be interrupted at any time, partially completing execution
Concurrent access to shared data may result in data inconsistency
Maintaining data consistency requires mechanisms to ensure the orderly execution of cooperating processes
Illustration of the problem:Suppose that we wanted to provide a solution to the consumer-producer problem that fills all the buffers. We can do so by having an integer counter that keeps track of the number of full buffers. Initially, counter is set to 0. It is incremented by the producer after it produces a new buffer and is decremented by the consumer after it consumes a buffer
LSC Revisited - From Scenarios to Distributed ComponentsDirk Fahland
Scenario-based techniques such as Message Sequence Charts
(MSC) and Live Sequence Charts (LSC) are a technique to specify
behavior of complex, distributed systems in an intuitive manner,
particularly at early stages of system design. Despite its intuitive
nature, the technique poses some challenges. The most prominent is to
automatically synthesize an operational system model (a statechart or
a Petri net) from a given specification; the model can then serve as a
blue print for implementation in hard- and software. While MSC are
essentially too weak to specify complex systems, LSCs are too strong:
synthesis of components of a distributed system fails.
In my talk, I will reconsider the semantics of LSC-style scenarios
regarding expressive power, ability to specify distributed behaviors
and solving the synthesis problem. I will show that by changing the
interpretation of LSC from linear time to simple branching time
semantics, one obtains a simple, yet very expressive and intuitive
scenario-based specification language. By choosing partial orders
instead of sequential runs as semantic domain, one can faithfully
specify the behaviors of a distributed system. We call this notation
distributed LSC (dLSC). As the main result, I will present a complete
technique for synthesizing Petri net components from any given dLSC
specification, in polynomial time.
Remote seminar talk held in the Advanced Software Tools Research Seminar of S. Maoz and A. Yehudai at Tel Aviv University, January 7, 2013.
These slides were presented during technical event at my organization. It focuses on overview to find a root cause of the unexpected system down events. It is mainly useful for Linux or Unix system administrators. Here, I tried to cover all aspects of the topic. It took me more than 2 hours to present these slides, but one can also cover these slides within short time-span. Gray background of slides is implemented to hide the company logo and to preserve the confidentially of private template. However, The Knowledge is not restricted :)
The timing behavior of the OS must be predictable - services of the OS: Upper bound on the execution time!
2. OS must manage the timing and scheduling
OS possibly has to be aware of task deadlines;
(unless scheduling is done off-line).
3. The OS must be fast
UNIT IV FAILURE RECOVERY AND FAULT TOLERANCE 9
Basic Concepts-Classification of Failures – Basic Approaches to Recovery; Recovery in
Concurrent System; Synchronous and Asynchronous Checkpointing and Recovery; Check
pointing in Distributed Database Systems; Fault Tolerance; Issues - Two-phase and Nonblocking
Commit Protocols; Voting Protocols; Dynamic Voting Protocols;
Operating system 23 process synchronizationVaibhav Khanna
Processes can execute concurrently
May be interrupted at any time, partially completing execution
Concurrent access to shared data may result in data inconsistency
Maintaining data consistency requires mechanisms to ensure the orderly execution of cooperating processes
Illustration of the problem:Suppose that we wanted to provide a solution to the consumer-producer problem that fills all the buffers. We can do so by having an integer counter that keeps track of the number of full buffers. Initially, counter is set to 0. It is incremented by the producer after it produces a new buffer and is decremented by the consumer after it consumes a buffer
LSC Revisited - From Scenarios to Distributed ComponentsDirk Fahland
Scenario-based techniques such as Message Sequence Charts
(MSC) and Live Sequence Charts (LSC) are a technique to specify
behavior of complex, distributed systems in an intuitive manner,
particularly at early stages of system design. Despite its intuitive
nature, the technique poses some challenges. The most prominent is to
automatically synthesize an operational system model (a statechart or
a Petri net) from a given specification; the model can then serve as a
blue print for implementation in hard- and software. While MSC are
essentially too weak to specify complex systems, LSCs are too strong:
synthesis of components of a distributed system fails.
In my talk, I will reconsider the semantics of LSC-style scenarios
regarding expressive power, ability to specify distributed behaviors
and solving the synthesis problem. I will show that by changing the
interpretation of LSC from linear time to simple branching time
semantics, one obtains a simple, yet very expressive and intuitive
scenario-based specification language. By choosing partial orders
instead of sequential runs as semantic domain, one can faithfully
specify the behaviors of a distributed system. We call this notation
distributed LSC (dLSC). As the main result, I will present a complete
technique for synthesizing Petri net components from any given dLSC
specification, in polynomial time.
Remote seminar talk held in the Advanced Software Tools Research Seminar of S. Maoz and A. Yehudai at Tel Aviv University, January 7, 2013.
These slides were presented during technical event at my organization. It focuses on overview to find a root cause of the unexpected system down events. It is mainly useful for Linux or Unix system administrators. Here, I tried to cover all aspects of the topic. It took me more than 2 hours to present these slides, but one can also cover these slides within short time-span. Gray background of slides is implemented to hide the company logo and to preserve the confidentially of private template. However, The Knowledge is not restricted :)
Mutual exclusion:
Concurrency control property which is introduce to prevent race.
Distributed transaction:
Transaction that span multiple database servers,as well are connected by a network.
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docxfaithxdunce63732
CS 301 Computer Architecture
Student # 1
E
ID: 09
Kingdom of Saudi Arabia Royal Commission at Yanbu Yanbu University College Yanbu Al-Sinaiyah
Student # 2
H
ID: 09
Kingdom of Saudi Arabia Royal Commission at Yanbu Yanbu University College Yanbu Al-Sinaiyah
1
1. Introduction
High-performance processor design has recently taken two distinct approaches. One approach is to increase the execution rate by increasing the clock frequency of the processor or by reducing the execution latency of the operations. While this approach is important, much of its performance gain comes as a consequence of circuit and layout improvements and is beyond the scope of this research. The other approach is to directly exploit the instruction-level parallelism (ILP) in the program and to issue and execute multiple operations concurrently. This approach requires both compiler and microarchitecture support.
Traditional processor designs that issue and execute at most one operation per cycle are often called scalar designs. Static and dynamic scheduling techniques have been used to achieve better-than scalar performance by issuing and executing more than one operation per cycle. While Johnson[7] defines a superscalar processor as a design that achieves better-than scalar performance, popular usage of this term refers exclusively to those processors that use dynamic scheduling techniques. For clarity, we use instruction-level parallel processors to refer to the general class of processors that execute more than one operation per cycle of the computer both at the personal level, or the level of a small network of computers to do not require more of these types.
The primary static scheduling technique uses the compiler to determine sets of operations that have their source operands ready and have no dependencies within the set. These operations can then be scheduled within the same instruction subject only to hardware resource limits. Since each of the operations in an instruction is guaranteed by the compiler to be independent, the hardware is able to is- sue and execute these operations directly with no dynamic analysis. These multi-operation instructions are very long in comparison with traditional single-operation instructions and processors using .
Review of Some Checkpointing Schemes for Distributed and Mobile Computing Env...Eswar Publications
Fault Tolerance Techniques facilitate systems to carry out tasks in the incidence of faults. A checkpoint is a local state of a process saved on stable storage. In a distributed system, since the processes in the system do not share memory; a global state of the system is defined as a combination of local states, one from each process. In case of a fault in distributed systems, checkpointing enables the execution of a program to be resumed from a previous consistent global state rather than resuming the execution from the commencement. In this way, the sum of constructive processing vanished because of the fault is appreciably reduced. In this paper, we talk about various issues related to the checkpointing for distributed systems and mobile computing environments. We also confer
various types of checkpointing: coordinated checkpointing, asynchronous checkpointing, communication induced
checkpointing and message logging based checkpointing. We also present a survey of some checkpointing algorithms for distributed systems.
Review of Some Checkpointing Schemes for Distributed and Mobile Computing Env...
Seminar
1. RGLock: Recoverable Mutual
Exclusion for Non-Volatile Main
Memory Systems
MASc Thesis Seminar
by
Aditya Ramaraju
Academic Supervisor: Prof. Wojciech Golab
2. Outline
Preliminaries: Spinlocks
Motivation: Crash-recovery, NVMM
Shortcomings in Related Work
Execution Model
Recoverable Mutual Exclusion
RGLock Algorithm
Proof Sketch
Conclusion:
Learnings
Limitations
Further research
Summary of contributions
2
3. Preliminaries
The primary challenge of concurrency is managing access to shared, mutable state.
If there is no controlled access to shared data, some processes will obtain an inconsistent
view of this data.
A race condition arises when any two concurrent processes simultaneously modifying the
value of a shared variable can produce different outcomes, depending on their sequence
of operations.
Critical Section (CS), a block of code to manipulate shared data is needed to avoid race
conditions in multiprocessor programming.
3
4. Preliminaries
Mutual Exclusion is the problem of implementing a CS such that no two concurrent
processes execute the CS simultaneously.
Generally, processes gain permission to access CS by acquiring the lock in an entry
protocol and then release the lock in an exit protocol, after completing the CS.
Actions that do not involve the protected shared resource are categorized under non-
critical section (NCS).
4
5. Preliminaries
A concurrent program is thus defined as a non-terminating loop alternating between
critical and non-critical sections.
A passage is a single iteration of such loop consisting of four sections of code in a
concurrent program with the following structure:
Doorway: a wait-free block of code in the entry protocol.
If the mutex is already being held by another process, busy-waiting is performed by a
technique called spinning.
5
Entry
Protocol
Critical
Section
Exit
Protocol
Non-
Critical
Section
6. Preliminaries
Spin-locks:
• Attempt to acquire lock by repeatedly polling a shared variable.
• Release the lock by resetting the spin variable.
• Eg: Test-and-Set lock, Ticket lock, etc.
• Prone to high contention on single cache line.
Queue-based locks:
• contending processes “line up” in a queue, only head enters the CS.
• FCFS guarantee, high scalability.
In-depth surveys by Raynal (1986), Anderson et al. (2003), and Buhr et al. (2014).
6
7. Preliminaries
MCS Lock (1991):
• Gained most widespread usage and popularity.
• Relies on fetch_and_store availability for doorway.
• Makes use of compare_and_swap (CAS) in lock release.
• Generates 𝒪(1) remote memory references.
• Requires only a constant amount of space per lock per process.
• Guarantees Mutual Exclusion, FCFS order, and Starvation freedom.
7
8. Motivation
Crash-recovery:
• Examples of crash failures: system crash, power loss, accidental or intentional termination, heuristic
deadlock recovery mechanisms, etc.
• In a crash-recovery model, a failed process may be resurrected after a crash failure to resume execution of
its algorithm.
• Several crash-recovery techniques exist for the message-passing model, which use check-pointing and
message logging.
• For DSM and CC models with SRAM-based caches and DRAM-based memories, such techniques are
poorly suited owing to frequent disk accesses.
8
9. Motivation
Crash-recoverable Mutex:
• Lamport was first to consider failures in his Bakery algorithm: processes ‘restart’ in NCS when they fail.
• However, none of the prominent mutual exclusion algorithms (Peterson’s, Lamport’s Bakery, MCS, etc.) can
provide fault-tolerance “out-of-the box” if the state of the spin variable is lost in a crash failure.
Goals for a Crash-recoverable Mutex:
– No process’s queue entry is lost in the crash, i.e., no process in the system should starve due to a crash.
– Each process contains at most one instance of its record in the lock queue.
– At most one process owns the lock. Also, at most one process at a time believes it is the lock-holder.
– If a lock-holder crashes, then it should not lose the ownership when it recovers from the crash.
– No process should wait indefinitely to relinquish its lock ownership.
9
10. Motivation
How NVMM is a big step in the quest for a crash-recoverable mutex:
• Potentially the most advanced alternative to the 40-year old CPU, DRAM and disk design.
• Combines the high speed of SRAM, the density of DRAM and the non-volatility of flash memory.
• All execution state can be dissociated from process crashes and power failures by storing it on a persistent
non-volatile medium (PCM, FeRAM, MRAM, memristors, etc.).
10
Image: K. Bailey and L. Ceze, “Operating system implications of fast, cheap, non-volatile memory,” Proceedings
of the 13th USENIX conference on Hot topics in operating systems. USENIX Association, pp. 2–2, 2011.
11. Motivation
Why “out-of-the box” MCS is a poor fit in the event of a crash
(even in NVMM systems):
• Besides the state of the PC, the evidence of a process ever completing
the doorway is lost in the crash.
• A lock holder
• may attempt to acquire lock again
• may never set the lock free
• may never relinquish the lock
• A busy-waiting process
• may attempt to enter the queue twice!
• may never link itself behind last known predecessor
• may block itself even though it was just promoted
• In all cases above, the progress of most active processes in the queue
is impeded.
11
12. Shortcomings in Related Work
Bohannon et al. (1995 & 96) proposed recovery mechanisms for test_and_set lock and
MCS Lock. Michael and Kim (2009) proposed a CAS-based implementation of a
recoverable queue lock.
However, in the event of a crash, their solutions
require the OS/scheduler to play ‘Big Brother’
are highly inefficient in large non-homogeneous systems
involve a ‘cleanup’ routine that itself is assumed to never crash
do not account for system crash, i.e., all processes fail simultaneously
do not guarantee FCFS due to “usurping” of lock from a dead process
do not guarantee starvation freedom and are also prone to priority inversion during “cleanup”
12
13. Execution Model
Hardware considerations:
An asynchronous multi-processor architecture of Cache Coherent (CC) model – write-through approach
The main memory modules are based on the persistent and reliable Non-Volatile Random Access Memory
(NVRAM) medium. We assume that
• Information once stored in NVRAM is never lost or corrupted.
• the caching and memory ordering can be controlled to the point where the shared memory operations
are atomic and durable.
Local memory references (e.g., in-cache reads) vs Remote Memory References (RMRs).
The time complexity of our algorithm is measured by counting the RMRs performed during a passage.
Support for swap_and_store (SAS) and compare_and_swap (CAS) instructions.
13
14. Execution Model
Formalism:
We use a less formal approach to the I/O automata model, by defining the behavior of processes using a
pseudo-code representation.
A process is a sequential program consisting of operations on variables. Each variable is either private or
shared. Each process also has a special private variable, program counter (PC).
A step in a history corresponds to a statement execution or a crash.
The processes in the system interact with a finite set of variables in corresponding sequence of steps
recorded in an execution history 𝐻 ∈ ℋ.
In a fair history, each individual process in the system is given an opportunity to perform its locally controlled
steps infinitely often.
14
15. Execution Model
Formalism (contd..):
A crash is a failure in an execution of one process where the private variables of the crashed process are
reset to their initial values and the process simply stops executing any computation until it is active again.
A crash-recovery procedure reconstructs a crashed process’s state and resumes its active execution from
the point of failure in the algorithm.
A process is said to be in recovery until the execution of its crash-recovery procedure is complete.
Classification of steps:
Normal step
Crash-recovery step
CS step
15
16. Execution Model
Formalism (contd..):
A crash-recoverable execution history is a fair history wherein every process either executes infinitely many
passages or crashes a finite number of times.
In other words, if a process is ever inactive, it is not because it is crashing indefinitely.
16
17. Execution Model
Summary of assumptions:
A process in recovery reconstructs its state from the shared variables stored in non-volatile memory.
Process crashes are independent, i.e., failure of one process does not crash other active processes in the
system.
Other active processes in the system may read, modify and write to the globally accessible shared variables
of a process in recovery.
The code for critical section is idempotent and harmlessly repeatable by a process in recovery if it has the
necessary exclusive access to do so.
17
18. Recoverable Mutual Exclusion
To the best of our knowledge, we are the first to provide a formal specification to the
correctness properties of Recoverable Mutual Exclusion.
A crash-recoverable mutex satisfies all the following :
Mutual Exclusion (ME)
First-come-first-served (FCFS)
Livelock-freedom (LF)
Starvation-freedom (SF)
Terminating Exit (TE)
Finite Recovery (FR)
18
20. RMEQ
𝑅𝑀𝐸𝑄 is a linked-list of qnodes.
Each qnode contains:
a checkpoint number 𝑐ℎ𝑘.
an 𝑎ℎ𝑒𝑎𝑑 pointer that determines the links in 𝑅𝑀𝐸𝑄 and also acts as the spin variable.
a 𝑛𝑒𝑥𝑡 pointer to hold the address of the successor qnode.
The lock is represented by pointer 𝐿, set either to 𝑛𝑢𝑙𝑙 when the lock is free or to the tail
qnode of 𝑅𝑀𝐸𝑄.
Processes append their qnodes to 𝑅𝑀𝐸𝑄 using the SAS instruction (doorway).
The process with head qnode in 𝑅𝑀𝐸𝑄 is the lock-holder.
To release a lock a process either sets 𝐿 to 𝑛𝑢𝑙𝑙 if it has no immediate successor in 𝑅𝑀𝐸𝑄, or
flips the successor’s spin variable to 𝑛𝑢𝑙𝑙.
20
22. RGLock Algorithm
Overview:
All processes start from an initial state in the NCS.
In a failure-free passage, execute 𝑎𝑐𝑞𝑢𝑖𝑟𝑒_𝑙𝑜𝑐𝑘, CS and 𝑟𝑒𝑙𝑒𝑎𝑠𝑒_𝑙𝑜𝑐𝑘 before returning to NCS.
A process may take several steps in NCS until subsequent request for lock acquisition.
If a process crashes at any point of execution within a failure-free passage, it
reads the state of its qnode from NVRAM;
invokes corresponding recovery procedure based on the 𝑐ℎ𝑘 value;
identifies the position of its qnode in RMEQ; and then
completes the crash-recoverable passage accordingly and returns to NCS.
22
23. RGLock Algorithm
atomic 𝒔𝒘𝒂𝒑_𝒂𝒏𝒅_𝒔𝒕𝒐𝒓𝒆 (SAS):
In one indivisible atomic step, a fetch_and_store is immediately followed by another store that writes the result
of the fetch_and_store operation to a location in the invoking process’s non-volatile memory.
Ensures strict FCFS order in lock acquisitions.
Aids a process in recovery in identifying the position of its qnode in 𝑅𝑀𝐸𝑄.
Pseudo-code representation:
function SAS (old_element: address, new_element: value, location: address)
atomic {
temp: val_type := *old_element
*old_element := new_element
*location := temp
}
23
27. Crash-recovery procedures
27
recoverBlocked recoverHead recoverRelease
• Invoked if 𝑞𝑖. 𝑐ℎ𝑘 = 1
immediately after crash.
• Check if 𝑞𝑖 ∈ 𝑅𝑀𝐸𝑄
• If yes,
• busy-wait in
waitForCS until 𝑞𝑖
is head
• proceed to CS in
recoverHead
• release the lock
• If no,
• return false
• execute failureFree
• Invoked if 𝑞𝑖. 𝑐ℎ𝑘 = 2
immediately after crash
or within
recoverBlocked.
• Execute CS
• Release the lock
• Invoked if 𝑞𝑖. 𝑐ℎ𝑘 = 3
immediately after crash
• Check if 𝑞𝑖 ∈ 𝑅𝑀𝐸𝑄
• If yes,
• release the lock
• If no,
• reset 𝑞𝑖. 𝑐ℎ𝑘 and
return to NCS
failureFree
• Invoked if 𝑞𝑖. 𝑐ℎ𝑘 =0
immediately after crash
or if recoverBlocked
returns false.
28. Proof Sketch
28
The correctness of our algorithm is derived by an induction on the length of the execution history or by contradiction,
where applicable.
We use a history variable 𝑄 which represents the sequence of process IDs whose qnodes are in 𝑅𝑀𝐸𝑄.
An invariant is established with respect to the state of 𝐿, 𝑄, 𝑅𝑀𝐸𝑄 and the 𝑎ℎ𝑒𝑎𝑑 and 𝑐ℎ𝑘 fields on a qnode.
We show that the elements of 𝑄 are the same as the qnodes in 𝑅𝑀𝐸𝑄 at the end of a finite history, in that order.
The head qnode of 𝑅𝑀𝐸𝑄 is the lock holder and since 𝑄 always has at most one head element, ME is guaranteed.
FCFS, SF, LF, and TE are proved by contradiction, using the invariant.
And since every procedure in the RGLock algorithm terminates in a finite number of steps, FR is guaranteed.
Finally, we show that the RGLock algorithm incurs 𝒪(1) RMRs per process per failure-free passage.
29. Conclusion
29
Learnings (for me, that is):
Less is more.
For designing synchronization datastructures.
Evolution of qnodes in RMEQ.
Asynchrony is a harsh mistress.
𝑓𝑖𝑛𝑑𝑀𝑒 accuracy.
𝑤𝑎𝑖𝑡𝐹𝑜𝑟𝐶𝑆 correctness.
30. Conclusion
30
Known Limitations
Requires support for an unconventional hardware instruction (SAS).
𝑓𝑖𝑛𝑑𝑀𝑒 presets the no. of processes in the system.
Further Research
Programmatic implementation of the algorithm.
Simplify the code for more rigorous analysis.
Bakery algorithm in the context of crash-recovery for NVMM.
Make provision for processes to be added to the system even after the algorithm is initialized.
Potential Impact
In-memory databases for ‘always-on’ applications and high-performance computing.
31. Conclusion
31
Summary of Contributions:
Formal specification of the correctness properties of Recoverable Mutual Exclusion.
RGLock: a first-of-its-kind crash-recoverable mutual exclusion lock for NVMM systems.
Proposed doorway instruction could help guide design of future NVMM architectures.
Distinguishing RGLock from earlier attempts for crash-recoverable mutex:
RGLock satisfies all safety and liveness properties simultaneously in presence of crash failures.
RGLock tolerates failures on any individual component, including a lock-holder, and system-wide crashes as well.
Compared to MCS Lock, RGLock does not inflate time complexity in failure-free execution.
A comprehensive proof of correctness for the RGLock algorithm.