SlideShare a Scribd company logo
Fault Tolerance
Basic Concepts 
• Availability 
The system is ready to work immediately 
• Reliability 
The system can run continuously 
• Safety 
When the system fails, nothing catastrophic happens 
• Maintainability 
A failed system can be easily repaired. 
Fault types: transient, intermittent, permanent
Failure Models 
Type of failure Description 
Crash failure A server halts, but is working correctly until it halts 
Omission failure 
Different types of failures. 
Receive omission 
Send omission 
A server fails to respond to requests 
A server fails to receive incoming messages 
A server fails to send messages 
Timing failure A server's response lies outside the specified time interval 
Response failure 
Value failure 
State transition failure 
The server's response is incorrect 
The value of the response is wrong 
The server deviates from the correct flow of control 
Arbitrary failure A server may produce arbitrary responses at arbitrary times
Failure Masking by Redundancy 
•Information redundancy (extra bits) 
•Time redundancy (extra operations) 
• Physical redundancy (extra equipment or processes)
Failure Masking by Redundancy 
Triple modular redundancy (TMR). 
An electronic circuit example
Process failures 
To tolerate a faulty process, identical processes organized into a 
group 
When one process of the group fails, some other process in the group 
takes care of the work 
Process groups may be dynamic 
Mechanisms are needed for managing groups membership 
•Group server maintains information on membership (centralized) 
•Distributed management (less simple and time consuming)
Flat Groups versus Hierarchical Groups 
a) Communication in a flat group (voting mechanism, slow decision) 
Replicated write protocols 
b) Communication in a simple hierarchical group (single point of failure) 
Primary based protocols
Client-server communication failures 
Using a reliable transport protocol (TCP) masks omission failures, 
but many failures are not masked. 
Classes of failure 
• The client is unable to locate the server – exception is a solution, but we loose 
in transparency 
•The request message from the client to the server is lost – retransmission 
•The server crashes after receiving a request 
•The reply message from the server to the client is lost – retransmission, but… 
•The client crashes after sending a request – orphan is generated. 
(extermination, reincarnation with epoch #, gentle reincarnation, expiration…)
Server Crashes (1) 
A server in client-server communication 
a) Normal case 
b) Crash after execution 
c) Crash before execution 
At least once semantic: after server reboot, to try until a request is obtained 
At most once semantic: immediate failure report 
Exactly once semantic: no way
Server Crashes (2) 
Example: a client send a message to a server for printing (P) it, having a completion 
message back (M). The server can crash (C) 
Client Server 
Strategy M -> P Strategy P -> M 
Reissue strategy MPC MC(P) C(MP) PMC PC(M) C(PM) 
Always DUP OK OK DUP DUP OK 
Never OK ZERO ZERO OK OK ZERO 
Only when ACKed DUP OK ZERO DUP OK ZERO 
Only when not ACKed OK ZERO OK OK DUP OK 
Different combinations of client and server strategies in the presence of server crashes.
Group Communication 
Basic Reliable-Multicasting Schemes 
Important for messaging in process group 
A simple solution to reliable multicasting when all receivers are known and are 
assumed not to fail 
a) Message transmission b) Reporting feedback 
Efficient only for little # of receivers ( only nack, timer etc..)
Nonhierarchical Feedback Control 
To scale, we need to reduce the number of messages, 
with feedback suppression 
Several receivers have scheduled a request for retransmission, but the 
first retransmission request leads to the suppression of others (Scalable 
Reliable Multicasting protocol). 
It leads to timing problems, useless retransmissions or a complicated 
organization of the group membership
Hierarchical Feedback Control 
The essence of hierarchical reliable multicasting. A tree of receivers partitions is 
formed 
• Each local coordinator forwards the message to its children. 
• A local coordinator handles retransmission requests. 
Acknowledge between coordinators
Atomic Multicast 
In presence of process failures, the guarantee that a message is delivered to all or none 
of the receivers is needed. This lead to the atomic multicast problem 
Atomic multicasting ensures that group members maintain consistency 
The logical organization of a distributed system to distinguish between message receipt 
and message delivery 
In atomic multicasting a multicast message is uniquely associated to a list of receiving 
processes ( Group view ) 
A view change takes place when a process joins or leaves the group
Virtual Synchrony 
We need an ordered reliable multicasting. 
Virtual Synchrony guarantees that a message sent to a group view is delivered to each 
non-faulty member of the group. 
If the sender crashes, the message may be either delivered to all the other processes or 
ignored by each of them. 
The principle of virtual synchronous multicast (view change similar to 
synchronization variable)
Message Ordering 
Four different type of ordering of multicasts: 
• Reliable, unordered multicast 
no guarantees is given on the order in which messages are delivered 
• FIFO ordered multicast 
messages from the same process are delivered in the order as they are sent 
• Causally ordered multicast 
causality between messages is preserved 
• Totally-ordered multicast 
messages are delivered in the same order to all members of the group
Message Ordering 
Process P1 Process P2 Process P3 
sends m1 receives m1 receives m2 
sends m2 receives m2 receives m1 
Unordered multicast: 
Three communicating processes in the same group. The ordering of events per process is shown 
along the vertical axis. 
Process P1 Process P2 Process P3 Process P4 
sends m1 receives m1 receives m3 sends m3 
sends m2 receives m3 receives m1 sends m4 
receives m2 receives m2 
receives m4 receives m4 
Four processes in the same group with two different senders, and a possible delivery order of 
messages under FIFO-ordered multicasting
Message Ordering 
Virtually synchronous reliable multicasting offering totally ordered delivery 
is called atomic multicasting 
Multicast Basic Message Ordering Total-ordered Delivery? 
Reliable multicast None No 
FIFO multicast FIFO-ordered delivery No 
Causal multicast Causal-ordered delivery No 
Atomic multicast None Yes 
FIFO atomic multicast FIFO-ordered delivery Yes 
Causal atomic multicast Causal-ordered delivery Yes 
Six different versions of virtually synchronous reliable multicasting.
Distributed Commit 
Distributed commit means that an operation has to be performed by 
each member of a group or none at all 
One phase distributed commit is performed using a coordinator ( if a participant 
cannot perform the operation, no means to advise the coordinator) 
a) The finite state machine for the coordinator in two phase commit. 
b) The finite state machine for a participant. 
The first phase is the vote phase, the second is the decision phase 
Timeout mechanisms are necessary, coordinator can crash
Two Phase Commit 
• The coordinator send a vote_request to all participants 
• A participant returns a vote-commit (it is ready to commit its 
part of transaction) or a vote-abort 
• The coordinator collects the votes and send a global_commit or a 
global_abort (if one of the participants has sent a vote_abort) 
• A participant receive a global_commit and locally commits the 
transaction, or receive a global_abort and locally aborts the 
transaction 
1 – voting phase 
2 – decision phase 
1 
2
Three-Phase Commit 
It avoids blocking processes in case of coordinator crash 
• There is no state from which it is possible to make a transition directly to 
COMMIT or ABORT 
• There is no state in which it is not possible to make a final decision and 
from which a transition to a COMMIT can be made
Recovery 
• Backward recovery brings the system to the previous correct 
state. It is necessary to record the state (check-pointing) 
• Forward recovery attempt to bring the system in a correct new 
state to continue the execution.

More Related Content

What's hot

Replication in Distributed Systems
Replication in Distributed SystemsReplication in Distributed Systems
Replication in Distributed Systems
Kavya Barnadhya Hazarika
 
Agreement Protocols, distributed File Systems, Distributed Shared Memory
Agreement Protocols, distributed File Systems, Distributed Shared MemoryAgreement Protocols, distributed File Systems, Distributed Shared Memory
Agreement Protocols, distributed File Systems, Distributed Shared Memory
SHIKHA GAUTAM
 
Chapter 6 synchronization
Chapter 6 synchronizationChapter 6 synchronization
Optimistic concurrency control in Distributed Systems
Optimistic concurrency control in Distributed SystemsOptimistic concurrency control in Distributed Systems
Optimistic concurrency control in Distributed Systems
mridul mishra
 
Clock synchronization in distributed system
Clock synchronization in distributed systemClock synchronization in distributed system
Clock synchronization in distributed system
Sunita Sahu
 
Peer to Peer services and File systems
Peer to Peer services and File systemsPeer to Peer services and File systems
Peer to Peer services and File systems
MNM Jain Engineering College
 
clock synchronization in Distributed System
clock synchronization in Distributed System clock synchronization in Distributed System
clock synchronization in Distributed System
Harshita Ved
 
Chapter 14 replication
Chapter 14 replicationChapter 14 replication
Chapter 14 replicationAbDul ThaYyal
 
remote procedure calls
  remote procedure calls  remote procedure calls
remote procedure callsAshish Kumar
 
8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating Systems8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating Systems
Dr Sandeep Kumar Poonia
 
Foult Tolerence In Distributed System
Foult Tolerence In Distributed SystemFoult Tolerence In Distributed System
Foult Tolerence In Distributed SystemRajan Kumar
 
Distributed Mutual Exclusion and Distributed Deadlock Detection
Distributed Mutual Exclusion and Distributed Deadlock DetectionDistributed Mutual Exclusion and Distributed Deadlock Detection
Distributed Mutual Exclusion and Distributed Deadlock Detection
SHIKHA GAUTAM
 
management of distributed transactions
management of distributed transactionsmanagement of distributed transactions
management of distributed transactionsNilu Desai
 
6.Distributed Operating Systems
6.Distributed Operating Systems6.Distributed Operating Systems
6.Distributed Operating Systems
Dr Sandeep Kumar Poonia
 
Distributed system lamport's and vector algorithm
Distributed system lamport's and vector algorithmDistributed system lamport's and vector algorithm
Distributed system lamport's and vector algorithm
pinki soni
 
Aggrement protocols
Aggrement protocolsAggrement protocols
Aggrement protocolsMayank Jain
 
Process Management-Process Migration
Process Management-Process MigrationProcess Management-Process Migration
Process Management-Process Migration
MNM Jain Engineering College
 
resource management
  resource management  resource management
resource managementAshish Kumar
 

What's hot (20)

Replication in Distributed Systems
Replication in Distributed SystemsReplication in Distributed Systems
Replication in Distributed Systems
 
Agreement Protocols, distributed File Systems, Distributed Shared Memory
Agreement Protocols, distributed File Systems, Distributed Shared MemoryAgreement Protocols, distributed File Systems, Distributed Shared Memory
Agreement Protocols, distributed File Systems, Distributed Shared Memory
 
Chapter 6 synchronization
Chapter 6 synchronizationChapter 6 synchronization
Chapter 6 synchronization
 
Optimistic concurrency control in Distributed Systems
Optimistic concurrency control in Distributed SystemsOptimistic concurrency control in Distributed Systems
Optimistic concurrency control in Distributed Systems
 
Clock synchronization in distributed system
Clock synchronization in distributed systemClock synchronization in distributed system
Clock synchronization in distributed system
 
Peer to Peer services and File systems
Peer to Peer services and File systemsPeer to Peer services and File systems
Peer to Peer services and File systems
 
Synch
SynchSynch
Synch
 
clock synchronization in Distributed System
clock synchronization in Distributed System clock synchronization in Distributed System
clock synchronization in Distributed System
 
Chapter 14 replication
Chapter 14 replicationChapter 14 replication
Chapter 14 replication
 
remote procedure calls
  remote procedure calls  remote procedure calls
remote procedure calls
 
8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating Systems8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating Systems
 
Foult Tolerence In Distributed System
Foult Tolerence In Distributed SystemFoult Tolerence In Distributed System
Foult Tolerence In Distributed System
 
Distributed Mutual Exclusion and Distributed Deadlock Detection
Distributed Mutual Exclusion and Distributed Deadlock DetectionDistributed Mutual Exclusion and Distributed Deadlock Detection
Distributed Mutual Exclusion and Distributed Deadlock Detection
 
management of distributed transactions
management of distributed transactionsmanagement of distributed transactions
management of distributed transactions
 
6.Distributed Operating Systems
6.Distributed Operating Systems6.Distributed Operating Systems
6.Distributed Operating Systems
 
Distributed system lamport's and vector algorithm
Distributed system lamport's and vector algorithmDistributed system lamport's and vector algorithm
Distributed system lamport's and vector algorithm
 
Aggrement protocols
Aggrement protocolsAggrement protocols
Aggrement protocols
 
Distributed deadlock
Distributed deadlockDistributed deadlock
Distributed deadlock
 
Process Management-Process Migration
Process Management-Process MigrationProcess Management-Process Migration
Process Management-Process Migration
 
resource management
  resource management  resource management
resource management
 

Viewers also liked

Distributed datababase Transaction and concurrency control
Distributed datababase Transaction and concurrency controlDistributed datababase Transaction and concurrency control
Distributed datababase Transaction and concurrency control
balamurugan.k Kalibalamurugan
 
02. Fault Tolerance Pattern 위한 mindset
02. Fault Tolerance Pattern 위한 mindset02. Fault Tolerance Pattern 위한 mindset
02. Fault Tolerance Pattern 위한 mindseteva
 
Distributed Systems Theory for Mere Mortals
Distributed Systems Theory for Mere MortalsDistributed Systems Theory for Mere Mortals
Distributed Systems Theory for Mere Mortals
Ensar Basri Kahveci
 
ElSayed Fouda-Oracle Fusion SCM Consultant
ElSayed Fouda-Oracle Fusion SCM ConsultantElSayed Fouda-Oracle Fusion SCM Consultant
ElSayed Fouda-Oracle Fusion SCM ConsultantELSayed Fouda
 
Uomini di cultura nel mondo arabo
Uomini di cultura nel mondo araboUomini di cultura nel mondo arabo
Uomini di cultura nel mondo arabo
Giulio Negri
 
まちクエストサミット#2
まちクエストサミット#2まちクエストサミット#2
まちクエストサミット#2
Machiquest, Inc.
 
Pp Tik IX bab IV
Pp Tik IX bab IVPp Tik IX bab IV
Pp Tik IX bab IV
Arifkurkur
 
види мистецтва
види мистецтвавиди мистецтва
види мистецтваolenasar
 
Global warming effects
Global warming effectsGlobal warming effects
Global warming effects
Marcelino Santos
 
Poezie havo 5.2014
Poezie havo 5.2014Poezie havo 5.2014
Poezie havo 5.2014
henkthaar
 
Conventions of a ad media a2
Conventions of a ad   media a2Conventions of a ad   media a2
Conventions of a ad media a2
shaheenatarafdar241
 
Писанка
ПисанкаПисанка
Писанкаolenasar
 
strategiesofamulbranding -phpapp02
strategiesofamulbranding -phpapp02strategiesofamulbranding -phpapp02
strategiesofamulbranding -phpapp02
Manish Thakur
 
презен. досвіду
презен. досвідупрезен. досвіду
презен. досвідуolenasar
 

Viewers also liked (16)

Distributed datababase Transaction and concurrency control
Distributed datababase Transaction and concurrency controlDistributed datababase Transaction and concurrency control
Distributed datababase Transaction and concurrency control
 
02. Fault Tolerance Pattern 위한 mindset
02. Fault Tolerance Pattern 위한 mindset02. Fault Tolerance Pattern 위한 mindset
02. Fault Tolerance Pattern 위한 mindset
 
Distributed Systems Theory for Mere Mortals
Distributed Systems Theory for Mere MortalsDistributed Systems Theory for Mere Mortals
Distributed Systems Theory for Mere Mortals
 
ElSayed Fouda-Oracle Fusion SCM Consultant
ElSayed Fouda-Oracle Fusion SCM ConsultantElSayed Fouda-Oracle Fusion SCM Consultant
ElSayed Fouda-Oracle Fusion SCM Consultant
 
Uomini di cultura nel mondo arabo
Uomini di cultura nel mondo araboUomini di cultura nel mondo arabo
Uomini di cultura nel mondo arabo
 
まちクエストサミット#2
まちクエストサミット#2まちクエストサミット#2
まちクエストサミット#2
 
Pp Tik IX bab IV
Pp Tik IX bab IVPp Tik IX bab IV
Pp Tik IX bab IV
 
види мистецтва
види мистецтвавиди мистецтва
види мистецтва
 
Global warming effects
Global warming effectsGlobal warming effects
Global warming effects
 
Poezie havo 5.2014
Poezie havo 5.2014Poezie havo 5.2014
Poezie havo 5.2014
 
Young and Beautiful
Young and BeautifulYoung and Beautiful
Young and Beautiful
 
page 1
page 1page 1
page 1
 
Conventions of a ad media a2
Conventions of a ad   media a2Conventions of a ad   media a2
Conventions of a ad media a2
 
Писанка
ПисанкаПисанка
Писанка
 
strategiesofamulbranding -phpapp02
strategiesofamulbranding -phpapp02strategiesofamulbranding -phpapp02
strategiesofamulbranding -phpapp02
 
презен. досвіду
презен. досвідупрезен. досвіду
презен. досвіду
 

Similar to 9 fault-tolerance

Unit_4_Fault_Tolerance.pptx
Unit_4_Fault_Tolerance.pptxUnit_4_Fault_Tolerance.pptx
Unit_4_Fault_Tolerance.pptx
rameshwarchintamani
 
Distributed System by Pratik Tambekar
Distributed System by Pratik TambekarDistributed System by Pratik Tambekar
Distributed System by Pratik Tambekar
Pratik Tambekar
 
fault-tolerance-slide.ppt
fault-tolerance-slide.pptfault-tolerance-slide.ppt
fault-tolerance-slide.ppt
Shailendra61
 
Communication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed SystemsCommunication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed Systemsguest61205606
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed Systems
guest0f5a7d
 
Communication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed SystemsCommunication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed Systems
guest61205606
 
Message Passing, Remote Procedure Calls and Distributed Shared Memory as Com...
Message Passing, Remote Procedure Calls and  Distributed Shared Memory as Com...Message Passing, Remote Procedure Calls and  Distributed Shared Memory as Com...
Message Passing, Remote Procedure Calls and Distributed Shared Memory as Com...
Sehrish Asif
 
Fault tolerance review by tsegabrehan zerihun
Fault tolerance review by tsegabrehan zerihunFault tolerance review by tsegabrehan zerihun
Fault tolerance review by tsegabrehan zerihun
Tsegabrehan Am
 
CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMS
Kathirvel Ayyaswamy
 
Message passing in Distributed Computing Systems
Message passing in Distributed Computing SystemsMessage passing in Distributed Computing Systems
Message passing in Distributed Computing Systems
Alagappa Govt Arts College, Karaikudi
 
Chapter 7-Fault Tolerance.ppt
Chapter 7-Fault Tolerance.pptChapter 7-Fault Tolerance.ppt
Chapter 7-Fault Tolerance.ppt
Habib246314
 
Fault Tolerant and Distributed System
Fault Tolerant and Distributed SystemFault Tolerant and Distributed System
Fault Tolerant and Distributed System
sreenivas1591
 
DATA LINK LAYER.pdf
DATA LINK LAYER.pdfDATA LINK LAYER.pdf
DATA LINK LAYER.pdf
electricalengineerin42
 
Client Server Model and Distributed Computing
Client Server Model and Distributed ComputingClient Server Model and Distributed Computing
Client Server Model and Distributed Computing
Abhishek Jaisingh
 
Synchronization in Distributed Systems.pptx
Synchronization in Distributed Systems.pptxSynchronization in Distributed Systems.pptx
Synchronization in Distributed Systems.pptx
RichardMathengeSPASP
 
Reactive Messaging Patterns.
Reactive Messaging Patterns.Reactive Messaging Patterns.
Reactive Messaging Patterns.
Knoldus Inc.
 
DDB_lec_05_Concurrency_Control.pdf
DDB_lec_05_Concurrency_Control.pdfDDB_lec_05_Concurrency_Control.pdf
DDB_lec_05_Concurrency_Control.pdf
AhmedImmamImmam
 
Osi model
Osi model Osi model
Osi model maha tce
 

Similar to 9 fault-tolerance (20)

Unit_4_Fault_Tolerance.pptx
Unit_4_Fault_Tolerance.pptxUnit_4_Fault_Tolerance.pptx
Unit_4_Fault_Tolerance.pptx
 
Distributed System by Pratik Tambekar
Distributed System by Pratik TambekarDistributed System by Pratik Tambekar
Distributed System by Pratik Tambekar
 
fault-tolerance-slide.ppt
fault-tolerance-slide.pptfault-tolerance-slide.ppt
fault-tolerance-slide.ppt
 
Communication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed SystemsCommunication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed Systems
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed Systems
 
Communication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed SystemsCommunication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed Systems
 
Message Passing, Remote Procedure Calls and Distributed Shared Memory as Com...
Message Passing, Remote Procedure Calls and  Distributed Shared Memory as Com...Message Passing, Remote Procedure Calls and  Distributed Shared Memory as Com...
Message Passing, Remote Procedure Calls and Distributed Shared Memory as Com...
 
Fault tolerance review by tsegabrehan zerihun
Fault tolerance review by tsegabrehan zerihunFault tolerance review by tsegabrehan zerihun
Fault tolerance review by tsegabrehan zerihun
 
CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMS
 
Message passing in Distributed Computing Systems
Message passing in Distributed Computing SystemsMessage passing in Distributed Computing Systems
Message passing in Distributed Computing Systems
 
Fault tolerance-omer-rana
Fault tolerance-omer-ranaFault tolerance-omer-rana
Fault tolerance-omer-rana
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
Chapter 7-Fault Tolerance.ppt
Chapter 7-Fault Tolerance.pptChapter 7-Fault Tolerance.ppt
Chapter 7-Fault Tolerance.ppt
 
Fault Tolerant and Distributed System
Fault Tolerant and Distributed SystemFault Tolerant and Distributed System
Fault Tolerant and Distributed System
 
DATA LINK LAYER.pdf
DATA LINK LAYER.pdfDATA LINK LAYER.pdf
DATA LINK LAYER.pdf
 
Client Server Model and Distributed Computing
Client Server Model and Distributed ComputingClient Server Model and Distributed Computing
Client Server Model and Distributed Computing
 
Synchronization in Distributed Systems.pptx
Synchronization in Distributed Systems.pptxSynchronization in Distributed Systems.pptx
Synchronization in Distributed Systems.pptx
 
Reactive Messaging Patterns.
Reactive Messaging Patterns.Reactive Messaging Patterns.
Reactive Messaging Patterns.
 
DDB_lec_05_Concurrency_Control.pdf
DDB_lec_05_Concurrency_Control.pdfDDB_lec_05_Concurrency_Control.pdf
DDB_lec_05_Concurrency_Control.pdf
 
Osi model
Osi model Osi model
Osi model
 

Recently uploaded

ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
ongomchris
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 

Recently uploaded (20)

ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 

9 fault-tolerance

  • 2. Basic Concepts • Availability The system is ready to work immediately • Reliability The system can run continuously • Safety When the system fails, nothing catastrophic happens • Maintainability A failed system can be easily repaired. Fault types: transient, intermittent, permanent
  • 3. Failure Models Type of failure Description Crash failure A server halts, but is working correctly until it halts Omission failure Different types of failures. Receive omission Send omission A server fails to respond to requests A server fails to receive incoming messages A server fails to send messages Timing failure A server's response lies outside the specified time interval Response failure Value failure State transition failure The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary failure A server may produce arbitrary responses at arbitrary times
  • 4. Failure Masking by Redundancy •Information redundancy (extra bits) •Time redundancy (extra operations) • Physical redundancy (extra equipment or processes)
  • 5. Failure Masking by Redundancy Triple modular redundancy (TMR). An electronic circuit example
  • 6. Process failures To tolerate a faulty process, identical processes organized into a group When one process of the group fails, some other process in the group takes care of the work Process groups may be dynamic Mechanisms are needed for managing groups membership •Group server maintains information on membership (centralized) •Distributed management (less simple and time consuming)
  • 7. Flat Groups versus Hierarchical Groups a) Communication in a flat group (voting mechanism, slow decision) Replicated write protocols b) Communication in a simple hierarchical group (single point of failure) Primary based protocols
  • 8. Client-server communication failures Using a reliable transport protocol (TCP) masks omission failures, but many failures are not masked. Classes of failure • The client is unable to locate the server – exception is a solution, but we loose in transparency •The request message from the client to the server is lost – retransmission •The server crashes after receiving a request •The reply message from the server to the client is lost – retransmission, but… •The client crashes after sending a request – orphan is generated. (extermination, reincarnation with epoch #, gentle reincarnation, expiration…)
  • 9. Server Crashes (1) A server in client-server communication a) Normal case b) Crash after execution c) Crash before execution At least once semantic: after server reboot, to try until a request is obtained At most once semantic: immediate failure report Exactly once semantic: no way
  • 10. Server Crashes (2) Example: a client send a message to a server for printing (P) it, having a completion message back (M). The server can crash (C) Client Server Strategy M -> P Strategy P -> M Reissue strategy MPC MC(P) C(MP) PMC PC(M) C(PM) Always DUP OK OK DUP DUP OK Never OK ZERO ZERO OK OK ZERO Only when ACKed DUP OK ZERO DUP OK ZERO Only when not ACKed OK ZERO OK OK DUP OK Different combinations of client and server strategies in the presence of server crashes.
  • 11. Group Communication Basic Reliable-Multicasting Schemes Important for messaging in process group A simple solution to reliable multicasting when all receivers are known and are assumed not to fail a) Message transmission b) Reporting feedback Efficient only for little # of receivers ( only nack, timer etc..)
  • 12. Nonhierarchical Feedback Control To scale, we need to reduce the number of messages, with feedback suppression Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others (Scalable Reliable Multicasting protocol). It leads to timing problems, useless retransmissions or a complicated organization of the group membership
  • 13. Hierarchical Feedback Control The essence of hierarchical reliable multicasting. A tree of receivers partitions is formed • Each local coordinator forwards the message to its children. • A local coordinator handles retransmission requests. Acknowledge between coordinators
  • 14. Atomic Multicast In presence of process failures, the guarantee that a message is delivered to all or none of the receivers is needed. This lead to the atomic multicast problem Atomic multicasting ensures that group members maintain consistency The logical organization of a distributed system to distinguish between message receipt and message delivery In atomic multicasting a multicast message is uniquely associated to a list of receiving processes ( Group view ) A view change takes place when a process joins or leaves the group
  • 15. Virtual Synchrony We need an ordered reliable multicasting. Virtual Synchrony guarantees that a message sent to a group view is delivered to each non-faulty member of the group. If the sender crashes, the message may be either delivered to all the other processes or ignored by each of them. The principle of virtual synchronous multicast (view change similar to synchronization variable)
  • 16. Message Ordering Four different type of ordering of multicasts: • Reliable, unordered multicast no guarantees is given on the order in which messages are delivered • FIFO ordered multicast messages from the same process are delivered in the order as they are sent • Causally ordered multicast causality between messages is preserved • Totally-ordered multicast messages are delivered in the same order to all members of the group
  • 17. Message Ordering Process P1 Process P2 Process P3 sends m1 receives m1 receives m2 sends m2 receives m2 receives m1 Unordered multicast: Three communicating processes in the same group. The ordering of events per process is shown along the vertical axis. Process P1 Process P2 Process P3 Process P4 sends m1 receives m1 receives m3 sends m3 sends m2 receives m3 receives m1 sends m4 receives m2 receives m2 receives m4 receives m4 Four processes in the same group with two different senders, and a possible delivery order of messages under FIFO-ordered multicasting
  • 18. Message Ordering Virtually synchronous reliable multicasting offering totally ordered delivery is called atomic multicasting Multicast Basic Message Ordering Total-ordered Delivery? Reliable multicast None No FIFO multicast FIFO-ordered delivery No Causal multicast Causal-ordered delivery No Atomic multicast None Yes FIFO atomic multicast FIFO-ordered delivery Yes Causal atomic multicast Causal-ordered delivery Yes Six different versions of virtually synchronous reliable multicasting.
  • 19. Distributed Commit Distributed commit means that an operation has to be performed by each member of a group or none at all One phase distributed commit is performed using a coordinator ( if a participant cannot perform the operation, no means to advise the coordinator) a) The finite state machine for the coordinator in two phase commit. b) The finite state machine for a participant. The first phase is the vote phase, the second is the decision phase Timeout mechanisms are necessary, coordinator can crash
  • 20. Two Phase Commit • The coordinator send a vote_request to all participants • A participant returns a vote-commit (it is ready to commit its part of transaction) or a vote-abort • The coordinator collects the votes and send a global_commit or a global_abort (if one of the participants has sent a vote_abort) • A participant receive a global_commit and locally commits the transaction, or receive a global_abort and locally aborts the transaction 1 – voting phase 2 – decision phase 1 2
  • 21. Three-Phase Commit It avoids blocking processes in case of coordinator crash • There is no state from which it is possible to make a transition directly to COMMIT or ABORT • There is no state in which it is not possible to make a final decision and from which a transition to a COMMIT can be made
  • 22. Recovery • Backward recovery brings the system to the previous correct state. It is necessary to record the state (check-pointing) • Forward recovery attempt to bring the system in a correct new state to continue the execution.