This document discusses fault tolerance in distributed systems. It covers basic concepts like types of faults and failure modes. It then discusses techniques for achieving fault tolerance like replication, failure detection, and agreement algorithms. It discusses reliable client-server communication and group communication. It also covers distributed commit protocols and recovery techniques like checkpointing and message logging. The overall document provides an overview of key concepts and techniques for building fault tolerant distributed systems.
Agreement Protocols, Distributed Resource Management: Issues in distributed File Systems, Mechanism for building distributed file systems, Design issues in Distributed Shared Memory, Algorithm for Implementation of Distributed Shared Memory.
Grokking Techtalk #37: Software design and refactoringGrokking VN
Even though software engineering has been around for decades, there is still no clear ways to assess the strengths and weaknesses of software design.
This talk introduces a framework to assess the strength of any specific software design and steps to refactor and improve it. Both object-oriented and functional programming will be discussed as ways to improve the design.
In the talk, the speaker also proposes a software architecture that incorporates all the ideas presented as the conclusion.
About speaker:
Thành currently works at Holistics Software as Co-founder and Chief Engineer architecting the next generation DataOps driven BI platform.
Before joining Holistics as co-founder, Thanh had 8 years of experience as a software engineer and big-data consultant from multiple companies, notably Revolution Analytics which was acquired by Microsoft in 2015.
Thanh graduated from National University of Singapore in 2009 majoring in Computer Engineering with a minor in Technopreneurship.
Distributed Mutual Exclusion and Distributed Deadlock DetectionSHIKHA GAUTAM
Distributed Mutual Exclusion: Classification of distributed mutual exclusion, requirement of mutual exclusion theorem, Token based and non token based algorithms. Distributed Deadlock Detection: system model, resource Vs communication deadlocks, deadlock prevention, avoidance, detection & resolution, centralized dead lock detection
Agreement Protocols, Distributed Resource Management: Issues in distributed File Systems, Mechanism for building distributed file systems, Design issues in Distributed Shared Memory, Algorithm for Implementation of Distributed Shared Memory.
Grokking Techtalk #37: Software design and refactoringGrokking VN
Even though software engineering has been around for decades, there is still no clear ways to assess the strengths and weaknesses of software design.
This talk introduces a framework to assess the strength of any specific software design and steps to refactor and improve it. Both object-oriented and functional programming will be discussed as ways to improve the design.
In the talk, the speaker also proposes a software architecture that incorporates all the ideas presented as the conclusion.
About speaker:
Thành currently works at Holistics Software as Co-founder and Chief Engineer architecting the next generation DataOps driven BI platform.
Before joining Holistics as co-founder, Thanh had 8 years of experience as a software engineer and big-data consultant from multiple companies, notably Revolution Analytics which was acquired by Microsoft in 2015.
Thanh graduated from National University of Singapore in 2009 majoring in Computer Engineering with a minor in Technopreneurship.
Distributed Mutual Exclusion and Distributed Deadlock DetectionSHIKHA GAUTAM
Distributed Mutual Exclusion: Classification of distributed mutual exclusion, requirement of mutual exclusion theorem, Token based and non token based algorithms. Distributed Deadlock Detection: system model, resource Vs communication deadlocks, deadlock prevention, avoidance, detection & resolution, centralized dead lock detection
UNIT IV FAILURE RECOVERY AND FAULT TOLERANCE 9
Basic Concepts-Classification of Failures – Basic Approaches to Recovery; Recovery in
Concurrent System; Synchronous and Asynchronous Checkpointing and Recovery; Check
pointing in Distributed Database Systems; Fault Tolerance; Issues - Two-phase and Nonblocking
Commit Protocols; Voting Protocols; Dynamic Voting Protocols;
Diego Souza fala sobre sistemas distribuídos mostradando uma introdução sobre os conceitos básicos e algumas considerações práticas que podem afetar o nosso dia a dia.
Assista esta palestra em https://www.eventials.com/locaweb/sistemas-distribuidos/
This is a presentation for Chapter 7 Distributed system management
Book: DISTRIBUTED COMPUTING , Sunita Mahajan & Seema Shah
Prepared by Students of Computer Science, Ain Shams University - Cairo - Egypt
Dear students get fully solved assignments
Send your semester & Specialization name to our mail id :
“ help.mbaassignments@gmail.com ”
or
Call us at : 08263069601
UNIT IV FAILURE RECOVERY AND FAULT TOLERANCE 9
Basic Concepts-Classification of Failures – Basic Approaches to Recovery; Recovery in
Concurrent System; Synchronous and Asynchronous Checkpointing and Recovery; Check
pointing in Distributed Database Systems; Fault Tolerance; Issues - Two-phase and Nonblocking
Commit Protocols; Voting Protocols; Dynamic Voting Protocols;
Diego Souza fala sobre sistemas distribuídos mostradando uma introdução sobre os conceitos básicos e algumas considerações práticas que podem afetar o nosso dia a dia.
Assista esta palestra em https://www.eventials.com/locaweb/sistemas-distribuidos/
This is a presentation for Chapter 7 Distributed system management
Book: DISTRIBUTED COMPUTING , Sunita Mahajan & Seema Shah
Prepared by Students of Computer Science, Ain Shams University - Cairo - Egypt
Dear students get fully solved assignments
Send your semester & Specialization name to our mail id :
“ help.mbaassignments@gmail.com ”
or
Call us at : 08263069601
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfJay Das
With the advent of artificial intelligence or AI tools, project management processes are undergoing a transformative shift. By using tools like ChatGPT, and Bard organizations can empower their leaders and managers to plan, execute, and monitor projects more effectively.
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Into the Box Keynote Day 2: Unveiling amazing updates and announcements for modern CFML developers! Get ready for exciting releases and updates on Ortus tools and products. Stay tuned for cutting-edge innovations designed to boost your productivity.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Fault tolerance review by tsegabrehan zerihun
1. Kotebe Metropolitan University College of
Natural and Computational Science
Name: Tsegabrehan Zerihun
ID Number: PGE/19587/13
Program: MSc in Computer Science (Postgraduate Evening)
2. 1 | P a g e
Contents
Introduction.................................................................................................................................................2
1.Basic Concepts..........................................................................................................................................2
1.1 Faults are classified into three .........................................................................................................2
1.2 Failure Modes....................................................................................................................................3
1.3 Failure Masking by Redundancy.....................................................................................................3
2.Process Resilience.....................................................................................................................................3
2.1Failure Masking and Replication .....................................................................................................4
2.2 Agreement in Faulty Systems.....................................................................................................4
2.3 Failure Detection...............................................................................................................................6
3.Reliable Client-Server Communication.................................................................................................6
3.1 Point-to-Point Communication........................................................................................................6
3.2 RPC Semantics in the Presence of Failures....................................................................................6
4 Reliable Group Communication.............................................................................................................8
4.1Scalability in Reliable Multicasting..................................................................................................8
4.2 Atomic Multicast...............................................................................................................................9
4.3 Message Ordering.............................................................................................................................9
5 Distributed Commit.................................................................................................................................9
5.1One-Phase Commit Protocol.............................................................................................................9
5.2 Two-Phase Commit Protocol (2PC) ................................................................................................9
5.3 Three-Phase Commit Protocol (3PC)............................................................................................10
6. Recovery.................................................................................................................................................10
6.1 Introduction.....................................................................................................................................10
6.2 Checkpointing .................................................................................................................................11
6.3 Message Logging.............................................................................................................................11
6.4 Recovery Oriented Computing......................................................................................................11
3. 2 | P a g e
Fault Tolerance
Introduction
An important goal in distributed systems design to construct the system in such away that
it can automatically recover from partial failures with out seriously affecting the overall
performance. Such a failure may affect some components while others will continue to
function properly
1.Basic Concepts
Being fault tolerant is strongly related what are called Dependable systems
➢ Availability it refers to the probability that the system is operating correctly at any
given moment and is available to perform its function on behalf of its users.
➢ Reliability refers to the property that a system can run continuously without failure.
➢ Safety refers to situation that when a system temporarily fails to operate correctly,
no catastrophic event happens.
➢ Maintainability refers to how easily a failed system can be repaired.
1.1 Faults are classified into three
➢ Transient fault occurs once and then disappears; if the operation is repeated, the
fault goes away; e.g., a bird flying through a beam of a microwave transmitter may
cause some lost bits
➢ An intermittent fault it occurs, then vanishes on its own accord, then reappears,
...; e.g., a loose connection; difficult to diagnose; take your sick child to the nearest
clinic, but the child does not show any sickness by the time you reach there
➢ A permanent fault one that continues to exist until the faulty component is
repaired; e.g., disk head crash, software bug
4. 3 | P a g e
1.2 Failure Modes
➢ Crash failure: a server halts, but was working correctly until it stopped
➢ Omission failure: a server fails to respond to incoming requests
• Receive omission: a server fails to receive incoming messages; e.g., may
be no thread is listening
• Send omission: a server fails to send messages
➢ Timing failure: a server's response lies outside the specified time interval; e.g., may
be it is too fast over flooding the receiver or too slow
➢ Response failure: the server's response is incorrect
• Value failure: the value of the response is wrong; e.g., a search engine
returning wrong Web pages as a result of a search
• State transition failure: the server deviates from the correct flow of
control; e.g., taking default actions when it fails to understand the request
➢ Arbitrary failure (or Byzantine failure): a server may produce arbitrary responses
at arbitrary times; the most serious
1.3 Failure Masking by Redundancy
➢ to be fault tolerant, the system tries to hide the occurrence of failures from other
processes - masking
➢ the key technique for masking faults is redundancy
➢ three kinds are possible
a. Information redundancy
b. Time redundancy
c. Physical redundancy
2.Process Resilience
➢ how can fault tolerance be achieved in distributed systems
➢ one method is protection against process failures by replicating processes into
groups
➢ Design Issues
5. 4 | P a g e
✓ the key approach to tolerating a faulty process is to organize several identical
processes into a group
✓ all members of a group receive a message hoping that if one process fails,
another one will take over
✓ process groups may be dynamic
▪ new groups can be created and old groups can be destroyed
▪ a process can join or leave a group
▪ a process can be a member of several groups at the same time
2.1Failure Masking and Replication
➢ how to replicate processes so that they can form groups? there are two ways
• primary-based protocols: for fault tolerance, primary-backup protocol is used;
organize processes hierarchically and let the primary coordinate all writes; if
the primary crashes, the backups hold an election
• replicated-write protocols: in the form of active replication or by means of
quorum-based protocols; processes are organized as flat groups
➢ another important issue is how much replication is needed
➢ a system is said to be k fault tolerant if it can survive faults in k components and
still meets its specifications
• if the processes fail silently, then having k+1 replicas is enough; if k of them
fail, the remaining one can function
• if processes exhibit Byzantine failure, 2k+1 replicas are required; if the k processes
generate the same reply (wrong), the k+1 will also produce the same answer
(correct); but which of the two is correct is difficult to ascertain; unless we believe
the majority
2.2Agreement in Faulty Systems
➢ agreement is required in many cases; some examples are
• electing a coordinator
6. 5 | P a g e
• deciding whether or not to commit a transaction
• dividing tasks among workers
✓ the objective of all agreement algorithms is to have all the non-faulty
processes reach consensus on some issue in a finite number of steps
✓ but, reaching an agreement in a faulty environment (if process and/or
communication failure exists) is difficult; two cases
• processes are perfect but communication is unreliable
✓ this is similar to the problem known as the two-army problem; the sender of a
message does not know that its message has arrived
✓ even if the receiver acknowledges, it does not know that its
acknowledgement has been received
✓ hence, it is even difficult to reach an agreement when there are only two
perfect processes
➢ communication is perfect but processes are not
• this is also similar to the problem known as the Byzantine agreement problem,
where out of n generals k are traitors (they send wrong responses); can the loyal
generals reach an agreement?
• a recursive algorithm by Lamport
✓ each general is assumed to know how many troops s/he has
✓ the goal is to exchange troop strengths (by telephone, hence reliable
communication)
✓ at the end of the algorithm, each general has a vector of length n
corresponding to all the armies
✓ if general i is loyal, then element i is his troop strength; otherwise, it is
undefined
❖ the algorithm works in four steps (converting generals to processes, if possible ☺)
1. each nonfaulty process i sends vi to every other process; faulty processes
may send anything
2. the results are collected in the form of vectors
3. every process passes its vector to every other process
7. 6 | P a g e
4. each process examines the ith element; if any value has a majority, that
value is put into the result vector; else is marked UNKNOWN
2.3 Failure Detection
• one of the corner stones of a fault tolerance in a distributed system
• for a group of processes, nonfaulty members should be able to decide who is still a
member, and who is not
• two possibilities
1. pinging; processes send “are you alive?” messages to each other; set a timeout
mechanism; but
✓ if the network is unreliable timeout does not necessarily mean failure (there
could be false positive)
2. passively wait until messages come in from different processes
3.Reliable Client-Server Communication
✓ fault tolerance in distributed systems concentrates on faulty processes but
communication failures also have to be considered
✓ a communication channel may exhibit failures in the form of crash, omission,
timing and arbitrary (duplicate messages as a result of buffering at nodes and the
sender retransmitting)
3.1 Point-to-Point Communication
✓ reliable transport protocols such as TCP can be used that mask most
communication failures such as omissions (lost messages) using
acknowledgements and retransmissions
3.2 RPC Semantics in the Presence of Failures
✓ the goal of RPC is to hide communication by making remote procedure calls look like
local ones; may be difficult to mask remote calls when there are failures
✓ five different classes of failures can occur in RPC systems, each requiring a different
solution
1.the client is unable to locate the sever
✓ may be the client is unable to locate a suitable server or the server is down
2.the client’s request to the server is lost
8. 7 | P a g e
✓ let the OS or the client stub start a timer when sending a request; if the timer expires
before a reply or acknowledgement comes back, the message is retransmitted
3.the server receives the request and crashes
✓ the failure can happen before or after execution
4 the reply message from the server to the client is lost
✓ there is a risk that the request is sent more than once; may be the server was slow
✓ in addition, we can have a bit in the message header that is used to distinguish initial
requests from retransmissions so as to take more care on retransmissions
.5. the client crashes after sending a request
✓ in this case a computation may be done without a parent waiting for the result; such
unwanted computations are called orphans
▪ orphans
▪ waste CPU cycles
▪ can lock valuable resources such as files
▪ can cause confusion if the client reboots and does the RPC again and the
reply from the orphan comes back immediately afterward
▪ proposed solutions
✓ extermination: let the client stub maintain a log before sending an RPC message; after a
reboot, inspect the log and kill the orphan; some problems
▪ the expense of writing a log for every RPC
▪ it may not work if the orphans themselves do RPC, creating grandorphans
that are difficult to locate
✓ reincarnation: divide time into sequentially numbered epochs; when a client reboots, it
broadcasts a message to all machines declaring the start of a new epoch; when such a
broadcast comes in, all remote computations on behalf of that client are killed
9. 8 | P a g e
✓ gentle reincarnation: when an epoch broadcast comes in, each machine checks to see if
it has remote computations, and if so, tries to locate their owner; a computation is killed
only if the owner cannot be found
✓ expiration: each RPC is given a standard amount of time, T, to do the job; if it cannot
finish, it must explicitly ask for another quantum; if after a crash the client waits a time T
before rebooting, all orphans are killed; the problem is how to choose a reasonable value
of T, since RPCs have differing needs
✓ none of the solutions is satisfactory; still locked files or database records will remain
locked if an orphan is killed, etc.
4 Reliable Group Communication
▪ how to reliably deliver messages to a process group (multicasting)
▪ Basic Reliable-Multicasting Schemes
✓ reliable multicasting means a message sent to a process group should be delivered to
each member of that group
✓ transport protocols do not offer reliable communication to a collection of processes
✓ problems:
• what happens if a process joins a group during communication?
• what happens if a (sending) process crashes during communication?
• what if there are faulty processes?
a weaker solution assuming that all receivers are known (and they are limited) and
that none will fail is for the sending process to assign a sequence number to each
message and to buffer all messages so that lost ones can be retransmitted
4.1Scalability in Reliable Multicasting
➢ a reliable multicast scheme does not support large number of receivers; there will be
feedback implosion
➢ two approaches for solving the problem
1.Nonhierarchical Feedback Control
• reduce the number of feedback messages returned to the sender using feedback
suppression
10. 9 | P a g e
• the Scalable Reliable Multicasting (SRM) Protocol uses this principle
2.Hierarchical Feedback Control
• hierarchical solutions are better to achieve scalability for very large groups of receivers
4.2 Atomic Multicast
✓ how to achieve reliable multicasting in the presence of process failures
✓ for example, in a replicated database, how to handle update operations when a replica
crashes during update operations
✓ the atomic multicast problem: to guarantee that a message is delivered to either all
processes or none at all and that messages are delivered in the same order to all processes
4.3 Message Ordering virtually synchrony allows an application developer to think about
multicasts as taking place in epochs that are separated by group membership changes.
there are four different orderings
1. Unordered multicasts
2. FIFO-ordered multicasts
3. Causally-ordered multicasts
4. Totally-Ordered Multicasts
5 Distributed Commit
✓ atomic multicasting is an example of the more generalized problem known as distributed
commit
✓ in atomic multicasting, the operation is delivery of a message
✓ but the distributed commit problem involves having an(y) operation being performed by
each member of a process group, or none at all
✓ there are three protocols: one-phase commit, two-phase commit, and three-phase commit
5.1One-Phase Commit Protocol
✓ a coordinator tells all other processes, called participants, whether or not to (locally)
perform an operation
✓ drawback: if one of the participants cannot perform the operation, there is no way to tell
the coordinator; for example, due to violation of concurrency control constraints in
distributed transactions
5.2 Two-Phase Commit Protocol (2PC)
✓ it has two phases: voting phase and decision phase, each involving two steps
❖ voting phase
the coordinator sends a VOTE_REQUEST message to all participants
each participant then sends a VOTE_COMMIT or VOTE_ABORT message
depending on its local situation
11. 10 | P a g e
❖ decision phase
the coordinator collects all votes; if all vote to commit the transaction, it sends
a GLOBAL_COMMIT message; if at least one participant sends
VOTE_ABORT, it sends a GLOBAL_ABORT message
each participant that voted for a commit waits for the final reaction of the
coordinator and commits or aborts
5.3 Three-Phase Commit Protocol (3PC)
✓ the problem with 2PC is that, if the coordinator crashes, participants will need to block
until the coordinator recovers (although such a probability is low and hence 2PC is
widely used)
✓ 3PC avoids blocking processes in the presence of crashes
✓ the states of the coordinator and each participant satisfy the following two conditions
(necessary and sufficient conditions for a commit protocol to be nonblocking)
❖ there is no single state from which it is possible to make a transition directly to either
COMMIT or an ABORT state
❖ there is no state in which it is not possible to make a final decision, and from which a
transition to a COMMIT state can be made
6. Recovery
6.1 Introduction
fundamental to fault tolerance is recovery from an error recovery means to replace an
erroneous state with an error-free state
Backward Recovery bring the system from its present erroneous state back into a
previously correct state e.g. retransmitting lost or damaged packets in the implementation
of reliable communication most widely used method, since it is a generally applicable
method and can be integrated into the middleware layer of a distributed system
disadvantages: checkpointing and restoring a process to its previous state are costly and
performance bottlenecks, no guarantee can be given that the error will not recur, which
may take an application into a loop of recovery and some actions may be irreversible; e.g.,
deleting a file, handing over cash to a customer
Forward Recovery bring the system from its present erroneous state to a correct new state
from which it can continue to execute
12. 11 | P a g e
it has to be known in advance which errors may occur so as to correct those errors e.g.
erasure correction (or simply error correction) where a lost or damaged packet is
constructed from other successfully delivered packets
6.2 Checkpointing
to recover after a process or system failure requires that we construct a consistent global
state (also called distributed snapshot); if a process has recorded the receipt of a message,
then there should be another process that has recorded the sending of that message
it is best to recover from the most recent distributed snapshot, also referred to as a
recovery line; it corresponds to the most recent collection of checkpoints
6.3 Message Logging
Considering that checkpointing can be expensive operation, techniques have been sought
to reduce the number of checkpoints but still enable recovery. One such technique is
logging messages.
6.4 Recovery Oriented Computing
A related way of handling recovery is essentially to start over again. The underlying
principle toward this way of masking failure is that it may be much cheaper to optimize
for recovery, the it is aiming for systems that are free from failure for along time.