SlideShare a Scribd company logo
1 of 26
Download to read offline
Exploring Failure Transparency and the Limits of
Generic Recovery
David E. Lowell, Subhachandra Chandra, Peter M. Chen
Presented by Miroslav Cupak
10/29/2012
Introduction
failure recovery as an illusion of failure-free
operation
goal of the paper:
2 invariants for failure transparency - save/lose
work
evaluation of real-life applications
Guaranteeing Failure Transparency
requirements:
automatized
fast
generic
tools:
commit events
rollback of a process
reexecution
problems:
undoable and redoable operations
some things hard to undo
some things hard to redo
Guaranteeing Failure Transparency: Failures
Guaranteeing Failure Transparency: Stop Failures
consistent recovery
Recovery is consistent if and only if there exists a
complete, failure-free execution of the computation
that would result in a sequence of visible events
equivalent to the sequence of visible events
actually output in the failed and recovered run.
ignore ordering and duplicates
non-deterministic events
Guaranteeing Failure Transparency: Save Work
save-work invariant
A computation is guaranteed consistent recovery
from stop failures if and only if for each executed
non-deterministic event ei
p that causally precedes a
visible or commit event e, process p executes a
commit event ej
p such that ej
p happens-before (or
atomic with) e, and i <= j.
save-work visible and save-work orphan invariant
many protocols
Guaranteeing Failure Transparency: Protocol Space
Guaranteeing Failure Transparency: Protocol Space
Guaranteeing Failure Transparency: Propagation Failures
non-determinism helpful
fixed vs transient events
Guaranteeing Failure Transparency: Lose Work
lose-work invariant
Application-generic recovery from propagation
failures is guaranteed to be possible if and only if
the application executes no commit event on a
dangerous path.
single/multi-process dangerous paths
algorithms
techniques for adding non-determinism
Evaluation: Performance Penalty
4 applications: nvi, magic, xpilot, TreadMarks
failure transparency: Discount Checking
modified to intercept non-deterministic system
calls
reliable memory: Rio File Cache
transactions: Vista transaction library
7 protocols: CAND, CPVS, CBNDVS,
CAND-LOG, CBNVS-LOG, CPV-2PC,
CBNDV-2PC
Evaluation: Performance Penalty - Results
24 performance analyses (no. of checkpoints,
increase in execution time)
interesting observations:
commit frequency decreases and performance
increases with increasing coordinates (except for
xpilot)
at least 1 protocol performs well for each app
disk-based recovery with reasonably low overhead
for interactive apps
Evaluation: Conflicts (Lose-work vs Save-work)
sample: 50 crashes for each of 7 fault types for
2 applications (nvi, postgres)
application faults (commit on dangerous path
after fault activation)
lose-work violation: 35%
guess for conflicts based on non-deterministic bug
distribution in other applications: 90%
OS faults
failure to recover: 3% (postgres) - 15% (nvi)
propagation failures: 10% (postgres) - 41% (nvi)
Related work
lot of work on transactions, fault tolerance and
recovery from stop failures
notable papers:
E. N. Elnozahy et al.: A Survey of
Rollback-Recovery Protocols in Message-Passing
Systems (thorough analysis of checkpointing &
log-based recovery protocols, over 240 references)
S. Chandra: Evaluating the Recovery-Related
Properties of Software Faults (PhD thesis
explaining most of the failure-recovery terms used
in the paper)
D. E. Lowell: Theory and Practice of Failure
Transparency (consistent recovery, commit
invariant, protocol space, discount checking)
D. E. Lowell, P. M. Chen: Discount Checking (fast
light checkpointing system)
Conclusions & Contributions
extending the concept of the stop failures to
propagation failures
invariant for surviving propagation failures
evaluation of propagation failures for which
consistent recovery is not possible
failure transparency for stop failures possible,
help from the application needed for
propagation failures
Questions?
Discount Checking system used in evaluations is
built on top of Vista transaction library. Why are
transactions needed?
What are the transactions needed for?
DC doesn’t know anything about semantics of
application operations and cannot use it for
rollback
transactions can be easily used to implement
checkpointing
interval between checkpoints = body of a
transaction
taking a checkpoint = transaction commiting
process state rollback to the last checkpoint =
transaction aborting
Applications failed to recover in up to 15% cases
using Discount Checking with the possibility to roll
back to the last checkpoint. Why don’t they use
multiple checkpoints? Would it help?
Would using multiple checkpoints help? Why don’t they use them?
it would definitely help to decrease the recovery
failure rate
but it wouldn’t solve all the problems (starting
states)
it’s not used probably because of
implementation difficulty (system based on
transactions, this kind of action would require
another encapsulation)
performance overhead would increase
Failure transparency is great. What are the
drawbacks?
What are the drawbacks of failure transparency?
increased cost
complicated debugging (masked failures)
interference with other components
lower priority of fault correction
How representative are the results of the
evaluations?
How representative are the results of the evaluations?
small sample of different types of applications
with prevailing types of events
even though only 2 applications tested for
performance, significantly different results
only 1 checkpointing system
part of the statistics inferred from other
programs
12 years ago - new HW and possibly different
bottlenecks, new possibly more efficient
protocols etc.
As a developer trying to make an application
transparent to failures, would you use the same
approach as the authors did for they evaluations
(Discount Checking)? Why?
Would you use Discount Checking?
good solution for many use cases
amazingly flexible, very easy setup for existing
applications compared to other systems we
mentioned
works out of the box, but extensible if needed
low performance overhead

More Related Content

What's hot

Using Control Charts for Detecting and Understanding Performance Regressions ...
Using Control Charts for Detecting and Understanding Performance Regressions ...Using Control Charts for Detecting and Understanding Performance Regressions ...
Using Control Charts for Detecting and Understanding Performance Regressions ...SAIL_QU
 
Jerry banks introduction to simulation
Jerry banks   introduction to simulationJerry banks   introduction to simulation
Jerry banks introduction to simulationsarubianoa
 
Software testing-life-cycle-process Process
Software testing-life-cycle-process ProcessSoftware testing-life-cycle-process Process
Software testing-life-cycle-process ProcessAjit Waje
 
Metrics formulas
Metrics formulasMetrics formulas
Metrics formulasmd_taufeeq
 
Defects in software testing
Defects in software testingDefects in software testing
Defects in software testingsandeepsingh2808
 
Adaptive software testing
Adaptive software testingAdaptive software testing
Adaptive software testingJohan Hoberg
 
Unit 5 general principles, simulation software
Unit 5 general principles, simulation softwareUnit 5 general principles, simulation software
Unit 5 general principles, simulation softwareraksharao
 
Manual testing-training-institute-in-marathahalli
Manual testing-training-institute-in-marathahalliManual testing-training-institute-in-marathahalli
Manual testing-training-institute-in-marathahallisiyaram ray
 
Introduction to Software Testing
Introduction to Software TestingIntroduction to Software Testing
Introduction to Software TestingHenry Muccini
 
Software techniques
Software techniquesSoftware techniques
Software techniqueshome
 
Testing concepts ppt
Testing concepts pptTesting concepts ppt
Testing concepts pptRathna Priya
 
Automated Detection of Performance Regressions Using Statistical Process Cont...
Automated Detection of Performance Regressions Using Statistical Process Cont...Automated Detection of Performance Regressions Using Statistical Process Cont...
Automated Detection of Performance Regressions Using Statistical Process Cont...SAIL_QU
 

What's hot (20)

Software Testing
Software TestingSoftware Testing
Software Testing
 
unit testing and debugging
unit testing and debuggingunit testing and debugging
unit testing and debugging
 
Using Control Charts for Detecting and Understanding Performance Regressions ...
Using Control Charts for Detecting and Understanding Performance Regressions ...Using Control Charts for Detecting and Understanding Performance Regressions ...
Using Control Charts for Detecting and Understanding Performance Regressions ...
 
Tlc
TlcTlc
Tlc
 
Jerry banks introduction to simulation
Jerry banks   introduction to simulationJerry banks   introduction to simulation
Jerry banks introduction to simulation
 
Software testing-life-cycle-process Process
Software testing-life-cycle-process ProcessSoftware testing-life-cycle-process Process
Software testing-life-cycle-process Process
 
Metrics formulas
Metrics formulasMetrics formulas
Metrics formulas
 
Defects in software testing
Defects in software testingDefects in software testing
Defects in software testing
 
Adaptive software testing
Adaptive software testingAdaptive software testing
Adaptive software testing
 
Software testing Report
Software testing ReportSoftware testing Report
Software testing Report
 
Unit 5 general principles, simulation software
Unit 5 general principles, simulation softwareUnit 5 general principles, simulation software
Unit 5 general principles, simulation software
 
Manual testing-training-institute-in-marathahalli
Manual testing-training-institute-in-marathahalliManual testing-training-institute-in-marathahalli
Manual testing-training-institute-in-marathahalli
 
Introduction to Software Testing
Introduction to Software TestingIntroduction to Software Testing
Introduction to Software Testing
 
Software testing ppt
Software testing pptSoftware testing ppt
Software testing ppt
 
Software techniques
Software techniquesSoftware techniques
Software techniques
 
Testing concepts ppt
Testing concepts pptTesting concepts ppt
Testing concepts ppt
 
Automated Detection of Performance Regressions Using Statistical Process Cont...
Automated Detection of Performance Regressions Using Statistical Process Cont...Automated Detection of Performance Regressions Using Statistical Process Cont...
Automated Detection of Performance Regressions Using Statistical Process Cont...
 
Sop test planning
Sop test planningSop test planning
Sop test planning
 
CTFL chapter 05
CTFL chapter 05CTFL chapter 05
CTFL chapter 05
 
Software Testing and Debugging
Software Testing and DebuggingSoftware Testing and Debugging
Software Testing and Debugging
 

Similar to Exploring Failure Transparency and the Limits of Generic Recovery

Spm unit v-software maintenance-intro
Spm unit v-software maintenance-introSpm unit v-software maintenance-intro
Spm unit v-software maintenance-introKanchana Devi
 
Software reliability
Software reliabilitySoftware reliability
Software reliabilityAnand Kumar
 
Bt0081 software engineering2
Bt0081 software engineering2Bt0081 software engineering2
Bt0081 software engineering2Techglyphs
 
Defect correction-Software Testing
Defect correction-Software TestingDefect correction-Software Testing
Defect correction-Software Testingmrinmoy mukherjee
 
Review Paper on Recovery of Data during Software Fault
Review Paper on Recovery of Data during Software FaultReview Paper on Recovery of Data during Software Fault
Review Paper on Recovery of Data during Software FaultAM Publications
 
Software Testing and Quality Assurance Assignment 2
Software Testing and Quality Assurance Assignment 2Software Testing and Quality Assurance Assignment 2
Software Testing and Quality Assurance Assignment 2Gurpreet singh
 
MBT_Installers_Dev_Env
MBT_Installers_Dev_EnvMBT_Installers_Dev_Env
MBT_Installers_Dev_EnvChris Struble
 
Sech1920 1200112979886874-3
Sech1920 1200112979886874-3Sech1920 1200112979886874-3
Sech1920 1200112979886874-3Mateti Anilraja
 
Software reliability & quality
Software reliability & qualitySoftware reliability & quality
Software reliability & qualityNur Islam
 
Spm unit v-software reliability-
Spm unit v-software reliability-Spm unit v-software reliability-
Spm unit v-software reliability-Kanchana Devi
 
Partitioned Based Regression Verification
Partitioned Based Regression VerificationPartitioned Based Regression Verification
Partitioned Based Regression VerificationAung Thu Rha Hein
 
Building blocks of Algblocks of Alg.pptx
Building blocks of Algblocks of Alg.pptxBuilding blocks of Algblocks of Alg.pptx
Building blocks of Algblocks of Alg.pptxNISHASOMSCS113
 
Real software engineering
Real software engineeringReal software engineering
Real software engineeringatr2006
 
SE2_Lec 20_Software Testing
SE2_Lec 20_Software TestingSE2_Lec 20_Software Testing
SE2_Lec 20_Software TestingAmr E. Mohamed
 
Efficient failure detection and consensus at extreme-scale systems
Efficient failure detection and consensus at extreme-scale  systemsEfficient failure detection and consensus at extreme-scale  systems
Efficient failure detection and consensus at extreme-scale systemsIJECEIAES
 
Performance testing interview questions and answers
Performance testing interview questions and answersPerformance testing interview questions and answers
Performance testing interview questions and answersGaruda Trainings
 

Similar to Exploring Failure Transparency and the Limits of Generic Recovery (20)

Spm unit v-software maintenance-intro
Spm unit v-software maintenance-introSpm unit v-software maintenance-intro
Spm unit v-software maintenance-intro
 
Testing
TestingTesting
Testing
 
Software reliability
Software reliabilitySoftware reliability
Software reliability
 
Bt0081 software engineering2
Bt0081 software engineering2Bt0081 software engineering2
Bt0081 software engineering2
 
Defect correction-Software Testing
Defect correction-Software TestingDefect correction-Software Testing
Defect correction-Software Testing
 
Review Paper on Recovery of Data during Software Fault
Review Paper on Recovery of Data during Software FaultReview Paper on Recovery of Data during Software Fault
Review Paper on Recovery of Data during Software Fault
 
Software Testing and Quality Assurance Assignment 2
Software Testing and Quality Assurance Assignment 2Software Testing and Quality Assurance Assignment 2
Software Testing and Quality Assurance Assignment 2
 
MBT_Installers_Dev_Env
MBT_Installers_Dev_EnvMBT_Installers_Dev_Env
MBT_Installers_Dev_Env
 
Sech1920 1200112979886874-3
Sech1920 1200112979886874-3Sech1920 1200112979886874-3
Sech1920 1200112979886874-3
 
Intro-Soft-Engg-2.pptx
Intro-Soft-Engg-2.pptxIntro-Soft-Engg-2.pptx
Intro-Soft-Engg-2.pptx
 
Types
TypesTypes
Types
 
Testing type
Testing typeTesting type
Testing type
 
Software reliability & quality
Software reliability & qualitySoftware reliability & quality
Software reliability & quality
 
Spm unit v-software reliability-
Spm unit v-software reliability-Spm unit v-software reliability-
Spm unit v-software reliability-
 
Partitioned Based Regression Verification
Partitioned Based Regression VerificationPartitioned Based Regression Verification
Partitioned Based Regression Verification
 
Building blocks of Algblocks of Alg.pptx
Building blocks of Algblocks of Alg.pptxBuilding blocks of Algblocks of Alg.pptx
Building blocks of Algblocks of Alg.pptx
 
Real software engineering
Real software engineeringReal software engineering
Real software engineering
 
SE2_Lec 20_Software Testing
SE2_Lec 20_Software TestingSE2_Lec 20_Software Testing
SE2_Lec 20_Software Testing
 
Efficient failure detection and consensus at extreme-scale systems
Efficient failure detection and consensus at extreme-scale  systemsEfficient failure detection and consensus at extreme-scale  systems
Efficient failure detection and consensus at extreme-scale systems
 
Performance testing interview questions and answers
Performance testing interview questions and answersPerformance testing interview questions and answers
Performance testing interview questions and answers
 

More from Miro Cupak

Exploring the latest and greatest from Java 14
Exploring the latest and greatest from Java 14Exploring the latest and greatest from Java 14
Exploring the latest and greatest from Java 14Miro Cupak
 
Exploring reactive programming in Java
Exploring reactive programming in JavaExploring reactive programming in Java
Exploring reactive programming in JavaMiro Cupak
 
Exploring the last year of Java
Exploring the last year of JavaExploring the last year of Java
Exploring the last year of JavaMiro Cupak
 
Local variable type inference - Will it compile?
Local variable type inference - Will it compile?Local variable type inference - Will it compile?
Local variable type inference - Will it compile?Miro Cupak
 
The Good, the Bad and the Ugly of Java API design
The Good, the Bad and the Ugly of Java API designThe Good, the Bad and the Ugly of Java API design
The Good, the Bad and the Ugly of Java API designMiro Cupak
 
Local variable type inference - Will it compile?
Local variable type inference - Will it compile?Local variable type inference - Will it compile?
Local variable type inference - Will it compile?Miro Cupak
 
Exploring reactive programming in Java
Exploring reactive programming in JavaExploring reactive programming in Java
Exploring reactive programming in JavaMiro Cupak
 
The good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API designThe good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API designMiro Cupak
 
Master class in modern Java
Master class in modern JavaMaster class in modern Java
Master class in modern JavaMiro Cupak
 
The good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API designThe good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API designMiro Cupak
 
Exploring reactive programming in Java
Exploring reactive programming in JavaExploring reactive programming in Java
Exploring reactive programming in JavaMiro Cupak
 
The good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API designThe good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API designMiro Cupak
 
Writing clean code with modern Java
Writing clean code with modern JavaWriting clean code with modern Java
Writing clean code with modern JavaMiro Cupak
 
The good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API designThe good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API designMiro Cupak
 
Master class in modern Java
Master class in modern JavaMaster class in modern Java
Master class in modern JavaMiro Cupak
 
Exploring reactive programming in Java
Exploring reactive programming in JavaExploring reactive programming in Java
Exploring reactive programming in JavaMiro Cupak
 
Writing clean code with modern Java
Writing clean code with modern JavaWriting clean code with modern Java
Writing clean code with modern JavaMiro Cupak
 
Exploring what's new in Java 10 and 11 (and 12)
Exploring what's new in Java 10 and 11 (and 12)Exploring what's new in Java 10 and 11 (and 12)
Exploring what's new in Java 10 and 11 (and 12)Miro Cupak
 
Exploring what's new in Java 10 and 11
Exploring what's new in Java 10 and 11Exploring what's new in Java 10 and 11
Exploring what's new in Java 10 and 11Miro Cupak
 
Exploring what's new in Java in 2018
Exploring what's new in Java in 2018Exploring what's new in Java in 2018
Exploring what's new in Java in 2018Miro Cupak
 

More from Miro Cupak (20)

Exploring the latest and greatest from Java 14
Exploring the latest and greatest from Java 14Exploring the latest and greatest from Java 14
Exploring the latest and greatest from Java 14
 
Exploring reactive programming in Java
Exploring reactive programming in JavaExploring reactive programming in Java
Exploring reactive programming in Java
 
Exploring the last year of Java
Exploring the last year of JavaExploring the last year of Java
Exploring the last year of Java
 
Local variable type inference - Will it compile?
Local variable type inference - Will it compile?Local variable type inference - Will it compile?
Local variable type inference - Will it compile?
 
The Good, the Bad and the Ugly of Java API design
The Good, the Bad and the Ugly of Java API designThe Good, the Bad and the Ugly of Java API design
The Good, the Bad and the Ugly of Java API design
 
Local variable type inference - Will it compile?
Local variable type inference - Will it compile?Local variable type inference - Will it compile?
Local variable type inference - Will it compile?
 
Exploring reactive programming in Java
Exploring reactive programming in JavaExploring reactive programming in Java
Exploring reactive programming in Java
 
The good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API designThe good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API design
 
Master class in modern Java
Master class in modern JavaMaster class in modern Java
Master class in modern Java
 
The good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API designThe good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API design
 
Exploring reactive programming in Java
Exploring reactive programming in JavaExploring reactive programming in Java
Exploring reactive programming in Java
 
The good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API designThe good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API design
 
Writing clean code with modern Java
Writing clean code with modern JavaWriting clean code with modern Java
Writing clean code with modern Java
 
The good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API designThe good, the bad, and the ugly of Java API design
The good, the bad, and the ugly of Java API design
 
Master class in modern Java
Master class in modern JavaMaster class in modern Java
Master class in modern Java
 
Exploring reactive programming in Java
Exploring reactive programming in JavaExploring reactive programming in Java
Exploring reactive programming in Java
 
Writing clean code with modern Java
Writing clean code with modern JavaWriting clean code with modern Java
Writing clean code with modern Java
 
Exploring what's new in Java 10 and 11 (and 12)
Exploring what's new in Java 10 and 11 (and 12)Exploring what's new in Java 10 and 11 (and 12)
Exploring what's new in Java 10 and 11 (and 12)
 
Exploring what's new in Java 10 and 11
Exploring what's new in Java 10 and 11Exploring what's new in Java 10 and 11
Exploring what's new in Java 10 and 11
 
Exploring what's new in Java in 2018
Exploring what's new in Java in 2018Exploring what's new in Java in 2018
Exploring what's new in Java in 2018
 

Recently uploaded

CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...Neo4j
 
Novo Nordisk: When Knowledge Graphs meet LLMs
Novo Nordisk: When Knowledge Graphs meet LLMsNovo Nordisk: When Knowledge Graphs meet LLMs
Novo Nordisk: When Knowledge Graphs meet LLMsNeo4j
 
[GRCPP] Introduction to concepts (C++20)
[GRCPP] Introduction to concepts (C++20)[GRCPP] Introduction to concepts (C++20)
[GRCPP] Introduction to concepts (C++20)Dimitrios Platis
 
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...Abortion Clinic
 
GraphSummit Milan - Neo4j: The Art of the Possible with Graph
GraphSummit Milan - Neo4j: The Art of the Possible with GraphGraphSummit Milan - Neo4j: The Art of the Possible with Graph
GraphSummit Milan - Neo4j: The Art of the Possible with GraphNeo4j
 
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale IbridaUNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale IbridaNeo4j
 
Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?Maxim Salnikov
 
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...Flutter Agency
 
Lessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdfLessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdfSrushith Repakula
 
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024Andreas Granig
 
A Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdfA Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdfICS
 
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdfAzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdfryanfarris8
 
Microsoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdfMicrosoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdfMarkus Moeller
 
Software Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements EngineeringSoftware Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements EngineeringPrakhyath Rai
 
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio, Inc.
 

Recently uploaded (20)

CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
 
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
 
Novo Nordisk: When Knowledge Graphs meet LLMs
Novo Nordisk: When Knowledge Graphs meet LLMsNovo Nordisk: When Knowledge Graphs meet LLMs
Novo Nordisk: When Knowledge Graphs meet LLMs
 
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
 
[GRCPP] Introduction to concepts (C++20)
[GRCPP] Introduction to concepts (C++20)[GRCPP] Introduction to concepts (C++20)
[GRCPP] Introduction to concepts (C++20)
 
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
 
GraphSummit Milan - Neo4j: The Art of the Possible with Graph
GraphSummit Milan - Neo4j: The Art of the Possible with GraphGraphSummit Milan - Neo4j: The Art of the Possible with Graph
GraphSummit Milan - Neo4j: The Art of the Possible with Graph
 
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale IbridaUNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida
 
Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?
 
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
 
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
 
Lessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdfLessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdf
 
Abortion Pill Prices Jozini ](+27832195400*)[ 🏥 Women's Abortion Clinic in Jo...
Abortion Pill Prices Jozini ](+27832195400*)[ 🏥 Women's Abortion Clinic in Jo...Abortion Pill Prices Jozini ](+27832195400*)[ 🏥 Women's Abortion Clinic in Jo...
Abortion Pill Prices Jozini ](+27832195400*)[ 🏥 Women's Abortion Clinic in Jo...
 
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
 
A Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdfA Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdf
 
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdfAzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
 
Abortion Pill Prices Mthatha (@](+27832195400*)[ 🏥 Women's Abortion Clinic In...
Abortion Pill Prices Mthatha (@](+27832195400*)[ 🏥 Women's Abortion Clinic In...Abortion Pill Prices Mthatha (@](+27832195400*)[ 🏥 Women's Abortion Clinic In...
Abortion Pill Prices Mthatha (@](+27832195400*)[ 🏥 Women's Abortion Clinic In...
 
Microsoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdfMicrosoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdf
 
Software Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements EngineeringSoftware Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements Engineering
 
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
 

Exploring Failure Transparency and the Limits of Generic Recovery

  • 1. Exploring Failure Transparency and the Limits of Generic Recovery David E. Lowell, Subhachandra Chandra, Peter M. Chen Presented by Miroslav Cupak 10/29/2012
  • 2. Introduction failure recovery as an illusion of failure-free operation goal of the paper: 2 invariants for failure transparency - save/lose work evaluation of real-life applications
  • 3. Guaranteeing Failure Transparency requirements: automatized fast generic tools: commit events rollback of a process reexecution problems: undoable and redoable operations some things hard to undo some things hard to redo
  • 5. Guaranteeing Failure Transparency: Stop Failures consistent recovery Recovery is consistent if and only if there exists a complete, failure-free execution of the computation that would result in a sequence of visible events equivalent to the sequence of visible events actually output in the failed and recovered run. ignore ordering and duplicates non-deterministic events
  • 6. Guaranteeing Failure Transparency: Save Work save-work invariant A computation is guaranteed consistent recovery from stop failures if and only if for each executed non-deterministic event ei p that causally precedes a visible or commit event e, process p executes a commit event ej p such that ej p happens-before (or atomic with) e, and i <= j. save-work visible and save-work orphan invariant many protocols
  • 9. Guaranteeing Failure Transparency: Propagation Failures non-determinism helpful fixed vs transient events
  • 10. Guaranteeing Failure Transparency: Lose Work lose-work invariant Application-generic recovery from propagation failures is guaranteed to be possible if and only if the application executes no commit event on a dangerous path. single/multi-process dangerous paths algorithms techniques for adding non-determinism
  • 11. Evaluation: Performance Penalty 4 applications: nvi, magic, xpilot, TreadMarks failure transparency: Discount Checking modified to intercept non-deterministic system calls reliable memory: Rio File Cache transactions: Vista transaction library 7 protocols: CAND, CPVS, CBNDVS, CAND-LOG, CBNVS-LOG, CPV-2PC, CBNDV-2PC
  • 12. Evaluation: Performance Penalty - Results 24 performance analyses (no. of checkpoints, increase in execution time) interesting observations: commit frequency decreases and performance increases with increasing coordinates (except for xpilot) at least 1 protocol performs well for each app disk-based recovery with reasonably low overhead for interactive apps
  • 13. Evaluation: Conflicts (Lose-work vs Save-work) sample: 50 crashes for each of 7 fault types for 2 applications (nvi, postgres) application faults (commit on dangerous path after fault activation) lose-work violation: 35% guess for conflicts based on non-deterministic bug distribution in other applications: 90% OS faults failure to recover: 3% (postgres) - 15% (nvi) propagation failures: 10% (postgres) - 41% (nvi)
  • 14. Related work lot of work on transactions, fault tolerance and recovery from stop failures notable papers: E. N. Elnozahy et al.: A Survey of Rollback-Recovery Protocols in Message-Passing Systems (thorough analysis of checkpointing & log-based recovery protocols, over 240 references) S. Chandra: Evaluating the Recovery-Related Properties of Software Faults (PhD thesis explaining most of the failure-recovery terms used in the paper) D. E. Lowell: Theory and Practice of Failure Transparency (consistent recovery, commit invariant, protocol space, discount checking) D. E. Lowell, P. M. Chen: Discount Checking (fast light checkpointing system)
  • 15. Conclusions & Contributions extending the concept of the stop failures to propagation failures invariant for surviving propagation failures evaluation of propagation failures for which consistent recovery is not possible failure transparency for stop failures possible, help from the application needed for propagation failures
  • 17. Discount Checking system used in evaluations is built on top of Vista transaction library. Why are transactions needed?
  • 18. What are the transactions needed for? DC doesn’t know anything about semantics of application operations and cannot use it for rollback transactions can be easily used to implement checkpointing interval between checkpoints = body of a transaction taking a checkpoint = transaction commiting process state rollback to the last checkpoint = transaction aborting
  • 19. Applications failed to recover in up to 15% cases using Discount Checking with the possibility to roll back to the last checkpoint. Why don’t they use multiple checkpoints? Would it help?
  • 20. Would using multiple checkpoints help? Why don’t they use them? it would definitely help to decrease the recovery failure rate but it wouldn’t solve all the problems (starting states) it’s not used probably because of implementation difficulty (system based on transactions, this kind of action would require another encapsulation) performance overhead would increase
  • 21. Failure transparency is great. What are the drawbacks?
  • 22. What are the drawbacks of failure transparency? increased cost complicated debugging (masked failures) interference with other components lower priority of fault correction
  • 23. How representative are the results of the evaluations?
  • 24. How representative are the results of the evaluations? small sample of different types of applications with prevailing types of events even though only 2 applications tested for performance, significantly different results only 1 checkpointing system part of the statistics inferred from other programs 12 years ago - new HW and possibly different bottlenecks, new possibly more efficient protocols etc.
  • 25. As a developer trying to make an application transparent to failures, would you use the same approach as the authors did for they evaluations (Discount Checking)? Why?
  • 26. Would you use Discount Checking? good solution for many use cases amazingly flexible, very easy setup for existing applications compared to other systems we mentioned works out of the box, but extensible if needed low performance overhead