SlideShare a Scribd company logo
1 of 36
Gray & Reuter FT 2: 1
Dependable Computing SystemsDependable Computing Systems
Jim GrayJim Gray
Microsoft, Gray @ Microsoft.comMicrosoft, Gray @ Microsoft.com
Andreas ReuterAndreas Reuter
International University, Andreas.Reuter@i-u.deInternational University, Andreas.Reuter@i-u.de
9:00
11:00
1:30
3:30
7:00
Overview
Faults
Tolerance
T Models
Party
TP mons
Lock Theory
Lock Techniq
Queues
Workflow
Log
ResMgr
CICS & Inet
Adv TM
Cyberbrick
Files &Buffers
COM+
Corba
Replication
Party
B-tree
Access Paths
Groupware
Benchmark
Mon Tue Wed Thur Fri
Gray & Reuter FT 2: 2
1,000 discs =
10 Terrorbytes
100 Tape Transports
= 1,000 tapes
= 1 PetaByte
100 Nodes
1 Tips
HighSpeedNetwork(10Gb/s)
The Airplane RuleThe Airplane Rule
A two engine airplane has twice as many engine
problems.
A thousand-engine airplane has thousands of
engine problems
Fault Tolerance is KEY!
Mask and repair faults
Internet: Node fails every 2 weeks
Vendors: Disk fails every 40 years
Here: node “fails” every 20 minutes
disk fails every 2 weeks.
Gray & Reuter FT 2: 3
OutlineOutline
• Does fault tolerance work?Does fault tolerance work?
• General methods to mask faults.General methods to mask faults.
• Software-fault toleranceSoftware-fault tolerance
• SummarySummary
Gray & Reuter FT 2: 4
DEPENDABILITY: The 3 ITIESDEPENDABILITY: The 3 ITIES
• Reliability / Integrity:Reliability / Integrity: Does the right thingDoes the right thing (also(also
large MTTF)large MTTF)
• Availability:Availability: Does it nowDoes it now..
(also large(also large MTTFMTTF
MTTF+MTTRMTTF+MTTR
System Availability:System Availability:
If 90% of terminals up & 99% of DB up?If 90% of terminals up & 99% of DB up?
(=>89% of transactions are serviced on time).(=>89% of transactions are serviced on time).
• Holistic vs Reductionist viewHolistic vs Reductionist view
Security
Integrity /
Reliability
Availability
SecurityIntegrity /
Reliability
Availability
Gray & Reuter FT 2: 5
High Availability System ClassesHigh Availability System Classes
Goal: Build Class 6 SystemsGoal: Build Class 6 Systems
System Type
Unmanaged
Managed
Well Managed
Fault Tolerant
High-Availability
Very-High-Availability
Ultra-Availability
Unavailable
(min/year)
50,000
5,000
500
50
5
.5
.05
Availability
90.%
99.%
99.9%
99.99%
99.999%
99.9999%
99.99999%
Availability
Class
1
2
3
4
5
6
7
Gray & Reuter FT 2: 6
Sources of FailuresSources of Failures
MTTFMTTF MTTRMTTR
Power FailurePower Failure:: 2000 hr2000 hr 1 hr1 hr
Phone LinesPhone Lines
SoftSoft >.1 hr>.1 hr .1 hr.1 hr
HardHard 4000 hr4000 hr 10 hr10 hr
Hardware ModulesHardware Modules:: 100,000hr100,000hr 10hr10hr (many are transient)
SoftwareSoftware::
1 Bug/1000 Lines Of Code (after vendor-user testing)1 Bug/1000 Lines Of Code (after vendor-user testing)
=> Thousands of bugs in System!=> Thousands of bugs in System!
Most software failures are transient: dump & restart system.Most software failures are transient: dump & restart system.
Useful fact: 8,760 hrs/year ~ 10k hr/yearUseful fact: 8,760 hrs/year ~ 10k hr/year
Gray & Reuter FT 2: 7
Case Studies - JapanCase Studies - Japan
"Survey on Computer Security", Japan Info Dev Corp., March 1986. (trans: Eiichi Watanabe)."Survey on Computer Security", Japan Info Dev Corp., March 1986. (trans: Eiichi Watanabe).
VendorVendor (hardware and software)(hardware and software) 5 Months5 Months
Application softwareApplication software 9 Months9 Months
Communications linesCommunications lines 1.5 Years1.5 Years
OperationsOperations 2 Years2 Years
EnvironmentEnvironment 2 Years2 Years
10 Weeks10 Weeks
1,383 institutions reported (6/84 - 7/85)1,383 institutions reported (6/84 - 7/85)
7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES
To get 10 year mttfTo get 10 year mttf
must attack all these problemsmust attack all these problems
Vendor
Com Lines
Application
Software Operations
Environment
42%
12%
25%
9.3%
11.2
%
Gray & Reuter FT 2: 8
Case Studies -TandemCase Studies -Tandem
Outage Reports to VendorOutage Reports to Vendor
Totals:
More than 7,000 Customer years
More than 30,000 System years
More than 80,000 Processor years
More than 200,000 Disc Years
Summary Tandem EWR Data
1985 1987 1989
Customers 1000 1300 2000
EWR Customers ? ? 267
Outage Customers 176 205 164
Systems 2400 6000 9000
Processors 7,000 15,000 25,500
Discs 16,000 46,000 74,000
Cases 305 227 501
Reports 491 535 766
Faults 592 609 892
Outages 285 294 438
System MTTF 8 years 20 years 21 years
Systematic Under-reporting
But ratios & trends interesting
Gray & Reuter FT 2: 9
Case Studies - Tandem TrendsCase Studies - Tandem Trends
MTTF improved: WOW! Outages per millennium.MTTF improved: WOW! Outages per millennium.
ShiftShift fromfrom Hardware & Maintenance to from 50% to 10%Hardware & Maintenance to from 50% to 10%
toto Software (62%) & Operations (15%)Software (62%) & Operations (15%)
NOTE: Systematic under-reporting ofNOTE: Systematic under-reporting of EnvironmentEnvironment
Operations errorsOperations errors
Application SoftwareApplication Software
unknown environment operations maintenance hardware software
0
10
20
30
40
50
60
70
80
90
100
1985 1987 1989
0
20
40
60
80
100
120
1985 1987 1989
Outag es/1000 System Years
by Primar y Cause
% of Outage s by Pri mary Cause
Gray & Reuter FT 2: 10
Case Studies - Tandem TrendsCase Studies - Tandem Trends
Reported MTTF by ComponentReported MTTF by Component
0
50
100
150
200
250
300
350
400
450
1985 1987 1989
software
hardware
maintenance
operations
environment
total
Mean Time to System Failure (years)
by Cause
1985 1987 1990
SOFTWARE 2 53 33 Years
HARDWARE 29 91 310 Years
MAINTENANCE 45 162 409 Years
OPERATIONS 99 171 136 Years
ENVIRONMENT 142 214 346 Years
SYSTEM 8 20 21 Years
Remember Systematic Under-reporting
Gray & Reuter FT 2: 11
SummarySummary
Current Situation: ~4-year MTTF
=> Fault Tolerance Works.
Hardware is GREAT (maintenance and MTTF).
Software masks most hardware faults.
Many hidden software outages in operations:
New System Software.
New Application Software.
Utilities.
Must make all software ONLINE.
Software seems to define a 30-year MTTF ceiling.
Reasonable Goal: 100-year MTTF.
class 4 today =>class 6 tomorrow.
Gray & Reuter FT 2: 12
OutlineOutline
• Does fault tolerance work?Does fault tolerance work?
• General methods to mask faults.General methods to mask faults.
• Software-fault toleranceSoftware-fault tolerance
• SummarySummary
Gray & Reuter FT 2: 13
Key IdeaKey Idea
ArchitectureArchitecture Hardware FaultsHardware Faults
SoftwareSoftware MasksMasks Environmental FaultsEnvironmental Faults
DistributionDistribution MaintenanceMaintenance
• Software automates / eliminates operatorsSoftware automates / eliminates operators
So,So,
• In the limit there are only software & design faults.In the limit there are only software & design faults.
Software-fault tolerance is the key to dependability.Software-fault tolerance is the key to dependability.
INVENT IT!INVENT IT!
} { }{
Gray & Reuter FT 2: 14
Fault Tolerance TechniquesFault Tolerance Techniques
FAIL FAST MODULESFAIL FAST MODULES: work or stop: work or stop
SPARE MODULESSPARE MODULES :: instant repair time.repair time.
INDEPENDENT MODULE FAILSINDEPENDENT MODULE FAILS by designby design
MTTFMTTFPairPair ~ MTTF~ MTTF22
/ MTTR (/ MTTR (so want tiny MTTRso want tiny MTTR))
MESSAGE BASED OSMESSAGE BASED OS: Fault Isolation: Fault Isolation
software has no shared memory.
SESSION-ORIENTED COMMSESSION-ORIENTED COMM: Reliable messages: Reliable messages
detect lost/duplicate messages
coordinate messages with commit
PROCESS PAIRSPROCESS PAIRS ::Mask Hardware & Software Faults
TRANSACTIONSTRANSACTIONS: give A.C.I.D. (simple fault model): give A.C.I.D. (simple fault model)
Gray & Reuter FT 2: 15
Example: the FT BankExample: the FT Bank
Modularity & Repair are KEY:Modularity & Repair are KEY:
vonNeumann needed 20,000x redundancy in wires and switchesvonNeumann needed 20,000x redundancy in wires and switches
We use 2x redundancy.We use 2x redundancy.
Redundant hardware can support peak loadsRedundant hardware can support peak loads (so not redundant)(so not redundant)
Fault Tolerant Computer Backup System
System MTTF >10 YEAR (except for power & terminals)
Gray & Reuter FT 2: 16
Fail-Fast is Good, Repair is NeededFail-Fast is Good, Repair is Needed
Improving either MTTR or MTTF gives benefitImproving either MTTR or MTTF gives benefit
Simple redundancy does not help much.Simple redundancy does not help much.
Fault Detect
Repair
Lifecycle of a moduleLifecycle of a module
fail-fast givesfail-fast gives
short fault latencyshort fault latency
High AvailabilityHigh Availability
is low UN-Availabilityis low UN-Availability
Unavailability -Unavailability - MTTRMTTR
MTTFMTTF
return
Gray & Reuter FT 2: 17
Hardware Reliability/AvailabilityHardware Reliability/Availability
(how to make it fail fast)(how to make it fail fast)
Comparitor Strategies:Comparitor Strategies:
Duplex:Duplex: Fail-Fast: fail if either fails (e.g. duplexed cpus)Fail-Fast: fail if either fails (e.g. duplexed cpus)
vsvs Fail-Soft: fail if both fail (e.g. disc, atm,...)Fail-Soft: fail if both fail (e.g. disc, atm,...)
Note: in recursive pairs, parent knows which is bad.
Triplex:Triplex: Fail-Fast: fail if 2 fail (triplexed cpus)Fail-Fast: fail if 2 fail (triplexed cpus)
Fail-Soft: fail if 3 fail (triplexed FailFast cpus)Fail-Soft: fail if 3 fail (triplexed FailFast cpus)
Basic FailFast Designs
Pair Triplex
Recursive Designs
Recursive Availability Designs
Pair & Spare + + Triple Modular Redundancy
Gray & Reuter FT 2: 18
Redundant Designs Have Worse MTTF!Redundant Designs Have Worse MTTF!
THIS IS NOT GOOD: Variance is lower but MTTF is worseTHIS IS NOT GOOD: Variance is lower but MTTF is worse
Simple redundancy does not improve MTTF (sometimes hurts).Simple redundancy does not improve MTTF (sometimes hurts).
This is just an example of the airplane rule.This is just an example of the airplane rule.
mttf/1
2
work
1
work
0
work
mttf/2
1.5*mttf
Duplex: fail soft
mttf/1
3
work
2
work
1
work
0
work
mttf/3 mttf/2
11/6*mttf
TMR: fail soft
mttf/1
3
work
2
work
1
work
0
work
0 mttf/2
3/4*mttf
Pair & Spare: fail fast
4
work
mttf/4
mttf
3
work
2
work
1
work
0
work
mttf/2
~2.1*mttf
Pair & Spare: fail soft
4
work
mttf/4 mttf/3
2
work
1
work
0
work
mttf/2
mttf/2
: Duplex fail fast
mttf/1
3
work
2
work
1
work
0
work
mttf/3 mttf/2
5/6*mttf
TMR: fail fast
mttf/1
Gray & Reuter FT 2: 19
Add Repair: Get 10Add Repair: Get 1044
ImprovementImprovement
2
work
1
work
0
work
mtbf/2
Duplex: fail fast: mttf/2
mttrmttr
mttf/1
mttf/1
3
work
2
work
1
work
0
work
mttf/3 mttf/2mttf/1
2
work
1
work
0
work mttrmttrmttrmttrmttr
10 mttf
5
TMR: fail softDuplex: fail soft 10 mttf
4
3
work
2
work
1
work
0
work
mttf/3
TMR: fail fast
mttr mttr mttr
10 mttf
4
mttf/2
mttf/2
mttf/2
Gray & Reuter FT 2: 20
When To Repair?When To Repair?
Chances Of Tolerating A Fault are 1000:1 (class 3)Chances Of Tolerating A Fault are 1000:1 (class 3)
A 1995 study: Processor & Disc Rated At ~ 10khr MTTFA 1995 study: Processor & Disc Rated At ~ 10khr MTTF
Computed SingleComputed Single ObservedObserved
FailuresFailures Double FailsDouble Fails RatioRatio
10k Processor Fails10k Processor Fails 14 Double14 Double ~ 1000 : 1~ 1000 : 1
40k Disc Fails,40k Disc Fails, 26 Double26 Double ~ 1000 : 1~ 1000 : 1
Hardware Maintenance:Hardware Maintenance:
On-Line Maintenance "Works" 999 Times Out Of 1000.On-Line Maintenance "Works" 999 Times Out Of 1000.
The chance a duplexed disc will fail during maintenance ~ 1:1000
Risk Is 30x Higher During MaintenanceRisk Is 30x Higher During Maintenance
=> Do It Off Peak Hour=> Do It Off Peak Hour
Software Maintenance:Software Maintenance:
Repair Only Virulent BugsRepair Only Virulent Bugs
Wait For Next Release To Fix Benign BugsWait For Next Release To Fix Benign Bugs
Gray & Reuter FT 2: 21
OK: So FarOK: So Far
Hardware fail-fast is easyHardware fail-fast is easy
Redundancy plus Repair is great (Class 7 availability)Redundancy plus Repair is great (Class 7 availability)
Hardware redundancy & repair is via modules.Hardware redundancy & repair is via modules.
How can we get instant software repair?How can we get instant software repair?
We Know How To Get Reliable StorageWe Know How To Get Reliable Storage
RAID Or Dumps And Transaction Logs.RAID Or Dumps And Transaction Logs.
We Know How To Get Available StorageWe Know How To Get Available Storage
Fail Soft Duplexed Discs (RAID 1...N).Fail Soft Duplexed Discs (RAID 1...N).
? HOW DO WE GET RELIABLE EXECUTION?? HOW DO WE GET RELIABLE EXECUTION?
? HOW DO WE GET AVAILABLE EXECUTION?? HOW DO WE GET AVAILABLE EXECUTION?
Gray & Reuter FT 2: 22
OutlineOutline
• Does fault tolerance work?Does fault tolerance work?
• General methods to mask faults.General methods to mask faults.
• Software-fault toleranceSoftware-fault tolerance
• SummarySummary
Gray & Reuter FT 2: 23
Software Techniques:Software Techniques:
Learning from HardwareLearning from Hardware
Most outages in Fault Tolerant Systems are SOFTWAREMost outages in Fault Tolerant Systems are SOFTWARE
Fault Avoidance TechniquesFault Avoidance Techniques:: Good & Correct design.Good & Correct design.
After that:After that: Software Fault Tolerance Techniques:Software Fault Tolerance Techniques:
ModularityModularity (isolation, fault containment)(isolation, fault containment)
Design diversityDesign diversity
N-Version Programming:N-Version Programming: N-different implementationsN-different implementations
Defensive Programming:Defensive Programming: Check parameters and dataCheck parameters and data
Auditors:Auditors: Check data structures in backgroundCheck data structures in background
Transactions:Transactions: to clean up state after a failureto clean up state after a failure
Paradox: Need Fail-Fast SoftwareParadox: Need Fail-Fast Software
Gray & Reuter FT 2: 24
Fail-Fast and High-AvailabilityFail-Fast and High-Availability
ExecutionExecution
Software N-Plexing: Design DiversitySoftware N-Plexing: Design Diversity
N-Version ProgrammingN-Version Programming
Write the same program N-Times (N > 3)Write the same program N-Times (N > 3)
Compare outputs of all programs and take majority voteCompare outputs of all programs and take majority vote
Process Pairs: Instant restart (repair)Process Pairs: Instant restart (repair)
Use Defensive programming to make a process fail-fastUse Defensive programming to make a process fail-fast
Have restarted process ready in separate environmentHave restarted process ready in separate environment
Second process “takes over” if primary faultsSecond process “takes over” if primary faults
Transaction mechanism can clean up distributed stateTransaction mechanism can clean up distributed state
if takeover in middle of computation.if takeover in middle of computation.
SESSION
PRIMARY
PROCESS
BACKUP
PROCESS
STATE
INFORMATION
LOGICAL PROCESS = PROCESS PAIR
Gray & Reuter FT 2: 25
What Is MTTF of N-Version Program?What Is MTTF of N-Version Program?
First fails after MTTF/NFirst fails after MTTF/N
Second fails after MTTF/(N-1),...Second fails after MTTF/(N-1),...
so MTTF(1/N + 1/(N-1) + ... + 1/2)so MTTF(1/N + 1/(N-1) + ... + 1/2)
harmonic series goes to infinity, but VERY slowlyharmonic series goes to infinity, but VERY slowly
for example 100-version programming givesfor example 100-version programming gives
~4 MTTF of 1-version programming~4 MTTF of 1-version programming
Reduces varianceReduces variance
N-Version Programming Needs REPAIRN-Version Programming Needs REPAIR
If a program fails, must reset its state from otherIf a program fails, must reset its state from other
programs.programs.
=> programs have common data/state representation.=> programs have common data/state representation.
How does this work forHow does this work for Database Systems?Database Systems?
Operating Systems?Operating Systems?
Network Systems?Network Systems?
Answer: I don’t know.Answer: I don’t know.
Gray & Reuter FT 2: 26
Why Process Pairs Mask FaultsWhy Process Pairs Mask Faults
Many Software Faults are SoftMany Software Faults are Soft
AfterAfter Design ReviewDesign Review
Code InspectionCode Inspection
Alpha TestAlpha Test
Beta TestBeta Test
10k Hrs Of Gamma Test (Production)10k Hrs Of Gamma Test (Production)
Most Software Faults Are TransientMost Software Faults Are Transient
MVS Functional Recovery RoutinesMVS Functional Recovery Routines 5:15:1
Tandem SpoolerTandem Spooler 100:1100:1
AdamsAdams >100:1>100:1
Terminology:Terminology:
Heisenbug: Works On RetryHeisenbug: Works On Retry
Bohrbug: Faults Again On RetryBohrbug: Faults Again On Retry
Adams: "Optimizing Preventative Service of Software Products", IBM J R&D,28.1,1984Adams: "Optimizing Preventative Service of Software Products", IBM J R&D,28.1,1984
Gray: "Why Do Computers Stop", Tandem TR85.7, 1985Gray: "Why Do Computers Stop", Tandem TR85.7, 1985
Mourad: "The Reliability of the IBM/XA Operating System", 15 ISFTCS, 1985.Mourad: "The Reliability of the IBM/XA Operating System", 15 ISFTCS, 1985.
Gray & Reuter FT 2: 27
Process Pair Repair StrategyProcess Pair Repair Strategy
If software fault (bug) is a Bohrbug, then there is noIf software fault (bug) is a Bohrbug, then there is no
repairrepair
““wait for the next release” orwait for the next release” or
““get an emergency bug fix” orget an emergency bug fix” or
““get a new vendor”get a new vendor”
If software fault is a Heisenbug, then repair isIf software fault is a Heisenbug, then repair is
reboot and retry orreboot and retry or
switch to backup process (instant restart)switch to backup process (instant restart)
PROCESS PAIRS ToleratePROCESS PAIRS Tolerate Hardware FaultsHardware Faults
HeisenbugsHeisenbugs
Repair time is seconds, could be mili-seconds if time isRepair time is seconds, could be mili-seconds if time is
criticalcritical
Flavors Of Process Pair:Flavors Of Process Pair: LockstepLockstep
AutomaticAutomatic
State CheckpointingState Checkpointing
Delta CheckpointingDelta Checkpointing
PersistentPersistent
SESSION
PRIMARY
PROCESS
BACKUP
PROCESS
STATE
INFORMATION
LOGICAL PROCESS = PROCESS PAIR
Gray & Reuter FT 2: 28
How Takeover Masks FailuresHow Takeover Masks Failures
Server Resets At Takeover But What AboutServer Resets At Takeover But What About Application State?Application State?
Database State?Database State?
Network State?Network State?
Answer:Answer: Use Transactions To Reset State!Use Transactions To Reset State!
Abort Transaction If Process Fails.Abort Transaction If Process Fails.
Keeps Network "Up"Keeps Network "Up"
Keeps System "Up"Keeps System "Up"
Reprocesses Some Transactions On FailureReprocesses Some Transactions On Failure
SESSION
PRIMARY
PROCESS
BACKUP
PROCESS
STATE
INFORMATION
LOGICAL PROCESS = PROCESS PAIR
Gray & Reuter FT 2: 29
PROCESS PAIRS - SUMMARYPROCESS PAIRS - SUMMARY
Transactions Give ReliabilityTransactions Give Reliability
Process Pairs Give AvailabilityProcess Pairs Give Availability
Process Pairs Are Expensive & Hard To ProgramProcess Pairs Are Expensive & Hard To Program
Transactions + Persistent Process PairsTransactions + Persistent Process Pairs
=> Fault Tolerant=> Fault Tolerant Sessions & ExecutionSessions & Execution
When Tandem Converted To This StyleWhen Tandem Converted To This Style
Saved 3x MessagesSaved 3x Messages
Saved 5x Message BytesSaved 5x Message Bytes
Made Programming EasierMade Programming Easier
Gray & Reuter FT 2: 30
SYSTEM PAIRSSYSTEM PAIRS
FOR HIGH AVAILABILITYFOR HIGH AVAILABILITY
Programs, Data, Processes Replicated at two sites.Programs, Data, Processes Replicated at two sites.
Pair looks like a single system.Pair looks like a single system.
System becomes logical conceptSystem becomes logical concept
Like Process Pairs: System Pairs.Like Process Pairs: System Pairs.
Backup receives transaction log (spooled if backup down).Backup receives transaction log (spooled if backup down).
If primary fails or operator Switches, backup offers service.If primary fails or operator Switches, backup offers service.
Primary Backup
Gray & Reuter FT 2: 31
SYSTEM PAIRSYSTEM PAIR
CONFIGURATION OPTIONSCONFIGURATION OPTIONS
Primary BackupMutual Backup:Mutual Backup:
each has 1/2 of Database & Applicationeach has 1/2 of Database & Application
Primary
Primary
Primary
Hub:Hub:
One site acts as backup for many othersOne site acts as backup for many others
In General can be any directed graphIn General can be any directed graph
Primary Backup
Copy
Copy Copy
Stale replicas: Lazy replicationStale replicas: Lazy replication
Backup
Backup
Primary
Primary
Primary
PrimaryPrimary
Copy
Copy
Copy
Gray & Reuter FT 2: 32
SYSTEM PAIRS FOR:SYSTEM PAIRS FOR:
SOFTWARE MAINTENANCESOFTWARE MAINTENANCE
(Primary) (Backup)
V1 V1
St ep 1: Bot h systems are running V1.
(Primary) (Backup)
V1 V2
Step 2: Backup is cold-loaded as V2.
(Backup) (Primary)
V1 V2
Step 3: SWITCH to Backup.
V2
(Backup) (Primary)
V2
Step 4: Backup is cold-loaded as V2 D30.
Similar ideas apply to:Similar ideas apply to:
Database ReorganizationDatabase Reorganization
Hardware modification (e.g. add discs, processors,...)Hardware modification (e.g. add discs, processors,...)
Hardware maintenanceHardware maintenance
Environmental changes (rewire, new air conditioning)Environmental changes (rewire, new air conditioning)
Move primary or backup to new location.Move primary or backup to new location.
Gray & Reuter FT 2: 33
SYSTEM PAIR BENEFITSSYSTEM PAIR BENEFITS
Protects against ENVIRONMENT: different sitesProtects against ENVIRONMENT: different sites
weatherweather
utilitiesutilities
sabotagesabotage
Protects against OPERATOR FAILURE:Protects against OPERATOR FAILURE:
two sites, two sets of operatorstwo sites, two sets of operators
Protects against MAINTENANCE OUTAGESProtects against MAINTENANCE OUTAGES
work on backupwork on backup
software/hardware install/upgrade/move...software/hardware install/upgrade/move...
Protects against HARDWARE FAILURESProtects against HARDWARE FAILURES
backup takes overbackup takes over
Protects against TRANSIENT SOFTWARE ERRORSProtects against TRANSIENT SOFTWARE ERRORS
Commercial systems:Commercial systems: Digital's Remote Transaction Router (RTR)Digital's Remote Transaction Router (RTR)
Tandem's Remote Database Facility (RDF)Tandem's Remote Database Facility (RDF)
IBM's Cross Recovery XRF( both in same campus)IBM's Cross Recovery XRF( both in same campus)
Oracle, Sybase, Informix, Microsoft... replicationOracle, Sybase, Informix, Microsoft... replication
Gray & Reuter FT 2: 34
SUMMARYSUMMARY
FT systems fail for the conventional reasonsFT systems fail for the conventional reasons
EnvironmentEnvironment mostlymostly
PeoplePeople sometimessometimes
SoftwareSoftware mostlymostly
HardwareHardware RarelyRarely
MTTF of FT SYSTEMSMTTF of FT SYSTEMS ~ 50X conventional~ 50X conventional
~ years vs weeks~ years vs weeks
Fail-Fast Modules + Reconfiguration + Repair =>Fail-Fast Modules + Reconfiguration + Repair =>
Good Hardware Fault ToleranceGood Hardware Fault Tolerance
Transactions + Process Pairs =>Transactions + Process Pairs =>
Good Software Fault Tolerance (Repair)Good Software Fault Tolerance (Repair)
System Pairs Hide Many FaultsSystem Pairs Hide Many Faults
Challenge: Tolerate Human ErrorsChallenge: Tolerate Human Errors
(make system simpler to manage, operate, and maintain)(make system simpler to manage, operate, and maintain)
Gray & Reuter FT 2: 35
Key IdeaKey Idea
ArchitectureArchitecture Hardware FaultsHardware Faults
SoftwareSoftware MasksMasks Environmental FaultsEnvironmental Faults
DistributionDistribution MaintenanceMaintenance
• Software automates / eliminates operatorsSoftware automates / eliminates operators
So,So,
• In the limit there are only software & design faults.In the limit there are only software & design faults.
Software-fault tolerance is the key to dependability.Software-fault tolerance is the key to dependability.
INVENT IT!INVENT IT!
} { }{
Gray & Reuter FT 2: 36
ReferencesReferences
Adams, E. (1984). “Optimizing Preventative Service of Software Products.” IBM Journal of
Research and Development. 28(1): 2-14.0
Anderson, T. and B. Randell. (1979). Computing Systems Reliability.
Garcia-Molina, H. and C. A. Polyzois. (1990). Issues in Disaster Recovery. 35th IEEE
Compcon 90. 573-577.
Gray, J. (1986). Why Do Computers Stop and What Can We Do About It. 5th Symposium on
Reliability in Distributed Software and Database Systems. 3-12.
Gray, J. (1990). “A Census of Tandem System Availability between 1985 and 1990.” IEEE
Transactions on Reliability. 39(4): 409-418.
Gray, J. N., Reuter, A. (1993). Transaction Processing Concepts and Techniques. San Mateo,
Morgan Kaufmann.
Lampson, B. W. (1981). Atomic Transactions. Distributed Systems -- Architecture and
Implementation: An Advanced Course. ACM, Springer-Verlag.
Laprie, J. C. (1985). Dependable Computing and Fault Tolerance: Concepts and Terminology.
15’th FTCS. 2-11.
Long, D.D., J. L. Carroll, and C.J. Park (1991). A study of the reliability of Internet sites. Proc
10’th Symposium on Reliable Distributed Systems, pp. 177-186, Pisa, September 1991.

More Related Content

Viewers also liked (10)

14 turing wics
14 turing wics14 turing wics
14 turing wics
 
09 workflow
09 workflow09 workflow
09 workflow
 
18 philbe replication stanford99
18 philbe replication stanford9918 philbe replication stanford99
18 philbe replication stanford99
 
11 tm
11 tm11 tm
11 tm
 
13 tm adv
13 tm adv13 tm adv
13 tm adv
 
8 application servers_v2
8 application servers_v28 application servers_v2
8 application servers_v2
 
06 07 lock
06 07 lock06 07 lock
06 07 lock
 
10 replication
10 replication10 replication
10 replication
 
7 concurrency controltwo
7 concurrency controltwo7 concurrency controltwo
7 concurrency controltwo
 
6 two phasecommit
6 two phasecommit6 two phasecommit
6 two phasecommit
 

Similar to 02 fault tolerance

[iROC Webinar] Do I Need to Worry About Soft Errors?
[iROC Webinar] Do I Need to Worry About Soft Errors? [iROC Webinar] Do I Need to Worry About Soft Errors?
[iROC Webinar] Do I Need to Worry About Soft Errors?
iROCTech
 
Technology overview
Technology overviewTechnology overview
Technology overview
virtuehm
 
Pci lecture 2 software it failures (1)
Pci lecture 2 software it failures (1)Pci lecture 2 software it failures (1)
Pci lecture 2 software it failures (1)
cooper0236
 

Similar to 02 fault tolerance (20)

Get your Lost Data Back Now - Understanding Data Recovery
Get your Lost Data Back Now - Understanding Data RecoveryGet your Lost Data Back Now - Understanding Data Recovery
Get your Lost Data Back Now - Understanding Data Recovery
 
[iROC Webinar] Do I Need to Worry About Soft Errors?
[iROC Webinar] Do I Need to Worry About Soft Errors? [iROC Webinar] Do I Need to Worry About Soft Errors?
[iROC Webinar] Do I Need to Worry About Soft Errors?
 
Technology overview
Technology overviewTechnology overview
Technology overview
 
Optimal+ GSA 2014
Optimal+ GSA  2014Optimal+ GSA  2014
Optimal+ GSA 2014
 
A trial investigation system for vulnerability on M2M network
A trial investigation system for vulnerability on M2M networkA trial investigation system for vulnerability on M2M network
A trial investigation system for vulnerability on M2M network
 
A Trial Investigation System for Vulnerability on M2M Network
A Trial Investigation System for Vulnerability on M2M NetworkA Trial Investigation System for Vulnerability on M2M Network
A Trial Investigation System for Vulnerability on M2M Network
 
Rf technology 5-8-2011-final-revised
Rf technology 5-8-2011-final-revisedRf technology 5-8-2011-final-revised
Rf technology 5-8-2011-final-revised
 
RF to IF Mixers by IDT: Low-power Downconversion Mixers with Zero-Distortion
RF to IF Mixers by IDT: Low-power Downconversion Mixers with Zero-DistortionRF to IF Mixers by IDT: Low-power Downconversion Mixers with Zero-Distortion
RF to IF Mixers by IDT: Low-power Downconversion Mixers with Zero-Distortion
 
ECI OpenFlow 2.0 the Future of SDN
ECI OpenFlow 2.0 the Future of SDN ECI OpenFlow 2.0 the Future of SDN
ECI OpenFlow 2.0 the Future of SDN
 
Electronics Reliability Prediction Using the Product Bill of Materials
Electronics Reliability Prediction Using the Product Bill of MaterialsElectronics Reliability Prediction Using the Product Bill of Materials
Electronics Reliability Prediction Using the Product Bill of Materials
 
Purchase order
Purchase orderPurchase order
Purchase order
 
iperfTZ: Understanding Network Bottlenecks for TrustZone-based Applications
iperfTZ: Understanding Network Bottlenecks for TrustZone-based ApplicationsiperfTZ: Understanding Network Bottlenecks for TrustZone-based Applications
iperfTZ: Understanding Network Bottlenecks for TrustZone-based Applications
 
LM358 dual operational amplifier
LM358 dual operational amplifierLM358 dual operational amplifier
LM358 dual operational amplifier
 
C08 – Updated planning and commissioning guidelines for Profinet - Xaver Sch...
C08 – Updated planning and commissioning guidelines for Profinet -  Xaver Sch...C08 – Updated planning and commissioning guidelines for Profinet -  Xaver Sch...
C08 – Updated planning and commissioning guidelines for Profinet - Xaver Sch...
 
cFrame framework slides
cFrame framework slidescFrame framework slides
cFrame framework slides
 
EDCC14 Keynote, Newcastle 15may14
EDCC14 Keynote, Newcastle 15may14EDCC14 Keynote, Newcastle 15may14
EDCC14 Keynote, Newcastle 15may14
 
Expert Talks Cardiff 2017 - Keeping your ci-cd system as fast as it needs to be
Expert Talks Cardiff 2017 - Keeping your ci-cd system as fast as it needs to beExpert Talks Cardiff 2017 - Keeping your ci-cd system as fast as it needs to be
Expert Talks Cardiff 2017 - Keeping your ci-cd system as fast as it needs to be
 
Pci lecture 2 software it failures (1)
Pci lecture 2 software it failures (1)Pci lecture 2 software it failures (1)
Pci lecture 2 software it failures (1)
 
Improving substation reliability & availability
Improving substation reliability & availability Improving substation reliability & availability
Improving substation reliability & availability
 
RAIDShield of Fast15 slides
RAIDShield of Fast15 slides RAIDShield of Fast15 slides
RAIDShield of Fast15 slides
 

More from ashish61_scs (20)

7 concurrency controltwo
7 concurrency controltwo7 concurrency controltwo
7 concurrency controltwo
 
Transactions
TransactionsTransactions
Transactions
 
22 levine
22 levine22 levine
22 levine
 
21 domino mohan-1
21 domino mohan-121 domino mohan-1
21 domino mohan-1
 
20 access paths
20 access paths20 access paths
20 access paths
 
19 structured files
19 structured files19 structured files
19 structured files
 
17 wics99 harkey
17 wics99 harkey17 wics99 harkey
17 wics99 harkey
 
16 greg hope_com_wics
16 greg hope_com_wics16 greg hope_com_wics
16 greg hope_com_wics
 
15 bufferand records
15 bufferand records15 bufferand records
15 bufferand records
 
14 scaleabilty wics
14 scaleabilty wics14 scaleabilty wics
14 scaleabilty wics
 
10b rm
10b rm10b rm
10b rm
 
10a log
10a log10a log
10a log
 
08 message and_queues_dieter_gawlick
08 message and_queues_dieter_gawlick08 message and_queues_dieter_gawlick
08 message and_queues_dieter_gawlick
 
05 tp mon_orbs
05 tp mon_orbs05 tp mon_orbs
05 tp mon_orbs
 
04 transaction models
04 transaction models04 transaction models
04 transaction models
 
03 fault model
03 fault model03 fault model
03 fault model
 
01 whirlwind tour
01 whirlwind tour01 whirlwind tour
01 whirlwind tour
 
Solution5.2012
Solution5.2012Solution5.2012
Solution5.2012
 
Solution6.2012
Solution6.2012Solution6.2012
Solution6.2012
 
Solution7.2012
Solution7.2012Solution7.2012
Solution7.2012
 

Recently uploaded

Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
EADTU
 

Recently uploaded (20)

21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
How to Manage Call for Tendor in Odoo 17
How to Manage Call for Tendor in Odoo 17How to Manage Call for Tendor in Odoo 17
How to Manage Call for Tendor in Odoo 17
 
What is 3 Way Matching Process in Odoo 17.pptx
What is 3 Way Matching Process in Odoo 17.pptxWhat is 3 Way Matching Process in Odoo 17.pptx
What is 3 Way Matching Process in Odoo 17.pptx
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
VAMOS CUIDAR DO NOSSO PLANETA! .
VAMOS CUIDAR DO NOSSO PLANETA!                    .VAMOS CUIDAR DO NOSSO PLANETA!                    .
VAMOS CUIDAR DO NOSSO PLANETA! .
 
Economic Importance Of Fungi In Food Additives
Economic Importance Of Fungi In Food AdditivesEconomic Importance Of Fungi In Food Additives
Economic Importance Of Fungi In Food Additives
 
Our Environment Class 10 Science Notes pdf
Our Environment Class 10 Science Notes pdfOur Environment Class 10 Science Notes pdf
Our Environment Class 10 Science Notes pdf
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
UGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdf
UGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdfUGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdf
UGC NET Paper 1 Unit 7 DATA INTERPRETATION.pdf
 
Model Attribute _rec_name in the Odoo 17
Model Attribute _rec_name in the Odoo 17Model Attribute _rec_name in the Odoo 17
Model Attribute _rec_name in the Odoo 17
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
 
How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17
 

02 fault tolerance

  • 1. Gray & Reuter FT 2: 1 Dependable Computing SystemsDependable Computing Systems Jim GrayJim Gray Microsoft, Gray @ Microsoft.comMicrosoft, Gray @ Microsoft.com Andreas ReuterAndreas Reuter International University, Andreas.Reuter@i-u.deInternational University, Andreas.Reuter@i-u.de 9:00 11:00 1:30 3:30 7:00 Overview Faults Tolerance T Models Party TP mons Lock Theory Lock Techniq Queues Workflow Log ResMgr CICS & Inet Adv TM Cyberbrick Files &Buffers COM+ Corba Replication Party B-tree Access Paths Groupware Benchmark Mon Tue Wed Thur Fri
  • 2. Gray & Reuter FT 2: 2 1,000 discs = 10 Terrorbytes 100 Tape Transports = 1,000 tapes = 1 PetaByte 100 Nodes 1 Tips HighSpeedNetwork(10Gb/s) The Airplane RuleThe Airplane Rule A two engine airplane has twice as many engine problems. A thousand-engine airplane has thousands of engine problems Fault Tolerance is KEY! Mask and repair faults Internet: Node fails every 2 weeks Vendors: Disk fails every 40 years Here: node “fails” every 20 minutes disk fails every 2 weeks.
  • 3. Gray & Reuter FT 2: 3 OutlineOutline • Does fault tolerance work?Does fault tolerance work? • General methods to mask faults.General methods to mask faults. • Software-fault toleranceSoftware-fault tolerance • SummarySummary
  • 4. Gray & Reuter FT 2: 4 DEPENDABILITY: The 3 ITIESDEPENDABILITY: The 3 ITIES • Reliability / Integrity:Reliability / Integrity: Does the right thingDoes the right thing (also(also large MTTF)large MTTF) • Availability:Availability: Does it nowDoes it now.. (also large(also large MTTFMTTF MTTF+MTTRMTTF+MTTR System Availability:System Availability: If 90% of terminals up & 99% of DB up?If 90% of terminals up & 99% of DB up? (=>89% of transactions are serviced on time).(=>89% of transactions are serviced on time). • Holistic vs Reductionist viewHolistic vs Reductionist view Security Integrity / Reliability Availability SecurityIntegrity / Reliability Availability
  • 5. Gray & Reuter FT 2: 5 High Availability System ClassesHigh Availability System Classes Goal: Build Class 6 SystemsGoal: Build Class 6 Systems System Type Unmanaged Managed Well Managed Fault Tolerant High-Availability Very-High-Availability Ultra-Availability Unavailable (min/year) 50,000 5,000 500 50 5 .5 .05 Availability 90.% 99.% 99.9% 99.99% 99.999% 99.9999% 99.99999% Availability Class 1 2 3 4 5 6 7
  • 6. Gray & Reuter FT 2: 6 Sources of FailuresSources of Failures MTTFMTTF MTTRMTTR Power FailurePower Failure:: 2000 hr2000 hr 1 hr1 hr Phone LinesPhone Lines SoftSoft >.1 hr>.1 hr .1 hr.1 hr HardHard 4000 hr4000 hr 10 hr10 hr Hardware ModulesHardware Modules:: 100,000hr100,000hr 10hr10hr (many are transient) SoftwareSoftware:: 1 Bug/1000 Lines Of Code (after vendor-user testing)1 Bug/1000 Lines Of Code (after vendor-user testing) => Thousands of bugs in System!=> Thousands of bugs in System! Most software failures are transient: dump & restart system.Most software failures are transient: dump & restart system. Useful fact: 8,760 hrs/year ~ 10k hr/yearUseful fact: 8,760 hrs/year ~ 10k hr/year
  • 7. Gray & Reuter FT 2: 7 Case Studies - JapanCase Studies - Japan "Survey on Computer Security", Japan Info Dev Corp., March 1986. (trans: Eiichi Watanabe)."Survey on Computer Security", Japan Info Dev Corp., March 1986. (trans: Eiichi Watanabe). VendorVendor (hardware and software)(hardware and software) 5 Months5 Months Application softwareApplication software 9 Months9 Months Communications linesCommunications lines 1.5 Years1.5 Years OperationsOperations 2 Years2 Years EnvironmentEnvironment 2 Years2 Years 10 Weeks10 Weeks 1,383 institutions reported (6/84 - 7/85)1,383 institutions reported (6/84 - 7/85) 7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES To get 10 year mttfTo get 10 year mttf must attack all these problemsmust attack all these problems Vendor Com Lines Application Software Operations Environment 42% 12% 25% 9.3% 11.2 %
  • 8. Gray & Reuter FT 2: 8 Case Studies -TandemCase Studies -Tandem Outage Reports to VendorOutage Reports to Vendor Totals: More than 7,000 Customer years More than 30,000 System years More than 80,000 Processor years More than 200,000 Disc Years Summary Tandem EWR Data 1985 1987 1989 Customers 1000 1300 2000 EWR Customers ? ? 267 Outage Customers 176 205 164 Systems 2400 6000 9000 Processors 7,000 15,000 25,500 Discs 16,000 46,000 74,000 Cases 305 227 501 Reports 491 535 766 Faults 592 609 892 Outages 285 294 438 System MTTF 8 years 20 years 21 years Systematic Under-reporting But ratios & trends interesting
  • 9. Gray & Reuter FT 2: 9 Case Studies - Tandem TrendsCase Studies - Tandem Trends MTTF improved: WOW! Outages per millennium.MTTF improved: WOW! Outages per millennium. ShiftShift fromfrom Hardware & Maintenance to from 50% to 10%Hardware & Maintenance to from 50% to 10% toto Software (62%) & Operations (15%)Software (62%) & Operations (15%) NOTE: Systematic under-reporting ofNOTE: Systematic under-reporting of EnvironmentEnvironment Operations errorsOperations errors Application SoftwareApplication Software unknown environment operations maintenance hardware software 0 10 20 30 40 50 60 70 80 90 100 1985 1987 1989 0 20 40 60 80 100 120 1985 1987 1989 Outag es/1000 System Years by Primar y Cause % of Outage s by Pri mary Cause
  • 10. Gray & Reuter FT 2: 10 Case Studies - Tandem TrendsCase Studies - Tandem Trends Reported MTTF by ComponentReported MTTF by Component 0 50 100 150 200 250 300 350 400 450 1985 1987 1989 software hardware maintenance operations environment total Mean Time to System Failure (years) by Cause 1985 1987 1990 SOFTWARE 2 53 33 Years HARDWARE 29 91 310 Years MAINTENANCE 45 162 409 Years OPERATIONS 99 171 136 Years ENVIRONMENT 142 214 346 Years SYSTEM 8 20 21 Years Remember Systematic Under-reporting
  • 11. Gray & Reuter FT 2: 11 SummarySummary Current Situation: ~4-year MTTF => Fault Tolerance Works. Hardware is GREAT (maintenance and MTTF). Software masks most hardware faults. Many hidden software outages in operations: New System Software. New Application Software. Utilities. Must make all software ONLINE. Software seems to define a 30-year MTTF ceiling. Reasonable Goal: 100-year MTTF. class 4 today =>class 6 tomorrow.
  • 12. Gray & Reuter FT 2: 12 OutlineOutline • Does fault tolerance work?Does fault tolerance work? • General methods to mask faults.General methods to mask faults. • Software-fault toleranceSoftware-fault tolerance • SummarySummary
  • 13. Gray & Reuter FT 2: 13 Key IdeaKey Idea ArchitectureArchitecture Hardware FaultsHardware Faults SoftwareSoftware MasksMasks Environmental FaultsEnvironmental Faults DistributionDistribution MaintenanceMaintenance • Software automates / eliminates operatorsSoftware automates / eliminates operators So,So, • In the limit there are only software & design faults.In the limit there are only software & design faults. Software-fault tolerance is the key to dependability.Software-fault tolerance is the key to dependability. INVENT IT!INVENT IT! } { }{
  • 14. Gray & Reuter FT 2: 14 Fault Tolerance TechniquesFault Tolerance Techniques FAIL FAST MODULESFAIL FAST MODULES: work or stop: work or stop SPARE MODULESSPARE MODULES :: instant repair time.repair time. INDEPENDENT MODULE FAILSINDEPENDENT MODULE FAILS by designby design MTTFMTTFPairPair ~ MTTF~ MTTF22 / MTTR (/ MTTR (so want tiny MTTRso want tiny MTTR)) MESSAGE BASED OSMESSAGE BASED OS: Fault Isolation: Fault Isolation software has no shared memory. SESSION-ORIENTED COMMSESSION-ORIENTED COMM: Reliable messages: Reliable messages detect lost/duplicate messages coordinate messages with commit PROCESS PAIRSPROCESS PAIRS ::Mask Hardware & Software Faults TRANSACTIONSTRANSACTIONS: give A.C.I.D. (simple fault model): give A.C.I.D. (simple fault model)
  • 15. Gray & Reuter FT 2: 15 Example: the FT BankExample: the FT Bank Modularity & Repair are KEY:Modularity & Repair are KEY: vonNeumann needed 20,000x redundancy in wires and switchesvonNeumann needed 20,000x redundancy in wires and switches We use 2x redundancy.We use 2x redundancy. Redundant hardware can support peak loadsRedundant hardware can support peak loads (so not redundant)(so not redundant) Fault Tolerant Computer Backup System System MTTF >10 YEAR (except for power & terminals)
  • 16. Gray & Reuter FT 2: 16 Fail-Fast is Good, Repair is NeededFail-Fast is Good, Repair is Needed Improving either MTTR or MTTF gives benefitImproving either MTTR or MTTF gives benefit Simple redundancy does not help much.Simple redundancy does not help much. Fault Detect Repair Lifecycle of a moduleLifecycle of a module fail-fast givesfail-fast gives short fault latencyshort fault latency High AvailabilityHigh Availability is low UN-Availabilityis low UN-Availability Unavailability -Unavailability - MTTRMTTR MTTFMTTF return
  • 17. Gray & Reuter FT 2: 17 Hardware Reliability/AvailabilityHardware Reliability/Availability (how to make it fail fast)(how to make it fail fast) Comparitor Strategies:Comparitor Strategies: Duplex:Duplex: Fail-Fast: fail if either fails (e.g. duplexed cpus)Fail-Fast: fail if either fails (e.g. duplexed cpus) vsvs Fail-Soft: fail if both fail (e.g. disc, atm,...)Fail-Soft: fail if both fail (e.g. disc, atm,...) Note: in recursive pairs, parent knows which is bad. Triplex:Triplex: Fail-Fast: fail if 2 fail (triplexed cpus)Fail-Fast: fail if 2 fail (triplexed cpus) Fail-Soft: fail if 3 fail (triplexed FailFast cpus)Fail-Soft: fail if 3 fail (triplexed FailFast cpus) Basic FailFast Designs Pair Triplex Recursive Designs Recursive Availability Designs Pair & Spare + + Triple Modular Redundancy
  • 18. Gray & Reuter FT 2: 18 Redundant Designs Have Worse MTTF!Redundant Designs Have Worse MTTF! THIS IS NOT GOOD: Variance is lower but MTTF is worseTHIS IS NOT GOOD: Variance is lower but MTTF is worse Simple redundancy does not improve MTTF (sometimes hurts).Simple redundancy does not improve MTTF (sometimes hurts). This is just an example of the airplane rule.This is just an example of the airplane rule. mttf/1 2 work 1 work 0 work mttf/2 1.5*mttf Duplex: fail soft mttf/1 3 work 2 work 1 work 0 work mttf/3 mttf/2 11/6*mttf TMR: fail soft mttf/1 3 work 2 work 1 work 0 work 0 mttf/2 3/4*mttf Pair & Spare: fail fast 4 work mttf/4 mttf 3 work 2 work 1 work 0 work mttf/2 ~2.1*mttf Pair & Spare: fail soft 4 work mttf/4 mttf/3 2 work 1 work 0 work mttf/2 mttf/2 : Duplex fail fast mttf/1 3 work 2 work 1 work 0 work mttf/3 mttf/2 5/6*mttf TMR: fail fast mttf/1
  • 19. Gray & Reuter FT 2: 19 Add Repair: Get 10Add Repair: Get 1044 ImprovementImprovement 2 work 1 work 0 work mtbf/2 Duplex: fail fast: mttf/2 mttrmttr mttf/1 mttf/1 3 work 2 work 1 work 0 work mttf/3 mttf/2mttf/1 2 work 1 work 0 work mttrmttrmttrmttrmttr 10 mttf 5 TMR: fail softDuplex: fail soft 10 mttf 4 3 work 2 work 1 work 0 work mttf/3 TMR: fail fast mttr mttr mttr 10 mttf 4 mttf/2 mttf/2 mttf/2
  • 20. Gray & Reuter FT 2: 20 When To Repair?When To Repair? Chances Of Tolerating A Fault are 1000:1 (class 3)Chances Of Tolerating A Fault are 1000:1 (class 3) A 1995 study: Processor & Disc Rated At ~ 10khr MTTFA 1995 study: Processor & Disc Rated At ~ 10khr MTTF Computed SingleComputed Single ObservedObserved FailuresFailures Double FailsDouble Fails RatioRatio 10k Processor Fails10k Processor Fails 14 Double14 Double ~ 1000 : 1~ 1000 : 1 40k Disc Fails,40k Disc Fails, 26 Double26 Double ~ 1000 : 1~ 1000 : 1 Hardware Maintenance:Hardware Maintenance: On-Line Maintenance "Works" 999 Times Out Of 1000.On-Line Maintenance "Works" 999 Times Out Of 1000. The chance a duplexed disc will fail during maintenance ~ 1:1000 Risk Is 30x Higher During MaintenanceRisk Is 30x Higher During Maintenance => Do It Off Peak Hour=> Do It Off Peak Hour Software Maintenance:Software Maintenance: Repair Only Virulent BugsRepair Only Virulent Bugs Wait For Next Release To Fix Benign BugsWait For Next Release To Fix Benign Bugs
  • 21. Gray & Reuter FT 2: 21 OK: So FarOK: So Far Hardware fail-fast is easyHardware fail-fast is easy Redundancy plus Repair is great (Class 7 availability)Redundancy plus Repair is great (Class 7 availability) Hardware redundancy & repair is via modules.Hardware redundancy & repair is via modules. How can we get instant software repair?How can we get instant software repair? We Know How To Get Reliable StorageWe Know How To Get Reliable Storage RAID Or Dumps And Transaction Logs.RAID Or Dumps And Transaction Logs. We Know How To Get Available StorageWe Know How To Get Available Storage Fail Soft Duplexed Discs (RAID 1...N).Fail Soft Duplexed Discs (RAID 1...N). ? HOW DO WE GET RELIABLE EXECUTION?? HOW DO WE GET RELIABLE EXECUTION? ? HOW DO WE GET AVAILABLE EXECUTION?? HOW DO WE GET AVAILABLE EXECUTION?
  • 22. Gray & Reuter FT 2: 22 OutlineOutline • Does fault tolerance work?Does fault tolerance work? • General methods to mask faults.General methods to mask faults. • Software-fault toleranceSoftware-fault tolerance • SummarySummary
  • 23. Gray & Reuter FT 2: 23 Software Techniques:Software Techniques: Learning from HardwareLearning from Hardware Most outages in Fault Tolerant Systems are SOFTWAREMost outages in Fault Tolerant Systems are SOFTWARE Fault Avoidance TechniquesFault Avoidance Techniques:: Good & Correct design.Good & Correct design. After that:After that: Software Fault Tolerance Techniques:Software Fault Tolerance Techniques: ModularityModularity (isolation, fault containment)(isolation, fault containment) Design diversityDesign diversity N-Version Programming:N-Version Programming: N-different implementationsN-different implementations Defensive Programming:Defensive Programming: Check parameters and dataCheck parameters and data Auditors:Auditors: Check data structures in backgroundCheck data structures in background Transactions:Transactions: to clean up state after a failureto clean up state after a failure Paradox: Need Fail-Fast SoftwareParadox: Need Fail-Fast Software
  • 24. Gray & Reuter FT 2: 24 Fail-Fast and High-AvailabilityFail-Fast and High-Availability ExecutionExecution Software N-Plexing: Design DiversitySoftware N-Plexing: Design Diversity N-Version ProgrammingN-Version Programming Write the same program N-Times (N > 3)Write the same program N-Times (N > 3) Compare outputs of all programs and take majority voteCompare outputs of all programs and take majority vote Process Pairs: Instant restart (repair)Process Pairs: Instant restart (repair) Use Defensive programming to make a process fail-fastUse Defensive programming to make a process fail-fast Have restarted process ready in separate environmentHave restarted process ready in separate environment Second process “takes over” if primary faultsSecond process “takes over” if primary faults Transaction mechanism can clean up distributed stateTransaction mechanism can clean up distributed state if takeover in middle of computation.if takeover in middle of computation. SESSION PRIMARY PROCESS BACKUP PROCESS STATE INFORMATION LOGICAL PROCESS = PROCESS PAIR
  • 25. Gray & Reuter FT 2: 25 What Is MTTF of N-Version Program?What Is MTTF of N-Version Program? First fails after MTTF/NFirst fails after MTTF/N Second fails after MTTF/(N-1),...Second fails after MTTF/(N-1),... so MTTF(1/N + 1/(N-1) + ... + 1/2)so MTTF(1/N + 1/(N-1) + ... + 1/2) harmonic series goes to infinity, but VERY slowlyharmonic series goes to infinity, but VERY slowly for example 100-version programming givesfor example 100-version programming gives ~4 MTTF of 1-version programming~4 MTTF of 1-version programming Reduces varianceReduces variance N-Version Programming Needs REPAIRN-Version Programming Needs REPAIR If a program fails, must reset its state from otherIf a program fails, must reset its state from other programs.programs. => programs have common data/state representation.=> programs have common data/state representation. How does this work forHow does this work for Database Systems?Database Systems? Operating Systems?Operating Systems? Network Systems?Network Systems? Answer: I don’t know.Answer: I don’t know.
  • 26. Gray & Reuter FT 2: 26 Why Process Pairs Mask FaultsWhy Process Pairs Mask Faults Many Software Faults are SoftMany Software Faults are Soft AfterAfter Design ReviewDesign Review Code InspectionCode Inspection Alpha TestAlpha Test Beta TestBeta Test 10k Hrs Of Gamma Test (Production)10k Hrs Of Gamma Test (Production) Most Software Faults Are TransientMost Software Faults Are Transient MVS Functional Recovery RoutinesMVS Functional Recovery Routines 5:15:1 Tandem SpoolerTandem Spooler 100:1100:1 AdamsAdams >100:1>100:1 Terminology:Terminology: Heisenbug: Works On RetryHeisenbug: Works On Retry Bohrbug: Faults Again On RetryBohrbug: Faults Again On Retry Adams: "Optimizing Preventative Service of Software Products", IBM J R&D,28.1,1984Adams: "Optimizing Preventative Service of Software Products", IBM J R&D,28.1,1984 Gray: "Why Do Computers Stop", Tandem TR85.7, 1985Gray: "Why Do Computers Stop", Tandem TR85.7, 1985 Mourad: "The Reliability of the IBM/XA Operating System", 15 ISFTCS, 1985.Mourad: "The Reliability of the IBM/XA Operating System", 15 ISFTCS, 1985.
  • 27. Gray & Reuter FT 2: 27 Process Pair Repair StrategyProcess Pair Repair Strategy If software fault (bug) is a Bohrbug, then there is noIf software fault (bug) is a Bohrbug, then there is no repairrepair ““wait for the next release” orwait for the next release” or ““get an emergency bug fix” orget an emergency bug fix” or ““get a new vendor”get a new vendor” If software fault is a Heisenbug, then repair isIf software fault is a Heisenbug, then repair is reboot and retry orreboot and retry or switch to backup process (instant restart)switch to backup process (instant restart) PROCESS PAIRS ToleratePROCESS PAIRS Tolerate Hardware FaultsHardware Faults HeisenbugsHeisenbugs Repair time is seconds, could be mili-seconds if time isRepair time is seconds, could be mili-seconds if time is criticalcritical Flavors Of Process Pair:Flavors Of Process Pair: LockstepLockstep AutomaticAutomatic State CheckpointingState Checkpointing Delta CheckpointingDelta Checkpointing PersistentPersistent SESSION PRIMARY PROCESS BACKUP PROCESS STATE INFORMATION LOGICAL PROCESS = PROCESS PAIR
  • 28. Gray & Reuter FT 2: 28 How Takeover Masks FailuresHow Takeover Masks Failures Server Resets At Takeover But What AboutServer Resets At Takeover But What About Application State?Application State? Database State?Database State? Network State?Network State? Answer:Answer: Use Transactions To Reset State!Use Transactions To Reset State! Abort Transaction If Process Fails.Abort Transaction If Process Fails. Keeps Network "Up"Keeps Network "Up" Keeps System "Up"Keeps System "Up" Reprocesses Some Transactions On FailureReprocesses Some Transactions On Failure SESSION PRIMARY PROCESS BACKUP PROCESS STATE INFORMATION LOGICAL PROCESS = PROCESS PAIR
  • 29. Gray & Reuter FT 2: 29 PROCESS PAIRS - SUMMARYPROCESS PAIRS - SUMMARY Transactions Give ReliabilityTransactions Give Reliability Process Pairs Give AvailabilityProcess Pairs Give Availability Process Pairs Are Expensive & Hard To ProgramProcess Pairs Are Expensive & Hard To Program Transactions + Persistent Process PairsTransactions + Persistent Process Pairs => Fault Tolerant=> Fault Tolerant Sessions & ExecutionSessions & Execution When Tandem Converted To This StyleWhen Tandem Converted To This Style Saved 3x MessagesSaved 3x Messages Saved 5x Message BytesSaved 5x Message Bytes Made Programming EasierMade Programming Easier
  • 30. Gray & Reuter FT 2: 30 SYSTEM PAIRSSYSTEM PAIRS FOR HIGH AVAILABILITYFOR HIGH AVAILABILITY Programs, Data, Processes Replicated at two sites.Programs, Data, Processes Replicated at two sites. Pair looks like a single system.Pair looks like a single system. System becomes logical conceptSystem becomes logical concept Like Process Pairs: System Pairs.Like Process Pairs: System Pairs. Backup receives transaction log (spooled if backup down).Backup receives transaction log (spooled if backup down). If primary fails or operator Switches, backup offers service.If primary fails or operator Switches, backup offers service. Primary Backup
  • 31. Gray & Reuter FT 2: 31 SYSTEM PAIRSYSTEM PAIR CONFIGURATION OPTIONSCONFIGURATION OPTIONS Primary BackupMutual Backup:Mutual Backup: each has 1/2 of Database & Applicationeach has 1/2 of Database & Application Primary Primary Primary Hub:Hub: One site acts as backup for many othersOne site acts as backup for many others In General can be any directed graphIn General can be any directed graph Primary Backup Copy Copy Copy Stale replicas: Lazy replicationStale replicas: Lazy replication Backup Backup Primary Primary Primary PrimaryPrimary Copy Copy Copy
  • 32. Gray & Reuter FT 2: 32 SYSTEM PAIRS FOR:SYSTEM PAIRS FOR: SOFTWARE MAINTENANCESOFTWARE MAINTENANCE (Primary) (Backup) V1 V1 St ep 1: Bot h systems are running V1. (Primary) (Backup) V1 V2 Step 2: Backup is cold-loaded as V2. (Backup) (Primary) V1 V2 Step 3: SWITCH to Backup. V2 (Backup) (Primary) V2 Step 4: Backup is cold-loaded as V2 D30. Similar ideas apply to:Similar ideas apply to: Database ReorganizationDatabase Reorganization Hardware modification (e.g. add discs, processors,...)Hardware modification (e.g. add discs, processors,...) Hardware maintenanceHardware maintenance Environmental changes (rewire, new air conditioning)Environmental changes (rewire, new air conditioning) Move primary or backup to new location.Move primary or backup to new location.
  • 33. Gray & Reuter FT 2: 33 SYSTEM PAIR BENEFITSSYSTEM PAIR BENEFITS Protects against ENVIRONMENT: different sitesProtects against ENVIRONMENT: different sites weatherweather utilitiesutilities sabotagesabotage Protects against OPERATOR FAILURE:Protects against OPERATOR FAILURE: two sites, two sets of operatorstwo sites, two sets of operators Protects against MAINTENANCE OUTAGESProtects against MAINTENANCE OUTAGES work on backupwork on backup software/hardware install/upgrade/move...software/hardware install/upgrade/move... Protects against HARDWARE FAILURESProtects against HARDWARE FAILURES backup takes overbackup takes over Protects against TRANSIENT SOFTWARE ERRORSProtects against TRANSIENT SOFTWARE ERRORS Commercial systems:Commercial systems: Digital's Remote Transaction Router (RTR)Digital's Remote Transaction Router (RTR) Tandem's Remote Database Facility (RDF)Tandem's Remote Database Facility (RDF) IBM's Cross Recovery XRF( both in same campus)IBM's Cross Recovery XRF( both in same campus) Oracle, Sybase, Informix, Microsoft... replicationOracle, Sybase, Informix, Microsoft... replication
  • 34. Gray & Reuter FT 2: 34 SUMMARYSUMMARY FT systems fail for the conventional reasonsFT systems fail for the conventional reasons EnvironmentEnvironment mostlymostly PeoplePeople sometimessometimes SoftwareSoftware mostlymostly HardwareHardware RarelyRarely MTTF of FT SYSTEMSMTTF of FT SYSTEMS ~ 50X conventional~ 50X conventional ~ years vs weeks~ years vs weeks Fail-Fast Modules + Reconfiguration + Repair =>Fail-Fast Modules + Reconfiguration + Repair => Good Hardware Fault ToleranceGood Hardware Fault Tolerance Transactions + Process Pairs =>Transactions + Process Pairs => Good Software Fault Tolerance (Repair)Good Software Fault Tolerance (Repair) System Pairs Hide Many FaultsSystem Pairs Hide Many Faults Challenge: Tolerate Human ErrorsChallenge: Tolerate Human Errors (make system simpler to manage, operate, and maintain)(make system simpler to manage, operate, and maintain)
  • 35. Gray & Reuter FT 2: 35 Key IdeaKey Idea ArchitectureArchitecture Hardware FaultsHardware Faults SoftwareSoftware MasksMasks Environmental FaultsEnvironmental Faults DistributionDistribution MaintenanceMaintenance • Software automates / eliminates operatorsSoftware automates / eliminates operators So,So, • In the limit there are only software & design faults.In the limit there are only software & design faults. Software-fault tolerance is the key to dependability.Software-fault tolerance is the key to dependability. INVENT IT!INVENT IT! } { }{
  • 36. Gray & Reuter FT 2: 36 ReferencesReferences Adams, E. (1984). “Optimizing Preventative Service of Software Products.” IBM Journal of Research and Development. 28(1): 2-14.0 Anderson, T. and B. Randell. (1979). Computing Systems Reliability. Garcia-Molina, H. and C. A. Polyzois. (1990). Issues in Disaster Recovery. 35th IEEE Compcon 90. 573-577. Gray, J. (1986). Why Do Computers Stop and What Can We Do About It. 5th Symposium on Reliability in Distributed Software and Database Systems. 3-12. Gray, J. (1990). “A Census of Tandem System Availability between 1985 and 1990.” IEEE Transactions on Reliability. 39(4): 409-418. Gray, J. N., Reuter, A. (1993). Transaction Processing Concepts and Techniques. San Mateo, Morgan Kaufmann. Lampson, B. W. (1981). Atomic Transactions. Distributed Systems -- Architecture and Implementation: An Advanced Course. ACM, Springer-Verlag. Laprie, J. C. (1985). Dependable Computing and Fault Tolerance: Concepts and Terminology. 15’th FTCS. 2-11. Long, D.D., J. L. Carroll, and C.J. Park (1991). A study of the reliability of Internet sites. Proc 10’th Symposium on Reliable Distributed Systems, pp. 177-186, Pisa, September 1991.