This presentation discusses graceful degradation and fault tolerance in computing systems. It covers topics like degradation allowance, diagnosis and isolation of faulty components, checkpointing and rollback for recovery, and optimal checkpoint insertion. Checkpointing involves periodically saving the state of a computation to allow restarting from the last checkpoint if a fault occurs. The optimal number of checkpoints balances the overhead of checkpointing against the work lost due to rollbacks. Various data distribution techniques like replication and dispersion are also presented to improve the reliability of data storage.
Presentation was delivered in a fault tolerance class which talk about the achieving fault tolerance in databases by making use of the replication.Different commercial databases were studied and looked into the approaches they took for replication.Then based on the study an architecture was suggested for military database design using an asynchronous approach and making use of the cluster patterns.
Introduction to Reliability Evaluation Techniques –
Reliability Models for Hardware Redundancy –
Permanent faults only - Transient faults.
Introduction to clock synchronization –
A Non-Fault-Tolerant Synchronization Algorithm –
Fault-Tolerant Synchronization in Hardware –
Completely connected zero propagation time system –
Sparse interconnection zero propagation time system –
Fault tolerant analysis with Signal Propagation delays.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1kS4Bkv.
Joe Armstrong describes the foundations of fault tolerant computation and the basic properties a system should have in order to be able to function in an adequate manner despite the occurrence of hardware and software errors, summarizing the key features of Erlang and showing how they can be used for programming fault-tolerant and scalable systems on multi-core clusters. Filmed at qconlondon.com.
Joe Armstrong is the principle inventor of the Erlang programming Language and coined the term "Concurrency Oriented Programming". He has worked for Ericsson where he developed Erlang and was the chief software architect of the project which produced the Erlang OTP system. He is author of several books, the latest being "Programming Erlang: Software for a concurrent world - 2'nd edition".
Real Time Systems – Issues in Real Time Computing – Structure of a real time system – Process – Task – Threads.
Classification of Tasks – Task Periodicity – Periodic Tasks- Sporadic Tasks – Aperiodic Tasks – Task Scheduling –
Classification of Scheduling Algorithms – Event Driven Scheduling – Rate monotonic scheduling – Earliest deadline first scheduling.
Inter Process Communication:- Shared data problem, Use of Semaphore(s), Priority Inversion Problem and Deadlock Situations -
Presentation was delivered in a fault tolerance class which talk about the achieving fault tolerance in databases by making use of the replication.Different commercial databases were studied and looked into the approaches they took for replication.Then based on the study an architecture was suggested for military database design using an asynchronous approach and making use of the cluster patterns.
Introduction to Reliability Evaluation Techniques –
Reliability Models for Hardware Redundancy –
Permanent faults only - Transient faults.
Introduction to clock synchronization –
A Non-Fault-Tolerant Synchronization Algorithm –
Fault-Tolerant Synchronization in Hardware –
Completely connected zero propagation time system –
Sparse interconnection zero propagation time system –
Fault tolerant analysis with Signal Propagation delays.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1kS4Bkv.
Joe Armstrong describes the foundations of fault tolerant computation and the basic properties a system should have in order to be able to function in an adequate manner despite the occurrence of hardware and software errors, summarizing the key features of Erlang and showing how they can be used for programming fault-tolerant and scalable systems on multi-core clusters. Filmed at qconlondon.com.
Joe Armstrong is the principle inventor of the Erlang programming Language and coined the term "Concurrency Oriented Programming". He has worked for Ericsson where he developed Erlang and was the chief software architect of the project which produced the Erlang OTP system. He is author of several books, the latest being "Programming Erlang: Software for a concurrent world - 2'nd edition".
Real Time Systems – Issues in Real Time Computing – Structure of a real time system – Process – Task – Threads.
Classification of Tasks – Task Periodicity – Periodic Tasks- Sporadic Tasks – Aperiodic Tasks – Task Scheduling –
Classification of Scheduling Algorithms – Event Driven Scheduling – Rate monotonic scheduling – Earliest deadline first scheduling.
Inter Process Communication:- Shared data problem, Use of Semaphore(s), Priority Inversion Problem and Deadlock Situations -
Just because Containerized, Kubernetized and Cloudified doesnt mean you application is production grade and disruption free. You will have to utilize all the features provided by Kubernetes to really make it produciton ready.
In these slides, I will try to explain what could be possible disruptions that can happen in your Kubernetes cluster that can impact your application or workloads. And then will try to explain features or configuration of Kubernetes that will help in making your application production grade.
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012TEST Huddle
EuroSTAR Software Testing Conference 2012 presentation on Innovations for Testing Parallel Software by Mike Bartley.
See more at: http://conference.eurostarsoftwaretesting.com/past-presentations/
In this paper we describe paradigms for building and designing parallel computing machines. Firstly we
elaborate the uniqueness of MIMD model for the execution of diverse applications. Then we compare the
General Purpose Architecture of Parallel Computers with Special Purpose Architecture of Parallel
Computers in terms of cost, throughput and efficiency. Then we describe how Parallel Computer
Architecture employs parallelism and concurrency through pipelining. Since Pipelining improves the
performance of a machine by dividing an instruction into a number of stages, therefore we describe how
the performance of a vector processor is enhanced by employing multi pipelining among its processing
elements. Also we have elaborated the RISC architecture and Pipelining in RISC machines After comparing
RISC computers with CISC computers we observe that although the high speed of RISC computers is very
desirable but the significance of speed of a computer is dependent on implementation strategies. Only CPU
clock speed is not the only parameter to move the system software from CISC to RISC computers but the
other parameters should also be considered like instruction size or format, addressing modes, complexity of
instructions and machine cycles required by instructions. Considering all parameters will give performance
gain . We discuss Multiprocessor and Data Flow Machines in a concise manner. Then we discuss three
SIMD (Single Instruction stream Multiple Data stream) machines which are DEC/MasPar MP-1, Systolic
Processors and Wavefront array Processors. The DEC/MasPar MP-1 is a massively parallel SIMD array
processor. A wide variety of number representations and arithmetic systems for computers can be
implemented easily on the DEC/MasPar MP-1 system. The principal advantages of using such 64×64
SIMD array of 4-bit processors for the implementation of a computer arithmetic laboratory arise out of its
flexibility. After comparison of Systolic Processors with Wave front Processors we found that both of the
Systolic Processors and Wave front Processors are fast and implemented in VLSI. The major drawback of
Systolic Processors is the problem of availability of inputs when clock ticks because of propagation delays
in connection buses. The Wave front Processors combine the Systolic Processor architecture with Data
Flow machine architecture. Although the Wave front processors use asynchronous data flow computing
structure, the timing in the interconnection buses, at input and at output is not problematic..
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
This talk explores deploying a series of small and large batch and streaming pipelines locally, to Spark and Flink clusters and to Google Cloud Dataflow services to give the audience a feel for the portability of Beam, a new portable Big Data processing framework recently submitted by Google to the Apache foundation. This talk will look at how the programming model handles late arriving data in a stream with event time, windows, and triggers.
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Flink Forward
Apache Beam is Flink’s sibling in the Apache family of streaming processing frameworks. The Beam and Flink teams work closely together on advancing what is possible in streaming processing, including Streaming SQL extensions and code interoperability on both platforms.
Beam was originally developed at Google as the amalgamation of its internal batch and streaming frameworks to power the exabyte-scale data processing for Gmail, YouTube and Ads. It now powers a fully-managed, serverless service Google Cloud Dataflow, as well as is available to run in other Public Clouds and on-premises when deployed in portability mode on Apache Flink, Spark, Samza and other runners. Users regularly run distributed data processing jobs on Beam spanning tens of thousands of CPU cores and processing millions of events per second.
In this session, Sergei Sokolenko, Cloud Dataflow product manager, and Reuven Lax, the founding member of the Dataflow and Beam team, will share Google’s learnings from building and operating a global streaming processing infrastructure shared by thousands of customers, including:
safe deployment to dozens of geographic locations,
resource autoscaling to minimize processing costs,
separating compute and state storage for better scaling behavior,
dynamic work rebalancing of work items away from overutilized worker nodes,
offering a throughput-optimized batch processing capability with the same API as streaming,
grouping and joining of 100s of Terabytes in a hybrid in-memory/on-desk file system,
integrating with the Google Cloud security ecosystem, and other lessons.
Customers benefit from these advances through faster execution of jobs, resource savings, and a fully managed data processing environment that runs in the Cloud and removes the need to manage infrastructure.
Pipelining understandingPipelining is running multiple stages of .pdfarasanlethers
Pipelining understanding:
Pipelining is running multiple stages of the same process in parallel in a way that efficiently uses
all the available hardware while respecting the dependencies of each stage upon the previous
stages. In the laundry example, the stages are washing, drying, and folding. By starting a wash
stage as soon as the previous wash stage is moved to the dryer, the idle time of the washer is
minimized. Notice that the wash stage takes less time than the dry stage, so the wash stage must
remain idle until the dry stage finishes: the steady state throughput of the pipeline is limited by
the slowest stage in the pipeline. This can be mitigated by breaking up the bottleneck stage into
smaller sub-stages. For those less concerned with laundry-based examples, consider a video
game. The CPU computes the keyboard/mouse input each frame and moves the camera
accordingly, then the GPU takes that information and actually renders the scene; meanwhile, the
CPU has already begun calculating what\'s going to happen in the next frame.
How Pipelining will done:
In class, we mentioned that interpreting each computer instruction is a four step process: fetching
the instruction, decoding it and reading the register, executing it, and recording the results. Each
instruction may take 4 cycles to complete, but if our throughput is one instruction each cycle,
then we would like to perform, on average, $n$ instructions every $n$ cycles. To accomplish
this, we can split up an instruction\'s work into the 4 different steps so that other pieces of
hardware work to decode, execute, and record results while the CPU performs the fetch. The
latency to process each instruction is fixed at 4 cycles, so by processing a new instruction every
cycle, after four cycles, one instruction has been completed and three are \"in progress\" (they\'re
in the pipeline). After many cycles the steady state throughput approaches one completed
instruction every cycle.
An assembly line in a auto manufacturing plant is another good example of a pipelined process.
There are many steps in the assembly of the car, each of which is assigned a stage in the pipeline.
Typically the depth of these pipelines is very large: cars are pretty complex, so there need to be a
lot of stages in the assembly line. The more stages, the longer it takes to crank the system up to a
steady state. The larger the depth, the more costly it is to turn the system around: A branch
misprediction in an instruction pipeline would be like getting one of the steps wrong in the
assembly line: all the cars affected would have to go back to the beginning of the assembly line
and be processed again.
OnLive Example[Realtime]:
OnLive is a company that allows gamers to play video games in the cloud. The games are run on
one of the company\'s server farms, and video of the game is sent back to your computer. The
idea is that even the lamest of computers can run the most highly intensive games because all the
computer does .
UNIT IV FAILURE RECOVERY AND FAULT TOLERANCE 9
Basic Concepts-Classification of Failures – Basic Approaches to Recovery; Recovery in
Concurrent System; Synchronous and Asynchronous Checkpointing and Recovery; Check
pointing in Distributed Database Systems; Fault Tolerance; Issues - Two-phase and Nonblocking
Commit Protocols; Voting Protocols; Dynamic Voting Protocols;
Software-defined migration how to migrate bunch of v-ms and volumes within a...OPNFV
Kentaro Matsumoto, KDDI Corporation, Hyde Sugiyama, Red Hat, Inc
As telecom career, we KDDI have been managing thousands of physical servers and run various kinds of workloads. In our operation of such a huge environment, We are frequently required to shut down our servers for maintenance, but it is not easy to negotiate with our tenant users to allow downtime. To make it easier, we are developing the structure called "Zone Migration", using the framework of OpenStack project "Watcher". "Zone Migration" makes it possible to migrate tenants’ workloads from compute nodes and storage devices we want to maintain (source zone) to new blank ones (destination zone) efficiently, automatically, and with minimum downtime.
These requirements as follows are realized.
-A lot of VMs and volumes should be migrated within a limited time frame
-Operations should be automated, but also can be controlled manually
-Time and load of migration should be under control so that tenants’ systems will not be affected
We are proceeding with the project in cooperation with NEC and Red Hat, and developing this structure on Red Hat OpenStack Platform.
Embedded Linux Conference 2013
https://github.com/ystk/sched-deadline/tree/dlmiss-detection-dev
Real-time system need to meet deadline. In this point of view, the system is required two functions to have determinism. One is interrupt latency stabilization and the other one is processing time reservation. SCHED_DEADLINE has a feature to reserve CPU time in advance to ensure predictable behavior. However there is a lack of feature to control deadline missed processes.
In this presentation, we would like to discuss the requirement for the feature and also show a sample implementation to control deadline missed processes.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
3. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 3
21 Degradation Allowance
4. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 4
Our computers are down,
so we have to do
everything manually...
“Redundancy is such an ugly word. Let’s talk
about your ‘employment crunch’.”
5. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 5
Robust Parallel Processing
Resilient Algorithms
6. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 6
21.1 Graceful Degradation
Degradation allowance
Diagnose malfunctions and provide capability for the system to work
without the modules which are malfunctioning
Degradation management
Adapt: Prioritize tasks and redistribute load
Monitor: Keep track of system operation in degraded mode
Reverse: Return system to the intact (or less degraded) state ASAP
Return: Go back to normal operation
Terminology: n. Graceful degradation
adj. Gracefully degrading/degradable = fail-soft
Strategies for failure prevention
1. Quick malfunction diagnosis
2. Effective isolation of malfunctioning elements
3. On-line repair (preferably via hot-pluggable modules)
4. Avoidance of catastrophic malfunctions
7. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 7
Degradation Allowance Is Not Automatic
A car possessing extra wheels compared with the minimum number
required does not guarantee that it can operate with fewer wheels
8. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 8
Area under performance curve
represents work accomplished
Performability of a Fail-Soft System
On-line repair: Done by removal/replacement of affected modules in a
way that does not disrupt the operation of the remaining system parts
Crashed
Off-line repair
On-line
repair
Off-line repair: Involves shutting down the entire system while affected
modules are removed and replacements are plugged in
Intact
Time
Degraded
Partially
failed
Fail-soft
operation
Performance
9. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 9
21.2 Diagnosis, Isolation, and Repair
Diagnose the malfunction
Remove the malfunctioning unit
Update system (OS) tables
Physical isolation?
Initiate repair, if applicable
Create new working configuration
Exclude processor, channel, controller, I/O device (e.g., sensor)
Avoid bad tracks on disk, garbled files, noisy communication links
Remove parts of the memory via virtual address mapping
Bypass a cache or use only half of it (more restricted mapping?)
Unit
Guard 1
Bus
Guard 2
Recover processes and associated data
Recover state information from removed unit
Initialize any new resource brought on-line
Reactivate processes (via rollback or restart)
Additional steps
needed to return
repaired units to
operating status
10. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 10
21.3 Stable Storage
Storage that won’t lose its contents (unlike registers and SRAM/DRAM)
Combined stability & reliability can be provided with RAID-like methods
Possible implementation method: Battery backup for a time duration long
enough to save contents of disk cache or other volatile memory
Flash memory
11. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 11
Malfunction-Stop Modules
Malfunction tolerance would be much easier if modules simply stopped
functioning, rather than engage in arbitrary behavior
Unpredictable (Byzantine) malfunctions are notoriously hard to handle
Assuming the availability of a reliable stable storage along with its
controlling s-process and (approximately) synchronized clocks,
a k-malfunction-stop module can be implemented from k + 1 units
Operation of s-process to decide whether the module has stopped:
R := bag of received requests with appropriate timestamps
if |R| = k+1 all requests identical and from different sources stop
then if request is a write
then perform the write operation in stable storage
else if request is a read, send value to all processes
else set variable stop in stable storage to TRUE
12. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 12
21.4 Process and Data Recovery
Use of logs with process restart
Impossible when the system operates in real time and performs
actions that cannot be undone
Such actions must be compensated for as part of degradation
management
13. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 13
21.5 Checkpointing and Rollback
Early computers had a short MTTF
It was impossible to complete any computation that ran for several hours
If MTTF is shorter than the running time, many restarts may be needed
Long-running comp.
Time
Checkpoint #1
MTTF
Checkpoint #2 Checkpoint #3
Checkpointing entails some overhead
Too few checkpoints would lead to a lot of wasted work
Too many checkpoints would lead to a lot of overhead
Checkpoints are placed at convenient points along the computation path
(not necessarily at equal intervals)
14. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 14
Why Checkpointing Helps
A computation’s running time is T = 2 MTTF = 2/l. What is the
probability that we can finish the computation in time 2T:
a. Assuming no checkpointing
b. Assuming checkpointing at regular intervals of T/2
Ignore all overheads.
Long-running comp.
Time
MTTF
Chkpt #1 Chkpt #2 Chkpt #3
T = 2 MTTF
2T
S C
.135 335
= e–2
S H
.367 879
= e–1
C
.367 879
25% success probability
47% success probability
15. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 15
Recovery via Rollback
Rollback or restart creates no problem for tasks that do I/O at the end
Interactive processes must be handled with more care
e.g., bank ATM transaction to withdraw money or transfer funds
(check balance, reduce balance, dispense cash or increase balance)
Roll back process 2 to the last checkpoint (#2)
Restart process 6
Process 1
Process 3
Process 2
Process 4
Process 6
Process 5
Time
Checkpoint #1 Checkpoint #2 Detected
malfunction
Affected
processes
16. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 16
Checkpointing for Data
Consider data objects stored on a primary site and k backup sites
(with appropriate design, such a scheme will be k-malfunction-tolerant)
Each access request is sent to the primary site
Read request honored immediately by the primary site
One way to deal with a write request:
Update requests sent to backup sites
Request is honored after all messages ack’ed
If one or more backup sites malfunction, service is not disrupted
If the primary site malfunctions, a new primary site is “elected”
(distributed election algorithms exist that can tolerate malfunctions)
Time in
state:
fn of k
fn of k
k Availability
0 0.922
1 0.987
2 0.996
4 0.997
8 0.997
Analysis by Huang and Jalote:
Normal state (primary OK, data available)
Recovery state (primary site is changing)
Checkpoint state (primary doing back-up)
Idle (no site is OK)
Alternative:
Primary site does
frequent back-ups
17. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 17
Asynchronous Distributed Checkpointing
For noninteracting processes, asynchronous checkpoints not a problem
Process 1
Process 3
Process 2
Process 4
Process 5
Time
I
Chkpt 1.1
Detected
malfunction
Affected
process
Process 6
I
Chkpt 1.2
I
Chkpt 1.3
I
Chkpt. 2.1
I
Chkpt 3.1
I
Chkpt 3.2
I
Chkpt 5.1
I
Chkpt 5.2
I
Chkpt 4.1
e3.2
e1.1
e5.1
e3.1 e3.3
e5.3
e1.2
e5.2
When one process is rolled back, other processes may have to be
rolled back also, and this has the potential of creating a domino effect
Identifying a consistent set of checkpoints (recovery line) is nontrivial
Recovery line
18. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 18
21.6 Optimal Checkpoint Insertion
There is a clear tradeoff in the decision regarding checkpoints
Too few checkpoints lead to long rollbacks in the event of a malfunction
Too many checkpoints lead to excessive time overhead
As in many other engineering problems, there is a happy medium
# of checkpoints
Checkpointing
overhead
Expected work
for recovering from
a malfunction
19. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 19
Optimal Checkpointing for Long Computations
T = Total computation time without checkpointing
q = Number of computation segments; there will be q – 1 checkpoints
Tcp = Time needed to capture a checkpoint snapshot
l = Malfunction rate
End
Chkpt #1
Start
. . .
Chkpt #2
T
0 T/q 2T/q
1 – lT/q 1 – lT/q
. . .
Discrete Markov model:
Expected length of stay in
each state1/(1–lT/q ),
where time step is T/q
Computation time with checkpointing Ttotal = T/(1 – lT/q) + (q – 1)Tcp
= T + lT2/(q – lT) + (q – 1)Tcp
dTtotal/dq = –lT2/(q – lT)2 + Tcp = 0 qopt = T(l + l/Tcp)
Example: T = 200 hr, l = 0.01 / hr, Tcp = 1/8 hr
qopt = 200(0.01 + (0.01/0.25)1/2) = 42; Ttotal 215 hr
opt
Warning: Model
is accurate only
when T/q << 1/l
20. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 20
Elaboration on Optimal Computation Checkpoints
T = Total computation time without checkpointing (Example: 200 hr)
q = Number of computation segments; there will be q – 1 checkpoints
Tcp = Time needed to capture a checkpoint snapshot (Range: 1/8-10 hr)
l = Malfunction rate (Example: 0.01 / hr)
Computation time with checkpointing Ttotal = T + lT2/(q – lT) + (q – 1)Tcp
dTtotal/dq = –lT2/(q – lT)2 + Tcp = 0 qopt = T(l + l/Tcp)
d2Ttotal/dq2 = 2lT2/(q – lT)3 > 0 Bowl-like curve for Ttotal, with a minimum
Example:
4 6 8 10 20 30 40 50
0 400 300 267 250 222 214 211 208
1/8 400 301 267 251 225 218 215 214
1/3 401 302 269 253 229 224 224 224
1 403 305 274 259 241 243 250 257
10 430 350 337 340 412 504 601 698
100 700 800 967 1150 2122 3114 4111 5108
1000 3400 5300 7267 9250 19222 29214 39211 49208
Tcp
q
Ttotal
Rollback Checkpointing
21. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 21
Optimal Checkpointing in Transaction Processing
Pcp = Checkpointing period
Tcp = Checkpointing time overhead (for capturing a database snapshot)
Trb = Expected rollback time upon malfunction detection
Relative checkpointing overhead
O = (Tcp + Trb) / Pcp
Assume that rollback time, given malfunction at time x, is a + bx
(b is typically small, because only updates need to be reprocessed)
r(x): Expected rollback time due to malfunction in the time interval [0, x]
r(x+dx) = r(x) + (a + bx)ldx dr(x)/dx = (a + bx)l r(x) = lx(a + bx/2)
Trb = r(Pcp) = lPcp(a + bPcp/2)
x
x + dx Pcp
0
Checkpoint i – 1 Checkpoint i
O = (Tcp+Trb)/Pcp = Tcp/Pcp + l(a + bPcp/2) is minimized for: Pcp = 2Tcp/(lb)
22. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 22
Examples for Optimal Database Checkpointing
O = (Tcp+Trb)/Pcp = Tcp/Pcp + l(a + bPcp/2) is minimized for: Pcp = 2Tcp/(lb)
Tcp = Time needed to capture a checkpoint snapshot = 16 min
l = Malfunction rate = 0.0005 / min (MTTF = 2000 min 33.3 hr)
b = 0.1
Pcp = 2Tcp/(lb) = 800 min 13.3 hr
opt
Suppose that by using faster memory for saving the checkpoint snapshots
(e.g., disk, rather than tape) we reduce Tcp to 1 min
Pcp = 2Tcp/(lb) = 200 min 3.3 hr
opt
23. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 23
22 Degradation Management
24. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 24
“Budget cuts.”
25. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 25
Robust Parallel Processing
Resilient Algorithms
26. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 26
22.1 Data Distribution Methods
Reliable data storage requires that the availability and integrity of data
not be dependent on the health of any one site
Data Replication
Data dispersion
27. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 27
Data Replication
Resilient objects using the primary site approach
Active replicas: the state-machine approach
Request is sent to all replicas
All replicas are equivalent and any one of them can service the request
Ensure that all replicas are in same state (e.g., via atomic broadcast)
Maintaining replica consistency very difficult under Byzantine faults
Will discuss Byzantine agreement later
Read and write quorums
Example: 9 replicas, arranged in 2D grid
Rows constitute write quorums
Columns constitute read quorums
A read quorum contains the latest update Possible
read quorum
Most
up-to-date
replicas
28. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 28
Data Dispersion
Instead of replicating data objects completely, one can divide each
one into k pieces, encode the pieces, and distribute the encoded
pieces such that any q of the pieces suffice to reconstruct the data
Recunstruction
algorithm
Original data word
and its k pieces
The k pieces
after encoding
(approx. three
times larger)
Original data word recovered
from k/3 encoded pieces
Up-to-date
pieces
Possible read set
of size 2k/3
Possible update set
of size 2k/3
Reconstruction
algorithm
29. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 29
22.2 Multiphase Commit Protocols
The two generals problem: Two generals lead divisions of an army
camped on the mountains on the two sides of an enemy-occupied valley
The two divisions can only communicate via messengers
We need a scheme for the generals to agree on a common attack time,
given that attack by only one division would be disastrous
Messengers are totally reliable, but may need
an arbitrary amount of time to cross the valley
(they may even be captured and never arrive)
G2
G1
G1 decides on T, sends a messenger to tell G2
Tomorrow
at noon
Got it!
Got
your
ack!
G2 acknowledges receipt of the attack time T
G2, unsure whether G1 got the ack (without which
he would not attack), will need an ack of the ack!
This can go on forever, without either being sure
30. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 30
Maintaining System Consistency
Atomic action: Either the entire action is completed or none of it is done
One key tool is the ability to ensure atomicity despite malfunctions
Similar to a computer guaranteeing sequential execution of instructions,
even though it may perform some steps in parallel or out of order
Where atomicity is useful:
Upon a write operation, ensure that all data replicas are updated
Electronic funds transfer (reduce one balance, increase the other one)
In centralized systems atomicity can be ensured via locking mechanisms
Acquire (read or write) lock for desired data object and operation
Perform operation
Release lock
A key challenge of locks is to avoid deadlock (circular waiting for locks)
31. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 31
Two-Phase Commit Protocol
To avoid participants being stranded in the wait state (e.g., when the
coordinator malfunctions), a time-out scheme may be implemented
Ensuring atomicity of actions in a distributed environment
Quiet
Wait
Abort
Commit
Coordinator
- / Begin to all
Yes from all /
Commit to all
No from some /
Abort to all
Quiet
Wait
Abort
Commit
Participant
Begin / Yes
Abort / -
Commit / -
Begin / No
Where the
transaction
is initiated
Where some
part of the
transaction
is performed
Execute the
termination
protocol
32. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 32
Three-Phase Commit Protocol
Two-phase commit is a blocking protocol, even with timeout transitions
Quiet
Wait
Abort
Commit
Coordinator
- / Begin to all
Yes from all /
Prepare to all
No from some /
Abort to all
Quiet
Wait
Abort
Participant
Begin / Yes
Abort / -
Prepare / Ack
Begin / No
Prepare
Ack from all /
Commit to all
Commit
Prepare
Commit / -
Safe from blocking, given
the absence of a local state
that is adjacent to both a
commit and an abort state
33. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 33
22.3 Dependable Communication
Point-to-point message: encoding + acknowledgment + timeout
Reliable broadcast: message guaranteed to be received by all nodes
Forwarding along branches of a broadcast tree, with possible repetition
(duplicate messages recognized from their sequence numbers)
Positive and negative acknowledgments piggybacked on subsequent
broadcast messages (P broadcasts message m1, Q receives it and
tacks a positive ack for m1 to message m2 that it broadcasts, R did not
receive m1 but finds out about it from Q’s ack and requests retransmit)
Atomic broadcast: reliable broadcast, plus the requirement that
multiple broadcasts be received in the same order by all nodes
(much more complicated to ensure common ordering of messages)
Causal broadcast: if m2 is sent after m1, any message triggered by m2
must not cause actions before those of m1 have been completed
34. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 34
22.4 Dependable Collaboration
Distributed systems, built from COTS nodes (processors plus memory)
and interconnects, have redundancy and allow software-based
malfunction tolerance implementation
Interconnect malfunctions are dealt with by synthesizing reliable
communication primitives (point-to-point, broadcast, multicast)
Node malfunctions are modeled differently, with the more general
models requiring greater redundancy to deal with
Byzantine
Timing
Omission
Crash
Node stops (does not
undergo incorrect transitions)
Node does not respond
to some inputs
Node responds either
too early or too late
Totally arbitrary behavior
35. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 35
Malfunction Detectors in Distributed Systems
Malfunction detector: Distributed oracle related to malfunction detection
Creates and maintains a list of suspected processes
Defined by two properties: completeness and accuracy
Advantages:
Allows decoupling of the effort to detect malfunctions, e.g. site crashes,
from that of the actual computation, leading to more modular design
Improves portability, because the same application can be used on a
different platform if suitable malfunction detectors are available for it
Example malfunction detectors:
P (Perfect): strong completeness, strong accuracy (min required for IC)
S: strong completeness, eventual weak accuracy (min for consensus)
Reference: M. Raynal, “A Short Introduction to Failure Detectors for Asynchronous
Distributed Systems,” ACM SIGACT News, Vol. 36, No. 1, pp. 53-70, March 2005.
36. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 36
Reliable Group Membership Service
A group of processes may be cooperating for solving a problem
The group’s membership may expand and contract owing to changing
processing requirements or because of malfunctions and repairs
Reliable multicast: message guaranteed to be received by all
members within the group
ECE 254C: Advanced Computer Architecture – Distributed Systems
(course devoted to distributed computing and its reliability issues)
37. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 37
22.5 Remapping and Load Balancing
After remapping, various parts of the computation are performed by
different modules compared with the original mapping
It is quite unlikely that the same incorrect answers are obtained in the
remapped version
When pieces of a computation are performed on different modules,
remapping may expose hidden malfunctions
Load balancing is the act of redistributing the computational load in the
face of lost/recovered resources and dynamically changing
computational requirements
38. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 38
Cell 6
Cell 1 Cell 2 Cell 3 Cell 4 Cell 5
Recomputation with Shift in Space
Linear array with an extra cell can redo the same pipelined computation
with each step of the original computation shifted in space
Each cell i + 1 compares the result of step i that it received from the left
in the first computation to the result of step i that it obtains in the second
computation
With two extra cells in the linear array, three computations can be
pipelined and voting used to derive highly reliable results
1 2 3 4 5
1 2 3 4 5
39. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 39
22.6 Modeling of Degradable Systems
Intact Failed
Degraded
Catastrophic malfunction (to be avoided)
Reducing the probability of catastrophic malfunctions
Reduce the probability of malfunctions going undetected
Increase the accuracy of malfunction diagnosis
Make repair rates much greater than malfunction rates (keep spares)
Provide sufficient “safety factor” in computational capacity
40. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 40
Importance of Coverage in Fail-Soft Systems
A fail-soft system can fail either indirectly, due to resource exhaustion,
or directly because of imperfect coverage (analogous to leakage)
Providing more resources (“safety factor”) lengthens the indirect path,
thus slowing indirect failures but does nothing to block the direct path
Saturation effect: For a given coverage factor, addition of resources
beyond a certain point would not be cost-effective with regard to the
resulting reliability gain (same effect observed in standby sparing)
Semidirect path
Direct path to failure
(imperfect coverage)
Indirect path to failure
(resource exhaustion)
Intact Failed
41. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 41
23 Resilient Algorithms
42. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 42
“Just a darn minute! Yesterday you said X equals two!”
43. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 43
Robust Parallel Processing
Resilient Algorithms
44. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 44
23.1 COTS-Based Paradigms
Many of the hardware and software redundancy methods assume that
we are building the entire system (or a significant part of it) from scratch
Question: What can be done to ensure the dependability of
computations using commercial off-the-shelf (COTS) components?
Some companies with fault-tolerant systems and related services:
ARM: Fault-tolerant ARM (launched in late 2006), automotive applications
Nth Generation Computing: High-availability and enterprise storage systems
Resilience Corp.: Emphasis on data security
Stratus Technologies: “The Availability Company”
Sun Microsystems: Fault-tolerant SPARC ( ft-SPARC™)
Tandem Computers: An early ft leader, part of HP/Compaq since 1997
A number of algorithm and data-structure design methods are available
45. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 45
Some History: The SIFT Experience
SIFT (software-implemented fault tolerance), developed at Stanford in
early 1970s using mostly COTS components, was one of two competing
“concept systems” for fly-by-wire aircraft control
The other one, FTMP (fault-tolerant multiprocessor), developed at MIT,
used a hardware-intensive approach
System failure rate goal: 10–9/hr over a 10-hour flight
SIFT allocated tasks for execution on multiple, loosely synchronized
COTS processor-memory pairs (skew of up to 50 ms was acceptable);
only the bus system was custom designed
Some fundamental results on, and methods for, clock synchronization
emerged from this project
To prevent errors from propagating, processors obtained multiple copies
of data from different memories over different buses (local voting)
46. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 46
Limitations of the COTS-Based Approach
Some modern microprocessors have dependability features built in:
Parity and other codes in memory, TLB, microcode store
Retry at various levels, from bus transmissions to full instructions
Machine check facilities and registers to hold the check results
Manufacturers can incorporate both more advanced and new features,
and at times have experimented with a number of mechanisms, but the
low volume of the application base has hindered commercial viability
According to Avizienis [Aviz97], however:
These are often not documented enough to allow users to build on them
Protection is nonsystematic and uneven
Recovery options are limited to shutdown and restart
Description of error handling is scattered among a lot of other detail
There is no top-down view of the features and their interrelationships
47. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 47
23.2 Robust Data Structures
Stored and transmitted data can be protected against unwanted changes
through encoding, but coding does not protect the structure of the data
Consider, e.g., an ordered list of numbers
Individual numbers can be protected by encoding
The set of values can be protected by a checksum
The ordering, however, remains unprotected
Can we devise some general methods for protecting commonly used
data structures?
Idea – Use a checksum that weighs each value differently: (Sjxj) mod A
Idea – Add a “difference with next item” field to each list entry
x x – y
y y – z
z . . .
48. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 48
Recoverable Linear Linked Lists
Simple linked list: 0-detectable, 0-correctable
Circular list, with node count and unique ID: 1-detectable, 0-correctable
ID ID ID ID ID
Size
Doubly linked list, with node count and ID: 2-detectable, 1-correctable
ID ID ID ID ID
Size
Add skip links to make this 3-detectable, 1-correctable
Cannot recover from even one erroneous link
Skip
49. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 49
Other Robust Data Structures
Trees, FIFOs, stacks (LIFOs), heaps, queues
In general, a linked data structure is 2-detectable and 1-correctable iff
the link network is 2-connected
Robust data structures provide fairly good protection with little design
effort or run-time overhead
Audits can be performed during idle time
Reuse possibility makes the method even more effective
Robustness features to protect the structure can be combined with
coding methods (such as checksums) to protect the content
50. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 50
Recoverable Binary Trees
Add “parent links” and/or “threads”
(threads are links that connect leaves
to higher-level nodes)
Threads can be added with little
overhead by taking advantage
of unused leaf links (one bit in
every node can be used to
identify leaves, thus freeing
their link fields for other uses)
ID
ID
ID ID
Size
ID
L
L
N
N
N
Adding redundancy to data structures has three types of cost:
Storage requirements for the additional information
Slightly more difficult updating procedures
Time overhead for periodic checking of structural integrity
51. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 51
23.3 Data Diversity and Fusion
A = xy
a
z
x
y
R
r
A = ½ z2 sin a A = 4r(R2 – r2)1/2
Alternate formulations of the same information (input re-expression)
Example: The shape of a rectangle can be specified:
By its two sides x and y
By the length z of its diameters and the angle a between them
By the radii r and R of its inscribed and circumscribed circles
Area calculations with computation and data diversity
52. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 52
23.4 Self-Checking Algorithms
2 1 6 1
5 3 4 4
3 2 7 4
Mr =
2 1 6 1
5 3 4 4
3 2 7 4
2 6 1 1
Mf =
2 1 6
5 3 4
3 2 7
M =
2 1 6
5 3 4
3 2 7
2 6 1
Mc =
Matrix M Row checksum matrix
Column checksum matrix Full checksum matrix
Error coding applied to data structures, rather than at the level of atomic
data elements
Example: mod-8
checksums used
for matrices
If Z = X Y then
Zf = Xc Yr
In Mf, any single
error is correctable
and any 3 errors
are detectable
Four errors may
go undetected
53. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 53
Matrix Multiplication Using ABET
2 1 6
5 3 4
3 2 7
X =
If Z = X Y then
Zf = Xc Yr
46 20 42 36
39 41 53 37
56 30 56 46
21 35 47 31
=
Column checksum
matrix for X
2 1 6
5 3 4
3 2 7
2 6 1
1 5 3
2 4 6
7 1 5
Y =
46 + 20 + 42 = 108 = 4 mod 8
20 + 41 + 30 = 91 = 3 mod 8 35 = 3 mod 8
36 = 4 mod 8
1 5 3 1
2 4 6 4
7 1 5 5
Row checksum matrix for Y
54. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 54
23.5 Self-Adapting Algorithms
This section to be completed
55. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 55
23.6 Other Algorithmic Methods
This section to be completed
56. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 56
24 Software Redundancy
57. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 57
“We are neither hardware nor
software; we are your parents.”
“I haven’t the slightest idea who he is.
He came bundled with the software.”
“Well, what’s a piece of software
without a bug or two?”
58. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 58
Robust Parallel Processing
Resilient Algorithms
59. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 59
24.1 Software Dependability
Imagine the following product disclaimers:
For a steam iron
There is no guarantee, explicit or implied, that this device will remove
wrinkles from clothing or that it will not lead to the user’s electrocution.
The manufacturer is not liable for any bodily harm or property damage
resulting from the operation of this device.
For an electric toaster
The name “toaster” for this product is just a symbolic identifier. There
is no guarantee, explicit or implied, that the device will prepare toast.
Bread slices inserted in the product may be burnt from time to time,
triggering smoke detectors or causing fires. By opening the package,
the user acknowledges that s/he is willing to assume sole responsibility
for any damages resulting from the product’s operation.
60. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 60
How Is Software Different from Hardware?
Software unreliability is caused predominantly by design slips, not by
operational deviations – we use flaw or bug, rather than fault or error
Not much sense in replicating the same software and doing comparison
or voting, as we did for hardware
At the current levels of hardware complexity, latent design slips also
exist in hardware, thus the two aren’t totally dissimilar
The curse of complexity
The 7-Eleven convenience store chain spent nearly $9M to make its
point-of-sale software Y2K-compliant for its 5200 stores
The modified software was subjected to 10,000 tests (all successful)
The system worked with no problems throughout the year 2000
On January 1, 2001, however, the system began rejecting credit cards,
because it “thought” the year was 1901 (bug was fixed within a day)
61. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 61
Software Development Life Cycle
Project initiation
Needs
Requirements
Specifications
Prototype design
Prototype test
Revision of specs
Final design
Coding
Unit test
Integration test
System test
Acceptance test
Field deployment
Field maintenance
System redesign
Software discard
Evaluation by both the developer and customer
Implementation or programming
Separate testing of each major unit (module)
Test modules within pretested control structure
Customer or third-party conformance-to-specs test
New contract for changes and additional features
Obsolete software is discarded (perhaps replaced)
Software flaws may arise
at several points within
these life-cycle phases
62. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 62
What Is Software Dependability?
Major structural and logical problems are removed very early in the
process of software testing
What remains after extensive verification and validation is a collection of
tiny flaws which surface under rare conditions or particular combinations
of circumstances, thus giving software failure a statistical nature
Software usually contains one or more flaws per thousand lines of code,
with < 1 flaw considered good (linux has been estimated to have 0.1)
If there are f flaws in a software component, the hazard rate, that is,
rate of failure occurrence per hour, is kf, with k being the constant of
proportionality which is determined experimentally (e.g., k = 0.0001)
Software reliability: R(t) = e–kft
The only way to improve software reliability is to reduce the number of
residual flaws through more rigorous verification and/or testing
63. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 63
Residual Software Flaws
Input space
Not expected
to occur
Flaw
64. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 64
24.2 Software Malfunction Models
Software flaw/bug Operational error Software-induced failure
“Software failure” used informally to denote any software-related problem
Removing flaws, without
generating new ones
Initial
flaws
Removed flaws
Residual flaws
Start of
testing
Software
release
# flaws
Time
Initial
flaws
Removed flaws
Start of
testing
Software
release
Added flaws
Residual
flaws
# flaws
Time
New flaws introduced are
proportional to removal rate
Rate of flaw removal
decreases with time
65. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 65
Software Reliability Models and Parameters
For simplicity, we focus on the
case of no new flaw generation
Initial
flaws
Removed flaws
Residual flaws
Start of
testing
Software
release
# flaws
Testing
time
Assume linearly decreasing flaw
removal rate (F = residual flaws,
t = testing time, in months)
dF(t)/dt = –(a – bt)
F(t) = F0 – at(1 – bt/(2a))
Example: F(t) = 130 – 30t(1 – t/16)
Hazard function
z(t) = k(F0 – at(1 – bt/(2a)))
In our example, let k = 0.000132
R(t) = exp(–0.000132(130 – 30t(1 – t/16))t)
Assume testing for t = 8 months:
R(t) = e–0.00132t
t MTTF (hr)
0 58
2 98
4 189
6 433
8 758
66. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 66
The Phenomenon of Software Aging
Reasons for and types of software aging:
Accumulation of junk in the state part (reversible via restoration)
Long-term cumulative effects of updates (patches and the like)
Software does not wear out or age in the same sense as hardware
Yet, we do observe deterioration in software that has been running for
a long time
As the software’s structure deviates from its original clean form,
unexpected failures begin to occur
Eventually software becomes so mangled that it must be discarded
and redeveloped from scratch
So, the bathtub curve is also applicable to software
Bathtub
curve
67. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 67
More on Software Reliability Models
Exponentially decreasing flaw removal rate is more realistic than
linearly decreasing, since flaw removal rate never really becomes 0
How does one go about estimating the model constants?
Use handbook: public ones, or compiled from in-house data
Match moments (mean, 2nd moment, . . .) to flaw removal data
Least-squares estimation, particularly with multiple data sets
Maximum-likelihood estimation (a statistical method)
Linearly decreasing flaw removal rate isn’t the only option in modeling
Constant flaw removal rate has also been considered, but it does not
lead to a very realistic model
68. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 68
24.3 Software Verification and Validation
Verification: “Are we building the system right?” (meets specifications)
Validation: “Are we building the right system?” (meets requirements)
Both verification and validation use testing as well as formal methods
Software testing
Exhaustive testing impossible
Test with many typical inputs
Identify and test fringe cases
Formal methods
Program correctness proof
Formal specification
Model checking
Example: overlap of rectangles
Smart cards
[Requet 2000]
Examples: safety/security-critical
Cryptography device
[Kirby 1999]
Railway
interlocking
system
[Hlavaty 2001]
Automated
lab analysis
test equipment
[Bicarregui 1997]
69. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 69
Formal Proofs for Software Verification
input m and n
x := m
y := n
while x y
if x < y
then y := y – x
else x := x – y
endif
endwhile
output x
x and y are positive integers, x = m, y = n
Program to find the greatest common divisor of integers m > 0 and n > 0
x = gcd(m , n)
m and n are positive integers
Loop invariant: x > 0, y > 0, gcd(x, y) = gcd(m, n)
4. Loop executes a finite number of times (termination condition)
3. Loop invariant and exit condition imply the assertion after the loop
2. If satisfied before an iteration begins, then also satisfied at the end
1. Loop invariant implied by the assertion before the loop (precondition)
The four steps of a correctness proof relating to a program loop:
Steps 1-3: “partial correctness”
Step 4: ensures “total correctness”
70. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 70
Software Flaw Tolerance
Given that a complex piece of software will contain bugs, can we use
redundancy to reduce the probability of software-induced failures?
Sources: Software Fault Tolerance, ed. by Michael R. Lyu, Wiley, 2005
(on-line book at http://www.cse.cuhk.edu.hk/~lyu/book/sft/index.html)
Also, “Software Fault Tolerance: A Tutorial,” 2000 (NASA report, available on-line)
The ideas of masking redundancy, standby redundancy, and
self-checking design have been shown to be applicable to software,
leading to various types of fault-tolerant software
“Flaw tolerance” is a better term; “fault tolerance” has been overused
Flaw avoidance strategies include (structured) design methodologies,
software reuse, and formal methods
Masking redundancy: N-version programming
Standby redundancy: the recovery-block scheme
Self-checking design: N-self-checking programming
71. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 71
24.4 N-Version Programming
Independently develop N different programs (known as “versions”)
from the same initial specification
The greater the diversity in the N versions, the less likely
that they will have flaws that produce correlated errors
Diversity in:
Programming teams (personnel and structure)
Software architecture
Algorithms used
Programming languages
Verification tools and methods
Data (input re-expression and output adjustment)
Version 1
Version 2
Version 3
Voter Output
Input
Adjudicator;
Decider;
Data fuser
72. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 72
Some Objections to N-Version Programming
Developing programs is already a very
expensive and slow process;
why multiply the difficulties by N?
This is a criticism of
reliability modeling
with independence
assumption, not of
the method itself
Multiple diverse
specifications?
Cannot produce
flawless software,
regardless of cost
Will discuss the
adjudication problem
in a future lecture
Diversity does not ensure independent flaws
(It has been amply documented that multiple
programming teams tend to overlook the
same details and to fall into identical traps,
thereby committing very similar errors)
Imperfect specification can be the source of
common flaws
With truly diverse implementations, the
output selection mechanism (adjudicator) is
complicated and may contain its own flaws
73. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 73
Reliability Modeling for N-Version Programs
Fault-tree model: the version shown
here is fairly simple, but the power of
the method comes in handy when
combined hardware/software
modeling is attempted
Probabilities of coincident flaws
are estimated from experimental
failure data
Source: Dugan & Lyu, 1994 and 1995
74. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 74
Applications of N-Version Programming
Back-to-back testing: multiple versions can help in the testing process
Source: P. Bishop, 1995
Some experiments in N-version programming
B777 flight computer: 3 diverse processors running diverse software
Airbus A320/330/340 flight control: 4 dissimilar hardware/software
modules drive two independent sets of actuators
75. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 75
24.5 The Recovery Block Method
The software counterpart to standby sparing for hardware
Suppose we can verify the result of a software module by subjecting
it to an acceptance test
ensure acceptance test
by primary module
else by first alternate
.
.
.
else by last alternate
else fail
e.g., sorted list
e.g., quicksort
e.g., bubblesort
.
.
.
e.g., insertion sort
The acceptance test can range from a simple reasonableness check
to a sophisticated and thorough test
Design diversity helps ensure that an alternate can succeed when the
primary module fails
76. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 76
The Acceptance Test Problem
Design of acceptance tests (ATs) that are both simple and thorough is
very difficult; for example, to check the result of sorting, it is not enough
to verify that the output sequence is monotonic
Simplicity is desirable because acceptance test is executed after the
primary computation, thus lengthening the critical path
Thoroughness ensures that an incorrect result does not pass the test
(of course, a correct result always passes a properly designed test)
Some computations do have simple tests (inverse computation)
Examples: square-rooting can be checked through squaring, and
roots of a polynomial can be verified via polynomial evaluation
At worst, the acceptance test might be as complex as the primary
computation itself
77. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 77
24.6 Hybrid Software Redundancy
Recoverable N-version
block scheme =
N-self-checking program
Voter acts only on module
outputs that have passed
an acceptance test
Consensus recovery
block scheme
Only when there is no
majority agreement,
acceptance test applied
(in a prespecified order)
to module outputs until
one passes its test
Source: Parhami, B., “An Approach to Component-Based Synthesis of
Fault-Tolerant Software,” Informatica, Vol. 25, pp. 533-543, Nov. 2001.
1
3
(a) RNVB / NSCP
2
F
P
1 2 3
1 2 3
(b) CRB
1 2 3
1
2
3
F
F
F F F
P P
P
P
P
v/2
>v/2
Modules
Tests
Tests
Voter
Voter
Error
Error
78. Nov. 2020 Part VI – Degradations: Behavioral Lapses Slide 78
More General Hybrid NVP-AT Schemes
1, 2
3
Module
Test
Voter
In
Out
Error
(a) Legend (b) 5VP
1
3
(c) ALT1
2
F
P
1 2 3 4 5 1 2 3
4
Source: Parhami, B., “An Approach to Component-Based Synthesis of
Fault-Tolerant Software,” Informatica, Vol. 25, pp. 533-543, Nov. 2001.