SlideShare a Scribd company logo
1 of 32
Download to read offline
Dependable Systems
!
Introduction
Dr. Peter Tröger
Dependable Systems Course PT 2014
Dependability
• Umbrella term for operational requirements on a system

• IFIP WG 10.4: "[..] the trustworthiness of a computing system which allows reliance
to be justifiably placed on the service it delivers [..]"

• IEC IEV: "dependability (is) the collective term used to describe the availability
performance and its influencing factors : reliability performance, maintainability
performance and maintenance support performance"

• Laprie: „ Trustworthiness of a computer system such that 

reliance can be placed on the service it delivers to the user “

• Adds a third dimension to system quality

• General question: How to deal with unexpected events ?

• In German: ,Verlässlichkeit‘ vs. ,Zuverlässigkeit‘

2
Dependable Systems Course PT 2014
English vs. German
3
Dependability Verlässlichkeit
Reliability Zuverlässigkeit
Availability Verfügbarkeit
Safety Sicherheit
Security Schutz
Maintainability Wartbarkeit
Fault Fehlerursache (Fehler)
Error Fehlerzustand (Fehler)
Failure Ausfall
Error Propagation Fehlerausbreitung
Fault Prevention Fehlerursachenverhinderung
Fault Removal Fehlerursachenbeseitigung
Fault Tolerance Fehlertoleranz
Fault Forecasting Fehlervorhersage
Dependable Systems Course PT 2014
Computers in Safety-Critical Systems
• Different levels of critical participation for a computer system

• Information provisioning to human controller on request

• Interpretation of data and presentation to the user

• Issues command on behalf of the human controller

• Replaces human controller

• Trend to realize critical systems with commercial-of-the-shelf components

• Driven by budget cuts and performance advantage

• Puts sole responsibility on software layer, in contrast to early hardware-only
redundancy approaches
4
Dependable Systems Course PT 2014
Dependable Systems Motivation
Money

Life saving

Human users

Harsh environments

Complexity
5
Dependable Systems Course PT 2014
Example: Therac 25
• Radiation therapy engine, involved in 6 accidents between 1985 and 1987

• Two operation modes: High-power spreaded beam, or low power focus beam

• Accident: High-power mode without spreader plate activated

• Harmful operation not detected due to software flaws

• Institutional issues: No independent software review, 

no risk analysis for failure modes, no test of hardware / software assembling

• Engineering issues: No hardware interlock, software re-use from older models, 

no sensors for hardware check, arithmetic overflow in the check routines

• No investigation of single modules
6
Dependable Systems Course PT 2014
Example: Patriot Missile Launcher
• Mobile missile launcher, designed for a few hours of operation

• Used for Scud defense operation, never designed for it

• February 1991 - Battery in Dharan (Saudi Arabia) failed to intercept Scud missile,

hit army barrack

• Software aging problem in system‘s weapon control computer

• Target velocity and time 

demanded as real values, 

stored as 24-bit integer

• Inaccurate tracking computation 

due to overlong operation 

( > 100 hours)

• Modified software reached the 

base one day after the accident
7
Dependable Systems Course PT 2014
Example: Ariane 5
• June 4 1996, self-destruction firework for $500 million

• Initiated because boosters were ripping from the rocket

• Reasoned by on-board computer reaction on inertia system data

• Was not data, but a diagnostic bit pattern due to an overflow error

• Identical inertia backup system had shut down milliseconds before

• Inertia system from Ariane 4, numbers were never expected to get that high

• 64-bit FP to 16-bit signed integer, data conversion in Ada was not protected

• No test of full system due to budget limits

• System was running after lift off for no special purpose

• Possibility for fast restart with brief hold in the countdown, only for Ariane 4
8
declare
v_veloc_sensor: float; -- 64 bit floating point
h_veloc_sensor: float;
v_veloc_bias: integer; -- 16 bit signed integer
h_veloc_bias: integer;
begin
sensor_get (v_veloc_sensor);
sensor_get (h_veloc_sensor);
v_veloc_bias := integer (v_veloc_sensor);
h_veloc_bias := integer (h_veloc_sensor);
exception
when others => use_irs1();
end;
Dependable Systems Course PT 2014
Example: Ariane 5
9
Dependable Systems Course PT 2014
Example: Mars Pathfinder
• Successful landing on Mars at July 4th, 1997

• Spontaneous resets after landing

• VxWorks operating system, real-time scheduling of data streams and commands

• Classical priority inversion - long-running medium-communication task, 

low-priority meteorological task, reset triggered by watchdog

• Antennas performed too good, unexpected timing behavior in data transmission

• Problem reproduced with identical Pathfinder on earth

• Semaphore had priority inheritance protection, which was switched off for
performance optimization
10
Dependable Systems Course PT 2014
Unbounded Priority Inversion
• Pathfinder scenario

• T1: Periodically checks
health of systems

• T2: Processes image data

• T3: Occasional test on
equipment status

• T1 and T3 share a protected
data structure

• T1 must indirectly wait for
T2 suspend, which has no
relation to it‘s own purpose

• T1 timeout lead to full
system reset and check
11
Dependable Systems Course PT 2014
Example: Air France AF 447 (May 2009)
• Airbus A330 flight at 10.000m through turbulent weather conditions

• Ice crystals trigger malfunction in Pitot air pressure sensors

• Stream of air normally used for speed computation

• Flight speed instruments shut down, followed by auto pilot

• Demanded manual angle and thrust level computation by pilots

• Flight computer switches to emergency mode (,alternate law 2‘)

• Still operational (,fly-by-wire‘), but different flight feeling

• Pilots lost physical control over machine, restart of flight computer not successful

• Specification for sensors not feasible (written in 1947)

• Sensor must be operational till -40 degrees / 9000m height
12
Dependable Systems Course PT 2014
Importance of Dependability for Business
• Average costs per hour of downtime (Gartner 1998)

• Brokerage operations in finance: $6.5 million

• Credit card authorization: $2.6 million

• Home catalog sales: $90.000

• Airline reservation: $89.500

• 22-hour service outage of eBay on June 6th 1999

• Interruption of around 2.3 million auctions

• 9.2% stock value drop, $3-5 billion of lost revenues

• Problems blamed on Sun server software

• Banking, telephone system, online games, cloud operations, ...
13
Dependable Systems Course PT 2014
Example: New York Stock Exchange
• Credit Suisse trading desk, Nov 14th, 2007

• Automated trading algorithm went mad

• Reasoned by misplaced double click in the trading software UI

• Hundreds of thousands of bogus order cancellations

• DoS nature of fault caused NYSE to freeze up - $150.000 fine
14
Dependable Systems Course PT 2014
Example: Amazon Cloud Outages
• 2006 - Elevated levels of authenticated requests from multiple users

• Request volumes are monitored, but cryptographic overhead was not considered

• Overload on authentication services - performs request processing and validation

• S3 service (cloud storage service) became inaccessible, but is the only storage
facility supported for Amazon-hosted virtual machines 

• Took 3,25 hours to solve the problem - live fixing, started at 3am

• No SLA promises ever, but trust in cloud-only hosted web sited dropped dramatically

• 2009 - 19 hours of outage for bitbucket.org

• Code hosting service, relies on EC2 and block storage

• UDP / TCP flooding attack influenced also access to storage facility
16
Hello,
A few days ago we sent you an email letting you know that we were
working on recovering an inconsistent data snapshot of one or
more of your Amazon EBS volumes. We are very sorry, but ultimately
our efforts to manually recover your volume were unsuccessful. The
hardware failed in such a way that we could not forensically restore the
data.
What we were able to recover has been made available via a snapshot,
although the data is in such a state that it may have little to no
utility…
If you have no need for this snapshot, please delete it to avoid
incurring storage charges.
We apologize for this volume loss and any impact to your business.
Sincerely,

Amazon Web Services, EBS Support
Dependable Systems Course PT 2014
Example: IBM S/390 Outage Sources
• Examination of 16 managed mainframe systems over 6 months (Pfister, 1990)

• 92% planned outages, 8% unplanned outages

• 32% database backup

• 24% database reorganization

• 14% system software maintenance

• 12% network

• 8% applications

• 8% hardware

• 2% other
18
Dependable Systems Course PT 2014
Tradeoffs
19
0.99999
0.9999
0.999
0.99
0.9
Massively Parallel/
Distributed Systems
Commercial
Fault-Tolerant
Systems
Ultra Reliable
Systems
Availability
Throughput (MIPS)
Dependable Systems Course PT 2014
Example: AT&T Electronic Switching System ESS1A
• Downtime for entire system not more than 2 hours over 40 years of life

• System outage < 3 minutes / year

• 100% availability from user perspective

• Goal achieved, two minutes of down time [Malek]

• 24 sec hardware faults, 18 sec software problems, 

36 sec procedural errors, 42 sec recovery problems

• Full duplication of all critical components

• Automated error detection on hardware and software level

• Replication checks - duplex system with comparison

on every cycle

• Timing checks, coding checks, internal checks with comparators
20
Dependable Systems Course PT 2014
Example: AT&T Electronic Switching System ESS1A
21
(C) Malek
Dependable Systems Course PT 2014
Example: Boeing 777 Flight Controller
22
• Fly-by-wire system, requirements

• Probability 1x10-10 for degradation 

from ,MIN-OP configuration‘

• ,Never give up‘ redundancy 

management strategy

• Mean time between 

maintenance actions: 25.000 operating hours

• TMR for all hardware resources, fail-passive electronics

• Candidate architectures were evaluated according to many trade aspects
Dependable Systems Course PT 2014
Example: Boeing 777 Flight Controller (Buus et al.)
23
Dependable Systems Course PT 2014
Example: VAX Hardware Redundancy
• VMS operating system on VAX hardware, since mid-1970‘s, outgrow of PDP-11

• VAXft in the early 1990‘s
24
Dependable Systems Course PT 2014
IBM zEnterprise
25
© 2011 IBM Corporation16
IBM zEnterprise
16
Designed to deliver continuous availability at the application level
Single System z Multiple System z
Site 1 Site 2
Multiple Data Centers
Where mean time
between failure is
measured in decades
Designed for
application availability
of 99.999%
Industry-leading
solution for
disaster recovery
 Avoiding the cost of downtime
 Ensuring access to critical
applications
 Maintaining productivity of users
 Open to clients 24/7
Manage Risk with System z Resiliency
Availability built in across the system
(C) IBM
Dependable Systems Course PT 2014
IBM zEnterprise
26
© 2011 IBM Corporation56
IBM zEnterprise
Scheduled(CIE+DisruptivePatches+ECs)
Planned- (MES+Driver Upgrades)
Unscheduled(UIRA)
Sources of Outages
Pre z9
-Hrs/Year/Syst-
Prior
Servers
z9 EC Z10 EC z196
Unscheduled
Outages
Scheduled
Outages
Planned
Outages
Preplanning
requirements
Power &
Thermal
Management
Increased Focus over time
  
 
 





Temperature = Silicon Reliability Worst Enemy
Wearout = Mechanical Components Reliability Worst Enemy.
System z overall RAS Strategy
…..Continuing our RAS focus helps avoid outages
ImpactofOutage
(C) IBM
Dependable Systems Course PT 2014
IBM zEnterprise
27
© 2011 IBM Corporation14
IBM zEnterprise
The potential performance impact of the Linux server farm is
isolated from the other Logical partitions (LPAR)
LPAR – Up to 60 Logical partitions
System z
Deploy virtual servers in seconds
Highly granular resource sharing (<1%)
Add physical resources without taking system
down, scale out to 1000s of virtual servers
Do more with less:
More virtual servers per core,
Share more physical resources across servers
Extensive built-in facilities for virtual server life-
cycle management
Hardware-enforced isolation
Distributed Platforms
Limited per-core virtual server scalability
Physical server sprawl is needed to scale
Operational complexity increases as virtual
server images grow
VMware tools only support VMware hypervisor
(ESX)
PR/SM
Hypervisor in
firmware
z/VM – 100s of virtual servers – Shared Memory
Hardware support:
10% of circuits are
used for virtualization
Each LPAR is one
Virtual Mainframe
Extreme Virtualization in zEnterprise
- Built into the architecture not an “add on” feature
(C) IBM
Dependable Systems Course PT 2014
Dependability Examples
• A segmentation fault can lead to a system failure.

• A faulty controller in an RAID array lowers the reliability of the storage facility.

• The measurable availability of a web server depends on the fault tolerance
capabilities in all its components.

• A database fault creates an erroneous state in the authentication component, which
can lead to a system failure influencing the system confidentiality. 

• The integrity of electronic voting systems depends on the complete absence of
failures in the voting and the calculation procedures.

• Fault forecasting can support the maintainability of a system.

• Testing can only prove the presence of faults, not their absence.

• Our service level agreement guarantees an availability of 99.99%.
28
Dependable Systems Course PT 2014
Consequences of Failures [Knight]
• Human injury or loss of life

• Damage to the environment

• Damage to or loss of equipment

• Damage to or loss of data

• Financial loss by theft

• Financial loss through production of useless or defective products

• Financial loss through reduced capacity for production or service

• Loss of business reputation

• Loss of customer base

• Loss of jobs
29
Dependable Systems Course PT 2014
Dependability Concerns [Knight]
• Do I know all available techniques to prevent some damage in / by my system ?

• I don‘t know how the system is used, so how can I decide for an appropriate
dependability mechanism ?

• How do I get an accurate and complete set of term definitions to communicate reliably
with the other engineers ?

• What is a technology to tackle all weak points in my system architecture, not only the
weakest or the obvious ones ?

• How can I measure the effectiveness of my dependability techniques ?

• Software is special ...

• Definition of the required function is harder than in other engineering disciplines

• Software needs to take action when hardware fails

• Software may have to operate on platforms whose design was influenced by
dependability goals
30
Dependable Systems Course PT 2014
Course Topics
• Definitions and metrics

• Fault, error, failure, fault models, fault tolerance, fault forecasting, ...

• Reliability, availability, maintainability, error mitigation, damage confinement, ...

• Reliability engineering, reliability testing, MTTF, MTBF, failure rate, ...

• Fault tolerance patterns

• N-out-of-M, voting, heartbeat, ...

• Analytical evaluation

• Reliability block diagrams, fault tree analysis, ...

• Petri nets, Markov chains

• Root cause analysis, risk analysis and assessment
31
Dependable Systems Course PT 2014
Course Topics
• Hardware-Implemented Dependability

• Failure rates, testing, hardware redundancy approaches 

• Software-Implemented Dependability

• Fault-tolerant programming: simplex, recovery blocks, checkpointing, roll-back, ...

• Testing, software metrics

• Fault-tolerant distributed systems: Transactions, consensus, clocks, ...

• Advanced approaches

• Autonomic computing, rejuvenation, online failure prediction, performability

• Case studies from practice
32

More Related Content

What's hot

Dependable Systems -Software Dependability (15/16)
Dependable Systems -Software Dependability (15/16)Dependable Systems -Software Dependability (15/16)
Dependable Systems -Software Dependability (15/16)Peter Tröger
 
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)Peter Tröger
 
OpenSubmit - How to grade 1200 code submissions
OpenSubmit - How to grade 1200 code submissionsOpenSubmit - How to grade 1200 code submissions
OpenSubmit - How to grade 1200 code submissionsPeter Tröger
 
What activates a bug? A refinement of the Laprie terminology model.
What activates a bug? A refinement of the Laprie terminology model.What activates a bug? A refinement of the Laprie terminology model.
What activates a bug? A refinement of the Laprie terminology model.Peter Tröger
 
Real Time Systems
Real Time SystemsReal Time Systems
Real Time Systemsleo3004
 
Components in real time systems
Components in real time systemsComponents in real time systems
Components in real time systemsSaransh Garg
 
Fault tolerant presentation
Fault tolerant presentationFault tolerant presentation
Fault tolerant presentationskadyan1
 
Fault tolerance and computing
Fault tolerance  and computingFault tolerance  and computing
Fault tolerance and computingPalani murugan
 
Real Time Systems
Real Time SystemsReal Time Systems
Real Time SystemsDeepak John
 
Software Fault Tolerance
Software Fault ToleranceSoftware Fault Tolerance
Software Fault ToleranceAnkit Singh
 
Availability tactics
Availability tacticsAvailability tactics
Availability tacticsahsan riaz
 
Fault tolearant system
Fault tolearant systemFault tolearant system
Fault tolearant systemarvinthsaran
 
Fault tolerance techniques for real time operating system
Fault tolerance techniques for real time operating systemFault tolerance techniques for real time operating system
Fault tolerance techniques for real time operating systemanujos25
 

What's hot (20)

Dependable Systems -Software Dependability (15/16)
Dependable Systems -Software Dependability (15/16)Dependable Systems -Software Dependability (15/16)
Dependable Systems -Software Dependability (15/16)
 
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
 
OpenSubmit - How to grade 1200 code submissions
OpenSubmit - How to grade 1200 code submissionsOpenSubmit - How to grade 1200 code submissions
OpenSubmit - How to grade 1200 code submissions
 
What activates a bug? A refinement of the Laprie terminology model.
What activates a bug? A refinement of the Laprie terminology model.What activates a bug? A refinement of the Laprie terminology model.
What activates a bug? A refinement of the Laprie terminology model.
 
Real Time Systems
Real Time SystemsReal Time Systems
Real Time Systems
 
Rtos
RtosRtos
Rtos
 
Components in real time systems
Components in real time systemsComponents in real time systems
Components in real time systems
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
Fault tolerant presentation
Fault tolerant presentationFault tolerant presentation
Fault tolerant presentation
 
Fault tolerance and computing
Fault tolerance  and computingFault tolerance  and computing
Fault tolerance and computing
 
Real Time Systems
Real Time SystemsReal Time Systems
Real Time Systems
 
Software Fault Tolerance
Software Fault ToleranceSoftware Fault Tolerance
Software Fault Tolerance
 
Voting protocol
Voting protocolVoting protocol
Voting protocol
 
Availability tactics
Availability tacticsAvailability tactics
Availability tactics
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
Fault tolearant system
Fault tolearant systemFault tolearant system
Fault tolearant system
 
Real time system tsp
Real time system tspReal time system tsp
Real time system tsp
 
RTOS Basic Concepts
RTOS Basic ConceptsRTOS Basic Concepts
RTOS Basic Concepts
 
Fault tolerance techniques for real time operating system
Fault tolerance techniques for real time operating systemFault tolerance techniques for real time operating system
Fault tolerance techniques for real time operating system
 
Rtos Concepts
Rtos ConceptsRtos Concepts
Rtos Concepts
 

Similar to Dependable Systems - Introduction (1/16)

RuSIEM overview (english version)
RuSIEM overview (english version)RuSIEM overview (english version)
RuSIEM overview (english version)Olesya Shelestova
 
Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applicationsAmit Kejriwal
 
04. availability-concepts
04. availability-concepts04. availability-concepts
04. availability-conceptsMuhammad Ahad
 
Nonfunctional Testing: Examine the Other Side of the Coin
Nonfunctional Testing: Examine the Other Side of the CoinNonfunctional Testing: Examine the Other Side of the Coin
Nonfunctional Testing: Examine the Other Side of the CoinTechWell
 
Real-Time Systems Intro.pptx
Real-Time Systems Intro.pptxReal-Time Systems Intro.pptx
Real-Time Systems Intro.pptx20EUEE018DEEPAKM
 
Rapid_Recovery-T75-v2204j.pdf
Rapid_Recovery-T75-v2204j.pdfRapid_Recovery-T75-v2204j.pdf
Rapid_Recovery-T75-v2204j.pdfTony Pearson
 
The Great Disconnect of Data Protection: Perception, Reality and Best Practices
The Great Disconnect of Data Protection: Perception, Reality and Best PracticesThe Great Disconnect of Data Protection: Perception, Reality and Best Practices
The Great Disconnect of Data Protection: Perception, Reality and Best Practicesiland Cloud
 
10 Tips for Your Journey to the Public Cloud
10 Tips for Your Journey to the Public Cloud10 Tips for Your Journey to the Public Cloud
10 Tips for Your Journey to the Public CloudIntuit Inc.
 
Performance Testing
Performance TestingPerformance Testing
Performance TestingAnu Shaji
 
Embedded System Introduction and Basics
Embedded System Introduction  and BasicsEmbedded System Introduction  and Basics
Embedded System Introduction and Basicsgkesavan11
 
05. performance-concepts-26-slides
05. performance-concepts-26-slides05. performance-concepts-26-slides
05. performance-concepts-26-slidesMuhammad Ahad
 
Building an Experimentation Platform in Clojure
Building an Experimentation Platform in ClojureBuilding an Experimentation Platform in Clojure
Building an Experimentation Platform in ClojureSrihari Sriraman
 
Cloudciti Disaster Recovery as a Service
Cloudciti Disaster Recovery as a Service   Cloudciti Disaster Recovery as a Service
Cloudciti Disaster Recovery as a Service PT Datacomm Diangraha
 
November 2014 Webinar - Disaster Recovery Worthy of a Zombie Apocalypse
November 2014 Webinar - Disaster Recovery Worthy of a Zombie ApocalypseNovember 2014 Webinar - Disaster Recovery Worthy of a Zombie Apocalypse
November 2014 Webinar - Disaster Recovery Worthy of a Zombie ApocalypseRapidScale
 
Automating the cip compliance test lab
Automating the cip compliance test labAutomating the cip compliance test lab
Automating the cip compliance test labChuck Reynolds
 
Fdp embedded systems
Fdp embedded systemsFdp embedded systems
Fdp embedded systemsKavya G
 
Acceleration_and_Security_draft_v2
Acceleration_and_Security_draft_v2Acceleration_and_Security_draft_v2
Acceleration_and_Security_draft_v2Srinivasa Addepalli
 

Similar to Dependable Systems - Introduction (1/16) (20)

RuSIEM overview (english version)
RuSIEM overview (english version)RuSIEM overview (english version)
RuSIEM overview (english version)
 
Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applications
 
Disaster Recover : 10 tips for disaster recovery planning
Disaster Recover : 10 tips for disaster recovery planningDisaster Recover : 10 tips for disaster recovery planning
Disaster Recover : 10 tips for disaster recovery planning
 
04. availability-concepts
04. availability-concepts04. availability-concepts
04. availability-concepts
 
Nonfunctional Testing: Examine the Other Side of the Coin
Nonfunctional Testing: Examine the Other Side of the CoinNonfunctional Testing: Examine the Other Side of the Coin
Nonfunctional Testing: Examine the Other Side of the Coin
 
Real-Time Systems Intro.pptx
Real-Time Systems Intro.pptxReal-Time Systems Intro.pptx
Real-Time Systems Intro.pptx
 
Rapid_Recovery-T75-v2204j.pdf
Rapid_Recovery-T75-v2204j.pdfRapid_Recovery-T75-v2204j.pdf
Rapid_Recovery-T75-v2204j.pdf
 
The Great Disconnect of Data Protection: Perception, Reality and Best Practices
The Great Disconnect of Data Protection: Perception, Reality and Best PracticesThe Great Disconnect of Data Protection: Perception, Reality and Best Practices
The Great Disconnect of Data Protection: Perception, Reality and Best Practices
 
A075434624
A075434624A075434624
A075434624
 
10 Tips for Your Journey to the Public Cloud
10 Tips for Your Journey to the Public Cloud10 Tips for Your Journey to the Public Cloud
10 Tips for Your Journey to the Public Cloud
 
Performance Testing
Performance TestingPerformance Testing
Performance Testing
 
Embedded System Introduction and Basics
Embedded System Introduction  and BasicsEmbedded System Introduction  and Basics
Embedded System Introduction and Basics
 
05. performance-concepts-26-slides
05. performance-concepts-26-slides05. performance-concepts-26-slides
05. performance-concepts-26-slides
 
Designing Scalable Applications
Designing Scalable ApplicationsDesigning Scalable Applications
Designing Scalable Applications
 
Building an Experimentation Platform in Clojure
Building an Experimentation Platform in ClojureBuilding an Experimentation Platform in Clojure
Building an Experimentation Platform in Clojure
 
Cloudciti Disaster Recovery as a Service
Cloudciti Disaster Recovery as a Service   Cloudciti Disaster Recovery as a Service
Cloudciti Disaster Recovery as a Service
 
November 2014 Webinar - Disaster Recovery Worthy of a Zombie Apocalypse
November 2014 Webinar - Disaster Recovery Worthy of a Zombie ApocalypseNovember 2014 Webinar - Disaster Recovery Worthy of a Zombie Apocalypse
November 2014 Webinar - Disaster Recovery Worthy of a Zombie Apocalypse
 
Automating the cip compliance test lab
Automating the cip compliance test labAutomating the cip compliance test lab
Automating the cip compliance test lab
 
Fdp embedded systems
Fdp embedded systemsFdp embedded systems
Fdp embedded systems
 
Acceleration_and_Security_draft_v2
Acceleration_and_Security_draft_v2Acceleration_and_Security_draft_v2
Acceleration_and_Security_draft_v2
 

More from Peter Tröger

WannaCry - An OS course perspective
WannaCry - An OS course perspectiveWannaCry - An OS course perspective
WannaCry - An OS course perspectivePeter Tröger
 
Cloud Standards and Virtualization
Cloud Standards and VirtualizationCloud Standards and Virtualization
Cloud Standards and VirtualizationPeter Tröger
 
Distributed Resource Management Application API (DRMAA) Version 2
Distributed Resource Management Application API (DRMAA) Version 2Distributed Resource Management Application API (DRMAA) Version 2
Distributed Resource Management Application API (DRMAA) Version 2Peter Tröger
 
Design of Software for Embedded Systems
Design of Software for Embedded SystemsDesign of Software for Embedded Systems
Design of Software for Embedded SystemsPeter Tröger
 
Humans should not write XML.
Humans should not write XML.Humans should not write XML.
Humans should not write XML.Peter Tröger
 
Dependable Systems - System Dependability Evaluation (8/16)
Dependable Systems - System Dependability Evaluation (8/16)Dependable Systems - System Dependability Evaluation (8/16)
Dependable Systems - System Dependability Evaluation (8/16)Peter Tröger
 
Verteilte Software-Systeme im Kontext von Industrie 4.0
Verteilte Software-Systeme im Kontext von Industrie 4.0Verteilte Software-Systeme im Kontext von Industrie 4.0
Verteilte Software-Systeme im Kontext von Industrie 4.0Peter Tröger
 
Operating Systems 1 (12/12) - Summary
Operating Systems 1 (12/12) - SummaryOperating Systems 1 (12/12) - Summary
Operating Systems 1 (12/12) - SummaryPeter Tröger
 
Operating Systems 1 (8/12) - Concurrency
Operating Systems 1 (8/12) - ConcurrencyOperating Systems 1 (8/12) - Concurrency
Operating Systems 1 (8/12) - ConcurrencyPeter Tröger
 
Operating Systems 1 (9/12) - Memory Management Concepts
Operating Systems 1 (9/12) - Memory Management ConceptsOperating Systems 1 (9/12) - Memory Management Concepts
Operating Systems 1 (9/12) - Memory Management ConceptsPeter Tröger
 
Operating Systems 1 (7/12) - Threads
Operating Systems 1 (7/12) - ThreadsOperating Systems 1 (7/12) - Threads
Operating Systems 1 (7/12) - ThreadsPeter Tröger
 
Operating Systems 1 (1/12) - History
Operating Systems 1 (1/12) - HistoryOperating Systems 1 (1/12) - History
Operating Systems 1 (1/12) - HistoryPeter Tröger
 
Operating Systems 1 (2/12) - Hardware Basics
Operating Systems 1 (2/12) - Hardware BasicsOperating Systems 1 (2/12) - Hardware Basics
Operating Systems 1 (2/12) - Hardware BasicsPeter Tröger
 
Operating Systems 1 (11/12) - Input / Output
Operating Systems 1 (11/12) - Input / OutputOperating Systems 1 (11/12) - Input / Output
Operating Systems 1 (11/12) - Input / OutputPeter Tröger
 
Operating Systems 1 (3/12) - Architectures
Operating Systems 1 (3/12) - ArchitecturesOperating Systems 1 (3/12) - Architectures
Operating Systems 1 (3/12) - ArchitecturesPeter Tröger
 
Operating Systems 1 (6/12) - Processes
Operating Systems 1 (6/12) - ProcessesOperating Systems 1 (6/12) - Processes
Operating Systems 1 (6/12) - ProcessesPeter Tröger
 
Operating Systems 1 (5/12) - Architectures (Unix)
Operating Systems 1 (5/12) - Architectures (Unix)Operating Systems 1 (5/12) - Architectures (Unix)
Operating Systems 1 (5/12) - Architectures (Unix)Peter Tröger
 
Operating Systems 1 (4/12) - Architectures (Windows)
Operating Systems 1 (4/12) - Architectures (Windows)Operating Systems 1 (4/12) - Architectures (Windows)
Operating Systems 1 (4/12) - Architectures (Windows)Peter Tröger
 

More from Peter Tröger (18)

WannaCry - An OS course perspective
WannaCry - An OS course perspectiveWannaCry - An OS course perspective
WannaCry - An OS course perspective
 
Cloud Standards and Virtualization
Cloud Standards and VirtualizationCloud Standards and Virtualization
Cloud Standards and Virtualization
 
Distributed Resource Management Application API (DRMAA) Version 2
Distributed Resource Management Application API (DRMAA) Version 2Distributed Resource Management Application API (DRMAA) Version 2
Distributed Resource Management Application API (DRMAA) Version 2
 
Design of Software for Embedded Systems
Design of Software for Embedded SystemsDesign of Software for Embedded Systems
Design of Software for Embedded Systems
 
Humans should not write XML.
Humans should not write XML.Humans should not write XML.
Humans should not write XML.
 
Dependable Systems - System Dependability Evaluation (8/16)
Dependable Systems - System Dependability Evaluation (8/16)Dependable Systems - System Dependability Evaluation (8/16)
Dependable Systems - System Dependability Evaluation (8/16)
 
Verteilte Software-Systeme im Kontext von Industrie 4.0
Verteilte Software-Systeme im Kontext von Industrie 4.0Verteilte Software-Systeme im Kontext von Industrie 4.0
Verteilte Software-Systeme im Kontext von Industrie 4.0
 
Operating Systems 1 (12/12) - Summary
Operating Systems 1 (12/12) - SummaryOperating Systems 1 (12/12) - Summary
Operating Systems 1 (12/12) - Summary
 
Operating Systems 1 (8/12) - Concurrency
Operating Systems 1 (8/12) - ConcurrencyOperating Systems 1 (8/12) - Concurrency
Operating Systems 1 (8/12) - Concurrency
 
Operating Systems 1 (9/12) - Memory Management Concepts
Operating Systems 1 (9/12) - Memory Management ConceptsOperating Systems 1 (9/12) - Memory Management Concepts
Operating Systems 1 (9/12) - Memory Management Concepts
 
Operating Systems 1 (7/12) - Threads
Operating Systems 1 (7/12) - ThreadsOperating Systems 1 (7/12) - Threads
Operating Systems 1 (7/12) - Threads
 
Operating Systems 1 (1/12) - History
Operating Systems 1 (1/12) - HistoryOperating Systems 1 (1/12) - History
Operating Systems 1 (1/12) - History
 
Operating Systems 1 (2/12) - Hardware Basics
Operating Systems 1 (2/12) - Hardware BasicsOperating Systems 1 (2/12) - Hardware Basics
Operating Systems 1 (2/12) - Hardware Basics
 
Operating Systems 1 (11/12) - Input / Output
Operating Systems 1 (11/12) - Input / OutputOperating Systems 1 (11/12) - Input / Output
Operating Systems 1 (11/12) - Input / Output
 
Operating Systems 1 (3/12) - Architectures
Operating Systems 1 (3/12) - ArchitecturesOperating Systems 1 (3/12) - Architectures
Operating Systems 1 (3/12) - Architectures
 
Operating Systems 1 (6/12) - Processes
Operating Systems 1 (6/12) - ProcessesOperating Systems 1 (6/12) - Processes
Operating Systems 1 (6/12) - Processes
 
Operating Systems 1 (5/12) - Architectures (Unix)
Operating Systems 1 (5/12) - Architectures (Unix)Operating Systems 1 (5/12) - Architectures (Unix)
Operating Systems 1 (5/12) - Architectures (Unix)
 
Operating Systems 1 (4/12) - Architectures (Windows)
Operating Systems 1 (4/12) - Architectures (Windows)Operating Systems 1 (4/12) - Architectures (Windows)
Operating Systems 1 (4/12) - Architectures (Windows)
 

Recently uploaded

AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
ROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationAadityaSharma884161
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 

Recently uploaded (20)

AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
ROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint Presentation
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 

Dependable Systems - Introduction (1/16)

  • 2. Dependable Systems Course PT 2014 Dependability • Umbrella term for operational requirements on a system • IFIP WG 10.4: "[..] the trustworthiness of a computing system which allows reliance to be justifiably placed on the service it delivers [..]" • IEC IEV: "dependability (is) the collective term used to describe the availability performance and its influencing factors : reliability performance, maintainability performance and maintenance support performance" • Laprie: „ Trustworthiness of a computer system such that 
 reliance can be placed on the service it delivers to the user “ • Adds a third dimension to system quality • General question: How to deal with unexpected events ? • In German: ,Verlässlichkeit‘ vs. ,Zuverlässigkeit‘
 2
  • 3. Dependable Systems Course PT 2014 English vs. German 3 Dependability Verlässlichkeit Reliability Zuverlässigkeit Availability Verfügbarkeit Safety Sicherheit Security Schutz Maintainability Wartbarkeit Fault Fehlerursache (Fehler) Error Fehlerzustand (Fehler) Failure Ausfall Error Propagation Fehlerausbreitung Fault Prevention Fehlerursachenverhinderung Fault Removal Fehlerursachenbeseitigung Fault Tolerance Fehlertoleranz Fault Forecasting Fehlervorhersage
  • 4. Dependable Systems Course PT 2014 Computers in Safety-Critical Systems • Different levels of critical participation for a computer system • Information provisioning to human controller on request • Interpretation of data and presentation to the user • Issues command on behalf of the human controller • Replaces human controller • Trend to realize critical systems with commercial-of-the-shelf components • Driven by budget cuts and performance advantage • Puts sole responsibility on software layer, in contrast to early hardware-only redundancy approaches 4
  • 5. Dependable Systems Course PT 2014 Dependable Systems Motivation Money Life saving Human users Harsh environments Complexity 5
  • 6. Dependable Systems Course PT 2014 Example: Therac 25 • Radiation therapy engine, involved in 6 accidents between 1985 and 1987 • Two operation modes: High-power spreaded beam, or low power focus beam • Accident: High-power mode without spreader plate activated • Harmful operation not detected due to software flaws • Institutional issues: No independent software review, 
 no risk analysis for failure modes, no test of hardware / software assembling • Engineering issues: No hardware interlock, software re-use from older models, 
 no sensors for hardware check, arithmetic overflow in the check routines • No investigation of single modules 6
  • 7. Dependable Systems Course PT 2014 Example: Patriot Missile Launcher • Mobile missile launcher, designed for a few hours of operation • Used for Scud defense operation, never designed for it • February 1991 - Battery in Dharan (Saudi Arabia) failed to intercept Scud missile,
 hit army barrack • Software aging problem in system‘s weapon control computer • Target velocity and time 
 demanded as real values, 
 stored as 24-bit integer • Inaccurate tracking computation 
 due to overlong operation 
 ( > 100 hours) • Modified software reached the 
 base one day after the accident 7
  • 8. Dependable Systems Course PT 2014 Example: Ariane 5 • June 4 1996, self-destruction firework for $500 million • Initiated because boosters were ripping from the rocket • Reasoned by on-board computer reaction on inertia system data • Was not data, but a diagnostic bit pattern due to an overflow error • Identical inertia backup system had shut down milliseconds before • Inertia system from Ariane 4, numbers were never expected to get that high • 64-bit FP to 16-bit signed integer, data conversion in Ada was not protected • No test of full system due to budget limits • System was running after lift off for no special purpose • Possibility for fast restart with brief hold in the countdown, only for Ariane 4 8
  • 9. declare v_veloc_sensor: float; -- 64 bit floating point h_veloc_sensor: float; v_veloc_bias: integer; -- 16 bit signed integer h_veloc_bias: integer; begin sensor_get (v_veloc_sensor); sensor_get (h_veloc_sensor); v_veloc_bias := integer (v_veloc_sensor); h_veloc_bias := integer (h_veloc_sensor); exception when others => use_irs1(); end; Dependable Systems Course PT 2014 Example: Ariane 5 9
  • 10. Dependable Systems Course PT 2014 Example: Mars Pathfinder • Successful landing on Mars at July 4th, 1997 • Spontaneous resets after landing • VxWorks operating system, real-time scheduling of data streams and commands • Classical priority inversion - long-running medium-communication task, 
 low-priority meteorological task, reset triggered by watchdog • Antennas performed too good, unexpected timing behavior in data transmission • Problem reproduced with identical Pathfinder on earth • Semaphore had priority inheritance protection, which was switched off for performance optimization 10
  • 11. Dependable Systems Course PT 2014 Unbounded Priority Inversion • Pathfinder scenario • T1: Periodically checks health of systems • T2: Processes image data • T3: Occasional test on equipment status • T1 and T3 share a protected data structure • T1 must indirectly wait for T2 suspend, which has no relation to it‘s own purpose • T1 timeout lead to full system reset and check 11
  • 12. Dependable Systems Course PT 2014 Example: Air France AF 447 (May 2009) • Airbus A330 flight at 10.000m through turbulent weather conditions • Ice crystals trigger malfunction in Pitot air pressure sensors • Stream of air normally used for speed computation • Flight speed instruments shut down, followed by auto pilot • Demanded manual angle and thrust level computation by pilots • Flight computer switches to emergency mode (,alternate law 2‘) • Still operational (,fly-by-wire‘), but different flight feeling • Pilots lost physical control over machine, restart of flight computer not successful • Specification for sensors not feasible (written in 1947) • Sensor must be operational till -40 degrees / 9000m height 12
  • 13. Dependable Systems Course PT 2014 Importance of Dependability for Business • Average costs per hour of downtime (Gartner 1998) • Brokerage operations in finance: $6.5 million • Credit card authorization: $2.6 million • Home catalog sales: $90.000 • Airline reservation: $89.500 • 22-hour service outage of eBay on June 6th 1999 • Interruption of around 2.3 million auctions • 9.2% stock value drop, $3-5 billion of lost revenues • Problems blamed on Sun server software • Banking, telephone system, online games, cloud operations, ... 13
  • 14. Dependable Systems Course PT 2014 Example: New York Stock Exchange • Credit Suisse trading desk, Nov 14th, 2007 • Automated trading algorithm went mad • Reasoned by misplaced double click in the trading software UI • Hundreds of thousands of bogus order cancellations • DoS nature of fault caused NYSE to freeze up - $150.000 fine 14
  • 15.
  • 16. Dependable Systems Course PT 2014 Example: Amazon Cloud Outages • 2006 - Elevated levels of authenticated requests from multiple users • Request volumes are monitored, but cryptographic overhead was not considered • Overload on authentication services - performs request processing and validation • S3 service (cloud storage service) became inaccessible, but is the only storage facility supported for Amazon-hosted virtual machines • Took 3,25 hours to solve the problem - live fixing, started at 3am • No SLA promises ever, but trust in cloud-only hosted web sited dropped dramatically • 2009 - 19 hours of outage for bitbucket.org • Code hosting service, relies on EC2 and block storage • UDP / TCP flooding attack influenced also access to storage facility 16
  • 17. Hello, A few days ago we sent you an email letting you know that we were working on recovering an inconsistent data snapshot of one or more of your Amazon EBS volumes. We are very sorry, but ultimately our efforts to manually recover your volume were unsuccessful. The hardware failed in such a way that we could not forensically restore the data. What we were able to recover has been made available via a snapshot, although the data is in such a state that it may have little to no utility… If you have no need for this snapshot, please delete it to avoid incurring storage charges. We apologize for this volume loss and any impact to your business. Sincerely,
 Amazon Web Services, EBS Support
  • 18. Dependable Systems Course PT 2014 Example: IBM S/390 Outage Sources • Examination of 16 managed mainframe systems over 6 months (Pfister, 1990) • 92% planned outages, 8% unplanned outages • 32% database backup • 24% database reorganization • 14% system software maintenance • 12% network • 8% applications • 8% hardware • 2% other 18
  • 19. Dependable Systems Course PT 2014 Tradeoffs 19 0.99999 0.9999 0.999 0.99 0.9 Massively Parallel/ Distributed Systems Commercial Fault-Tolerant Systems Ultra Reliable Systems Availability Throughput (MIPS)
  • 20. Dependable Systems Course PT 2014 Example: AT&T Electronic Switching System ESS1A • Downtime for entire system not more than 2 hours over 40 years of life • System outage < 3 minutes / year • 100% availability from user perspective • Goal achieved, two minutes of down time [Malek] • 24 sec hardware faults, 18 sec software problems, 
 36 sec procedural errors, 42 sec recovery problems • Full duplication of all critical components • Automated error detection on hardware and software level • Replication checks - duplex system with comparison
 on every cycle • Timing checks, coding checks, internal checks with comparators 20
  • 21. Dependable Systems Course PT 2014 Example: AT&T Electronic Switching System ESS1A 21 (C) Malek
  • 22. Dependable Systems Course PT 2014 Example: Boeing 777 Flight Controller 22 • Fly-by-wire system, requirements • Probability 1x10-10 for degradation 
 from ,MIN-OP configuration‘ • ,Never give up‘ redundancy 
 management strategy • Mean time between 
 maintenance actions: 25.000 operating hours • TMR for all hardware resources, fail-passive electronics • Candidate architectures were evaluated according to many trade aspects
  • 23. Dependable Systems Course PT 2014 Example: Boeing 777 Flight Controller (Buus et al.) 23
  • 24. Dependable Systems Course PT 2014 Example: VAX Hardware Redundancy • VMS operating system on VAX hardware, since mid-1970‘s, outgrow of PDP-11 • VAXft in the early 1990‘s 24
  • 25. Dependable Systems Course PT 2014 IBM zEnterprise 25 © 2011 IBM Corporation16 IBM zEnterprise 16 Designed to deliver continuous availability at the application level Single System z Multiple System z Site 1 Site 2 Multiple Data Centers Where mean time between failure is measured in decades Designed for application availability of 99.999% Industry-leading solution for disaster recovery  Avoiding the cost of downtime  Ensuring access to critical applications  Maintaining productivity of users  Open to clients 24/7 Manage Risk with System z Resiliency Availability built in across the system (C) IBM
  • 26. Dependable Systems Course PT 2014 IBM zEnterprise 26 © 2011 IBM Corporation56 IBM zEnterprise Scheduled(CIE+DisruptivePatches+ECs) Planned- (MES+Driver Upgrades) Unscheduled(UIRA) Sources of Outages Pre z9 -Hrs/Year/Syst- Prior Servers z9 EC Z10 EC z196 Unscheduled Outages Scheduled Outages Planned Outages Preplanning requirements Power & Thermal Management Increased Focus over time             Temperature = Silicon Reliability Worst Enemy Wearout = Mechanical Components Reliability Worst Enemy. System z overall RAS Strategy …..Continuing our RAS focus helps avoid outages ImpactofOutage (C) IBM
  • 27. Dependable Systems Course PT 2014 IBM zEnterprise 27 © 2011 IBM Corporation14 IBM zEnterprise The potential performance impact of the Linux server farm is isolated from the other Logical partitions (LPAR) LPAR – Up to 60 Logical partitions System z Deploy virtual servers in seconds Highly granular resource sharing (<1%) Add physical resources without taking system down, scale out to 1000s of virtual servers Do more with less: More virtual servers per core, Share more physical resources across servers Extensive built-in facilities for virtual server life- cycle management Hardware-enforced isolation Distributed Platforms Limited per-core virtual server scalability Physical server sprawl is needed to scale Operational complexity increases as virtual server images grow VMware tools only support VMware hypervisor (ESX) PR/SM Hypervisor in firmware z/VM – 100s of virtual servers – Shared Memory Hardware support: 10% of circuits are used for virtualization Each LPAR is one Virtual Mainframe Extreme Virtualization in zEnterprise - Built into the architecture not an “add on” feature (C) IBM
  • 28. Dependable Systems Course PT 2014 Dependability Examples • A segmentation fault can lead to a system failure. • A faulty controller in an RAID array lowers the reliability of the storage facility. • The measurable availability of a web server depends on the fault tolerance capabilities in all its components. • A database fault creates an erroneous state in the authentication component, which can lead to a system failure influencing the system confidentiality. • The integrity of electronic voting systems depends on the complete absence of failures in the voting and the calculation procedures. • Fault forecasting can support the maintainability of a system. • Testing can only prove the presence of faults, not their absence. • Our service level agreement guarantees an availability of 99.99%. 28
  • 29. Dependable Systems Course PT 2014 Consequences of Failures [Knight] • Human injury or loss of life • Damage to the environment • Damage to or loss of equipment • Damage to or loss of data • Financial loss by theft • Financial loss through production of useless or defective products • Financial loss through reduced capacity for production or service • Loss of business reputation • Loss of customer base • Loss of jobs 29
  • 30. Dependable Systems Course PT 2014 Dependability Concerns [Knight] • Do I know all available techniques to prevent some damage in / by my system ? • I don‘t know how the system is used, so how can I decide for an appropriate dependability mechanism ? • How do I get an accurate and complete set of term definitions to communicate reliably with the other engineers ? • What is a technology to tackle all weak points in my system architecture, not only the weakest or the obvious ones ? • How can I measure the effectiveness of my dependability techniques ? • Software is special ... • Definition of the required function is harder than in other engineering disciplines • Software needs to take action when hardware fails • Software may have to operate on platforms whose design was influenced by dependability goals 30
  • 31. Dependable Systems Course PT 2014 Course Topics • Definitions and metrics • Fault, error, failure, fault models, fault tolerance, fault forecasting, ... • Reliability, availability, maintainability, error mitigation, damage confinement, ... • Reliability engineering, reliability testing, MTTF, MTBF, failure rate, ... • Fault tolerance patterns • N-out-of-M, voting, heartbeat, ... • Analytical evaluation • Reliability block diagrams, fault tree analysis, ... • Petri nets, Markov chains • Root cause analysis, risk analysis and assessment 31
  • 32. Dependable Systems Course PT 2014 Course Topics • Hardware-Implemented Dependability • Failure rates, testing, hardware redundancy approaches • Software-Implemented Dependability • Fault-tolerant programming: simplex, recovery blocks, checkpointing, roll-back, ... • Testing, software metrics • Fault-tolerant distributed systems: Transactions, consensus, clocks, ... • Advanced approaches • Autonomic computing, rejuvenation, online failure prediction, performability • Case studies from practice 32