Software safety in embedded systems & software safety why, what, and how

Software Safety in Embedded Systems
&
Software Safety: Why, What, and How
– Leveson
UC San Diego
CSE 294
Spring Quarter 2006
Barry Demchak

Previous Paper
 System Safety in Computer-Controlled Automotive
Systems – Leveson (2000)
 Types of accidents
 Safeware Methodology
 Project Management
 Software Hazard Analysis
 Software Requirements Specification & Analysis
 Software Design & Analysis
 Design & Analysis of Human-Machine Interaction
 Software Verification
 Feedback from Operational Experience
 Change Control and Analysis

Roadmap
 Safety definitions
 Industrial safety and risk
 Systems Issues – hardware and software
 Software Safety
 Analysis and Modeling
 Verification and Validation
 System Safety Engineering

Safety Before Computers
 NASA: 10-9
chance of failure over a 10 hour
flight
 British nuclear reactors: no single fault can
cause a reactor to trip, and 10-7
chance over
5000 hours of failure to meet a demand to trip
 FAA: 10-9
chance per flight hour (i.e., not
within total life span of entire fleet)

Introduction of Computers
 Nuclear Power Plants
 Space Shuttle
 Airbus Aircraft
 Space Satellites
 NORAD
 Purpose: perform functions that are too
dangerous, quick, or complex for humans

System Safety (def.)
 Subdiscipline of systems engineering
 Applies scientific, management, and
engineering principals
 Ensures adequate safety throughout the
system life cycle
 Constrained by operational effectiveness,
time, and cost
 MilSpec: “freedom from those conditions that
can cause death, injury, occupational illness,
or damage to or loss of equipment or
property”

More Definitions
 Accident
 Unwanted and unexpected release of energy
 Mishap (or failure)
 Unplanned event or series of events
 Death, injury, occupational illness, damage, or
loss of equipment or property, or
environmental harm
 Hazard
 A condition that can lead to a mishap

More Definitions (cont’d)
 Risk
 Probability of a hazardous state occurring
 Probability of a hazardous state leading to a
mishap
 Perceived severity of the worst potential
mishap that could result from a hazard
 Hazard probability
 Hazard criticality (severity)

Early Approach
 Operational or Industrial Safety
 Examining system during operating life
 Correcting unacceptable hazards
 Ignores crushing effect of single catastrophe
 Assumptions
 All faults caused by human errors could be
avoided completely or located and removed
prior to delivery and operation
 Relatively low complexity of hardware

Ford Pinto (early 1970s)
 Specifications: 2000 pounds, $2000 sale price
 Use existing factory tooling
 Safety issue with gas tank placement
 Analysis
 Deaths cost $200,000, burns cost $67,000
 Cost to make change $137M, benefit $49M
 Ford engineer: “But you miss the point entirely. You
see, safety isn't the issue, trunk space is. You have
no idea how stiff the competition is over trunk space.”
 Ford president: “Safety doesn’t sell”
 Verdict: $100M

Anecdotes
 Safety devices themselves have been
responsible for losses or increasing chances
of mishaps
 Redundancy sometimes degrades safety
 Unrelated (but related) systems cause errors

Later Approach
 System Safety
 Design acceptable safety level before actual
production or operation
 Optimize safety by applying scientific and
engineering principals to identify and control
hazards through analysis, design, and
management procedures
 Hazard analysis identifies and assesses
 Criticality level of hazards
 Risks involved in system design

Later approach (cont’d)
 Assumptions
 Complexity of software and hardware
interaction causes non-linear increase in
human-error-induced faults
 Impossible to demonstrate safety ahead of
usage
 Complexity and coupling are covariant

Hardware vs Systems
 Hardware
 Widgets have long history of use and fault
analysis … highly responsive to redundant
techniques
 Infinite number of stable states
 Software
 No history with software … reuse is rare
 Large number of discrete states without
repetitive structure
 Difficult to test under realistic conditions

More Systems Issues
 Difficult to specify completely – what it does,
and what it does not do
 Cannot identify misunderstandings about
requirements
 Engineers assume perfect execution
environments, don’t consider transient faults
 Lack of system-level methods and viewpoints

Even Bigger Systems Issues
 Specification and implementation of
components is not the same as between
components
 Between-component interactions grow
exponentially and are often underrepresented
in analyses
 Components include
 Software and components
 Hardware
 Human operators

Still Bigger Systems Issues
 More Components
 Development Methodologies
 Source code maintenance
 Verification/Validation Methodologies
 Stakeholder Values
 Management
 Individual Programmers
 Customer
 Human Users
 Suppliers

Definitions
 Reliability
 Probability that system will perform intended
function
 Safety
 Probability that hazard will not lead to a
mishap
 Reliability = failure free
 Safety = mishap free
 Reliability and Safety often conflict

Safety
 Studied separately from security, reliability, or
availability
 Separation of concerns
 Safety requirements are identified and
separated from operational requirements
 Conflicts resolved in a well-reasoned manner

Definitions
 System
 Sum total of all component parts
 Software is only a part, and its correctness
exists only in relation to other system
components

Software Safety
 Ensures software will execute within a system
context without resulting in unacceptable risk
 Safety-critical software functions
 Directly or indirectly allow a hazardous system
state to exist
 Safety-critical software
 Contains safety-critical functions

System Characteristics
 Inputs and outputs over time
 Control subsystem
 Description of function to be performed
 Specification of operating constraints (quality,
capacity, process, and safety)
 Safety constraints are hazards rewritten as
constraints
 Safety constraints written, maintained, and
audited separately

Constraints, Requirements, Design

Analysis and Modeling
 Preliminary Hazard Analysis (PHA)
 Subsystem Hazard Analysis (SSHA)
 System Hazard Analysis (SHA)
 Operating and Support Hazard Analysis
(OSHA)
 Safeware – Leveson

Hazard Analysis
 Start with list of identifiable hazards
 Work backward to discover combination of
faults that produce the hazard
 Categorization
 Frequent
 Occasional
 Reasonably remote
 Remote
 … physically impossible

Hazard Examples (Nuclear Weapons)
 Inadvertent nuclear detonation
 Inadvertent prearming, arming, launching,
firing, or releasing
 Deliberate prearming, arming, launching,
firing, or releasing under inappropriate
conditions

Software Requirement Analysis
 Hard to do
 Cubby-hole mentality
 Rarely includes what the system should not
do
 Techniques
 Fault Tree Analysis (FTA)
 Real Time Logic (RTL)
 Petri nets

Real Time Logic
 Model the system in terms of events and
actions (both data dependency and temporal
ordering)
 Generate predicates
 Determine whether a safety assertion is a
theorem derivable from the model
 Inherently unsafe means that the assertion
cannot be derived from the model

Time Petri Nets
 Mathematical modeling of discrete event
systems in terms of conditions and events
and the relationship between them
 Facilitates backward analysis
 Points to failures and faults which are
potentially most hazardous
 Nontrivial to build and maintain

Research Question
 What is the place of these analysis
techniques in an agile development
environment??

Safety Verification and Validation
 Showing that a fault cannot occur
 Showing that if a fault occurs, it is not
dangerous
 Only as good as the specifications
 Specifications are usually incomplete, and
hardware specifications are rare

 Methodologies
 Proofs of adequacy
 Software Fault Tree (proofs of fault tree
analyses)
 Determine safety requirements
 Detect software logic errors
 Identify multiple failure sequences involving
different parts of the system
 Inform critical runtime checks
 Inform testing

 Methodologies
 Nuclear Safety Cross Check Analysis
(NSCCA)
 Demonstrate that software will not contribute to a
nuclear mishap
 Multiple technical analyses demonstrate
adherence to specifications
 Demonstrate security and control measures
 A lot of qualitative judgment regarding criticality
 Software Common Mode Analysis
 Sneak Software Analysis

Safety Analysis – Quantitative
 Requires statistical histories which may not
exist
 Applies mostly to physical systems
 Single-valued Best Estimate
 Information sufficient for determinate models
 Probabilistic
 Science is understood, but limited parameters
available
 Bounding
 Putting a ceiling on the answer

System Safety Engineering
 Identify hazards
 Assessing hazards (likelihood and criticality)
 Design to eliminate or control hazards
 Assess risks that cannot be eliminated or
controlled

Failure Mode Definitions
 Fail-safe
 Default is safe mode, no attempt to execute
operational mission
 Fail-operational
 Default is to correct fault and continue with
operational mission
 Fail-soft
 Default is to continue with degraded
operations

Designing for Safety
 Not possible to ensure safety by analysis or
verification alone
 Analysis and verification may be cost-
prohibitive
 Different standard hierarchy
 Intrinsically safe
 Prevents or minimizes occurrence of hazards
 Controls the hazard
 Warns of presence of hazard

Safety Design Mechanisms
 Lockout device
 Prevents event from occurring when hazard is
present
 Lockin device
 Maintains an event or condition
 Interlock device
 Assuring operation sequences in correct order

Safety Design Principals
 Provide leverage for certification
 Avoid complexity where possible
 Reduce risk by reducing hazard likelihood, or
severity, or both
 Modularize to separate safety-critical
functions from non-critical functions
 Execute safety-critical functions under
separate authority
 Fail on a single-point failure

Safety Design Principals (cont’d)
 Start out in safe state, and take affirmative
actions to reach higher risk states
 Check critical flags as close as possible to
actions they protect
 Avoid compliments: absence of “armed” is not
“safe”
 Use “true” values to indicate safety … “false”
values can result from common hardware
failures

Safety Design Principals (cont’d)
 Detection of unsafe states
 Watchdog timer
 Independent monitors
 Asserts and exception handlers
 Use backward recovery (return system to safe
state) instead of forward recovery (plow
ahead)

Human Factors
 Define partnership between human and
computer
 Avoid complacency
 Avoid confusion
 Avoid passive monitoring

Conclusion
 Select suite of techniques and tools spanning
entire software development process
 Apply them consciensciously, consistently,
and thoroughly
 Consider implementation tradeoffs
 Low catastrophe, high cost alternatives
 Moderate catastrophe, moderate cost
alternatives
 High catastrophe, low cost alternatives

Take Home Messages
 Safety is a system issue – in the large sense
 Software engineering techniques can
contribute to system safety – in both a narrow
and broad context
 Acceptable risk is king, and determining and
executing it is hard

Software safety in embedded systems & software safety why, what, and how

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Software safety in embedded systems & software safety why, what, and how

Similar to Software safety in embedded systems & software safety why, what, and how (20)

More from bdemchak

More from bdemchak (20)

Recently uploaded

Recently uploaded (20)