Testing safety critical systems: Practice and Theory (14-05-2013, VU Amsterdam)

Testing Safety Critical Systems
Theory and Experiences
J.vanEkris@Delta-Pi.nl

Agenda
• The challenge
• Process and Organization
• System design
• Verification Techniques
• Trends
• Reality
3

THE CHALLENGE
Why is testing safety critical systems so hard?

Some people live on the edge…
How would you feel if you were getting
ready to launch and knew you were
sitting on top of two million parts -- all
built by the lowest bidder on a
government contract.
John Glenn

We even accept loss...
• Lost/misdirected luggage:
Chance of failure 10-2 per
suitcase
• Airplane: Chance of crash
10-8 per flight hour
• Storm Surge Barrier: Chance
of failure 10-7 per usage
• Nuclear power plant: As
Low As Reasonably Possible
(ALARP)
7

Are the software risks acceptable?

To put things in perspective…
• Getting killed in traffic: 10-2 per year
• Having a drunk pilot: 10-2 per flight
• Hurt yourself when using a chainsaw: 10-3 per use
• Considered being posessed by satan: 10-4 per lifetime
• Dating a supermodel: 10-5 in a lifetime
• Drowning in a bathtub: 10-7 in a lifetime
• Being hit by falling airplane parts: 10-8 in a lifetime
• Being killed by lighting: 10-9 per lifetime
• Your house being hit by a meteor: 10-15 per lifetime

We might have become overprotective…

Nonetheless software is dangerous...

and the odds are against us…
• Capers-Jones: at least 2 high severity
errors per 10KLoc
• Industry concensus is that software
will never be more reliable than
– 10-5 per usage
– 10-9 per operating hour

The value of testing
Program testing can be used to show the
presence of bugs, but never to show
their absence!
Edsger W. Dijkstra

PROCESS AND ORGANIZATION
Who does what in safety critical software development?

IEC 61508: Safety Integrity Level and
acceptable risk

IEC 61508: A process for safety critical functions

Process or personal commitment?
• Romans put the architect
under the arches when
removing the scaffolding
• Boeing and Airbus put all
lead-engineers on the first
test-flight
• Dijkstra put his
“rekenmeisjes” on the
opposite dock when
launching ships

It is about keeping your back straight…
• Thomas Andrews, Jr.
• Naval architect in charge of RMS Titanic
• He recognized regulations were
insufficient for ship the size of Titanic
• Decisions “forced upon him” by the client:
– Limit the range of double hulls
– Limit the number of lifeboats
• He was on the maiden voyage to spot
improvements
• He knowingly went down with the ship,
saving as many as he could

SYSTEM DESIGN
What do safety critical systems look like?

An introduction into storm surge barriers…

Design Principles
• Keep it simple...
• Risk analysis drives design (decissions)
• Safety first (production later)
• Fail-to-safe
• There shall be no single source of
(catastrophic) failure

A simple design of a storm surge barrier
Relais
(€10,00/piece)
Waterdetector
(€17,50)
Design documentation
(Sponsored by Heineken)

Risk analysis
Relais failure
Chance: small
Cause: aging
Effect: catastophic
Waterdetector fails
Change: Huge
Oorzaken: Rust, driftwood,
seaguls (eating, shitting)
Effect: Catastophic
Measurement errors
Chance: Collossal
Causes: Waves, wind
Effect: False Positive
Broken cable
Chance: Medium
Cause: digging, seaguls
Effect: Catastophic

Typical risks identified
• Components making the wrong decissions
• Power failure
• Hardware failure of PLC’s/Servers
• Network failure
• Ship hitting water sensors
• Human maintenance error
27

Risk ≠ system crash
• Wrongful functional
behaviour
• Data accuracy
• Lack of response speed
• Understandability of
the GUI
• Tolerance towards
unlogical inputs

Nihilating risk isn’t the goal…
No matter how well the
environment analysis
has been:
• Some scenarios will be
missed
• Some scenarios are
too expensive to
prevent:
– Accept risk
– Communicate to stakeholders

Risks can be contradictionary…
Availability of the service Safety of the installation
VS.

Risk reality does change over time...

9/11...
• Really tested our “test
abortion” procedure
• Introduced a
fundamental new risk
to ATC systems
• Changed the ATC
system dramatically
• Doubled our testcases
overnight

StuurX: Component architecture design

Stuurx::Functionality, initial global design
Init
Start_D
“Start” signal to Diesels
Wacht
Waterlevel < 3 meter
Waterlevel> 3 meter
W_O_D
“Diesels ready”
Sluit_?
“Close Barrier”
Waterlevel

Stuurx::Functionality, final global design

Stuurx::Functionality,
Wait_For_Diesels, detailed design

VERIFICATION
What is getting tested, and how?

Challenge: time and resource limitations
• 64 bits input isn’t that
uncommon
• 264 is the global rice
production in 1000
years, measured in
individual grains
• Fully testing all binary
inputs on a 64-bits
stimilus response system
takes 2 centuries

Goals of testing safety critical systems
• Verify contractually agreed functionality
• Verify correct functional safety-behaviour
• Verify safety-behaviour during degraded and
failure conditions

An example of safety critical components

IEC 61508 SIL4: Required verification activities

Design Validation and Verification
• Peer reviews by
– System architect
– 2nd designer
– Programmers
– Testmanager system testing
• Fault Tree Analysis / Failure Mode and Effect
Analysis
• Performance modeling
• Static Verification/ Dynamic Simulation by
(Twente University)

Programming (in C/C++)
• Coding standard:
– Based on “Safer C”, by Les Hutton
– May only use safe subset of the compiler
– Verified by Lint and 5 other tools
• Code is peer reviewed by 2nd developer
• Certified and calibrated compiler

Unit tests
• Focus on conformance to specifications
• Required coverage: 100% with respect to:
– Code paths
– Input equivalence classes
• Boundary Value analysis
• Probabilistic testing
• Execution:
– Fully automated scripts, running 24x7
– Creates 100Mb/hour of logs and measurement data
• Upon bug detection
– 3 strikes is out  After 3 implementation errors it is build by another developer
– 2 strikes is out  Need for a 2nd rebuild implies a redesign by another designer

Representative testing is difficult

Integration testing
• Focus on
– Functional behaviour of chain of components
– Failure scenarios based on risk analysis
• Required coverage
– 100% coverage on input classes
• Probabilistic testing
• Execution:
– Fully automated scripts, running 24x7, speed times 10
• Upon detection
– Each bug  Rootcause-analysis

Redundancy is a nasty beast
• You do get functional
behaviour of your
entire system
• It is nearly impossible
to see if all your
components are
working correctly
51

System testing
• Focus on
– Functional behaviour
– Failure scenarios based on risk analysis
• Required coverage
– 100% complete environment (simultation)
– 100% coverage on input classes
• Execution:
– Fully automated scripts, running 24x7, speed times 10
• Upon detection
– Each bug  Rootcause-analysis

Acceptance testing
• Acceptance testing
1. Functional acceptance
2. Failure behaviour, all top 50 (FMECA) risks tested
3. A year of operational verification
• Execution:
– Tests performed on a working stormsurge barrier
• Upon detection
– Each bug  Root cause-analysis

Endurance testing
• Look for the “one in a
million times” problem
• Challenge:
– Software is deterministic
– execution is not (timing,
system load, bit-errors)
• Have an automated
script run it over and
over again

GUI Acceptance testing
• Looking for
– quality in use for interactive
systems
– Understandability of the
GUI
• Structural investigation of
the performance of the
system-human interactions
• Looking for “abuse” by the
users
• Looking at real-life handling
of emergency operations

Avalanche testing
• To test the capabilies of
alarming and control
• Usually starts with one
simple trigger
• Generally followed by
millions of alarms
• Generally brings your
network and systems
to the breaking point

Crash and recovery procedure testing
• Validation of system
behaviour after massive
crash and restart
• Usually identifies many
issues about emergency
procedures
• Sometimes identifies issues
around power supply
• Usually identifies some
(combination of) systems
incapable of unattended
recovery...

Testing safety critical functions is
dangerous...

A risk analysis to testing
• There should always be
a way out of a test
procedure
• Some things are too
dangerous to test
• Some tests introduce
more risks than they
try to mitigate

Root-cause analysis
• A painfull process, by
design
• Is extremely thorough
• Assumes that the error
found is a symptom of an
underlying collection of
(process) flaws
• Searches for the underlying
causes for the error, and
looks for possible similar
errors that might have
followed a similar path

Failed gates of a potential deadlock

TRENDS
What is the newest and hottest?

A root-cause analysis of this flaw

REALITY
What are the real-life challenges of a testmanager of safety critical systems?

It requires a specific breed of people
The faiths of developers and
testers are linked to safety
critical systems into
eternity

Conclusions
• Stop reading newspapers
• Safety Critical Testing is a
lot of work, making sure
nothing happens
• Technically it isn’t that
much different, we’re just
more rigerous and use a
specific breed of
people....

Testing safety critical systems: Practice and Theory (14-05-2013, VU Amsterdam)

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Viewers also liked

Viewers also liked (6)

Similar to Testing safety critical systems: Practice and Theory (14-05-2013, VU Amsterdam)

Similar to Testing safety critical systems: Practice and Theory (14-05-2013, VU Amsterdam) (20)

More from Jaap van Ekris

More from Jaap van Ekris (20)

Recently uploaded

Recently uploaded (20)

Testing safety critical systems: Practice and Theory (14-05-2013, VU Amsterdam)

Editor's Notes