2016-04-28 - VU Amsterdam - testing safety critical systems

Testing Safety Critical Systems
Theory and Experiences
J.vanEkris@Delta-Pi.nl
http://www.slideshare.net/Jaap_van_Ekris/

My Job
Your life’s goal will be to stay out of
the newspapers
Gerard Duin (KEMA)

Agenda
• The Goal
• The requirements
• The challenge
• Go with the process flow
– Development Process
– System design
– Testing Techniques
• Trends
• Reality
4

Specifications…
• Specifications are
extremely detailed
• Sometimes up to 20
binders
• After years, you still
find contradictions

Goals of testing safety critical systems
• Verify contractually agreed functionality
• Verify correct functional safety-behaviour
• Verify safety-behaviour during degraded and
failure conditions

THE REQUIREMENTS
What is so different about safety critical systems?

Some people live on the edge…
How would you feel if you were getting
ready to launch and knew you were
sitting on top of two million parts
-- all built by the lowest bidder on a
government contract.
John Glenn

We might have become overprotective…

The public is mostly unaware of risk…

Until it is too late…
• February 1st 1953
• Spring tide and heavy
winds broke dykes
• Killed 1836 humans
and 30.000 animals

The battle against flood risk…
• Cost €2.500.000.000
• The largest moving
structure on the
planet
• Defends
– 500 km2 land
– 80.000 people
• Partially controlled
by software

Nothing is flawless, by design…
No matter how well the
design has been:
• Some scenarios will be
missed
• Some scenarios are
too expensive to
prevent:
– Accept risk
– Communicate to stakeholders

When is software good enough?
• Dutch Law on
storm surge
barriers
• Equalizes risk
of dying due
to unnatural
causes across
the Netherlands

Risks have to be balanced…
Availability of the service Safety of the service
VS.

Oosterschelde Storm Surge Barrier
• Chance of
– Failure to close: 10-7
per usage
– Unexpected closure:
10-4 per year

To put things in perspective…
• Having a drunk pilot: 10-2 per flight
• Hurt yourself when using a chainsaw: 10-3 per use
• Dating a supermodel: 10-5 in a lifetime
• Drowning in a bathtub: 10-7 in a lifetime
• Being hit by falling airplane parts: 10-8 in a lifetime
• Being killed by lighting: 10-9 per lifetime
• Winning the lottery: 10-10 per lifetime
• Your house being hit by a meteor: 10-15 per lifetime
• Winning the lottery twice: 10-20 per lifetime

Risk balance does change over time...

9/11...
• Identified a
fundamental (new) risk
to ATC systems
• Changed the ATC
system dramatically
• Doubled our safety
critical scenario’s

Are software risks acceptable?

Software plays a significant role...

The industry statistics are against us…
• Capers-Jones: at least 2 high severity
errors per 10KLoc
• Industry concensus is that software
will never be more reliable than
– 10-5 per usage
– 10-9 per operating hour

THE CHALLENGE
Why is testing safety critical systems so hard?

The value of testing
Program testing can be used to show the
presence of bugs, but never to show
their absence!
Edsger W. Dijkstra

Is just testing enough?
• 64 bits input isn’t that
uncommon
• 264 is the global rice
production in 1000 years,
measured in individual
grains
• Fully testing all binary
inputs on a simple 64-bits
stimilus response system
once takes 2 centuries

THE SOFTWARE DEVELOPMENT
PROCESS
Quality and reliability start at conception, not at testing…

IEC 61508: Safety Integrity Level and
acceptable risk

IEC 61508: A process for safety critical functions

SYSTEM DESIGN
What do safety critical systems look like and what are their most important drivers?

Design Principles
• Risk analysis drives design (decissions)
• Safety first (production later)
• Fail-to-safe
• There shall be no single source of
(catastrophic) failure

Simplicity is
prerequisite for
reliability
Edsger W. Dijkstra

A simple design of a storm surge barrier
Relais
(€10,00/piece)
Waterdetector
(€17,50)
Design documentation
(Sponsored by Heineken)

Risk analysis
Relais failure
Chance: small
Cause: aging
Effect: catastophic
Waterdetector fails
Change: Huge
Oorzaken: Rust, driftwood,
seaguls (eating, shitting)
Effect: Catastophic
Measurement errors
Chance: Collossal
Causes: Waves, wind
Effect: False Positive
Broken cable
Chance: Medium
Cause: digging, seaguls
Effect: Catastophic

Typical risks identified
• Components making the wrong decissions
• Power failure
• Hardware failure of PLC’s/Servers
• Network failure
• Ship hitting water sensors
• Human maintenance error
39

Risk ≠ system crash
• Understandability of
the GUI
• Wrongful functional
behaviour
• Data accuracy
• Lack of response speed
• Tolerance towards
unlogical inputs
• Resistance to hackers

Usability of a MMI is key to safety

Systems aren’t your only problem

StuurX: Component architecture design

Stuurx::Functionality, initial global design
Init
Start_D
“Start” signal to Diesels
Wacht
Waterlevel < 3 meter
Waterlevel> 3 meter
W_O_D
“Diesels ready”
Sluit_?
“Close Barrier”
Waterlevel

Stuurx::Functionality, final global design

Stuurx::Functionality,
Wait_For_Diesels, detailed design

VERIFICATION
What is getting tested, and how?

An example of safety critical components

IEC 61508 SIL4: Required verification activities

Design Validation and Verification
• Peer reviews by
– System architect
– 2nd designer
– Programmers
– Testmanager system testing
• Fault Tree Analysis / Failure Mode and Effect
Analysis
• Performance modeling
• Static Verification/ Dynamic Simulation by
(Twente University)

Programming (in C/C++)
• Coding standard:
– Based on “Safer C”, by Les Hutton
– May only use safe subset of the compiler
– Verified by Lint and 5 other tools
• Code is peer reviewed by 2nd developer
• Certified and calibrated compiler

Unit tests
• Focus on conformance to specifications
• Required coverage: 100% with respect to:
– Code paths
– Input equivalence classes
• Boundary Value analysis
• Probabilistic testing
• Execution:
– Fully automated scripts, running 24x7
– Creates 100Mb/hour of logs and measurement data
• Upon bug detection
– 3 strikes is out  After 3 implementation errors it is build by another developer
– 2 strikes is out  Need for a 2nd rebuild implies a redesign by another designer

Representative testing is difficult

Integration testing
• Focus on
– Functional behaviour of chain of components
– Failure scenarios based on risk analysis
• Required coverage
– 100% coverage on input classes
• Probabilistic testing
• Execution:
– Fully automated scripts, running 24x7, speed times 10
• Upon detection
– Each bug  Rootcause-analysis

Redundancy is a nasty beast
• You do get functional
behaviour of your entire
system
• It is nearly impossible to
see if all components
are working correctly
• Is EVERYTHING working
ok, or is it the safetynet?
58

System testing
• Focus on
– Functional behaviour
– Failure scenarios based on risk analysis
• Required coverage
– 100% complete environment (simultation)
– 100% coverage on input classes
• Execution:
– Fully automated scripts, running 24x7, speed times 10
• Upon detection
– Each bug  Rootcause-analysis

Endurance testing
• Look for the “one in a
million times” problem
• Challenge:
– Software is deterministic
– execution is not (timing,
transmission-errors,
system load)
• Have an automated
script run it over and
over again

Results of Endurance Tests
1,E-05
1,E-04
1,E-03
1,E-02
1,E-01
1,E+00
4.35 4.36 4.37
ChanceofFailure(LogarithmicScale)
Platform Version
Reliability Growth of Function M, Project S

Acceptance testing
• Acceptance testing
1. Functional acceptance
2. Failure behaviour, all top 50 (FMECA) risks tested
3. A year of operational verification
• Execution:
– Tests performed on a working stormsurge barrier
• Upon detection
– Each bug  Root cause-analysis

A risk limit to testing
• Some things are too
dangerous to test
• Some tests introduce
more risks than they
try to mitigate
• There should always be
a safe way out of a test
procedure

Testing safety critical functions is
dangerous...

GUI Acceptance testing
• Looking for
– quality in use for interactive
systems
– Understandability of the
GUI
• Structural investigation of
the performance of the
man-machine interactions
• Looking for “abuse” by the
users
• Looking at real-life handling
of emergency operations

Avalanche testing
• To test the capabilies of
alarming and control
• Usually starts with one
simple trigger
• Generally followed by
millions of alarms
• Generally brings your
network and systems
to the breaking point

Crash and recovery procedure testing
• Validation of system
behaviour after massive
crash and restart
• Usually identifies many
issues about emergency
procedures
• Sometimes identifies issues
around power supply
• Usually identifies some
(combination of) systems
incapable of unattended
recovery...

Software will never be flawless

Production has its challenges…
• Are equipment and
processes optimally
arranged?
• Are the humans up to
their task?
• Does everything
perform as expected?

TRENDS
What is the newest and hottest?

A root-cause analysis of this flaw

REALITY
What are the real-life challenges of a testmanager of safety critical systems?

Difference between theory and reality

Requires true commitment to results…
• Romans put the architect
under the arches when
removing the scaffolding
• Boeing and Airbus put all
lead-engineers on the first
test-flight
• Dijkstra put his
“rekenmeisjes” on the
opposite dock when
launching ships

It is about keeping your back straight…
• Thomas Andrews, Jr.
• Naval architect in charge of RMS Titanic
• He recognized regulations were
insufficient for ship the size of Titanic
• Decisions “forced upon him” by the client:
– Limit the range of double hulls
– Limit the number of lifeboats
• He was on the maiden voyage to spot
improvements
• He knowingly went down with the ship,
saving as many as he could

It requires a specific breed of people
The faiths of developers and
testers are linked to safety
critical systems into
eternity

It sometimes requires drastic measures

Conclusion
• Stop reading newspapers
• Safety Critical Testing is a
lot of work, making sure
nothing happens
• Technically it isn’t that
much different, we’re just
more rigerous and use a
specific breed of
people....

Questions?
• Questions/remarks: j.vanEkris@Delta-Pi.nl
• View again: http://www.slideshare.net/Jaap_van_Ekris/

2016-04-28 - VU Amsterdam - testing safety critical systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to 2016-04-28 - VU Amsterdam - testing safety critical systems

Similar to 2016-04-28 - VU Amsterdam - testing safety critical systems (20)

More from Jaap van Ekris

More from Jaap van Ekris (14)

Recently uploaded

Recently uploaded (20)

2016-04-28 - VU Amsterdam - testing safety critical systems