Presentation about the steps required for Verifying and Validating safety critical systems, as well as the test approach used. It goes beyond the simple processes, and also talks about the required safety culture and people required. The presentation contains examples of real-life IEC 61508 SIL 4 systems used on stormsurge barriers.
4. Agenda
• The Goal
• The requirements
• The challenge
• Go with the process flow
– Development Process
– System design
– Testing Techniques
• Trends
• Reality
4
8. Some people live on the edge…
How would you feel if you were getting
ready to launch and knew you were
sitting on top of two million parts
-- all built by the lowest bidder on a
government contract.
John Glenn
12. Until it is too late…
• February 1st 1953
• Spring tide and heavy
winds broke dykes
• Killed 1836 humans
and 30.000 animals
13. The battle against flood risk…
• Cost €2.500.000.000
• The largest moving
structure on the
planet
• Defends
– 500 km2 land
– 80.000 people
• Partially controlled
by software
14. Nothing is flawless, by design…
No matter how well the
design has been:
• Some scenarios will be
missed
• Some scenarios are
too expensive to
prevent:
– Accept risk
– Communicate to stakeholders
15. When is software good enough?
• Dutch Law on
storm surge
barriers
• Equalizes risk
of dying due
to unnatural
causes across
the Netherlands
16. Risks have to be balanced…
Availability of the service Safety of the service
VS.
17. Oosterschelde Storm Surge Barrier
• Chance of
– Failure to close: 10-7
per usage
– Unexpected closure:
10-4 per year
18. To put things in perspective…
• Having a drunk pilot: 10-2 per flight
• Hurt yourself when using a chainsaw: 10-3 per use
• Dating a supermodel: 10-5 in a lifetime
• Drowning in a bathtub: 10-7 in a lifetime
• Being hit by falling airplane parts: 10-8 in a lifetime
• Being killed by lighting: 10-9 per lifetime
• Winning the lottery: 10-10 per lifetime
• Your house being hit by a meteor: 10-15 per lifetime
• Winning the lottery twice: 10-20 per lifetime
24. The industry statistics are against us…
• Capers-Jones: at least 2 high severity
errors per 10KLoc
• Industry concensus is that software
will never be more reliable than
– 10-5 per usage
– 10-9 per operating hour
26. The value of testing
Program testing can be used to show the
presence of bugs, but never to show
their absence!
Edsger W. Dijkstra
27. Is just testing enough?
• 64 bits input isn’t that
uncommon
• 264 is the global rice
production in 1000 years,
measured in individual
grains
• Fully testing all binary
inputs on a simple 64-bits
stimilus response system
once takes 2 centuries
31. IEC 61508: A process for safety critical functions
32. SYSTEM DESIGN
What do safety critical systems look like and what are their most important drivers?
33. Design Principles
• Risk analysis drives design (decissions)
• Safety first (production later)
• Fail-to-safe
• There shall be no single source of
(catastrophic) failure
39. Typical risks identified
• Components making the wrong decissions
• Power failure
• Hardware failure of PLC’s/Servers
• Network failure
• Ship hitting water sensors
• Human maintenance error
39
40. Risk ≠ system crash
• Understandability of
the GUI
• Wrongful functional
behaviour
• Data accuracy
• Lack of response speed
• Tolerance towards
unlogical inputs
• Resistance to hackers
53. Design Validation and Verification
• Peer reviews by
– System architect
– 2nd designer
– Programmers
– Testmanager system testing
• Fault Tree Analysis / Failure Mode and Effect
Analysis
• Performance modeling
• Static Verification/ Dynamic Simulation by
(Twente University)
54. Programming (in C/C++)
• Coding standard:
– Based on “Safer C”, by Les Hutton
– May only use safe subset of the compiler
– Verified by Lint and 5 other tools
• Code is peer reviewed by 2nd developer
• Certified and calibrated compiler
55. Unit tests
• Focus on conformance to specifications
• Required coverage: 100% with respect to:
– Code paths
– Input equivalence classes
• Boundary Value analysis
• Probabilistic testing
• Execution:
– Fully automated scripts, running 24x7
– Creates 100Mb/hour of logs and measurement data
• Upon bug detection
– 3 strikes is out After 3 implementation errors it is build by another developer
– 2 strikes is out Need for a 2nd rebuild implies a redesign by another designer
57. Integration testing
• Focus on
– Functional behaviour of chain of components
– Failure scenarios based on risk analysis
• Required coverage
– 100% coverage on input classes
• Probabilistic testing
• Execution:
– Fully automated scripts, running 24x7, speed times 10
– Creates 250Mb/hour of logs and measurement data
• Upon detection
– Each bug Rootcause-analysis
58. Redundancy is a nasty beast
• You do get functional
behaviour of your entire
system
• It is nearly impossible to
see if all components
are working correctly
• Is EVERYTHING working
ok, or is it the safetynet?
58
59. System testing
• Focus on
– Functional behaviour
– Failure scenarios based on risk analysis
• Required coverage
– 100% complete environment (simultation)
– 100% coverage on input classes
• Execution:
– Fully automated scripts, running 24x7, speed times 10
– Creates 250Mb/hour of logs and measurement data
• Upon detection
– Each bug Rootcause-analysis
60. Endurance testing
• Look for the “one in a
million times” problem
• Challenge:
– Software is deterministic
– execution is not (timing,
transmission-errors,
system load)
• Have an automated
script run it over and
over again
61. Results of Endurance Tests
1,E-05
1,E-04
1,E-03
1,E-02
1,E-01
1,E+00
4.35 4.36 4.37
ChanceofFailure(LogarithmicScale)
Platform Version
Reliability Growth of Function M, Project S
62. Acceptance testing
• Acceptance testing
1. Functional acceptance
2. Failure behaviour, all top 50 (FMECA) risks tested
3. A year of operational verification
• Execution:
– Tests performed on a working stormsurge barrier
– Creates 250Mb/hour of logs and measurement data
• Upon detection
– Each bug Root cause-analysis
63. A risk limit to testing
• Some things are too
dangerous to test
• Some tests introduce
more risks than they
try to mitigate
• There should always be
a safe way out of a test
procedure
65. GUI Acceptance testing
• Looking for
– quality in use for interactive
systems
– Understandability of the
GUI
• Structural investigation of
the performance of the
man-machine interactions
• Looking for “abuse” by the
users
• Looking at real-life handling
of emergency operations
66. Avalanche testing
• To test the capabilies of
alarming and control
• Usually starts with one
simple trigger
• Generally followed by
millions of alarms
• Generally brings your
network and systems
to the breaking point
67. Crash and recovery procedure testing
• Validation of system
behaviour after massive
crash and restart
• Usually identifies many
issues about emergency
procedures
• Sometimes identifies issues
around power supply
• Usually identifies some
(combination of) systems
incapable of unattended
recovery...
69. Production has its challenges…
• Are equipment and
processes optimally
arranged?
• Are the humans up to
their task?
• Does everything
perform as expected?
77. Requires true commitment to results…
• Romans put the architect
under the arches when
removing the scaffolding
• Boeing and Airbus put all
lead-engineers on the first
test-flight
• Dijkstra put his
“rekenmeisjes” on the
opposite dock when
launching ships
78. It is about keeping your back straight…
• Thomas Andrews, Jr.
• Naval architect in charge of RMS Titanic
• He recognized regulations were
insufficient for ship the size of Titanic
• Decisions “forced upon him” by the client:
– Limit the range of double hulls
– Limit the number of lifeboats
• He was on the maiden voyage to spot
improvements
• He knowingly went down with the ship,
saving as many as he could
79. It requires a specific breed of people
The faiths of developers and
testers are linked to safety
critical systems into
eternity
81. Conclusion
• Stop reading newspapers
• Safety Critical Testing is a
lot of work, making sure
nothing happens
• Technically it isn’t that
much different, we’re just
more rigerous and use a
specific breed of
people....