Presentation about the steps required for Verifying and Validating safety critical systems, as well as the test approach used. It goes beyond the simple processes, and also talks about the required safety culture and people required. The presentation contains examples of real-life IEC 61508 SIL 4 systems used on stormsurge barriers..
5. Some people live on the edge…
How would you feel if you were getting
ready to launch and knew you were
sitting on top of two million parts -- all
built by the lowest bidder on a
government contract.
John Glenn
7. We even accept loss...
• Lost/misdirected luggage:
Chance of failure 10-2 per
suitcase
• Airplane: Chance of crash
10-8 per flight hour
• Storm Surge Barrier: Chance
of failure 10-7 per usage
• Nuclear power plant: As
Low As Reasonably Possible
(ALARP)
7
9. To put things in perspective…
• Getting killed in traffic: 10-2 per year
• Having a drunk pilot: 10-2 per flight
• Hurt yourself when using a chainsaw: 10-3 per use
• Considered being posessed by satan: 10-4 per lifetime
• Dating a supermodel: 10-5 in a lifetime
• Drowning in a bathtub: 10-7 in a lifetime
• Being hit by falling airplane parts: 10-8 in a lifetime
• Being killed by lighting: 10-9 per lifetime
• Your house being hit by a meteor: 10-15 per lifetime
12. and the odds are against us…
• Capers-Jones: at least 2 high severity
errors per 10KLoc
• Industry concensus is that software
will never be more reliable than
– 10-5 per usage
– 10-9 per operating hour
13. The value of testing
Program testing can be used to show the
presence of bugs, but never to show
their absence!
Edsger W. Dijkstra
17. IEC 61508: A process for safety critical functions
18. Process or personal commitment?
• Romans put the architect
under the arches when
removing the scaffolding
• Boeing and Airbus put all
lead-engineers on the first
test-flight
• Dijkstra put his
“rekenmeisjes” on the
opposite dock when
launching ships
19. It is about keeping your back straight…
• Thomas Andrews, Jr.
• Naval architect in charge of RMS Titanic
• He recognized regulations were
insufficient for ship the size of Titanic
• Decisions “forced upon him” by the client:
– Limit the range of double hulls
– Limit the number of lifeboats
• He was on the maiden voyage to spot
improvements
• He knowingly went down with the ship,
saving as many as he could
22. Design Principles
• Keep it simple...
• Risk analysis drives design (decissions)
• Safety first (production later)
• Fail-to-safe
• There shall be no single source of
(catastrophic) failure
23. A simple design of a storm surge barrier
Relais
(€10,00/piece)
Waterdetector
(€17,50)
Design documentation
(Sponsored by Heineken)
27. Typical risks identified
• Components making the wrong decissions
• Power failure
• Hardware failure of PLC’s/Servers
• Network failure
• Ship hitting water sensors
• Human maintenance error
27
28. Risk ≠ system crash
• Wrongful functional
behaviour
• Data accuracy
• Lack of response speed
• Understandability of
the GUI
• Tolerance towards
unlogical inputs
32. Nihilating risk isn’t the goal…
No matter how well the
environment analysis
has been:
• Some scenarios will be
missed
• Some scenarios are
too expensive to
prevent:
– Accept risk
– Communicate to stakeholders
33. Risks can be contradictionary…
Availability of the service Safety of the installation
VS.
35. 9/11...
• Really tested our “test
abortion” procedure
• Introduced a
fundamental new risk
to ATC systems
• Changed the ATC
system dramatically
• Doubled our testcases
overnight
42. Challenge: time and resource limitations
• 64 bits input isn’t that
uncommon
• 264 is the global rice
production in 1000
years, measured in
individual grains
• Fully testing all binary
inputs on a 64-bits
stimilus response system
takes 2 centuries
43. Goals of testing safety critical systems
• Verify contractually agreed functionality
• Verify correct functional safety-behaviour
• Verify safety-behaviour during degraded and
failure conditions
46. Design Validation and Verification
• Peer reviews by
– System architect
– 2nd designer
– Programmers
– Testmanager system testing
• Fault Tree Analysis / Failure Mode and Effect
Analysis
• Performance modeling
• Static Verification/ Dynamic Simulation by
(Twente University)
47. Programming (in C/C++)
• Coding standard:
– Based on “Safer C”, by Les Hutton
– May only use safe subset of the compiler
– Verified by Lint and 5 other tools
• Code is peer reviewed by 2nd developer
• Certified and calibrated compiler
48. Unit tests
• Focus on conformance to specifications
• Required coverage: 100% with respect to:
– Code paths
– Input equivalence classes
• Boundary Value analysis
• Probabilistic testing
• Execution:
– Fully automated scripts, running 24x7
– Creates 100Mb/hour of logs and measurement data
• Upon bug detection
– 3 strikes is out After 3 implementation errors it is build by another developer
– 2 strikes is out Need for a 2nd rebuild implies a redesign by another designer
50. Integration testing
• Focus on
– Functional behaviour of chain of components
– Failure scenarios based on risk analysis
• Required coverage
– 100% coverage on input classes
• Probabilistic testing
• Execution:
– Fully automated scripts, running 24x7, speed times 10
– Creates 250Mb/hour of logs and measurement data
• Upon detection
– Each bug Rootcause-analysis
51. Redundancy is a nasty beast
• You do get functional
behaviour of your
entire system
• It is nearly impossible
to see if all your
components are
working correctly
51
52. System testing
• Focus on
– Functional behaviour
– Failure scenarios based on risk analysis
• Required coverage
– 100% complete environment (simultation)
– 100% coverage on input classes
• Execution:
– Fully automated scripts, running 24x7, speed times 10
– Creates 250Mb/hour of logs and measurement data
• Upon detection
– Each bug Rootcause-analysis
53. Acceptance testing
• Acceptance testing
1. Functional acceptance
2. Failure behaviour, all top 50 (FMECA) risks tested
3. A year of operational verification
• Execution:
– Tests performed on a working stormsurge barrier
– Creates 250Mb/hour of logs and measurement data
• Upon detection
– Each bug Root cause-analysis
54. Endurance testing
• Look for the “one in a
million times” problem
• Challenge:
– Software is deterministic
– execution is not (timing,
system load, bit-errors)
• Have an automated
script run it over and
over again
55. GUI Acceptance testing
• Looking for
– quality in use for interactive
systems
– Understandability of the
GUI
• Structural investigation of
the performance of the
system-human interactions
• Looking for “abuse” by the
users
• Looking at real-life handling
of emergency operations
56. Avalanche testing
• To test the capabilies of
alarming and control
• Usually starts with one
simple trigger
• Generally followed by
millions of alarms
• Generally brings your
network and systems
to the breaking point
57. Crash and recovery procedure testing
• Validation of system
behaviour after massive
crash and restart
• Usually identifies many
issues about emergency
procedures
• Sometimes identifies issues
around power supply
• Usually identifies some
(combination of) systems
incapable of unattended
recovery...
59. A risk analysis to testing
• There should always be
a way out of a test
procedure
• Some things are too
dangerous to test
• Some tests introduce
more risks than they
try to mitigate
60. Root-cause analysis
• A painfull process, by
design
• Is extremely thorough
• Assumes that the error
found is a symptom of an
underlying collection of
(process) flaws
• Searches for the underlying
causes for the error, and
looks for possible similar
errors that might have
followed a similar path
68. It requires a specific breed of people
The faiths of developers and
testers are linked to safety
critical systems into
eternity
69. Conclusions
• Stop reading newspapers
• Safety Critical Testing is a
lot of work, making sure
nothing happens
• Technically it isn’t that
much different, we’re just
more rigerous and use a
specific breed of
people....
Editor's Notes
Copyright CIBIT Adviseurs|Opleiders 2005 Jaap van Ekris, Veiligheidskritische systemen Werkveld: Kerncentrales Luchtverkeersleiding Stormvloedkeringen Fouten kosten veel mensenlevens
Voordeel van Glen was dat het maar 1 keer hoefde te werken...... En dat waren de 60er jaren (toen kon dat nog), en astronauten hadden nog lef Bron: http://www.historicwings.com/features98/mercury/seven-left-bottom.html
When I started my career, my mentor told me: “From now on, your goal is to stay off the frontpage of the newspapers” I can tell you it is hard, but so far I’ve succeeded.
Please note that these failure rates include electromechanical failure as well!! Electrocution by a light switch: Change of 10 -5 per usage, which is the exact chance of dating a supermodel as well. 25 April 2013
Please note that pilots have a redundant counterpart…. 25 April 2013
25 April 2013
Maar we leven (onwetend) nog steeds in die wereld..... 25 April 2013
Voordeel van Glen was dat het maar 1 keer hoefde te werken...... Bron: http://www.historicwings.com/features98/mercury/seven-left-bottom.html
FTA en FMEA zijn tegenpolen, goede controlemechanismen van elkaar (NASA) Alhoewel NASA geen feilloos trackrecord heeft….
Aquaduct of Segovia, peninsula of Iberia. Build by the Romans in 50AC to 125AC. Architects overengineered their equipment that heavily, it is still standing 2000 years later People do have to realize that commitment of people to get it right the first time is essential. At Eurocontrol, we mentioned a projected deathtoll on every bug 25 April 2013
Doel: mag maar eens in de 10.000 jaar
Je begint met je primary concern Proces is simpel: je hakt je probleem zover op todat je die 2 miljoen onderdelen hebt, en je weet wat de bijdrage is van elke component Je pakt de belangrijkste 10, of 100 en neemt gericht maatregelen
Tickles security: hard van buiten, boterzacht van binnen
De perfecte “single point of failure”
Als we rekening gaan houden met deadlocks en redundantie ziet ons plaatje er zo uit: niet echt simpel meer……
There is a bug in this one: this code is NOT fail-safe because it has a potential catastrophic deadlock (when the Diesels don’t report Ready)..... 25 April 2013
Please be reminded: the presented code has a deadlock! 25 April 2013
FTA en FMEA zijn tegenpolen, goede controlemechanismen van elkaar (NASA) Alhoewel NASA geen feilloos trackrecord heeft….
Do you know the difference between validation and verification? Validation = meets external expectations, does what it is supposed to do Verification = meets internal expectations, conforming to specs 25 April 2013
Funny example: printing screen....
Most beautifull example: UPSes using too much power to charge, killing all fuses.... Current example: found out that identity management server was a single point of failure.... Eurocontrol example: control unit wasn’t ready for the CWPs, and after that got overloaded 25 April 2013
FTA en FMEA zijn tegenpolen, goede controlemechanismen van elkaar (NASA) Alhoewel NASA geen feilloos trackrecord heeft….
FTA en FMEA zijn tegenpolen, goede controlemechanismen van elkaar (NASA) Alhoewel NASA geen feilloos trackrecord heeft….
This is functional nonsense: DirMsgResponse is sent to the output, whatever what. 25 April 2013
FTA en FMEA zijn tegenpolen, goede controlemechanismen van elkaar (NASA) Alhoewel NASA geen feilloos trackrecord heeft….
Our successes are unknown, our failures make the headlines…. When a system fails in production, it is actual blood on our hands. At eurocontrol, each bug had a bodycount attachted to it.....