Testing Safety Critical Systems (10-02-2014, VU amsterdam)


Published on

Presentation about the steps required for Verifying and Validating safety critical systems, as well as the test approach used. It goes beyond the simple processes, and also talks about the required safety culture and people required. The presentation contains examples of real-life IEC 61508 SIL 4 systems used on stormsurge barriers...

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • I spend the last 15 years working on highly mission and safety critical systems
  • Voordeel van Glen was dat het maar 1 keer hoefde te werken......En dat waren de 60er jaren (toen kon dat nog), en astronauten hadden nog lefBron: http://www.historicwings.com/features98/mercury/seven-left-bottom.html
  • Surfers sunbathing in front of one of the most dangerousnuclear power plants of the world: San Onofre
  • It is carved in stone: nobody asks if we can make the storm surge barriers less safe when the crime rate in Amsterdam goes up…
  • We have toexplicitlybalance the availability of a service anditssafety.Otherwise we wouldjustpermanently close the barrierandbe happy aboutit!
  • Pleasenotethat:The first bullet is the reasonwhe have auto-pilots (copilots are gooddrinkingbuddies)Thatpeoplestill are anxioustofly, but don’t get up in the morning thinking they are goingto score a date with a supermodel
  • Wonin 2012 de jackpot van 10 miljoenIn 2013 de jackpot van 3 miljoenKansen:1 maalwinnen: 1 op 14 miljoen2 maalwinnen: 1 op 195 biljoen
  • Maar we leven (onwetend) nog steeds in die wereld.....
  • Voordeel van Glen was dat het maar 1 keer hoefde te werken......Bron: http://www.historicwings.com/features98/mercury/seven-left-bottom.html
  • Je begint met je primary concernProces is simpel: je hakt je probleemzover op todat je die 2 miljoenonderdelenhebt, en je weetwat de bijdrage is van elke componentJe pakt de belangrijkste 10, of 100 en neemtgerichtmaatregelen
  • Three Mile Island nuclear disaster28th April 1979Lack of understanding situation let to loss of controll of the #2 reactor, led to partial meltdownSome say it killed approx. 330 people
  • Tickles security: hard van buiten, boterzacht van binnen
  • There is a bug in this one: this code is NOT fail-safe because it has a potential catastrophic deadlock (when the Diesels don’t report Ready).....
  • Please be reminded: the presented code has a deadlock!
  • Do you know the difference between validation and verification?Validation = meets external expectations, does what it is supposed to doVerification = meets internal expectations, conforming to specs
  • Note: this is NOT related to the StormSurge Barrier!
  • You look forsafetycriticalsituations, sometimes without safetynet.Chernobyl was a test of a pump, without safetynet
  • Three Miles Island is a good example
  • Most beautifull example: UPSes using too much power to charge, killing all fuses....Current example: found out that identity management server was a single point of failure....Eurocontrol example: control unit wasn’t ready for the CWPs, and after that got overloaded
  • T5 problems: People couldn’t find their way in the installation…
  • This is functional nonsense: DirMsgResponse is sent to the output, whatever what.
  • Reality isn’t nice and clean. Reality is messy, chaotic, stressed and nobody completely understands it….Reality is much more fun 
  • The Agile movement is right: People beforeprocesses! A processcan’tcompensateforidiotsinside it.Aquaduct of Segovia, peninsula of Iberia. Buildby the Romans in 50AC to 125AC. Architectsoverengineeredtheir equipment thatheavily, it is still standing 2000 years laterPeople do have torealizethat commitment of peopleto get it right the first time is essential.At Eurocontrol, we mentioned a projecteddeathtoll on every bug
  • Our successes are unknown, our failures make the headlines….When a system fails in production, it is actual blood on our hands. At eurocontrol, each bug had a bodycount attachted to it.....
  • Testing Safety Critical Systems (10-02-2014, VU amsterdam)

    1. 1. Testing Safety Critical Systems Theory and Experiences J.vanEkris@Delta-Pi.nl http://www.slideshare.net/Jaap_van_Ekris/
    2. 2. My Job Your life’s goal will be to stay out of the newspapers Gerard Duin (KEMA) Worked at
    3. 3. My Projects
    4. 4. Agenda • • • • The Goal The requirements The challenge Go with the process flow – Development Process – System design – Testing Techniques • Trends • Reality 4
    5. 5. Goals of testing safety critical systems • Verify contractually agreed functionality • Verify correct functional safety-behaviour • Verify safety-behaviour during degraded and failure conditions
    6. 6. THE REQUIREMENTS What is so different about safety critical systems?
    7. 7. Some people live on the edge… How would you feel if you were getting ready to launch and knew you were sitting on top of two million parts -- all built by the lowest bidder on a government contract. John Glenn
    8. 8. Actually, we all do…
    9. 9. We might have become overprotective…
    10. 10. The public is mostly unaware of risk…
    11. 11. Until it is too late… • February 1st 1953 • Spring tide and heavy winds broke dykes • Killed 1836 humans and 30.000 animals
    12. 12. The battle against flood risk… • Cost €2.500.000.000 • The largest moving structure on the planet • Defends – 500 km2 land – 80.000 people • Partially controlled by software
    13. 13. Nothing is flawless, by design… No matter how well the design has been: • Some scenarios will be missed • Some scenarios are too expensive to prevent: – Accept risk – Communicate to stakeholders
    14. 14. When is software good enough? • Dutch Law on storm surge barriers • Equalizes risk of dying due to unnatural causes across the Netherlands
    15. 15. Risks have to be balanced… VS. Availability of the service Safety of the service
    16. 16. Oosterschelde Storm Surge Barrier • Chance of – Failure to close: 10-7 per usage – Unexpected closure: 10-4 per year
    17. 17. To put things in perspective… • • • • • • • • • Having a drunk pilot: 10-2 per flight Hurt yourself when using a chainsaw: 10-3 per use Dating a supermodel: 10-5 in a lifetime Drowning in a bathtub: 10-7 in a lifetime Being hit by falling airplane parts: 10-8 in a lifetime Being killed by lighting: 10-9 per lifetime Winning the lottery: 10-10 per lifetime Your house being hit by a meteor: 10-15 per lifetime Winning the lottery twice: 10-20 per lifetime
    18. 18. Small chances do happen…
    19. 19. Risk balance does change over time...
    20. 20. 9/11... • Identified a fundamental (new) risk to ATC systems • Changed the ATC system dramatically • Doubled our safety critical scenario’s
    21. 21. Are software risks acceptable?
    22. 22. Software plays a significant role...
    23. 23. The industry statistics are against us… • Capers-Jones: at least 2 high severity errors per 10KLoc • Industry concensus is that software will never be more reliable than – 10-5 per usage – 10-9 per operating hour
    24. 24. THE CHALLENGE Why is testing safety critical systems so hard?
    25. 25. The value of testing Program testing can be used to show the presence of bugs, but never to show their absence! Edsger W. Dijkstra
    26. 26. Is just testing enough? • 64 bits input isn’t that uncommon • 264 is the global rice production in 1000 years, measured in individual grains • Fully testing all binary inputs on a simple 64-bits stimilus response system once takes 2 centuries
    27. 27. THE SOFTWARE DEVELOPMENT PROCESS Quality and reliability start at conception, not at testing…
    28. 28. IEC 61508: Safety Integrity Level and acceptable risk
    29. 29. IEC61508: Risk distribution
    30. 30. IEC 61508: A process for safety critical functions
    31. 31. SYSTEM DESIGN What do safety critical systems look like and what are their most important drivers?
    32. 32. Design Principles • • • • • Keep it simple... Risk analysis drives design (decissions) Safety first (production later) Fail-to-safe There shall be no single source of (catastrophic) failure
    33. 33. A simple design of a storm surge barrier Relais (€10,00/piece) Waterdetector (€17,50) Design documentation (Sponsored by Heineken)
    34. 34. Risk analysis Broken cable Chance: Medium Cause: digging, seaguls Effect: Catastophic Relais failure Chance: small Cause: aging Effect: catastophic Waterdetector fails Change: Huge Oorzaken: Rust, driftwood, seaguls (eating, shitting) Effect: Catastophic Measurement errors Chance: Collossal Causes: Waves, wind Effect: False Positive
    35. 35. System Architecture
    36. 36. Risk analysis
    37. 37. Typical risks identified • • • • • • 37 Components making the wrong decissions Power failure Hardware failure of PLC’s/Servers Network failure Ship hitting water sensors Human maintenance error
    38. 38. Risk ≠ system crash • Understandability of the GUI • Wrongful functional behaviour • Data accuracy • Lack of response speed • Tolerance towards unlogical inputs • Resistance to hackers
    39. 39. Usability of a MMI is key to safety
    40. 40. Systems do misbehave...
    41. 41. Systems can be late…
    42. 42. Systems aren’t your only problem
    43. 43. StuurX: Component architecture design
    44. 44. Stuurx::Functionality, initial global design Init Waterlevel < 3 meter Waterlevel Wacht Waterlevel> 3 meter Start_D W_O_D Sluit_? “Start” signal to Diesels “Diesels ready” “Close Barrier”
    45. 45. Stuurx::Functionality, final global design
    46. 46. Stuurx::Functionality, Wait_For_Diesels , detailed design
    47. 47. VERIFICATION What is getting tested, and how?
    48. 48. Design completion...
    49. 49. An example of safety critical components
    50. 50. IEC 61508 SIL4: Required verification activities
    51. 51. Design Validation and Verification • Peer reviews by – – – – System architect 2nd designer Programmers Testmanager system testing • Fault Tree Analysis / Failure Mode and Effect Analysis • Performance modeling • Static Verification/ Dynamic Simulation by (Twente University)
    52. 52. Programming (in C/C++) • Coding standard: – Based on “Safer C”, by Les Hutton – May only use safe subset of the compiler – Verified by Lint and 5 other tools • Code is peer reviewed by 2nd developer • Certified and calibrated compiler
    53. 53. Unit tests • Focus on conformance to specifications • Required coverage: 100% with respect to: – Code paths – Input equivalence classes • Boundary Value analysis • Probabilistic testing • Execution: – Fully automated scripts, running 24x7 – Creates 100Mb/hour of logs and measurement data • Upon bug detection – 3 strikes is out  After 3 implementation errors it is build by another developer – 2 strikes is out  Need for a 2nd rebuild implies a redesign by another designer
    54. 54. Representative testing is difficult
    55. 55. Integration testing • Focus on – Functional behaviour of chain of components – Failure scenarios based on risk analysis • Required coverage – 100% coverage on input classes • Probabilistic testing • Execution: – Fully automated scripts, running 24x7, speed times 10 – Creates 250Mb/hour of logs and measurement data • Upon detection – Each bug  Rootcause-analysis
    56. 56. Redundancy is a nasty beast • You do get functional behaviour of your entire system • It is nearly impossible to see if all components are working correctly • Is EVERYTHING working ok, or is it the safetynet? 56
    57. 57. System testing • Focus on – Functional behaviour – Failure scenarios based on risk analysis • Required coverage – 100% complete environment (simultation) – 100% coverage on input classes • Execution: – Fully automated scripts, running 24x7, speed times 10 – Creates 250Mb/hour of logs and measurement data • Upon detection – Each bug  Rootcause-analysis
    58. 58. Endurance testing • Look for the “one in a million times” problem • Challenge: – Software is deterministic – execution is not (timing, transmission-errors, system load) • Have an automated script run it over and over again
    59. 59. Results of Endurance Tests Reliability Growth of Function M, Project S Chance of Failure (Logarithmic Scale) 1.E+00 1.E-01 1.E-02 1.E-03 1.E-04 1.E-05 4.35 4.36 Platform Version 4.37
    60. 60. Acceptance testing • Acceptance testing 1. Functional acceptance 2. Failure behaviour, all top 50 (FMECA) risks tested 3. A year of operational verification • Execution: – Tests performed on a working stormsurge barrier – Creates 250Mb/hour of logs and measurement data • Upon detection – Each bug  Root cause-analysis
    61. 61. A risk limit to testing • Some things are too dangerous to test • Some tests introduce more risks than they try to mitigate • There should always be a safe way out of a test procedure
    62. 62. Testing safety critical functions is dangerous...
    63. 63. GUI Acceptance testing • Looking for – quality in use for interactive systems – Understandability of the GUI • Structural investigation of the performance of the man-machine interactions • Looking for “abuse” by the users • Looking at real-life handling of emergency operations
    64. 64. Avalanche testing • To test the capabilies of alarming and control • Usually starts with one simple trigger • Generally followed by millions of alarms • Generally brings your network and systems to the breaking point
    65. 65. Crash and recovery procedure testing • Validation of system behaviour after massive crash and restart • Usually identifies many issues about emergency procedures • Sometimes identifies issues around power supply • Usually identifies some (combination of) systems incapable of unattended recovery...
    66. 66. Production has its challenges… • Are equipment and processes optimally arranged? • Are the humans up to their task? • Does everything perform as expected?
    67. 67. TRENDS What is the newest and hottest?
    68. 68. Model Driven Design
    69. 69. A real-life example
    70. 70. A root-cause analysis of this flaw
    71. 71. REALITY What are the real-life challenges of a testmanager of safety critical systems?
    72. 72. Difference between theory and reality
    73. 73. Working together…
    74. 74. Requires true commitment to results… • Romans put the architect under the arches when removing the scaffolding • Boeing and Airbus put all lead-engineers on the first test-flight • Dijkstra put his “rekenmeisjes” on the opposite dock when launching ships
    75. 75. It is about keeping your back straight… • Thomas Andrews, Jr. • Naval architect in charge of RMS Titanic • He recognized regulations were insufficient for ship the size of Titanic • Decisions “forced upon him” by the client: – Limit the range of double hulls – Limit the number of lifeboats • He was on the maiden voyage to spot improvements • He knowingly went down with the ship, saving as many as he could
    76. 76. It requires a specific breed of people The faiths of developers and testers are linked to safety critical systems into eternity
    77. 77. Conclusion • Stop reading newspapers • Safety Critical Testing is a lot of work, making sure nothing happens • Technically it isn’t that much different, we’re just more rigerous and use a specific breed of people....
    78. 78. Questions?