2011-05-02 - VU Amsterdam - Testing safety critical systems


Published on

Presentation about the steps required for Verifying and Vlaidating safety critical systems, as well as the test approach used. Contains examples of real-life IEC 61508 SIL 4 systems.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Copyright CIBIT Adviseurs|Opleiders 2005 Jaap van Ekris, Veiligheidskritische systemen Werkveld: Kerncentrales Luchtverkeersleiding Stormvloedkeringen Fouten kosten veel mensenlevens
  • Voordeel van Glen was dat het maar 1 keer hoefde te werken...... En dat waren de 60er jaren (toen kon dat nog), en astronauten hadden nog lef Bron: http://www.historicwings.com/features98/mercury/seven-left-bottom.html
  • When I started my career, my mentor told me: “From now on, your goal is to stay off the frontpage of the newspapers” I can tell you it is hard, but so far I’ve succeeded ... ALMOST: T5
  • Maar we leven (onwetend) nog steeds in die wereld..... 10 June 2011
  • Please note that these failure rates include electromechanical failure as well!! Electrocution by a light switch: Change of 10 -5 per usage 10 June 2011
  • Voordeel van Glen was dat het maar 1 keer hoefde te werken...... Bron: http://www.historicwings.com/features98/mercury/seven-left-bottom.html
  • FTA en FMEA zijn tegenpolen, goede controlemechanismen van elkaar (NASA) Alhoewel NASA geen feilloos trackrecord heeft….
  • Doel: mag maar eens in de 10.000 jaar
  • Je begint met je primary concern Proces is simpel: je hakt je probleem zover op todat je die 2 miljoen onderdelen hebt, en je weet wat de bijdrage is van elke component Je pakt de belangrijkste 10, of 100 en neemt gericht maatregelen
  • Tickles security: hard van buiten, boterzacht van binnen
  • Als we rekening gaan houden met deadlocks en redundantie ziet ons plaatje er zo uit: niet echt simpel meer……
  • There is a bug in this one: this code is NOT fail-safe because it has a potential catastrophic deadlock (when the Diesels don’t report Ready)..... 10 June 2011
  • Please be reminded: the presented code has a deadlock! 10 June 2011
  • Do you know the difference between validation and verification? Validation = meets external expectations, does what it is supposed to do Verification = meets internal expectations, conforming to specs 10 June 2011
  • Funny example: printing screen using collosal bitmaps to postscript printer, instead of small vector drawings....
  • Most beautifull example: UPSes using too much power to charge, killing all fuses.... Current example: found out that identity management server was a single point of failure... 10 June 2011
  • This is functional nonsense: DirMsgResponse is sent to the output, whatever what. 10 June 2011
  • Dijkstra put mathematicians in the line of ships, just to remind them of the danger: a practice still used by Boeing and Airbus (maiden flight) Testers, like John Glenn actually was, put their life on the line each and every time At eurocontrol, each bug had a bodycount attachted to it..... When a system fails in production, it is actual blood on our hands I lose about a collegue a year Quit when you think it is routine.....
  • 10 June 2011
  • 2011-05-02 - VU Amsterdam - Testing safety critical systems

    1. 1. Testing Safety Critical Systems Theory and Experiences <ul><li>2 May 2011 </li></ul><ul><li>Jaap van Ekris </li></ul>
    2. 2. Jaap van Ekris
    3. 3. Some people live on the edge… <ul><li>How would you feel if you were getting ready to launch and knew you were sitting on top of two million parts -- all built by the lowest bidder on a government contract. </li></ul><ul><li>John Glenn </li></ul>
    4. 4. Actually, we all do…
    5. 5. Agenda <ul><li>The challenge </li></ul><ul><li>Process and Organization </li></ul><ul><li>System design </li></ul><ul><li>Verification Techniques </li></ul><ul><li>Trends </li></ul><ul><li>Reality </li></ul>
    6. 6. THE CHALLENGE <ul><li>Why is testing safety critical systems so hard? </li></ul>
    7. 7. Software is dangerous... <ul><li>Capers-Jones: at least 2 high severity errors per 10KLoc </li></ul><ul><li>Industry concensus is that software will never be more reliable than </li></ul><ul><ul><li>10 -5 per usage </li></ul></ul><ul><ul><li>10 -9 per operating hour </li></ul></ul>
    8. 8. We even accept loss... <ul><li>Lost/misdirected luggage: Chance of failure 10 -2 per suitcase </li></ul><ul><li>Airplane: Chance of failure 10 -8 per flight hour </li></ul><ul><li>Storm Surge Barrier: Chance of failure 10 -7 per usage </li></ul><ul><li>Nuclear power plant: As Low As Reasonably Possible (ALARP) </li></ul>
    9. 9. The value of testing <ul><li>Program testing can be used to show the presence of bugs, but never to show their absence! </li></ul><ul><li>Edsger W. Dijkstra </li></ul>
    10. 10. PROCESS AND ORGANIZATION <ul><li>Who does what in safety critical software development? </li></ul>
    11. 11. IEC 61508: Safety Integrity Level and acceptable risk
    12. 12. IEC61508: Risk distribution
    13. 13. IEC 61508: A process for safety critical functions
    14. 14. SYSTEM DESIGN <ul><li>What do safety critical systems look like? </li></ul>
    15. 15. A short introduction into storm surge barriers…
    16. 16. Design Principles <ul><li>Keep it simple... </li></ul><ul><li>Risk analysis drives design (decissions) </li></ul><ul><li>Safety first (production later) </li></ul><ul><li>Fail-to-safe </li></ul><ul><li>There shall be no single source of (catastrophic) failure </li></ul>
    17. 17. A simple design of a storm surge barrier Relais (€10,00/piece) Waterdetector (€17,50) Design documentation (Sponsored by Heineken)
    18. 18. Risk analysis Relais failure Chance : small Cause : aging Effect : catastophic Waterdetector fails Change : Huge Oorzaken : Rust, driftwood, seaguls (eating, shitting) Effect : Catastophic Measurement errors Chance : Collossal Causes : Waves, wind Effect : False Positive Broken cable Chance : Medium Cause : digging, seaguls Effect : Catastophic
    19. 19. System Architecture
    20. 20. Risk analysis
    21. 21. Typical risks identified <ul><li>Components making the wrong decissions </li></ul><ul><li>Power failure </li></ul><ul><li>Hardware failure of PLC’s/Servers </li></ul><ul><li>Network failure </li></ul><ul><li>Ship hitting water sensors </li></ul><ul><li>Human maintenance error </li></ul>
    22. 22. Risk ≠ system crash <ul><li>Wrongful functional behaviour </li></ul><ul><li>Data accuracy </li></ul><ul><li>Lack of response speed </li></ul><ul><li>Understandability of the GUI </li></ul><ul><li>Tolerance towards unlogical inputs </li></ul>
    23. 23. Systems do misbehave...
    24. 24. Risks can be external as well
    25. 25. Nihilating risk isn’t the goal… <ul><li>No matter how well the environment analysis has been: </li></ul><ul><li>Some scenarios will be missed </li></ul><ul><li>Some scenarios are too expensive to prevent: </li></ul><ul><ul><li>Accept risk </li></ul></ul><ul><ul><li>Communicate to stakeholders </li></ul></ul>
    26. 26. Risk reality does change over time...
    27. 27. 9/11... <ul><li>Really tested our “test abortion” procedure </li></ul><ul><li>Introduced a fundamental new risk to ATC systems </li></ul><ul><li>Changed the ATC system dramatically </li></ul><ul><li>Doubled our testcases overnight </li></ul>
    28. 28. Stuur X : Component architecture design
    29. 29. Stuur x ::Functionality, initial global design Init Start_D “ Start” signal to Diesels Wacht Waterlevel < 3 meter Waterlevel> 3 meter W_O_D “ Diesels ready” Sluit_? “ Close Barrier” Waterlevel
    30. 30. Stuur x ::Functionality, final global design
    31. 31. Stuur x ::Functionality, Wait_For_Diesels, detailed design
    32. 32. VERIFICATION <ul><li>What is getting tested, and how? </li></ul>
    33. 33. The end is nigh...
    34. 34. Challenge: time and resource limitations <ul><li>64 bits input isn’t that uncommon </li></ul><ul><li>2 64 is the global rice production in 1000 years, measured in individual grains </li></ul><ul><li>Fully testing all binary inputs on a 64-bits stimilus response system takes 2 centuries </li></ul>
    35. 35. Goals of testing safety critical systems <ul><li>Verify correct functional safety-behaviour </li></ul><ul><li>Verify safety-behaviour during degraded and failure conditions </li></ul>
    36. 36. An example of safety critical components
    37. 37. IEC 61508 SIL4: Required verification activities
    38. 38. Design Validation and Verification <ul><li>Peer reviews by </li></ul><ul><ul><li>System architect </li></ul></ul><ul><ul><li>2 nd designer </li></ul></ul><ul><ul><li>Programmers </li></ul></ul><ul><ul><li>Testmanager system testing </li></ul></ul><ul><li>Fault Tree Analysis / Failure Mode and Effect Analysis </li></ul><ul><li>Performance modeling </li></ul><ul><li>Static Verification/ Dynamic Simulation by (Twente University) </li></ul>
    39. 39. Programming (in C/C++) <ul><li>Coding standard: </li></ul><ul><ul><li>Based on “Safer C”, by Les Hutton </li></ul></ul><ul><ul><li>May only use safe subset of the compiler </li></ul></ul><ul><ul><li>Verified by Lint and 5 other tools </li></ul></ul><ul><li>Code is peer reviewed by 2 nd developer </li></ul><ul><li>Certified and calibrated compiler </li></ul>
    40. 40. Unit tests <ul><li>Focus on conformance to specifications </li></ul><ul><li>Required coverage: 100% with respect to: </li></ul><ul><ul><li>Code paths </li></ul></ul><ul><ul><li>Input equivalence classes </li></ul></ul><ul><li>Boundary Value analysis </li></ul><ul><li>Probabilistic testing </li></ul><ul><li>Execution: </li></ul><ul><ul><li>Fully automated scripts, running 24x7 </li></ul></ul><ul><ul><li>Creates 100Mb/hour of logs and measurement data </li></ul></ul><ul><li>Upon bug detection </li></ul><ul><ul><li>3 strikes is out  After 3 implementation errors it is build by another developer </li></ul></ul><ul><ul><li>2 strikes is out  Need for a 2 nd rebuild implies a redesign by another designer </li></ul></ul>
    41. 41. Representative testing is difficult
    42. 42. Integration testing <ul><li>Focus on </li></ul><ul><ul><li>Functional behaviour of chain of components </li></ul></ul><ul><ul><li>Failure scenarios based on risk analysis </li></ul></ul><ul><li>Required coverage </li></ul><ul><ul><li>100% coverage on input classes </li></ul></ul><ul><li>Probabilistic testing </li></ul><ul><li>Execution: </li></ul><ul><ul><li>Fully automated scripts, running 24x7, speed times 10 </li></ul></ul><ul><ul><li>Creates 250Mb/hour of logs and measurement data </li></ul></ul><ul><li>Upon detection </li></ul><ul><ul><li>Each bug  Rootcause-analysis </li></ul></ul>
    43. 43. Redundancy is a nasty beast <ul><li>You do get functional behaviour of your entire system </li></ul><ul><li>It is nearly impossible to see if all your components are working correctly </li></ul>
    44. 44. System testing <ul><li>Focus on </li></ul><ul><ul><li>Functional behaviour </li></ul></ul><ul><ul><li>Failure scenarios based on risk analysis </li></ul></ul><ul><li>Required coverage </li></ul><ul><ul><li>100% complete environment (simultation) </li></ul></ul><ul><ul><li>100% coverage on input classes </li></ul></ul><ul><li>Execution: </li></ul><ul><ul><li>Fully automated scripts, running 24x7, speed times 10 </li></ul></ul><ul><ul><li>Creates 250Mb/hour of logs and measurement data </li></ul></ul><ul><li>Upon detection </li></ul><ul><ul><li>Each bug  Rootcause-analysis </li></ul></ul>
    45. 45. Acceptance testing <ul><li>Acceptance testing </li></ul><ul><ul><li>Functional acceptance </li></ul></ul><ul><ul><li>Failure behaviour, all top 50 (FMECA) risks tested </li></ul></ul><ul><ul><li>A year of operational verification </li></ul></ul><ul><li>Execution: </li></ul><ul><ul><li>Tests performed on a working stormsurge barrier </li></ul></ul><ul><ul><li>Creates 250Mb/hour of logs and measurement data </li></ul></ul><ul><li>Upon detection </li></ul><ul><ul><li>Each bug  Root cause-analysis </li></ul></ul>
    46. 46. GUI Acceptance testing <ul><li>Looking for </li></ul><ul><ul><li>quality in use for interactive systems </li></ul></ul><ul><ul><li>Understandability of the GUI </li></ul></ul><ul><li>Structural investigation of the performance of the system-human interactions </li></ul><ul><li>Looking for “abuse” by the users </li></ul><ul><li>Looking at real-life handling of emergency operations </li></ul>
    47. 47. Avalanche testing <ul><li>To test the capabilies of alarming </li></ul><ul><li>Usually starts with one simple trigger </li></ul><ul><li>Generally followed by millions of alarms </li></ul><ul><li>Generally brings your network and systems to the breaking point </li></ul>
    48. 48. Crash and recovery procedure testing <ul><li>Validation of system behaviour after massive crash and restart </li></ul><ul><li>Usually identifies many issues about emergency procedures </li></ul><ul><li>Sometimes identifies issues around power supply </li></ul><ul><li>Usually identifies some (combination of) systems incapable of unattended recovery... </li></ul>
    49. 49. Testing safety critical functions is dangerous...
    50. 50. A risk analysis to testing <ul><li>There should always be a way out of a test procedure </li></ul><ul><li>Some things are too dangerous to test </li></ul><ul><li>Some tests introduce more risks than they try to mitigate </li></ul>
    51. 51. Root-cause analysis <ul><li>A painfull process, by design </li></ul><ul><li>Is extremely thorough </li></ul><ul><li>Assumes that the error found is a symptom of an underlying collection of (process) flaws </li></ul><ul><li>Searches for the underlying causes for the error, and looks for possible similar errors that might have followed a similar path </li></ul>
    52. 52. Failed gates of a potential deadlock
    53. 53. TRENDS <ul><li>What is the newest and hottest? </li></ul>
    54. 54. Model Driven Design
    55. 55. A real-life example
    56. 56. A root-cause analysis of this flaw
    57. 57. REALITY <ul><li>What are the real-life challenges of a testmanager of safety critical systems? </li></ul>
    58. 58. Testing in reality
    59. 59. It requires a specific breed of people <ul><li>The faiths of developers and testers are linked to safety critical systems into eternity </li></ul>
    60. 60. Conclusions <ul><li>Stop reading newspapers </li></ul><ul><li>Safety Critical Testing is a lot of work, making sure nothing happens </li></ul><ul><li>Technically it isn’t that much different, we’re just more rigerous and use a specific breed of people.... </li></ul>
    61. 61. Safeguarding life, property and the environment www.dnv.com