2011-05-02 - VU Amsterdam - Testing safety critical systems

  • 599 views
Uploaded on

Presentation about the steps required for Verifying and Vlaidating safety critical systems, as well as the test approach used. Contains examples of real-life IEC 61508 SIL 4 systems.

Presentation about the steps required for Verifying and Vlaidating safety critical systems, as well as the test approach used. Contains examples of real-life IEC 61508 SIL 4 systems.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
599
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Copyright CIBIT Adviseurs|Opleiders 2005 Jaap van Ekris, Veiligheidskritische systemen Werkveld: Kerncentrales Luchtverkeersleiding Stormvloedkeringen Fouten kosten veel mensenlevens
  • Voordeel van Glen was dat het maar 1 keer hoefde te werken...... En dat waren de 60er jaren (toen kon dat nog), en astronauten hadden nog lef Bron: http://www.historicwings.com/features98/mercury/seven-left-bottom.html
  • When I started my career, my mentor told me: “From now on, your goal is to stay off the frontpage of the newspapers” I can tell you it is hard, but so far I’ve succeeded ... ALMOST: T5
  • Maar we leven (onwetend) nog steeds in die wereld..... 10 June 2011
  • Please note that these failure rates include electromechanical failure as well!! Electrocution by a light switch: Change of 10 -5 per usage 10 June 2011
  • Voordeel van Glen was dat het maar 1 keer hoefde te werken...... Bron: http://www.historicwings.com/features98/mercury/seven-left-bottom.html
  • FTA en FMEA zijn tegenpolen, goede controlemechanismen van elkaar (NASA) Alhoewel NASA geen feilloos trackrecord heeft….
  • Doel: mag maar eens in de 10.000 jaar
  • Je begint met je primary concern Proces is simpel: je hakt je probleem zover op todat je die 2 miljoen onderdelen hebt, en je weet wat de bijdrage is van elke component Je pakt de belangrijkste 10, of 100 en neemt gericht maatregelen
  • Tickles security: hard van buiten, boterzacht van binnen
  • Als we rekening gaan houden met deadlocks en redundantie ziet ons plaatje er zo uit: niet echt simpel meer……
  • There is a bug in this one: this code is NOT fail-safe because it has a potential catastrophic deadlock (when the Diesels don’t report Ready)..... 10 June 2011
  • Please be reminded: the presented code has a deadlock! 10 June 2011
  • Do you know the difference between validation and verification? Validation = meets external expectations, does what it is supposed to do Verification = meets internal expectations, conforming to specs 10 June 2011
  • Funny example: printing screen using collosal bitmaps to postscript printer, instead of small vector drawings....
  • Most beautifull example: UPSes using too much power to charge, killing all fuses.... Current example: found out that identity management server was a single point of failure... 10 June 2011
  • This is functional nonsense: DirMsgResponse is sent to the output, whatever what. 10 June 2011
  • Dijkstra put mathematicians in the line of ships, just to remind them of the danger: a practice still used by Boeing and Airbus (maiden flight) Testers, like John Glenn actually was, put their life on the line each and every time At eurocontrol, each bug had a bodycount attachted to it..... When a system fails in production, it is actual blood on our hands I lose about a collegue a year Quit when you think it is routine.....
  • 10 June 2011

Transcript

  • 1. Testing Safety Critical Systems Theory and Experiences
    • 2 May 2011
    • Jaap van Ekris
  • 2. Jaap van Ekris
  • 3. Some people live on the edge…
    • How would you feel if you were getting ready to launch and knew you were sitting on top of two million parts -- all built by the lowest bidder on a government contract.
    • John Glenn
  • 4. Actually, we all do…
  • 5. Agenda
    • The challenge
    • Process and Organization
    • System design
    • Verification Techniques
    • Trends
    • Reality
  • 6. THE CHALLENGE
    • Why is testing safety critical systems so hard?
  • 7. Software is dangerous...
    • Capers-Jones: at least 2 high severity errors per 10KLoc
    • Industry concensus is that software will never be more reliable than
      • 10 -5 per usage
      • 10 -9 per operating hour
  • 8. We even accept loss...
    • Lost/misdirected luggage: Chance of failure 10 -2 per suitcase
    • Airplane: Chance of failure 10 -8 per flight hour
    • Storm Surge Barrier: Chance of failure 10 -7 per usage
    • Nuclear power plant: As Low As Reasonably Possible (ALARP)
  • 9. The value of testing
    • Program testing can be used to show the presence of bugs, but never to show their absence!
    • Edsger W. Dijkstra
  • 10. PROCESS AND ORGANIZATION
    • Who does what in safety critical software development?
  • 11. IEC 61508: Safety Integrity Level and acceptable risk
  • 12. IEC61508: Risk distribution
  • 13. IEC 61508: A process for safety critical functions
  • 14. SYSTEM DESIGN
    • What do safety critical systems look like?
  • 15. A short introduction into storm surge barriers…
  • 16. Design Principles
    • Keep it simple...
    • Risk analysis drives design (decissions)
    • Safety first (production later)
    • Fail-to-safe
    • There shall be no single source of (catastrophic) failure
  • 17. A simple design of a storm surge barrier Relais (€10,00/piece) Waterdetector (€17,50) Design documentation (Sponsored by Heineken)
  • 18. Risk analysis Relais failure Chance : small Cause : aging Effect : catastophic Waterdetector fails Change : Huge Oorzaken : Rust, driftwood, seaguls (eating, shitting) Effect : Catastophic Measurement errors Chance : Collossal Causes : Waves, wind Effect : False Positive Broken cable Chance : Medium Cause : digging, seaguls Effect : Catastophic
  • 19. System Architecture
  • 20. Risk analysis
  • 21. Typical risks identified
    • Components making the wrong decissions
    • Power failure
    • Hardware failure of PLC’s/Servers
    • Network failure
    • Ship hitting water sensors
    • Human maintenance error
  • 22. Risk ≠ system crash
    • Wrongful functional behaviour
    • Data accuracy
    • Lack of response speed
    • Understandability of the GUI
    • Tolerance towards unlogical inputs
  • 23. Systems do misbehave...
  • 24. Risks can be external as well
  • 25. Nihilating risk isn’t the goal…
    • No matter how well the environment analysis has been:
    • Some scenarios will be missed
    • Some scenarios are too expensive to prevent:
      • Accept risk
      • Communicate to stakeholders
  • 26. Risk reality does change over time...
  • 27. 9/11...
    • Really tested our “test abortion” procedure
    • Introduced a fundamental new risk to ATC systems
    • Changed the ATC system dramatically
    • Doubled our testcases overnight
  • 28. Stuur X : Component architecture design
  • 29. Stuur x ::Functionality, initial global design Init Start_D “ Start” signal to Diesels Wacht Waterlevel < 3 meter Waterlevel> 3 meter W_O_D “ Diesels ready” Sluit_? “ Close Barrier” Waterlevel
  • 30. Stuur x ::Functionality, final global design
  • 31. Stuur x ::Functionality, Wait_For_Diesels, detailed design
  • 32. VERIFICATION
    • What is getting tested, and how?
  • 33. The end is nigh...
  • 34. Challenge: time and resource limitations
    • 64 bits input isn’t that uncommon
    • 2 64 is the global rice production in 1000 years, measured in individual grains
    • Fully testing all binary inputs on a 64-bits stimilus response system takes 2 centuries
  • 35. Goals of testing safety critical systems
    • Verify correct functional safety-behaviour
    • Verify safety-behaviour during degraded and failure conditions
  • 36. An example of safety critical components
  • 37. IEC 61508 SIL4: Required verification activities
  • 38. Design Validation and Verification
    • Peer reviews by
      • System architect
      • 2 nd designer
      • Programmers
      • Testmanager system testing
    • Fault Tree Analysis / Failure Mode and Effect Analysis
    • Performance modeling
    • Static Verification/ Dynamic Simulation by (Twente University)
  • 39. Programming (in C/C++)
    • Coding standard:
      • Based on “Safer C”, by Les Hutton
      • May only use safe subset of the compiler
      • Verified by Lint and 5 other tools
    • Code is peer reviewed by 2 nd developer
    • Certified and calibrated compiler
  • 40. Unit tests
    • Focus on conformance to specifications
    • Required coverage: 100% with respect to:
      • Code paths
      • Input equivalence classes
    • Boundary Value analysis
    • Probabilistic testing
    • Execution:
      • Fully automated scripts, running 24x7
      • Creates 100Mb/hour of logs and measurement data
    • Upon bug detection
      • 3 strikes is out  After 3 implementation errors it is build by another developer
      • 2 strikes is out  Need for a 2 nd rebuild implies a redesign by another designer
  • 41. Representative testing is difficult
  • 42. Integration testing
    • Focus on
      • Functional behaviour of chain of components
      • Failure scenarios based on risk analysis
    • Required coverage
      • 100% coverage on input classes
    • Probabilistic testing
    • Execution:
      • Fully automated scripts, running 24x7, speed times 10
      • Creates 250Mb/hour of logs and measurement data
    • Upon detection
      • Each bug  Rootcause-analysis
  • 43. Redundancy is a nasty beast
    • You do get functional behaviour of your entire system
    • It is nearly impossible to see if all your components are working correctly
  • 44. System testing
    • Focus on
      • Functional behaviour
      • Failure scenarios based on risk analysis
    • Required coverage
      • 100% complete environment (simultation)
      • 100% coverage on input classes
    • Execution:
      • Fully automated scripts, running 24x7, speed times 10
      • Creates 250Mb/hour of logs and measurement data
    • Upon detection
      • Each bug  Rootcause-analysis
  • 45. Acceptance testing
    • Acceptance testing
      • Functional acceptance
      • Failure behaviour, all top 50 (FMECA) risks tested
      • A year of operational verification
    • Execution:
      • Tests performed on a working stormsurge barrier
      • Creates 250Mb/hour of logs and measurement data
    • Upon detection
      • Each bug  Root cause-analysis
  • 46. GUI Acceptance testing
    • Looking for
      • quality in use for interactive systems
      • Understandability of the GUI
    • Structural investigation of the performance of the system-human interactions
    • Looking for “abuse” by the users
    • Looking at real-life handling of emergency operations
  • 47. Avalanche testing
    • To test the capabilies of alarming
    • Usually starts with one simple trigger
    • Generally followed by millions of alarms
    • Generally brings your network and systems to the breaking point
  • 48. Crash and recovery procedure testing
    • Validation of system behaviour after massive crash and restart
    • Usually identifies many issues about emergency procedures
    • Sometimes identifies issues around power supply
    • Usually identifies some (combination of) systems incapable of unattended recovery...
  • 49. Testing safety critical functions is dangerous...
  • 50. A risk analysis to testing
    • There should always be a way out of a test procedure
    • Some things are too dangerous to test
    • Some tests introduce more risks than they try to mitigate
  • 51. Root-cause analysis
    • A painfull process, by design
    • Is extremely thorough
    • Assumes that the error found is a symptom of an underlying collection of (process) flaws
    • Searches for the underlying causes for the error, and looks for possible similar errors that might have followed a similar path
  • 52. Failed gates of a potential deadlock
  • 53. TRENDS
    • What is the newest and hottest?
  • 54. Model Driven Design
  • 55. A real-life example
  • 56. A root-cause analysis of this flaw
  • 57. REALITY
    • What are the real-life challenges of a testmanager of safety critical systems?
  • 58. Testing in reality
  • 59. It requires a specific breed of people
    • The faiths of developers and testers are linked to safety critical systems into eternity
  • 60. Conclusions
    • Stop reading newspapers
    • Safety Critical Testing is a lot of work, making sure nothing happens
    • Technically it isn’t that much different, we’re just more rigerous and use a specific breed of people....
  • 61. Safeguarding life, property and the environment www.dnv.com