• Save
 

Testing safety critical systems: Practice and Theory (14-05-2013, VU Amsterdam)

on

  • 421 views

Presentation about the steps required for Verifying and Validating safety critical systems, as well as the test approach used. It goes beyond the simple processes, and also talks about the required ...

Presentation about the steps required for Verifying and Validating safety critical systems, as well as the test approach used. It goes beyond the simple processes, and also talks about the required safety culture and people required. The presentation contains examples of real-life IEC 61508 SIL 4 systems used on stormsurge barriers..

Statistics

Views

Total Views
421
Slideshare-icon Views on SlideShare
420
Embed Views
1

Actions

Likes
0
Downloads
0
Comments
0

1 Embed 1

https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Copyright CIBIT Adviseurs|Opleiders 2005 Jaap van Ekris, Veiligheidskritische systemen Werkveld: Kerncentrales Luchtverkeersleiding Stormvloedkeringen Fouten kosten veel mensenlevens
  • Voordeel van Glen was dat het maar 1 keer hoefde te werken...... En dat waren de 60er jaren (toen kon dat nog), en astronauten hadden nog lef Bron: http://www.historicwings.com/features98/mercury/seven-left-bottom.html
  • When I started my career, my mentor told me: “From now on, your goal is to stay off the frontpage of the newspapers” I can tell you it is hard, but so far I’ve succeeded.
  • Please note that these failure rates include electromechanical failure as well!! Electrocution by a light switch: Change of 10 -5 per usage, which is the exact chance of dating a supermodel as well. 25 April 2013
  • Please note that pilots have a redundant counterpart…. 25 April 2013
  • 25 April 2013
  • Maar we leven (onwetend) nog steeds in die wereld..... 25 April 2013
  • Voordeel van Glen was dat het maar 1 keer hoefde te werken...... Bron: http://www.historicwings.com/features98/mercury/seven-left-bottom.html
  • FTA en FMEA zijn tegenpolen, goede controlemechanismen van elkaar (NASA) Alhoewel NASA geen feilloos trackrecord heeft….
  • Aquaduct of Segovia, peninsula of Iberia. Build by the Romans in 50AC to 125AC. Architects overengineered their equipment that heavily, it is still standing 2000 years later People do have to realize that commitment of people to get it right the first time is essential. At Eurocontrol, we mentioned a projected deathtoll on every bug 25 April 2013
  • Doel: mag maar eens in de 10.000 jaar
  • Je begint met je primary concern Proces is simpel: je hakt je probleem zover op todat je die 2 miljoen onderdelen hebt, en je weet wat de bijdrage is van elke component Je pakt de belangrijkste 10, of 100 en neemt gericht maatregelen
  • Tickles security: hard van buiten, boterzacht van binnen
  • De perfecte “single point of failure”
  • Als we rekening gaan houden met deadlocks en redundantie ziet ons plaatje er zo uit: niet echt simpel meer……
  • There is a bug in this one: this code is NOT fail-safe because it has a potential catastrophic deadlock (when the Diesels don’t report Ready)..... 25 April 2013
  • Please be reminded: the presented code has a deadlock! 25 April 2013
  • FTA en FMEA zijn tegenpolen, goede controlemechanismen van elkaar (NASA) Alhoewel NASA geen feilloos trackrecord heeft….
  • Do you know the difference between validation and verification? Validation = meets external expectations, does what it is supposed to do Verification = meets internal expectations, conforming to specs 25 April 2013
  • Funny example: printing screen....
  • Most beautifull example: UPSes using too much power to charge, killing all fuses.... Current example: found out that identity management server was a single point of failure.... Eurocontrol example: control unit wasn’t ready for the CWPs, and after that got overloaded 25 April 2013
  • FTA en FMEA zijn tegenpolen, goede controlemechanismen van elkaar (NASA) Alhoewel NASA geen feilloos trackrecord heeft….
  • FTA en FMEA zijn tegenpolen, goede controlemechanismen van elkaar (NASA) Alhoewel NASA geen feilloos trackrecord heeft….
  • This is functional nonsense: DirMsgResponse is sent to the output, whatever what. 25 April 2013
  • FTA en FMEA zijn tegenpolen, goede controlemechanismen van elkaar (NASA) Alhoewel NASA geen feilloos trackrecord heeft….
  • Our successes are unknown, our failures make the headlines…. When a system fails in production, it is actual blood on our hands. At eurocontrol, each bug had a bodycount attachted to it.....

Testing safety critical systems: Practice and Theory (14-05-2013, VU Amsterdam) Testing safety critical systems: Practice and Theory (14-05-2013, VU Amsterdam) Presentation Transcript

  • Testing Safety Critical SystemsTheory and ExperiencesJ.vanEkris@Delta-Pi.nl
  • Jaap van Ekris
  • Agenda• The challenge• Process and Organization• System design• Verification Techniques• Trends• Reality3
  • THE CHALLENGEWhy is testing safety critical systems so hard?
  • Some people live on the edge…How would you feel if you were gettingready to launch and knew you weresitting on top of two million parts -- allbuilt by the lowest bidder on agovernment contract.John Glenn
  • Actually, we all do…
  • We even accept loss...• Lost/misdirected luggage:Chance of failure 10-2persuitcase• Airplane: Chance of crash10-8per flight hour• Storm Surge Barrier: Chanceof failure 10-7per usage• Nuclear power plant: AsLow As Reasonably Possible(ALARP)7
  • Are the software risks acceptable?
  • To put things in perspective…• Getting killed in traffic: 10-2per year• Having a drunk pilot: 10-2per flight• Hurt yourself when using a chainsaw: 10-3per use• Considered being posessed by satan: 10-4per lifetime• Dating a supermodel: 10-5in a lifetime• Drowning in a bathtub: 10-7in a lifetime• Being hit by falling airplane parts: 10-8in a lifetime• Being killed by lighting: 10-9per lifetime• Your house being hit by a meteor: 10-15per lifetime
  • We might have become overprotective…
  • Nonetheless software is dangerous...
  • and the odds are against us…• Capers-Jones: at least 2 high severityerrors per 10KLoc• Industry concensus is that softwarewill never be more reliable than– 10-5per usage– 10-9per operating hour
  • The value of testingProgram testing can be used to show thepresence of bugs, but never to showtheir absence!Edsger W. Dijkstra
  • PROCESS AND ORGANIZATIONWho does what in safety critical software development?
  • IEC 61508: Safety Integrity Level andacceptable risk
  • IEC61508: Risk distribution
  • IEC 61508: A process for safety critical functions
  • Process or personal commitment?• Romans put the architectunder the arches whenremoving the scaffolding• Boeing and Airbus put alllead-engineers on the firsttest-flight• Dijkstra put his“rekenmeisjes” on theopposite dock whenlaunching ships
  • It is about keeping your back straight…• Thomas Andrews, Jr.• Naval architect in charge of RMS Titanic• He recognized regulations wereinsufficient for ship the size of Titanic• Decisions “forced upon him” by the client:– Limit the range of double hulls– Limit the number of lifeboats• He was on the maiden voyage to spotimprovements• He knowingly went down with the ship,saving as many as he could
  • SYSTEM DESIGNWhat do safety critical systems look like?
  • An introduction into storm surge barriers…
  • Design Principles• Keep it simple...• Risk analysis drives design (decissions)• Safety first (production later)• Fail-to-safe• There shall be no single source of(catastrophic) failure
  • A simple design of a storm surge barrierRelais(€10,00/piece)Waterdetector(€17,50)Design documentation(Sponsored by Heineken)
  • Risk analysisRelais failureChance: smallCause: agingEffect: catastophicWaterdetector failsChange: HugeOorzaken: Rust, driftwood,seaguls (eating, shitting)Effect: CatastophicMeasurement errorsChance: CollossalCauses: Waves, windEffect: False PositiveBroken cableChance: MediumCause: digging, seagulsEffect: Catastophic
  • System Architecture
  • Risk analysis
  • Typical risks identified• Components making the wrong decissions• Power failure• Hardware failure of PLC’s/Servers• Network failure• Ship hitting water sensors• Human maintenance error27
  • Risk ≠ system crash• Wrongful functionalbehaviour• Data accuracy• Lack of response speed• Understandability ofthe GUI• Tolerance towardsunlogical inputs
  • Systems do misbehave...
  • Can be late…
  • Risks can be external as well
  • Nihilating risk isn’t the goal…No matter how well theenvironment analysishas been:•Some scenarios will bemissed•Some scenarios are tooexpensive to prevent:– Accept risk– Communicate to stakeholders
  • Risks can be contradictionary…Availability of the service Safety of the installationVS.
  • Risk reality does change over time...
  • 9/11...• Really tested our “testabortion” procedure• Introduced afundamental new riskto ATC systems• Changed the ATCsystem dramatically• Doubled our testcasesovernight
  • StuurX: Component architecture design
  • Stuurx::Functionality, initial global designInitStart_D“Start” signal to DieselsWachtWaterlevel < 3 meterWaterlevel> 3 meterW_O_D“Diesels ready”Sluit_?“Close Barrier”Waterlevel
  • Stuurx::Functionality, final globaldesign
  • Stuurx::Functionality,Wait_For_Diesels, detailed design
  • VERIFICATIONWhat is getting tested, and how?
  • The end is nigh...
  • Challenge: time and resource limitations• 64 bits input isn’t thatuncommon• 264is the global riceproduction in 1000years, measured inindividual grains• Fully testing all binaryinputs on a 64-bitsstimilus response systemtakes 2 centuries
  • Goals of testing safety critical systems• Verify contractually agreed functionality• Verify correct functional safety-behaviour• Verify safety-behaviour during degraded andfailure conditions
  • An example of safety critical components
  • IEC 61508 SIL4: Required verification activities
  • Design Validation and Verification• Peer reviews by– System architect– 2nddesigner– Programmers– Testmanager system testing• Fault Tree Analysis / Failure Mode and EffectAnalysis• Performance modeling• Static Verification/ Dynamic Simulation by(Twente University)
  • Programming (in C/C++)• Coding standard:– Based on “Safer C”, by Les Hutton– May only use safe subset of the compiler– Verified by Lint and 5 other tools• Code is peer reviewed by 2nddeveloper• Certified and calibrated compiler
  • Unit tests• Focus on conformance to specifications• Required coverage: 100% with respect to:– Code paths– Input equivalence classes• Boundary Value analysis• Probabilistic testing• Execution:– Fully automated scripts, running 24x7– Creates 100Mb/hour of logs and measurement data• Upon bug detection– 3 strikes is out  After 3 implementation errors it is build by another developer– 2 strikes is out  Need for a 2ndrebuild implies a redesign by another designer
  • Representative testing is difficult
  • Integration testing• Focus on– Functional behaviour of chain of components– Failure scenarios based on risk analysis• Required coverage– 100% coverage on input classes• Probabilistic testing• Execution:– Fully automated scripts, running 24x7, speed times 10– Creates 250Mb/hour of logs and measurement data• Upon detection– Each bug  Rootcause-analysis
  • Redundancy is a nasty beast• You do get functionalbehaviour of yourentire system• It is nearly impossibleto see if all yourcomponents areworking correctly51
  • System testing• Focus on– Functional behaviour– Failure scenarios based on risk analysis• Required coverage– 100% complete environment (simultation)– 100% coverage on input classes• Execution:– Fully automated scripts, running 24x7, speed times 10– Creates 250Mb/hour of logs and measurement data• Upon detection– Each bug  Rootcause-analysis
  • Acceptance testing• Acceptance testing1. Functional acceptance2. Failure behaviour, all top 50 (FMECA) risks tested3. A year of operational verification• Execution:– Tests performed on a working stormsurge barrier– Creates 250Mb/hour of logs and measurement data• Upon detection– Each bug  Root cause-analysis
  • Endurance testing• Look for the “one in amillion times” problem• Challenge:– Software is deterministic– execution is not (timing,system load, bit-errors)• Have an automatedscript run it over andover again
  • GUI Acceptance testing• Looking for– quality in use for interactivesystems– Understandability of theGUI• Structural investigation ofthe performance of thesystem-human interactions• Looking for “abuse” by theusers• Looking at real-life handlingof emergency operations
  • Avalanche testing• To test the capabiliesof alarming and control• Usually starts with onesimple trigger• Generally followed bymillions of alarms• Generally brings yournetwork and systemsto the breaking point
  • Crash and recovery procedure testing• Validation of systembehaviour after massivecrash and restart• Usually identifies manyissues about emergencyprocedures• Sometimes identifies issuesaround power supply• Usually identifies some(combination of) systemsincapable of unattendedrecovery...
  • Testing safety critical functions isdangerous...
  • A risk analysis to testing• There should always bea way out of a testprocedure• Some things are toodangerous to test• Some tests introducemore risks than theytry to mitigate
  • Root-cause analysis• A painfull process, bydesign• Is extremely thorough• Assumes that the errorfound is a symptom of anunderlying collection of(process) flaws• Searches for the underlyingcauses for the error, andlooks for possible similarerrors that might havefollowed a similar path
  • Failed gates of a potential deadlock
  • TRENDSWhat is the newest and hottest?
  • Model Driven Design
  • A real-life example
  • A root-cause analysis of this flaw
  • REALITYWhat are the real-life challenges of a testmanager of safety critical systems?
  • Testing in reality
  • It requires a specific breed of peopleThe faiths of developers andtesters are linked to safetycritical systems intoeternity
  • Conclusions• Stop reading newspapers• Safety Critical Testing is alot of work, making surenothing happens• Technically it isn’t thatmuch different, we’rejust more rigerous anduse a specific breed ofpeople....