A Large-Scale Industrial Case Study on Architecture-based Software Reliability Analysis

1,254 views

Published on

Talk from ISSRE 2010

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,254
On SlideShare
0
From Embeds
0
Number of Embeds
85
Actions
Shares
0
Downloads
25
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Why is this done? Benefits:- Determine components most contributing to software architecture- Allocate testing efforts, goals for testing units- Evaluate design alternatives, improve architecture- More reliabile system, quantitative numbers
  • report on experiences and methods usedlessons learnedwhat needs to be improved (from our perspective)
  • 3 MLOC C++, COM, ATL9 subsystems, >100 componentsmanaging industrial process (e.g., power generation, paper production, oil and gas refining, etc.)distributed system, controllers, servers, networks, field devicesoperator workplace for controlling the process: montoring sensor readings, manipulating actuators
  • - also agenda of the rest of the talk
  • Schrift größer, weniger text
  • -Selected Littlewood/Verrall model from IEEE Std. 1633Industry affinity (SCADA), good fit in initial testsTime between failures exponentially distributed:Repair may introduce new faults, repair time = 0  is a random variable with Gamma distributionWe were able to fit the whole dataset without filtering data at5% significance level with the quadratic Littlewood/Verrallmodel (LV-Q)failure reports are often not mapped to components in bug tracking systemsdifficult to select a Modeltoo many models availablestatistical validity hard
  • failure data from bugtracker, filtered for critical/high severity bugsquadratic model: programmers have good intentions in fixing the codedone for each subsystem, result: 9 failure probabilities
  • Installed and configured the systemDefined 2 load profiles, configured load driversConfigured ABB tool to log subsystem transitionsExecuted load drivers for each profile (2 days)Processed logs (2 GB) with scriptAdded initial, final stateCalculate transition probabilitiesValidated the modelCompared with architectural documentationInterviewed PCS experts
  • - Q: transition probabilitiy matrix (by eliminating failure state)S: steady state probabilitiesR: system reliability (probability of reaching the successstate
  • units obfuscated for confidentiality reasonssubsystem 8 has highest failure probabilitysubsystem 1 has highest sensitivity to system reliabilitysubsystem 6 is used by many subsystems, but only limited contribution to system reliability
  • verteilung erklärenMany variation points, limited step-by-step guidanceTime-consuming data collection for non-expertsBest for for small changes to existing systemsNeeds to be tailored to available data
  • A Large-Scale Industrial Case Study on Architecture-based Software Reliability Analysis

    1. 1. © ABB Group January 30, 2015 | Slide 1 A Large-Scale Industrial Case Study on Architecture-based Software Reliability Analysis Heiko Koziolek, Bastian Schlich, Carlos Bilich, ABB Corporate Research, 2010-11-01
    2. 2. Architecture-based Software Reliability Analysis (ABSRA) What?  Typical questions of software architects concerning reliability  „What is the reliability (probability of failures) in my system?“  „How do individual components contribute to the system reliability?“  „Which architectural alternative is best for reliability?“  „Where shall I introduce fault-tolerance mechanisms?“  „How to distribute my limited testing efforts among components?“  Additional questions by ABB  „How much more reliable is a new architecture than a former one?“  „Does ABSRA work on large-scale systems?“ © ABB Group January 30, 2015 | Slide 2
    3. 3. Architecture-based Software Reliability Analysis (ABSRA) How? © ABB Group January 30, 2015 | Slide 3 Software components, control flow, reliabilities R=0.995 R=0.982 R=0.937 Markov Model combine Markov Model Solution trans- form R = 0.9923 Predicted system reliability solve im- prove
    4. 4. Related work Existing empirical studies © ABB Group January 30, 2015 | Slide 4 ”… very little effort has been devoted to the validation of architecture-based software reliability techniques.” [Gokhale2007, IEEE Transactions on Dependable and Secure Computing, Vol. 4, No. 1] Source Name Year Lang. LOC # Components [Gokhale2004, Perf. Eval.] SHARPE 1998 C 35,000 30 [Goseva2001, ISSRE] ESA 2001 C 10,000 3 [Goseva2005, ISSRE] GCC 2005 C 350,000 13 [Wang2005, JSS] SMS 2006 C/C++ 13,000 15 [Goseva2006, ISSRE] IDN 2006 C 11,000 6 Source Name Year Lang. LOC # Components [Gokhale2004, Perf. Eval.] SHARPE 1998 C 35,000 30 [Goseva2001, ISSRE] ESA 2001 C 10,000 3 [Goseva2005, ISSRE] GCC 2005 C 350,000 13 [Wang2005, JSS] SMS 2006 C/C++ 13,000 15 [Goseva2006, ISSRE] IDN 2006 C 11,000 6 Our Paper ABB 2010 C++ >3,000,000 8 (>100)
    5. 5. System under study: Process control system © ABB Group January 30, 2015 | Slide 5
    6. 6. System under study: Process control system Topology © ABB Group January 30, 2015 | Slide 6 Plant / Office Network Network Isolation Device Remote Workplaces Firewall Internet Remote Workplaces Redundant Network Workplaces Controllers Servers Fieldbus Remote I/O and Field devices
    7. 7. System under study: Process control system Subsystems within the servers © ABB Group January 30, 2015 | Slide 7
    8. 8. Which steps are required for ABSRA? Estimate component failure probabilities Estimate transition probabilities Construct the Markov model Exploit the results © ABB Group January 30, 2015 | Slide 8
    9. 9. Estimate component failure probabilities Existing methods Code metrics [Nagappan2006] • Validity debated Reliability growth modeling [IEEE Std 1633-2008] • Requires component failure reports Random/statistical testing [Miller1992] • Does not scale, difficult to apply on components Fault injection [Gokhale2004] • Does not determine the current reliability Explicit failure modeling [Cheung2008] • Accuracy unknown © ABB Group January 30, 2015 | Slide 9
    10. 10. Reliability growth modeling General principle © ABB Group January 30, 2015 | Slide 10   0, )( ))(exp()()( ),,( 1      l lilii ilg     Littlewood/Verrall Model
    11. 11. Reliability growth modeling Using the Littlewood/Verrall-model on one subsystem © ABB Group January 30, 2015 | Slide 11  Filtered subsystem bug list  Release dates  Curve fitting in CASRE 3.0 http://www.openchannelsoftware.com/projects/CASRE_3.0/
    12. 12. Reliability growth modeling Result © ABB Group January 30, 2015 | Slide 12 R1= ... R8= ... R4= ... R3= ... R5= ... R6= ... R7= ... R2= ...
    13. 13. Which steps are required for ABSRA? Estimate component failure probabilities Estimate transition probabilities Construct the Markov model Exploit the results © ABB Group January 30, 2015 | Slide 13
    14. 14. Estimate component transition probabilities Existing methods Exploiting design document [Gokhale2007] • Only static dependencies in SW architecture Profiling [Goseva2005] • Complicated filtering of data required Manual code instrumentation • Can be time-comsuming © ABB Group January 30, 2015 | Slide 14
    15. 15. Self-coded script Estimate component transition probabilities Profiling with proprietary tools © ABB Group January 30, 2015 | Slide 15 Example trace from profiling Set up and ran the system
    16. 16. Which steps are required for ABSRA? Estimate component failure probabilities Estimate transition probabilities Construct the Markov model Exploit the results © ABB Group January 30, 2015 | Slide 16
    17. 17. Construct the Markov model Existing state-based methods [Littlewood1979] [Cheung1980] [Laprie1984] [Kubat1989] [Gokhale1998] [Ledoux1999] [Gokhale1998-2] © ABB Group January 30, 2015 | Slide 17 [Goseva-Popstojanova2001]
    18. 18. Cheung model Adding failure & end states, compute reliability © ABB Group January 30, 2015 | Slide 18 [Cheung1980]
    19. 19. Which steps are required for ABSRA? Estimate component failure probabilities Estimate transition probabilities Construct the Markov model Exploit the results © ABB Group January 30, 2015 | Slide 19
    20. 20. Exploit the results Possibilities Estimate system reliability [Cheung1980] • Experience by customers hard to validate Conduct sensitivity analysis [Gokhale2002] • Study system reliability for varying component failure rates Assess costs of bugs [Cheung1980] • Quantify the effect of an error in component Evaluate design alternatives [Goseva2001] • Values for new componentes need to be guessed Allocate test budgets efficiently [Pietrantuono2010] • Test critical components more often © ABB Group January 30, 2015 | Slide 20
    21. 21. Sensitivity Analysis Impact of varying subsystem failure rates © ABB Group January 30, 2015 | Slide 21 http://www.prismmodelchecker.org/
    22. 22. Evaluation Cost estimations in person hours (best/worst case) © ABB Group January 30, 2015 | Slide 22
    23. 23. Conclusions Lessons learned  Getting failure and transition probabilities is hard  Time consuming, error-prone, limited automation   Main obstacle for ABSRA is data collection  Currently rather simple models  No technologies, concurrency, hardware  Difficult to evaluate architecture alternatives  Limited decision support from the predictions  Lack of empirical studies in literature  Predominantly small systems  Often dubious techniques for estimating failure rates  Replicated case studies needed © ABB Group January 30, 2015 | Slide 23
    24. 24. © ABB Group January 30, 2015 | Slide 24

    ×