Software Faults, Failures and Their Mitigations | Turing100@Persistent

39,363 views

Published on

Prof Kishor Trivedi, Duke University, USA talks about Software Faults, Failures and Their Mitigations at the 6th Session of Turing100@Persistent

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
39,363
On SlideShare
0
From Embeds
0
Number of Embeds
363
Actions
Shares
0
Downloads
75
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Software Faults, Failures and Their Mitigations | Turing100@Persistent

  1. 1. Persistent Systems January 5, 2013 Software Faults, Failures and Their Mitigations Kishor TrivediDuke High Availability Assurance Lab (DHAAL) Dept. of Electrical & Computer Engineering Duke University Durham, NC 27708 kst@ee.duke.edu www.ee.duke.edu/~kst Copyright © 2013 by K.S. Trivedi
  2. 2. Duke UniversityResearch Triangle Park(RTP) Duke UNC-CH NC state North USA Carolina 2 Copyright © 2013 by K.S. Trivedi
  3. 3. Duke University• U.S. News & World Report in its 2012 edition, – ranked the universitys undergraduate program 8th among all universities in USA Copyright © 2013 by K.S. Trivedi 3
  4. 4. Duke UniversityNCAA Men’s Basketball Champions 2010 (also 1998,1999 & 2001) 4 Copyright © 2013 by K.S. Trivedi
  5. 5. Trivedi’s research triangle Stochastic modeling methods & numerical Theory solution methods: Large Fault trees, Stochastic Petri Nets, Large/stiff Markov & non-Markov models Fluid stochastic models Performability & Markov reward models Software aging and rejuvenation Security, Survivability, ResilienceSoftware Books: quantification Machine LearningPackages Applications Blue, Red, Reliability/availability/performance Avionics (Boeing), Space (NASA/JPL), White HARP (NASA), SAVE (IBM), Power systems (GE), IRAP (Boeing) Automobile systems (GM) SHARPE, SPNP, SREPT Computer systems (EMC,SUN,HP,TCS) Telco systems (AT&T, Lucent, Avaya) Computer Networks (Motorola) Virtualized Data center (NEC) Cloud computing (IBM, NEC, Cisco) 5 Copyright © 2013 by K.S. Trivedi Aging & Rejuvenation (Huawei) Software
  6. 6. Probability and Statistics with Reliability, Queuing,and Computer Science Applications, Second edition, John Wiley, 2001 (Bluebook) [First edition by Prentice-Hall, 1982] TextbooksPerformance and Reliability Analysis of Computer Systems: An Example-Based Approach Using theSHARPE Software Package, Kluwer (now Springer), 1996 (Redbook) Queuing Networks and Markov Chains, John Wiley, second edition, 2006 (White book) 6 Copyright © 2013 by K.S. Trivedi
  7. 7. DHAAL & Industry A Success Story• Reliability Prediction of Boeing 787 Current Return Network for FAA Certification• Security Quantification (DARPA SITAR)/NSF• Survivability Quantification for Lucent POTS and Siemens Smartgrid• Cloud performance, availability, power with IBM and Cisco• Reliability/Availability Prediction of SIP protocol on IBM WebSphere• Software Aging and Rejuvenation: state of the art, theory, measurements, and implementation (IBM x-series); Huawei• Cloud computing security (Measurements, quantification, and implementation) - NATO Science for Peace and Security• NASA-JPL Failures data analytics• NEC collaboration for Performability Management (VMs allocation, VMs and VMM rejuvenation, etc) in Virtualized Data Center• Data analytics (statistical and machine learning techniques used to interpret huge volume of data) - WiPro Technologies• Software Reliability/Availability/Performability analysis – Short courses, seminars, and consulting - Tata consulting services7 Copyright © 2013 by K.S. Trivedi 7
  8. 8. Outline• Motivation• A Real System• Software Fault Classification• Environmental Diversity• Methods of Mitigation• Software Aging and Rejuvenation• Conclusions Copyright © 2013 by K.S. Trivedi
  9. 9. Pervasive Dependence on Computer Systems Need for High Reliability/Availability CommunicationHealth & Medicine Avionics Banking Entertainmen t Copyright © 2013 by K.S. Trivedi
  10. 10. Basic Definitions• Steady-state availability (Ass) or just availability  Long-term probability that the system is available when requested: MTTF Ass = MTTF + MTTR  MTTF is the system mean time to failure, a complex combination of component MTTFs  MTTR is the system mean time to recovery - may consist of many phases Copyright © 2013 by K.S. Trivedi
  11. 11. Basic Definitions• Downtime in minutes per year (un)availability is usually presented in terms of annual downtime. – Downtime = 8760×60 ×(1- Ass) minutes. – 5 NINES (Ass = 0.99999)  5.26 minutes annual downtime Copyright © 2013 by K.S. Trivedi
  12. 12. Number of Nines– Reality Check• 49% of Fortune 500 companies experience at least 1.6 hours of downtime per week – Approx. 80 hours/year=4800 minutes/year – Ass=(8760-80)/8760=0.9908 – That is, between 2 NINES and 3 NINES!• This study assumes planned and unplanned downtime, together Copyright © 2013 by K.S. Trivedi
  13. 13. Achieving High Availability is a Challenge• Black Sept. 2011, In the same week!!!!: – Microsoft Cloud service outage (2.5 hours) – Google Docs service outage (1 hour) • A memory leak due to a software update• Sept. 2012 GoDaddy (4 hours) – 5 millions of websites affected• Oct. 2012 Amazon – 10/15/2012 Webservices – 6 hours (Memory leak) – 10/27/2012 EC2 – > 2 hours Copyright © 2013 by K.S. Trivedi
  14. 14. Downtown Costs per Hour • Brokerage operations $6,450,000 • Credit card authorization $2,600,000 • eBay (1 outage 22 hours) $225,000 • Amazon.com $180,000 • Package shipping services $150,000 • Home shopping channel $113,000 • Catalog sales center $90,000 • Airline reservation center $89,000 • Cellular service activation $41,000 • On-line network fees $25,000 • ATM service fees $14,000Sources: InternetWeek 4/3/2000; Fibre Channel: A Comprehensive Introduction, R. Kembel 2000, p.8. ”...based on a survey done by Contingency Planning Research." Copyright © 2013 by K.S. Trivedi
  15. 15. High Reliability/Availability• Hardware fault tolerance, fault management, reliability/availability modeling/assurance relatively well developed• System outages more due to software faults Key Challenge:• Software reliability is one of the weakest links in system reliability/availabilityby K.S. Trivedi Copyright © 2013
  16. 16. Software is the problemim Gray’s paper titled “W do computers hytop and what can be done about it?” arted to pointed out this trend in 1985, followed by his paperA census of tandem system availability between 1985 and 1990” 2005 Across different industries…. 1985 Copyright © 2013 by K.S. Trivedi 16
  17. 17. Increasing SW Failure Rate? Planetary Missions Flight Software: A. Nikora of JPL T interval he between the first and last launch: 8.76 years. T interval between he successive launches ranges from: 23 to 790 days. Similar results for ground software Mars Pathfinder CASSINI Mars Mars Stardust Mars Genesis Mars Deep Mars Global Climate Polar Odyssey Exploration Impact ReconnaissanceSurveyor Orbiter Copyright © 2013 by K.S. Trivedi Lander Rover Orbiter Mission Name (in launch order)
  18. 18. High Reliability/Availability: Software is the problem• Fault avoidance – good software engineering practices – difficult for large/complex software systems – Impossible to fully test and verify if software is fault-free “Testing shows the presence, not the absence, of bugs” - E. W. Dijkstra• Yet there are stringent requirements for failure-free operation Copyright © 2013 by K.S. Trivedi
  19. 19. High Reliability/Availability: Software is the problem (2)Software fault tolerance is a potentialsolution to improve software reliability in lieu ofvirtually impossible fault-free software Copyright © 2013 by K.S. Trivedi
  20. 20. Software Fault Tolerance Classical TechniquesDesign diversity – N-version programming – Recovery blockExpensive  not used much in practice!Yet there are stringent requirements for failure-free operationChallenge: Affordable Software Fault Tolerance Copyright © 2013 by K.S. Trivedi
  21. 21. Outline• Motivation• A Real System• Software Fault Classification• Environmental Diversity• Methods of Mitigation• Software Aging and Rejuvenation• Conclusions Copyright © 2013 by K.S. Trivedi
  22. 22. High availability SIP Application Server Configuration on IBM WebSphereP RDC 2008 andISSRE 2010papers Copyright © 2013 by K.S. Trivedi
  23. 23. High availability SIP Application Server configuration on WebSphereHardware configuration: – Two BladeCenter chassis; 4 blades (nodes) on each chassis (1 chassis sufficient for performance)Software configuration: – 2 copies of SIP/Proxy servers (1 sufficient for performance) – 12 copies of WAS (6 sufficient for performance) – Each WAS instance forms a redundancy pair (replication domain) with WAS installed on another node on a different chassis• The system has hardware redundancy and software redundancy Copyright © 2013 by K.S. Trivedi
  24. 24. High availability SIP Application Server configuration on WebSphereSoftware Fault Tolerance – Identical copies of SIP proxy used as backups (hot spares) – Identical copies of WebSphere Applications Server (WAS) used as backups (hot spares) – Type of software redundancy – (not design diversity) but replication of identical software copies – Normal recovery • restart software, reboot node or fail-over to a software replica; only when all else fails, a “software repair” is invoked Copyright © 2013 by K.S. Trivedi
  25. 25. Escalated levels of Recovery Single Process Restart (SPR) A real example The flowchart depicts the Avaya Servers and actions taken for recovery IF 3 SPR Media Gateways after a failure is detected.NO within 60 seconds Try the simplest recovery method first, then a more YES System Warm complex etc. Restart (SWR) IF 3NO YES IF 3 SCR NO SWR System Cold within within Restart (SCR) 15 min 15 min YES Avaya IF 3 YES Communication ACMSR OS Reboot Manager within Software reloads 15 min (ACMSR) Copyright © 2013 by K.S. Trivedi NO
  26. 26. Software Fault Tolerance: New ThinkingRetry, restart, reboot! – Known to help in dealing with hardware transients – Do they help in dealing with failures caused by software bugs? – If yes, why? Copyright © 2013 by K.S. Trivedi
  27. 27. A CartoonWhy is this true………at least for computers? Copyright © 2013 by K.S. Trivedi
  28. 28. Software Fault Tolerance: New ThinkingFailover to an identical software replica (that is not a diverse version) – Does it help? – If yes, why?Twenty years ago this would be considered crazy! Copyright © 2013 by K.S. Trivedi
  29. 29. Outline• Motivation• A Real System• Software Fault Classification – Fighting Bugs: Remove, Retry, Replicate and Rejuvenate, M. Grottke and K. Trivedi, IEEE Computer Magazine, Feb. 2007• Environmental Diversity• Methods of Mitigation• Software Aging and Rejuvenation• Conclusions Copyright © 2013 by K.S. Trivedi
  30. 30. Software Faultsmain threats to high reliability, availability & safety Copyright © 2012 byby K.S. Trivedi Copyright © 2013 K.S. Trivedi
  31. 31. IFIP Working Group 10.4 (Laprie) • Failure occurs when the delivered service no longer complies with the desired output. • Error is that part of the system state which is liable to lead to subsequent failure. • Fault is adjudged or hypothesized cause of an error.Faults are the cause of errors that may lead to failures Fault Error Failure 31 Copyright © 2013 by K.S. Trivedi
  32. 32. Need to Classify bug types• We submit that a software fault tolerance approach based on retry, restart, reboot or fail- over to an identical software replica (not a diverse version) work because of a significant number of software failures are caused by Mandelbugs as opposed to the traditional software bugs now called Bohrbugs Copyright © 2013 by K.S. Trivedi
  33. 33. Need to Classify bug types• In recent years, researchers have reported the phenomenon of “software aging” (i.e., degraded performance and/or increased failure rate of long-running software systems).• Puzzle: How can performance and failure rate change if the software code is not modified?!⇒ Study software fault types and their relationships Copyright © 2013 by K.S. Trivedi
  34. 34. Jim Gray’s Definitions• The terms “Bohrbug” and “Heisenbug” were first used in print by Jim Gray in 1985.• “Bohrbugs, like the Bohr atom, are solid, easily detected by standard techniques, and hence boring.”• “Most production software faults are soft. If the program state is reinitialized and the failed operation is retried, J. Gray the operation will not fail a second time. … The assertion that most production software bugs are soft – Heisenbugs that go away when you look at them – is well known to systems programmers.” (Gray, 1985) Copyright © 2013 by K.S. Trivedi
  35. 35. Bruce Lindsay’s Definition Based on Gray’s paper, researchers have often equated Heisenbugs with soft faults. However, when Bruce Lindsay originally coined the term in the 1960s (while working with Jim Gray), he had a more narrow definition in mind.• “Heisenbugs as originally defined … B. Lindsay, photo by T. Upton are bugs in which clearly the system behavior is incorrect, and when you try to look to see why it’s incorrect, the problem goes away.” (Lindsay, 2004)• The term alludes to the physicist Werner Heisenberg and his Uncertainty Principle. Copyright © 2013 by K.S. Trivedi
  36. 36. Heisenbug – Our Definition• Heisenbug := A fault that stops causing a failure or that manifests differently when one attempts to probe or isolate it.• How can probing affect the bug? 1. Some debuggers initialize unused memory to default values, thus preventing failures due to improper initialization. 2. Trying to investigate a failure can influence process scheduling in such a way that a scheduling- related failure does not occur again. Copyright © 2013 by K.S. Trivedi
  37. 37. A Classification of Software Faults• Bohrbug := A fault that is easily isolated and that manifests consistently under a well-defined set of conditions, because its activation and error propagation lack complexity. Example: A bug causing a failure whenever the user enters a negative date of birth Since they are easily found, Bohrbugs tend to be detected and fixed during the software testing phase. The term alludes to the physicist Niels Bohr and his rather simple atomic model. Copyright © 2013 by K.S. Trivedi
  38. 38. Mandelbug – Definition• Mandelbug := A fault whose activation and/or error propagation are complex. Typically, a Mandelbug is difficult to isolate, and/or the failures caused by a it are not systematically reproducible.• Example: A bug whose activation is scheduling-dependent The residual faults in a thoroughly-tested piece of software are mainly Mandelbugs. The term alludes to the mathematician Benoît Mandelbrot and his research in fractal geometry. Copyright © 2013 by K.S. Trivedi
  39. 39. Mandelbug: “Complexity” (1) 39• The explanation of the possible sources of complexity is based on the “chain of threats” linking faults with errors and failures:• First source of complexity: Time lag between fault activation and failure occurrence, e.g., because several different error states have to be traversed in the error propagation.• Example: The result of an erroneous calculation may at first be kept in the system memory and cause a failure only later, when it is being accessed and used. Copyright © 2013 by K.S. Trivedi
  40. 40. Mandelbug: “Complexity” (2) 40• Second source of complexity: Fault activation and/or error propagation depend on interactions between conditions occurring inside the application and conditions that accrue within the system- internal environment of the application.• Example: A fault causing failures due to side-effects of other applications Copyright © 2013 by K.S. Trivedi
  41. 41. Mandelbugs: Consequences 41• Mandelbugs are difficult to detect and remove during the software testing phase.• An operation that failed due to a Mandelbug may execute correctly upon retry even if the fault has not been removed; changing the environment may suffice.• Potential recovery techniques: – “Microreboot” of individual components – Application restart – System reboot – Failover to a standby component (replicate) – Manual recovery Copyright © 2013 by K.S. Trivedi
  42. 42. Examples of Types of Bugs in IT System• Mandelbugs in IT Systems: Trivedi, Mansharamani, Kim, Grottke, and Nambiar. “Recovery from failures due to Mandelbugs in IT systems”. PRDC 2011.• The projects ranged across a number of business systems in the banking, financial, government, IT, pharmacy, and telecom sector.42 Copyright © 2013 by K.S. Trivedi
  43. 43. Examples of Types of Bugs in IT System (cont.)• Exemple of Mandelbug in a large telecom system – Slow response times of the front end screens. – The problem was hard to analyze since the screens would freeze at random points in the day. • As the days went by the frequency of incidents of these freezes kept increasing.• Class of MandelBug encountered – The users would wait for some time and their operations would resume. – The IT operations team rebooted the servers and the operations could resume for a few hours.43 Copyright © 2013 by K.S. Trivedi
  44. 44. Examples of Types of Bugs in IT System (cont.)• Reason of the Problem – Whenever a front end screen was invoked, a temporary file was created at the centralized server. • This file was never cleaned up even after a screen was closed. – As a result, tens of thousands of small files kept accumulating on the disk causing sluggish behavior.• Solution of the problem – A cleanup utility was written to move these files periodically to another file system and later delete them.44 Copyright © 2013 by K.S. Trivedi
  45. 45. Examples of Types of Bugs in IT System (cont.)• Exemple of Mandelbug in a government tax information system. – All organizations should submit the income tax deducted at source (TDS) records for all of their employees. – Sporadically when a large corporation uploaded its file, every once in a while the application server would crash.• Class of MandelBug encountered – The IT operations staff would then increase the JVM heap size and restart the JVM, which would allow the file to be uploaded without any problem.45 Copyright © 2013 by K.S. Trivedi
  46. 46. Examples of Types of Bugs in IT System (cont.)• Reason of the Problem – The probability of a failure occurrence increased after each JVM restart, as the heap got consumed more and more.• Solution of the problem – Reconfigurate system parameters to resume operations successfully.46 Copyright © 2013 by K.S. Trivedi
  47. 47. Aging-related Bug – Definition • Aging-related bug := A fault that leads to the accumulation of errors either inside the running application or in its system-context environment, resulting in an increased failure rate and/or degraded performance. Example:  A bug causing memory leaks in the application Note that the aging phenomenon requires a delay between fault activation and failure occurrence. Note also that the software appears to age due to such a bug; there is no physical deterioration Copyright © 2013 by K.S. Trivedi
  48. 48. Relationships Bohrbug and Mandelbug are complementary antonyms. Aging-related bugs are a subtype of Mandelbugs Mandelbugs Aging Related Bugs Aging-Related Bugs - Bohrbugs Copyright © 2013 by K.S. Trivedi
  49. 49. Important Questions about these Bugs• What fraction of bugs are Bohrbugs, Mandelbugs and aging-related bugs – How do these fractions vary • over time • over projects, languages, application types,… – Need Measurements – Current NASA/JPL Project with Allen Nikora & Michael Grottke; preliminary results from one NASA software project: • 52% Bohrbugs • 35% Mandelbugs (non-aging-related) • 4% Aging-related bugs • 7% Operator related • 2% Unclassified – Very similar results for Linux, MySQL, Apache AXIS, httpd• What are the methods of mitigation for the different fault types Copyright © 2013 by K.S. Trivedi
  50. 50. Trends in SW Fault Type Proportions Planetary Missions Flight Software• Fault Type Proportions vs. Runtime for Four Earlier Missions (of 8 missions analyzed)• Result: The proportion of Bohrbugs seems to settle at around the same value. Such a convergence to similar values is less obvious for the other fault types. Copyright © 2013 by K.S. Trivedi
  51. 51. Outline• Motivation• A Real System• Software Fault Classification• Environmental Diversity• Methods of Mitigation• Software Aging and Rejuvenation• Conclusions Copyright © 2013 by K.S. Trivedi
  52. 52. Environmental diversityA new thinking to deal with software faults and failures Copyright © 2012 byby K.S. Trivedi Copyright © 2013 K.S. Trivedi
  53. 53. Software Fault Tolerance: New ThinkingNew thinking: Environmental Diversity as opposed to Design DiversityOur claim is that this works since failures due to Mandelbugs are not negligible, we have an affordable software fault tolerance technique that we callEnvironmental Diversity Copyright © 2013 by K.S. Trivedi
  54. 54. What is environmental diversity?• The underlying idea of Environmental diversity – Retry a previously faulty operation and it works – Why? – because of the environment where the operation was executed has changed enough to avoid the fault activation.• The environment is understood as – OS resources, other applications running concurrently and sharing the same resources, interleaving of operations, concurrency, or synchronization. Copyright © 2013 by K.S. Trivedi
  55. 55. What is environmental diversity? • The execution of an application depends on the environment Restart the applicationAp1 Ap2 Ap3 Ap4 Ap4 Ap6 Ap3 Ap5 Operating System Operating System Hardware HardwareEnvironment at time t1 Environment at time t1+n Copyright © 2013 by K.S. Trivedi
  56. 56. Environmental Diversity• Restart an application, reboot a node or failover to an identical standby replica work because of the environmental diversity that will be underlying these actions; – By environment here we mean the resources of the OS, other applications running concurrently and sharing system resources, interleaving of operations, concurrency, synchronization etc.• Environmental Diversity uses time redundancy over expensive design diversity • [Adams] Restart • [Jalote et al.] Rollback, rollforward • [Patriot] Occasional reboot, “switch off and on” • [Avaya Swift] restart process; failover to a replica • [IBM SIP] escalated levels: restart, reboot, failover… • [IBM Director-X-series] Rejuvenation Copyright © 2013 by K.S. Trivedi
  57. 57. Outline• Motivation• A Real System• Software Fault Classification• Environmental Diversity• Methods of Mitigation• Software Aging and Rejuvenation• Conclusions Copyright © 2013 by K.S. Trivedi
  58. 58. Methods of Mitigation Copyright © 2012 byby K.S. Trivedi Copyright © 2013 K.S. Trivedi
  59. 59. MitigationCopyright © 2013 by K.S. Trivedi
  60. 60. Bohrbugs: Remove Find and fix the bugs during testing Failure data collected during testing Calibrate a software reliability growth model (SRGM) using failure data; this model is then used for prediction Many SRGMs exist (JM,NHPP,HGRGM, etc.)  Books by Lyu, Musa, Cai  Gokhale & Trivedi, A Time/Structure Based Software Reliability Model, Annals of Software Engineering, 1999 Measurements  Empirical (statistical) models Copyright © 2013 by K.S. Trivedi
  61. 61. MitigationCopyright © 2013 by K.S. Trivedi
  62. 62. OS Availability Model (IBM BladeCenter) Fix (Failed due to a Bohrbug) Reboot (Failure due to a Mandelbug) Copyright © 2013 by K.S. Trivedi
  63. 63. Availability model of a Proxy or a WAS (IBM SIP on websphere) • Failure detection – By WLM – By Node Agent – Manual detection • Recovery – Node Agent • Auto process restart – Manual recovery • Process restart • Node reboot • Repair Application server and proxy server Copyright © 2013 by K.S. Trivedi
  64. 64. Outline• Motivation• A Real System• Software Fault Classification• Environmental Diversity• Methods of Mitigation• Software Aging and Rejuvenation• Conclusions Copyright © 2013 by K.S. Trivedi
  65. 65. Aging Related Bugs: Replicate, Restart, Reboot, Rejuvenate Copyright © 2012 byby K.S. Trivedi Copyright © 2013 K.S. Trivedi
  66. 66. Software AgingAging phenomenon Error conditions accumulating over time Performance degradation, system failure Performance degradation, system failureMain causes of Software AgingMemory leak, fragmentation, Unterminated threads, Data corruption, Round-off errors, Unreleased file-locks, etcObserved system OS, Middle-ware, Netscape, Internet Explorer etc Copyright © 2013 by K.S. Trivedi
  67. 67. Software Aging - Definition“Software Aging” phenomenon Long-running software tends to show an increasing failure rate. Not related to application program becoming obsolete due to changing requirements/maintenance. Software appears to age; no real deterioration Copyright © 2013 by K.S. Trivedi
  68. 68. Software Aging - Examples• Cisco Catalyst Switch [Matias Jr.]• File system aging [Smith & Seltzer]• Gradual service degradation in the AT&T transaction processing system [Avritzer et al.]• Error accumulation in Patriot missile system’s software [Marshall]• Resources exhaustion in Apache [Li et al., Grottke et al.]• Physical memory degradation in a SOAP-based Server [Silva et al.]• Software aging in Linux [Cotroneo et al.]• Crash/hang failures in general purpose applications after a long runtime Copyright © 2013 by K.S. Trivedi
  69. 69. Measurements Showing Resource Exhaustion or Depletion Real Memory Free File Table SizeAMethodology for Detection and Estimation of Software Aging,S. Garg, A. van Moorsel, K. Vaidyanathan and K. Trivedi.Pro c. o f IEEE I Symp. o n So ftware Re liability Eng ine e ring , Nov. 1998. ntl. Copyright © 2013 by K.S. Trivedi
  70. 70. Software Fault Types & Their Mitigation Copyright © 2013 by K.S. Trivedi
  71. 71. Software rejuvenationSoftware rejuvenation is a cost effective solution forimproving software reliability by avoiding/postponingunanticipated software failures/crashes.It allows proactive recovery to be carried eitherautomatically or at the discretion of theuser/administratorRejuvenation of the environment, not of software Copyright © 2013 by K.S. Trivedi
  72. 72. Software RejuvenationCounteracts the software aging phenomenon Frees up OS resources; Removes error accumulationCommon techniques for cleaning Garbage collection, defragmentation, flushing kernel and file server tables etc.Challenge: Rejuvenation scheduling/granularity Copyright © 2013 by K.S. Trivedi
  73. 73. SW Rejuvenation: The Genesis“Software Rejuvenation: Analysis, Module and Applications”, Y. Huang, C. Kintala, N. Koletis, N. Fulton, in FTCS 1995 An insight into operational software, that no-one had before (at least, formally). It changed • How practitioners looked at making software more dependable • Windfall of performance and dependability modeling problems for academicians • Ideas to build better, real-world systems as Internet evolved • Led to recognition of “software aging” phenomenon • Brought about Phds, tenureships, publications, patents, awards, tools, systems, funding for many many people around the world. Copyright © 2013 by K.S. Trivedi
  74. 74. Software Rejuvenation ExamplesAT&T billing applications [Huang et al.]Patriot missile system software - switch off/on every 8 hours [Marshall]On-board preventive maintenance for long-life deep space missions (NASA’s X2000 Advanced Flight Systems Program) [Tai et al.]IBM Director Software Rejuvenation (x-series) [IBM & Duke Researchers]Microsoft IIS 5.0 process recycling toolProcess restart in Apache [Li et al.]ISS FS SSC (ISS File system) - switch off and on every 2 months [NASA ISS reports]For more examples: "Software rejuvenation - Do IT & Telco industries use it?". Javier Alonso, Antonio Bovenzi, Jinghui Li, Yakun Wang, Stefano Russo, and Kishor Trivedi. The 4rd International Workshop on Software Aging and Rejuvenation (WoSAR 2012) . Held in conjunction with The 23nd annual International Symposium on Software Reliability Engineering (ISSRE 2012), Dallas, USA, 2012. Copyright © 2013 by K.S. Trivedi
  75. 75. Software Rejuvenation –Trade-off• Advantages – Reduces costs of sudden aging-related failures – Can be applied at the discretion of the user/administrator• Disadvantages – Direct costs of carrying out rejuvenation – Opportunity costs of rejuvenation (downtime, decreased performance, lost transactions etc) Important research issue: Find optimal times to perform rejuvenation! Copyright © 2013 by K.S. Trivedi
  76. 76. Software rejuvenation - ApproachesTwo approaches based on WHEN: Time-Based rejuvenation approaches Rejuvenation applied regularly and at predetermined time intervals. Widely used in real environments – Web servers (Apache) – ISS two-months reboot – Telecommunication systems Copyright © 2013 by K.S. Trivedi
  77. 77. Software rejuvenation - ApproachesTwo approaches based on WHEN: Measurement (Inspection)-based rejuvenation approaches • Threshold based or predictive • System metrics continually monitored • Rejuvenation triggered when the crash is imminent based on the observation/prediction • Reduce potentially useless rejuvenation actions and downtime in the process [Silva et al.] Copyright © 2013 by K.S. Trivedi
  78. 78. Software rejuvenation Copyright © 2013 by K.S. Trivedi
  79. 79. Software Rejuvenation – ApproachesTwo approaches based on HOW: Use analytical model to optimize rejuvenation schedule • Lucent Bell Labs [Huang et al., ‘95] • Duke [IEEE-TC’98, SIGMETRICS’96, ISSRE’95, PRDC’00, SIGMETRICS’01, Comp J.’01, SRDS’02, DSN’02, ISSRE’02, DSN’03, IEEE-TR’05] • Others [IPDS’98, PNPM’99] Copyright © 2013 by K.S. Trivedi
  80. 80. Software Rejuvenation – ApproachesTwo approaches based on HOW: Use measurements of resource degradation to determine/predict optimal rejuvenation schedule • Duke [ISSRE’98, ISSRE’99, IBMJRD’01, ISESE’02, IEEE-TPDS’05] • Duke (formerly UPC) [GRID’07, IEEE-TC’09, DSN’10] Copyright © 2013 by K.S. Trivedi
  81. 81. Failure ratePreventive maintenance is useful only if failure rate is increasingIf the time to failure distribution is exponential then failure rate is ConstantNeed to assume (and establish) that TTF is IFR Copyright © 2013 by K.S. Trivedi
  82. 82. Analytic ModelsSingle node models – CTMC model – SMP modelCluster systems – IBM Cluster model (Time-based, condition-based) Copyright © 2013 by K.S. Trivedi
  83. 83. Analytic Models Software Aging and RejuvenationA simple and useful model of increasing failure rate: Failure probable Robust state Failed state state Time to failure: Hypo-exponential distribution Increasing failure rate aging Copyright © 2013 by K.S. Trivedi
  84. 84. Analytic Models CTMC model [Huang95] Failed state Robust state Sf S0 r1 r3 r1 Failed state r2 λ Rejuvenation Failure probable state Robust state r4 Sf λ Sp Sr state Sp S0 r2 Failure probable state Model w/o rejuvenation Model with rejuvenationFrom this Continuous-time Markov chain modelCan find closed-form expression for the optimal rejuvenation trigger rate (r4) Copyright © 2013 by K.S. Trivedi
  85. 85. Analytic Models Semi-Markov model [Dohi00]Relax the assumption of exponentially distributed sojourn times (time- independent transition rates)Hence have a semi Markov model 0 Completion of Completion of Repair Rejuvenation State change 2 1 3Can find closed-form expression for the optimal (deterministic) time to System Failure Rejuvenation rejuvenation trigger Copyright © 2013 by K.S. Trivedi
  86. 86. Rejuvenation in Cluster Systems Cluster System[Pfister] Collection of independent, self-contained computer systems working together to provide a more reliable and powerful system than a single node by itselfEasier scaling to larger systems, high levels of availability/performance and low management costs Copyright © 2013 by K.S. Trivedi
  87. 87. Rejuvenation for Cluster Systems MotivationRejuvenation using the fail-over mechanismsLong-terms benefits in terms of availability/performanceContinuous operation (possibly at a degraded level) Practically zero downtime Copyright © 2013 by K.S. Trivedi
  88. 88. Rejuvenation for Cluster Systems MotivationLess disruptive and lower overhead than unplanned outagesTransparent to user/applicationMost current industry initiatives reactiveTwo approaches Simple time-based (periodic) Condition-based (only from the “failure-impending” state Copyright © 2013 by K.S. Trivedi
  89. 89. Rejuvenation for Cluster Systems SRN ModelsRejuvenation using the fail-over mechanisms in a rolling fashionModeling using SRNs (Stochastic Reward Nets)Analysis for 2 rejuvenation policies Simple time-based policy • All nodes rejuvenated successively at the end of each rejuvenation interval Condition-based policy • Nodes rejuvenated only from the “failure-probable” state Copyright © 2013 by K.S. Trivedi
  90. 90. SRN ModelBasic Cluster Model Copyright © 2013 by K.S. Trivedi
  91. 91. SRN ModelSimple Time-Based Rejuvenation Copyright © 2013 by K.S. Trivedi
  92. 92. Model ParametersTransition Mean timeTfprob 240 hoursTnodefail 720 hoursTnoderepair 30 minsTsysrepair 4 hoursTrejuv 10 minscostnodefail $5000/hourcostnoderejuv $250/hour Copyright © 2013 by K.S. Trivedi
  93. 93. Model Measures Measures ComputedUnavailability (#Psysfail == 1) ? 1 : 0Cost #Prejuv*costrejuv + #Pnodefail*costnodefail + #Psysfail*costsysfail Copyright © 2013 by K.S. Trivedi
  94. 94. Results Simple Time-Based Rejuvenation 8/ configuration 1 8/ configuration 2As rejuv. int. increases, rejuvenation is performed less frequentlyWhen rejuv int is close to zero, the system is rejuvenating very frequently resulting in high cost/downtimeWhen rejuv. int. goes beyond optimal value, system failures become frequentresulting in high cost/downtime Copyright © 2013 by K.S. Trivedi
  95. 95. Measuring Performance VariablesObjective Detection and validation of agingPeriodically monitor and collect data on the attributes responsible for the “health” of the systemQuantify the effect of aging on system resources Proposed metric – Estimated time to exhaustion Proposed metric – Evaluation function using PCA approach Copyright © 2013 by K.S. Trivedi
  96. 96. Measuring Performance VariablesApproaches Time-based (workload-independent) estimation [Garg98] Workload-based estimation [Vaidyanathan99] ARMA/ARX models [Li02] ALT and ADT techniques [Matias06] Non-parametric Algorithms [Dohi00] Non-linear models [Hoffman07] Principal component Analysis (PCA) and System identification [Jia] Pattern recognition [Vaidyanathan & Gross] Threshold-based approaches [Silva09] Machine Learning Approaches [Alonso10] Copyright © 2013 by K.S. Trivedi
  97. 97. Data Collection Experimental Setup 97 SNMP-based resource monitoring tool: Data related to OS resource usage (memory, process table, file table etc.) and system activity (CPU usage, paging, swapping, NFS, interrupts etc. ) collected for over 3 months at 10 min intervalsCopyright © 2013 by K.S. Trivedi
  98. 98. Time Plots Non-parametric Regression Smoothing 98Real Memory Free File Table Size Trend detection: Seasonal Kendall test for trend Copyright © 2013 by K.S. Trivedi
  99. 99. IBM xSeries Software Rejuvenation Agent (SRA)Implemented in a high-availability clustered environmentMonitors consumable resources, estimate time to exhaustion and generates alerts if within user notification horizon Copyright © 2013 by K.S. Trivedi
  100. 100. IBM xSeries Software Rejuvenation Agent (SRA)IBM Director system management tool – Provides GUI to configure SRA – Acts upon alertsTwo versions – Periodic rejuvenation – Prediction-based rejuvenation Copyright © 2013 by K.S. Trivedi
  101. 101. SummaryIt is possible to enhance software availability during operation exploiting environmental diversityMultiple types of recovery after a software failure can be judiciously employed: restart, failover to a replica, reboot and if all else fails repair (patch) Copyright © 2013 by K.S. Trivedi
  102. 102. SummarySoftware aging not anecdotal – real life scientific phenomenonRejuvenation implemented in several special purpose applications and many general purpose cluster systems Copyright © 2013 by K.S. Trivedi
  103. 103. Key References• Software Rejuvenation: Analysis, Module and Applications, Y. Huang, C. Kintala, N. Kolettis and N. Fulton, In Proc. FTCS-25, June 1995.• A Methodology for Detection and Estimation of Software Aging, S. Garg, A. van Moorsel, K. Vaidyanathan and K. S. Trivedi. Proc. ISSRE 1998.• Performance and Reliability Evaluation of Passive Replication Schemes in Application Level Fault Tolerance, S. Garg, Y. Huang, C. M. R. Kintala, K. S. Trivedi, and S. Yajnik. In Proc. FTCS 1999.• Statistical Non-Parametric Algorithms to Estimate the Optimal Software Rejuvenation Schedule, T. Dohi, K. Goseva-Popstojanova and K. S. Trivedi, Proc. PRDC 2000.• Proactive Management of Software Aging, V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi, K. Vaidyanathan and W. Zeggert, IBM Journal of Research & Development, March 2001.• A Comprehensive Model for Software Rejuvenation, K. Vaidyanathan and K. S. Trivedi. IEEE- TDSC, April-June 2005.• Analysis of software aging in a web server, M. Grottke, L. Li, K. Vaidyanathan and K. S. Trivedi, IEEE Trans. Reliability, Sept. 2006. Copyright © 2013 by K.S. Trivedi
  104. 104. Key References• Fighting Bugs: Remove, Retry, Replicate and Rejuvenate, M. Grottke and K. Trivedi, IEEE Computer, Feb. 2007.• Availability Modeling of SIP Protocol on IBM WebSphere, K. S. Trivedi, D. Wang, D. J. Hunt, A. Rindos, W. E. Smith, B. Vashaw, Proc. PRDC 2008.• Using Accelerated Life Tests to Estimate Time to Software Aging Failure, MATIAS JR, R., TRIVEDI, K., Maciel, P. , ISSRE, 2010.• Accelerated Degradation Tests Applied to Software Aging Experiments, Rivalino Matias, Jr., K. S. Trivedi and Paulo J. F. Filho and Pedro A. Barbetta, IEEE Transactions on Reliability, March 2010.• An Empirical Investigation of Fault Types in Space Mission System Software, M.Grottke, A. P. Nikora and K. S. Trivedi, Proc. DSN, 2010.• Software fault mitigation and availability assurance techniques, K. S. Trivedi, M. Grottke, and E. Andrade. International Journal of System Assurance Engineering and Management, 2011.• Recovery from Failures due to Mandelbugs in IT Systems, K. Trivedi, R. Mansharamani, D.S. Kim, M. Grottke, M. Nambiar , Proc. PRDC 2011• O. Kyas. (2001). Network Troubleshooting, Palo Alto California, Agilent Technologies (book)• M. Kaaniche and K. Kanoun (1996). Reliability of a Commercial Telecommunications System, ISSRE 1996• R. Cramp, M. A. Vouk, and W. Jones (1992). On Operational Availability of a Large Software-Based Telecommunications System, ISSRE 1992 Copyright © 2013 by K.S. Trivedi

×