SlideShare a Scribd company logo
1 of 41
Software Design For Reliability (DfR) Seminar

   An Overview of
 Software Reliability




                  Bob Mueller
              bobm@opsalacarte.com
               www.opsalacarte.com
Software Quality
       and
Software Reliability

Related Disciplines,
 Yet Very Different
Definition of Software Quality
                               FACTORS                       CRITERIA

                                                                suitability
                               Functionality                    accuracy
                                                             interoperability
                                                                 security


                                                            understandability
                                 Usability                     learnability
                                                               operability
                                                             attractiveness

         Software                                                                     Software Quality
          Quality
                                                                 maturity
                                Reliability                  fault tolerance       The level to which the
          *ISO9126
         Quality Model
                                                              recoverability      software characteristics
                                                                                     conform to all the
                                                              time behavior            specifications.
                                Efficiency                 resource utilization



                                                              analysability
                                                              changeability
                                Portability                     stability
                                                               testability



                                                              adaptability
                                                              installability
                              Maintainability                co-existence
                                                             replaceability
(v0.5)                                          Ops A La Carte ©                                             3
Most Common Misconception
                              FACTORS                       CRITERIA
                                                                                     What organizations
                                                               suitability          believe they are doing
                              Functionality                    accuracy              ------------------
                                                            interoperability         We have a strong SW
                                                                security
                                                                                   quality program. We don’t
                                                                                   need to add SW reliability
                                                           understandability
                                Usability                     learnability                  practices.
                                                              operability
                                                            attractiveness

         Software
          Quality
                                                                maturity
                               Reliability                  fault tolerance
          *ISO9126
                                                                                       What is missing
                                                             recoverability
         Quality Model                                                                 ---------------
                                                                                 Implementing sufficient
                                                                                 SW reliability practices
                                                             time behavior
                               Efficiency                 resource utilization
                                                                                   to satisfy customer
                                                                                       expectations

                                                             analysability
                                                             changeability
                               Portability                     stability
                                                              testability          What the organizations
                                                                                     are really doing
                                                             adaptability
                                                                                     ------------------
                                                             installability          Implementing only a
                             Maintainability                co-existence           sparse set of SW quality
                                                            replaceability
                                                                                           practices
(v0.5)                                         Ops A La Carte ©                                               4
Software Design For Reliability (DfR) Seminar




   Background on
 Software Reliability
Software Reliability Can Be Measured

 Software Reliability is 20 years behind HW reliability
 ◈Ramifications of failure
     Education on the consumer side
     Many consumers just expect unreliable s/w
 ◈Education on the manufacturer’s side
     Mfgs don’t know new innovative methods
     Mfgs don’t figure out how users will use product
 ◈Software engineers are more free-spirited than HW
 ◈Entry cost for a SW devel. team less than for HW




(v0.5)                  Ops A La Carte ©              6
Reliability vs. Cost

                                     TOTAL
                                     COST
            OPTIMUM                  CURVE
             COST
             POINT                   RELIABILITY
                                     PROGRAM
                                     COSTS
   COST




                                      HW
                                      WARRANTY
                                      COSTS




          RELIABILITY              The SW impact on
                                   HW warranty costs
                                   is minimal at best


(v0.5)          Ops A La Carte ©                   7
Reliability vs. Cost,        continued


◈SW has no associated manufacturing costs, so warranty
     costs and saving are almost entirely allocated to HW
◈If there are no cost savings associated with improving
     software reliability, why not leave it as is and focus on
     improving HW reliability to save money?
          One study found that the root causes of typical embedded
           system failures were SW, not HW, by a ratio of 10:1.
          Customers buy systems, not just HW.
◈The benefits for a SW Reliability Program are not in direct
     cost savings, rather in:
          Increased SW/FW staff availability with reduced operational
           schedules resulting from fewer corrective maintenance
           content.
          Increased customer goodwill based on improved customer
(v0.5)
           satisfaction.       Ops A La Carte ©                     8
Defining
Software Reliability
Software Reliability Definitions

                        The customer perception of
         the software’s ability to deliver the expected functionality
                         in the target environment
                                without failing.

 ◈ Examine the key points

 ◈ Practical rewording of the definition


                     Software reliability is a measure of
           the software failures that are visible to a customer and
          prevent a system from delivering essential functionality.


(v0.5)                            Ops A La Carte ©                      10
Software Reliability Can Be Measured

 ◈ Measurements are a required foundation
          Differs from quality which is not defined by measurements
          All measurements and metrics are based on run-time failures


 ◈ Only customer-visible failures are targeted
          Only defects that produce customer-visible failures affect reliability
          Corollaries
              ◘ Defects that do not trigger run-time failures do NOT affect reliability
                 – badly formatted or commented code
                 – defects in dead code
              ◘ Not all defects that are triggered at run-time produce customer-visible
                 failures
                   –   corruption of any unused region of memory


 ◈ SW Reliability evolved from HW Reliability
          SW Reliability focuses only on design reliability
          HW Reliability has no counterpart to this

(v0.5)                                     Ops A La Carte ©                               11
Software Reliability Is Based On Usage

 ◈ SW failure characteristics are derived from the usage profile of a
         particular customer or set of customers
          Each usage profile triggers a different set of run-time SW faults and failures


 ◈ Example
          Examine product usage by 2 different customers
              ◘ Customer A’s usage profile only exercises the sections of SW that produce
                very few failures.
              ◘ Customer B’s usage profile overlaps with Customer A’s usage profile, but
                additionally exercises other sections of SW that produce many, frequent
                failures.
          Customer assessment of the product’s software reliability
              ◘ Customer A’s assessment - the SW reliability is high
              ◘ Customer B’s assessment - the SW reliability is low




(v0.5)                                     Ops A La Carte ©                                 12
Reliability ≠ Correctness

 ◈ Correctness is a measure of the degree of intended functionality
         implemented by the SW
          Correctness measures the completeness of requirements and the accuracy of
           defining a SW model based on these requirements



 ◈ Reliability is a measure of the behavior (i.e., failures) that
         prevents the software from delivering the implemented
         functionality




(v0.5)                                  Ops A La Carte ©                               13
Defects,
Faults,
  and
Failures
Terminology

 ◈ Defect
          A flaw in the requirements, design or source code that produces
           implementation logic that will trigger a fault                         Defect

             ◘ Defects of omission
                – Not all requirements were used in creating a design model
                – The design satisfies all requirements but is incomplete
                – The source code did not implement all the design
                – The source code has missing or incomplete logic
             ◘ Defects of commission
                – Incorrect requirements are specified
                – Requirements are incorrectly translated into a design model
                – The design is incorrectly translated into source code
                – The source code logic is flawed
          Defects are static and can be detected and removed without
           executing the source code
          Defects that cannot trigger a SW failure are not tracked or measured
             ◘ Ex: quality defects, such as test case and soft maintenance
                defects, and defects in “dead code”


(v0.5)                                     Ops A La Carte ©                                15
Terminology (continued)

 ◈ Fault
          The result of triggering a SW defect by executing the associated       Defect
           implementation logic
              ◘ Faults are NOT always visible to the customer
              ◘ A fault can be the transitional state that results in a failure
                                                                                  Fault
              ◘ Trivially simple defects (e.g., display spelling errors) do not
                have intermediate fault states



 ◈ Failure
                                                                                  Defect
          A customer (or operational system) observation or detection that
           is perceived as an unacceptable departure of operation from the
           designed SW behavior
              ◘ Failures MUST be observable by the customer or an                  Fault
                operational system
              ◘ Failures are the visible, run-time symptoms of faults
              ◘ Not all failures result in system outages
                                                                                  Failure



(v0.5)                                      Ops A La Carte ©                                16
Basic Failure Classification

 ◈ High-level SW failure classification based on complexity
         and time-sensitivity of triggering the associated defect:
           Bohr Bugs
           Heisen Bugs
           Aging Bugs

 ◈ Bohr Bugs
          Named after the “Bohr” atom
              ◘ Connotation: Deterministic failures that are straight-forward to isolate
          Failures are easily reproducible, even after a system restart/reboot
          Most frequent failure category detected during development, testing and early
            deployment
          These are considered “trivial” defects since every execution of the associated
            logic results in a failure




(v0.5)                                    Ops A La Carte ©                                  17
Basic Failure Classification (continued)

 ◈ Heisen Bugs
          Named after the Heisenberg uncertainty principle
              ◘ Connotation: Failures that are difficult to isolate to a root cause
          Intermittent failures that are rarely triggered and difficult to reproducible.
          Unlikely to reoccur following a system restart/reboot
          Common root causes:
              ◘ Synchronization boundaries between SW components
              ◘ Improper or insufficient exception handling
              ◘ Interdependent timing of multiple events
          Rarely detected when the SW is not mature (i.e., during early development and
            testing phases)
          The best methods to deal with these “tough” defects are by
              ◘ Identification using SW failure analysis
              ◘ Impact mitigation using fault tolerant code




(v0.5)                                     Ops A La Carte ©                                 18
Basic Failure Classification (continued)

 ◈ Aging Bugs
          Attributed to the results of continuous, long-term operations or use
              ◘ Connotation: Failures resulting from accumulation of erroneous conditions
          Transient failures occur after extended run-time or functional cycles where the
           contributing faults have occurred numerous times
          Preceding faults may lead to system performance degradation before a failure
           occurs
          Extremely unlikely to reoccur following a system restart/reboot due to the
           longevity requirement
          Common root causes:
              ◘ Deterioration in the availability of OS resources (e.g., depletion of device
                  handles, memory leaks, heap fragmentation)
              ◘   Data corruption
              ◘   Application race conditions
              ◘   Accumulation of numerical round-off errors
              ◘   Gradual data accumulation for sampling or queue build-up
          The best methods to deal with these “tough” defects are by
              ◘ Identification using SW failure analysis
              ◘ Impact mitigation using fault tolerant code
(v0.5)                                     Ops A La Carte ©                                    19
What Is
  Reliable
Software ??
Reliable Software Characteristics

 ◈ Operates within the reliability specification that satisfies customer
         expectations
           Measured in terms of failure rate and availability level
           The goal is rarely “defect free” or “ultra-high reliability”



 ◈ “Gracefully” handles erroneous inputs from users, other systems,
         and transient hardware faults
           Attempts to prevent state or output data corruption from “erroneous” inputs



 ◈ Quickly detects, reports and recovers from SW and transient HW
         faults
           SW provides system behave as continuously monitoring, self-diagnosing” and
            “self-healing”
           Prevents as many run-time faults as possible from becoming system-level
            failures



(v0.5)                                       Ops A La Carte ©                             21
Common Paths to Software Reliability

 ◈ Traditional SW Reliability Programs - Predictions
          Program directed by a separate team of reliability engineers
          Development process viewed as a SW-generating, black box
              ◘ Develop prediction models to estimate the number of faults in the SW
          Reliability techniques used to identify defects and produce SW reliability metrics
              ◘ Traditional HW failure analysis techniques, e.g., FMEAs or FTAs
              ◘ Defect estimation and tracking

 ◈ SW Process Control
          Based on the assumption of a correlation between development process
            maturity and latent defect density in the final SW
             ◘ Ex: CMM Level 3 organizations can develop SW with 3.5 defects/KSLOC
          If the current process level does not yield the desired SW reliability, audits and
            stricter process controls are implemented


 ◈ Quality Through SW Testing
          Most prevalent approach for implementing SW reliability
          Assumes reliability is increased by expanding the types of system tests (e.g.,
            integration, performance and loading) and increasing the duration of testing
          Measured by counting and classifying defects
(v0.5)                                     Ops A La Carte ©                                     22
Common Paths to Software Reliability                                             (continued)


 ◈ These approaches generally do not provide a complete solution
          Reliability prediction models are not well-understood
          SW engineers find it difficult to apply HW failure analysis techniques to detailed
           SW designs
          Only 20% of the SW defects identified by quality processes during development
           (e.g., code inspections) affect reliability
          System testing is an inefficient mechanism for finding run-time failures
              ◘ Generally identifies no more than 50% of run-time failures
          Quality processes for tracking defects do not produce SW reliability information
           such as defect density and failure rates



 ◈ Net Effect:
          SW engineers still end up spending more than 50% of their time debugging,
           instead of focusing on designing or implementing source code




(v0.5)                                      Ops A La Carte ©                                    23
Design for Reliability
       (DfR)
Software Defect Distributions


   Average distribution of SW defects by lifecycle phase:
          20%     Requirements
          30%     Design
          35%     Coding
          10%     Bad Defect Fixes (introduction of secondary defects)
          5%      Customer Documentation



   Average distribution of SW defects at the time of field deployment:
         (based on 1st year field defect report data)
          1%      Severity 1 (catastrophic)
          20%     Severity 2 (major)
          35%     Severity 3 (minor)
          44%     Severity 4 (annoyance)




(v0.5)                                   Ops A La Carte ©                 25
Typical Defect Tracking (System Test)


                       Severity #1   Severity #2         Severity #3   Severity #4
         System Test                                                                 Total Defects
                        Defects       Defects             Defects       Defects
            Build                                                                       Found
                         Found         Found               Found         Found

         SysBuild-01       7             9                       16        22             54


         SysBuild-02       5             5                       14        26             50


         SysBuild-03       4             6                       8         16             34

              •             •             •                      •          •              •
              •             •             •                      •          •              •
              •             •             •                      •          •              •

         SysBuild-7        0             1                       4         6              11




(v0.5)                                        Ops A La Carte ©                                       26
Defect Origin and Discovery

          Typical Behavior

                     Requirements   Design           Coding     Testing        Maintenance
          Defect
          Origin




          Defect
                     Requirements   Design            Coding    Testing        Maintenance
         Discovery
                                                                          Surprise!


          Goal of Best Practices on Defect Discovery

           Defect    Requirements   Design            Coding    Testing       Maintenance

           Origin




          Defect
                     Requirements    Design            Coding    Testing       Maintenance
         Discovery
(v0.5)                                       Ops A La Carte ©                                27
Defect Removal Efficiencies

 ◈ Defect removal efficiency is a key reliability measure
                                           Defects found
          Removal efficiency =
                                       Defects present

 ◈ “Defects present” is the critical parameter that is based on inspections, testing and
     field data


                                                                        System & Subsystem      Field
          Requirements        Design        Coding   Unit Testing
                                                                           Testing Stages    Deployment
                   Inspection Efficiency                                                        Overall
                                                                Testing Efficiency
                                                                                               Efficiency

 Example:



               Origin             Defects Found                            Metric            Removal Efficiency

             Inspections               90                           Inspection Efficiency      43% = (90 / 210)
            Unit Testing               25                            Testing Efficiency        38% = (80 / 210)
         System & Subsystem                                          Overall Efficiency       81% = (170 / 210)
                                       55
               Testing

          Field Deployment             40

               TOTAL                   210

(v0.5)                                               Ops A La Carte ©                                             28
Reliability Defect Tracking (All Phases)

                                 Total              Reqmts       Design      Code      Unit Test   System Test
                      Total
                               Critical   Defect    Critical     Critical   Critical    Critical     Critical
         Activity   Failures
                               Failures   Density   Defects      Defects    Defects    Failures      Failures
                     Found
                                Found               Found        Found      Found       Found         Found

  Reqmts              75         12        16%          12


  Design              123        45        37%          4          41


  Code                158        72        46%          4           6         62


  Unit Test           78         25        35%          1           4         18          2

  Development
                      434        154                    21         51         80          2
  Totals
  DRE
                                                      57%         80%        78%        100%
  (development)



  System Test         189        53        68%          1          11         31          6            2

  DRE
  (after system                                       55%         66%        56%         25%          100%
  testing)

(v0.5)                                        Ops A La Carte ©                                               29
Defect Removal Technique Impact


         Design
         Inspections /     n     n     n         n         Y     n     Y     Y
         Reviews


         Code
         Inspections /     n     n     n        Y          n     n     Y     Y
         Reviews


         Formal SQA
         Processes
                           n     Y     n         n         n     Y     n     Y



         Formal Testing    n     n     Y         n         n     Y     n     Y



         Median Defect
                          40%   45%   53%     57%         60%   65%   85%   99%
         Efficiency




(v0.5)                                 Ops A La Carte ©                           30
Typical Defect Reduction Goals

                 200




                 150




                 100




                  50




                       SysBld    SysBld      SysBld   SysBld   SysBld
                         #1        #2          #3       #4       #5


                                           System Test
(v0.5)                  Ops A La Carte ©                                31
Design for Reliability

200




150

                        Goal is to Predict Defect Totals for Next Phase


100




 50




         Req   Design   Code     Unit    SysBld     SysBld      SysBld   SysBld   SysBld     Field
                                 Test      #1         #2          #3       #4       #5      Failures


                Development                                   System Test                  Deployment
(v0.5)                                     Ops A La Carte ©                                             32
Software Reliability
     Practices
Goals of Reliability Practices

     Reliability Practices split the development lifecycle into 2 opposing
                                     phases:



              Pre-deployment                                       Post-deployment
          Focus on Fault Intolerance                            Focus on Fault Tolerance

                                                                 Fault Tolerance Techniques
            Fault Avoidance Techniques                            Allow a system to operate
            Prevents defects from being                       predictably in the presence of faults
                     introduced
                                                               System Restoration Techniques
            Fault Removal Techniques                         Quickly restore the operational state
            Detects and repairs faults                        of a system in the simplest manner
                                                                           possible


          Goal: Increase reliability by                       Goal: Increase availability by
         eliminating critical defects that                   reducing or avoiding the effects
             reduce the failure rate                                     of faults

(v0.5)                                    Ops A La Carte ©                                            34
Software Reliability Practices

         Analysis                     Design                  Verification

 Formal                    Formal Interface          Boundary Value Analysis
  Scenario/Checklist         Specification
                                                       Equivalence Class
  Analysis
                            Defensive Programming      Partitioning
 FRACAS
                            Fault Tolerance           Reliability Growth Testing
 FMECA
                            Modular Design            Fault Injection Testing
 FTA
                            Error Detection and       Static/Dynamic Code
 Petri Nets                 Correction                 Analysis
 Change Impact Analysis    Critical Functionality    Coverage Testing
                             Isolation
 Common Cause Failure                                 Usage Profile Testing
  Analysis                  Design by Contract
                                                       Cleanroom
 Sneak Analysis            Reliability Allocation
                            Design Diversity




(v0.5)                             Ops A La Carte ©                                  35
Design and Code Inspections

 ◈ The original rationale for inspections (current payback):
         “Inspections require less time and resources to detect and repair defects than
            traditional testing and debugging”

          Work done at Nortel Technologies in 1991 demonstrated that 65% to 90% of
           operational defects were detected by inspections at 1/4 to 2/3 the cost of testing


 ◈ Soft maintenance rationale (future payback):
          Data collected on 130 inspection sessions findings on the long-term, software
           maintenance benefits of inspections as follows:



    True Defects - The code behavior was wrong and an
          execution affecting change was made to
          resolve it.

    False Positives – Any issue not requiring a code or
          document change.

    Soft Maintenance Changes – Any other issue that
          resulted in a code or document change, e.g.,
          code restructuring or addition of code
          comments.

(v0.5)                                           Ops A La Carte ©                               36
Spectrum of Inspection Methodologies
          Method /                      # of            Detection      Collection       Post Process
                       Team Size
          Originator                  Sessions           Method         Meeting          Feedback
                                                                           Yes
         Fagan            Large           1               Ad hoc
                                                                      Group oriented
                                                                                            None

                                                                           Yes
         Bisant           Small           1               Ad hoc
                                                                      Group oriented
                                                                                            None

                                                                           Yes            Root Cause
         Gilb             Large           1              Checklist
                                                                      Group oriented       Analysis

                                                                            No
         Meetingless
                          Large           1             Unspecified     Individual          None
         Inspection                                                      Oriented

                                                                           Yes
         ADR              Small          >1              Scenario
                                                                      Group oriented
                                                                                            None

                                          4                                Yes
         Britcher       Unspecified
                                       Parallel
                                                         Scenario
                                                                      Group oriented
                                                                                            None

                                                                            No
         Phased                          >1              Checklist
                          Small                                         (Mtg only to        None
         Inspection                   Sequential          (comp)
                                                                      reconcile data)

                                         >1                                Yes
         N-fold           Small
                                       Parallel
                                                          Ad hoc
                                                                      Group oriented
                                                                                            None

         Code                                                              No
                          Small           1               Ad hoc                            None
         Reading                                                       Mtg Optional


         WOW! No wonder inspections are not well-understood, there’s too many methodologies.
                             AND, THERE ARE MORE OPTIONS…
(v0.5)                                        Ops A La Carte ©                                         37
Spectrum of Technical Review Methodologies

     Inspections are just one of the many classes of Technical Review Methodologies.


 • Informal                                                   • Formal

 • Individual initiative                                      • Team-oriented

 • Small time commitment                                      • Multiple meetings and pre-
                                                                meeting preparation

 • General Feedback                                           • Compliance with Standards

 • Defect Detection                                           • Satisfies Specifications




Adhoc                            Pairs                           Team
Review                       Programming                        Review
               Peer Desk                       Walkthrough                      Inspection
                 Check
              (Passaround
                Check)
(v0.5)                                 Ops A La Carte ©                                      38
Why Isn’t Software Reliability Prevalent ??

    “Those are very good ideas. We would like to implement them and
     we know we should try. However, there just isn’t enough time.”


◈ The erroneous arguments all assume testing is the most effective
     defect detection methodology
          Results from inspections/reviews are generally poor
          Engineers believe that testers will do a more thorough and efficient job of
           testing than any effort they implement (inspections and unit testing)
          Managers believe progress can be demonstrated faster and better once the SW
           is in the system test phase



◈ Remember, just like the story of the lumberjack and his ax,


             “If you don’t have time to do it correctly the first time,
                  then you must have time to do it over later!”

(v0.5)                                     Ops A La Carte ©                              39
Software DfR Tools by Phase

              Phase                                          Activities                                       Tools


                                                                                              ◈Benchmarking
            Concept                              Define SW reliability requirements           ◈Internal Goal Setting
                                                                                              ◈Gap Analysis


                                                                                              ◈SW Failure Analysis
                  Architecture &
                                                       Modeling & Predictions                 ◈SW Fault Tolerance
                High Level Design
                                                                                              ◈Human Factors Analysis
  Design
                                    ◈Identify core, critical and vulnerable sections of the   ◈Human Factors Analysis
                Low Level Design     design                                                   ◈Derating Analysis
                                    ◈Static detection of design defects                       ◈Worst Case Analysis


                                                                                              ◈FRACAS
             Coding                              Static detection of coding defects
                                                                                              ◈RCA


                                                                                              ◈FRACAS
           Unit Testing                   Dynamic detection of design and coding defects
                                                                                              ◈RCA


                                                                                              ◈FRACAS
Integration and System Testing                         SW Statistical Testing                 ◈RCA
                                                                                              ◈SW Reliability Testing


                                                                                              ◈FRACAS
 Operations and Maintenance                 Continuous assessment of product reliability
                                                                                              ◈RCA

(v0.5)                                                     Ops A La Carte ©                                             40
Questions?




(v0.5)     Ops A La Carte ©   41

More Related Content

Featured

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Featured (20)

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 

Software Reliability Overview

  • 1. Software Design For Reliability (DfR) Seminar An Overview of Software Reliability Bob Mueller bobm@opsalacarte.com www.opsalacarte.com
  • 2. Software Quality and Software Reliability Related Disciplines, Yet Very Different
  • 3. Definition of Software Quality FACTORS CRITERIA suitability Functionality accuracy interoperability security understandability Usability learnability operability attractiveness Software Software Quality Quality maturity Reliability fault tolerance The level to which the *ISO9126 Quality Model recoverability software characteristics conform to all the time behavior specifications. Efficiency resource utilization analysability changeability Portability stability testability adaptability installability Maintainability co-existence replaceability (v0.5) Ops A La Carte © 3
  • 4. Most Common Misconception FACTORS CRITERIA What organizations suitability believe they are doing Functionality accuracy ------------------ interoperability We have a strong SW security quality program. We don’t need to add SW reliability understandability Usability learnability practices. operability attractiveness Software Quality maturity Reliability fault tolerance *ISO9126 What is missing recoverability Quality Model --------------- Implementing sufficient SW reliability practices time behavior Efficiency resource utilization to satisfy customer expectations analysability changeability Portability stability testability What the organizations are really doing adaptability ------------------ installability Implementing only a Maintainability co-existence sparse set of SW quality replaceability practices (v0.5) Ops A La Carte © 4
  • 5. Software Design For Reliability (DfR) Seminar Background on Software Reliability
  • 6. Software Reliability Can Be Measured Software Reliability is 20 years behind HW reliability ◈Ramifications of failure  Education on the consumer side  Many consumers just expect unreliable s/w ◈Education on the manufacturer’s side  Mfgs don’t know new innovative methods  Mfgs don’t figure out how users will use product ◈Software engineers are more free-spirited than HW ◈Entry cost for a SW devel. team less than for HW (v0.5) Ops A La Carte © 6
  • 7. Reliability vs. Cost TOTAL COST OPTIMUM CURVE COST POINT RELIABILITY PROGRAM COSTS COST HW WARRANTY COSTS RELIABILITY The SW impact on HW warranty costs is minimal at best (v0.5) Ops A La Carte © 7
  • 8. Reliability vs. Cost, continued ◈SW has no associated manufacturing costs, so warranty costs and saving are almost entirely allocated to HW ◈If there are no cost savings associated with improving software reliability, why not leave it as is and focus on improving HW reliability to save money?  One study found that the root causes of typical embedded system failures were SW, not HW, by a ratio of 10:1.  Customers buy systems, not just HW. ◈The benefits for a SW Reliability Program are not in direct cost savings, rather in:  Increased SW/FW staff availability with reduced operational schedules resulting from fewer corrective maintenance content.  Increased customer goodwill based on improved customer (v0.5) satisfaction. Ops A La Carte © 8
  • 10. Software Reliability Definitions The customer perception of the software’s ability to deliver the expected functionality in the target environment without failing. ◈ Examine the key points ◈ Practical rewording of the definition Software reliability is a measure of the software failures that are visible to a customer and prevent a system from delivering essential functionality. (v0.5) Ops A La Carte © 10
  • 11. Software Reliability Can Be Measured ◈ Measurements are a required foundation  Differs from quality which is not defined by measurements  All measurements and metrics are based on run-time failures ◈ Only customer-visible failures are targeted  Only defects that produce customer-visible failures affect reliability  Corollaries ◘ Defects that do not trigger run-time failures do NOT affect reliability – badly formatted or commented code – defects in dead code ◘ Not all defects that are triggered at run-time produce customer-visible failures – corruption of any unused region of memory ◈ SW Reliability evolved from HW Reliability  SW Reliability focuses only on design reliability  HW Reliability has no counterpart to this (v0.5) Ops A La Carte © 11
  • 12. Software Reliability Is Based On Usage ◈ SW failure characteristics are derived from the usage profile of a particular customer or set of customers  Each usage profile triggers a different set of run-time SW faults and failures ◈ Example  Examine product usage by 2 different customers ◘ Customer A’s usage profile only exercises the sections of SW that produce very few failures. ◘ Customer B’s usage profile overlaps with Customer A’s usage profile, but additionally exercises other sections of SW that produce many, frequent failures.  Customer assessment of the product’s software reliability ◘ Customer A’s assessment - the SW reliability is high ◘ Customer B’s assessment - the SW reliability is low (v0.5) Ops A La Carte © 12
  • 13. Reliability ≠ Correctness ◈ Correctness is a measure of the degree of intended functionality implemented by the SW  Correctness measures the completeness of requirements and the accuracy of defining a SW model based on these requirements ◈ Reliability is a measure of the behavior (i.e., failures) that prevents the software from delivering the implemented functionality (v0.5) Ops A La Carte © 13
  • 15. Terminology ◈ Defect  A flaw in the requirements, design or source code that produces implementation logic that will trigger a fault Defect ◘ Defects of omission – Not all requirements were used in creating a design model – The design satisfies all requirements but is incomplete – The source code did not implement all the design – The source code has missing or incomplete logic ◘ Defects of commission – Incorrect requirements are specified – Requirements are incorrectly translated into a design model – The design is incorrectly translated into source code – The source code logic is flawed  Defects are static and can be detected and removed without executing the source code  Defects that cannot trigger a SW failure are not tracked or measured ◘ Ex: quality defects, such as test case and soft maintenance defects, and defects in “dead code” (v0.5) Ops A La Carte © 15
  • 16. Terminology (continued) ◈ Fault  The result of triggering a SW defect by executing the associated Defect implementation logic ◘ Faults are NOT always visible to the customer ◘ A fault can be the transitional state that results in a failure Fault ◘ Trivially simple defects (e.g., display spelling errors) do not have intermediate fault states ◈ Failure Defect  A customer (or operational system) observation or detection that is perceived as an unacceptable departure of operation from the designed SW behavior ◘ Failures MUST be observable by the customer or an Fault operational system ◘ Failures are the visible, run-time symptoms of faults ◘ Not all failures result in system outages Failure (v0.5) Ops A La Carte © 16
  • 17. Basic Failure Classification ◈ High-level SW failure classification based on complexity and time-sensitivity of triggering the associated defect:  Bohr Bugs  Heisen Bugs  Aging Bugs ◈ Bohr Bugs  Named after the “Bohr” atom ◘ Connotation: Deterministic failures that are straight-forward to isolate  Failures are easily reproducible, even after a system restart/reboot  Most frequent failure category detected during development, testing and early deployment  These are considered “trivial” defects since every execution of the associated logic results in a failure (v0.5) Ops A La Carte © 17
  • 18. Basic Failure Classification (continued) ◈ Heisen Bugs  Named after the Heisenberg uncertainty principle ◘ Connotation: Failures that are difficult to isolate to a root cause  Intermittent failures that are rarely triggered and difficult to reproducible.  Unlikely to reoccur following a system restart/reboot  Common root causes: ◘ Synchronization boundaries between SW components ◘ Improper or insufficient exception handling ◘ Interdependent timing of multiple events  Rarely detected when the SW is not mature (i.e., during early development and testing phases)  The best methods to deal with these “tough” defects are by ◘ Identification using SW failure analysis ◘ Impact mitigation using fault tolerant code (v0.5) Ops A La Carte © 18
  • 19. Basic Failure Classification (continued) ◈ Aging Bugs  Attributed to the results of continuous, long-term operations or use ◘ Connotation: Failures resulting from accumulation of erroneous conditions  Transient failures occur after extended run-time or functional cycles where the contributing faults have occurred numerous times  Preceding faults may lead to system performance degradation before a failure occurs  Extremely unlikely to reoccur following a system restart/reboot due to the longevity requirement  Common root causes: ◘ Deterioration in the availability of OS resources (e.g., depletion of device handles, memory leaks, heap fragmentation) ◘ Data corruption ◘ Application race conditions ◘ Accumulation of numerical round-off errors ◘ Gradual data accumulation for sampling or queue build-up  The best methods to deal with these “tough” defects are by ◘ Identification using SW failure analysis ◘ Impact mitigation using fault tolerant code (v0.5) Ops A La Carte © 19
  • 20. What Is Reliable Software ??
  • 21. Reliable Software Characteristics ◈ Operates within the reliability specification that satisfies customer expectations  Measured in terms of failure rate and availability level  The goal is rarely “defect free” or “ultra-high reliability” ◈ “Gracefully” handles erroneous inputs from users, other systems, and transient hardware faults  Attempts to prevent state or output data corruption from “erroneous” inputs ◈ Quickly detects, reports and recovers from SW and transient HW faults  SW provides system behave as continuously monitoring, self-diagnosing” and “self-healing”  Prevents as many run-time faults as possible from becoming system-level failures (v0.5) Ops A La Carte © 21
  • 22. Common Paths to Software Reliability ◈ Traditional SW Reliability Programs - Predictions  Program directed by a separate team of reliability engineers  Development process viewed as a SW-generating, black box ◘ Develop prediction models to estimate the number of faults in the SW  Reliability techniques used to identify defects and produce SW reliability metrics ◘ Traditional HW failure analysis techniques, e.g., FMEAs or FTAs ◘ Defect estimation and tracking ◈ SW Process Control  Based on the assumption of a correlation between development process maturity and latent defect density in the final SW ◘ Ex: CMM Level 3 organizations can develop SW with 3.5 defects/KSLOC  If the current process level does not yield the desired SW reliability, audits and stricter process controls are implemented ◈ Quality Through SW Testing  Most prevalent approach for implementing SW reliability  Assumes reliability is increased by expanding the types of system tests (e.g., integration, performance and loading) and increasing the duration of testing  Measured by counting and classifying defects (v0.5) Ops A La Carte © 22
  • 23. Common Paths to Software Reliability (continued) ◈ These approaches generally do not provide a complete solution  Reliability prediction models are not well-understood  SW engineers find it difficult to apply HW failure analysis techniques to detailed SW designs  Only 20% of the SW defects identified by quality processes during development (e.g., code inspections) affect reliability  System testing is an inefficient mechanism for finding run-time failures ◘ Generally identifies no more than 50% of run-time failures  Quality processes for tracking defects do not produce SW reliability information such as defect density and failure rates ◈ Net Effect:  SW engineers still end up spending more than 50% of their time debugging, instead of focusing on designing or implementing source code (v0.5) Ops A La Carte © 23
  • 25. Software Defect Distributions Average distribution of SW defects by lifecycle phase:  20% Requirements  30% Design  35% Coding  10% Bad Defect Fixes (introduction of secondary defects)  5% Customer Documentation Average distribution of SW defects at the time of field deployment: (based on 1st year field defect report data)  1% Severity 1 (catastrophic)  20% Severity 2 (major)  35% Severity 3 (minor)  44% Severity 4 (annoyance) (v0.5) Ops A La Carte © 25
  • 26. Typical Defect Tracking (System Test) Severity #1 Severity #2 Severity #3 Severity #4 System Test Total Defects Defects Defects Defects Defects Build Found Found Found Found Found SysBuild-01 7 9 16 22 54 SysBuild-02 5 5 14 26 50 SysBuild-03 4 6 8 16 34 • • • • • • • • • • • • • • • • • • SysBuild-7 0 1 4 6 11 (v0.5) Ops A La Carte © 26
  • 27. Defect Origin and Discovery Typical Behavior Requirements Design Coding Testing Maintenance Defect Origin Defect Requirements Design Coding Testing Maintenance Discovery Surprise! Goal of Best Practices on Defect Discovery Defect Requirements Design Coding Testing Maintenance Origin Defect Requirements Design Coding Testing Maintenance Discovery (v0.5) Ops A La Carte © 27
  • 28. Defect Removal Efficiencies ◈ Defect removal efficiency is a key reliability measure Defects found Removal efficiency = Defects present ◈ “Defects present” is the critical parameter that is based on inspections, testing and field data System & Subsystem Field Requirements Design Coding Unit Testing Testing Stages Deployment Inspection Efficiency Overall Testing Efficiency Efficiency Example: Origin Defects Found Metric Removal Efficiency Inspections 90 Inspection Efficiency 43% = (90 / 210) Unit Testing 25 Testing Efficiency 38% = (80 / 210) System & Subsystem Overall Efficiency 81% = (170 / 210) 55 Testing Field Deployment 40 TOTAL 210 (v0.5) Ops A La Carte © 28
  • 29. Reliability Defect Tracking (All Phases) Total Reqmts Design Code Unit Test System Test Total Critical Defect Critical Critical Critical Critical Critical Activity Failures Failures Density Defects Defects Defects Failures Failures Found Found Found Found Found Found Found Reqmts 75 12 16% 12 Design 123 45 37% 4 41 Code 158 72 46% 4 6 62 Unit Test 78 25 35% 1 4 18 2 Development 434 154 21 51 80 2 Totals DRE 57% 80% 78% 100% (development) System Test 189 53 68% 1 11 31 6 2 DRE (after system 55% 66% 56% 25% 100% testing) (v0.5) Ops A La Carte © 29
  • 30. Defect Removal Technique Impact Design Inspections / n n n n Y n Y Y Reviews Code Inspections / n n n Y n n Y Y Reviews Formal SQA Processes n Y n n n Y n Y Formal Testing n n Y n n Y n Y Median Defect 40% 45% 53% 57% 60% 65% 85% 99% Efficiency (v0.5) Ops A La Carte © 30
  • 31. Typical Defect Reduction Goals 200 150 100 50 SysBld SysBld SysBld SysBld SysBld #1 #2 #3 #4 #5 System Test (v0.5) Ops A La Carte © 31
  • 32. Design for Reliability 200 150 Goal is to Predict Defect Totals for Next Phase 100 50 Req Design Code Unit SysBld SysBld SysBld SysBld SysBld Field Test #1 #2 #3 #4 #5 Failures Development System Test Deployment (v0.5) Ops A La Carte © 32
  • 33. Software Reliability Practices
  • 34. Goals of Reliability Practices Reliability Practices split the development lifecycle into 2 opposing phases: Pre-deployment Post-deployment Focus on Fault Intolerance Focus on Fault Tolerance Fault Tolerance Techniques Fault Avoidance Techniques Allow a system to operate Prevents defects from being predictably in the presence of faults introduced System Restoration Techniques Fault Removal Techniques Quickly restore the operational state Detects and repairs faults of a system in the simplest manner possible Goal: Increase reliability by Goal: Increase availability by eliminating critical defects that reducing or avoiding the effects reduce the failure rate of faults (v0.5) Ops A La Carte © 34
  • 35. Software Reliability Practices Analysis Design Verification  Formal  Formal Interface  Boundary Value Analysis Scenario/Checklist Specification  Equivalence Class Analysis  Defensive Programming Partitioning  FRACAS  Fault Tolerance  Reliability Growth Testing  FMECA  Modular Design  Fault Injection Testing  FTA  Error Detection and  Static/Dynamic Code  Petri Nets Correction Analysis  Change Impact Analysis  Critical Functionality  Coverage Testing Isolation  Common Cause Failure  Usage Profile Testing Analysis  Design by Contract  Cleanroom  Sneak Analysis  Reliability Allocation  Design Diversity (v0.5) Ops A La Carte © 35
  • 36. Design and Code Inspections ◈ The original rationale for inspections (current payback): “Inspections require less time and resources to detect and repair defects than traditional testing and debugging”  Work done at Nortel Technologies in 1991 demonstrated that 65% to 90% of operational defects were detected by inspections at 1/4 to 2/3 the cost of testing ◈ Soft maintenance rationale (future payback):  Data collected on 130 inspection sessions findings on the long-term, software maintenance benefits of inspections as follows: True Defects - The code behavior was wrong and an execution affecting change was made to resolve it. False Positives – Any issue not requiring a code or document change. Soft Maintenance Changes – Any other issue that resulted in a code or document change, e.g., code restructuring or addition of code comments. (v0.5) Ops A La Carte © 36
  • 37. Spectrum of Inspection Methodologies Method / # of Detection Collection Post Process Team Size Originator Sessions Method Meeting Feedback Yes Fagan Large 1 Ad hoc Group oriented None Yes Bisant Small 1 Ad hoc Group oriented None Yes Root Cause Gilb Large 1 Checklist Group oriented Analysis No Meetingless Large 1 Unspecified Individual None Inspection Oriented Yes ADR Small >1 Scenario Group oriented None 4 Yes Britcher Unspecified Parallel Scenario Group oriented None No Phased >1 Checklist Small (Mtg only to None Inspection Sequential (comp) reconcile data) >1 Yes N-fold Small Parallel Ad hoc Group oriented None Code No Small 1 Ad hoc None Reading Mtg Optional WOW! No wonder inspections are not well-understood, there’s too many methodologies. AND, THERE ARE MORE OPTIONS… (v0.5) Ops A La Carte © 37
  • 38. Spectrum of Technical Review Methodologies Inspections are just one of the many classes of Technical Review Methodologies. • Informal • Formal • Individual initiative • Team-oriented • Small time commitment • Multiple meetings and pre- meeting preparation • General Feedback • Compliance with Standards • Defect Detection • Satisfies Specifications Adhoc Pairs Team Review Programming Review Peer Desk Walkthrough Inspection Check (Passaround Check) (v0.5) Ops A La Carte © 38
  • 39. Why Isn’t Software Reliability Prevalent ?? “Those are very good ideas. We would like to implement them and we know we should try. However, there just isn’t enough time.” ◈ The erroneous arguments all assume testing is the most effective defect detection methodology  Results from inspections/reviews are generally poor  Engineers believe that testers will do a more thorough and efficient job of testing than any effort they implement (inspections and unit testing)  Managers believe progress can be demonstrated faster and better once the SW is in the system test phase ◈ Remember, just like the story of the lumberjack and his ax, “If you don’t have time to do it correctly the first time, then you must have time to do it over later!” (v0.5) Ops A La Carte © 39
  • 40. Software DfR Tools by Phase Phase Activities Tools ◈Benchmarking Concept Define SW reliability requirements ◈Internal Goal Setting ◈Gap Analysis ◈SW Failure Analysis Architecture & Modeling & Predictions ◈SW Fault Tolerance High Level Design ◈Human Factors Analysis Design ◈Identify core, critical and vulnerable sections of the ◈Human Factors Analysis Low Level Design design ◈Derating Analysis ◈Static detection of design defects ◈Worst Case Analysis ◈FRACAS Coding Static detection of coding defects ◈RCA ◈FRACAS Unit Testing Dynamic detection of design and coding defects ◈RCA ◈FRACAS Integration and System Testing SW Statistical Testing ◈RCA ◈SW Reliability Testing ◈FRACAS Operations and Maintenance Continuous assessment of product reliability ◈RCA (v0.5) Ops A La Carte © 40
  • 41. Questions? (v0.5) Ops A La Carte © 41