Weighted Defect Removal Effectiveness: Method and Value

4,617 views

Published on

This is de-blued version of my presentation at Rational Software Conference 2009. An accompanying video (http://www.youtube.com/watch?v=0ZU28Dma6zw&feature=channel_page) demonstrates one method for generating these values with IBM Rational ClearQuest.

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,617
On SlideShare
0
From Embeds
0
Number of Embeds
30
Actions
Shares
0
Downloads
90
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • This presentation is very focused on only one metric, a modified form of TC Jones' Defect Removal Efficiency. (see http://www.research.ibm.com/journal/sj/171/ibmsj1701E.pdf) Outline - The Quality questions: How good is our software? How good is our testing? - Jones' original and simplified formulae for DRE and Cumulative DRE. - How DRE answers the Quality questions Published benchmarks from industry - Problems with DRE's simplifying assumption of defect equivalence. - A simple method for applying defect valence (DRE-W). - Does weighting make a difference? Examples from actual projects: DRE vs DRE-W. - Demo: How to calculate DRE and DRE-W with ClearQuest. - What _not_ to do with these numbers. Benefits The desired learning outcome is for each attendee to: - adopt an attitude that testing effectiveness is measurable - understand the method and limitations of DRE - be able to calculate DRE and weighted DRE (DRE-W) - appreciate how DRE and DRE-W differ in results - see how easy it is to generate these metrics from ClearQuest (CQ) - take away a set of instructions, and hyperlink, to calculate measures from their own CQ data Background I've championed DRE in two companies. What managers and teams do with the results is far more important than the numbers. I developed DRE-W to help counter measurement errors caused by QA over-reporting of cosmetic defects.
  • How do we know how good we are at designing, executing, and interpreting tests? Can you give me a simple, easily understandable measure to answer that question?How good is our product when it’s still under development? Is it good enough for release?How many found and undicovered defects still exist? We can’t test quality into a product, so are these two questions at all related?If the questions aren’t related, how could a measure for one tell us anything about the other? And yet, it does!! Because these are Quality Questions and Transition: Ultimately, Quality is in the eyes of the Customer!
  • So we count how many defects we found during testingThen we count how many defects the customer reported after our testingApply this simple formula and That’s how good our overall testing was!This all began in 1976 when Michael Fagan’s team of hardware engineers burned up some hardware during a test.I’ll take some liberties with the story, but where do you think the name “Smoke test” comes from?It wasn’t exactly possible to retest the burning rubble, so the team went back and looked at the blueprints.They found out why it toastedBut also a number of other problems with the designIncluding problems which they would not have found in the planned testsSo Fagan became an advocate for inspection before testing (it saves hardware)And applied a statistic called Error Detection EfficiencyThe idea of inspection before testing was further developed by Capers Jones and became the Cumulative Defect Removal Efficiency formula you see on this slide. This is the Simple Method, we’ll talk about the simplifications later.But software which fails a ‘smoke test’ isn’t a heap of smoldering rubble, so it can be fixed and restested, again and again. This means that we no longer have to wait on our customer’s bug reports to learn how well we have been at testing:Transition: Don’t have to wait on the customer to measure qualityOnce upon a time Hardware cost more than software So when hardware had mistakes in it A team sat down to figure out why And they found more mistakes paraphrasing Michael FaganError detection efficiency = Errors found by an inspection X 100 / Total errors in the product before inspection -- Michael Fagan, IBM Systems Journal, 1976
  • We just have to wait on the next testFor any level of defect detection (which may be inspection or test, by developers, QA, or users), we can apply the same basic formulaWe need remember the value of the present test will not be known until the next test.Some of you may be saying, hey, we do that, but we don’t call it DRETransition: So “Defect Removal Efficiency” is a specialized term, but others have rephrased it to meet their needs
  • Capers Jones uses the term Defect Removal Efficiency when considering standardized units of software, specifically, function point size categorizations.If we measure without reference to the size of the projectIf we want to talk in terms business people understandIf we are really focused on what we find, not what we fixIf we measure relative to some other measureTransition: regardless of what we call it, how good are we, as an industry, at stopping defects?
  • Some forms of testing are better at finding defects than others.This is a sampling from a table published in CrossTalk.The light blue centers and large rounded rectangles represent the ‘normal’ DRE for a given type test and the bar below User’s normal range shows even more of the variability in the efficiency of User Acceptance testing.But these are based on normalized projects. How many of us have ever worked on a ‘normal’ software project? Are we all at the same level Capability maturity? Are all of our teams the same size? Have we always applied the same series of test types?When we look at the overall effectiveness, we get a better idea of the variability involvedhttp://www.stsc.hill.af.mil/crosstalk/2005/04/0504Jones.htmlVaries significantly by Capability Maturity, Size of ProjectTransition: How much does it vary?
  • This is a simple slide, but take a moment to let it speak to you, as I try to interpret. Most projects sent software to production with 10-20% of the bugs still undetected.These are results published in April of this year!Remember the question: How good was our testing?Not perfect! We are going to send bugs to production!How many defects did you log before your last delivery? What if your Cumulative Removal Efficiency was 90%? How many defects did you miss?Transition: Before getting too worked up about this, let’s look at some of the assumptions behind this very useful number, and after three decades of experience, could we possibly provide better answers to the quality questions? Remember that CDRE is calculated using the Simple Method. What makes it simple?Curved linear representation of data from T. Capers Jones’ table from April 2009 ITMPI Webinar, “Software Defect Removal”
  • Jones’ original paper on DRE was not focused on the statistic; it wasn’t at all concerned with QA. He was interested in ways to improve programmer productivity! So CDRE was just a means to a very different end than we have been considering.Jones documented his assumptions for us, which you see on this slide:The industry DRE numbers you just examined show empirical refutation of the first assumption. This is a good reason to measure after each set of tests rather than just Cumulatively.Bug injection during fixes is also measured by Jones and others, it varies, but sometimes one bug is created for every three destroyed. --need source!!Remember that our Or use the detailed method“All defects, regardless of source or of origin (whether design problems, coding problems, or some other) are lumped together and counted as the single variable, defects.”“The defect removal efficiencies of all reviews, inspections, tests, and other defect removal operations are lumped together and counted as the single variable, cumulative defect removal efficiency.”Transition: Let’s discuss this last assumption, which we want to play with …
  • Spend just a few seconds getting comfortable with this bar chart.Team Swan is above the waterline and Team Dolphin is below the line.Each stage of testing is indicated in sequence.Defects are shown with an X where detected.Colors indicate Severity levels; the end of the color bar indicates the defect is removed.The blue, Inconsequential defect was never found or removed.Transition: Now that we understand the symbols, let’s read from left to right.
  • Spend just a few seconds getting comfortable with this bar chart.Team Swan is above the waterline and Team Dolphin is below the line.Each stage of testing is indicated in sequence.Defects are shown with an X where detected.Colors indicate Severity levels; the end of the color bar indicates the defect is removed.The blue, Inconsequential defect was never found or removed.Q: Can we determine anything about the effectiveness of my testing or the quality of our product?Tests are useful, caught some bugs.How good is my software? Looks pretty good for Dolphins, not so good for Swans.Remember, DRE is always retrospective.Transition: So let’s look at the next stage of testing.
  • At the end of Acceptance testing, How good WAS my development testing?How good is my software? The Swans are feeling better about their software, the Dolphins are getting nervous.Transition: And what did our Customers tell us after we released the product?
  • How good was Acceptance testing? The user-reported defects indicate that Acceptance testing eliminated half of the remaining defects, so the DRE is 50%Remember: Each DRE is relative only to the defects remainingHow good is my software? Well, you see.Transition: So what was our Cumulative DRE?
  • Three of a total four reported defects were detected and removed before production release, so …How good was my testing? 75% effective (well below industry norms)Transition: Look closely at what the CDRE seems to tell us: our test methods are equally effective? Do you believe that?
  • The Swans removed all but one cosmetic defect before release.The Dolphins allowed only one defect to slip, but the customer was not happy!Original purpose of Fagan and Jones was to look at practices, not measures.Transition: Capers Jones’ himself was aware of the effect of simplifying DRE by considering all defects equivalent.
  • Recall that Quality is determined by the Customer, and the customer cares about severity.Okay, we care about Severity, but how do we factor that into a nice, simple statistic?Transition: So what if all defects are not treated equally? How might we account for Severity?
  • Assign a weight to each of the severity levels in use.Sev levels can be invertedSeverity for test and Severity for use are not the same. A defect in Production has business impact. A defect in test has no business impact, we would have to guess. So I removed one of Jones’ simplifications to substitute one of my own. Quantified PotentialBusinessOrthogonal Defect Classification – Business Impact, not ODC Impact. The use of triggers to answer the quality questions has been published by Chaar, et al. Transition: Assuming just this simple 1 through 5 weighting, what difference would we see in our game tests?
  • Swans and Dolphins were not equally effective at discovering weighted defects.(step through the calculation)Transition: and our Acceptance test results would also differ
  • Transition: What if we were to apply weighting to the Cumulative DRE?
  • Knowing that one team sent a cosmetic bug to your customers and the other sent a critical bug, are you comfortable with the idea that the Swans were better at testing than the Dolphins?Transition: OK, so how can we calculate DRE and Weighted DRE with ClearQuest?
  • A study at the Software Engineering Laboratory found that code reading detected about 80 percent more faults per hour than testing (Basili and Selby 1987). Another organization found that it cost six times as much to detect design defects by using testing as by using inspections (Ackerman, Buchwald, and Lewski 1989). A later study at IBM found that only 3.5 staff hours were needed to find each error when using code inspections, whereas 15–25 hours were needed to find each error through testing (Kaplan 1995). -- Steve McConnell, Code Complete, 2d ed, 2004Transition: Let’s look at how these numbers answer the Quality Questions
  • Weighted Defect Removal Effectiveness: Method and Value

    1. 1. IBM Rational Software Conference 2009 Overview  Asking quality questions  Defect removal efficiency (DRE, CDRE)  Weighted DRE (DREw, CDREw)  Demo  Answering quality questions 1
    2. 2. IBM Rational Software Conference 2009 Asking Quality Questions  How good was our testing?  How good is our software? 2
    3. 3. IBM Rational Software Conference 2009 Cumulative Defect Removal Efficiency (Simple Method) “Cumulative defect removal efficiency = defects found before release defects found before and after release By this formula, if 100 defects are found in a program during its entire life -- in both development and in production -- and 90 of the defects are found before release, then the cumulative defect removal efficiency is considered to be 90 percent.” -- T.C. Jones, IBM Systems Journal, 1978
    4. 4. IBM Rational Software Conference 2009 Work-In-Process Defect Removal Efficiency defects found prior test defects found prior and current test How good was my testing? WIP DRE is retrospective.
    5. 5. IBM Rational Software Conference 2009 A DRE by Any Other Name  Defect Removal Effectiveness  Defect Fix Percentage  Defect Detection Effectiveness  Defect Detection Percentage  Defect Detection Rate 5
    6. 6. IBM Rational Software Conference 2009 What are Actual DREs? -- data from table by Capers Jones, CrossTalk, 2008 6
    7. 7. IBM Rational Software Conference 2009 What are Actual CDREs? < 80% 80-85% 85-90% 90-95% 95-99% >99% -- based on Capers Jones data published 2008 by ITMPI 7
    8. 8. IBM Rational Software Conference 2009 Jones’ Simplifying Assumptions  All detection methods are equivalent  All fixes are good and singular  All defect causes are equivalent *  All defects are equivalent -- T.C. Jones, IBM Systems Journal, 1978 8
    9. 9. IBM Rational Software Conference 2009 Defect Detection CRITICAL X MAJOR X MINOR COSMETIC INCONSEQUENTIAL Development CRITICAL MAJOR MINOR X COSMETIC X INCONSEQUENTIAL
    10. 10. IBM Rational Software Conference 2009 Work In Process Calculations WIP DRE 67% CRITICAL X MAJOR X MINOR X COSMETIC INCONSEQUENTIAL Development Acceptance CRITICAL MAJOR X MINOR X COSMETIC X INCONSEQUENTIAL WIP DRE 67%
    11. 11. IBM Rational Software Conference 2009 WIP DRE becomes DRE WIP DRE 50% WIP DRE 50% CRITICAL X MAJOR X MINOR X COSMETIC X INCONSEQUENTIAL Development Acceptance Production CRITICAL X MAJOR X MINOR X COSMETIC X INCONSEQUENTIAL WIP DRE 50% WIP DRE 50% With Production counts, WIP DRE becomes DRE
    12. 12. IBM Rational Software Conference 2009 Cumulative Defect Removal Efficiency (CDRE) DRE 50% DRE 50% CRITICAL X MAJOR X CDRE 75% MINOR X COSMETIC X INCONSEQUENTIAL Development Acceptance Production CRITICAL X MAJOR X MINOR X COSMETIC X CDRE 75% INCONSEQUENTIAL DRE 50% DRE 50%
    13. 13. IBM Rational Software Conference 2009 Are these test results equivalent ????? CRITICAL X MAJOR X MINOR X COSMETIC X INCONSEQUENTIAL Development Acceptance Production CRITICAL X MAJOR X MINOR X COSMETIC X INCONSEQUENTIAL
    14. 14. IBM Rational Software Conference 2009 Severity Weighting “Obviously, it is important to measure defect severity levels as well as recording numbers of defects.” -- T. Capers Jones, 2008 15
    15. 15. IBM Rational Software Conference 2009 Weighted Defect Removal Effectiveness (DREw) Critical x 5 Major x 4 Minor x 3 Cosmetic x 2 Inconsequential x 1 Keep It Simple! (or use quantified potential business impact)
    16. 16. IBM Rational Software Conference 2009 Weighted Defect Removal Effectiveness (DREw) DREw 75% CRITICAL 5 MAJOR 4 MINOR 3 COSMETIC INCONSEQUENTIAL 9/12 5/9 CRITICAL MAJOR 4 MINOR 3 COSMETIC 2 INCONSEQUENTIAL DREw 56%
    17. 17. IBM Rational Software Conference 2009 Weighted Defect Removal Effectiveness (DREw) DREw 60% CRITICAL MAJOR MINOR 3 COSMETIC 2 INCONSEQUENTIAL 3/5 4/9 CRITICAL 5 MAJOR 4 MINOR COSMETIC INCONSEQUENTIAL DREw 44%
    18. 18. IBM Rational Software Conference 2009 Cumulative DREw (CDREw) CRITICAL 5 CDREw 86% MAJOR 4 MINOR 3 COSMETIC 2 INCONSEQUENTIAL 12/14 9/14 CRITICAL 5 MAJOR 4 MINOR 3 COSMETIC 2 CDREw 64% INCONSEQUENTIAL
    19. 19. IBM Rational Software Conference 2009 Why Measure Work-In-Process Testing?  Consistent WIP DRE lends predictive value for product reliability from a stable process  Consistent (WIP) DREw lends predictive value for product releasability from a stable process 20
    20. 20. IBM Rational Software Conference 2009 Answering Quality Questions Critical Major How good was our testing? Minor Cosmetic Weighted Total How good is our software? Dev Int QA Alpha Beta Prod 21

    ×