EuroSTAR Software Testing Conference 2010 presentation on Big Bugs That Got Away by Ken Johnston . See more at: http://conference.eurostarsoftwaretesting.com/past-presentations/
Exploring the Future Potential of AI-Enabled Smartphone Processors
Ken Johnston - Big Bugs That Got Away - EuroSTAR 2010
1. What We Can Learn from Big Bugs that Got Away
Ken Johnston, Group ManagerOffice, Internet Platforms & Operation
EuroSTAR2010
2. I Want to know more about YOU
•Who wandered in here by accident
•Who is at EuroSTARfor the first time
•How long have you been in Software Testing
•Have you ever missed a bug
•Have you ever heard…
8. Session Overview
•About you, me and setting the tone
•Bug Wallowing 1 –A self reflective journey
•Bug Wallowing 2 –Group Therapy
•Root Cause Analysis 101
▫Sentinel Events
▫Pattern Analysis
▫Formal RCA program overview
•Bug Wallowing 3
•Five Whys
•Bug Wallow 4
•Fishbone
•Bug Wallowing 5
•Crafting a good bug story
P
P
9. Learning Objectives
1.Be armed to deal with the question, “How did test miss this bug.”
2.Learn a little about formal RCA and the use of the 5 Whys and Fishbone tools
3.Have a number of highly instructive bug stories from within your organization that you can take home
10. Def. –Roll in something: to lie down and roll around in something
13. Repeat After Me
•I did not design the bug.
•I did not code the bug.
•I found crashing bugs, data corruption bugs, fit and finish bugs.
•I found hundreds of bugs.
14. Repeat After Me
•So what if I missed a bug.
•I didn’t write the bug in the first place.
15. Activity Share your Bug Story
•Take the next 10 minutes
•Groups of 2 or 3
•Think of a bug that got away
•Minimum One Bug story each
•Questions to ask
▫How long after ship did you see this
▫How big was the impact
▫How did it get missed
▫What did you change because of this bug
20. Time to Share
•What did you come up with?
•Why do we wallow?
•Why do we RCA bugs?
•My List
▫To learn from mistakes
▫To systematically identify areas for improvement
▫To prevent repetition of mistakes
▫Bugs are stories and organizations are driven by the stories they tell
22. Root Cause Analysis 300 Level
•Two approaches to RCA
▫Sentinel Event
▫Pattern Analysis
•Formal RCA Program
▫Data Collection
▫Data Analysis and Assessment
▫Corrective Actions
•The Pit and the Pendulum
▫Risks of RCA
▫Benefits of RCA
Based upon Ch. 11
PDF available to EuroSTAR
attendees
http://defectprevention.org
23. RCA –Sentinel Event Bugs
•How do you know it’s a Sentinel Event Bug?
•If you make the front page of the http://wsj.com
•Production Outage
▫I have a lot of these stories
•Security vulnerabilities
•The last bug taken before ship
▫“How could we have missed this!”
•Any big bug that got away
•Nothing to do with the X-Men
24. RCA Pattern Analysis
•Pattern Analysis requires a lot of bugs
•Pattern Analysis can be done over time
•Pattern Analysis is best served within a formal RCA Program.
▫Cut some of the slides from this presentation
▫The full set of slides can be found in the appendix on the EuroSTARconference website
25. Phases of an RCA Program
1.Event Identification
2.Data Collection
3.Data Analysis and Assessment
4.Corrective Action
5.Inform and Apply
6.Follow-up, measurement and reportingEvent Identification
Data Collection
Data Analysis and Assessment
Corrective Actions
Inform and Apply
Follow up, Measurement, and Reporting
P
26. Phase 2: Data CollectionExercise
•Data Channels 5 Minute Discussion in Groups
▫What are the sources of data in my organization
▫Which are practical
▫Which are the most costly to implement
▫Which are most likely to yield results
▫Do you have time to implement these
Event Identification
Data Collection
Data Analysis and Assessment
Corrective Actions
Inform and Apply
Follow up, Measurement, and Reporting
28. Phase 2: Data Collection Time to Share
•What sources did you come up with?
29. Phase 2: Data Collection(Sources of Data)
•Defect and Test Case Management tracking system
•Source code repository and Test code coverage data
•Voice of the Customer
▫Product support and Customer or marketing data
▫Individual surveys and interviews
•Findings from previous RCA Studies
•Crash data through Windows Error Reporting
•Services have tickets and data center telemetry
▫Heuristic Data of live site now vs. historic
More about WER @ https://winqual.microsoft.com/
30. Phase 2: Data Collection(Tracking System)
•Prepare a list of Sentinel Events
•Gather and Prepare the Preliminary Data
•Route Single Event through Process
•Create an RCA Tracking Database
Data Elements of RCATracking System
•Event or Study ID, Title & Dates
•Related Defect links
•Failure areas and Source Code
•Timeline of events before and after (vital for services)
•Team Contacts and Owners
•RCA Analysts and Contacts
•Expert Groups and Contacts
•Cause of defect and corrective action
•Survey Data and Resultson effectiveness of corrective action
•Log Events in RCA system
•Analyze events
•NOTE: Meta Data better suited for lists, documents and shares
31. Phase III: Data Analysis and Assessment(the Five Whys and the Fish Bone) Good article from ASQ – http://www.asq.org/learn-about-quality/cause-analysis-tools/overview/fishbone.htmlEvent Identification
Data Collection
Data Analysis and Assessment
Corrective Actions
Inform and Apply
Follow up, Measurement, and Reporting
32. Phase III: Data Analysis and Assessment(the Five Whys)
•Brief History -http://en.wikipedia.org/wiki/5_Whys
▫Developed by SakichiToyoda
▫First used in Toyota Motor Corporation
▫Common tool within Kaizen, Lean Manufacturing & Six Sigma
•What is it
▫Simply put -ask why 5 times to get to the root cause of a problem
•Fun Example from -http://startuplessonslearned.blogspot.com/2008/11/five-whys.html
▫why was the website down? The CPU utilization on all our front-end servers went to 100%
▫why did the CPU usage spike? A new bit of code contained an infinite loop!
▫why did that code get written? So-and-so made a mistake
▫why did his mistake get checked in? He didn't write a unit test for the feature
▫why didn't he write a unit test? He's a new employee, and he was not properly trained in TDD
Event Identification
Data Collection
Data Analysis and Assessment
Corrective Actions
Inform and Apply
Follow up, Measurement, and Reporting
33. Def. –indulge in something excessively: to take pleasure or be immersed in something in a self- indulgent way
37. Time to Share
•Time for about 2 examples
•What about the 5 Whys worked for you
•Where did it fall short?
38. Phase III: Data Analysis and Assessment(the Five Whys)
•Criticism of five whys
▫Not reproducible across individuals
▫Shown that investigators tent do stop a symptoms rather than root cause
▫Relies upon the investigators knowledge
39. •Brief History -http://en.wikipedia.org/wiki/Ishikawa_diagram
▫Developed by Kaoru Ishikawa in the 1960s
▫One of the 7 basic quality management tools
•Can use with 5 whys
▫Put each why off the first tree point
▫Ask why for each one of these issues
▫Keep going until you find one or more root causes
•Some industries have common causes mapped to the fishbone
▫Original 4 Ms–Machine, Method, Material, Man power
▫The 8 Ps (Used in Service Industry) –People, Process, Policies, Procedures, Price, Promotion, Place/Plant, Product
▫Ken’s List –People& Training, Tools, Inspection and supervision, Pressure or Stress, Process & Accountability, Recognition & Awareness
Event IdentificationData Collection
Data Analysis and Assessment
Corrective Actions
Inform and Apply
Follow up, Measurement, and Reporting
Phase III: Data Analysis and Assessment(Fishbone Diagram)
40. Pressure or Stress
Recognition & Awareness
Process & Accountability
Tools
Inspection & Supervision
People & Training
Brownout across 3 largest datacenters
41. •Deployment tool changes
▫Warn but do not prevent multi-DC deployments
▫Automatically generate rollback script
▫Cross service monitors will cancel and roll back a bad deployment automatically
•Process changes
▫Deployment code review
▫Deployment checklist
▫Audits and Fire drills
Audited all alerts, escalation aliases and contact #s
Fire drill email and phone
•New Tools
▫Per-Alert fault injection
•Recognition
▫SWAT DRI team for most senior DRIs
42. Fishbone Exercise
•Take 5-10 minutes
•Have a handout for you
•Use the same bug from the five whys exercise
44. Time to Share
•Time to share
▫Who did the same bug as the five whys?
▫Who did a different bug?
•What about the fishbone worked for you?
•Where did it fall short?
45. Phase III: Data Analysis and Assessment(the Fishbone)
•Criticism of Fishbone
▫Requires a lot of experts for each branch
▫Cumbersome
46. Phase V: Inform and Apply
•Host a Management Review
▫Managers will like RCA more than bugs
▫You are eliminating a problem not just finding it
•Implementation is a project, treat it that way
▫Assign Owners
▫Build and Maintain Schedule
▫Create a Feedback Loop
▫Establish a Monthly Status Report
▫Track and correct the corrective actionEvent Identification
Data Collection
Data Analysis and Assessment
Corrective Actions
Inform and Apply
Follow up, Measurement, and Reporting
47. Phase VI: Follow-up, Measurement, and Reporting
•More than Just
•Six Sigma type approaches
•Longitudinal Analysis
▫Draws from Longitudinal Data Analysis - http://gseacademic.harvard.edu/alda/
▫Study Over Time
•Develop failure types and risk areas/components
•Inspect similar products/areas for baseline
•Gather and inspect process data
•Examine Data for Trends
•Report out
Event Identification
Data Collection
Data Analysis and Assessment
Corrective Actions
Inform and Apply
Follow up, Measurement, and Reporting
48. Def. –have huge amount of something: to have an ample or excessive supply of something
50. Risks of Root Cause Analysis
•Begins with inadequate data
•Go after too much data too early
•Draws incorrect conclusion or makes invalid recommendations
▫Anyone experience this before
•Focus on the wrong set of defects
•Ends at the wrong level –too early or late
•Investment is not always predictable
▫Can be high cost with low ROI
•Over focus on data can detract from the story
51. Benefits of Structured RCA Study
•Can start as small pilots
•Uses an identical process regardless of type, age or scope of defect
•Avoids repeat failures
•Can be the shortest path to determining and correcting causes of failure
•Lowers Maintenance Costs
•Builds a culture of
▫Accountability
▫Continuous Improvement
52. Achieve Balance
•Full Blow RCA with large pattern analysis rarely meets ROI goals.
•Limit the scope
▫Few Data Sources
▫Beware of the RCA Tax
•Focus on Sentinel Events
▫Provides opportunity for clear visible winds
▫If it’s a bug that got away you’ll be doing a Post Mortem anyway
▫Sentinel events provide an opportunity to change the dialogue
55. So why a focus on Bugs that got away
•Bugs that got away are Sentinel Events
•They are great stories
▫There is never an end to bugs
•Bug Stories are Organizational Knowledge
•Tribal Knowledge drives organizations
•Stories are powerful change enablers
57. Gloves on the Boardroom Table
•The Heart of Change
▫Requires an emotional component
▫What is more emotional than “How could test miss this bug!”
•Not all change stories involve yelling
•Visual and tactile help too
▫Handout of “Gloves on the boardroom table”
▫john@SAGEKotter.com
“I love your idea. And you have my permission.”
58. Organizational Development
•I worked in Engineering Excellence
▫We were Performance Improvement organization
▫Enterprise Change Management
•Let me bring in some OD concepts
59. Knowledge Management (KM)
comprises a range of practices used in an organization to identify, create, represent, distribute and enable adoption of insightsand experiences.
Such insights and experiences comprise knowledge, either embodied in individualsor embedded in organizationalprocessesor practice.
http://en.wikipedia.org/wiki/Knowledge_management
62. Tribal Knowledge
Institutional memoryis a collective set of facts, concepts, experiences and know-howheld by a group of people.
http://en.wikipedia.org/wiki/Institutional_memory
63. Organizational Storytelling
The study of organizational storytelling, sometimes called “Narrative Knowledge,” attempts to recount events in the form of a storywithin the context of an organization
http://en.wikipedia.org/wiki/Organizational_Storytelling
64. So, what is a bug story?
be part of the Organizational Narrative Knowledge
that should…
65. Springboard Story
•Very simple, very quick, very brief
▫Think elevator ride
•Non-threatening
•Enables listener to visualize
•Catalyzes understanding
•Spark new stories in the mind
•Do not transfer large amounts of information
66. Story Telling Tips
•Brain’s are not computers
▫Brain Movies –“The brain assembles perceptions by the simultaneous interaction of whole concepts, whole images.”
•The Central Movie –a country or organization
▫Universal Principles –freedom, democracy, constitutional government
▫Long-term goals –education, “life, liberty, pursuit of happiness”
▫Operating methods –free markets, due process, federal and state governments
•Capture the Audience
▫“One time there was this bug we missed…”
•3D Story Telling pg85-87
▫Details (facts, information)
▫Dialogue (characters)
▫Drama (a bug that got away?) Brain Movies, The Central Movie, and 3D Story Telling from“The Leader’s Voice”
67. Our Last Exercise!
•Your own bug story in 10 minutes
▫Take 10 minutes outlining your story
▫Goal is a 1-2 minute story
Think short and tight
•Remember to
▫Hook the audience
▫3D Storytelling –Details, Dialogue, Drama
▫RCA –what change do you want to convey?
68. My Bug Story -Template
•Title
•The Hook
•Details –Who, what, when, product/project
•Dialogue –Yelling, Crying, Funny?
•Drama –What is the tension? Anyone Fired?
•What were the Root Causes
•What did you change and why?
70. Time to Share
•3 volunteers to come up and tell their bug story
71. Resources
•“The Leader’s Guide to Storytelling” by Steve Denning
▫Resources –http://www.stevedenning.com/launchgifts.html
▫Audio Interview -The knowledge-based organization: Using stories to embody and transfer knowledge
http://www.storytellingwithchildren.com/2008/01/12/steve- denning-the-knowledge-based-organization/
•“The Leader’s Voice” by Crossland& Clark
▫http://roncrossland.com/
•Defect Prevention Chapter 11 RCA
▫http://defectprevention.org
•“The Heart of Change” by Cr. John P. Kotter
▫Gloves story can be found on pages 11-12 http://www.linkageinc.com/pdfs/disl/KotterPG.pdf
75. Root Cause Analysis 300 Level
•Two approaches to RCA
▫Sentinel Event
▫Pattern Analysis
•Formal RCA Program
▫When to do an RCA Study
▫Staffing for Success
▫Phases of an RCA Study
•The Pit and the Pendulum
▫Risks of RCA
▫Benefits of RCA
Based upon Ch. 11
http://defectprevention.org
76. RCA Sentinel Event
A sentinel eventis defined by the Joint Commission on Accreditation of Healthcare Organizations(JCAHO) as any unanticipated event in a healthcare setting resulting in death or serious physical or psychologicalinjuryto a personor persons,
http://en.wikipedia.org/wiki/Sentinel_event
77. RCA –The Sentinel Event of Bugs
•Home Page of http://wsj.com
•Production Outage
▫I have a lot of these stories
•Security vulnerabilities
•The last bug taken before ship
•“How could we have missed this!”
•Big Bugs that Got Away
78. RCA –Office 14 Sentinel Bug Process
•Why SharePoint as the repository
▫Attachments
▫Collaborating
▫Workflow
▫Reporting Dash
▫Wiki
▫Exchange contacts
▫Offline
•Simple Light Weight Approach
•Focus on recall class bugs from O14 Beta 1
▫Will need the answers anyway to get through triage
▫Usually logged in the bug but not easy to find or learn from
▫No consistent process across teams
•Develop a common template in Word
•Track on a SharePoint site with some meta data
79. Office 14 Root Cause “Template”
•Tenets/Best Practices
•History/Summary
•Bugs
▫Bug number(s)
▫Bug description
•Root Cause Questions
▫Would this get found in our Test Focus/Pass for this area?
▫When did it get broken?
▫Was ownership confused?
▫Would we have assumed that another team would have also seen it?
▫Would it have been reasonable to assume that the fix that caused the regression would have broken this?
▫Would a code review have likely identified the issue?
▫Was there a partner team(s) involved?
▫Were there multiple PRs involved?
▫Was the feature "Hot" coming into the close of the milestone?
•Engineering Recommendations:
▫Recommendation(s)/Owners
▫1.
▫2.
▫3.
80. O14 Example Beta1 End Game
•Word: Japanese Indented Bullets when saved lose their indents
▫Repro:
Set Japanese to be your primary editing language
Create a bulleted list with indents
Save/Close/Re-open
Result: indents are gone
Expect: no loss of indents
▫Happens with all docs created with that setting in 12 and 14
81. O14 Example RCA Recommendations
•Engineering Recommendation:
▫Automatethis case and use the code change to inform other automation needed for this area (lists, styles, paragraph props)
▫Ensure that ICTsdogfoodthe product
▫Make new push for testers to use international settingsmore frequently, with an eye on Beta2 languages and risks associated with each language equivalence class –we’ll most likely drive a Mini-pass on all our features with this setting for Beta2
▫Add this area to testing executed during regression checks onallstyle-related fixes.
82. RCA Sentinel Bug Approach
•Big Bugs that got away are Sentinel Events
•On bug is indicative of other risk
•The more big bugs the more patterns
•Nothing to do with X-Men
83. Formal RCA Program(Sentinel Events and Pattern Analysis)
•Started at any time during SDLC
•Often launched after a single expensive bug
▫Security vulnerabilities
▫Production Outage
I have a lot of these stories
•Can be Resource Intensive -so be deliberate
84. Staffing for Success –RCA Study Analyst
•A Single Analyst or a Team
▫Could be you after today
•Senior with wide range of development process knowledge
•Component Level and System Level analysis
•Work with all types -Development, Testing, Program Management, Operations, Support
▫May include marketing and field personnel
•Skills
▫Defect and low-level code analysis
▫Efficiency Diagnosis
▫RCA Analysis and even understanding
▫Algorithm and metric development
▫Data analysis and presentation
85. Phases of an RCA Program
1.Event Identification
2.Data Collection
3.Data Analysis and Assessment
4.Corrective Action
5.Inform and Apply
6.Follow-up, measurement and reportingEvent Identification
Data Collection
Data Analysis and Assessment
Corrective Actions
Inform and Apply
Follow up, Measurement, and Reporting
86. Phase I: Event Identification
•The Sentinel Event
▫Bug that got away and customer found
▫Does not need to be a defect
▫One or multiple
•Often too many bugs to pick from
▫For an RCA program first establish criteria for a sentinel event
Event Identification
Data Collection
Data Analysis and Assessment
Corrective Actions
Inform and Apply
Follow up, Measurement, and Reporting
87. Phase I: Event Identification (Sentinel Event Criteria)
•Not all bugs will yield a true “root” cause
•Focus on most severe/undesirable event
▫“I remember this one bug…”
•Risk based assessment criteria
▫Severity
▫Risk of recurrence
▫Cost –actual and opportunity
Identify Sentinel Event Criteria
Identify Data Channels
Route Single Event through Process
Prepare Data & Map Fields (defect tracking system query)
Log Event in RCA Tracking Database
Event to Analyze
Sentinel Event Data Chanel Loop
88. Phase I: Event Identification (Data Chanel –Sources of Data)
•Defect and Test Case Management tracking system
•Source code repository and Test code coverage data
•Voice of the Customer
▫Product support and Customer or marketing data
▫Individual surveys and interviews
•Findings from previous RCA Studies
•Crash data through Windows Error Reporting
•Services have tickets and data center telemetry
▫Client and Cloud testing session tomorrow
More about WER @ https://winqual.microsoft.com/
89. Phase I: Event Identification(Tracking System)
•Prepare a list of Sentinel Events
•Gather and Prepare the Preliminary Data
•Route Single Event through Process
•Create an RCA Tracking Database
Data Elements of RCATracking System
•Event or Study ID, Title & Dates
•Related Defect links
•Failure areas and Source Code
•Timeline of events before and after (vital for services)
•Team Contacts and Owners
•RCA Analysts and Contacts
•Expert Groups and Contacts
•Cause of defect and corrective action
•Survey Data and Resultson effectiveness of corrective action
•Log Events in RCA system
•Analyze events
•NOTE: Meta Data better suited for lists, documents and shares
90. Phase II: Data Collection
•Use Common Sense and Trust Gut Feel
▫“Hey did you hear about the bug…”
▫“I heard BillGwas doing a demon when…”
•Use a survey to gather additional data
▫Was this noticed and ignored
▫Is this a common error type
▫Could this have been prevented
•Gather common data on several sentinel eventsEvent Identification
Data Collection
Data Analysis and Assessment
Corrective Actions
Inform and Apply
Follow up, Measurement, and Reporting
91. Phase II: Data Collection
•Windows Customized (Visual Studio Team System)
▫Part of Defect Tracking System
▫Connect to source code
▫Attachments
▫Collaborating
▫Workflow
Windows ezRCAProgram
TheGoal
Reduce DefectsThroughout the Product Cycle
The Questions
•What type of defect?
•What phasewas the defect introduced?
•What was the extent of the fix?
•How long did it take to fix the defect?
The Source
•Product Studio Extension (Per Bug Report)
Leverage Points
•Distributed Workflow
•Quick and Easy Data Collection
•AggregateAnalysis and Trend Charts
•Subcomponent-Level Data Also Available
•Focus on Individual Improvement
•Windows Vista ran a full RCA program
•Windows 7 moved to ezRCA
▫Cut many of the other data sources
▫Focus on meta data around bugs
92. Windows “ezRCA” Approach
Windows ezRCAProgram
TheGoal
Reduce DefectsThroughout the Product Cycle
The Questions
•What type of defect?
•What phasewas the defect introduced?
•What was the extent of the fix?
•How long did it take to fix the defect?
The Source
•Product Studio Extension (Per Bug Report)
Leverage Points
•Distributed Workflow
•Quick and Easy Data Collection
•AggregateAnalysis and Trend Charts
•Subcomponent-Level Data Also Available
•Focus on Individual Improvement
93. Windows EZ RCA Diagnosis
As isNew
•Diagnosis is currently required for all bugs and defaults to NA
•This field should only be activated if the bug is resolved “Fixed” or “Won’t Fix”
•There should be no default value
•Change/combine Hardware & No HW to Hardware Issue
NOTE: Items in REDare new or changed
Assignment Error
Build Error
Concurrency Error
Data Checking Error
Data Corruption
Doc Error
Environment Error
Error Handling Problem
Hardware Issue
Ignored Failure
Incorrect Program State
Interface Error
Missing Method/Function
Logic Error
Not Applicable
Other
Resource Issue
Simple Coding Error
System Error
User Misunderstanding
94. Windows ezRCAValues
•Initial classification of root causes
•Root cause helps us identify the nature of the kinds of mistakes we are making
•This will be a required field for Developers when resolving a bug that is ‘Fixed’ or ‘Won’t Fix’
•This will be a single-select dropdown list and developers will be expected to select the item that is most applicable
•This field is not intended to replace deep RCA studies and more information will likely be required based on analysis of this data
•For gathering further information, use the Prevention Tab, Test Follow-up Tab, and Bug Analysis Tabs in Product Studio or Soapbox (NOTE: Much of this will be consolidated in the future)
95. Windows Additional RCA data
•Symptom and Prevention categorization
•Link to more info
•Anonymous submission
96. ezRCAPivot Points
ezRCA
•Data on Lots of Bugs
•Few Questions & Answers
•Quick, Easy
•Fully Distributed
Traditional RCA
•Data on Select Fixed Bugs
•Detailed Analysis of Defect
•Multiple-Data Sources
•Significant Investment
•Can be Resource-Limited
97. Phase II: Data Collection Keys to Success
•For Sentinel Events open template is fine
•For ezRCAExtend bug tracking system with ezDataCollection
▫Keep system light weight
▫Limit required fields
▫Provide opportunity to expand within bug
•For Formal RCA will need multiple data sources and extensible schema
•Recommend you start with Sentinel Events and progress to a formal programEvent Identification
Data Collection
Data Analysis and Assessment
Corrective Actions
Inform and Apply
Follow up, Measurement, and Reporting
98. Keep going with formal RCA
•Some tools you can use with Sentinel Events and ezRCA
•What good tester doesn’t make you wallow in the details.
99. Phase III: Data Analysis and Assessment
•Analysis Performed by
▫RCA Team
▫Research Team
▫Related expertsEvent Identification
Data Collection
Data Analysis and Assessment
Corrective Actions
Inform and Apply
Follow up, Measurement, and Reporting
•Log all outputs in RCA System
•Be judicious with Experts time
100. Phase III: Data Analysis and Assessment(the Five Whys and the Fish Bone)
Good article from ASQ –
http://www.asq.org/learn-about-quality/cause-analysis-tools/overview/fishbone.html
101. Phase III: Data Analysis and Assessment(the Five Whys)
•Brief History -http://en.wikipedia.org/wiki/5_Whys
▫Developed by SakichiToyoda
▫First used in Toyota (Kaizen), Six Sigma tool
•What is it
▫Simply put -ask why 5 times to get to the root cause of a problem
•Fun Example from -http://startuplessonslearned.blogspot.com/2008/11/five-whys.html
▫why was the website down? The CPU utilization on all our front-end servers went to 100%
▫why did the CPU usage spike? A new bit of code contained an infinite loop!
▫why did that code get written? So-and-so made a mistake
▫why did his mistake get checked in? He didn't write a unit test for the feature
▫why didn't he write a unit test? He's a new employee, and he was not properly trained in TDD
•Criticism of five whys
▫Not reproducible across individuals
▫Shown that investigators tent do stop a symptoms rather than root cause
▫Relies upon the investigators knowledge
102. Phase III: Data Analysis and Assessment(the Five Whys)
•Brief History -http://en.wikipedia.org/wiki/5_Whys
▫Developed by SakichiToyoda
▫First used in Toyota Motor Corporation
▫Common tool within Kaizen, Lean Manufacturing & Six Sigma
•What is it
▫Simply put -ask why 5 times to get to the root cause of a problem
•Fun Example from -http://startuplessonslearned.blogspot.com/2008/11/five-whys.html
▫why was the website down? The CPU utilization on all our front-end servers went to 100%
▫why did the CPU usage spike? A new bit of code contained an infinite loop!
▫why did that code get written? So-and-so made a mistake
▫why did his mistake get checked in? He didn't write a unit test for the feature
▫why didn't he write a unit test? He's a new employee, and he was not properly trained in TDD
103. •Brief History -http://en.wikipedia.org/wiki/Ishikawa_diagram
▫Developed by Kaoru Ishikawa in the 1960s
▫One of the 7 basic quality management tools
•Can use with 5 Whys
▫Put each why off the first tree point
▫Ask why for each one of these issues
▫Keep going until you find one or more root causes
•Some industries have common causes mapped to the fishbone
▫Original 4 Ms–Machine, Method, Material, Man power
▫The 8 Ps (Used in Service Industry) –People, Process, Policies, Procedures, Price, Promotion, Place/Plant, Product
▫Ken’s List –People, Process, Tools, Accountability, Training, Recognition and awareness, Inspection and supervision, Pressure or Stress
Event IdentificationData Collection
Data Analysis and Assessment
Corrective Actions
Inform and Apply
Follow up, Measurement, and Reporting
Phase III: Data Analysis and Assessment(Fishbone Diagram)
104. Trending Per-Subcomponent
•Trends Matter
▫Uptick Warrants More Investigation?
▫Perform a Traditional RCA for That Set of Events
•Profile
▫The State of the Code
▫Personal Improvements
▫Identify Key Events
Last 5 Weeks
105. Analysis is not yet at solutions
•Five Whys and Fishbone Diagram help get to root causes
•Data and trending can provide timely alerts and catches regressions
•Root causes are then analyzed for corrective actions
106. Event Identification
Data CollectionData Analysis and Assessment
Corrective Actions
Inform and Apply
Follow up, Measurement, and Reporting
Phase III: Analysis is not the solution(Fishbone Diagram)
•Five Whys and Fishbone Diagram are tools to get to root causes
•Data and trending of bugs can provide timely alerts and catches regressions
•Root causes are then analyzed for corrective actions
107. Phase IV: Corrective Actions
Event IdentificationData Collection
Data Analysis and Assessment
Corrective Actions
Inform and Apply
Follow up, Measurement, and Reporting
•Identify Trends and Group Them into Corrective Themes
▫May be solutions related to Fishbone Diagram mapping buckets
•Meet with the experts again
▫Remember my warning not to burn out your experts
•Determine Prioritization Factors and Costing for Corrective Actions
▫Consider Return on Investment (ROI)
Should have capture direct cost and opportunity cost during Data Collection
▫Speed to implement
▫Likelihood of solution being highly effective
▫Simplicity of solution
▫Is the solution automatable or process driven
108. Bug Wallow #3: Our Corrective Actions
•Email and Provisioning used Production Data
•Both sanitized the data
•Both impacted production
•What did we change?
▫Stress Tests have no Internet Access
▫Sanitized Date Diff feature
109. Phase V: Inform and Apply
•Host a Management Review
▫Managers will like RCA more than bugs
▫You are eliminating a problem not just finding it
•Implementation is a project, treat it that way
▫Assign Owners
▫Build and Maintain Schedule
▫Create a Feedback Loop
▫Establish a Monthly Status Report
▫Track and correct the corrective action
Event Identification
Data Collection
Data Analysis and Assessment
Corrective Actions
Inform and Apply
Follow up, Measurement, and Reporting
110. Phase VI: Follow-up, Measurement, and Reporting
•More than Just
•Six Sigma type approaches
•Longitudinal Analysis
▫Draws from Longitudinal Data Analysis - http://gseacademic.harvard.edu/alda/
▫Study Over Time
•Develop failure types and risk areas/components
•Inspect similar products/areas for baseline
•Gather and inspect process data
•Examine Data for Trends
•Report out
Event Identification
Data Collection
Data Analysis and Assessment
Corrective Actions
Inform and Apply
Follow up, Measurement, and Reporting
111. Flatonium2007
•Need to insert video
•20 new machines added to the data center
•5 machines put into production early
•Machines needed to be Nuked-N-Paved (NNP)
•Oops
113. Risks of Root Cause Analysis
•Begins with inadequate data
•Go after too much data too early
•Draws incorrect conclusion or makes invalid recommendations
▫Anyone experience this before
•Focus on the wrong set of defects
•Ends at the wrong level –too early or late
•Investment is not always predictable
▫Can be high cost with low ROI
•Over focus on data can detract from the story
114. Benefits of Structured RCA Study
•Can start as small pilots
•Uses an identical process regardless of type, age or scope of defect
•Avoids repeat failures
•Can be the shortest path to determining and correcting causes of failure
•Lowers Maintenance Costs
•Builds a culture of
▫Accountability
▫Continuous Improvement