Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Katarzyna Balcerzak Root Cause Analysis

10 views

Published on

Nauka na błędach – root cause analysis w praktyce.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Katarzyna Balcerzak Root Cause Analysis

  1. 1. ROOT CAUSE ANALYSIS @kasia_balcerzak LEARN BY FAILING
  2. 2. WHO AM I? ➤ Quality Engineer @ Spartez ➤ Testing traveler in Atlassian products. ➤ Root Cause Analysis Facilitator. ➤ Teaches developers how to test.
  3. 3. What can we do to find similar bugs? What is the root of bugs?
  4. 4. OH S**T…
  5. 5. CREATE USER ASSIGN USER PROCESSING EVENTS QUEUE
  6. 6. PROCESSING EVENTS QUEUE t1 t2 t2 t1
  7. 7. WHOSE FAULT WAS IT?
  8. 8. PLAY THE BALL, NOT THE PLAYER!
  9. 9. create timeline
  10. 10. GATHER FACTS TO BE OBJECTIVE IN ANALYSIS.
  11. 11. DEVELOPER FINDS DIFFERENCE IN DEFAULT QUEUE IMPLEMENTATION EVENT QUEUE LIBRARY UPGRADED -LIFO OVER FIFO SUPPORT TICKET FROM CANARY CHANGE RELEASED TO PRODUCTION ENTERPRISE CUSTOMERS REPORT ISSUES, REOPENED SUPPORT CASE VERSION ROLLED BACK
  12. 12. meet the team create timeline
  13. 13. INVITE PEOPLE WHO… introduced problem detected problem communicated problem investigated problem other interested
  14. 14. EVERYONE IS TRULY AN EXPERT THERE ARE STORIES TO UNCOVER EACH STORY IS ESSENTIAL
  15. 15. DEVELOPER FINDS DIFFERENCE IN DEFAULT QUEUE IMPLEMENTATION EVENT QUEUE LIBRARY UPGRADED -LIFO OVER FIFO SUPPORT TICKET FROM CANARY CHANGE RELEASED TO PRODUCTION ENTERPRISE CUSTOMERS REPORT ISSUES, REOPENED SUPPORT CASE VERSION ROLLED BACK
  16. 16. DEVELOPER FINDS DIFFERENCE IN DEFAULT QUEUE IMPLEMENTATION EVENT QUEUE LIBRARY UPGRADED -LIFO OVER FIFO SUPPORT TICKET FROM CANARY CHANGE RELEASED TO PRODUCTION ENTERPRISE CUSTOMERS REPORT ISSUES, REOPENED SUPPORT CASE VERSION ROLLED BACK
  17. 17. DEVELOPER FINDS DIFFERENCE IN DEFAULT QUEUE IMPLEMENTATION SUPPORT TICKET CAN’T BE REPRODUCED; MARKED AS NOT RELATED TO CHANGE TESTING REPLACED WITH DOG FOODING AND MONITORING CHANGE RELEASED TO CANARY EVENT QUEUE LIBRARY UPGRADED -LIFO OVER FIFO TESTING IGNORED; DIDN’T PERFORM IMPACT ANALYSIS SUPPORT TICKET FROM CANARY CHANGE RELEASED TO PRODUCTION ENTERPRISE CUSTOMERS REPORT ISSUES, REOPENED SUPPORT CASE VERSION ROLLED BACK
  18. 18. BAD?BAD?LUCKY?GOOD? GOOD? LUCKY? GOOD? BAD? LUCKY? Developer quickly identified problem with queue. Audit log helped rerun failed operations. Didn’t show error message which helped ignore initial support ticket.
  19. 19. meet the team Find causal factors create timeline
  20. 20. WHAT CONDITIONS ENABLED THE PROBLEM?
  21. 21. CAUSAL FACTOR != ROOT CAUSE
  22. 22. INCIDENT: CAR INCIDENT CAUSAL FACTOR: DRIVER DID NOT KEEP CAR ON ROAD ROOT CAUSE: HIGH SPEED WHEN ICE ON THE ROAD.
  23. 23. DEVELOPER FINDS DIFFERENCE IN DEFAULT QUEUE IMPLEMENTATION SUPPORT TICKET CAN’T BE REPRODUCED; MARKED AS NOT RELATED TO CHANGE TESTING REPLACED WITH DOG FOODING AND MONITORING CHANGE RELEASED TO CANARY EVENT QUEUE LIBRARY UPGRADED -LIFO OVER FIFO TESTING IGNORED; DIDN’T PERFORM IMPACT ANALYSIS SUPPORT TICKET FROM CANARY CHANGE RELEASED TO PRODUCTION ENTERPRISE CUSTOMERS REPORT ISSUES, REOPENED SUPPORT CASE VERSION ROLLED BACK EVENT QUEUE LIBRARY UPGRADED -LIFO OVER FIFO
  24. 24. meet the team Find causal factors Find roots create timeline
  25. 25. Event queue library updated. LIFO over FIFO. Developers didn’t know how external library can impact the application. There wasn't any test for external libraries. Libraries responsibilities were unclear. Writing expectations of libraries in tests doesn’t happen when libraries are introduced. Adding contract tests for external libraries is not a common thing and lacks procedure. Why? Why? Why? Why? Why?
  26. 26. meet the team define causal factors define roots create timeline create actions
  27. 27. PREVENTIVE ACTIONS CORRECTIVE ACTIONS CONTRACT TESTS FOR NEW LIBRARIES CONTRACT TESTS FOR ALL EXTERNAL LIBRARIES RISK ANALYSIS PERFORMED WHEN UPGRADING LIBRARY DISPLAY ERROR MESSAGE TO USER
  28. 28. Create realistic actions Establish mutual agreement Assign owners
  29. 29. Does it end now?
  30. 30. ANALYSE ROOT CAUSE DEFINE ACTIONS APPROVE ACTIONS REVIEW RESULTS
  31. 31. TRY AGAIN. FAIL AGAIN. TEST BETTER. Kasia Samuel Beckett*
  32. 32. ROOT CAUSE ANALYSIS FOR DUMMIES ➤ Bad thing happens… ➤ Resolve it. ➤ Schedule time for analysis and book a whiteboard. ➤ Invite interested people. ➤ Work up the chain to find the root cause. ➤ List improvements to avoid future failures of the system. ➤ Identify owners. Assign tasks. Set time expectations. Get agreement. ➤ Publish post-mortem. ➤ Follow up on actions and share your learnings.
  33. 33. THANKS! @kasia_balcerzak

×