SlideShare a Scribd company logo
The Art of Debugging
Avishai Ish-Shalom
github.com/avishai-ish-shalom@nukembergavishai.is@wix.com
The Art of Debugging
Avishai Ish-Shalom
github.com/avishai-ish-shalom@nukembergavishai.is@wix.com
Randomly rebooting stuff
until it works
01
Please don’t
▪ Randomly reboot stuff
▪ Randomly look at metrics
▪ Blindly believe the monitor/metrics
▪ Give the loudest screamer all the attention
▪ Overemphasize the importance of latest data
Correlation does not mean causation
▪ Some problems are transient
▪ Some causes are transient, but the system doesn’t recover
▪ Especially when EVERYONE is doing things together
● 1 out of 1000 drivers is drunk
● Breathalyzer detects all drunks but has 5%
false positives
● Drivers stopped at random
A Driver was pulled over, breathalyzer shows
he’s drunk.
What’s the probability he’s really drunk?
If you answered 0.95, you have
fallen for the
THE BASE RATE FALLACY
Correct answer: ~ 0.02
In a random sample of 1000 drivers, 1 would be drunk
and 49.95 (999 x 0.05) would falsely test as drunk
Base rate of false positives (P(D)=50/1000) >> rate of
drunk drivers (P(drunk)=1/1000)
Explanation
Errors are “normal”
▪ Any big enough system has statistical errors/variations
▪ If you randomly search for errors, you will find some somewhere!
▪ And they will probably be unrelated
▪ The base rate of errors >> rate of incidents
Without theory,
data is useless and
misleading
Debugging 101
02
The methodology:
0. Triage
1. Define and narrow down the symptoms
2. Build a (mental) model of the system
3. Deconstruct, create/revise a theory
4. Corroborate the theory
5. Reconstruct and validate
6. Rinse and Repeat
0. Triage
▪ What is the business impact?
▪ Is it actually a problem?
▪ Should we handle this?
▪ Should we handle this now?
1. Define and narrow down the symptoms
▪ Can you recreate the issue?
▪ Isolate the offending conditions
▪ Operational definition of the problem
When? Where? to Whom? Under Which conditions?
E.g.: HTTP GET /bla -> returns 500, from US only with any cookie, user-agent, etc.
Examples
■ GET /bla -> returns 500 from all countries, for any headers, in 5% of cases
■ p99 of transaction X is over 500ms for all time intervals since 1 hour ago,
should be under 100ms
■ Transaction Y should return for user XXX returns empty records, should return
100 records
2. The Mental Model
▪ We have one for anything we interact with
▪ Implicit assumptions about how things work
▪ Sometimes wrong
We need to make the model explicit!
The Mental Model (example)
How does a oven work?
■ Temperature knob
■ Timer knob
The Mental Model (example)
Motorcycle
counter-steering
When things don’t make sense
3 options:
▪ We are missing data
▪ The data we have is wrong
▪ Our mental model is wrong
3. Deconstruct, create/revise a theory
▪ Disassemble the system to sub-systems
▪ How are they connected? what is the input/output of each one?
▪ Which sub-system(s) is the cause of the problem?
▪ Or maybe the connection is the problem?
Draw a diagram!
4. Corroborate the theory
▪ Define what metrics/experiments you need to prove/disprove
the theory*
▪ Define the expected results
▪ Get the data
* More on that later
Confirmation bias ahead!
Confirmation bias
“Confirmation bias is the tendency to search for, interpret,
favor, and recall information in a way that confirms one's
preexisting beliefs or hypotheses”
Always decide what to check before you see the results!
5. Reconstruct and validate
▪ Validate system invariants (e.g. Little’s Law)
▪ Compare the data to expected results
▪ Reconstruct the system from sub-systems
▪ Remember the integration points!
6. Rinse and repeat
▪ Problem found? excellent
▪ Problem narrowed down to sub-system? you are now debugging it. Go
back to #1
▪ Problem not found? revise your theory/model, go back to #2
Knowing what to
look at
03
What are we looking for?
▪ Should we look for outliers? high values? low values?
▪ Averages? sums? percentiles?
▪ Error counts?
What are we looking at?
▪ How was this measured? Client side? Server side?
▪ What aggregation method?
▪ Metrics artifacts?
Choose your metric!
▪ Looking for outliers? - percentiles, max, most deviants
▪ Utilization? Totals? - averages, sums
▪ Errors? - error counters
▪ Baselines? - compare with previous timeframe, average, median
▪ Trends? - moving averages (low pass filters)
Looking at latency
▪ Percentiles for effect on clients
▪ Sums/Percentiles for used resources
▪ Averages for Little’s Law and the USE method
▪ Bounds (timeouts, max, min)
▪ Always check throughput, frequency
Looking at errors
▪ Compare with baseline
▪ Compare with normal throughput
▪ Statistical error? -> is there a statistically significant difference?
High cardinality data
▪ Top K
▪ Most deviant
▪ Unique counts
The USE method
Utilization - RPM, % busy
Saturation - Queue length, load shedded
Errors - error count
The USE method - Brendan Gregg
Q&A
github.com/avishai-ish-shalom@nukembergavishai.is@wix.com
Bonus round (a war story)
OMG, it’s broken!
When in doubt, reboot
▪ Recurring issue, service rebooted instead of debugged
▪ “Known” issue, but nobody understood why
Amazing correlation!
Once again, with methodology
▪ All of the errors were “business errors”, unrelated.
▪ Use the profiler Luke!
▪ Analyzed results
▪ Construct theory
Getting data
case class DomainDTO(domainName: String, redirectDomain: Option[String], siteGuid:
Option[UUID])
@tailrec
private def isConnectedToASite(userDomains: List[DomainDTO])(domain: DomainDTO):
Boolean = domain.redirectDomain match {
case Some(redirectToDomain) =>
val redirectToInfoOpt = userDomains.find(_.domainName == redirectToDomain)
redirectToInfoOpt match {
case Some(redirectToInfo) => isConnectedToASite(userDomains)(redirectToInfo)
case None => false
}
case None => domain.siteGuid.isDefined
}
But but but….
▪ We don’t have redirect loops!
▪ And even if we did, why didn’t CPU released after timeout?
We needed to revise our mental model!
Validating the theory
▪ Queried 400k records
▪ Wrote program that traversed redirects and detects for loops
▪ Redirect loops found! (~30)
Reproducing the error
Recap
▪ The methodology proved itself (again)
▪ Correlation is not causation
▪ In data we trust (unless we have a good reason not too)
Thank You
github.com/avishai-ish-shalom@nukembergavishai.is@wix.com

More Related Content

What's hot

5 why analysis
5 why analysis5 why analysis
5 why analysis
Saurabh Kumar
 
A/B Testing and the Infinite Monkey Theory
A/B Testing and the Infinite Monkey TheoryA/B Testing and the Infinite Monkey Theory
A/B Testing and the Infinite Monkey Theory
UseItBetter
 
Mini-Training: Using root-cause analysis for problem management
Mini-Training: Using root-cause analysis for problem managementMini-Training: Using root-cause analysis for problem management
Mini-Training: Using root-cause analysis for problem management
Betclic Everest Group Tech Team
 
Root Cause Analysis | 5 whys | Tools of accident investigation I Gaurav Singh...
Root Cause Analysis | 5 whys | Tools of accident investigation I Gaurav Singh...Root Cause Analysis | 5 whys | Tools of accident investigation I Gaurav Singh...
Root Cause Analysis | 5 whys | Tools of accident investigation I Gaurav Singh...
Gaurav Singh Rajput
 
Your A/B Tests are Lying to You
Your A/B Tests are Lying to YouYour A/B Tests are Lying to You
Your A/B Tests are Lying to You
John Clevenger
 
Your A/B Tests are Lying to You
Your A/B Tests are Lying to YouYour A/B Tests are Lying to You
Your A/B Tests are Lying to You
John Clevenger
 
SEETest: Making Teams Awesome
SEETest: Making Teams AwesomeSEETest: Making Teams Awesome
SEETest: Making Teams Awesome
Maaret Pyhäjärvi
 
9akk105151d0113 5 whys
9akk105151d0113 5 whys9akk105151d0113 5 whys
9akk105151d0113 5 whys
Aakash Kulkarni
 
5-Why Training
5-Why Training5-Why Training
5-Why Training
Ranita Kaur
 
5 whys
5 whys5 whys
Robert Xiong's 5 whys Methodology
Robert Xiong's 5 whys MethodologyRobert Xiong's 5 whys Methodology
Robert Xiong's 5 whys Methodology
Robert Xiong
 
A Guide to the Five Whys Technique
A Guide to the Five Whys TechniqueA Guide to the Five Whys Technique
A Guide to the Five Whys Technique
Olivier Serrat
 
Root cause analysis using 5 whys
Root cause analysis using 5 whysRoot cause analysis using 5 whys
Root cause analysis using 5 whys
Fahmi Ramadhan Putra
 
Root Cause Analysis (RCA) Tools
Root Cause Analysis (RCA) ToolsRoot Cause Analysis (RCA) Tools
Root Cause Analysis (RCA) Tools
Jeremy Jay Lim
 
5 why analysis training presentaion
5 why analysis training presentaion5 why analysis training presentaion
5 why analysis training presentaion
Dharmesh Panchal
 
Root Cause Analysis
Root Cause AnalysisRoot Cause Analysis
Root Cause Analysis
AnwarrChaudary
 
5 why training 21 oct2010
5 why training 21 oct20105 why training 21 oct2010
5 why training 21 oct2010
Aakash Kulkarni
 
5 why analysis
5 why analysis5 why analysis
5 why analysis
Amit Shrivastava
 
Root Cause Analysis
Root Cause AnalysisRoot Cause Analysis
Root Cause Analysis
gatelyw396
 
5 whys nhsiq 2014
5 whys   nhsiq 20145 whys   nhsiq 2014
5 whys nhsiq 2014
NHS Improving Quality
 

What's hot (20)

5 why analysis
5 why analysis5 why analysis
5 why analysis
 
A/B Testing and the Infinite Monkey Theory
A/B Testing and the Infinite Monkey TheoryA/B Testing and the Infinite Monkey Theory
A/B Testing and the Infinite Monkey Theory
 
Mini-Training: Using root-cause analysis for problem management
Mini-Training: Using root-cause analysis for problem managementMini-Training: Using root-cause analysis for problem management
Mini-Training: Using root-cause analysis for problem management
 
Root Cause Analysis | 5 whys | Tools of accident investigation I Gaurav Singh...
Root Cause Analysis | 5 whys | Tools of accident investigation I Gaurav Singh...Root Cause Analysis | 5 whys | Tools of accident investigation I Gaurav Singh...
Root Cause Analysis | 5 whys | Tools of accident investigation I Gaurav Singh...
 
Your A/B Tests are Lying to You
Your A/B Tests are Lying to YouYour A/B Tests are Lying to You
Your A/B Tests are Lying to You
 
Your A/B Tests are Lying to You
Your A/B Tests are Lying to YouYour A/B Tests are Lying to You
Your A/B Tests are Lying to You
 
SEETest: Making Teams Awesome
SEETest: Making Teams AwesomeSEETest: Making Teams Awesome
SEETest: Making Teams Awesome
 
9akk105151d0113 5 whys
9akk105151d0113 5 whys9akk105151d0113 5 whys
9akk105151d0113 5 whys
 
5-Why Training
5-Why Training5-Why Training
5-Why Training
 
5 whys
5 whys5 whys
5 whys
 
Robert Xiong's 5 whys Methodology
Robert Xiong's 5 whys MethodologyRobert Xiong's 5 whys Methodology
Robert Xiong's 5 whys Methodology
 
A Guide to the Five Whys Technique
A Guide to the Five Whys TechniqueA Guide to the Five Whys Technique
A Guide to the Five Whys Technique
 
Root cause analysis using 5 whys
Root cause analysis using 5 whysRoot cause analysis using 5 whys
Root cause analysis using 5 whys
 
Root Cause Analysis (RCA) Tools
Root Cause Analysis (RCA) ToolsRoot Cause Analysis (RCA) Tools
Root Cause Analysis (RCA) Tools
 
5 why analysis training presentaion
5 why analysis training presentaion5 why analysis training presentaion
5 why analysis training presentaion
 
Root Cause Analysis
Root Cause AnalysisRoot Cause Analysis
Root Cause Analysis
 
5 why training 21 oct2010
5 why training 21 oct20105 why training 21 oct2010
5 why training 21 oct2010
 
5 why analysis
5 why analysis5 why analysis
5 why analysis
 
Root Cause Analysis
Root Cause AnalysisRoot Cause Analysis
Root Cause Analysis
 
5 whys nhsiq 2014
5 whys   nhsiq 20145 whys   nhsiq 2014
5 whys nhsiq 2014
 

Similar to The art of debugging

#Measurecamp : 18 Simple Ways to F*** up Your AB Testing
#Measurecamp : 18 Simple Ways to F*** up Your AB Testing#Measurecamp : 18 Simple Ways to F*** up Your AB Testing
#Measurecamp : 18 Simple Ways to F*** up Your AB Testing
Craig Sullivan
 
The Troubleshooting Chart
The Troubleshooting ChartThe Troubleshooting Chart
The Troubleshooting Chart
James Wing
 
Understand (and Fix) Your Chronic Work Disorder with Kanban
Understand (and Fix) Your Chronic Work Disorder with KanbanUnderstand (and Fix) Your Chronic Work Disorder with Kanban
Understand (and Fix) Your Chronic Work Disorder with Kanban
Janice Linden-Reed
 
Testing for cognitive bias in ai systems
Testing for cognitive bias in ai systemsTesting for cognitive bias in ai systems
Testing for cognitive bias in ai systems
Peter Varhol
 
Why do my AB tests suck? measurecamp
Why do my AB tests suck?   measurecampWhy do my AB tests suck?   measurecamp
Why do my AB tests suck? measurecamp
Craig Sullivan
 
Management Consulting Productivity Hacks
Management Consulting Productivity HacksManagement Consulting Productivity Hacks
Management Consulting Productivity Hacks
Asen Gyczew
 
Testing a movingtarget_quest_dynatrace
Testing a movingtarget_quest_dynatraceTesting a movingtarget_quest_dynatrace
Testing a movingtarget_quest_dynatrace
Peter Varhol
 
Stop! you're testing too much
Stop!  you're testing too muchStop!  you're testing too much
Stop! you're testing too much
Shawn Wallace
 
STOP! You're Testing Too Much - Shawn Wallace
STOP!  You're Testing Too Much - Shawn WallaceSTOP!  You're Testing Too Much - Shawn Wallace
STOP! You're Testing Too Much - Shawn Wallace
QA or the Highway
 
#Measurefest : 20 Simple Ways to Fuck Up your AB tests
#Measurefest : 20 Simple Ways to Fuck Up your AB tests#Measurefest : 20 Simple Ways to Fuck Up your AB tests
#Measurefest : 20 Simple Ways to Fuck Up your AB tests
Craig Sullivan
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
Hakka Labs
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark Landry
Sri Ambati
 
Performance testing mistakes newbies make
Performance testing mistakes newbies makePerformance testing mistakes newbies make
Performance testing mistakes newbies make
Confiz Limited
 
Data science toolkit for product managers
Data science toolkit for product managers Data science toolkit for product managers
Data science toolkit for product managers
ProductFolks
 
Data Science Toolkit for Product Managers
Data Science Toolkit for Product ManagersData Science Toolkit for Product Managers
Data Science Toolkit for Product Managers
Mahmoud Jalajel
 
Apex 10 commandments df14
Apex 10 commandments df14Apex 10 commandments df14
Apex 10 commandments df14
James Loghry
 
Progression by Regression: How to increase your A/B Test Velocity
Progression by Regression: How to increase your A/B Test VelocityProgression by Regression: How to increase your A/B Test Velocity
Progression by Regression: How to increase your A/B Test Velocity
Stitch Fix Algorithms
 
Bad Metric, Bad!
Bad Metric, Bad!Bad Metric, Bad!
Bad Metric, Bad!
Joseph Ours, MBA, PMP
 
Myths, Lies and Illusions of AB and Split Testing
Myths, Lies and Illusions of AB and Split TestingMyths, Lies and Illusions of AB and Split Testing
Myths, Lies and Illusions of AB and Split Testing
Craig Sullivan
 
Things Could Get Worse: Ideas About Regression Testing
Things Could Get Worse: Ideas About Regression TestingThings Could Get Worse: Ideas About Regression Testing
Things Could Get Worse: Ideas About Regression Testing
TechWell
 

Similar to The art of debugging (20)

#Measurecamp : 18 Simple Ways to F*** up Your AB Testing
#Measurecamp : 18 Simple Ways to F*** up Your AB Testing#Measurecamp : 18 Simple Ways to F*** up Your AB Testing
#Measurecamp : 18 Simple Ways to F*** up Your AB Testing
 
The Troubleshooting Chart
The Troubleshooting ChartThe Troubleshooting Chart
The Troubleshooting Chart
 
Understand (and Fix) Your Chronic Work Disorder with Kanban
Understand (and Fix) Your Chronic Work Disorder with KanbanUnderstand (and Fix) Your Chronic Work Disorder with Kanban
Understand (and Fix) Your Chronic Work Disorder with Kanban
 
Testing for cognitive bias in ai systems
Testing for cognitive bias in ai systemsTesting for cognitive bias in ai systems
Testing for cognitive bias in ai systems
 
Why do my AB tests suck? measurecamp
Why do my AB tests suck?   measurecampWhy do my AB tests suck?   measurecamp
Why do my AB tests suck? measurecamp
 
Management Consulting Productivity Hacks
Management Consulting Productivity HacksManagement Consulting Productivity Hacks
Management Consulting Productivity Hacks
 
Testing a movingtarget_quest_dynatrace
Testing a movingtarget_quest_dynatraceTesting a movingtarget_quest_dynatrace
Testing a movingtarget_quest_dynatrace
 
Stop! you're testing too much
Stop!  you're testing too muchStop!  you're testing too much
Stop! you're testing too much
 
STOP! You're Testing Too Much - Shawn Wallace
STOP!  You're Testing Too Much - Shawn WallaceSTOP!  You're Testing Too Much - Shawn Wallace
STOP! You're Testing Too Much - Shawn Wallace
 
#Measurefest : 20 Simple Ways to Fuck Up your AB tests
#Measurefest : 20 Simple Ways to Fuck Up your AB tests#Measurefest : 20 Simple Ways to Fuck Up your AB tests
#Measurefest : 20 Simple Ways to Fuck Up your AB tests
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark Landry
 
Performance testing mistakes newbies make
Performance testing mistakes newbies makePerformance testing mistakes newbies make
Performance testing mistakes newbies make
 
Data science toolkit for product managers
Data science toolkit for product managers Data science toolkit for product managers
Data science toolkit for product managers
 
Data Science Toolkit for Product Managers
Data Science Toolkit for Product ManagersData Science Toolkit for Product Managers
Data Science Toolkit for Product Managers
 
Apex 10 commandments df14
Apex 10 commandments df14Apex 10 commandments df14
Apex 10 commandments df14
 
Progression by Regression: How to increase your A/B Test Velocity
Progression by Regression: How to increase your A/B Test VelocityProgression by Regression: How to increase your A/B Test Velocity
Progression by Regression: How to increase your A/B Test Velocity
 
Bad Metric, Bad!
Bad Metric, Bad!Bad Metric, Bad!
Bad Metric, Bad!
 
Myths, Lies and Illusions of AB and Split Testing
Myths, Lies and Illusions of AB and Split TestingMyths, Lies and Illusions of AB and Split Testing
Myths, Lies and Illusions of AB and Split Testing
 
Things Could Get Worse: Ideas About Regression Testing
Things Could Get Worse: Ideas About Regression TestingThings Could Get Worse: Ideas About Regression Testing
Things Could Get Worse: Ideas About Regression Testing
 

Recently uploaded

Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
ICS
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
Peter Muessig
 
Preparing Non - Technical Founders for Engaging a Tech Agency
Preparing Non - Technical Founders for Engaging  a  Tech AgencyPreparing Non - Technical Founders for Engaging  a  Tech Agency
Preparing Non - Technical Founders for Engaging a Tech Agency
ISH Technologies
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
safelyiotech
 
What next after learning python programming basics
What next after learning python programming basicsWhat next after learning python programming basics
What next after learning python programming basics
Rakesh Kumar R
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
dakas1
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
Peter Muessig
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
dakas1
 
Project Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdfProject Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdf
Karya Keeper
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
brainerhub1
 
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
mz5nrf0n
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
Rakesh Kumar R
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
ToXSL Technologies
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
ShulagnaSarkar2
 
Liberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptxLiberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptx
Massimo Artizzu
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
YAML crash COURSE how to write yaml file for adding configuring details
YAML crash COURSE how to write yaml file for adding configuring detailsYAML crash COURSE how to write yaml file for adding configuring details
YAML crash COURSE how to write yaml file for adding configuring details
NishanthaBulumulla1
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
XfilesPro
 

Recently uploaded (20)

Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
 
Preparing Non - Technical Founders for Engaging a Tech Agency
Preparing Non - Technical Founders for Engaging  a  Tech AgencyPreparing Non - Technical Founders for Engaging  a  Tech Agency
Preparing Non - Technical Founders for Engaging a Tech Agency
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
 
What next after learning python programming basics
What next after learning python programming basicsWhat next after learning python programming basics
What next after learning python programming basics
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
Project Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdfProject Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdf
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
 
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
 
Liberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptxLiberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptx
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
YAML crash COURSE how to write yaml file for adding configuring details
YAML crash COURSE how to write yaml file for adding configuring detailsYAML crash COURSE how to write yaml file for adding configuring details
YAML crash COURSE how to write yaml file for adding configuring details
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
 

The art of debugging

  • 1. The Art of Debugging Avishai Ish-Shalom github.com/avishai-ish-shalom@nukembergavishai.is@wix.com
  • 2. The Art of Debugging Avishai Ish-Shalom github.com/avishai-ish-shalom@nukembergavishai.is@wix.com
  • 4. Please don’t ▪ Randomly reboot stuff ▪ Randomly look at metrics ▪ Blindly believe the monitor/metrics ▪ Give the loudest screamer all the attention ▪ Overemphasize the importance of latest data
  • 5. Correlation does not mean causation ▪ Some problems are transient ▪ Some causes are transient, but the system doesn’t recover ▪ Especially when EVERYONE is doing things together
  • 6. ● 1 out of 1000 drivers is drunk ● Breathalyzer detects all drunks but has 5% false positives ● Drivers stopped at random A Driver was pulled over, breathalyzer shows he’s drunk. What’s the probability he’s really drunk?
  • 7. If you answered 0.95, you have fallen for the THE BASE RATE FALLACY Correct answer: ~ 0.02
  • 8. In a random sample of 1000 drivers, 1 would be drunk and 49.95 (999 x 0.05) would falsely test as drunk Base rate of false positives (P(D)=50/1000) >> rate of drunk drivers (P(drunk)=1/1000) Explanation
  • 9.
  • 10. Errors are “normal” ▪ Any big enough system has statistical errors/variations ▪ If you randomly search for errors, you will find some somewhere! ▪ And they will probably be unrelated ▪ The base rate of errors >> rate of incidents
  • 11. Without theory, data is useless and misleading
  • 13. The methodology: 0. Triage 1. Define and narrow down the symptoms 2. Build a (mental) model of the system 3. Deconstruct, create/revise a theory 4. Corroborate the theory 5. Reconstruct and validate 6. Rinse and Repeat
  • 14. 0. Triage ▪ What is the business impact? ▪ Is it actually a problem? ▪ Should we handle this? ▪ Should we handle this now?
  • 15. 1. Define and narrow down the symptoms ▪ Can you recreate the issue? ▪ Isolate the offending conditions ▪ Operational definition of the problem When? Where? to Whom? Under Which conditions? E.g.: HTTP GET /bla -> returns 500, from US only with any cookie, user-agent, etc.
  • 16. Examples ■ GET /bla -> returns 500 from all countries, for any headers, in 5% of cases ■ p99 of transaction X is over 500ms for all time intervals since 1 hour ago, should be under 100ms ■ Transaction Y should return for user XXX returns empty records, should return 100 records
  • 17. 2. The Mental Model ▪ We have one for anything we interact with ▪ Implicit assumptions about how things work ▪ Sometimes wrong We need to make the model explicit!
  • 18. The Mental Model (example) How does a oven work? ■ Temperature knob ■ Timer knob
  • 19. The Mental Model (example) Motorcycle counter-steering
  • 20. When things don’t make sense 3 options: ▪ We are missing data ▪ The data we have is wrong ▪ Our mental model is wrong
  • 21. 3. Deconstruct, create/revise a theory ▪ Disassemble the system to sub-systems ▪ How are they connected? what is the input/output of each one? ▪ Which sub-system(s) is the cause of the problem? ▪ Or maybe the connection is the problem? Draw a diagram!
  • 22. 4. Corroborate the theory ▪ Define what metrics/experiments you need to prove/disprove the theory* ▪ Define the expected results ▪ Get the data * More on that later
  • 24. Confirmation bias “Confirmation bias is the tendency to search for, interpret, favor, and recall information in a way that confirms one's preexisting beliefs or hypotheses” Always decide what to check before you see the results!
  • 25. 5. Reconstruct and validate ▪ Validate system invariants (e.g. Little’s Law) ▪ Compare the data to expected results ▪ Reconstruct the system from sub-systems ▪ Remember the integration points!
  • 26. 6. Rinse and repeat ▪ Problem found? excellent ▪ Problem narrowed down to sub-system? you are now debugging it. Go back to #1 ▪ Problem not found? revise your theory/model, go back to #2
  • 28. What are we looking for? ▪ Should we look for outliers? high values? low values? ▪ Averages? sums? percentiles? ▪ Error counts?
  • 29. What are we looking at? ▪ How was this measured? Client side? Server side? ▪ What aggregation method? ▪ Metrics artifacts?
  • 30. Choose your metric! ▪ Looking for outliers? - percentiles, max, most deviants ▪ Utilization? Totals? - averages, sums ▪ Errors? - error counters ▪ Baselines? - compare with previous timeframe, average, median ▪ Trends? - moving averages (low pass filters)
  • 31. Looking at latency ▪ Percentiles for effect on clients ▪ Sums/Percentiles for used resources ▪ Averages for Little’s Law and the USE method ▪ Bounds (timeouts, max, min) ▪ Always check throughput, frequency
  • 32. Looking at errors ▪ Compare with baseline ▪ Compare with normal throughput ▪ Statistical error? -> is there a statistically significant difference?
  • 33. High cardinality data ▪ Top K ▪ Most deviant ▪ Unique counts
  • 34. The USE method Utilization - RPM, % busy Saturation - Queue length, load shedded Errors - error count The USE method - Brendan Gregg
  • 36. Bonus round (a war story)
  • 38. When in doubt, reboot ▪ Recurring issue, service rebooted instead of debugged ▪ “Known” issue, but nobody understood why
  • 40. Once again, with methodology ▪ All of the errors were “business errors”, unrelated. ▪ Use the profiler Luke! ▪ Analyzed results ▪ Construct theory
  • 42. case class DomainDTO(domainName: String, redirectDomain: Option[String], siteGuid: Option[UUID]) @tailrec private def isConnectedToASite(userDomains: List[DomainDTO])(domain: DomainDTO): Boolean = domain.redirectDomain match { case Some(redirectToDomain) => val redirectToInfoOpt = userDomains.find(_.domainName == redirectToDomain) redirectToInfoOpt match { case Some(redirectToInfo) => isConnectedToASite(userDomains)(redirectToInfo) case None => false } case None => domain.siteGuid.isDefined }
  • 43. But but but…. ▪ We don’t have redirect loops! ▪ And even if we did, why didn’t CPU released after timeout? We needed to revise our mental model!
  • 44. Validating the theory ▪ Queried 400k records ▪ Wrote program that traversed redirects and detects for loops ▪ Redirect loops found! (~30)
  • 46. Recap ▪ The methodology proved itself (again) ▪ Correlation is not causation ▪ In data we trust (unless we have a good reason not too)