SlideShare a Scribd company logo
1 of 37
The (ir)rational
incident response
The psychology behind production incidents
November 2021
Boris Cherkasky
➔ Backend engineer and Production
advocate @Riskified
➔ I 🤍 Observability
➔ Scuba Diver
About me
@cherkaskyb on twitter / linkedin / medium
Agenda
01 The psychology of
an incident response
02 Intro to cognitive biases
and heuristics
03 Biases in production
Riskified by the numbers
Global team,
nearly 50% in R&D
Countries across
the globe
Online volume
reviewed in 2020
650+ 180+
$60B+
50+
Publicly held companies
among our clients
98%+
Client retention
for the past 2 years
As of August 2021
The Anatomy of production incidents
Time
Certainty
Cognitive
biases and
heuristics 101
X Not it
Heuristics /
Cognitive Biases
Mental “shortcuts”
A Radio Commercial
Loan at an interest rate of 0.5%
lower than the bank’s
The Anchoring Bias
Loan at an interest rate of 0.5%
lower than the bank’s
Real life,
production,
Heuristics
and biases
Processes Monitoring
Alerting
Design
Optimal
decision
making
Business process
Step 1 Step 2
External Data
source 1
External Data
source 2
Internal Data
source 1
Internal Data
source 2
Step 7
Final
Step
Business process
Step 1 Step 2
External Data
source 1
External Data
source 2
Internal Data
source 1
Step 7
Final
Step
No
service
Inaccurat
e results
Analysis
Paralysis
Prioritize your SLIs
Mitigating the Analysis Paralysis
Latency
Availability
Data integrity
Accuracy
Mitigating the Analysis Paralysis
Latency
Availability
Data integrity
Accuracy
A math lesson
Alert #8973: Latency error:
avg(avg_over_time(latency))
+ 2 * stddev(avg_over_time(latency)) > 18
A math lesson
Alert #8973: Latency error:
avg(avg_over_time(latency))
+ 2 * stddev(avg_over_time(latency)) > 18
Curse of
Knowledge
Alerts and metrics should be set
by “the common” responder,
mentored by the expert
Mitigating Curse of Knowledge
When complex alerts can’t be
avoided - document, explain, train,
level UP your organization
Monolithic
DB
Shared
Storage
API gateway
Monolith
Configuration Configuration
Monolithic
DB
Shared
Storage
API gateway
Monolith
Configuration Configuration
403 - Forbidden
Monolithic
DB
Shared
Storage
API gateway
Monolith
Configuration Configuration
403 - Forbidden
Monolithic
DB
Shared
Storage
API gateway
Monolith
Configuration
403 - Forbidden
Configuration
Simulation
heuristic
Mitigating the Simulation heuristic
Set the responder on the correct path as
soon as possible, with minimal friction
Minimize the time
to start triage
CPU usage Available Memory
7:30 8:00 7:30 8:00
7:30 8:00 7:30 8:00
2% 200MB
CPU usage Available Memory
Confirmation
bias
Show simple and standardized data
Mitigating the confirmation bias
Show simple and standardized data
Mitigating the confirmation bias
● Don’t work alone
● Draw a concrete line between the observed
facts, your hypothesis, and the existing state
(outcome/outage)
Mitigating the confirmation bias
Cheatsheet
Keep anything
production simple
Specific alerts,
Standardized dashboards
Prioritise SLIs
(SLI pyramid)
Normalize production
status with the
“average” responder
Boris Cherkasky
cherkaskyb@gmail.com
@cherkaskyb / Twitter / medium / linkedIn
Thank You
For Your Time!

More Related Content

What's hot

Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
GetInData
 
Finding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impact
Arun Kejariwal
 
Scrum für Embedded-Software: Gut – aber aus anderen Gründen, als Ihr Manager...
Scrum für Embedded-Software: Gut  – aber aus anderen Gründen, als Ihr Manager...Scrum für Embedded-Software: Gut  – aber aus anderen Gründen, als Ihr Manager...
Scrum für Embedded-Software: Gut – aber aus anderen Gründen, als Ihr Manager...
Joachim Schlosser
 

What's hot (20)

Scaling security in a cloud environment v0.5 (Sep 2017)
Scaling security in a cloud environment  v0.5 (Sep 2017)Scaling security in a cloud environment  v0.5 (Sep 2017)
Scaling security in a cloud environment v0.5 (Sep 2017)
 
Efficient IT operations using monitoring systems and standardized tools - Ici...
Efficient IT operations using monitoring systems and standardized tools - Ici...Efficient IT operations using monitoring systems and standardized tools - Ici...
Efficient IT operations using monitoring systems and standardized tools - Ici...
 
Boston DevOps Days 2016: Implementing Metrics Driven DevOps - Why and How
Boston DevOps Days 2016: Implementing Metrics Driven DevOps - Why and HowBoston DevOps Days 2016: Implementing Metrics Driven DevOps - Why and How
Boston DevOps Days 2016: Implementing Metrics Driven DevOps - Why and How
 
"#Microfrontends #LowConnectivity #AsianMarket", Maxim Demidenko
"#Microfrontends #LowConnectivity #AsianMarket", Maxim Demidenko"#Microfrontends #LowConnectivity #AsianMarket", Maxim Demidenko
"#Microfrontends #LowConnectivity #AsianMarket", Maxim Demidenko
 
Elastic @ Adobe: Making Search Smarter with Machine Learning at Scale
Elastic @ Adobe: Making Search Smarter with Machine Learning at ScaleElastic @ Adobe: Making Search Smarter with Machine Learning at Scale
Elastic @ Adobe: Making Search Smarter with Machine Learning at Scale
 
Applying AI to Performance Engineering: Shift-Left, Shift-Right, Self-Healing
Applying AI to Performance Engineering: Shift-Left, Shift-Right, Self-HealingApplying AI to Performance Engineering: Shift-Left, Shift-Right, Self-Healing
Applying AI to Performance Engineering: Shift-Left, Shift-Right, Self-Healing
 
Why AIOps Matters For Kubernetes
Why AIOps Matters For KubernetesWhy AIOps Matters For Kubernetes
Why AIOps Matters For Kubernetes
 
Code-to-Cloud Visibility: An Essential Framework for DevOps Success
Code-to-Cloud Visibility: An Essential Framework for DevOps SuccessCode-to-Cloud Visibility: An Essential Framework for DevOps Success
Code-to-Cloud Visibility: An Essential Framework for DevOps Success
 
Automate Your Backups at Scale
Automate Your Backups at ScaleAutomate Your Backups at Scale
Automate Your Backups at Scale
 
Chaos Engineering
Chaos EngineeringChaos Engineering
Chaos Engineering
 
IoT in the Cloud: Build and Unleash the Value in your Renewable Energy System
IoT in the Cloud: Build and Unleash the Value in your Renewable Energy SystemIoT in the Cloud: Build and Unleash the Value in your Renewable Energy System
IoT in the Cloud: Build and Unleash the Value in your Renewable Energy System
 
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
 
Splitting the Check on Compliance and Security
Splitting the Check on Compliance and SecuritySplitting the Check on Compliance and Security
Splitting the Check on Compliance and Security
 
Finding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impact
 
Scrum für Embedded-Software: Gut – aber aus anderen Gründen, als Ihr Manager...
Scrum für Embedded-Software: Gut  – aber aus anderen Gründen, als Ihr Manager...Scrum für Embedded-Software: Gut  – aber aus anderen Gründen, als Ihr Manager...
Scrum für Embedded-Software: Gut – aber aus anderen Gründen, als Ihr Manager...
 
use case ibm k8s_service+devops
use case ibm k8s_service+devopsuse case ibm k8s_service+devops
use case ibm k8s_service+devops
 
Capgemini: Observability within the Dutch government
Capgemini: Observability within the Dutch governmentCapgemini: Observability within the Dutch government
Capgemini: Observability within the Dutch government
 
Big Data - Hadoop and MapReduce - Aditya Garg
Big Data - Hadoop and MapReduce - Aditya GargBig Data - Hadoop and MapReduce - Aditya Garg
Big Data - Hadoop and MapReduce - Aditya Garg
 
Elastic APM: Amping up your logs and metrics for the full picture
Elastic APM: Amping up your logs and metrics for the full pictureElastic APM: Amping up your logs and metrics for the full picture
Elastic APM: Amping up your logs and metrics for the full picture
 
5 things we learned not to ignore while scaling kubernetes webinar dev ops.co...
5 things we learned not to ignore while scaling kubernetes webinar dev ops.co...5 things we learned not to ignore while scaling kubernetes webinar dev ops.co...
5 things we learned not to ignore while scaling kubernetes webinar dev ops.co...
 

Similar to THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT RESPONSE, BORIS CHERKASKY, Riskified Tech

Aberdeen Group Presents: Video Intelligence to Secure and Grow
Aberdeen  Group Presents: Video Intelligence to Secure and GrowAberdeen  Group Presents: Video Intelligence to Secure and Grow
Aberdeen Group Presents: Video Intelligence to Secure and Grow
3VR Inc.
 
3 Reasons Why IT puts Us at Risk by Phil Godwin
3 Reasons Why IT puts Us at Risk by Phil Godwin3 Reasons Why IT puts Us at Risk by Phil Godwin
3 Reasons Why IT puts Us at Risk by Phil Godwin
Clear Technologies
 
Accidental Techies Half Day Session
Accidental Techies Half Day SessionAccidental Techies Half Day Session
Accidental Techies Half Day Session
Miles Maier
 
Managing Riskin InformationSystemsPowered by vLab Solu.docx
Managing Riskin InformationSystemsPowered by vLab Solu.docxManaging Riskin InformationSystemsPowered by vLab Solu.docx
Managing Riskin InformationSystemsPowered by vLab Solu.docx
jessiehampson
 

Similar to THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT RESPONSE, BORIS CHERKASKY, Riskified Tech (20)

2020 KringleCon HolidayHack Report - Brazzell
2020 KringleCon HolidayHack Report - Brazzell2020 KringleCon HolidayHack Report - Brazzell
2020 KringleCon HolidayHack Report - Brazzell
 
Over pay as you go for your datastore
Over pay as you go for your datastoreOver pay as you go for your datastore
Over pay as you go for your datastore
 
Cloudbrew 2019 - Azure Security
Cloudbrew 2019 - Azure SecurityCloudbrew 2019 - Azure Security
Cloudbrew 2019 - Azure Security
 
AI, Blockchain and Quantum - Moonshoot for the Enterprise !?
AI, Blockchain and Quantum - Moonshoot for the Enterprise !?AI, Blockchain and Quantum - Moonshoot for the Enterprise !?
AI, Blockchain and Quantum - Moonshoot for the Enterprise !?
 
Enterprise Grade Data Labeling - Design Your Ground Truth to Scale in Produ...
Enterprise Grade Data Labeling - Design Your Ground Truth to Scale in Produ...Enterprise Grade Data Labeling - Design Your Ground Truth to Scale in Produ...
Enterprise Grade Data Labeling - Design Your Ground Truth to Scale in Produ...
 
Aberdeen Group Presents: Video Intelligence to Secure and Grow
Aberdeen  Group Presents: Video Intelligence to Secure and GrowAberdeen  Group Presents: Video Intelligence to Secure and Grow
Aberdeen Group Presents: Video Intelligence to Secure and Grow
 
3 Reasons Why IT puts Us at Risk by Phil Godwin
3 Reasons Why IT puts Us at Risk by Phil Godwin3 Reasons Why IT puts Us at Risk by Phil Godwin
3 Reasons Why IT puts Us at Risk by Phil Godwin
 
Elastic's recommendation on keeping services up and running with real-time vi...
Elastic's recommendation on keeping services up and running with real-time vi...Elastic's recommendation on keeping services up and running with real-time vi...
Elastic's recommendation on keeping services up and running with real-time vi...
 
Cyber Risk Management in 2017: Challenges & Recommendations
Cyber Risk Management in 2017: Challenges & RecommendationsCyber Risk Management in 2017: Challenges & Recommendations
Cyber Risk Management in 2017: Challenges & Recommendations
 
The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...
The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...
The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...
 
Csec 610 Motivated Minds/newtonhelp.com
Csec 610 Motivated Minds/newtonhelp.comCsec 610 Motivated Minds/newtonhelp.com
Csec 610 Motivated Minds/newtonhelp.com
 
Csec 610 Extraordinary Success/newtonhelp.com
Csec 610 Extraordinary Success/newtonhelp.comCsec 610 Extraordinary Success/newtonhelp.com
Csec 610 Extraordinary Success/newtonhelp.com
 
Csec 610 Education is Power/newtonhelp.com
Csec 610 Education is Power/newtonhelp.comCsec 610 Education is Power/newtonhelp.com
Csec 610 Education is Power/newtonhelp.com
 
Csec 610 Your world/newtonhelp.com
Csec 610 Your world/newtonhelp.comCsec 610 Your world/newtonhelp.com
Csec 610 Your world/newtonhelp.com
 
A Review of deep learning techniques in detection of anomaly incredit card tr...
A Review of deep learning techniques in detection of anomaly incredit card tr...A Review of deep learning techniques in detection of anomaly incredit card tr...
A Review of deep learning techniques in detection of anomaly incredit card tr...
 
Accidental Techies Half Day Session
Accidental Techies Half Day SessionAccidental Techies Half Day Session
Accidental Techies Half Day Session
 
Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017
Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017
Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017
 
Managing Riskin InformationSystemsPowered by vLab Solu.docx
Managing Riskin InformationSystemsPowered by vLab Solu.docxManaging Riskin InformationSystemsPowered by vLab Solu.docx
Managing Riskin InformationSystemsPowered by vLab Solu.docx
 
Jumpstart Success in Your Supply Chain: How Data Science and Modeling Can Sup...
Jumpstart Success in Your Supply Chain: How Data Science and Modeling Can Sup...Jumpstart Success in Your Supply Chain: How Data Science and Modeling Can Sup...
Jumpstart Success in Your Supply Chain: How Data Science and Modeling Can Sup...
 
Cybersecurity Strategies for Effective Attack Surface Reduction
Cybersecurity Strategies for Effective Attack Surface ReductionCybersecurity Strategies for Effective Attack Surface Reduction
Cybersecurity Strategies for Effective Attack Surface Reduction
 

More from DevOpsDays Tel Aviv

THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider SecurityTHE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
DevOpsDays Tel Aviv
 
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearBHOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
DevOpsDays Tel Aviv
 
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, FireflyDON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
DevOpsDays Tel Aviv
 

More from DevOpsDays Tel Aviv (20)

YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...
YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...
YOUR OPEN SOURCE PROJECT IS LIKE A STARTUP, TREAT IT LIKE ONE, EYAR ZILBERMAN...
 
GRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, Salto
GRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, SaltoGRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, Salto
GRAPHQL TO THE RES(T)CUE, ELLA SHARAKANSKI, Salto
 
MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...
MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...
MICROSERVICES ABOVE THE CLOUD - DESIGNING THE INTERNATIONAL SPACE STATION FOR...
 
PRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDog
PRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDogPRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDog
PRINCIPLES OF OBSERVABILITY // DANIEL MAHER, DataDog
 
NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...
NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...
NUDGE AND SLUDGE: DRIVING SECURITY WITH DESIGN // J. WOLFGANG GOERLICH, Duo S...
 
(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG
(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG
(Ignite) TAKE A HIKE: PREVENTING BATTERY CORROSION - LEAH VOGEL, CHEGG
 
BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...
BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...
BUILDING A DR PLAN FOR YOUR CLOUD INFRASTRUCTURE FROM THE GROUND UP, MOSHE BE...
 
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider SecurityTHE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
 
THE PLEASURES OF ON-PREM, TOMER GABEL
THE PLEASURES OF ON-PREM, TOMER GABELTHE PLEASURES OF ON-PREM, TOMER GABEL
THE PLEASURES OF ON-PREM, TOMER GABEL
 
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPack
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPackCONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPack
CONFIGURATION MANAGEMENT IN THE CLOUD NATIVE ERA, SHAHAR MINTZ, EggPack
 
SOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, Develeap
SOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, DeveleapSOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, Develeap
SOLVING THE DEVOPS CRISIS, ONE PERSON AT A TIME, CHRISTINA BABITSKI, Develeap
 
OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...
OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...
OPTIMIZING PERFORMANCE USING CONTINUOUS PRODUCTION PROFILING ,YONATAN GOLDSCH...
 
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKHHOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH
HOW TO SCALE YOUR ONCALL OPERATION, AND SURVIVE TO TELL, ANTON DRUKH
 
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearBHOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
HOW TO OPTIMIZE NON-CODING TIME, ORI KEREN, LinearB
 
FLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, Icinga
FLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, IcingaFLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, Icinga
FLYING BLIND - ACCESSIBILITY IN MONITORING, FEU MOUREK, Icinga
 
(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY
(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY
(Ignite) WHAT'S BURNING THROUGH YOUR CLOUD BILL - GIL BAHAT, CIDER SECURITY
 
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.ioSLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
 
ONBOARDING IN LOCKDOWN, HILA FOX, Augury
ONBOARDING IN LOCKDOWN, HILA FOX, AuguryONBOARDING IN LOCKDOWN, HILA FOX, Augury
ONBOARDING IN LOCKDOWN, HILA FOX, Augury
 
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, FireflyDON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
DON'T PANIC: GETTING YOUR INFRASTRUCTURE DRIFT UNDER CONTROL, ERAN BIBI, Firefly
 
KEYNOTE | WHAT'S COMING IN THE NEXT 10 YEARS OF DEVOPS? // ELLEN CHISA, bolds...
KEYNOTE | WHAT'S COMING IN THE NEXT 10 YEARS OF DEVOPS? // ELLEN CHISA, bolds...KEYNOTE | WHAT'S COMING IN THE NEXT 10 YEARS OF DEVOPS? // ELLEN CHISA, bolds...
KEYNOTE | WHAT'S COMING IN THE NEXT 10 YEARS OF DEVOPS? // ELLEN CHISA, bolds...
 

Recently uploaded

Recently uploaded (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

THE (IR)RATIONAL INCIDENT RESPONSE: HOW PSYCHOLOGICAL BIASES AFFECT INCIDENT RESPONSE, BORIS CHERKASKY, Riskified Tech

Editor's Notes

  1. Hey everyone! I wanna take you to one the most memorable evenings i had as a on call engineer. It was a Tuesday, I am at home, watching some movie with a warm pizza in my hand, when i get an alert from our production system. This alert turned that warm Tuesday night to a four hour long nightmare. At 1AM at night, i look back at the incident, after everything is back to normal, and my only though is - this could have been solved in 20 minutes. My greatest lesson from that Tuesday was that even the most experienced responder can make a wrong decisions. Those wrong decisions, and what causes them, are the topic of this talk.
  2. So thank you for having me, I’m Boris Cherkasky, and I’ve been breaking, fixing, and monitoring production systems, for the last four years. Im a backend engineer and production advocate at Riskified, and i am generally fascinated by observability. I love SCUBA diving and I write small tech blog that you can find using my handle is cherkaskyb on most social networks and medium
  3. This talk is a journey into how our minds work, and how it “plays tricks” on us during production incidents. We’ll start with some anatomy of the incident response process, Then we’ll do a short introduction to cognitive biases and heuristics, And the majority of this talk will focus on real life incident examples. In those examples we’ll cover how cognitive biases gets manifested in production incidents, and how we can mitigate those effectively.
  4. A few words about Riskified. Riskified enables top brands to fulfill their maximal e-commerce potential, by leveraging AI to help with fraud prevention, and other financial funnel optimisations. About 60 billion dollars worth of orders from all around the world goes through our systems Let’s get back to the previously mentioned Tuesday. The incident started with total uncertainty - I got an alert from one of our most reliable systems, one that malfunctioned only once in the last four years. I later learned, that this uncertainty, is what allows psychological biases to thrive, and affect our decision making.
  5. Let’s make an hypothetical chart of certainty over time during my incident, and i’ll walk you through the decisions made in those four hours. I first decided to go to a dashboard and not Logs, it’s a decision. It happens to be a good decision - I see some irregularities! my certainty grows. I then get a message from a colleague who reports some partial information that contradicts what i am seeing our certainty plumits. We scratch our heads for a minute, and open a more specific dashboard! It puts us on the path that one of our dependencies is imparied. our certainty is grows. We think we know what’s the root cause, and we DECIDE to restart an instance - it doesn’t help. our certainty plumits once again And this goes on until the incident is resolved four hours later - a ping pong game between certainty, and uncertainty. Because of this uncertainty, in each decision point, we are prone to cognitive biases that suppose to ease our decision making process, but in fact, might cause us make the wrong decisions.
  6. Let’s begin our journey with defining what heuristics and biases are
  7. Since i'm gonna be talking about psychology, i am first legally obligated to mention that i am not a trained psychiatrist.
  8. When talking about decision making under uncertainty, two terms come to mind - heuristics and cognitive biases. To simplify this talk i’ll address them both - “biases”, and the actual difference between them is not critical. Biases are mental patterns - shortcuts in decision making our mind makes to simplify this complex task. A good analogy for a bias is a branch prediction in the CPU - it’s a “calculated shortcut” the CPU makes - it can work, and is working many times, but when it doesn’t - a bad call was made, and we need to roll back to the correct state.
  9. Let’s start of with an example, A Radio commercial might state: Boris insurance inc. offers a Loan at an interest rate of 0.5% lower than the bank’s even that one small sentence has biases in it, put there on purpose to help you make the complex “loan decision”
  10. In this example - It’s The anchoring bias - a cognitive bias where an individual's decisions are influenced by a particular reference point or 'anchor' Our mind is now anchored to interest the bank is offering - every decision we’ll be making is anchored to this reference point, regardless if that’s a good reference point or not. And as you can probably guess - It’s probably not.
  11. But this was a commercial, What happens in production? What biases are we prone to there?
  12. Before we dive into this We have to understand that biases will come to life in our weakest point - an incident. There’s no way around it. What we can do, is to limit the effect, or volume of those biases, by preparing for them early on in our development process. by: Designing “bias proof” systems Maintaining “bias aware” environment with each change we deliver <PAUSE> using effective alerting and monitoring. Create bias reducing response procedure By applying the measures i’m gonna discuss, you can benefit in faster incident resolution, lower frustration within the response team, and maybe even get back to your pizza, while it’s still warm.
  13. The incidents you’re about to see were managed by trained professionals, do not try them at home (or work). Each example im about to cover is a real life incident where our responders got blind sided by a bias that affected their behaviour.
  14. The first incident, is one where of our most critical data sources failed. The backup didn’t work well enough, and our whole business process came to an halt. I’m sure you have a business process similar to ours in your system too - one that is composed of steps, and each one of those can fail One of the responders suggested we disabled the datasource and run without it, <CLICK>
  15. in the “fog of war” it sounded like a solid idea, since it’ll get the service back to life. The first time we almost turned it off, 30 minutes into the incident, one of the Analysts mentioned it’ll breach SLA for one of our top customers. The second time we almost turned it off, 50 minutes into the incident,it was head of engineering that mentioned it’ll cause downstream pressure on other systems. The more this idea floated the room, more stakeholders spoke up on the impact it’ll cause - from accuracy and latency, to legal and operational. This. was. Just. frustrating. No one cloud make the call, while In the background - the whole business in impared. About two hours into the incident, the idea escalated to the Chief Of Operations, who made the decision - we can’t turn it off, and we waited for the underlying issue to resolve.
  16. To clarify - we had 2 options First - wait for the underlying cause to resolve - and have no service until then. Second - turn off the failing service, which will cause multiple issues around accuracy, SLA, and other. Both are Bad. But NOT EQUALLY, we just didn’t know which one is worse, and needed the Chief of Operations to decide. Our response team was paralised, the decision couldn’t be made.
  17. This is the Analysis Paralysis - It manifested in our inability to make the call to turn off the data source. Analysis Paralysis is a psychological effect where The more knowledge we have - the harder it is to make a decision - all the alternatives and outcomes are being weighed, without coming to a decision. In our incident, the whole business was impared, therefore the response team was huge, with additional stakeholders flooding the room. It took 2 hours, with more than 10 people involved to make that decision. the response team wasn’t independent in it’s decision making. So, How can we give my response team it’s independence back?
  18. It’s not always possible, but if we start at the requirements and design phase, we can define an order of importance for our system SLIs - A pyramid. When the priority is explicitly defined, The SLIs at top of the pyramid will be sacrificed to secure the SLIs at the base of the pyramid and the response team is free to independently make fast decisions to mitigate degradations.
  19. In our case, the pyramid would have stated that the most important SLI would have been accuracy, so we’d know we can’t sacrifice accuracy for availability. This example shows how a process of “bias proofing” in the design phase, can later on help mitigate those effects in production incidents. We’ve touched what we can do in the design phase, let’s now see how we deliver features, and define alerts on them.
  20. This example is around a single alert. This one <point up> I’ll give you a few seconds to look at it, it’s a Pseudo Prometheus QL. Now, let me have your attention again. You’re now as confused as I was. It’s 10 in the evening and all i know is that there is an “issue with the service’s latency”, and im in what appears to be a math lesson! It’s been 10 years since i last saw standard deviations, and i have no clue what are 2 standard deviations. By raise of hand, how many of you know what two standard deviations are? CLICK
  21. I start this incident with google and wikipedia to understand what this alert means. How did we get here? How come this alert find it’s way to production when i have no idea what to do with it?
  22. My surprising midnight math lessing is a result of a bias called ”curse of knowledge”. And it has 3 main effects: The first effect is The tendency to assume knowledge one possess, is common knowledge - The author of my alert thought it’s basic knowledge that 2 standard deviations in uniform distribution is the 96’th percentile, so my incident in fact, is p96 latency increase. The second effect is The lack of ability to rollback to your “unknowing” state - this is why teaching is so hard, and one of the reasons getting new shifters to be experienced and confidant is complex. The third effect is Predicting another person's action is highly biased towards one's knowledge of the issue - this is why writing run books is hard. Runbooks are documented recipes for mitigating and incident. The course of knowledge is why many run books have “missing steps” and implicit knowledge. How can we mitigate this? We obviously want all our responders to be experts! Knowledge is a good thing!
  23. To mitigate the curse of knowledge make sure your alerting and monitoring layer is done by the “average” responder - in other words, normalize the expertise level you need, to the average you have. Have your experts review their work, and train the team, but avoid having that “one monitoring person” in your systems. If complex monitoring and alerting can’t be avoided - document them thoroughly, again - by the “average responder”. train, and level up your organization. Don’t let your responders learn during incidents. We now touched how writing alerts and monitoring can be affected by biases, in the next example we’ll dive even deeper into monitoring.
  24. The next incident manifested two biases, and was one of my most painful production incidents. We’re gonna talk about those biases one by one. Let’s first talk about the system at hand - We started implementing a new generation services in micro-service architecture, to do so we needed configurations stored in our monolithic main database. So we’ve created a ETL process that exposed those configurations into a shared storage for all relevant service to use. One of the configurations there, <PAUSE> was highly sensitive map of which features are enabled for each of our customers.
  25. The incident started with elevated error rates on the API layer - we were rejecting API Calls - some customers were being refused key features of our product. The alerts were originated from the API layer, and knowing the process, I started simulating what can be the cause for this behaviour. And i suspected degradation in performance in the shared storage. We’ve decided to manually re-run the ETL, and in did it solved the issue. for an hour. I don’t know if you ever experienced a P1 incident that you thought you’ve solved coming back to haunt you, after you’ve already notified the business and management that everything is back to normal. This is really an uncomfortable feeling, that made me doubt my engineering skills.
  26. I’ve gathered some my teammates and we’ve started the investigation again, and another hour in, my teammates found the root cause - a bug in the replication process in the ETL.
  27. My teammates found it, but not me. for that hole hour, while they were going through the code and logs, i was digging into that shared storage, proving (mostly to myself) why it’s indeed in degraded state. I couldn’t hide my surprise! A BUG in a component that is working smoothly for more than a year with no scale or any other change. It was literally the among the last things on my possible root cause list, somewhere around cosmic radiation. This incident was a grueling process of checking each component in this flow one by one. The errors originated from the actual SLI that was degraded, but the issue was far up stream. Why did I dug so deep into that datastore, why couldn’t I see im on the wrong path?
  28. I was deeply affected by the simulation bias. The simulation bias states that one's judgments are bias towards information that is easily imagined or simulated mentally by them. And i was simulating the datastore as the cause. It’s important to mention that the simulation is subjective - what I can simulate, others maybe can’t. This is why i wasn’t able to simulate a bug, and our data engineers probably wouldn’t simulated database performance issue. Simulation causes high friction with the production system, an in my case - focusing on the wrong elements. So, what can we do to control what our responders simulate? This sounds like a challenging task
  29. The problem with the response process was high surface area between the alert and the system. The alert was at the end of a long chain of components, and each one needed to be checked to find the issue. It was a process of many steps to get to the root cause, and the time spent in that process is time spent on simulating wrong paths. Firstly, We need to set the responder on the correct path as soon as possible, we should aim doing so, with minimal friction So alerts and monitors should be VERY specific and on any dependency and key SLI, we better have 20 simple specific alerts than 1 catch all alert. If the alert was on the ETL process, chances are, i’d start by digging in into it’s logs, rather than working my way back from the API, through the shared datastore. Secondly, we need to minimize the time to start triage. Time spent without data, is time spent simulating. If possible - incident insights should be pushed to the responder, instead of waiting for the responder to pull the information they need. That means charts, and logs related to the incident can be automatically added to the incident at hand (most incedent management tools supports such integration to some extent). So far for simulation, and reducing friction and surface area of the alert. We’re now ready to talk about the final bias, that also attacked me during this incident.
  30. I know what you’re thinking - I am an experienced responder, I should base my decision on concrete data. I'm not gonna lie - I did, i had data to support my hypothesis. How was i able to “prove” that the data store is the issue when it wasn’t? I went straight to the source, to the datastore metrics. There It was, my smoking gun - the CPU usage increased, and available memory decreased!
  31. In fact, My dangerous increase in CPU was only 2%. and the memory? A drop of only 200MB on a large instance. The axis in the dashboards were dynamic, but I rushed into action and missed that. This misinterpreted data was enough to convince me, that indeed the database is the issue and send my on a wild goose chase while my teammates were actually narrowing in on the issue. Why was it so easy for me, an experienced responder, with vast <PAUSE> daily mileage with observability tools, to misinterpret what I was seeing?
  32. It Comes down to the confirmation bias - We try to seek information that reinforces existing positions. We come to a conclusion first and try to find information that fits it. ignore information, and translate ambiguous information in our favor. When you think you know what the incident is, it easy to find patterns that re-enforces it. In my case, a 2% increase in CPU confirmed a wrong hypothesis, and literally made me useless in this incident. Now when we know what that bias is, how can we mitigate it?
  33. First - Let’s talk about how we show our data. Keep it simple! Show <PAUSE> simple <PAUSE> data. Complex data is easy to “manipulate” or mis-interpret That includes sensible Legends, Colors and Scales - errors should be Red, throughput probably green. In my case - if I’d shown the CPU usage percentage with a static scale of 0 to 100, the there would have been no visible change at all, not to mention a spike.
  34. One more thing about simplicity - show data with as little Dimensions as possible - dimensions are complex! Same goes for multiple axes on a single chart, elaborate coloring schemes, and heatmaps. Im not saying NOT to use those, but be very aware when you are! Next, Standardize your data! When majority of dashboards will be similar - looking into any dashboard during an incident will feel familiar for the responder, thus reducing the chance of mis-interpretation. In my case, i was rarely using CloudWatch for metrics, therefor i wasn’t fully aware that the scale there is dynamic.
  35. That’s about data, now a bit about process: Don’t <PAUSE> work <PAUSE> alone, Incident response is a team effort - Show meaningful data that reinforces your positions to your teammates - convince the “unconfirmed” responder that you are correct. Draw a concrete line between the observed facts, your hypothesis, and the existing production state. Chances are that any of my teammates that would have seen the CPU and Mem chart would have smiled and pointed out my mistake.
  36. This is all the examples i have for you today. To wrap things up, id like to show a short cheat sheet that can help handle some of the biases we’ve talked about: Keep anything production simple Specific alerts, Standardized dashboards Normalize production status with the “average” responder Prioritise SLIs (SLI pyramid)
  37. I’m writing about the connection between software and psychology in my blog from time to time and I’ll repost those on twitter and linkedin, so if you’ve found this talk interesting, be sure to check it out. That’s all i have for you today, It’s been a pleasure, thank you for you time!