Debugging under fire: Keeping your head when systems have lost their mind

B
Debugging under fire
Keeping your head when

systems have lost their mind
CTO
bryan@joyent.com
Bryan Cantrill
@bcantrill
The genesis of an outage
“Please don’t be me, please don’t be me”
“…doesn’t begin to describe it”
“WHEE!”
“Fat-finger”?
• Not just a “fat-finger”; even this relatively simple failure reflected
deeper complexities:















• Outage was instructive — and lucky — on many levels…
It could have been much worse!
• The (open source!) software stack that we have developed to
run our public cloud, Triton, is a complicated distributed system
• Compute nodes are PXE booted from the headnode with a
RAM-resident platform image
• It seemed entire conceivable that the services needed to boot
compute nodes would not be able to start because a compute
node could not boot…
• This was a condition we had tested, but at nowhere near the
scale — this was a failure that we hadn’t anticipated!
How did we get here?
• Software is increasingly delivered as part of a service
• Software configuration, deployment and management is
increasingly automated
• But automation is not total: humans are still in the loop, even
if only developing software
• Semi-automated systems are fraught with peril: the arrogance
and power of automation — but with human fallibility
Human fallibility in semi-automated systems
Human fallibility in semi-automated systems
Whither microservices?
• Microservices have yielded simpler components — but more
complicated systems
• …and open source has allowed us to deploy many more
kinds of software components, increasing complexity again
• As abstractions become more robust, failures become rare,
but arguably more acute: service outage is more likely due to
cascading failure in which there is not one bug but several
• That these failures may be in discrete software services
makes understanding the system very difficult…
The Microservices Complexity Paradox
The Microservices Complexity Paradox
an active shooter
Modern software failure modes
An even more apt metaphor
A mechanical distributed system
But… but… alerts and monitoring!


“It is a difficult thing to look at a winking light on a board,
or hear a peeping alarm — let alone several of them —
and immediately draw any sort of rational picture of
something happening”
— Nuclear Regulatory Commission’s Special Report

on incident at Three Mile Island
The debugging imperative
• We suffer from many of the same problems as nuclear power
in the 1970s: we are delivering systems that we think can’t fail
• In particular, distributed systems are vulnerable to software
defects — we must be able to debug them in production
• What does it mean to develop software to be debugged?
• Prompts a deeper question: how do we debug, anyway?
Debugging in the abstract
• Debugging is the process by which we understand
pathological behavior in a software system
• It is not unlike the process by which we understand the
behavior of a natural system — a process we call science
• Reasoning about the natural world can be very difficult:
experiments are expensive and even observations can be
very difficult
• Physical science is hypothesis-centric
The exceptionalism of software
• Software is entirely synthetic — it is mathematical machine!
• The conclusions of software debugging are often
mathematical in their unequivocal power!
• Software is so distilled and pure — experiments are so cheap
and observation so limitless — that we can structure our
reasoning about it differently
• We can understand software by simply observing it
The art of debugging
• The art of debugging isn’t to guess the answer — it is to be
able to ask the right questions to know how to answer them
• Answered questions are facts, not hypotheses
• Facts form constraints on future questions and hypotheses
• As facts beget questions which beget observations and more
facts, hypotheses become more tightly constrained — like a
cordon being cinched around the truth
The craft of debuggable software
• The essence of debugging is asking and answering questions
— and the craft of writing debuggable software is allowing the
software to be able to answer questions about itself
• This takes many forms:
• Designing for postmortem debuggability
• Designing for in situ instrumentation
• Designing for post hoc debugging
A culture of debugging
• Debugging must be viewed as the process by which systems
are understood and improved, not merely as the process by
which bugs are made to go away!
• Too often, we have found that beneath innocent wisps of
smoke lurk raging coal infernos
• Engineers must be empowered to understand anomalies!
• Engineers must be empowered to take the extra time to build
for debuggability — we must be secure in the knowledge that
this pays later dividends!
Debugging during an outage
• When systems are down, there is a natural tension: do we
optimize for recovery or understanding?
• “Can we resume service without losing information?”
• “What degree of service can we resume with minimal loss
of information?”
• Overemphasizing recovery with respect to understanding may
leave the problem undebugged or (worse) exacerbate the
problem with a destructive but unrelated action
The peril of overemphasizing recovery
• Recovery in lieu of understanding normalizes broken software
• If it becomes culturally engrained, the dubious principle of
software recovery has toxic corollaries, e.g.:
• Software should tolerate bad input (viz. “npm isntall”)
• Software should “recover” from fatal failures (uncaught
exceptions, segmentation violations, etc.)
• Software should not assert the correctness of its state
• These anti-patterns impede debuggability!
Debugging after an outage
• After an outage, we must debug to complete understanding
• In mature systems, we can expect cascading failures —
which can be exhausting to fully unwind
• It will be (very!) tempting after an outage to simply move on,
but every service failure (outage-inducing or not) represents
an opportunity to advance understanding
• Software engineers must be encouraged to understand their
own failures to encourage designing for debuggability
Enshrining debuggability
• Designing for debuggability effects true software robustness:
differentiating operational failure from programmatic ones
• Operational failures should be handled; programmatic failures
should be debugged
• Ironically, the more software is designed for debuggability the
less you will need to debug it — and the more you will
leverage it to debug the software that surrounds it
Debugging under fire
• It will always be stressful to debug a service that is down
• When a service is down, we must balance the need to restore
service with the need to debug it
• Missteps can be costly; taking time to huddle and think can
yield a better, safer path to recovery and root-cause
• In massive outages, parallelize by having teams take different
avenues of investigation
• Viewing outages as opportunities for understanding allows us
to develop software cultures that value debuggability!
Hungry for more?
• If you are the kind of software engineer who values
debuggability — and loves debugging — Joyent is hiring!
• If you have not yet hit your Cantrillian LD50, I will be joining
Brigit Kromhout, Andrew Clay Shafer, Matt Stratton as “Old
Geeks Shout At Cloud”
• Thank you!
1 of 29

Recommended

Debugging (Docker) containers in production by
Debugging (Docker) containers in productionDebugging (Docker) containers in production
Debugging (Docker) containers in productionbcantrill
5.1K views42 slides
Zebras all the way down: The engineering challenges of the data path by
Zebras all the way down: The engineering challenges of the data pathZebras all the way down: The engineering challenges of the data path
Zebras all the way down: The engineering challenges of the data pathbcantrill
17.2K views22 slides
Debugging microservices in production by
Debugging microservices in productionDebugging microservices in production
Debugging microservices in productionbcantrill
7K views45 slides
Oral tradition in software engineering: Passing the craft across generations by
Oral tradition in software engineering: Passing the craft across generationsOral tradition in software engineering: Passing the craft across generations
Oral tradition in software engineering: Passing the craft across generationsbcantrill
4.2K views22 slides
Platform as reflection of values: Joyent, node.js, and beyond by
Platform as reflection of values: Joyent, node.js, and beyondPlatform as reflection of values: Joyent, node.js, and beyond
Platform as reflection of values: Joyent, node.js, and beyondbcantrill
13.2K views39 slides
The State of Cloud 2016: The whirlwind of creative destruction by
The State of Cloud 2016: The whirlwind of creative destructionThe State of Cloud 2016: The whirlwind of creative destruction
The State of Cloud 2016: The whirlwind of creative destructionbcantrill
6.5K views19 slides

More Related Content

What's hot

The Hurricane's Butterfly: Debugging pathologically performing systems by
The Hurricane's Butterfly: Debugging pathologically performing systemsThe Hurricane's Butterfly: Debugging pathologically performing systems
The Hurricane's Butterfly: Debugging pathologically performing systemsbcantrill
5.7K views27 slides
Visualizing Systems with Statemaps by
Visualizing Systems with StatemapsVisualizing Systems with Statemaps
Visualizing Systems with Statemapsbcantrill
4.8K views31 slides
Principles of Technology Leadership by
Principles of Technology LeadershipPrinciples of Technology Leadership
Principles of Technology Leadershipbcantrill
5.4K views39 slides
The Internet-of-things: Architecting for the deluge of data by
The Internet-of-things: Architecting for the deluge of dataThe Internet-of-things: Architecting for the deluge of data
The Internet-of-things: Architecting for the deluge of databcantrill
3.8K views31 slides
Jax Devops 2017 Succeeding in the Cloud – the guidebook of Fail by
Jax Devops 2017  Succeeding in the Cloud – the guidebook of FailJax Devops 2017  Succeeding in the Cloud – the guidebook of Fail
Jax Devops 2017 Succeeding in the Cloud – the guidebook of FailSteve Poole
264 views42 slides
OSCON 2013: "Case Study: What to do when your project outgrows your company" by
OSCON 2013: "Case Study: What to do when your project outgrows your company"OSCON 2013: "Case Study: What to do when your project outgrows your company"
OSCON 2013: "Case Study: What to do when your project outgrows your company"The Linux Foundation
2.1K views42 slides

What's hot(20)

The Hurricane's Butterfly: Debugging pathologically performing systems by bcantrill
The Hurricane's Butterfly: Debugging pathologically performing systemsThe Hurricane's Butterfly: Debugging pathologically performing systems
The Hurricane's Butterfly: Debugging pathologically performing systems
bcantrill5.7K views
Visualizing Systems with Statemaps by bcantrill
Visualizing Systems with StatemapsVisualizing Systems with Statemaps
Visualizing Systems with Statemaps
bcantrill4.8K views
Principles of Technology Leadership by bcantrill
Principles of Technology LeadershipPrinciples of Technology Leadership
Principles of Technology Leadership
bcantrill5.4K views
The Internet-of-things: Architecting for the deluge of data by bcantrill
The Internet-of-things: Architecting for the deluge of dataThe Internet-of-things: Architecting for the deluge of data
The Internet-of-things: Architecting for the deluge of data
bcantrill3.8K views
Jax Devops 2017 Succeeding in the Cloud – the guidebook of Fail by Steve Poole
Jax Devops 2017  Succeeding in the Cloud – the guidebook of FailJax Devops 2017  Succeeding in the Cloud – the guidebook of Fail
Jax Devops 2017 Succeeding in the Cloud – the guidebook of Fail
Steve Poole264 views
OSCON 2013: "Case Study: What to do when your project outgrows your company" by The Linux Foundation
OSCON 2013: "Case Study: What to do when your project outgrows your company"OSCON 2013: "Case Study: What to do when your project outgrows your company"
OSCON 2013: "Case Study: What to do when your project outgrows your company"
Microservices for Mortals by Bert Ertman at Codemotion Dubai by Codemotion Dubai
 Microservices for Mortals by Bert Ertman at Codemotion Dubai Microservices for Mortals by Bert Ertman at Codemotion Dubai
Microservices for Mortals by Bert Ertman at Codemotion Dubai
Codemotion Dubai373 views
Graphel: A Purely Functional Approach to Digital Interaction by mtrimpe
Graphel: A Purely Functional Approach to Digital InteractionGraphel: A Purely Functional Approach to Digital Interaction
Graphel: A Purely Functional Approach to Digital Interaction
mtrimpe714 views
The economies of scaling software - Abdel Remani by jaxconf
The economies of scaling software - Abdel RemaniThe economies of scaling software - Abdel Remani
The economies of scaling software - Abdel Remani
jaxconf346 views
Cerebro general overiew eng by CineSoft
Cerebro general overiew engCerebro general overiew eng
Cerebro general overiew eng
CineSoft13.5K views
The Reactive Principles: Design Principles For Cloud Native Applications by Jonas Bonér
The Reactive Principles: Design Principles For Cloud Native ApplicationsThe Reactive Principles: Design Principles For Cloud Native Applications
The Reactive Principles: Design Principles For Cloud Native Applications
Jonas Bonér1.3K views
Xen Project Contributor Training Part2 : Processes and Conventions v1.1 by The Linux Foundation
Xen Project Contributor Training Part2 : Processes and Conventions v1.1Xen Project Contributor Training Part2 : Processes and Conventions v1.1
Xen Project Contributor Training Part2 : Processes and Conventions v1.1
The Linux Foundation75.6K views
Tiger oracle by d0nn9n
Tiger oracleTiger oracle
Tiger oracle
d0nn9n301 views
Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v... by Clinton Wolfe
Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...
Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...
Clinton Wolfe140 views
VMUG - My Journey to Full Stack Engineering by Chris Wahl
VMUG - My Journey to Full Stack EngineeringVMUG - My Journey to Full Stack Engineering
VMUG - My Journey to Full Stack Engineering
Chris Wahl1.5K views
Accidental Architecture 0.9 by Mark Cathcart
Accidental Architecture 0.9Accidental Architecture 0.9
Accidental Architecture 0.9
Mark Cathcart718 views

Similar to Debugging under fire: Keeping your head when systems have lost their mind

DevOps Days Vancouver 2014 Slides by
DevOps Days Vancouver 2014 SlidesDevOps Days Vancouver 2014 Slides
DevOps Days Vancouver 2014 SlidesAlex Cruise
2.6K views46 slides
No silver bullet by
No silver bulletNo silver bullet
No silver bulletGhufran Hasan
2.1K views22 slides
Empirical Methods in Software Engineering - an Overview by
Empirical Methods in Software Engineering - an OverviewEmpirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an Overviewalessio_ferrari
585 views165 slides
Mastering Microservices 2022 - Debugging distributed systems by
Mastering Microservices 2022 - Debugging distributed systemsMastering Microservices 2022 - Debugging distributed systems
Mastering Microservices 2022 - Debugging distributed systemsBert Jan Schrijver
425 views33 slides
I Think About My Biggest Weakness by
I Think About My Biggest WeaknessI Think About My Biggest Weakness
I Think About My Biggest WeaknessGina Buck
2 views159 slides
dist_systems.pdf by
dist_systems.pdfdist_systems.pdf
dist_systems.pdfCherenetToma
2 views13 slides

Similar to Debugging under fire: Keeping your head when systems have lost their mind(20)

DevOps Days Vancouver 2014 Slides by Alex Cruise
DevOps Days Vancouver 2014 SlidesDevOps Days Vancouver 2014 Slides
DevOps Days Vancouver 2014 Slides
Alex Cruise2.6K views
Empirical Methods in Software Engineering - an Overview by alessio_ferrari
Empirical Methods in Software Engineering - an OverviewEmpirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an Overview
alessio_ferrari585 views
Mastering Microservices 2022 - Debugging distributed systems by Bert Jan Schrijver
Mastering Microservices 2022 - Debugging distributed systemsMastering Microservices 2022 - Debugging distributed systems
Mastering Microservices 2022 - Debugging distributed systems
Bert Jan Schrijver425 views
I Think About My Biggest Weakness by Gina Buck
I Think About My Biggest WeaknessI Think About My Biggest Weakness
I Think About My Biggest Weakness
Gina Buck2 views
The 7 quests of resilient software design by Uwe Friedrichsen
The 7 quests of resilient software designThe 7 quests of resilient software design
The 7 quests of resilient software design
Uwe Friedrichsen8.2K views
Scaling a Web Site - OSCON Tutorial by duleepa
Scaling a Web Site - OSCON TutorialScaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON Tutorial
duleepa7.2K views
Non-Functional Requirements by David Simons
Non-Functional RequirementsNon-Functional Requirements
Non-Functional Requirements
David Simons2.4K views
GOTO night April 2022 - Debugging distributed systems by Bert Jan Schrijver
GOTO night April 2022 - Debugging distributed systemsGOTO night April 2022 - Debugging distributed systems
GOTO night April 2022 - Debugging distributed systems
JavaLand 2022 - Debugging distributed systems by Bert Jan Schrijver
JavaLand 2022 - Debugging distributed systemsJavaLand 2022 - Debugging distributed systems
JavaLand 2022 - Debugging distributed systems
Bert Jan Schrijver137 views
SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb by Daniel Zivkovic
SRE Topics with Charity Majors and Liz Fong-Jones of HoneycombSRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
Daniel Zivkovic458 views
No Silver Bullet - Essence and Accidents of Software Engineering by Aditi Abhang
No Silver Bullet - Essence and Accidents of Software EngineeringNo Silver Bullet - Essence and Accidents of Software Engineering
No Silver Bullet - Essence and Accidents of Software Engineering
Aditi Abhang116 views
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey by MongoDB
MongoDB World 2018: Tutorial - MongoDB Meets Chaos MonkeyMongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB329 views
RedisConf17 - Observability and the Glorious Future by Redis Labs
RedisConf17 - Observability and the Glorious FutureRedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious Future
Redis Labs405 views
Software Defects and SW Reliability Assessment by Kristine Hejna
Software Defects and SW Reliability AssessmentSoftware Defects and SW Reliability Assessment
Software Defects and SW Reliability Assessment
Kristine Hejna192 views
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my! by VMware Tanzu
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
VMware Tanzu4.7K views
devops, microservices, and platforms, oh my! by Andrew Shafer
devops, microservices, and platforms, oh my!devops, microservices, and platforms, oh my!
devops, microservices, and platforms, oh my!
Andrew Shafer2.7K views

More from bcantrill

Predicting the Present by
Predicting the PresentPredicting the Present
Predicting the Presentbcantrill
30 views17 slides
Sharpening the Axe: The Primacy of Toolmaking by
Sharpening the Axe: The Primacy of ToolmakingSharpening the Axe: The Primacy of Toolmaking
Sharpening the Axe: The Primacy of Toolmakingbcantrill
248 views23 slides
Coming of Age: Developing young technologists without robbing them of their y... by
Coming of Age: Developing young technologists without robbing them of their y...Coming of Age: Developing young technologists without robbing them of their y...
Coming of Age: Developing young technologists without robbing them of their y...bcantrill
370 views21 slides
I have come to bury the BIOS, not to open it: The need for holistic systems by
I have come to bury the BIOS, not to open it: The need for holistic systemsI have come to bury the BIOS, not to open it: The need for holistic systems
I have come to bury the BIOS, not to open it: The need for holistic systemsbcantrill
1.6K views20 slides
Towards Holistic Systems by
Towards Holistic SystemsTowards Holistic Systems
Towards Holistic Systemsbcantrill
5.7K views13 slides
The Coming Firmware Revolution by
The Coming Firmware RevolutionThe Coming Firmware Revolution
The Coming Firmware Revolutionbcantrill
1.2K views12 slides

More from bcantrill(20)

Predicting the Present by bcantrill
Predicting the PresentPredicting the Present
Predicting the Present
bcantrill30 views
Sharpening the Axe: The Primacy of Toolmaking by bcantrill
Sharpening the Axe: The Primacy of ToolmakingSharpening the Axe: The Primacy of Toolmaking
Sharpening the Axe: The Primacy of Toolmaking
bcantrill248 views
Coming of Age: Developing young technologists without robbing them of their y... by bcantrill
Coming of Age: Developing young technologists without robbing them of their y...Coming of Age: Developing young technologists without robbing them of their y...
Coming of Age: Developing young technologists without robbing them of their y...
bcantrill370 views
I have come to bury the BIOS, not to open it: The need for holistic systems by bcantrill
I have come to bury the BIOS, not to open it: The need for holistic systemsI have come to bury the BIOS, not to open it: The need for holistic systems
I have come to bury the BIOS, not to open it: The need for holistic systems
bcantrill1.6K views
Towards Holistic Systems by bcantrill
Towards Holistic SystemsTowards Holistic Systems
Towards Holistic Systems
bcantrill5.7K views
The Coming Firmware Revolution by bcantrill
The Coming Firmware RevolutionThe Coming Firmware Revolution
The Coming Firmware Revolution
bcantrill1.2K views
Hardware/software Co-design: The Coming Golden Age by bcantrill
Hardware/software Co-design: The Coming Golden AgeHardware/software Co-design: The Coming Golden Age
Hardware/software Co-design: The Coming Golden Age
bcantrill1.9K views
Tockilator: Deducing Tock execution flows from Ibex Verilator traces by bcantrill
Tockilator: Deducing Tock execution flows from Ibex Verilator tracesTockilator: Deducing Tock execution flows from Ibex Verilator traces
Tockilator: Deducing Tock execution flows from Ibex Verilator traces
bcantrill475 views
No Moore Left to Give: Enterprise Computing After Moore's Law by bcantrill
No Moore Left to Give: Enterprise Computing After Moore's LawNo Moore Left to Give: Enterprise Computing After Moore's Law
No Moore Left to Give: Enterprise Computing After Moore's Law
bcantrill4.1K views
Andreessen's Corollary: Ethical Dilemmas in Software Engineering by bcantrill
Andreessen's Corollary: Ethical Dilemmas in Software EngineeringAndreessen's Corollary: Ethical Dilemmas in Software Engineering
Andreessen's Corollary: Ethical Dilemmas in Software Engineering
bcantrill2.2K views
Platform values, Rust, and the implications for system software by bcantrill
Platform values, Rust, and the implications for system softwarePlatform values, Rust, and the implications for system software
Platform values, Rust, and the implications for system software
bcantrill6.8K views
Is it time to rewrite the operating system in Rust? by bcantrill
Is it time to rewrite the operating system in Rust?Is it time to rewrite the operating system in Rust?
Is it time to rewrite the operating system in Rust?
bcantrill27.4K views
dtrace.conf(16): DTrace state of the union by bcantrill
dtrace.conf(16): DTrace state of the uniondtrace.conf(16): DTrace state of the union
dtrace.conf(16): DTrace state of the union
bcantrill836 views
Papers We Love: ARC after dark by bcantrill
Papers We Love: ARC after darkPapers We Love: ARC after dark
Papers We Love: ARC after dark
bcantrill3.1K views
Down Memory Lane: Two Decades with the Slab Allocator by bcantrill
Down Memory Lane: Two Decades with the Slab AllocatorDown Memory Lane: Two Decades with the Slab Allocator
Down Memory Lane: Two Decades with the Slab Allocator
bcantrill3.2K views
The Container Revolution: Reflections after the first decade by bcantrill
The Container Revolution: Reflections after the first decadeThe Container Revolution: Reflections after the first decade
The Container Revolution: Reflections after the first decade
bcantrill5.4K views
Papers We Love: Jails and Zones by bcantrill
Papers We Love: Jails and ZonesPapers We Love: Jails and Zones
Papers We Love: Jails and Zones
bcantrill4.2K views
Why it’s (past) time to run containers on bare metal by bcantrill
Why it’s (past) time to run containers on bare metalWhy it’s (past) time to run containers on bare metal
Why it’s (past) time to run containers on bare metal
bcantrill3.5K views
Run containers on bare metal already! by bcantrill
Run containers on bare metal already!Run containers on bare metal already!
Run containers on bare metal already!
bcantrill2.5K views
A crime against common sense by bcantrill
A crime against common senseA crime against common sense
A crime against common sense
bcantrill3.9K views

Recently uploaded

Introduction to Maven by
Introduction to MavenIntroduction to Maven
Introduction to MavenJohn Valentino
6 views10 slides
Page Object Model by
Page Object ModelPage Object Model
Page Object Modelartembondar5
6 views5 slides
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx by
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptxanimuscrm
15 views19 slides
JioEngage_Presentation.pptx by
JioEngage_Presentation.pptxJioEngage_Presentation.pptx
JioEngage_Presentation.pptxadmin125455
6 views4 slides
What is API by
What is APIWhat is API
What is APIartembondar5
10 views15 slides
nintendo_64.pptx by
nintendo_64.pptxnintendo_64.pptx
nintendo_64.pptxpaiga02016
5 views7 slides

Recently uploaded(20)

2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx by animuscrm
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx
animuscrm15 views
JioEngage_Presentation.pptx by admin125455
JioEngage_Presentation.pptxJioEngage_Presentation.pptx
JioEngage_Presentation.pptx
admin1254556 views
Introduction to Git Source Control by John Valentino
Introduction to Git Source ControlIntroduction to Git Source Control
Introduction to Git Source Control
John Valentino6 views
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P... by NimaTorabi2
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...
NimaTorabi215 views
predicting-m3-devopsconMunich-2023-v2.pptx by Tier1 app
predicting-m3-devopsconMunich-2023-v2.pptxpredicting-m3-devopsconMunich-2023-v2.pptx
predicting-m3-devopsconMunich-2023-v2.pptx
Tier1 app9 views
Top-5-production-devconMunich-2023.pptx by Tier1 app
Top-5-production-devconMunich-2023.pptxTop-5-production-devconMunich-2023.pptx
Top-5-production-devconMunich-2023.pptx
Tier1 app8 views
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated... by TomHalpin9
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...
TomHalpin96 views
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra... by Marc Müller
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra....NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
Marc Müller41 views

Debugging under fire: Keeping your head when systems have lost their mind

  • 1. Debugging under fire Keeping your head when
 systems have lost their mind CTO bryan@joyent.com Bryan Cantrill @bcantrill
  • 2. The genesis of an outage
  • 3. “Please don’t be me, please don’t be me”
  • 4. “…doesn’t begin to describe it”
  • 6. “Fat-finger”? • Not just a “fat-finger”; even this relatively simple failure reflected deeper complexities:
 
 
 
 
 
 
 
 • Outage was instructive — and lucky — on many levels…
  • 7. It could have been much worse! • The (open source!) software stack that we have developed to run our public cloud, Triton, is a complicated distributed system • Compute nodes are PXE booted from the headnode with a RAM-resident platform image • It seemed entire conceivable that the services needed to boot compute nodes would not be able to start because a compute node could not boot… • This was a condition we had tested, but at nowhere near the scale — this was a failure that we hadn’t anticipated!
  • 8. How did we get here? • Software is increasingly delivered as part of a service • Software configuration, deployment and management is increasingly automated • But automation is not total: humans are still in the loop, even if only developing software • Semi-automated systems are fraught with peril: the arrogance and power of automation — but with human fallibility
  • 9. Human fallibility in semi-automated systems
  • 10. Human fallibility in semi-automated systems
  • 11. Whither microservices? • Microservices have yielded simpler components — but more complicated systems • …and open source has allowed us to deploy many more kinds of software components, increasing complexity again • As abstractions become more robust, failures become rare, but arguably more acute: service outage is more likely due to cascading failure in which there is not one bug but several • That these failures may be in discrete software services makes understanding the system very difficult…
  • 13. The Microservices Complexity Paradox an active shooter
  • 15. An even more apt metaphor
  • 17. But… but… alerts and monitoring! 
 “It is a difficult thing to look at a winking light on a board, or hear a peeping alarm — let alone several of them — and immediately draw any sort of rational picture of something happening” — Nuclear Regulatory Commission’s Special Report
 on incident at Three Mile Island
  • 18. The debugging imperative • We suffer from many of the same problems as nuclear power in the 1970s: we are delivering systems that we think can’t fail • In particular, distributed systems are vulnerable to software defects — we must be able to debug them in production • What does it mean to develop software to be debugged? • Prompts a deeper question: how do we debug, anyway?
  • 19. Debugging in the abstract • Debugging is the process by which we understand pathological behavior in a software system • It is not unlike the process by which we understand the behavior of a natural system — a process we call science • Reasoning about the natural world can be very difficult: experiments are expensive and even observations can be very difficult • Physical science is hypothesis-centric
  • 20. The exceptionalism of software • Software is entirely synthetic — it is mathematical machine! • The conclusions of software debugging are often mathematical in their unequivocal power! • Software is so distilled and pure — experiments are so cheap and observation so limitless — that we can structure our reasoning about it differently • We can understand software by simply observing it
  • 21. The art of debugging • The art of debugging isn’t to guess the answer — it is to be able to ask the right questions to know how to answer them • Answered questions are facts, not hypotheses • Facts form constraints on future questions and hypotheses • As facts beget questions which beget observations and more facts, hypotheses become more tightly constrained — like a cordon being cinched around the truth
  • 22. The craft of debuggable software • The essence of debugging is asking and answering questions — and the craft of writing debuggable software is allowing the software to be able to answer questions about itself • This takes many forms: • Designing for postmortem debuggability • Designing for in situ instrumentation • Designing for post hoc debugging
  • 23. A culture of debugging • Debugging must be viewed as the process by which systems are understood and improved, not merely as the process by which bugs are made to go away! • Too often, we have found that beneath innocent wisps of smoke lurk raging coal infernos • Engineers must be empowered to understand anomalies! • Engineers must be empowered to take the extra time to build for debuggability — we must be secure in the knowledge that this pays later dividends!
  • 24. Debugging during an outage • When systems are down, there is a natural tension: do we optimize for recovery or understanding? • “Can we resume service without losing information?” • “What degree of service can we resume with minimal loss of information?” • Overemphasizing recovery with respect to understanding may leave the problem undebugged or (worse) exacerbate the problem with a destructive but unrelated action
  • 25. The peril of overemphasizing recovery • Recovery in lieu of understanding normalizes broken software • If it becomes culturally engrained, the dubious principle of software recovery has toxic corollaries, e.g.: • Software should tolerate bad input (viz. “npm isntall”) • Software should “recover” from fatal failures (uncaught exceptions, segmentation violations, etc.) • Software should not assert the correctness of its state • These anti-patterns impede debuggability!
  • 26. Debugging after an outage • After an outage, we must debug to complete understanding • In mature systems, we can expect cascading failures — which can be exhausting to fully unwind • It will be (very!) tempting after an outage to simply move on, but every service failure (outage-inducing or not) represents an opportunity to advance understanding • Software engineers must be encouraged to understand their own failures to encourage designing for debuggability
  • 27. Enshrining debuggability • Designing for debuggability effects true software robustness: differentiating operational failure from programmatic ones • Operational failures should be handled; programmatic failures should be debugged • Ironically, the more software is designed for debuggability the less you will need to debug it — and the more you will leverage it to debug the software that surrounds it
  • 28. Debugging under fire • It will always be stressful to debug a service that is down • When a service is down, we must balance the need to restore service with the need to debug it • Missteps can be costly; taking time to huddle and think can yield a better, safer path to recovery and root-cause • In massive outages, parallelize by having teams take different avenues of investigation • Viewing outages as opportunities for understanding allows us to develop software cultures that value debuggability!
  • 29. Hungry for more? • If you are the kind of software engineer who values debuggability — and loves debugging — Joyent is hiring! • If you have not yet hit your Cantrillian LD50, I will be joining Brigit Kromhout, Andrew Clay Shafer, Matt Stratton as “Old Geeks Shout At Cloud” • Thank you!