SlideShare a Scribd company logo
Debugging under fire
Keeping your head when

systems have lost their mind
CTO
bryan@joyent.com
Bryan Cantrill
@bcantrill
The genesis of an outage
“Please don’t be me, please don’t be me”
“…doesn’t begin to describe it”
“WHEE!”
“Fat-finger”?
• Not just a “fat-finger”; even this relatively simple failure reflected
deeper complexities:















• Outage was instructive — and lucky — on many levels…
It could have been much worse!
• The (open source!) software stack that we have developed to
run our public cloud, Triton, is a complicated distributed system
• Compute nodes are PXE booted from the headnode with a
RAM-resident platform image
• It seemed entire conceivable that the services needed to boot
compute nodes would not be able to start because a compute
node could not boot…
• This was a condition we had tested, but at nowhere near the
scale — this was a failure that we hadn’t anticipated!
How did we get here?
• Software is increasingly delivered as part of a service
• Software configuration, deployment and management is
increasingly automated
• But automation is not total: humans are still in the loop, even
if only developing software
• Semi-automated systems are fraught with peril: the arrogance
and power of automation — but with human fallibility
Human fallibility in semi-automated systems
Human fallibility in semi-automated systems
Whither microservices?
• Microservices have yielded simpler components — but more
complicated systems
• …and open source has allowed us to deploy many more
kinds of software components, increasing complexity again
• As abstractions become more robust, failures become rare,
but arguably more acute: service outage is more likely due to
cascading failure in which there is not one bug but several
• That these failures may be in discrete software services
makes understanding the system very difficult…
The Microservices Complexity Paradox
The Microservices Complexity Paradox
an active shooter
Modern software failure modes
An even more apt metaphor
A mechanical distributed system
But… but… alerts and monitoring!


“It is a difficult thing to look at a winking light on a board,
or hear a peeping alarm — let alone several of them —
and immediately draw any sort of rational picture of
something happening”
— Nuclear Regulatory Commission’s Special Report

on incident at Three Mile Island
The debugging imperative
• We suffer from many of the same problems as nuclear power
in the 1970s: we are delivering systems that we think can’t fail
• In particular, distributed systems are vulnerable to software
defects — we must be able to debug them in production
• What does it mean to develop software to be debugged?
• Prompts a deeper question: how do we debug, anyway?
Debugging in the abstract
• Debugging is the process by which we understand
pathological behavior in a software system
• It is not unlike the process by which we understand the
behavior of a natural system — a process we call science
• Reasoning about the natural world can be very difficult:
experiments are expensive and even observations can be
very difficult
• Physical science is hypothesis-centric
The exceptionalism of software
• Software is entirely synthetic — it is mathematical machine!
• The conclusions of software debugging are often
mathematical in their unequivocal power!
• Software is so distilled and pure — experiments are so cheap
and observation so limitless — that we can structure our
reasoning about it differently
• We can understand software by simply observing it
The art of debugging
• The art of debugging isn’t to guess the answer — it is to be
able to ask the right questions to know how to answer them
• Answered questions are facts, not hypotheses
• Facts form constraints on future questions and hypotheses
• As facts beget questions which beget observations and more
facts, hypotheses become more tightly constrained — like a
cordon being cinched around the truth
The craft of debuggable software
• The essence of debugging is asking and answering questions
— and the craft of writing debuggable software is allowing the
software to be able to answer questions about itself
• This takes many forms:
• Designing for postmortem debuggability
• Designing for in situ instrumentation
• Designing for post hoc debugging
A culture of debugging
• Debugging must be viewed as the process by which systems
are understood and improved, not merely as the process by
which bugs are made to go away!
• Too often, we have found that beneath innocent wisps of
smoke lurk raging coal infernos
• Engineers must be empowered to understand anomalies!
• Engineers must be empowered to take the extra time to build
for debuggability — we must be secure in the knowledge that
this pays later dividends!
Debugging during an outage
• When systems are down, there is a natural tension: do we
optimize for recovery or understanding?
• “Can we resume service without losing information?”
• “What degree of service can we resume with minimal loss
of information?”
• Overemphasizing recovery with respect to understanding may
leave the problem undebugged or (worse) exacerbate the
problem with a destructive but unrelated action
The peril of overemphasizing recovery
• Recovery in lieu of understanding normalizes broken software
• If it becomes culturally engrained, the dubious principle of
software recovery has toxic corollaries, e.g.:
• Software should tolerate bad input (viz. “npm isntall”)
• Software should “recover” from fatal failures (uncaught
exceptions, segmentation violations, etc.)
• Software should not assert the correctness of its state
• These anti-patterns impede debuggability!
Debugging after an outage
• After an outage, we must debug to complete understanding
• In mature systems, we can expect cascading failures —
which can be exhausting to fully unwind
• It will be (very!) tempting after an outage to simply move on,
but every service failure (outage-inducing or not) represents
an opportunity to advance understanding
• Software engineers must be encouraged to understand their
own failures to encourage designing for debuggability
Enshrining debuggability
• Designing for debuggability effects true software robustness:
differentiating operational failure from programmatic ones
• Operational failures should be handled; programmatic failures
should be debugged
• Ironically, the more software is designed for debuggability the
less you will need to debug it — and the more you will
leverage it to debug the software that surrounds it
Debugging under fire
• It will always be stressful to debug a service that is down
• When a service is down, we must balance the need to restore
service with the need to debug it
• Missteps can be costly; taking time to huddle and think can
yield a better, safer path to recovery and root-cause
• In massive outages, parallelize by having teams take different
avenues of investigation
• Viewing outages as opportunities for understanding allows us
to develop software cultures that value debuggability!
Hungry for more?
• If you are the kind of software engineer who values
debuggability — and loves debugging — Joyent is hiring!
• If you have not yet hit your Cantrillian LD50, I will be joining
Brigit Kromhout, Andrew Clay Shafer, Matt Stratton as “Old
Geeks Shout At Cloud”
• Thank you!

More Related Content

What's hot

The Hurricane's Butterfly: Debugging pathologically performing systems
The Hurricane's Butterfly: Debugging pathologically performing systemsThe Hurricane's Butterfly: Debugging pathologically performing systems
The Hurricane's Butterfly: Debugging pathologically performing systems
bcantrill
 
Visualizing Systems with Statemaps
Visualizing Systems with StatemapsVisualizing Systems with Statemaps
Visualizing Systems with Statemaps
bcantrill
 
Principles of Technology Leadership
Principles of Technology LeadershipPrinciples of Technology Leadership
Principles of Technology Leadership
bcantrill
 
The Internet-of-things: Architecting for the deluge of data
The Internet-of-things: Architecting for the deluge of dataThe Internet-of-things: Architecting for the deluge of data
The Internet-of-things: Architecting for the deluge of data
bcantrill
 
Jax Devops 2017 Succeeding in the Cloud – the guidebook of Fail
Jax Devops 2017  Succeeding in the Cloud – the guidebook of FailJax Devops 2017  Succeeding in the Cloud – the guidebook of Fail
Jax Devops 2017 Succeeding in the Cloud – the guidebook of Fail
Steve Poole
 
OSCON 2013: "Case Study: What to do when your project outgrows your company"
OSCON 2013: "Case Study: What to do when your project outgrows your company"OSCON 2013: "Case Study: What to do when your project outgrows your company"
OSCON 2013: "Case Study: What to do when your project outgrows your company"
The Linux Foundation
 
Lessons Learned from Xen - SELF2013
Lessons Learned from Xen - SELF2013Lessons Learned from Xen - SELF2013
Lessons Learned from Xen - SELF2013
The Linux Foundation
 
Microservices for Mortals by Bert Ertman at Codemotion Dubai
 Microservices for Mortals by Bert Ertman at Codemotion Dubai Microservices for Mortals by Bert Ertman at Codemotion Dubai
Microservices for Mortals by Bert Ertman at Codemotion Dubai
Codemotion Dubai
 
Internet of Things, TYBSC IT, Semester 5, Unit II
Internet of Things, TYBSC IT, Semester 5, Unit IIInternet of Things, TYBSC IT, Semester 5, Unit II
Internet of Things, TYBSC IT, Semester 5, Unit II
Arti Parab Academics
 
Internet of Things, TYBSC IT, Semester 5, Unit V
Internet of Things, TYBSC IT, Semester 5, Unit VInternet of Things, TYBSC IT, Semester 5, Unit V
Internet of Things, TYBSC IT, Semester 5, Unit V
Arti Parab Academics
 
Graphel: A Purely Functional Approach to Digital Interaction
Graphel: A Purely Functional Approach to Digital InteractionGraphel: A Purely Functional Approach to Digital Interaction
Graphel: A Purely Functional Approach to Digital Interactionmtrimpe
 
The economies of scaling software - Abdel Remani
The economies of scaling software - Abdel RemaniThe economies of scaling software - Abdel Remani
The economies of scaling software - Abdel Remani
jaxconf
 
Cerebro general overiew eng
Cerebro general overiew engCerebro general overiew eng
Cerebro general overiew engCineSoft
 
The Reactive Principles: Design Principles For Cloud Native Applications
The Reactive Principles: Design Principles For Cloud Native ApplicationsThe Reactive Principles: Design Principles For Cloud Native Applications
The Reactive Principles: Design Principles For Cloud Native Applications
Jonas Bonér
 
Xen Project Contributor Training Part2 : Processes and Conventions v1.1
Xen Project Contributor Training Part2 : Processes and Conventions v1.1Xen Project Contributor Training Part2 : Processes and Conventions v1.1
Xen Project Contributor Training Part2 : Processes and Conventions v1.1
The Linux Foundation
 
Tiger oracle
Tiger oracleTiger oracle
Tiger oracled0nn9n
 
Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...
Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...
Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...
Clinton Wolfe
 
VMUG - My Journey to Full Stack Engineering
VMUG - My Journey to Full Stack EngineeringVMUG - My Journey to Full Stack Engineering
VMUG - My Journey to Full Stack Engineering
Chris Wahl
 
Accidental Architecture 0.9
Accidental Architecture 0.9Accidental Architecture 0.9
Accidental Architecture 0.9Mark Cathcart
 

What's hot (20)

The Hurricane's Butterfly: Debugging pathologically performing systems
The Hurricane's Butterfly: Debugging pathologically performing systemsThe Hurricane's Butterfly: Debugging pathologically performing systems
The Hurricane's Butterfly: Debugging pathologically performing systems
 
Visualizing Systems with Statemaps
Visualizing Systems with StatemapsVisualizing Systems with Statemaps
Visualizing Systems with Statemaps
 
Principles of Technology Leadership
Principles of Technology LeadershipPrinciples of Technology Leadership
Principles of Technology Leadership
 
The Internet-of-things: Architecting for the deluge of data
The Internet-of-things: Architecting for the deluge of dataThe Internet-of-things: Architecting for the deluge of data
The Internet-of-things: Architecting for the deluge of data
 
Jax Devops 2017 Succeeding in the Cloud – the guidebook of Fail
Jax Devops 2017  Succeeding in the Cloud – the guidebook of FailJax Devops 2017  Succeeding in the Cloud – the guidebook of Fail
Jax Devops 2017 Succeeding in the Cloud – the guidebook of Fail
 
OSCON 2013: "Case Study: What to do when your project outgrows your company"
OSCON 2013: "Case Study: What to do when your project outgrows your company"OSCON 2013: "Case Study: What to do when your project outgrows your company"
OSCON 2013: "Case Study: What to do when your project outgrows your company"
 
Lessons Learned from Xen - SELF2013
Lessons Learned from Xen - SELF2013Lessons Learned from Xen - SELF2013
Lessons Learned from Xen - SELF2013
 
Microservices for Mortals by Bert Ertman at Codemotion Dubai
 Microservices for Mortals by Bert Ertman at Codemotion Dubai Microservices for Mortals by Bert Ertman at Codemotion Dubai
Microservices for Mortals by Bert Ertman at Codemotion Dubai
 
Internet of Things, TYBSC IT, Semester 5, Unit II
Internet of Things, TYBSC IT, Semester 5, Unit IIInternet of Things, TYBSC IT, Semester 5, Unit II
Internet of Things, TYBSC IT, Semester 5, Unit II
 
Internet of Things, TYBSC IT, Semester 5, Unit V
Internet of Things, TYBSC IT, Semester 5, Unit VInternet of Things, TYBSC IT, Semester 5, Unit V
Internet of Things, TYBSC IT, Semester 5, Unit V
 
Graphel: A Purely Functional Approach to Digital Interaction
Graphel: A Purely Functional Approach to Digital InteractionGraphel: A Purely Functional Approach to Digital Interaction
Graphel: A Purely Functional Approach to Digital Interaction
 
The economies of scaling software - Abdel Remani
The economies of scaling software - Abdel RemaniThe economies of scaling software - Abdel Remani
The economies of scaling software - Abdel Remani
 
Cerebro general overiew eng
Cerebro general overiew engCerebro general overiew eng
Cerebro general overiew eng
 
The Reactive Principles: Design Principles For Cloud Native Applications
The Reactive Principles: Design Principles For Cloud Native ApplicationsThe Reactive Principles: Design Principles For Cloud Native Applications
The Reactive Principles: Design Principles For Cloud Native Applications
 
Xen Project Contributor Training Part2 : Processes and Conventions v1.1
Xen Project Contributor Training Part2 : Processes and Conventions v1.1Xen Project Contributor Training Part2 : Processes and Conventions v1.1
Xen Project Contributor Training Part2 : Processes and Conventions v1.1
 
Tiger oracle
Tiger oracleTiger oracle
Tiger oracle
 
Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...
Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...
Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...
 
Elatt Presentation
Elatt PresentationElatt Presentation
Elatt Presentation
 
VMUG - My Journey to Full Stack Engineering
VMUG - My Journey to Full Stack EngineeringVMUG - My Journey to Full Stack Engineering
VMUG - My Journey to Full Stack Engineering
 
Accidental Architecture 0.9
Accidental Architecture 0.9Accidental Architecture 0.9
Accidental Architecture 0.9
 

Similar to Debugging under fire: Keeping your head when systems have lost their mind

DevOps Days Vancouver 2014 Slides
DevOps Days Vancouver 2014 SlidesDevOps Days Vancouver 2014 Slides
DevOps Days Vancouver 2014 Slides
Alex Cruise
 
No silver bullet
No silver bulletNo silver bullet
No silver bullet
Ghufran Hasan
 
Empirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an OverviewEmpirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an Overview
alessio_ferrari
 
Mastering Microservices 2022 - Debugging distributed systems
Mastering Microservices 2022 - Debugging distributed systemsMastering Microservices 2022 - Debugging distributed systems
Mastering Microservices 2022 - Debugging distributed systems
Bert Jan Schrijver
 
dist_systems.pdf
dist_systems.pdfdist_systems.pdf
dist_systems.pdf
CherenetToma
 
The 7 quests of resilient software design
The 7 quests of resilient software designThe 7 quests of resilient software design
The 7 quests of resilient software design
Uwe Friedrichsen
 
Scaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON TutorialScaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON Tutorial
duleepa
 
Non-Functional Requirements
Non-Functional RequirementsNon-Functional Requirements
Non-Functional Requirements
David Simons
 
JavaLand 2022 - Debugging distributed systems
JavaLand 2022 - Debugging distributed systemsJavaLand 2022 - Debugging distributed systems
JavaLand 2022 - Debugging distributed systems
Bert Jan Schrijver
 
GOTO night April 2022 - Debugging distributed systems
GOTO night April 2022 - Debugging distributed systemsGOTO night April 2022 - Debugging distributed systems
GOTO night April 2022 - Debugging distributed systems
Bert Jan Schrijver
 
Debugging distributed systems
Debugging distributed systemsDebugging distributed systems
Debugging distributed systems
Bert Jan Schrijver
 
SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
SRE Topics with Charity Majors and Liz Fong-Jones of HoneycombSRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
Daniel Zivkovic
 
Software engineering
Software engineeringSoftware engineering
Software engineering
DivyaSharma458
 
No Silver Bullet - Essence and Accidents of Software Engineering
No Silver Bullet - Essence and Accidents of Software EngineeringNo Silver Bullet - Essence and Accidents of Software Engineering
No Silver Bullet - Essence and Accidents of Software Engineering
Aditi Abhang
 
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB World 2018: Tutorial - MongoDB Meets Chaos MonkeyMongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB
 
RedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious FutureRedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious Future
Redis Labs
 
Software Defects and SW Reliability Assessment
Software Defects and SW Reliability AssessmentSoftware Defects and SW Reliability Assessment
Software Defects and SW Reliability Assessment
Kristine Hejna
 
devops, microservices, and platforms, oh my!
devops, microservices, and platforms, oh my!devops, microservices, and platforms, oh my!
devops, microservices, and platforms, oh my!
Andrew Shafer
 
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
VMware Tanzu
 
Chaos engineering
Chaos engineering Chaos engineering
Chaos engineering
Alberto Acerbis
 

Similar to Debugging under fire: Keeping your head when systems have lost their mind (20)

DevOps Days Vancouver 2014 Slides
DevOps Days Vancouver 2014 SlidesDevOps Days Vancouver 2014 Slides
DevOps Days Vancouver 2014 Slides
 
No silver bullet
No silver bulletNo silver bullet
No silver bullet
 
Empirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an OverviewEmpirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an Overview
 
Mastering Microservices 2022 - Debugging distributed systems
Mastering Microservices 2022 - Debugging distributed systemsMastering Microservices 2022 - Debugging distributed systems
Mastering Microservices 2022 - Debugging distributed systems
 
dist_systems.pdf
dist_systems.pdfdist_systems.pdf
dist_systems.pdf
 
The 7 quests of resilient software design
The 7 quests of resilient software designThe 7 quests of resilient software design
The 7 quests of resilient software design
 
Scaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON TutorialScaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON Tutorial
 
Non-Functional Requirements
Non-Functional RequirementsNon-Functional Requirements
Non-Functional Requirements
 
JavaLand 2022 - Debugging distributed systems
JavaLand 2022 - Debugging distributed systemsJavaLand 2022 - Debugging distributed systems
JavaLand 2022 - Debugging distributed systems
 
GOTO night April 2022 - Debugging distributed systems
GOTO night April 2022 - Debugging distributed systemsGOTO night April 2022 - Debugging distributed systems
GOTO night April 2022 - Debugging distributed systems
 
Debugging distributed systems
Debugging distributed systemsDebugging distributed systems
Debugging distributed systems
 
SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
SRE Topics with Charity Majors and Liz Fong-Jones of HoneycombSRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
 
Software engineering
Software engineeringSoftware engineering
Software engineering
 
No Silver Bullet - Essence and Accidents of Software Engineering
No Silver Bullet - Essence and Accidents of Software EngineeringNo Silver Bullet - Essence and Accidents of Software Engineering
No Silver Bullet - Essence and Accidents of Software Engineering
 
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB World 2018: Tutorial - MongoDB Meets Chaos MonkeyMongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
 
RedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious FutureRedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious Future
 
Software Defects and SW Reliability Assessment
Software Defects and SW Reliability AssessmentSoftware Defects and SW Reliability Assessment
Software Defects and SW Reliability Assessment
 
devops, microservices, and platforms, oh my!
devops, microservices, and platforms, oh my!devops, microservices, and platforms, oh my!
devops, microservices, and platforms, oh my!
 
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
 
Chaos engineering
Chaos engineering Chaos engineering
Chaos engineering
 

More from bcantrill

Predicting the Present
Predicting the PresentPredicting the Present
Predicting the Present
bcantrill
 
Sharpening the Axe: The Primacy of Toolmaking
Sharpening the Axe: The Primacy of ToolmakingSharpening the Axe: The Primacy of Toolmaking
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
 
Coming of Age: Developing young technologists without robbing them of their y...
Coming of Age: Developing young technologists without robbing them of their y...Coming of Age: Developing young technologists without robbing them of their y...
Coming of Age: Developing young technologists without robbing them of their y...
bcantrill
 
I have come to bury the BIOS, not to open it: The need for holistic systems
I have come to bury the BIOS, not to open it: The need for holistic systemsI have come to bury the BIOS, not to open it: The need for holistic systems
I have come to bury the BIOS, not to open it: The need for holistic systems
bcantrill
 
Towards Holistic Systems
Towards Holistic SystemsTowards Holistic Systems
Towards Holistic Systems
bcantrill
 
The Coming Firmware Revolution
The Coming Firmware RevolutionThe Coming Firmware Revolution
The Coming Firmware Revolution
bcantrill
 
Hardware/software Co-design: The Coming Golden Age
Hardware/software Co-design: The Coming Golden AgeHardware/software Co-design: The Coming Golden Age
Hardware/software Co-design: The Coming Golden Age
bcantrill
 
Tockilator: Deducing Tock execution flows from Ibex Verilator traces
Tockilator: Deducing Tock execution flows from Ibex Verilator tracesTockilator: Deducing Tock execution flows from Ibex Verilator traces
Tockilator: Deducing Tock execution flows from Ibex Verilator traces
bcantrill
 
No Moore Left to Give: Enterprise Computing After Moore's Law
No Moore Left to Give: Enterprise Computing After Moore's LawNo Moore Left to Give: Enterprise Computing After Moore's Law
No Moore Left to Give: Enterprise Computing After Moore's Law
bcantrill
 
Andreessen's Corollary: Ethical Dilemmas in Software Engineering
Andreessen's Corollary: Ethical Dilemmas in Software EngineeringAndreessen's Corollary: Ethical Dilemmas in Software Engineering
Andreessen's Corollary: Ethical Dilemmas in Software Engineering
bcantrill
 
Platform values, Rust, and the implications for system software
Platform values, Rust, and the implications for system softwarePlatform values, Rust, and the implications for system software
Platform values, Rust, and the implications for system software
bcantrill
 
Is it time to rewrite the operating system in Rust?
Is it time to rewrite the operating system in Rust?Is it time to rewrite the operating system in Rust?
Is it time to rewrite the operating system in Rust?
bcantrill
 
dtrace.conf(16): DTrace state of the union
dtrace.conf(16): DTrace state of the uniondtrace.conf(16): DTrace state of the union
dtrace.conf(16): DTrace state of the union
bcantrill
 
Papers We Love: ARC after dark
Papers We Love: ARC after darkPapers We Love: ARC after dark
Papers We Love: ARC after dark
bcantrill
 
Down Memory Lane: Two Decades with the Slab Allocator
Down Memory Lane: Two Decades with the Slab AllocatorDown Memory Lane: Two Decades with the Slab Allocator
Down Memory Lane: Two Decades with the Slab Allocator
bcantrill
 
The Container Revolution: Reflections after the first decade
The Container Revolution: Reflections after the first decadeThe Container Revolution: Reflections after the first decade
The Container Revolution: Reflections after the first decade
bcantrill
 
Papers We Love: Jails and Zones
Papers We Love: Jails and ZonesPapers We Love: Jails and Zones
Papers We Love: Jails and Zones
bcantrill
 
Why it’s (past) time to run containers on bare metal
Why it’s (past) time to run containers on bare metalWhy it’s (past) time to run containers on bare metal
Why it’s (past) time to run containers on bare metal
bcantrill
 
Run containers on bare metal already!
Run containers on bare metal already!Run containers on bare metal already!
Run containers on bare metal already!
bcantrill
 
A crime against common sense
A crime against common senseA crime against common sense
A crime against common sense
bcantrill
 

More from bcantrill (20)

Predicting the Present
Predicting the PresentPredicting the Present
Predicting the Present
 
Sharpening the Axe: The Primacy of Toolmaking
Sharpening the Axe: The Primacy of ToolmakingSharpening the Axe: The Primacy of Toolmaking
Sharpening the Axe: The Primacy of Toolmaking
 
Coming of Age: Developing young technologists without robbing them of their y...
Coming of Age: Developing young technologists without robbing them of their y...Coming of Age: Developing young technologists without robbing them of their y...
Coming of Age: Developing young technologists without robbing them of their y...
 
I have come to bury the BIOS, not to open it: The need for holistic systems
I have come to bury the BIOS, not to open it: The need for holistic systemsI have come to bury the BIOS, not to open it: The need for holistic systems
I have come to bury the BIOS, not to open it: The need for holistic systems
 
Towards Holistic Systems
Towards Holistic SystemsTowards Holistic Systems
Towards Holistic Systems
 
The Coming Firmware Revolution
The Coming Firmware RevolutionThe Coming Firmware Revolution
The Coming Firmware Revolution
 
Hardware/software Co-design: The Coming Golden Age
Hardware/software Co-design: The Coming Golden AgeHardware/software Co-design: The Coming Golden Age
Hardware/software Co-design: The Coming Golden Age
 
Tockilator: Deducing Tock execution flows from Ibex Verilator traces
Tockilator: Deducing Tock execution flows from Ibex Verilator tracesTockilator: Deducing Tock execution flows from Ibex Verilator traces
Tockilator: Deducing Tock execution flows from Ibex Verilator traces
 
No Moore Left to Give: Enterprise Computing After Moore's Law
No Moore Left to Give: Enterprise Computing After Moore's LawNo Moore Left to Give: Enterprise Computing After Moore's Law
No Moore Left to Give: Enterprise Computing After Moore's Law
 
Andreessen's Corollary: Ethical Dilemmas in Software Engineering
Andreessen's Corollary: Ethical Dilemmas in Software EngineeringAndreessen's Corollary: Ethical Dilemmas in Software Engineering
Andreessen's Corollary: Ethical Dilemmas in Software Engineering
 
Platform values, Rust, and the implications for system software
Platform values, Rust, and the implications for system softwarePlatform values, Rust, and the implications for system software
Platform values, Rust, and the implications for system software
 
Is it time to rewrite the operating system in Rust?
Is it time to rewrite the operating system in Rust?Is it time to rewrite the operating system in Rust?
Is it time to rewrite the operating system in Rust?
 
dtrace.conf(16): DTrace state of the union
dtrace.conf(16): DTrace state of the uniondtrace.conf(16): DTrace state of the union
dtrace.conf(16): DTrace state of the union
 
Papers We Love: ARC after dark
Papers We Love: ARC after darkPapers We Love: ARC after dark
Papers We Love: ARC after dark
 
Down Memory Lane: Two Decades with the Slab Allocator
Down Memory Lane: Two Decades with the Slab AllocatorDown Memory Lane: Two Decades with the Slab Allocator
Down Memory Lane: Two Decades with the Slab Allocator
 
The Container Revolution: Reflections after the first decade
The Container Revolution: Reflections after the first decadeThe Container Revolution: Reflections after the first decade
The Container Revolution: Reflections after the first decade
 
Papers We Love: Jails and Zones
Papers We Love: Jails and ZonesPapers We Love: Jails and Zones
Papers We Love: Jails and Zones
 
Why it’s (past) time to run containers on bare metal
Why it’s (past) time to run containers on bare metalWhy it’s (past) time to run containers on bare metal
Why it’s (past) time to run containers on bare metal
 
Run containers on bare metal already!
Run containers on bare metal already!Run containers on bare metal already!
Run containers on bare metal already!
 
A crime against common sense
A crime against common senseA crime against common sense
A crime against common sense
 

Recently uploaded

Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
abdulrafaychaudhry
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
ShamsuddeenMuhammadA
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 

Recently uploaded (20)

Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 

Debugging under fire: Keeping your head when systems have lost their mind

  • 1. Debugging under fire Keeping your head when
 systems have lost their mind CTO bryan@joyent.com Bryan Cantrill @bcantrill
  • 2. The genesis of an outage
  • 3. “Please don’t be me, please don’t be me”
  • 4. “…doesn’t begin to describe it”
  • 6. “Fat-finger”? • Not just a “fat-finger”; even this relatively simple failure reflected deeper complexities:
 
 
 
 
 
 
 
 • Outage was instructive — and lucky — on many levels…
  • 7. It could have been much worse! • The (open source!) software stack that we have developed to run our public cloud, Triton, is a complicated distributed system • Compute nodes are PXE booted from the headnode with a RAM-resident platform image • It seemed entire conceivable that the services needed to boot compute nodes would not be able to start because a compute node could not boot… • This was a condition we had tested, but at nowhere near the scale — this was a failure that we hadn’t anticipated!
  • 8. How did we get here? • Software is increasingly delivered as part of a service • Software configuration, deployment and management is increasingly automated • But automation is not total: humans are still in the loop, even if only developing software • Semi-automated systems are fraught with peril: the arrogance and power of automation — but with human fallibility
  • 9. Human fallibility in semi-automated systems
  • 10. Human fallibility in semi-automated systems
  • 11. Whither microservices? • Microservices have yielded simpler components — but more complicated systems • …and open source has allowed us to deploy many more kinds of software components, increasing complexity again • As abstractions become more robust, failures become rare, but arguably more acute: service outage is more likely due to cascading failure in which there is not one bug but several • That these failures may be in discrete software services makes understanding the system very difficult…
  • 13. The Microservices Complexity Paradox an active shooter
  • 15. An even more apt metaphor
  • 17. But… but… alerts and monitoring! 
 “It is a difficult thing to look at a winking light on a board, or hear a peeping alarm — let alone several of them — and immediately draw any sort of rational picture of something happening” — Nuclear Regulatory Commission’s Special Report
 on incident at Three Mile Island
  • 18. The debugging imperative • We suffer from many of the same problems as nuclear power in the 1970s: we are delivering systems that we think can’t fail • In particular, distributed systems are vulnerable to software defects — we must be able to debug them in production • What does it mean to develop software to be debugged? • Prompts a deeper question: how do we debug, anyway?
  • 19. Debugging in the abstract • Debugging is the process by which we understand pathological behavior in a software system • It is not unlike the process by which we understand the behavior of a natural system — a process we call science • Reasoning about the natural world can be very difficult: experiments are expensive and even observations can be very difficult • Physical science is hypothesis-centric
  • 20. The exceptionalism of software • Software is entirely synthetic — it is mathematical machine! • The conclusions of software debugging are often mathematical in their unequivocal power! • Software is so distilled and pure — experiments are so cheap and observation so limitless — that we can structure our reasoning about it differently • We can understand software by simply observing it
  • 21. The art of debugging • The art of debugging isn’t to guess the answer — it is to be able to ask the right questions to know how to answer them • Answered questions are facts, not hypotheses • Facts form constraints on future questions and hypotheses • As facts beget questions which beget observations and more facts, hypotheses become more tightly constrained — like a cordon being cinched around the truth
  • 22. The craft of debuggable software • The essence of debugging is asking and answering questions — and the craft of writing debuggable software is allowing the software to be able to answer questions about itself • This takes many forms: • Designing for postmortem debuggability • Designing for in situ instrumentation • Designing for post hoc debugging
  • 23. A culture of debugging • Debugging must be viewed as the process by which systems are understood and improved, not merely as the process by which bugs are made to go away! • Too often, we have found that beneath innocent wisps of smoke lurk raging coal infernos • Engineers must be empowered to understand anomalies! • Engineers must be empowered to take the extra time to build for debuggability — we must be secure in the knowledge that this pays later dividends!
  • 24. Debugging during an outage • When systems are down, there is a natural tension: do we optimize for recovery or understanding? • “Can we resume service without losing information?” • “What degree of service can we resume with minimal loss of information?” • Overemphasizing recovery with respect to understanding may leave the problem undebugged or (worse) exacerbate the problem with a destructive but unrelated action
  • 25. The peril of overemphasizing recovery • Recovery in lieu of understanding normalizes broken software • If it becomes culturally engrained, the dubious principle of software recovery has toxic corollaries, e.g.: • Software should tolerate bad input (viz. “npm isntall”) • Software should “recover” from fatal failures (uncaught exceptions, segmentation violations, etc.) • Software should not assert the correctness of its state • These anti-patterns impede debuggability!
  • 26. Debugging after an outage • After an outage, we must debug to complete understanding • In mature systems, we can expect cascading failures — which can be exhausting to fully unwind • It will be (very!) tempting after an outage to simply move on, but every service failure (outage-inducing or not) represents an opportunity to advance understanding • Software engineers must be encouraged to understand their own failures to encourage designing for debuggability
  • 27. Enshrining debuggability • Designing for debuggability effects true software robustness: differentiating operational failure from programmatic ones • Operational failures should be handled; programmatic failures should be debugged • Ironically, the more software is designed for debuggability the less you will need to debug it — and the more you will leverage it to debug the software that surrounds it
  • 28. Debugging under fire • It will always be stressful to debug a service that is down • When a service is down, we must balance the need to restore service with the need to debug it • Missteps can be costly; taking time to huddle and think can yield a better, safer path to recovery and root-cause • In massive outages, parallelize by having teams take different avenues of investigation • Viewing outages as opportunities for understanding allows us to develop software cultures that value debuggability!
  • 29. Hungry for more? • If you are the kind of software engineer who values debuggability — and loves debugging — Joyent is hiring! • If you have not yet hit your Cantrillian LD50, I will be joining Brigit Kromhout, Andrew Clay Shafer, Matt Stratton as “Old Geeks Shout At Cloud” • Thank you!