Servers are
doomed to fail Jaana B. Dogan
jbd@google.com
@rakyll
Serverless is also
doomed to fail Jaana B. Dogan
jbd@google.com
@rakyll
Systems are
doomed to fail Jaana B. Dogan
jbd@google.com
@rakyll
Is failure OK?
Is failure an
unexpected case?
Failure is not an exception.
Systems change all
the time.
“I haven’t touched the code
for a century, it should just
work.”
Said no one ever.
Failure is expected.
Yes, it is.
@rakyll
monitoring
debugging
postmortem
Monitoring is about saying if
something is broken.
“99.99% of the requests
should return in 100ms.”
@rakyll
@rakyll
Debugging
Debugging is
collaborative.
Debugging comes in flavors.
Logs Traces Metrics
...
Postmortems
Postmortems
Postmortems
Blameless?
Focus on identifying
problems.
Collaboration
Design for
collaboration.
Design
for failure
Set SLOs, plan for
instrumentation, plan
for debugging.
Cross-stack
debugging
Accountability
across stack with high
cardinality data. speakerdeck.com/rakyll/rpc-metrics-at-google
Correlation
Jump from
monitoring/debugging
data to data.
On-call
debugging
Jump from distributed
tracing data to on-call
information.
who to page?
Dynamic
collection
Capability to enable
more collection in
production when
needed.
Continuous
collection
Continuously collect
signals, generate
fleet-wide analysis
reports.
Introspection
Introspection pages
provided from the
services.
@rakyll
monitoring
debugging
postmortem
Thank you
Jaana B. Dogan
Google
jbd@google.com

Servers are doomed to fail