Detangling complex systems
Liz Fong-Jones
@lizthegrey
#DevOpsDays DFW
August 20, 2019
with compassion & production excellence
1w/ illustrations by @emilywithcurls!
@lizthegrey at #DevOpsDays DFW
Production is increasingly complex.
2
@lizthegrey at #DevOpsDays DFW
Especially for hybrid systems.
3
@lizthegrey at #DevOpsDays DFW
What does uptime mean?
4
@lizthegrey at #DevOpsDays DFW
Is it measured in servers?
5
@lizthegrey at #DevOpsDays DFW
Is it measured in complaints?
6
@lizthegrey at #DevOpsDays DFW
How about juggling everything else?
7
@lizthegrey at #DevOpsDays DFW
Our strategies need to evolve.
8
@lizthegrey at #DevOpsDays DFW
Don't "buy" DevOps.
9
@lizthegrey at #DevOpsDays DFW
When we order the alphabet soup...
10
@lizthegrey at #DevOpsDays DFW
Noisy alerts. Grumpy engineers.
11
@lizthegrey at #DevOpsDays DFW
Walls of meaningless dashboards.
12
@lizthegrey at #DevOpsDays DFW
Incidents take forever to fix.
13
@lizthegrey at #DevOpsDays DFW
Everyone bugs the "expert".
14
@lizthegrey at #DevOpsDays DFW
Deploys are unpredictable.
15
@lizthegrey at #DevOpsDays DFW
There's no time to do projects...
16
@lizthegrey at #DevOpsDays DFW
and when there's time, there's no plan.
17
@lizthegrey at #DevOpsDays DFW
The team is struggling to hold on.
18
@lizthegrey at #DevOpsDays DFW
What are we missing?
19
@lizthegrey at #DevOpsDays DFW
We forgot who operates systems.
20
@lizthegrey at #DevOpsDays DFW
Tools aren't magical.
21
@lizthegrey at #DevOpsDays DFW
Invest in people, culture, & process.
22
@lizthegrey at #DevOpsDays DFW
Enter the art of
Production Excellence.
23
@lizthegrey at #DevOpsDays DFW
Make systems more reliable & friendly.
24
@lizthegrey at #DevOpsDays DFW
ProdEx takes planning.
25
@lizthegrey at #DevOpsDays DFW
Measure and act on what matters.
26
@lizthegrey at #DevOpsDays DFW
Involve everyone.
27
@lizthegrey at #DevOpsDays DFW
Build everyone's confidence.
Encourage asking questions.
28
@lizthegrey at #DevOpsDays DFW
How do we get started?
29
@lizthegrey at #DevOpsDays DFW
Know when it's too broken.
30
@lizthegrey at #DevOpsDays DFW
& be able to debug, together when it is.
31
@lizthegrey at #DevOpsDays DFW
Eliminate (unnecessary) complexity.
32
@lizthegrey at #DevOpsDays DFW
Our systems are always failing.
33
@lizthegrey at #DevOpsDays DFW
What if we measure too broken?
34
@lizthegrey at #DevOpsDays DFW
We need
Service Level Indicators
35
@lizthegrey at #DevOpsDays DFW
SLIs and SLOs are common language.
36
@lizthegrey at #DevOpsDays DFW
Think in terms of events in context.
37
@lizthegrey at #DevOpsDays DFW
Is this event good or bad?
38
@lizthegrey at #DevOpsDays DFW
Are users grumpy? Ask your PM.
39
@lizthegrey at #DevOpsDays DFW
What threshold buckets events?
40
@lizthegrey at #DevOpsDays DFW
HTTP Code 200? Latency < 300ms?
41
@lizthegrey at #DevOpsDays DFW
How many eligible events did we see?
42
@lizthegrey at #DevOpsDays DFW
Availability: Good / Eligible Events
43
@lizthegrey at #DevOpsDays DFW
Set a target Service Level Objective.
44
@lizthegrey at #DevOpsDays DFW
Use a window and target percentage.
45
@lizthegrey at #DevOpsDays DFW
99.9% of events good in past 30 days.
46
@lizthegrey at #DevOpsDays DFW
A good SLO barely keeps users happy.
47
@lizthegrey at #DevOpsDays DFW
Drive alerting with SLOs.
48
@lizthegrey at #DevOpsDays DFW
Error budget: allowed unavailability
49
@lizthegrey at #DevOpsDays DFW
How long until I run out?
50
@lizthegrey at #DevOpsDays DFW
Page if it's hours.
51
Ticket if it's days.
@lizthegrey at #DevOpsDays DFW
Data-driven business decisions.
52
@lizthegrey at #DevOpsDays DFW
Is it safe to do this risky experiment?
53
@lizthegrey at #DevOpsDays DFW
Should we invest in more reliability?
54
@lizthegrey at #DevOpsDays DFW
Perfect SLO > Good SLO >>> No SLO
55
@lizthegrey at #DevOpsDays DFW
Measure what you can today.
56
@lizthegrey at #DevOpsDays DFW
Iterate to meet user needs.
57
@lizthegrey at #DevOpsDays DFW
Only alert on what matters.
58
@lizthegrey at #DevOpsDays DFW
SLIs & SLOs are
only half the picture...
59
@lizthegrey at #DevOpsDays DFW
Our outages are never identical.
60
@lizthegrey at #DevOpsDays DFW
Failure modes can't be predicted.
61
@lizthegrey at #DevOpsDays DFW
Support debugging novel cases.
In production.
62
@lizthegrey at #DevOpsDays DFW
Allow forming & testing hypotheses.
63
@lizthegrey at #DevOpsDays DFW
Dive into data to ask new questions.
64
@lizthegrey at #DevOpsDays DFW
Our services must be observable.
65
@lizthegrey at #DevOpsDays DFW
Can you examine events in context?
66
@lizthegrey at #DevOpsDays DFW
Can you explain the variance?
67
@lizthegrey at #DevOpsDays DFW
Can you mitigate impact & debug later?
68
@lizthegrey at #DevOpsDays DFW
SLOs and Observability go together.
69
@lizthegrey at #DevOpsDays DFW
But they alone don't
create collaboration.
70
@lizthegrey at #DevOpsDays DFW
Heroism isn't sustainable.
71
@lizthegrey at #DevOpsDays DFW
Debugging is not a solo activity.
72
@lizthegrey at #DevOpsDays DFW
Debugging is for everyone.
73
@lizthegrey at #DevOpsDays DFW
Collaboration is interpersonal.
74
@lizthegrey at #DevOpsDays DFW
Lean on your team.
75
@lizthegrey at #DevOpsDays DFW
We learn better when we document.
76
@lizthegrey at #DevOpsDays DFW
Fix hero culture. Share knowledge.
77
@lizthegrey at #DevOpsDays DFW
Reward curiosity and teamwork.
78
@lizthegrey at #DevOpsDays DFW
Learn from the past.
Reward your future self.
79
@lizthegrey at #DevOpsDays DFW
Outages don't repeat, but they rhyme.
80
@lizthegrey at #DevOpsDays DFW
Risk analysis
helps us plan.
81
@lizthegrey at #DevOpsDays DFW
Quantify risks by frequency & impact.
82
@lizthegrey at #DevOpsDays DFW
Which risks are most significant?
83
@lizthegrey at #DevOpsDays DFW
Address risks that threaten the SLO.
84
@lizthegrey at #DevOpsDays DFW
Make the business case to fix them.
85
@lizthegrey at #DevOpsDays DFW
And prioritize completing the work.
86
@lizthegrey at #DevOpsDays DFW
Don't waste time chrome polishing.
87
@lizthegrey at #DevOpsDays DFW
Lack of observability is systemic risk.
88
@lizthegrey at #DevOpsDays DFW
So is lack of collaboration.
89
@lizthegrey at #DevOpsDays DFW
90
Success doesn't demand heroism.
@lizthegrey at #DevOpsDays DFW
Season the alphabet soup with ProdEx
91
@lizthegrey at #DevOpsDays DFW
Production Excellence
brings teams closer together.
Measure. Debug. Collaborate. Fix.
92
lizthegrey.com; @lizthegrey

Detangling complex systems with compassion & production excellence