2. Oh, The Places
We’ll Go!
Today, I want to propose a general framework for how to think about operational insight products and features. I’m hoping that this framework is applicable to anyone who performs operations in production. After I propose thinking about operational insight this
way, I’ll demonstrate some applications of it within our own operational environments at Netflix.
3. The template we were supposed to use had me start with a slide with the speaker bio, but I want to start with something more relevant and interesting to you: The Korean War, and specifically dogfights during the war.
4. John Boyd
John Boyd was an air force pilot at the time; he studied dogfights and came to the conclusion every fighter pilot went through the same four steps:
5. Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with
conflict against other humans, but I’d like to suggest it has much broader applicability.
6. Observe
Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with
conflict against other humans, but I’d like to suggest it has much broader applicability.
7. Observe
Orient
Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with
conflict against other humans, but I’d like to suggest it has much broader applicability.
8. Observe
Orient
Decide
Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with
conflict against other humans, but I’d like to suggest it has much broader applicability.
9. Observe
Orient
Decide
Act
Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with
conflict against other humans, but I’d like to suggest it has much broader applicability.
10. Observe
Orient
Decide
Act
OODA
Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with
conflict against other humans, but I’d like to suggest it has much broader applicability.
11. Observe
Orient
Decide
Act
OODA
“This approach favors agility over raw power in dealing with human
opponents in any endeavor” - Wikipedia
Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with
conflict against other humans, but I’d like to suggest it has much broader applicability.
12. This Is What We
Do
Because even when not dealing with human opponents, anyone dealing with any aspect of operations — dealing with availability events, making decisions about promoting software in production, or … well, making decisions in general — does this all. the. time.
13. For example, this pair of graphs represent the two KPIs by which we know if we have a high-level serious problem. The top one is the rate of calls into our customer
service group; the second one is the rate at which people are actually streaming. Both are over the last seven days. When these dip …
15. We know we have a problem. We don’t exactly know what’s causing it, or what we’ll do to fix it. We’ll need to understand more about the problem to come to a decision, and then execute on that decision — OODA.
16. OODA KPI
So if we do OODA all the time, how can we think about doing OODA better or worse? There are three facets we can look at: How fast we execute the loop, how much effort it takes to execute the loop, and how reliably we execute the loop (make the right
decision, execute it well).
17. OODA KPI
Speed
So if we do OODA all the time, how can we think about doing OODA better or worse? There are three facets we can look at: How fast we execute the loop, how much effort it takes to execute the loop, and how reliably we execute the loop (make the right
decision, execute it well).
18. OODA KPI
Speed Effort
So if we do OODA all the time, how can we think about doing OODA better or worse? There are three facets we can look at: How fast we execute the loop, how much effort it takes to execute the loop, and how reliably we execute the loop (make the right
decision, execute it well).
19. OODA KPI
Speed Effort Reliability
So if we do OODA all the time, how can we think about doing OODA better or worse? There are three facets we can look at: How fast we execute the loop, how much effort it takes to execute the loop, and how reliably we execute the loop (make the right
decision, execute it well).
20. Winning
Speed Effort Reliability
So if we were trying to optimize the loop, we’d try to raise speed, drop effort, and raise reliability. speed and reliability are directly relevant to the consistency of your product’s delivery; effort is more relevant to whether or not your people are doing high-value work
or not, and whether or not they’re likely going to continue to be happy working for you.
21. Winning
Speed
Effort Reliability
So if we were trying to optimize the loop, we’d try to raise speed, drop effort, and raise reliability. speed and reliability are directly relevant to the consistency of your product’s delivery; effort is more relevant to whether or not your people are doing high-value work
or not, and whether or not they’re likely going to continue to be happy working for you.
22. Winning
Speed
Effort
Reliability
So if we were trying to optimize the loop, we’d try to raise speed, drop effort, and raise reliability. speed and reliability are directly relevant to the consistency of your product’s delivery; effort is more relevant to whether or not your people are doing high-value work
or not, and whether or not they’re likely going to continue to be happy working for you.
23. Winning
Speed
Effort
Reliability
So if we were trying to optimize the loop, we’d try to raise speed, drop effort, and raise reliability. speed and reliability are directly relevant to the consistency of your product’s delivery; effort is more relevant to whether or not your people are doing high-value work
or not, and whether or not they’re likely going to continue to be happy working for you.
27. Implications …
for Observation (aka measurement, telemetry, metrics)
• Make It Easy
• Make It Scalable
• Make it pluggable
28. Implications …
for Observation (aka measurement, telemetry, metrics)
• Make It Easy
• Make It Scalable
• Make it pluggable
• (Eventually) Ruthlessly Cull
29. Implications …
for Observation (aka measurement, telemetry, metrics)
• Make It Easy
• Make It Scalable
• Make it pluggable
• (Eventually) Ruthlessly Cull
“What decision will this help me make?”
30. A Joke
I’d like to tell a very very long joke. It started at Velocity 2011, when I heard someone at a presentation “monitor all the things, because you never know what you might
find useful one of these days.”
31. This is a graph representing about 380K datapoints, collected once every five minutes since June 2011. It’s a bit mysterious, I know.
32. 52
48
It may help you to see the lower and upper bounds of this graph are 48 to 52.
33. % of servers in major region
with an even IP address
This graph represents the percent of our cloud instances in a given production region which had a public IP address.
We can — and should (and I hope we do) — laugh about this graph, but I’d bet you your monitoring system is chock full of similarly useless data — I know mine is. It
impacts the cost of the system, but also literally makes your job — and your customers’ jobs, if you’re responsible for the telemetry system — harder, because there’s
much much more chaff to wade through.
36. Implications …
for Orientation (aka graphing, visualization)
• First-class product
• Different decisions require different viz
37. Implications …
for Orientation (aka graphing, visualization)
• First-class product
• Different decisions require different viz
• Low cognitive load better than
38. Implications …
for Orientation (aka graphing, visualization)
• First-class product
• Different decisions require different viz
• Low cognitive load better than
• High refresh rates
39. Implications …
for Orientation (aka graphing, visualization)
• First-class product
• Different decisions require different viz
• Low cognitive load better than
• High refresh rates
• Deep data density
42. Implications …
for Decisions (aka alerting, real-time analytics, etc)
Alerts are a basic, primitive decision. Build on that.
43. Implications …
for Decisions (aka alerting, real-time analytics, etc)
• You already have (some of) this
Alerts are a basic, primitive decision. Build on that.
44. Implications …
for Decisions (aka alerting, real-time analytics, etc)
• You already have (some of) this
• Incremental improvement
Alerts are a basic, primitive decision. Build on that.
45. Implications …
for Decisions (aka alerting, real-time analytics, etc)
• You already have (some of) this
• Incremental improvement
• Sky’s the limit
Alerts are a basic, primitive decision. Build on that.
46. Implications …
for Decisions (aka alerting, real-time analytics, etc)
• You already have (some of) this
• Incremental improvement
• Sky’s the limit
• For benefits
Alerts are a basic, primitive decision. Build on that.
47. Implications …
for Decisions (aka alerting, real-time analytics, etc)
• You already have (some of) this
• Incremental improvement
• Sky’s the limit
• For benefits
• For cost
Alerts are a basic, primitive decision. Build on that.
50. Implications …
for Action
1. Humans beat bureaucracy
2. Machines beat humans
If you’re thinking of creating a runbook, AUTOMATE IT.
51. Implications …
for Action
1. Humans beat bureaucracy
2. Machines beat humans
3. Repeatability beats one-offs
If you’re thinking of creating a runbook, AUTOMATE IT.
52. Implications …
for Action
1. Humans beat bureaucracy
2. Machines beat humans
3. Repeatability beats one-offs
Repeatable machine processes TROUNCE one-off human
bureaucracy
If you’re thinking of creating a runbook, AUTOMATE IT.
53. Implications …
for Action
1. Humans beat bureaucracy
2. Machines beat humans
3. Repeatability beats one-offs
4. Start with humans
Repeatable machine processes TROUNCE one-off human
bureaucracy
If you’re thinking of creating a runbook, AUTOMATE IT.
54. Implications …
for Action
1. Humans beat bureaucracy
2. Machines beat humans
3. Repeatability beats one-offs
4. Start with humans
5. If IFTTT, deprecate humans
Repeatable machine processes TROUNCE one-off human
bureaucracy
If you’re thinking of creating a runbook, AUTOMATE IT.
55. Decision:
Do I Have Enough
Instances?
So let’s talk about a basic capacity quandry: Do I have enough instances in my cluster?
56. I showed this graph earlier. Our work volume is highly diurnal. So we could, if we wanted, make sure our cluster sizes are big enough to support peak workload and just deal with the waste when the work load decreases; instead, we use Amazon’s auto-scaling
group feature to automatically scale the clusters up and down in response to demand. So instead of trying to give users better telemetry on utilization, and making it easier for them to see if they need to increase capacity, we just automate that decision (allowing
them, of course, to override it whenever they want to)
57. I showed this graph earlier. Our work volume is highly diurnal. So we could, if we wanted, make sure our cluster sizes are big enough to support peak workload and just deal with the waste when the work load decreases; instead, we use Amazon’s auto-scaling
group feature to automatically scale the clusters up and down in response to demand. So instead of trying to give users better telemetry on utilization, and making it easier for them to see if they need to increase capacity, we just automate that decision (allowing
them, of course, to override it whenever they want to)
58. Decision:
Is My Canary Good?
We use a deployment pattern called canary, where we compare the new version of the software to the baseline, also in production, and seek to answer a very simple question: Is our canary at least as good as our baseline system?
87. Outlier Detection
In an environment where you have a bunch of potentially-undifferentiated resources that should all behave approximately similarly, it becomes easy — and necessary, in a sufficiently large ecosystem — to notice outliers. If your cost for culling the outliers is low, you
can also do it automatically. If not, you can at least alert that One Of These Things Is No Longer Like The Others.
88. Would You Like to Play a
Game?
Can I have a volunteer from the audience to run an experiment with me?
89. Spot the Outlier
So for training, imagine I’m giving you this information about nine servers, named A through I. Each row is a minute’s data for these servers — let’s say it’s load average, or error rates. I’m going to ask you to point out the server — or column — that looks materially
different from the others. This should be a relatively easy case, of course. Can you pick the server?
90. OK. Now, I’m going to time you doing the same with more interesting data.
Didn’t work so well? OK, let’s make it easier to orient and understand the numbers you’re looking at by showing this to you graphed.
91. OK. Now, I’m going to time you doing the same with more interesting data.
Didn’t work so well? OK, let’s make it easier to orient and understand the numbers you’re looking at by showing this to you graphed.
92. It probably is easier, isn’t it? Can you easily point out the outlier?
OK, one last test. At the next slide, I’m going to show you some information (you can assume it’s true) and I want you to tell me which is the outlier, OK?
93. The
Outlier Is
“A”That was … much easier, wasn’t it?
This is what happens when we let computers do this work. We could have spent more time and effort to give you a more powerful visualization that would have made it easier to notice the outlier, but we instead built the analytics system that lets us automatically
determine outliers so it won’t make it easier for you to do the work — it will do it for you.
94. Just The Stats
4-Week View
We can use this for anything — pieces of content, or devices, or ISPs. Right now, we’ve been using it for about ten or so clusters of server and in the last four weeks have automatically identified — and terminated — 739 outliers.
95. Just The Stats
4-Week View
739 Server Terminations
We can use this for anything — pieces of content, or devices, or ISPs. Right now, we’ve been using it for about ten or so clusters of server and in the last four weeks have automatically identified — and terminated — 739 outliers.
96. In a Nutshell
Observe
Orient
Decide
Act
So really, that’s all there is: Start with observation — you need this in order to get anything done. Then focus on the kinds of decisions that need to take place in your operational environment. Where you continue to need humans to make these decisions, work
hard to help them orient by providing better ways to understand the data; where you don’t … have machines do it.
97. In a Nutshell
Observe
Orient
Decide
Act
Need This First
http://bit.ly/nflx-atlas-2013
http://metrics20.org
So really, that’s all there is: Start with observation — you need this in order to get anything done. Then focus on the kinds of decisions that need to take place in your operational environment. Where you continue to need humans to make these decisions, work
hard to help them orient by providing better ways to understand the data; where you don’t … have machines do it.
98. In a Nutshell
Observe
Orient
Decide
Act
Need This First
http://bit.ly/nflx-atlas-2013
http://metrics20.org
Understand the decision
http://bit.ly/nflx-qcon-aca-2014
So really, that’s all there is: Start with observation — you need this in order to get anything done. Then focus on the kinds of decisions that need to take place in your operational environment. Where you continue to need humans to make these decisions, work
hard to help them orient by providing better ways to understand the data; where you don’t … have machines do it.
99. In a Nutshell
Observe
Orient
Decide
Act
Need This First
http://bit.ly/nflx-atlas-2013
http://metrics20.org
Understand the decision
http://bit.ly/nflx-qcon-aca-2014
Make it easier for humans
So really, that’s all there is: Start with observation — you need this in order to get anything done. Then focus on the kinds of decisions that need to take place in your operational environment. Where you continue to need humans to make these decisions, work
hard to help them orient by providing better ways to understand the data; where you don’t … have machines do it.
100. In a Nutshell
Observe
Orient
Decide
Act
Need This First
http://bit.ly/nflx-atlas-2013
http://metrics20.org
Understand the decision
http://bit.ly/nflx-qcon-aca-2014
Make it easier for humans
Make machines
do it
So really, that’s all there is: Start with observation — you need this in order to get anything done. Then focus on the kinds of decisions that need to take place in your operational environment. Where you continue to need humans to make these decisions, work
hard to help them orient by providing better ways to understand the data; where you don’t … have machines do it.
101. In a Nutshell
Observe
Orient
Decide
Act
Need This First
http://bit.ly/nflx-atlas-2013
http://metrics20.org
Understand the decision
http://bit.ly/nflx-qcon-aca-2014
Make it easier for humans
Make machines
do it
Higher speed
Lower effort
Higher reliability
So really, that’s all there is: Start with observation — you need this in order to get anything done. Then focus on the kinds of decisions that need to take place in your operational environment. Where you continue to need humans to make these decisions, work
hard to help them orient by providing better ways to understand the data; where you don’t … have machines do it.