Operational Insight Framework

Operational InsightJune 15, 2015
Roy Rapoport
@royrapoport / linkedin.com/in/royrapoport / rrapoport@netflix.com

Oh, The Places
We’ll Go!
Today, I want to propose a general framework for how to think about operational insight products and features. I’m hoping that this framework is applicable to anyone who performs operations in production. After I propose thinking about operational insight this
way, I’ll demonstrate some applications of it within our own operational environments at Netflix.

The template we were supposed to use had me start with a slide with the speaker bio, but I want to start with something more relevant and interesting to you: The Korean War, and specifically dogfights during the war.

John Boyd
John Boyd was an air force pilot at the time; he studied dogfights and came to the conclusion every fighter pilot went through the same four steps:

Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with
conflict against other humans, but I’d like to suggest it has much broader applicability.

Observe

Observe
Orient

Observe
Orient
Decide

Observe
Orient
Decide
Act

Observe
Orient
Decide
Act
OODA

Observe
Orient
Decide
Act
OODA
“This approach favors agility over raw power in dealing with human
opponents in any endeavor” - Wikipedia

This Is What We
Do
Because even when not dealing with human opponents, anyone dealing with any aspect of operations — dealing with availability events, making decisions about promoting software in production, or … well, making decisions in general — does this all. the. time.

For example, this pair of graphs represent the two KPIs by which we know if we have a high-level serious problem. The top one is the rate of calls into our customer
service group; the second one is the rate at which people are actually streaming. Both are over the last seven days. When these dip …

We know we have a problem. We don’t exactly know what’s causing it, or what we’ll do to fix it. We’ll need to understand more about the problem to come to a decision, and then execute on that decision — OODA.

OODA KPI
So if we do OODA all the time, how can we think about doing OODA better or worse? There are three facets we can look at: How fast we execute the loop, how much effort it takes to execute the loop, and how reliably we execute the loop (make the right
decision, execute it well).

OODA KPI
Speed

OODA KPI
Speed Eﬀort

OODA KPI
Speed Eﬀort Reliability

Winning
Speed Eﬀort Reliability
So if we were trying to optimize the loop, we’d try to raise speed, drop effort, and raise reliability. speed and reliability are directly relevant to the consistency of your product’s delivery; effort is more relevant to whether or not your people are doing high-value work
or not, and whether or not they’re likely going to continue to be happy working for you.

Winning
Speed
Eﬀort Reliability

Winning
Speed
Eﬀort
Reliability

Implications …
for Observation (aka measurement, telemetry, metrics)

Implications …
• Make It Easy

Implications …
• Make It Easy
• Make It Scalable

Implications …
• Make It Easy
• Make it pluggable

Implications …
• Make It Easy
• (Eventually) Ruthlessly Cull

Implications …
• Make It Easy
• (Eventually) Ruthlessly Cull
“What decision will this help me make?”

A Joke
I’d like to tell a very very long joke. It started at Velocity 2011, when I heard someone at a presentation “monitor all the things, because you never know what you might
find useful one of these days.”

This is a graph representing about 380K datapoints, collected once every five minutes since June 2011. It’s a bit mysterious, I know.

52
48
It may help you to see the lower and upper bounds of this graph are 48 to 52.

% of servers in major region
with an even IP address
This graph represents the percent of our cloud instances in a given production region which had a public IP address.
We can — and should (and I hope we do) — laugh about this graph, but I’d bet you your monitoring system is chock full of similarly useless data — I know mine is. It
impacts the cost of the system, but also literally makes your job — and your customers’ jobs, if you’re responsible for the telemetry system — harder, because there’s
much much more chaff to wade through.

Implications …
for Orientation (aka graphing, visualization)

Implications …
• First-class product

Implications …
• Diﬀerent decisions require diﬀerent viz

Implications …
• Low cognitive load better than

Implications …
• High refresh rates

Implications …
• High refresh rates
• Deep data density

Implications …
for Decisions (aka alerting, real-time analytics, etc)
Alerts are a basic, primitive decision. Build on that.

Implications …
• You already have (some of) this

Implications …
• Incremental improvement

Implications …
• Sky’s the limit

Implications …
• For benefits

Implications …
• For benefits
• For cost

Implications …
for Action
If you’re thinking of creating a runbook, AUTOMATE IT.

Implications …
for Action
1. Humans beat bureaucracy

Implications …
for Action
2. Machines beat humans

Implications …
for Action
3. Repeatability beats one-oﬀs

Implications …
for Action
Repeatable machine processes TROUNCE one-oﬀ human
bureaucracy

Implications …
for Action
4. Start with humans
bureaucracy

Implications …
for Action
4. Start with humans
5. If IFTTT, deprecate humans
bureaucracy

Decision:
Do I Have Enough
Instances?
So let’s talk about a basic capacity quandry: Do I have enough instances in my cluster?

I showed this graph earlier. Our work volume is highly diurnal. So we could, if we wanted, make sure our cluster sizes are big enough to support peak workload and just deal with the waste when the work load decreases; instead, we use Amazon’s auto-scaling
group feature to automatically scale the clusters up and down in response to demand. So instead of trying to give users better telemetry on utilization, and making it easier for them to see if they need to increase capacity, we just automate that decision (allowing
them, of course, to override it whenever they want to)

Decision:
Is My Canary Good?
We use a deployment pattern called canary, where we compare the new version of the software to the baseline, also in production, and seek to answer a very simple question: Is our canary at least as good as our baseline system?

Been there.
Done that.
Manually.Artisanally.
25

Been there.
• Started in the Data Center
Done that.
25

Been there.
• Started in the Data Center
• Manual, dashboard-driven
Done that.
25

Been there.
Done that.
Manually.
26
CPURequestsErrors

Been there.
Done that.
Manually.
27

Been there.
Done that.
Manually.
• Context vs Precision
27

Been there.
Done that.
Manually.
• No …
27

Been there.
Done that.
Manually.
• No …
• Repeatability
27

Been there.
Done that.
Manually.
• No …
• Repeatability
• Trending
27

Been there.
Done that.
Manually.
• No …
• Repeatability
• Trending
• Manual effort is manual
27

So Now What?
• Automate Analysis
28

So Now What?
• Took Some Eﬀort
28

So Now What?
• Approach and analytics
28

So Now What?
• Approach and analytics
• Presentation matters
28

Version
Control
System
1000
servers
@ 1.0.1
Customers
Build &
Deployment
System
Automated
Canary
Analysis
Pretty Pictures
29

Version
Control
System
1000
servers
@ 1.0.1
Customers
Build &
Deployment
System
1 server
@ 1.0.2
Automated
Canary
Analysis
Pretty Pictures
29

10 servers
@ 1.0.2
Version
Control
System
1000
servers
@ 1.0.1
Customers
Build &
Deployment
System
Automated
Canary
Analysis
Pretty Pictures
29

1000
servers
@ 1.0.2
Version
Control
System
1000
servers
@ 1.0.1
Customers
Build &
Deployment
System
Automated
Canary
Analysis
Pretty Pictures
29

Versi
on
1000
servers
@ 1.0.1
Custome
Build &
Deployment
Automat
ed
1000
servers
@ 1.0.2
Pretty Pictures
30
Version
Control
System
Build &
Deployment
System
Automated
Canary
Analysis
Customers

Versi
on
Custome
Build &
Deployment
Automat
ed
1000
servers
@ 1.0.2
Pretty Pictures
30
Version
Control
System
Build &
Deployment
System
Automated
Canary
Analysis
Customers

Versi
on
1000
servers
@ 1.0.1
Custome
Build &
Deployment
Automat
ed
1000
servers
@ 1.0.2
Pretty Pictures
31
Version
Control
System
Build &
Deployment
System
Automated
Canary
Analysis

Just The Stats
4-Week View
6309 canary analysis cycles

Just The Stats
4-Week View
6309 canary analysis cycles
16% canaries failed

Decision:
Do I Have an Outlier?

Outlier Detection
In an environment where you have a bunch of potentially-undifferentiated resources that should all behave approximately similarly, it becomes easy — and necessary, in a sufficiently large ecosystem — to notice outliers. If your cost for culling the outliers is low, you
can also do it automatically. If not, you can at least alert that One Of These Things Is No Longer Like The Others.

Would You Like to Play a
Game?
Can I have a volunteer from the audience to run an experiment with me?

Spot the Outlier
So for training, imagine I’m giving you this information about nine servers, named A through I. Each row is a minute’s data for these servers — let’s say it’s load average, or error rates. I’m going to ask you to point out the server — or column — that looks materially
different from the others. This should be a relatively easy case, of course. Can you pick the server?

OK. Now, I’m going to time you doing the same with more interesting data.
Didn’t work so well? OK, let’s make it easier to orient and understand the numbers you’re looking at by showing this to you graphed.

It probably is easier, isn’t it? Can you easily point out the outlier?
OK, one last test. At the next slide, I’m going to show you some information (you can assume it’s true) and I want you to tell me which is the outlier, OK?

The
Outlier Is
“A”That was … much easier, wasn’t it?
This is what happens when we let computers do this work. We could have spent more time and effort to give you a more powerful visualization that would have made it easier to notice the outlier, but we instead built the analytics system that lets us automatically
determine outliers so it won’t make it easier for you to do the work — it will do it for you.

Just The Stats
4-Week View
We can use this for anything — pieces of content, or devices, or ISPs. Right now, we’ve been using it for about ten or so clusters of server and in the last four weeks have automatically identified — and terminated — 739 outliers.

Just The Stats
4-Week View
739 Server Terminations
We can use this for anything — pieces of content, or devices, or ISPs. Right now, we’ve been using it for about ten or so clusters of server and in the last four weeks have automatically identified — and terminated — 739 outliers.

In a Nutshell
Observe
Orient
Decide
Act
So really, that’s all there is: Start with observation — you need this in order to get anything done. Then focus on the kinds of decisions that need to take place in your operational environment. Where you continue to need humans to make these decisions, work
hard to help them orient by providing better ways to understand the data; where you don’t … have machines do it.

In a Nutshell
Observe
Orient
Decide
Act
Need This First
http://bit.ly/nflx-atlas-2013
http://metrics20.org

In a Nutshell
Observe
Orient
Decide
Act
Need This First
Understand the decision
http://bit.ly/nflx-qcon-aca-2014

In a Nutshell
Observe
Orient
Decide
Act
Need This First
Make it easier for humans

In a Nutshell
Observe
Orient
Decide
Act
Need This First
Make machines 
do it

In a Nutshell
Observe
Orient
Decide
Act
Need This First
Make machines 
do it
Higher speed
Lower eﬀort
Higher reliability

Questions, Attributions, Feedback
42

Questions, Attributions, Feedback
@royrapoport
rsr@netflix.com
linkedin.com/in/royrapoport
?42

Operational Insight Framework

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (17)

Similar to Operational Insight Framework

Similar to Operational Insight Framework (20)

Recently uploaded

Recently uploaded (20)

Operational Insight Framework