Operational InsightJune 15, 2015
Roy Rapoport
@royrapoport / linkedin.com/in/royrapoport / rrapoport@netflix.com
Oh, The Places
We’ll Go!
John Boyd
Observe
Observe
Orient
Observe
Orient
Decide
Observe
Orient
Decide
Act
Observe
Orient
Decide
Act
OODA
Observe
Orient
Decide
Act
OODA
“This approach favors agility over raw power in dealing with human
opponents in any endeavor” - Wikipedia
This Is What We
Do
OODA KPI
OODA KPI
Speed
OODA KPI
Speed Effort
OODA KPI
Speed Effort Reliability
Winning
Speed Effort Reliability
Winning
Speed
Effort Reliability
Winning
Speed
Effort
Reliability
Winning
Speed
Effort
Reliability
Implications …
for Observation (aka measurement, telemetry, metrics)
Implications …
for Observation (aka measurement, telemetry, metrics)
• Make It Easy
Implications …
for Observation (aka measurement, telemetry, metrics)
• Make It Easy
• Make It Scalable
Implications …
for Observation (aka measurement, telemetry, metrics)
• Make It Easy
• Make It Scalable
• Make it pluggable
Implications …
for Observation (aka measurement, telemetry, metrics)
• Make It Easy
• Make It Scalable
• Make it pluggable
• (Eventually) Ruthlessly Cull
Implications …
for Observation (aka measurement, telemetry, metrics)
• Make It Easy
• Make It Scalable
• Make it pluggable
• (Eventually) Ruthlessly Cull
“What decision will this help me make?”
A Joke
52
48
% of servers in major region
with an even IP address
Implications …
for Orientation (aka graphing, visualization)
Implications …
for Orientation (aka graphing, visualization)
• First-class product
Implications …
for Orientation (aka graphing, visualization)
• First-class product
• Different decisions require different viz
Implications …
for Orientation (aka graphing, visualization)
• First-class product
• Different decisions require different viz
• Low cognitive load better than
Implications …
for Orientation (aka graphing, visualization)
• First-class product
• Different decisions require different viz
• Low cognitive load better than
• High refresh rates
Implications …
for Orientation (aka graphing, visualization)
• First-class product
• Different decisions require different viz
• Low cognitive load better than
• High refresh rates
• Deep data density
Better Like This …
Or Better Like That …
Implications …
for Decisions (aka alerting, real-time analytics, etc)
Implications …
for Decisions (aka alerting, real-time analytics, etc)
• You already have (some of) this
Implications …
for Decisions (aka alerting, real-time analytics, etc)
• You already have (some of) this
• Incremental improvement
Implications …
for Decisions (aka alerting, real-time analytics, etc)
• You already have (some of) this
• Incremental improvement
• Sky’s the limit
Implications …
for Decisions (aka alerting, real-time analytics, etc)
• You already have (some of) this
• Incremental improvement
• Sky’s the limit
• For benefits
Implications …
for Decisions (aka alerting, real-time analytics, etc)
• You already have (some of) this
• Incremental improvement
• Sky’s the limit
• For benefits
• For cost
Implications …
for Action
Implications …
for Action
1. Humans beat bureaucracy
Implications …
for Action
1. Humans beat bureaucracy
2. Machines beat humans
Implications …
for Action
1. Humans beat bureaucracy
2. Machines beat humans
3. Repeatability beats one-offs
Implications …
for Action
1. Humans beat bureaucracy
2. Machines beat humans
3. Repeatability beats one-offs
Repeatable machine processes TROUNCE one-off human
bureaucracy
Implications …
for Action
1. Humans beat bureaucracy
2. Machines beat humans
3. Repeatability beats one-offs
4. Start with humans
Repeatable machine processes TROUNCE one-off human
bureaucracy
Implications …
for Action
1. Humans beat bureaucracy
2. Machines beat humans
3. Repeatability beats one-offs
4. Start with humans
5. If IFTTT, deprecate humans
Repeatable machine processes TROUNCE one-off human
bureaucracy
Decision:
Do I Have Enough
Instances?
Decision:
Is My Canary Good?
25
Been there.
Done that.
Manually.Artisanally.
25
Been there.
• Started in the Data Center
Done that.
Manually.Artisanally.
25
Been there.
• Started in the Data Center
• Manual, dashboard-driven
Done that.
Manually.Artisanally.
25
Been there.
Done that.
Manually.
26
CPURequestsErrors
Been there.
Done that.
Manually.
27
Been there.
Done that.
Manually.
• Context vs Precision
27
Been there.
Done that.
Manually.
• Context vs Precision
• No …
27
Been there.
Done that.
Manually.
• Context vs Precision
• No …
• Repeatability
27
Been there.
Done that.
Manually.
• Context vs Precision
• No …
• Repeatability
• Trending
27
Been there.
Done that.
Manually.
• Context vs Precision
• No …
• Repeatability
• Trending
• Manual effort is manual
27
So Now What?
28
So Now What?
• Automate Analysis
28
So Now What?
• Automate Analysis
• Took Some Effort
28
So Now What?
• Automate Analysis
• Took Some Effort
• Approach and analytics
28
So Now What?
• Automate Analysis
• Took Some Effort
• Approach and analytics
• Presentation matters
28
Version
Control
System
1000
servers
@ 1.0.1
Customers
Build &
Deployment
System
Automated
Canary
Analysis
Pretty Pictures
29
Version
Control
System
1000
servers
@ 1.0.1
Customers
Build &
Deployment
System
1 server
@ 1.0.2
Automated
Canary
Analysis
Pretty Pictures
29
10 servers
@ 1.0.2
Version
Control
System
1000
servers
@ 1.0.1
Customers
Build &
Deployment
System
Automated
Canary
Analysis
Pretty Pictures
29
1000
servers
@ 1.0.2
Version
Control
System
1000
servers
@ 1.0.1
Customers
Build &
Deployment
System
Automated
Canary
Analysis
Pretty Pictures
29
Versi
on
1000
servers
@ 1.0.1
Custome
Build &
Deployment
Automat
ed
1000
servers
@ 1.0.2
Pretty Pictures
30
Version
Control
System
Build &
Deployment
System
Automated
Canary
Analysis
Customers
Versi
on
Custome
Build &
Deployment
Automat
ed
1000
servers
@ 1.0.2
Pretty Pictures
30
Version
Control
System
Build &
Deployment
System
Automated
Canary
Analysis
Customers
Versi
on
1000
servers
@ 1.0.1
Custome
Build &
Deployment
Automat
ed
1000
servers
@ 1.0.2
Pretty Pictures
31
Version
Control
System
Build &
Deployment
System
Automated
Canary
Analysis
Versi
on
1000
servers
@ 1.0.1
Custome
Build &
Deployment
Automat
ed
1000
servers
@ 1.0.2
Pretty Pictures
31
Version
Control
System
Build &
Deployment
System
Automated
Canary
Analysis
Just The Stats
4-Week View
Just The Stats
4-Week View
6309 canary analysis cycles
Just The Stats
4-Week View
6309 canary analysis cycles
16% canaries failed
Decision:
Do I Have an Outlier?
Outlier Detection
Would You Like to Play a
Game?
Spot the Outlier
The
Outlier Is
“A”
Just The Stats
4-Week View
Just The Stats
4-Week View
739 Server Terminations
In a Nutshell
Observe
Orient
Decide
Act
In a Nutshell
Observe
Orient
Decide
Act
Need This First
http://bit.ly/nflx-atlas-2013
http://metrics20.org
In a Nutshell
Observe
Orient
Decide
Act
Need This First
http://bit.ly/nflx-atlas-2013
http://metrics20.org
Understand the decision
http://bit.ly/nflx-qcon-aca-2014
In a Nutshell
Observe
Orient
Decide
Act
Need This First
http://bit.ly/nflx-atlas-2013
http://metrics20.org
Understand the decision
http://bit.ly/nflx-qcon-aca-2014
Make it easier for humans
In a Nutshell
Observe
Orient
Decide
Act
Need This First
http://bit.ly/nflx-atlas-2013
http://metrics20.org
Understand the decision
http://bit.ly/nflx-qcon-aca-2014
Make it easier for humans
Make machines

do it
In a Nutshell
Observe
Orient
Decide
Act
Need This First
http://bit.ly/nflx-atlas-2013
http://metrics20.org
Understand the decision
http://bit.ly/nflx-qcon-aca-2014
Make it easier for humans
Make machines

do it
Higher speed
Lower effort
Higher reliability
Questions, Attributions, Feedback
42
Questions, Attributions, Feedback
@royrapoport
rsr@netflix.com
linkedin.com/in/royrapoport
?42

Operational Insight: Concepts and Examples (w/o Presenter Notes)