Canary Analyze All the Things

Canary Analyze All the
Things
Roy Rapoport
@royrapoport
June 12, 2014
Significant contributions by Chris Sanden, @chris_sanden
1

Oh, the Places We’ll Go!
• Introductions
• Proposed Use Case and Definition
• Continuous Improvement / MVP Model
• Issues, Solutions
• Cloud Considerations
• The Road at Netflix
2

A Word About Me …
•About 20 years in technology
3

A Word About Me …
•Systems engineering, networking, software development, QA,
release management
3

A Word About Me …
release management
•Time at Netflix: 1809 days
3

A Word About Me …
release management
•Time at Netflix: 1809 days 4y:11m:14d
3

A Word About Me …
release management
•At Netflix:
4y:11m:14d
3

A Word About Me …
release management
•At Netflix:
•Systems Engineering, Service Delivery in IT/Ops
4y:11m:14d
3

A Word About Me …
release management
•At Netflix:
•Troubleshooter and Builder of Python Things[tm] in Product
Engineering
4y:11m:14d
3

A Word About Me …
release management
•At Netflix:
Engineering
•Current role: Insight Engineering in Product Engineering
4y:11m:14d
3

A Word About Me …
release management
•At Netflix:
Engineering
•Current role: Insight Engineering in Product Engineering
•Real-Time Operational Insight
4y:11m:14d
3

A Word About Netflix…
Just the Stats
4

•16 years
Just the Stats
4

•16 years
•2000+ employees
Just the Stats
4

•16 years
•2000+ employees
•48 million users
Just the Stats
4

•16 years
•2000+ employees
•48 million users
•5x10^9 hours/quarter
Just the Stats
4

Freedom and Responsibility Culture
5

•Optimize speed of innovation 
Constrain availability 
Cost will be what cost will be
5

•Hire smart (experienced)
people 
Get out of their way
5

•Hire smart (experienced)
people 
Get out of their way
•Anti-process bias
5

Technology and Operations
6

•Service Oriented Architecture
6

•Decentralized Operations. You
6

•Build
6

•Build
•Test
6

•Build
•Test
•Deploy
6

•Build
•Test
•Deploy
•Set up alerting and monitoring
6

•Build
•Test
•Deploy
•Set up alerting and monitoring
•Wake up at 2AM
6

• Introductions
7

So You’ve Just Done a Release
9

> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/cat
9

> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/cat
{“response”: “meow”}
9

10

> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/dog
10

> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/dog
{“response”: “woof”}
10

11

> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/fox
11

{“response”: “wa-pa-pa-pa-pa-pa-pow”}
11

{“response”: “wa-pa-pa-pa-pa-pa-pow”}
The correct answer to “what does the fox say?” is left an exercise for the reader
11

You Need Better Testing!
Well, yeah
12

You Need Better Testing!
“I’m going to push to production, though
I’m pretty sure it’s going to kill the system”
13
- Said no one, ever*
* Hopefully

Rate of Change
1 10 100 1000
0
1
2
3
4
5
6
Availability(nines)
Detour
Rate of Change vs Availability
14

Rate of Change
1 10 100 1000
0
1
2
3
4
5
6
Availability(nines)
Detour
Rate of Change vs Availability
Operations
Engineering
14

You Need Better Testing!Deployments!
Canary Analysis!
!
• A deployment process where
• a new change (in behavior, code, or both)
• is rolled out into production gradually,
• with checkpoints along the way to examine the new (canary) systems
• (optionally versus the old (baseline) systems)
• and make go/no-go decisions.
15

Canary Analysis Is Not
•A replacement for any sort of
software testing
16

software testing
•A/B Testing
16

software testing
•A/B Testing
•Releasing 100% to production
and hoping for the best
16

Version
Control
System
1000
servers
@ 1.0.1
Customers
Build &
Deployment
System
Automated
Canary
Analysis
One Possible Process
17

Version
Control
System
1000
servers
@ 1.0.1
Customers
Build &
Deployment
System
1 server
@ 1.0.2
Automated
Canary
Analysis
17

Version
Control
System
1000
servers
@ 1.0.1
Customers
Build &
Deployment
System
Automated
Canary
Analysis
10
servers
@ 1.0.2
17

Version
Control
System
1000
servers
@ 1.0.1
Customers
Build &
Deployment
System
Automated
Canary
Analysis
1000
servers
@ 1.0.2
17

Version
Control
System
1000
servers
@ 1.0.1
Customers
Build &
Deployment
System
Automated
Canary
Analysis
1000
servers
@ 1.0.2
18

Version
Control
System Customers
Build &
Deployment
System
Automated
Canary
Analysis
1000
servers
@ 1.0.2
18

Version
Control
System
1000
servers
@ 1.0.1
Customers
Build &
Deployment
System
Automated
Canary
Analysis
1000
servers
@ 1.0.2
19

• Introductions
20

Are We There Yet?
• We’re not
21

Are We There Yet?
• We’re not
• You’re probably not either
21

Minimally …
• Observability
22

Minimally …
• Observability
• Partial traffic routing
22

Minimally …
• Observability
• Partial traffic routing
• Decision-making
22

Better Yet …
• Focus on the Goal
23

Better Yet …
• Current Baseline Matters
23

Better Yet …
26% fewer errors in canary
23

Better Yet …
• Observability segregation
23

Hold On a Minute!
24

Hold On a Minute!
Mission
Accomplished
24

Hold On a Minute!
Mission
Accomplished
30% fewer requests handled in canary
25

Hold On a Minute!
• Absolute numbers are relatively
unimportant
27

Hold On a Minute!
unimportant
• Relative numbers matter
27

Hold On a Minute!
unimportant
• Error rate
27

Hold On a Minute!
unimportant
• Error rate
• RPS per CPU cycle
27

Requests Rate Comparison
So You’ve Got Your Graphs requests
28

Requests Rate Comparison
So You’ve Got Your Graphs requests
Type RAM Cores Cost
Baseline m3.medium 3.75GB 3 $.11/hr
Canary m1.small 1.7GB 1 $.06/hr
28

So You’ve Got Your Graphs
29

Automating …
• Decision
30

Automating …
• Decision
• Execution
30

A Quick Recap
• Observe
• Segregate metrics
31

A Quick Recap
• Observe
• Partial deploy
31

A Quick Recap
• Observe
• Partial deploy
• Compare to Baseline
31

A Quick Recap
• Observe
• Partial deploy
• Absolutes are never right
31

A Quick Recap
• Observe
• Partial deploy
• Automate decision
31

A Quick Recap
• Observe
• Partial deploy
• Automate decision
• Automate execution
31

• Introductions
32

To Save You Some Time …
Not all
metrics are
created
equal
33

Not all
metrics are
created
equal
Focus on
System and
Application
Metrics
33

Not all
metrics are
created
equal
Focus on
System and
Application
Metrics
Weight by
category
(system,
latency, etc)
33

Outliers are
out, lying
34

Outliers are
out, lying
Use a group
of servers
34

Outliers are
out, lying
Use a group
of servers
Balance
fidelity with
customer
impact
34

Exercise
without
warmup
can result
in injury
35

Exercise
without
warmup
can result
in injury
Repeat
canary
analysis
frequently
35

Exercise
without
warmup
can result
in injury
Repeat
canary
analysis
frequently
Both traffic
and startup
time are
factors
35

vive la
différence!
36

vive la
différence!
Hot-OK,
Cold-OK
36

vive la
différence!
Hot-OK,
Cold-OK
Let
Application
Owners
Choose
36

Signal is better
than no1$#[NO
CARRIER]
37

Signal is better
than no1$#[NO
CARRIER]
Ignore weak
signals
37

• Introductions
38

Good News
• Software-Defined Everything
39

Good News
• Software-Defined Everything
• Incremental Pricing
39

Bad News
• Capacity Management
40

Bad News
• Capacity Management
• Unpredictable Inconsistency
40

• Introductions
41

Numbers
• 752 services in production
42

Numbers
• In-house telemetry platform
42

Numbers
• In-house telemetry platform
• A few metrics
42

Been there.
Done that.
Manually. Artisanally.
43

Been there.
• Started in the Data Center
Done that.
43

Been there.
• Started in the Data Center
• Manual, dashboard-driven
Done that.
43

Been there.
Done that.
Manually.
44
CPURequestsErrors

Been there.
Done that.
Manually.
45

Been there.
Done that.
Manually.
46

Been there.
Done that.
Manually.
47

Been there.
Done that.
Manually.
48

Been there.
Done that.
Manually.
• Context vs Precision
48

Been there.
Done that.
Manually.
• No …
48

Been there.
Done that.
Manually.
• No …
• Repeatability
48

Been there.
Done that.
Manually.
• No …
• Repeatability
• Trending
48

Been there.
Done that.
Manually.
• No …
• Repeatability
• Trending
• Manual effort is manual
48

So Now What?
• Automate Analysis
49

So Now What?
• Took Some Effort
49

So Now What?
• Approach and analytics
49

So Now What?
• Approach and analytics
• Presentation matters
49

For Our Next Trick …
• Configuration GUI
55

• Deployment System Integration
55

• ACA All The Things
55

• OpenConnect firmware updates
55

• Client software changes
55

• Client software changes
• Configuration changes in production
55

Summary
• Canary Analysis makes your changes
56

Summary
• Safer
56

Summary
• Safer
• Faster
56

Summary
• Safer
• Faster
• Easier
56

Summary
• Safer
• Faster
• Easier
• Most people can start doing it
56

Summary
• Safer
• Faster
• Easier
• Most people can start doing it
• Everyone can do it better
56

• https://www.flickr.com/photos/cseeman
• https://www.flickr.com/photos/ransomtech
• https://www.flickr.com/photos/dougbrown47
• https://www.flickr.com/photos/andresthor/
• https://www.flickr.com/photos/pkdesigns
Questions, Attributions, Feedback
57

@royrapoport
57

@royrapoport
?57

Canary Analyze All the Things

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Similar to Canary Analyze All the Things

Similar to Canary Analyze All the Things (20)

Recently uploaded

Recently uploaded (20)

Canary Analyze All the Things