Canary Analyze All the 
Things 
Roy Rapoport 
@royrapoport 
June 12, 2014 
Significant contributions by Chris Sanden, @chris_sanden 
1
Watch the video with slide 
synchronization on InfoQ.com! 
http://www.infoq.com/presentations 
/canary-analysis-deployment-pattern 
InfoQ.com: News & Community Site 
• 750,000 unique visitors/month 
• Published in 4 languages (English, Chinese, Japanese and Brazilian 
Portuguese) 
• Post content from our QCon conferences 
• News 15-20 / week 
• Articles 3-4 / week 
• Presentations (videos) 12-15 / week 
• Interviews 2-3 / week 
• Books 1 / month
Presented at QCon New York 
www.qconnewyork.com 
Purpose of QCon 
- to empower software development by facilitating the spread of 
knowledge and innovation 
Strategy 
- practitioner-driven conference designed for YOU: influencers of 
change and innovation in your teams 
- speakers and topics driving the evolution and innovation 
- connecting and catalyzing the influencers and innovators 
Highlights 
- attended by more than 12,000 delegates since 2007 
- held in 9 cities worldwide
Oh, the Places We’ll Go! 
• Introductions 
• Proposed Use Case and Definition 
• Continuous Improvement / MVP Model 
• Issues, Solutions 
• Cloud Considerations 
• The Road at Netflix 
2
A Word About Me … 
•About 20 years in technology 
•Systems engineering, networking, software development, QA, 
release management 
•Time at Netflix: 1809 days 
4y:11m:14d 
•At Netflix: 
•Systems Engineering, Service Delivery in IT/Ops 
•Troubleshooter and Builder of Python Things[tm] in Product 
Engineering 
•Current role: Insight Engineering in Product Engineering 
•Real-Time Operational Insight 
3
A Word About Netflix… 
Just the Stats 
•16 years 
•2000+ employees 
•48 million users 
•5x10^9 hours/quarter 
4
A Word About Netflix… 
Freedom and Responsibility Culture 
•Optimize speed of innovation 
Constrain availability 
Cost will be what cost will be 
•Hire smart (experienced) 
people 
Get out of their way 
•Anti-process bias 
5
A Word About Netflix… 
Technology and Operations 
•Service Oriented Architecture 
•Decentralized Operations. You 
•Build 
•Test 
•Deploy 
•Set up alerting and monitoring 
•Wake up at 2AM 
6
Oh, the Places We’ll Go! 
• Introductions 
• Proposed Use Case and Definition 
• Continuous Improvement / MVP Model 
• Issues, Solutions 
• Cloud Considerations 
• The Road at Netflix 
7
Why Canary Analysis? 
8
So You’ve Just Done a Release 
> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/cat 
{“response”: “meow”} 
9
So You’ve Just Done a Release 
> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/dog 
{“response”: “woof”} 
10
So You’ve Just Done a Release 
> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/fox 
{“response”: “wa-pa-pa-pa-pa-pa-pow”} 
The correct answer to “what does the fox say?” is left an exercise for the reader 
11
You Need Better Testing! 
Well, yeah 
12
You Need Better Testing! 
“I’m going to push to production, though 
I’m pretty sure it’s going to kill the system” 
13 
- Said no one, ever* 
* Hopefully
Detour 
Rate of Change vs Availability 
1 10 100 1000 
Rate of Change 
6 
5 
4 
3 
2 
1 
0 
Availability (nines) 
Operations 
Engineering 
14
You Need Better Testing!Deployments! 
Canary Analysis 
• A deployment process where 
• a new change (in behavior, code, or both) 
• is rolled out into production gradually, 
• with checkpoints along the way to examine the new (canary) systems 
• (optionally versus the old (baseline) systems) 
• and make go/no-go decisions. 
15
Canary Analysis Is Not 
•A replacement for any sort of 
software testing 
•A/B Testing 
•Releasing 100% to production 
and hoping for the best 
16
Version 
Control 
System 
1000 
servers 
@ 1.0.2 
1000 
servers 
@ 1.0.1 
Customers 
commit 
Build & 
Deployment 
System 
1 server 
@ 1.0.2 
build 
deploy 
Automated 
Canary 
go 
Analysis 
10 
servers 
@ 1.0.2 
One Possible Process 
17
Version 
Control 
System 
1000 
servers 
@ 1.0.1 
Customers 
Build & 
Deployment 
System 
Automated 
Canary 
go 
Analysis 
1000 
servers 
@ 1.0.2 
One Possible Process 
18
Version 
Control 
System 
1000 
servers 
@ 1.0.1 
Customers 
Build & 
Deployment 
System 
Automated 
no Canary 
go 
Analysis 
1000 
servers 
@ 1.0.2 
One Possible Process 
19
Oh, the Places We’ll Go! 
• Introductions 
• Proposed Use Case and Definition 
• Continuous Improvement / MVP Model 
• Issues, Solutions 
• Cloud Considerations 
• The Road at Netflix 
20
Are We There Yet? 
• We’re not 
• You’re probably not either 
21
Minimally … 
• Observability 
• Partial traffic routing 
• Decision-making 
22
Better Yet … 
• Focus on the Goal 
• Current Baseline Matters 
• Observability segregation 
26% fewer errors in canary 
23
Hold On a Minute! 
26% fewer errors in canary 
Mission 
Accomplished 
24
Hold On a Minute! 
26% fewer errors in canary 
Mission 
Accomplished 
30% fewer requests handled in canary 
25
Hold On a Minute! 
26
Hold On a Minute! 
• Absolute numbers are relatively 
unimportant 
• Relative numbers matter 
• Error rate 
• RPS per CPU cycle 
27
So You’ve Got Your Graphs requests 
Requests Rate Comparison 
Type RAM Cores Cost 
Baseline m3.medium 3.75GB 3 $.11/hr 
Canary m1.small 1.7GB 1 $.06/hr 
28
So You’ve Got Your Graphs 
29
Automating … 
• Decision 
• Execution 
30
A Quick Recap 
• Observe 
• Segregate metrics 
• Partial deploy 
• Compare to Baseline 
• Absolutes are never right 
• Automate decision 
• Automate execution 
31
Oh, the Places We’ll Go! 
• Introductions 
• Proposed Use Case and Definition 
• Continuous Improvement / MVP Model 
• Issues, Solutions 
• Cloud Considerations 
• The Road at Netflix 
32
To Save You Some Time … 
Not all 
metrics are 
created 
equal 
Focus on 
System and 
Application 
Metrics 
Weight by 
category 
(system, 
latency, etc) 
33
To Save You Some Time … 
Outliers are 
out, lying 
Use a group 
of servers 
Balance 
fidelity with 
customer 
impact 
34
To Save You Some Time … 
Exercise 
without 
Repeat 
warmup 
canary 
can result 
analysis 
in injury 
frequently 
Both traffic 
and startup 
time are 
factors 
35
To Save You Some Time … 
vive la 
différence! 
Hot-OK, 
Cold-OK 
Let 
Application 
Owners 
Choose 
36
To Save You Some Time … 
Signal is better 
than no1$#[NO 
CARRIER] 
Ignore weak 
signals 
37
Oh, the Places We’ll Go! 
• Introductions 
• Proposed Use Case and Definition 
• Continuous Improvement / MVP Model 
• Issues, Solutions 
• Cloud Considerations 
• The Road at Netflix 
38
Good News 
• Software-Defined Everything 
• Incremental Pricing 
39
Bad News 
• Capacity Management 
• Unpredictable Inconsistency 
40
Oh, the Places We’ll Go! 
• Introductions 
• Proposed Use Case and Definition 
• Continuous Improvement / MVP Model 
• Issues, Solutions 
• Cloud Considerations 
• The Road at Netflix 
41
Numbers 
• 752 services in production 
• In-house telemetry platform 
• A few metrics 
42
Been there. 
Done that. 
Manually. Artisanally 
• Started in the Data Center 
• Manual, dashboard-driven 
43
Been there. 
Done that. 
Manually. 
44 
Errors Requests CPU
Been there. 
Done that. 
Manually. 
45
Been there. 
Done that. 
Manually. 
46
Been there. 
Done that. 
Manually. 
47
Been there. 
Done that. 
Manually. 
• Context vs Precision 
• No … 
• Repeatability 
• Trending 
• Manual effort is manual 
48
So Now What? 
• Automate Analysis 
• Took Some Effort 
• Approach and analytics 
• Presentation matters 
49
Automated Canary Analysis 
50
Automated Canary Analysis 
51
Automated Canary Analysis 
52
Automated Canary Analysis 
53
Automated Canary Analysis 
54
For Our Next Trick … 
• Configuration GUI 
• Deployment System Integration 
• ACA All The Things 
• OpenConnect firmware updates 
• Client software changes 
• Configuration changes in production 
55
Summary 
• Canary Analysis makes your changes 
• Safer 
• Faster 
• Easier 
• Most people can start doing it 
• Everyone can do it better 
56
http://bit.ly/qcon-netflix? 57 
Questions, Attributions, Feedback 
• https://www.flickr.com/photos/cseeman 
• https://www.flickr.com/photos/ransomtech 
• https://www.flickr.com/photos/dougbrown47 
• https://www.flickr.com/photos/andresthor/ 
• https://www.flickr.com/photos/dougbrown47 
• https://www.flickr.com/photos/pkdesigns 
@royrapoport 
rsr@netflix.com
Watch the video with slide synchronization on 
InfoQ.com! 
http://www.infoq.com/presentations/canary-analysis- 
deployment-pattern

Canary Analyze All The Things: How We Learned to Keep Calm and Release Often

  • 1.
    Canary Analyze Allthe Things Roy Rapoport @royrapoport June 12, 2014 Significant contributions by Chris Sanden, @chris_sanden 1
  • 2.
    Watch the videowith slide synchronization on InfoQ.com! http://www.infoq.com/presentations /canary-analysis-deployment-pattern InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  • 3.
    Presented at QConNew York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4.
    Oh, the PlacesWe’ll Go! • Introductions • Proposed Use Case and Definition • Continuous Improvement / MVP Model • Issues, Solutions • Cloud Considerations • The Road at Netflix 2
  • 5.
    A Word AboutMe … •About 20 years in technology •Systems engineering, networking, software development, QA, release management •Time at Netflix: 1809 days 4y:11m:14d •At Netflix: •Systems Engineering, Service Delivery in IT/Ops •Troubleshooter and Builder of Python Things[tm] in Product Engineering •Current role: Insight Engineering in Product Engineering •Real-Time Operational Insight 3
  • 6.
    A Word AboutNetflix… Just the Stats •16 years •2000+ employees •48 million users •5x10^9 hours/quarter 4
  • 7.
    A Word AboutNetflix… Freedom and Responsibility Culture •Optimize speed of innovation Constrain availability Cost will be what cost will be •Hire smart (experienced) people Get out of their way •Anti-process bias 5
  • 8.
    A Word AboutNetflix… Technology and Operations •Service Oriented Architecture •Decentralized Operations. You •Build •Test •Deploy •Set up alerting and monitoring •Wake up at 2AM 6
  • 9.
    Oh, the PlacesWe’ll Go! • Introductions • Proposed Use Case and Definition • Continuous Improvement / MVP Model • Issues, Solutions • Cloud Considerations • The Road at Netflix 7
  • 10.
  • 11.
    So You’ve JustDone a Release > curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/cat {“response”: “meow”} 9
  • 12.
    So You’ve JustDone a Release > curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/dog {“response”: “woof”} 10
  • 13.
    So You’ve JustDone a Release > curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/fox {“response”: “wa-pa-pa-pa-pa-pa-pow”} The correct answer to “what does the fox say?” is left an exercise for the reader 11
  • 14.
    You Need BetterTesting! Well, yeah 12
  • 15.
    You Need BetterTesting! “I’m going to push to production, though I’m pretty sure it’s going to kill the system” 13 - Said no one, ever* * Hopefully
  • 16.
    Detour Rate ofChange vs Availability 1 10 100 1000 Rate of Change 6 5 4 3 2 1 0 Availability (nines) Operations Engineering 14
  • 17.
    You Need BetterTesting!Deployments! Canary Analysis • A deployment process where • a new change (in behavior, code, or both) • is rolled out into production gradually, • with checkpoints along the way to examine the new (canary) systems • (optionally versus the old (baseline) systems) • and make go/no-go decisions. 15
  • 18.
    Canary Analysis IsNot •A replacement for any sort of software testing •A/B Testing •Releasing 100% to production and hoping for the best 16
  • 19.
    Version Control System 1000 servers @ 1.0.2 1000 servers @ 1.0.1 Customers commit Build & Deployment System 1 server @ 1.0.2 build deploy Automated Canary go Analysis 10 servers @ 1.0.2 One Possible Process 17
  • 20.
    Version Control System 1000 servers @ 1.0.1 Customers Build & Deployment System Automated Canary go Analysis 1000 servers @ 1.0.2 One Possible Process 18
  • 21.
    Version Control System 1000 servers @ 1.0.1 Customers Build & Deployment System Automated no Canary go Analysis 1000 servers @ 1.0.2 One Possible Process 19
  • 22.
    Oh, the PlacesWe’ll Go! • Introductions • Proposed Use Case and Definition • Continuous Improvement / MVP Model • Issues, Solutions • Cloud Considerations • The Road at Netflix 20
  • 23.
    Are We ThereYet? • We’re not • You’re probably not either 21
  • 24.
    Minimally … •Observability • Partial traffic routing • Decision-making 22
  • 25.
    Better Yet … • Focus on the Goal • Current Baseline Matters • Observability segregation 26% fewer errors in canary 23
  • 26.
    Hold On aMinute! 26% fewer errors in canary Mission Accomplished 24
  • 27.
    Hold On aMinute! 26% fewer errors in canary Mission Accomplished 30% fewer requests handled in canary 25
  • 28.
    Hold On aMinute! 26
  • 29.
    Hold On aMinute! • Absolute numbers are relatively unimportant • Relative numbers matter • Error rate • RPS per CPU cycle 27
  • 30.
    So You’ve GotYour Graphs requests Requests Rate Comparison Type RAM Cores Cost Baseline m3.medium 3.75GB 3 $.11/hr Canary m1.small 1.7GB 1 $.06/hr 28
  • 31.
    So You’ve GotYour Graphs 29
  • 32.
    Automating … •Decision • Execution 30
  • 33.
    A Quick Recap • Observe • Segregate metrics • Partial deploy • Compare to Baseline • Absolutes are never right • Automate decision • Automate execution 31
  • 34.
    Oh, the PlacesWe’ll Go! • Introductions • Proposed Use Case and Definition • Continuous Improvement / MVP Model • Issues, Solutions • Cloud Considerations • The Road at Netflix 32
  • 35.
    To Save YouSome Time … Not all metrics are created equal Focus on System and Application Metrics Weight by category (system, latency, etc) 33
  • 36.
    To Save YouSome Time … Outliers are out, lying Use a group of servers Balance fidelity with customer impact 34
  • 37.
    To Save YouSome Time … Exercise without Repeat warmup canary can result analysis in injury frequently Both traffic and startup time are factors 35
  • 38.
    To Save YouSome Time … vive la différence! Hot-OK, Cold-OK Let Application Owners Choose 36
  • 39.
    To Save YouSome Time … Signal is better than no1$#[NO CARRIER] Ignore weak signals 37
  • 40.
    Oh, the PlacesWe’ll Go! • Introductions • Proposed Use Case and Definition • Continuous Improvement / MVP Model • Issues, Solutions • Cloud Considerations • The Road at Netflix 38
  • 41.
    Good News •Software-Defined Everything • Incremental Pricing 39
  • 42.
    Bad News •Capacity Management • Unpredictable Inconsistency 40
  • 43.
    Oh, the PlacesWe’ll Go! • Introductions • Proposed Use Case and Definition • Continuous Improvement / MVP Model • Issues, Solutions • Cloud Considerations • The Road at Netflix 41
  • 44.
    Numbers • 752services in production • In-house telemetry platform • A few metrics 42
  • 45.
    Been there. Donethat. Manually. Artisanally • Started in the Data Center • Manual, dashboard-driven 43
  • 46.
    Been there. Donethat. Manually. 44 Errors Requests CPU
  • 47.
    Been there. Donethat. Manually. 45
  • 48.
    Been there. Donethat. Manually. 46
  • 49.
    Been there. Donethat. Manually. 47
  • 50.
    Been there. Donethat. Manually. • Context vs Precision • No … • Repeatability • Trending • Manual effort is manual 48
  • 51.
    So Now What? • Automate Analysis • Took Some Effort • Approach and analytics • Presentation matters 49
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
    For Our NextTrick … • Configuration GUI • Deployment System Integration • ACA All The Things • OpenConnect firmware updates • Client software changes • Configuration changes in production 55
  • 58.
    Summary • CanaryAnalysis makes your changes • Safer • Faster • Easier • Most people can start doing it • Everyone can do it better 56
  • 59.
    http://bit.ly/qcon-netflix? 57 Questions,Attributions, Feedback • https://www.flickr.com/photos/cseeman • https://www.flickr.com/photos/ransomtech • https://www.flickr.com/photos/dougbrown47 • https://www.flickr.com/photos/andresthor/ • https://www.flickr.com/photos/dougbrown47 • https://www.flickr.com/photos/pkdesigns @royrapoport rsr@netflix.com
  • 60.
    Watch the videowith slide synchronization on InfoQ.com! http://www.infoq.com/presentations/canary-analysis- deployment-pattern