Your SlideShare is downloading. ×
Canary Analyze All The Things: How We Learned to Keep Calm and Release Often
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Canary Analyze All The Things: How We Learned to Keep Calm and Release Often

337
views

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1ph8Rq1. …

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1ph8Rq1.

Roy Rapoport discusses canary analysis deployment and observability patterns he believes that are generally useful, and talks about the difference between manual and automated canary analysis. Filmed at qconnewyork.com.

Roy Rapoport manages the Insight Engineering group at Netflix, responsible for building Netflix's Operational Insight platforms, including cloud telemetry, alerting, and real-time analytics. He originally joined Netflix as part of its datacenter-based IT/Ops group, and prior to transferring over to Product Engineering, was managing Service Delivery for IT/Ops.

Published in: Technology

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
337
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Canary Analyze All the Things Roy Rapoport @royrapoport June 12, 2014 Significant contributions by Chris Sanden, @chris_sanden 1
  • 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /canary-analysis-deployment-pattern InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  • 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. Oh, the Places We’ll Go! • Introductions • Proposed Use Case and Definition • Continuous Improvement / MVP Model • Issues, Solutions • Cloud Considerations • The Road at Netflix 2
  • 5. A Word About Me … •About 20 years in technology •Systems engineering, networking, software development, QA, release management •Time at Netflix: 1809 days 4y:11m:14d •At Netflix: •Systems Engineering, Service Delivery in IT/Ops •Troubleshooter and Builder of Python Things[tm] in Product Engineering •Current role: Insight Engineering in Product Engineering •Real-Time Operational Insight 3
  • 6. A Word About Netflix… Just the Stats •16 years •2000+ employees •48 million users •5x10^9 hours/quarter 4
  • 7. A Word About Netflix… Freedom and Responsibility Culture •Optimize speed of innovation Constrain availability Cost will be what cost will be •Hire smart (experienced) people Get out of their way •Anti-process bias 5
  • 8. A Word About Netflix… Technology and Operations •Service Oriented Architecture •Decentralized Operations. You •Build •Test •Deploy •Set up alerting and monitoring •Wake up at 2AM 6
  • 9. Oh, the Places We’ll Go! • Introductions • Proposed Use Case and Definition • Continuous Improvement / MVP Model • Issues, Solutions • Cloud Considerations • The Road at Netflix 7
  • 10. Why Canary Analysis? 8
  • 11. So You’ve Just Done a Release > curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/cat {“response”: “meow”} 9
  • 12. So You’ve Just Done a Release > curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/dog {“response”: “woof”} 10
  • 13. So You’ve Just Done a Release > curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/fox {“response”: “wa-pa-pa-pa-pa-pa-pow”} The correct answer to “what does the fox say?” is left an exercise for the reader 11
  • 14. You Need Better Testing! Well, yeah 12
  • 15. You Need Better Testing! “I’m going to push to production, though I’m pretty sure it’s going to kill the system” 13 - Said no one, ever* * Hopefully
  • 16. Detour Rate of Change vs Availability 1 10 100 1000 Rate of Change 6 5 4 3 2 1 0 Availability (nines) Operations Engineering 14
  • 17. You Need Better Testing!Deployments! Canary Analysis • A deployment process where • a new change (in behavior, code, or both) • is rolled out into production gradually, • with checkpoints along the way to examine the new (canary) systems • (optionally versus the old (baseline) systems) • and make go/no-go decisions. 15
  • 18. Canary Analysis Is Not •A replacement for any sort of software testing •A/B Testing •Releasing 100% to production and hoping for the best 16
  • 19. Version Control System 1000 servers @ 1.0.2 1000 servers @ 1.0.1 Customers commit Build & Deployment System 1 server @ 1.0.2 build deploy Automated Canary go Analysis 10 servers @ 1.0.2 One Possible Process 17
  • 20. Version Control System 1000 servers @ 1.0.1 Customers Build & Deployment System Automated Canary go Analysis 1000 servers @ 1.0.2 One Possible Process 18
  • 21. Version Control System 1000 servers @ 1.0.1 Customers Build & Deployment System Automated no Canary go Analysis 1000 servers @ 1.0.2 One Possible Process 19
  • 22. Oh, the Places We’ll Go! • Introductions • Proposed Use Case and Definition • Continuous Improvement / MVP Model • Issues, Solutions • Cloud Considerations • The Road at Netflix 20
  • 23. Are We There Yet? • We’re not • You’re probably not either 21
  • 24. Minimally … • Observability • Partial traffic routing • Decision-making 22
  • 25. Better Yet … • Focus on the Goal • Current Baseline Matters • Observability segregation 26% fewer errors in canary 23
  • 26. Hold On a Minute! 26% fewer errors in canary Mission Accomplished 24
  • 27. Hold On a Minute! 26% fewer errors in canary Mission Accomplished 30% fewer requests handled in canary 25
  • 28. Hold On a Minute! 26
  • 29. Hold On a Minute! • Absolute numbers are relatively unimportant • Relative numbers matter • Error rate • RPS per CPU cycle 27
  • 30. So You’ve Got Your Graphs requests Requests Rate Comparison Type RAM Cores Cost Baseline m3.medium 3.75GB 3 $.11/hr Canary m1.small 1.7GB 1 $.06/hr 28
  • 31. So You’ve Got Your Graphs 29
  • 32. Automating … • Decision • Execution 30
  • 33. A Quick Recap • Observe • Segregate metrics • Partial deploy • Compare to Baseline • Absolutes are never right • Automate decision • Automate execution 31
  • 34. Oh, the Places We’ll Go! • Introductions • Proposed Use Case and Definition • Continuous Improvement / MVP Model • Issues, Solutions • Cloud Considerations • The Road at Netflix 32
  • 35. To Save You Some Time … Not all metrics are created equal Focus on System and Application Metrics Weight by category (system, latency, etc) 33
  • 36. To Save You Some Time … Outliers are out, lying Use a group of servers Balance fidelity with customer impact 34
  • 37. To Save You Some Time … Exercise without Repeat warmup canary can result analysis in injury frequently Both traffic and startup time are factors 35
  • 38. To Save You Some Time … vive la différence! Hot-OK, Cold-OK Let Application Owners Choose 36
  • 39. To Save You Some Time … Signal is better than no1$#[NO CARRIER] Ignore weak signals 37
  • 40. Oh, the Places We’ll Go! • Introductions • Proposed Use Case and Definition • Continuous Improvement / MVP Model • Issues, Solutions • Cloud Considerations • The Road at Netflix 38
  • 41. Good News • Software-Defined Everything • Incremental Pricing 39
  • 42. Bad News • Capacity Management • Unpredictable Inconsistency 40
  • 43. Oh, the Places We’ll Go! • Introductions • Proposed Use Case and Definition • Continuous Improvement / MVP Model • Issues, Solutions • Cloud Considerations • The Road at Netflix 41
  • 44. Numbers • 752 services in production • In-house telemetry platform • A few metrics 42
  • 45. Been there. Done that. Manually. Artisanally • Started in the Data Center • Manual, dashboard-driven 43
  • 46. Been there. Done that. Manually. 44 Errors Requests CPU
  • 47. Been there. Done that. Manually. 45
  • 48. Been there. Done that. Manually. 46
  • 49. Been there. Done that. Manually. 47
  • 50. Been there. Done that. Manually. • Context vs Precision • No … • Repeatability • Trending • Manual effort is manual 48
  • 51. So Now What? • Automate Analysis • Took Some Effort • Approach and analytics • Presentation matters 49
  • 52. Automated Canary Analysis 50
  • 53. Automated Canary Analysis 51
  • 54. Automated Canary Analysis 52
  • 55. Automated Canary Analysis 53
  • 56. Automated Canary Analysis 54
  • 57. For Our Next Trick … • Configuration GUI • Deployment System Integration • ACA All The Things • OpenConnect firmware updates • Client software changes • Configuration changes in production 55
  • 58. Summary • Canary Analysis makes your changes • Safer • Faster • Easier • Most people can start doing it • Everyone can do it better 56
  • 59. http://bit.ly/qcon-netflix? 57 Questions, Attributions, Feedback • https://www.flickr.com/photos/cseeman • https://www.flickr.com/photos/ransomtech • https://www.flickr.com/photos/dougbrown47 • https://www.flickr.com/photos/andresthor/ • https://www.flickr.com/photos/dougbrown47 • https://www.flickr.com/photos/pkdesigns @royrapoport rsr@netflix.com
  • 60. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/canary-analysis- deployment-pattern