Netflix Built Its Own Monitoring System - and Why You Probably Shouldn't

2,545 views

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1O6hOj9.

Roy Rapoport shares some of the lessons Netflix learned building a monitoring system, the challenges, pitfalls and opportunities encountered along the way. Filmed at qconlondon.com.

Roy Rapoport manages the Insight Engineering group at Netflix, responsible for building Netflix's Operational Insight platforms, including cloud telemetry, alerting, and real-time analytics". He originally joined Netflix as part of its datacenter-based IT/Ops group, and prior to transferring over to Product Engineering, was managing Service Delivery for IT/Ops.

Published in: Technology

Netflix Built Its Own Monitoring System - and Why You Probably Shouldn't

  1. 1. Netflix Built Its Own Monitoring System (And You Probably Shouldn’t) Roy Rapoport rsr@netflix.com @royrapoport 6 March 2015
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /netflix-monitoring-system
  3. 3. Presented at QCon London www.qconlondon.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. Not So Much About Telemetry • I telemetry • Architecture track Open Space, 11:30AM, Fleming 3rd Floor
  5. 5. The Knights Who Say NIH
  6. 6. Agenda • Introductions • On Judgment • Your Problem • Your (no, really) Solution • Mitigation and Anecdotes • (Not) building your own monitoring system
  7. 7. Introductions: Me • About 23 years in technology • Systems engineering, networking, software development, QA, release management • Time at Netflix: 2076 days (5y:8m:7d) • At Netflix: • Systems Engineering, Service Delivery in IT • Troubleshooter and Builder of Python Things in Product Engineering • Now: Engineering Manager, Insight Engineering
  8. 8. Introductions: Netflix • Optimize speed of innovation • Constrain availability • Cost is what it is • Hire smart people,
 get out of their way • Anti-process bias “Freedom and Responsibility”
  9. 9. Judgment
  10. 10. You Have a Problem (Your job would likely be boring otherwise) • Are you the first • To have it? • To care? • Are you sure? One that looks nice And not too expensive
  11. 11. You Have a Problem (Your job would likely be boring otherwise) • You’re not the first, or only • Good news! • Then what?
  12. 12. Adventures in IT-Land • (import disclaimer) • Not developers • Cautious about ongoing support load • Not well-trusted
  13. 13. Adventures in IT-Land
  14. 14. A Little Bit of … • Time, courage, knowledge, pride • Cynicism, hubris, fear
  15. 15. Technical Reasons for Rejection (Or: It’s Not You, It’s … Actually, It’s You) • Financial Cost • Technical incompatibility
  16. 16. Overqualified!
  17. 17. • https://www.flickr.com/photos/54945394@N00
  18. 18. A Moment for Pedantry Or: Requirements for “Not Invented Here”
  19. 19. The Knights Who Say IbPWAU
  20. 20. A Question of Trust • Technical: I don’t trust your product • Organizational: I don’t trust you
  21. 21. I Don’t Trust You To Care About Me as a Customer • You’re selling me something • I’m not your only customer • I’m not an important customer • You don’t care about your customers
  22. 22. I Don’t Trust You To build a good product • Past performance … • “Good for me” • Because you said so, that’s why!
  23. 23. I Don’t Trust You To build it fast enough • Unpredictable velocity • When best-case is too slow • Or maybe ever (OSS)
  24. 24. What Now?
  25. 25. Eventual Consistency • Fork n’ merge • THE model for OSS • Works better for incremental changes • Requires alignment of goals
  26. 26. Eventual Consistency No Fork Required • Start With a New Idea • Eventually merge concepts
  27. 27. Eventual Consistency Example Mainline Cloud Orchestration 2011
  28. 28. Eventual Consistency Example Mainline Cloud Orchestration 2011 2013
  29. 29. Eventual Consistency Example Mainline Cloud Orchestration 2011 2013 Insight Engineering CD Automation
  30. 30. Eventual Consistency Example Mainline Cloud Orchestration 2011 2013 Insight Engineering CD Automation 2014 Mainline CD Automation
  31. 31. Eventual Consistency Example Mainline Cloud Orchestration 2011 2013 Insight Engineering CD Automation 2014 Mainline CD Automation 2015
  32. 32. Eventual Consistency Example Mainline Cloud Orchestration 2011 2013 2014 Mainline CD Automation 2015 Insight Engineering CD Automation
  33. 33. Composability • Want this anyway • Map scope to options’ scopes
  34. 34. Composability: Example Netflix’s Atlas Telemetry Platform Global Query Endpoint
  35. 35. Composability: Example Netflix’s Atlas Telemetry Platform Global Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Boundary
  36. 36. Composability: Example Netflix’s Atlas Telemetry Platform Global Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Query Endpoint Memory Epic Cloudwatch
  37. 37. Composability: Example Netflix’s Atlas Telemetry Platform Global Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Query Endpoint Memory Cloudwatch
  38. 38. Composability: Example Netflix’s Atlas Telemetry Platform Global Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Query Endpoint Regional Query Endpoint Memory Cloudwatch OpenTSDB InfluxDB
  39. 39. Composability: Example Deployments and Automated Canary Analysis at Netflix Edge Systems Deployment Automation Platform Edge Systems Canary Analysis API API Mainline Deployment Automation Platform
  40. 40. Composability: Example Deployments and Automated Canary Analysis at Netflix Edge Systems Deployment Automation Platform Edge Systems Canary Analysis API Email Insight Engineering Canary Analysis Mainline Deployment Automation Platform
  41. 41. Composability: Example Deployments and Automated Canary Analysis at Netflix Edge Systems Deployment Automation Platform Edge Systems Canary Analysis API Insight Engineering Canary Analysis Mainline Deployment Automation Platform
  42. 42. Composability: Example Deployments and Automated Canary Analysis at Netflix Edge Systems Deployment Automation Platform Edge Systems Canary Analysis Insight Engineering Canary Analysis Mainline Deployment Automation Platform
  43. 43. Composability: Example Deployments and Automated Canary Analysis at Netflix Edge Systems Deployment Automation Platform Insight Engineering Canary Analysis Mainline Deployment Automation Platform
  44. 44. Composability: Example Deployments and Automated Canary Analysis at Netflix Edge Systems Deployment Automation Platform Insight Engineering Canary Analysis Mainline Deployment Automation Platform
  45. 45. Composability: Example Deployments and Automated Canary Analysis at Netflix Insight Engineering Canary Analysis Mainline Deployment Automation Platform
  46. 46. One More Reason“Think of the glory. Think of your reputation. Think how great it'll look on your next resume.” - Lois McMaster Bujold
  47. 47. Judgment
  48. 48. The Grand Example Netflix’s Monitoring Platform • Prior system owned by IT
  49. 49. The Grand Example Netflix’s Monitoring Platform • Prior system owned by IT • No great OSS products
  50. 50. The Grand Example Netflix’s Monitoring Platform • Prior system owned by IT • No great OSS products • Ridiculous scale
  51. 51. The Grand Example Netflix’s Monitoring Platform • Prior system owned by IT • No great OSS products • Ridiculous scale • Seriously, how hard can it be?
  52. 52. The Grand Example Netflix’s Monitoring Platform • Took longer than expected • Ongoing maintenance • UI only recent priority
  53. 53. The Grand Example Netflix’s Monitoring Platform • Scales efficientlyish • impedance match with dev lifestyle • Nicely pluggable* • Aggressivish OSS efforts * Ask me about Real-Time Analytics!
  54. 54. The Grand Example Netflix’s Monitoring Platform • Still the right solution • Worried about Sunk Cost Fallacy • Most shouldn’t do this
  55. 55. Can You Repeat That? Or: What’s Your Point? Or: I was Tweeting. Did I miss something? • What’s important to you? • Is this a technical decision? Really? • Honest and non-judgmental • Any mitigation? • Don’t build your own monitoring system. Seriously.
  56. 56. Name This Group • United States • Europe • China • Russia • India • Japan • Blue Origin • SpaceX • Virgin Galactic
  57. 57. 11:30am Frasier Room (3rd Floor) @royrapoport rsr@netflix.com
  58. 58. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/netflix- monitoring-system

×