Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Netflix Edge Engineering Open House Presentations - June 9, 2016

2,817 views

Published on

Netflix's Edge Engineering team is responsible for handling all device traffic for to support the user experience, including sign-up, discovery and the triggering of the playback experience. Developing and maintaining this set of massive scale services is no small task and its success is the difference between millions of happy streamers or millions of missed opportunities.

This video captures the presentations delivered at the first ever Edge Engineering Open House at Netflix. This video covers the primary aspects of our charter, including the evolution of our API and Playback services as well as building a robust developer experience for the internal consumers of our APIs.

Published in: Technology

Netflix Edge Engineering Open House Presentations - June 9, 2016

  1. 1. Daniel Jacobson @daniel_jacobson Satish Gudiboina @sgudiboina Suudhan Rangarajan @suudhan Vasanth Asokan @vasanthasokan Edge Engineering Open House - June 9, 2016
  2. 2. 190 Countries (not China and a few others) 81+ Million Subscribers
  3. 3. 1000+ Different Device Types
  4. 4. Over 42 Billion Hours Streamed in 2015
  5. 5. Streaming Hours Per Year in Billions
  6. 6. Streaming Hours Per Year in Billions
  7. 7. Over 42 Billion Hours Streamed in 2015
  8. 8. Over 42 Billion Successes!
  9. 9. Of Course, There Are Failures Too…
  10. 10. Two Primary Drivers Behind Our Successes
  11. 11. People Desire to Watch Netflix Two Primary Drivers Behind Our Successes
  12. 12. People Desire to Watch Netflix Systems Scale to Meet Desires Two Primary Drivers Behind Our Successes
  13. 13. Sign-Up
  14. 14. Sign-Up Discovery / Browse
  15. 15. Sign-Up Discovery / Browse Playback
  16. 16. Edge Engineering provides data and functionality to support these three experiences
  17. 17. Designing APIs Enabling Playback Scaling Routing Insights DX Resiliency Tools Edge Engineering provides data and functionality to support these three experiences
  18. 18. D E V I C E S
  19. 19. D E V I C E S R O U T I N G
  20. 20. D E V I C E S R O U T I N G
  21. 21. D E V I C E S R O U T I N G A P I API API API API API API
  22. 22. D E V I C E S R O U T I N G A P I API API API API API API S E R V I C E S S2S2Recs S2S2Member S2S2Ratings S2S2Playback Lifecycle S2S2Authn/z S2S2A/B S2S2Search S2S2Identity S2S2 S2S2Playback Data S2S2DRMMetadata
  23. 23. D E V I C E S R O U T I N G A P I API API API API API API S E R V I C E S S2S2Recs S2S2Member S2S2Ratings S2S2S2S2Authn/z S2S2A/B S2S2Search S2S2Identity S2S2Metadata S2S2Playback Data S2S2DRM Owned by Edge Engineering Playback Lifecycle
  24. 24. D E V I C E S R O U T I N G A P I API API API API API API S E R V I C E S S2S2Recs S2S2Member S2S2Ratings S2S2S2S2Authn/z S2S2A/B S2S2Search S2S2Identity S2S2 S2S2Playback Data S2S2DRMMetadata Playback Lifecycle
  25. 25. D E V I C E S R O U T I N G A P I API API API API API API S E R V I C E S S2S2Recs S2S2Member S2S2Ratings S2S2S2S2Authn/z S2S2A/B S2S2Search S2S2Identity S2S2 S2S2Playback Data S2S2DRMMetadata Playback Lifecycle
  26. 26. D E V I C E S R O U T I N G A P I API API API API API API S E R V I C E S S2S2Recs S2S2Member S2S2Ratings S2S2S2S2Authn/z S2S2A/B S2S2Search S2S2Identity S2S2 S2S2Playback Data S2S2DRMMetadata Playback Lifecycle
  27. 27. D E V I C E S R O U T I N G A P I API API API API API API S E R V I C E S S2S2Recs S2S2Member S2S2Ratings S2S2S2S2Authn/z S2S2A/B S2S2Search S2S2Identity S2S2 S2S2Playback Data S2S2DRMMetadata Playback Lifecycle
  28. 28. API API API API API API S2S2S2S2Authn/z S2S2Playback Data S2S2DRM I N S I G H T S T O O L S D X Playback Lifecycle
  29. 29. 42 Billion Hours 2015
  30. 30. 200 Billion Hours 2015 Future 42 Billion Hours
  31. 31. The rest of Netflix’s AWS Cloud Footprint by %
  32. 32. Talking About the Future of Edge Engineering Satish Gudiboina API and Upcoming Re-Architecture Suudhan Rangarajan Playback Experience Vasanth Asokan Developer Tools, Velocity and Experience
  33. 33. The Netflix API Platform for Server- Side Scripting Current and The Future Satish Gudiboina
  34. 34. The Netflix API
  35. 35. Streaming Hours Per Year in Billions
  36. 36. Scale is multi-faceted Growing number of users ( → RPS) Growing number of device types Growing number of A/B tests Growing number of languages Growing number of countries
  37. 37. What we need to build for Velocity Resiliency Other requirements: Performance Great developer experience Operational insights Tooling
  38. 38. SERVICELAYER Js (mostly) java Client A Client B Client C Client A Client Y Client Z ... ... Netflix Microservices script script script script ... script script script script Network boundary API Server JVM Today’s architecture Resiliency with Hystrix
  39. 39. Developer Velocity: Decoupled deployments of versions n+3 Day 1 Day 2 Day 3 Day 4 Day 5 API device 1 device 2 device 3 device 4 i+4 i+1 i+2i+3 i n+2 n+1 n k+1 k j j+1 l
  40. 40. Changing risk profile Growing number of users ( → RPS) Growing number of devices Growing number of A/B tests Growing number of languages Growing number of countries Growing number and complexity of scripts (scripts → apps)
  41. 41. SERVICELAYER Js (mostly) java Client A Client B Client C Client A Client Y Client Z ... ... Netflix Microservices script script ... script script Network boundary API Server JVM Today’s system (T-3yrs) few, small scripts fewer uploads
  42. 42. SERVICELAYERJs (mostly) java Client A Client B Client C Client A Client Y Client Z ... ... Netflix Microservices script script script script ... script script script script Network boundary API Server JVM Today’s system (T) scripts scripts hundreds of more complex scripts, 10-50 uploads per day
  43. 43. What we need Velocity Resiliency?
  44. 44. Lack of process isolation is a growing risk.
  45. 45. Moving toward our ideal API: What will change Scripts will run in containers Scripts will call API remotely
  46. 46. SERVICELAYERJs (mostly) java Client A Client B Client C Client A Client Y Client Z ... ... Netflix Microservices node script node script ... node script node script Network boundary API Server JVM The (near) future node.js process isolation node for device teams
  47. 47. Why containers? Process isolation Fast startup Consistent developer experience across environments
  48. 48. Isolated failures: scripts don’t affect each other API device 1 device 2 device 3 device 4 Temporaril y unavailabl e!
  49. 49. Independent autoscaling API device 1 device 2 device 3 device 4
  50. 50. Fast startup New API server: minutes New container: seconds Fast rollout, fast rollback, fast MTTR
  51. 51. The Netflix API
  52. 52. Edge Developer Experience Translating developer productivity to Netflix customer delight
  53. 53. Developer Experience?
  54. 54. DEVELOP (rapidly) DEPLOY (reliably) OPERATE (effectively) Experimentation driven innovation ~700 apps, dozens of pushes a day 15+ client teams, ~200 developers ~50 direct services, 100s of AB tests, dozens of new features The Innovation Funnel API Devices Netflix Services Client Adaptor Applications
  55. 55. Why care about DevEx? Developer Productivity Product Innovation Tools Automation Insights Customer Satisfaction
  56. 56. App Development and Management DEVELOP (rapidly) DEPLOY (reliably) OPERATE (effectively)
  57. 57. SERVICELAYER Netflix Microservices app WAN Boundary API SERVER JVM js java Developer Ergonomics app ... app app CLIENTLIBRARIES Large / Complex SERVICELAYER
  58. 58. REMOTESERVICELAYER app API SERVER JVM Developer Ergonomics ... app ... app app CLIENTLIBRARIES js javajs DOCKER CONTAINERS WAN Boundary Netflix Microservices
  59. 59. Setup Canary SupportProd Push Pre-Prod Metrics Tracing Lifecycle Alerts Build Bootstrap API Discovery REPL Unit Test SDK Debug Logging Profiling Audits Security Custom Routing Dependency Management Client Application Development Critical Component! Dx Developer Experience
  60. 60. $ newt init Just bring your Javascript business logic NeWT: Netflix Workflow Toolkit Continuous Integration Deployment Pipelines Autoscaling Dashboards Alerting Logging Lifecycle Management Audits and Analytics Container tooling Canaries Dependency Management
  61. 61. Titus ATLAS NeWT: Netflix Workflow Toolkit
  62. 62. Edge PaaS UI
  63. 63. $ newt auto-deploy -d nodeJS project Docker Machine node-inspector Debugger File watcher / live reload trigger File watcher agent NeWT: Local Container Development Local Container docker build / run
  64. 64. $ newt auto-deploy -d Docker Machine NeWT: Local Container Development Local Container Cloud Microservices Cloud Proxy Terminate security DiscoveryAgent Service Discovery Local System Cloud
  65. 65. App Operations and Insights DEVELOP (rapidly) DEPLOY (reliably) OPERATE (effectively)
  66. 66. • Low Latency, High throughput, Highly Efficient • Handle bursty or large scale loads • Extensible programming model 600 jobs in production, 8M messages/sec at peak, 100Gbps network throughput Mantis - Stream Processing Platform
  67. 67. Monitoring facets of aggregate application health, globally Aggregate Insights
  68. 68. Aggregate Insights
  69. 69. Analyze in real-time, requests matching a precise set of conditions Surgical Insights
  70. 70. Surgical Insights - Real-time Stream Queries
  71. 71. Surgical Insights - Real-time Stream Queries
  72. 72. Surgical Insights - Real-time Stream Queries
  73. 73. Monitoring server side calling pattern and internal application profile Session Tracing
  74. 74. Session Tracing
  75. 75. Session Tracing - Request Profile
  76. 76. Session Tracing - Per Node Profile
  77. 77. Automatic monitoring of high cardinality data across multiple dimensions Real-time Anomaly Detection
  78. 78. Real-time Anomaly Detection
  79. 79. • Scaling developer productivity with business growth •Provide fully managed PaaS experience to client developers • Shift Left Insights to power smart development • Curated, blended visualizations that simplify devops In conclusion...
  80. 80. Tech Soup
  81. 81. Scaling Playback Services Suudhan Rangarajan Senior Software Engineer, Playback Features @suudhan
  82. 82. Playback Lifecycle DECIDE COLLECT & LEARN AUTHORIZE
  83. 83. Decide MANIFEST (Tracks and URLs)
  84. 84. Authorize LICENS E ❏ Content usage / resolution policies ❏ Plan / device limits enforcement ❏ DRM / License generation
  85. 85. Collect & Learn Bookmarks & Hours Watched Streaming Errors and Metrics Quality Of Experience metrics 4
  86. 86. Lets look at Play Decisions DECIDE MANIFEST AUTHORIZE COLLECT & LEARN LICENS E SESSIO N
  87. 87. Huge number of Streams Resolutions - 720p, 1080p, 4K etc Codecs - H.264,HEVC etc Bitrates - 230, 780, 3000 etc Channels - Stereo, Surround Sound Languages - English, French etc Types - Subtitles, Closed Captions, Forced Narratives Languages - English, French etc
  88. 88. Streams to Tracks - H.264 Main Profile - English 5.1 Audio - No Subtitle - HEVC Dash Profile - French 2.0 Audio - English CC - HDR Dash Profile - Spanish AAC Audio - English Forced Narrative
  89. 89. Decide & Filter MANIFEST SERVICE
  90. 90. Many Many Dimensions PLAYBAC K MANIFEST USER PREFERENCES TITLE METADATA COUNTRY DEVICE NETWORK
  91. 91. Big Opportunity Rich playback experiences Tremendous increase in scale Customer growth
  92. 92. Challenge: Efficient Scaling Targeting sub-linear growth # of Requests Cloud Costs
  93. 93. Predictable Viewing Patterns Key Insight
  94. 94. Key Insight CONTENT RANK
  95. 95. Also.. Manifest Request for one title
  96. 96. Current: Completely Real-time Real-time manifest generation
  97. 97. With Caching Real-time manifest generation 80% Cached 20% Real-time
  98. 98. Challenges How do we determine the optimal combination of attributes to cache on?
  99. 99. Challenges Cache Considerations: ● When to populate? ● When to bust? ● How to scale for cache- miss or failures?
  100. 100. Potential Win 10x increase in requests with only 4x increase in costs
  101. 101. Optimize computation Can we re-imagine our service processing to dramatically increase throughput?
  102. 102. Anatomy of a Playback Manifest Request Metadata Access 27 % 36% Tracks Generation 16% Streams Filtering 21% Serialization
  103. 103. Potential Win 10x increase in requests with just 2x increase in service costs
  104. 104. Two-pronged Strategy to Scaling Cache Manifests Re-architect code to reduce processing time
  105. 105. Scaling Problems Across Services Decide Authorize Collect & Learn Playback Features Playback Access Playback Data Systems
  106. 106. Thanks! @suudhan Come Talk to Us!
  107. 107. Image Attribution All Images used are under creative commons or public domain license: ● Video icon - http://simpleicon.com/wp-content/uploads/video-camera-1.png ● Speaker icon - https://upload.wikimedia.org/wikipedia/commons/thumb/2/21/Speaker_Icon.svg/1024px- Speaker_Icon.svg.png ● Subtitle icon - https://thenounproject.com/term/subtitles/78795/ ● Uptrend image - https://pixabay.com/en/chart-line-line-chart-diagram-trend-148256/ ● Funnel image - https://commons.wikimedia.org/wiki/File:Funnel_Mech.svg ● Business Intelligence image - https://pixabay.com/static/uploads/photo/2015/04/14/23/17/it-business- 722950_960_720.png ● Key icon - https://pixabay.com/static/uploads/photo/2014/04/03/10/55/key-311738_960_720.png ● Person icon- https://pixabay.com/static/uploads/photo/2015/12/22/04/00/photo-1103596_960_720.png ● Mobile icon- https://upload.wikimedia.org/wikipedia/commons/thumb/1/14/Mobile_phone_font_awesome.svg/1024px- Mobile_phone_font_awesome.svg.png ● Globe image - https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Simple_Globe.svg/1024px- Simple_Globe.svg.png ● Devices icon- https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Simple_Globe.svg/1024px- Simple_Globe.svg.png ● wifi icon - https://pixabay.com/static/uploads/photo/2016/01/03/11/32/wireless-signal-1119306_960_720.png ● cell tower - https://pixabay.com/static/uploads/photo/2012/04/13/00/23/tower-31235_960_720.png

×