Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Netflix Play API - An Evolutionary Architecture

287 views

Published on

Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2EAvTfs.

Suudhan Rangarajan dives deep into how Netflix used a set of three core foundational principles to iteratively develop their architecture. He specifically talks about what patterns they observed in their previous architectures and how they arrived at a list of practices to create an Evolutionary Architecture. Filmed at qconsf.com.

Suudhan Rangarajan works on the Playback API team at Netflix, responsible for ensuring that customers receive the best possible playback experience every time they click play. Prior to Netflix, he worked on the Audio/Video decoding pipeline in Adobe Flash and Adobe Primetime products helping many partners create a great video streaming client.

Published in: Technology

Netflix Play API - An Evolutionary Architecture

  1. 1. Headline Suudhan Rangarajan (@suudhan) Senior Software Engineer Netflix Play API Why we built an Evolutionary Architecture
  2. 2. InfoQ.com: News & Community Site • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ netflix-play-api
  3. 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  4. 4. Headline Suudhan Rangarajan (@suudhan) Senior Software Engineer Netflix Play API Why we built an Evolutionary Architecture
  5. 5. Previous Architecture Workflow Sign-up Content Discovery Playback API Service ← Services hosted in AWS →Devices Domain specific Microservices API Proxy Service
  6. 6. Signup Workflow ← Services hosted in AWS →Devices Signup API Sign-up Content Discovery Playback Domain specific Microservices API Proxy Service API Service
  7. 7. Content Discovery Workflow ← Services hosted in AWS →Devices Discovery API Sign-up Content Discovery Playback Domain specific Microservices API Proxy Service API Service
  8. 8. Playback Workflow ← Services hosted in AWS →Devices Play API Sign-up Content Discovery Playback Domain specific Microservices API Proxy Service API Service
  9. 9. Previous Architecture ← Services hosted in AWS →Devices Signup API Discovery API Play API Sign-up Content Discovery Playback Domain specific Microservices API Proxy Service API Service
  10. 10. Identity Type 1/2 Decisions Evolvability
  11. 11. Identity Type 1/2 Decisions Evolvability
  12. 12. Start with WHY: Ask why your service exists
  13. 13. Lead the Internet TV revolution to entertain billions of people across the world P Maximize customer engagement from signup to streaming P Enable acquisition, discovery, playback functionality 24/7
  14. 14. API Identity: Deliver Acquisition, Discovery and Playback functions with high availability
  15. 15. Single Responsibility Principle: Be wary of multiple-identities rolled up into a single service
  16. 16. One API Service Signup API Discovery API Play API Signup API Discovery API Play API API Service Per function Previous Architecture Current Architecture
  17. 17. Lead the Internet TV revolution to entertain billions of people across the world P Maximize user engagement of Netflix customer from signup to streaming P Enable non-member, discovery, playback functionality 24/7 P Deliver Playback Lifecycle 24/7
  18. 18. Decide best playback experience Track events to measure playback experience Authorize playback experience Play API Devices API Proxy Service
  19. 19. Decide best playback experience Track events to measure playback experience Authorize playback experience Devices API Proxy Service High Coupling, Low Evolvability
  20. 20. Play API Identity: Orchestrate Playback Lifecycle with stable abstractions
  21. 21. Guiding Principle: We believe in a simple singular identity for our services. The identity relates to and complements the identities of the company, organization, team and its peer services
  22. 22. Identity Type 1/2 Decisions Evolvability
  23. 23. “Some decisions are consequential and irreversible or nearly irreversible – one-way doors – and these decisions must be made methodically, carefully, slowly, with great deliberation and consultation [...] We can call these Type 1 decisions…” Quote from Jeff Bezos
  24. 24. “...But most decisions aren’t like that – they are changeable, reversible – they’re two-way doors. If you’ve made a suboptimal Type 2 decision, you don’t have to live with the consequences for that long [...] Type 2 decisions can and should be made quickly by high judgment individuals or small groups.” Quote from Jeff Bezos
  25. 25. Three Type 1 Decisions to Consider Synchronous & Asynchronous Data ArchitectureAppropriate Coupling
  26. 26. Two types of Shared Libraries Play API Service Utilities cache Metrics Shared Libraries with common functions Client Libraries used for inter-service communications Client 1 Client 2 Client 3
  27. 27. “Thick” shared libraries with 100s of dependent libraries (e.g. utilities jar) Previous Architecture 1) Binary Coupling
  28. 28. Hundreds of shared libraries spanning services across network boundaries Previous Architecture Binary coupling => Distributed Monolith Utilities Utilities Utilities Service1 Service2 Service3
  29. 29. “The evils of too much coupling between services are far worse than the problems caused by code duplication” - Sam Newman (Building Microservices)
  30. 30. Play API Service Playback Decision Service Playback Decision Client Previous Architecture
  31. 31. Requests Per Second of API Service Increase in Latencies from the API Service Execution of Fallback via Play Decision Client Clients with heavy Fallbacks
  32. 32. Play API Service Playback Decision Service Playback Decision Client Previous Architecture 2) Operational Coupling
  33. 33. “Operational Coupling” might be an ok choice, if some services/teams are not yet ready to own and operate a highly available service.
  34. 34. Many of the client libraries had the potential to bring down the API Service Previous Architecture Operational Coupling impacts Availability Play API Service
  35. 35. Play API Service Playback Decisions Serviceclient Java Java Previous Architecture 3) Language Coupling
  36. 36. Play API Service client REST over HTTP 1.1 ● Unidirectional (Request/ Response type APIs) Previous Architecture Playback Decisions Service Jersey Framework Communication Protocol
  37. 37. Requirements Operationally “thin” Clients No or limited shared libraries Auto-generated clients for Polyglot support Bi-Directional Communication
  38. 38. ● At Netflix, most use-cases were modelled as Request/Response ○ REST was a simple and easy way of communicating between services; so choice of REST was more incidental rather than intentional ● Most of the services were not following RESTful principles. ○ The URL didn’t represent a unique resource, instead the parameters passed in the call determined the response - effectively made them a RPC call ● So we were agnostic to REST vs RPC as long as it meets our requirements REST vs RPC
  39. 39. Previous Architecture Current Architecture Play API Service Playback Decisions Playback Authorize Playback Events Playback Decisions Playback Authorize Playback Events 1) Operationally Coupled Clients 2) High Binary Coupling 3) Only Java 4) Unidirectional communication Play API Service 1) Minimal Operational Coupling 2) Limited Binary Coupling 3) Beyond Java 4) Beyond Request/ Response gRPC/ HTTP2REST/ HTTP1
  40. 40. Consider “thin” auto-generated clients with bi-directional communication and minimize code reuse across service boundaries Type 1 Decision: Appropriate Coupling
  41. 41. Three Type 1 Decisions to Consider Synchronous vs Asynchronous Data ArchitectureAppropriate Coupling
  42. 42. PlayData getPlayData(string customerId, string titleId, string deviceId){ CustomerInfo custInfo = getCustomerInfo(customerId); DeviceInfo deviceInfo = getDeviceInfo(deviceId); PlayData playdata = decidePlayData(custInfo, deviceInfo, titleId); return playdata; }
  43. 43. Request Handler Thread pool Client Thread pool Typical Synchronous Architecture
  44. 44. Request Handler Thread pool Client Thread pool getPlayData getCustomerInfo decidePlayData Return One thread per request Typical Synchronous Architecture getDeviceInfo Customer Service Device Service Play Data Decision Service
  45. 45. Request Handler Thread pool Client Thread pool getPlayData getCustomerInfo decidePlayData Return One thread per request Typical Synchronous Architecture getDeviceInfo Customer Service Device Service Play Data Decision Service Blocking Request Handler Blocking Client I/O
  46. 46. Request Handler Thread pool Client Thread pool getPlayData getCustomerInfo decidePlayData Return One thread per request Typical Synchronous Architecture getDeviceInfo Blocking Request Handler Blocking Client I/O Works for Simple Request/Response Works for Limited Clients
  47. 47. Beyond Request/Response One Request - One Response Request Play-data for Title X Receive Play-data for Title X One Request - Stream Response Request Play-data for Titles X,Y,Z Receive Play-data for Title X Receive Play-data for Title Y Receive Play-data for Title Z Stream Request - One Response Request Play-data for Title X Request Play-data for Title Y Request Play-data for Title Z Receive Play-data for Titles X,Y,Z Stream Request - Stream Response Request Play-data for Title X Request Play-data for Title Y Receive Play-data for Title X Get Play-data for Title Z Receive Play-data for Title Y Receive Play-data for Title Z
  48. 48. Request/Response Event Loop Outgoing Event Loop per client Worker Threads Asynchronous Architecture
  49. 49. PlayData getPlayData(string customerId, string titleId, string deviceId){ Zip(getCustomerInfo(customerId), getDeviceInfo(deviceId), (custInfo, deviceInfo) -> return decidePlayData(custInfo, deviceInfo, titleId) ); }
  50. 50. Request/Response Event Loop Outgoing Event Loop per clientWorkflow spans many worker threads Asynchronous Architecture Customer Service Device Service PlayData Service setup
  51. 51. Request/Response Event Loop Outgoing Event Loop per clientWorkflow spans many worker threads Asynchronous Architecture Customer Service Device Service PlayData Service getCustomerInfo
  52. 52. Request/Response Event Loop Outgoing Event Loop per clientWorkflow spans many worker threads Asynchronous Architecture Customer Service Device Service PlayData Service getDeviceInfo
  53. 53. Request/Response Event Loop Outgoing Event Loop per clientWorkflow spans many worker threads Asynchronous Architecture Customer Service Device Service PlayData Service zip
  54. 54. Request/Response Event Loop Outgoing Event Loop per clientWorkflow spans many worker threads Asynchronous Architecture Customer Service Device Service PlayData Service decidePlayData
  55. 55. ● All context is passed as messages from one processing unit to another. ● If we need to follow and reason about a request, we need to build tools to capture and reassemble the order of execution units ● None of the calls can block Workflow spans multiple threads
  56. 56. Request/Response Event Loop Outgoing Event Loop per client Worker Threads Asynchronous Architecture Asynchronous Request Handler Non-Blocking I/O
  57. 57. Synchrony Ask: Do you really have a need beyond Request/Response?
  58. 58. Network Event Loop Outgoing Event Loop per client Dedicated thread Synchronous Execution + Asynchronous I/O Blocking Request Handler Non-Blocking I/O Current Architecture getPlayData getCustomerInfo decidePlayData Return getDeviceInfo
  59. 59. If most of your APIs fit the Request/Response pattern, consider a synchronous request handler, with nonblocking I/O Type 1 Decision: Synchronous vs Asynchronous
  60. 60. Three Type 1 Decisions to Consider Synchronous vs Asynchronous Data ArchitectureAppropriate Coupling
  61. 61. Without an intentional Data Architecture, Data becomes its own monolith
  62. 62. Previous Architecture What a Data Monolith looks like Data Source Data Source Data Source Service 1 Service 2 Service 3 Service 4
  63. 63. 4 GB 1 GB 2 GB 400 MB 600 MB API Service ← Multiple Data sources loaded in memory → ←MemoryLoad→ Previous Architecture What a Data Monolith looks like
  64. 64. 4 GB 1 GB 2 GB 400 MB 600 MB API Service Very small percentage of data actually accessed Previous Architecture What a Data Monolith looks like
  65. 65. API Service Each Data Source models gets coupled across classes and libraries Previous Architecture What a Data Monolith looks like
  66. 66. API Service Unpredictable Performance Characteristics Data Update CPU Utilization Previous Architecture What a Data Monolith looks like
  67. 67. What a Data Monolith looks like API Service Potential to bring down the service Data Update Netflix was down Previous Architecture
  68. 68. "All problems in computer science can be solved by another level of indirection." David Wheeler (World’s first Comp Sci PhD)
  69. 69. Current Architecture Data Source Data Source Data Source Data Source Data Source Data Loader Data ServicePlay API Service Data Store Materialized View
  70. 70. Current Architecture Data Source Data Source Data Source Data Source Data Source Data Loader Data Service Uses only the data it needs Predictable Operational Characteristics Reduced Dependency chain Data StorePlay API Service Materialized View
  71. 71. Isolate Data from the Service. At the very least, ensure that data sources are accessed via a layer of abstraction, so that it leaves room for extension later Type 1 Decision: Data Architecture
  72. 72. Three Type 1 Decisions to Consider Synchrony Data ArchitectureAppropriate Coupling
  73. 73. For Type 2 decisions, choose a path, experiment and iterate
  74. 74. Guiding Principle: Identify your Type 1 and Type 2 decisions; Spend 80% of your time debating and aligning on Type 1 Decisions
  75. 75. Identity Type 1/2 Decisions Evolvability
  76. 76. An Evolutionary Architecture supports guided and incremental change as first principle among multiple dimensions - ThoughtWorks
  77. 77. Choosing a microservices architecture with appropriate coupling allows us to evolve across multiple dimensions
  78. 78. How evolvable are the Type 1 decisions Change Play API Current Architecture Previous Architecture Asynchronous? Polyglot services? Bidirectional APIs? Additional Data Sources? Known Unknowns
  79. 79. Potential Type 1 decisions in the future? Change Play API Current Architecture Previous Architecture Containers? Serverless? ? ? And we fully expect that there will be Unknown Unknowns
  80. 80. As we evolve, how to ensure we are not breaking our original goals?
  81. 81. Use Fitness Functions to guide change
  82. 82. High Availability Low Latency Simplicity Reliability High Throughput Observability Developer Productivity Continuous Integration Scalable Evolvability
  83. 83. High Availability Low Latency Simplicity Reliability High Throughput Observability Developer Productivity Continuous Integration Scalable Evolvability 1 2 3 4
  84. 84. Why Simplicity over Reliability? Increase in Operational Complexity Reliable Fallback when service is down
  85. 85. Why Scalability over Throughput? New instances were added Increase in Errors due to cache warming
  86. 86. Why Observability over Latency? Decrease in latency by using a fully async executor Cost of Async: Loss in Observability
  87. 87. Four 9s of availability Thin Clients P99 latencyResilience to failures Merge to Deploy Time 1 2 3
  88. 88. Guiding Principle: Define Fitness functions to act as your guide for architectural evolution
  89. 89. Previous Architecture Current Architecture Operational Coupling Binary Coupling Only Java Synchronous communication Data Monolith Operational Isolation No Binary Coupling Beyond Java Asynchronous communication Explicit Data Architecture Guided Fitness Functions Multiple Identities Singular Identities
  90. 90. ● No incidents in a year ● 4.5 deployments per week ● Just two rollbacks!
  91. 91. Identity Type 1/2 Decisions Evolvability Build a Evolutionary Architecture
  92. 92. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ netflix-play-api

×