Edge architecture ieee international conference on cloud engineering

10,526 views
10,367 views

Published on

Published in: Technology

Edge architecture ieee international conference on cloud engineering

  1. 1. Netflix’s Global Cloud Edge Architecture Mikey Cohen mikey@netflix.com Edge Engineering Platform Netflix
  2. 2. Over 44 million subscribers in over 40 countries
  3. 3. Netflix accounts for over 30% of peak internet traffic in North America
  4. 4. One billion hours ~ 100,000 years per month...
  5. 5. Netflix supports over 1000 device types
  6. 6. Edge Services ● Front door to Netflix ● Edge Routing - Zuul ● API - Edge Server ● Playback services
  7. 7. How does Netflix Streaming work?* * A simplified view
  8. 8. How does Netflix Streaming work? Netflix Services in Amazon Cloud Your CE Device CDN
  9. 9. Device Under the Hood Netflix Services in Amazon Cloud Your CE Device CDN User Interface Netflix Streaming Platform DRM encodingCE integration
  10. 10. User Interface loaded, data retrieved from Netflix Edge Service User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services
  11. 11. User Interface loaded, data retrieved from Netflix Edge Service User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services
  12. 12. User Interface Loaded User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services
  13. 13. Movie Authorization User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services Authorize
  14. 14. Movie Authorization User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services Authorize
  15. 15. Obtaining License User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services License
  16. 16. Movie starts streaming User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services PlayData
  17. 17. Movie starts streaming User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services
  18. 18. Periodic “bookmark” calls note place in movie User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services bookmark
  19. 19. Edge Services - What we are talking about today User Interface Netflix Streaming Platform DRM encoding Your CE Device CDN CE integration bookmarkNetflix Services in Amazon Cloud Edge Services
  20. 20. Edge’s lofty mission ● High Availability ● Good performance ● Data broker between many services and devices in a global, high volume, rapidly innovating, highly dynamic service ● Clients and services are constantly changing
  21. 21. Edge stats ● Billions of incoming requests per day ○ Over 10X outgoing service calls per request ● About 10 device changes per day ● Daily service pushes ● Daily routing changes
  22. 22. Architecture Goals ● Infrastructure ○ Availability ○ Resiliency ○ Scalability ● Application ○ Platform diversity ○ Rapid innovation ○ A/B Testing ● Delivery ○ Automation ○ Insights
  23. 23. Netflix’s Global Cloud Architecture
  24. 24. High Level Regional Edge Architecture ELB Edge Service Netflix Services ELB Playback Service ELB Zuul Website Service
  25. 25. Zuul ELB Edge Service Netflix Services ELB Playback Service ELB Zuul Website Service
  26. 26. What is Zuul? ● Open source framework for dynamically reading, writing, and executing filters that act on incoming HTTP requests ● Dynamically compiled filters written in Groovy ○ Any JVM language supported ● Filters share state through a request scoped context
  27. 27. How we use Zuul ● Authentication ● Insights ● Stress Testing ● Canary Testing ● Dynamic Routing ● Service Migration ● Load Shedding ● Security ● Static Response handling ● Active/Active traffic management
  28. 28. Zuul Filter Characteristics ● Type ● Execution Order ● Criteria ● Action
  29. 29. Zuul Filter Lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters
  30. 30. Zuul Filter Lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters
  31. 31. zuul filter lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters
  32. 32. zuul filter lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters
  33. 33. zuul filter lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters
  34. 34. zuul filter lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters
  35. 35. zuul filter lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters
  36. 36. zuul filter lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters
  37. 37. zuul filter lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters
  38. 38. Example Filter File: DeviceDelayFilter.groovy 1 class DeviceDelayFilter extends ZuulFilter { 2 3 def static Random rand = new Random() 4 @Override 5 String filterType() { 6 return 'pre' 7 } 8 9 @Override 10 int filterOrder() { 11 return 5 12 } 13 14 @Override 15 boolean shouldFilter() { 16 return RequestContext.getRequest(). 17 getParameter("deviceType")?equals("BrokenDevice"):false 18 } 19 20 @Override 21 Object run() { 22 sleep(rand.nextInt(20000)) //Sleep for a random number of seconds between [0-20] 23 } 24 }
  39. 39. Filter deployment
  40. 40. Active/Active
  41. 41. Multiple Active Regions ZUUL API Cassandra Services ZUUL API Cassandra Services
  42. 42. Multiple Active Regions - NM vs GE ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS
  43. 43. Multiple Active Regions- Cassandra Replication across regions ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS
  44. 44. DNS Misrouting ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS
  45. 45. DNS Misrouting ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS
  46. 46. Geo lookup resolves IP in west ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS GEO
  47. 47. Zuul east routes to Zuul west ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS GEO
  48. 48. Response is from west ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS GEO
  49. 49. Regional Failure ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS
  50. 50. Catastrophe in US-East ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS
  51. 51. East Coast is Down ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS
  52. 52. Switch DNS to point to US-West ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS
  53. 53. East traffic flows to West ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS
  54. 54. Edge Server (API)
  55. 55. The Edge Service - Netflix’s API Tier ELB Edge Service Netflix Services ELB Playback Service ELB Zuul Website Service
  56. 56. What’s wrong with REST for Netflix?
  57. 57. REST ● One Size Fits all ● One Data Format Fits All ● REST tends to be atomic ● Average 25 REST requests to build up a page.
  58. 58. Netflix’s Groovy Scripting Layer
  59. 59. Edge Scripting Tier ● Device teams write scripts for their device ○ control content, format, endpoints ● Code injected directly into Edge Service at runtime ○ Scripts are in production in about 30 seconds
  60. 60. Endpoint Code (Groovy) Endpoint Controller RxJava Async Service Layer API Hystrix (Fault tolerance)Endpoint Manager JVM Netflix Services Edge Server Architecture
  61. 61. Endpoint Code (Groovy) Endpoint Controller RxJava Async Service Layer API Hystrix (Fault tolerance)Endpoint Manager JVM Netflix Services Pushing a Script UI Engineer /ps3/home script
  62. 62. Endpoint Code (Groovy) Endpoint Controller RxJava Async Service Layer API Hystrix (Fault tolerance)Endpoint Manager JVM Netflix Services Pushing a Script UI Engineer /ps3/home script
  63. 63. Endpoint Code (Groovy) Endpoint Controller RxJava Async Service Layer API Hystrix (Fault tolerance)Endpoint Manager JVM Netflix Services Controller pulls new script / compiles UI Engineer /ps3/home script
  64. 64. Endpoint Code (Groovy) Endpoint Controller RxJava Async Service Layer API Hystrix (Fault tolerance)Endpoint Manager JVM Netflix Services Script Activated UI Engineer Activate
  65. 65. Service Layer
  66. 66. Endpoint Code (Groovy) Endpoint Controller RxJava Async Service Layer API Hystrix (Fault tolerance)Endpoint Manager JVM Netflix Services Service Layer
  67. 67. Purpose of the Service Layer ● Interface to business logic (our API) ● Shield data consumers from service changes ● Combine and expose business data in a logical and consistent manner ● All Service Layer methods are async using RxJava ○ Hides concurrency and underlying implementation
  68. 68. RxJava
  69. 69. Endpoint Code (Groovy) Endpoint Controller RxJava Async Service Layer API Hystrix (Fault tolerance)Endpoint Manager JVM Netflix Services RxJava
  70. 70. RxJava ● Why? ○ How do you expose an async service as an API? ○ Solution to compose async flows and sequences of data ○ Rich set of operators to filter and interact with data
  71. 71. How RxJava Helps ● Need to hide concurrency from script writers ○ Minimize the “bad things” consumers of our API on box can do. ○ Hide the internal implementation ■ Change concurrency of any given call ■ Switch to non-blocking IO
  72. 72. Hystrix Service Resiliency
  73. 73. Endpoint Code (Groovy) Endpoint Controller RxJava Async Service Layer API Hystrix (Fault tolerance)Endpoint Manager JVM Netflix Services Hystrix
  74. 74. How Hystrix helps ● Latency and Fault Tolerance ○ Stop cascading failures. Fallbacks and graceful degradation. Fail fast and rapid recovery. ○ Thread and semaphore isolation with circuit breakers. ● Realtime Operations ○ Realtime monitoring and configuration changes. Watch service and property changes take effect immediately as they spread across a fleet. ○ Be alerted, make decisions, affect change and see results in seconds. ● Concurrency ○ Parallel execution. Concurrency aware request caching. Automated batching through request collapsing.
  75. 75. Hystrix Dashboard Example
  76. 76. DELIVERY
  77. 77. Edge Delivery ● Continuous deployment ● Automated system integrity analysis ● Tools for facilitating delivery
  78. 78. Automated Deployment Pipeline
  79. 79. ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Edge Cluster Organization
  80. 80. ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Most Requests to Main Origin
  81. 81. ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Some requests to Canary
  82. 82. Canary Analysis
  83. 83. Canary Analysis Detail
  84. 84. ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Response Validation
  85. 85. ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Fork response to Main and Canary
  86. 86. ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Validate response Validate response integrity
  87. 87. ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Targeted Debugging
  88. 88. ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Targeted Debugging
  89. 89. ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Targeted Debugging
  90. 90. ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Squeezing the Origin
  91. 91. ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Squeezing the Origin
  92. 92. ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN ELB Finding service Capacity SQUEEZE ORIGIN
  93. 93. Scryer - Predictive auto-scaling ● Why? ○ Reactive doesn’t work in all cases ○ Reacting is sometimes too late ■ Sunday morning cartoons ○ Reactive overreacts ■ Superbowl, World Cup, Outages ■ Fixed size scaling ○ All in All - more reliable and saves money
  94. 94. Daily Traffic Patterns
  95. 95. Scryer Predictions
  96. 96. How does Scryer work? ● Traffic shape analysis ○ Monday vs Monday ○ Sunday vs Sunday, etc ○ FFT based smoothing
  97. 97. Filtering out Noise
  98. 98. Ignoring outages
  99. 99. Accounting for regular spikey traffic
  100. 100. Iteratively apply FFT
  101. 101. Other Scryer Factors ● Traffic volume analysis ○ At least 4 weeks of data ○ Linear regression based on time of day ○ Correct the prediction based on today’s trend. ● Instance factors ○ Instance startup time ○ Instance capacity (obtained by squeeze testing) ● Scale (up/down) actions scheduled based on prediction
  102. 102. The Future
  103. 103. Future - Large Projects on Edge ● Async, non-blocking servers ● Service layer redesign ● Internal Insights ● Global Insights
  104. 104. Edge Architecture Today ELB API Service Netflix Services ELB Streaming Service ELB Zuul Website Service Zuul Zuul
  105. 105. Future Edge Architecture ELB API/ Edge Service Netflix Services Playback Services ELB Zuul Website
  106. 106. Future Edge Architecture ELB API/ Edge Service Netflix Services Playback Services ELB Zuul Website
  107. 107. Future Edge Architecture ELB API/ Edge Service Netflix Services Playback Services ELB Zuul Website
  108. 108. Future Edge Architecture ELB API/ Edge Service Netflix Services Playback Services ELB Zuul Website
  109. 109. Future Edge Architecture ELB API/ Edge Service Netflix Services Playback Services ELB Zuul Website
  110. 110. Global Insights API/ Edge Service Netflix Services Playback Services Zuul User Interface Insight EngineEvent Stream Client Data
  111. 111. User Interface Designs
  112. 112. Netflix in the Cloud - 5 years later Lessons learned
  113. 113. What Did We Learn?
  114. 114. Failure is Assured!
  115. 115. ● Code failure - Continuous delivery ● Service failure - fallbacks and redundancy ● Instances and Zone failure - redundancy ● Cloud infrastructure failure - Multiple active regions ● Human failure - Automation Building for Failure
  116. 116. Drawbacks of the cloud ● Some failures are difficult to detect the cause ○ Huge variability in instance performance that are almost impossible to explain. ○ Network barriers ○ Multi tenancy ○ Firewalls ● Very limited access to information/ ability to fix issues
  117. 117. Software focus: Cloud’s greatest strength ● Scale our business ● Automate processes ● Radically experiment ● Remain resilient ● Move quickly
  118. 118. Netflix Culture - Our secret sauce ● Freedom and responsibility ● Highly aligned teams ● Aversion to process ● Design for necessity ● Design for failure ● Engineering teams operating their services
  119. 119. Netflix OSS ● Zuul - Smart edge router ● RxJava - Functional reactive libraries ● Hystrix - SOA resiliency ● + a lot more!
  120. 120. For more Info on Netflix Cloud Technology: Read our Technology Blog : http://techblog.netflix.com/ Check out our Open Source Cloud Projects : http://netflix.github.io

×