Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

(ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

4,655 views

Published on

The Netflix service supports more than 50 million subscribers in over 40 countries around the world. These subscribers use more than 1,000 different device types to connect to Netflix, resulting in massive amounts of traffic to the service. In our distributed environment, the gateway service that receives this customer traffic needs to be able to scale in a variety of ways while simultaneously protecting our subscribers from failures elsewhere in the architecture. This talk will detail how the Netflix front door operates, leveraging systems like Hystrix, Zuul, and Scryer to maximize the AWS infrastructure and to create a great streaming experience.

Published in: Technology

(ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

  1. 1. November 12, 2014 | Las Vegas, Nevada Daniel Jacobson, Netflix Ben Schmaus, Netflix
  2. 2. Daniel Jacobson @daniel_jacobson danieljacobson/linkedin danieljacobson.com/slideshare Ben Schmaus @schmaus schma.us/in schma.us/slides Edge Engineering
  3. 3. What does Edge Engineering do? •Broker data between services and devices •Control playback flow •Ensure resiliency •Scale our systems •Enable high velocity product innovation •Provide detailed, real-time health insights “The Edge... the only people who really know where it is are the ones who have gone over.” --Hunter S. Thompson
  4. 4. What does Edge Engineering do? •Broker data between services and devices •Control playback flow •Ensure resiliency •Scale our systems •Enable high velocity product innovation •Provide detailed, real-time health insights “The Edge... the only people who really know where it is are the ones who have gone over.” --Hunter S. Thompson
  5. 5. What does Edge Engineering do? •Broker data between services and devices •Control playback flow •Ensure resiliency •Scale our systems •Enable high velocity product innovation •Provide detailed, real-time health insights “The Edge... the only people who really know where it is are the ones who have gone over.” --Hunter S. Thompson APP-310: Scheduling using Apache Mesosin the Cloud 9:00 on Friday
  6. 6. D E V I C E S R O U T I N G O R I G I N API S E R V I C E S RxJava Hystrix S2 S2 S2 Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting S2 S2 S1 S2 S2 S4 S2 S2 S3 S2 S2 S6 S2 S2 S5 S2 S2 S8 S2 S2 S7 S2 S2 S10 S2 S2 S9 S2 S2 S12 S2 S2 S11 S2 S2 S13
  7. 7. D E V I C E S R O U T I N G O R I G I N S2 S2 S2 Playback Playback Website Website Logging S2 S2 S1 S2 S2 S4 S2 S2 S3 S2 S2 S6 S2 S2 S5 S2 S2 S8 S2 S2 S7 S2 S2 S10 S2 S2 S9 S2 S2 S12 S2 S2 S11 S2 S2 S13 API RxJava Hystrix Scripting S E R V I C E S
  8. 8. D E V I C E S R O U T I N G O R I G I N S2 S2 S2 Playback Playback Website Website Logging S2 S2 S1 S2 S2 S4 S2 S2 S3 S2 S2 S6 S2 S2 S5 S2 S2 S8 S2 S2 S7 S2 S2 S10 S2 S2 S9 S2 S2 S12 S2 S2 S11 S2 S2 S13 API RxJava Hystrix Scripting S E R V I C E S
  9. 9. Routing Traffic “There is no Dana, only Zuul!”
  10. 10. ZuulGatekeeper for the Netflix Streaming Application
  11. 11. Zuul •Multi-Region Resiliency •Dynamic Routing •Squeeze Testing •Insights •Load Shedding •Security •Authentication
  12. 12. D E V I C E S R O U T I N G O R I G I N API RxJava Hystrix S2 S2 S2 Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting S2 S2 S1 S2 S2 S4 S2 S2 S3 S2 S2 S6 S2 S2 S5 S2 S2 S8 S2 S2 S7 S2 S2 S10 S2 S2 S9 S2 S2 S12 S2 S2 S11 S2 S2 S13 S E R V I C E S
  13. 13. D E V I C E S R O U T I N G O R IG I N PROD RxJava Hystrix S2 S2 S2 Scripting DEBUG RxJava Hystrix Scripting S2 S2 S1 S2 S2 S4 S2 S2 S3 S2 S2 S6 S2 S2 S5 S2 S2 S8 S2 S2 S7 S2 S2 S10 S2 S2 S9 S2 S2 S12 S2 S2 S11 S2 S2 S13 S E R V I C E S
  14. 14. D E V I C E S R O U T I N G O R IG I N PROD RxJava Hystrix S2 S2 S2 Scripting SQUEEZE RxJava Hystrix Scripting S2 S2 S1 S2 S2 S4 S2 S2 S3 S2 S2 S6 S2 S2 S5 S2 S2 S8 S2 S2 S7 S2 S2 S10 S2 S2 S9 S2 S2 S12 S2 S2 S11 S2 S2 S13 S E R V I C E S
  15. 15. Systems are healthy. Traffic from the east goes to US-EAST Traffic from the west goes to US-WEST
  16. 16. Systems failure in US-EAST.
  17. 17. US-EAST Zuulroutes traffic to US-WEST Zuul (untilDNS gets resolved)
  18. 18. DNS gets resolved. Requests from east go to US-WEST
  19. 19. Systems recover in US-EAST. DNS set to return to normal
  20. 20. DNS gets resolved. Both regions return to normal.
  21. 21. Resiliency in Distributed Systems Preventing cascading failures
  22. 22. “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable” --Leslie Lamport
  23. 23. Dependency Relationships
  24. 24. 5,000,000,000 Incoming Requests Per Day Netflix API
  25. 25. 30 Dependent Services Netflix API
  26. 26. ~600 Dependency Jars Netflix API
  27. 27. 40,000,000,000 Outbound Calls Per Day to Dependent Services Netflix API
  28. 28. 1 Thing is common across all dependencies…
  29. 29. 0 Dependent Services have a 100% SLA
  30. 30. 99.99% = 99.7% 30 0.3% of 5B = 15M failures per day 2+ Hours of Downtime Per Month
  31. 31. 99.99% = 99.7% 30 0.3% of 5B = 15M failures per day 2+ Hours of Downtime Per Month
  32. 32. 99.9% = 97% 30 3% of 5B = 150M failures per day 20+ Hours of Downtime Per Month
  33. 33. D E V I C E S R O U T I N G O R I G I N API RxJava Hystrix S2 S2 S2 Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting S2 S2 S1 S2 S2 S4 S2 S2 S3 S2 S2 S6 S2 S2 S5 S2 S2 S8 S2 S2 S7 S2 S2 S10 S2 S2 S9 S2 S2 S12 S2 S2 S11 S2 S2 S13 S E R V I C E S
  34. 34. D E V I C E S R O U T I N G O R I G I N API RxJava Hystrix S2 S2 S2 Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting S2 S2 S1 S2 S2 S4 S2 S2 S3 S2 S2 S6 S2 S2 S5 S2 S2 S8 S2 S2 S7 S2 S2 S10 S2 S2 S9 S2 S2 S12 S2 S2 S11 S2 S2 S13 S E R V I C E S
  35. 35. D E V I C E S R O U T I N G O R I G I N API RxJava Hystrix S2 S2 S2 Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting S2 S2 S1 S2 S2 S4 S2 S2 S3 S2 S2 S6 S2 S2 S5 S2 S2 S8 S2 S2 S7 S2 S2 S10 S2 S2 S9 S2 S2 S12 S2 S2 S11 S2 S2 S13 S E R V I C E S
  36. 36. D E V I C E S R O U T I N G O R I G I N RxJava Hystrix S2 S2 S2 Scripting RxJava Hystrix Scripting RxJava Hystrix Scripting RxJava Hystrix Scripting RxJava Hystrix Scripting RxJava Hystrix Scripting S2 S2 S1 S2 S2 S4 S2 S2 S3 S2 S2 S6 S2 S2 S5 S2 S2 S8 S2 S2 S7 S2 S2 S10 S2 S2 S9 S2 S2 S12 S2 S2 S11 S2 S2 S13 API API API API API API S E R V I C E S
  37. 37. Call Volume and Health / Last 10 Seconds Call Volume / Last 2 Minutes
  38. 38. Successful Requests Short-Circuited Requests, Delivering Fallbacks Timeouts, Delivering Fallbacks Full Queues, Delivering Fallbacks Exceptions, Delivering Fallbacks
  39. 39. Error Rate # +#+ #+ #/ (#+ # +#+ #+ #) = Error Rate
  40. 40. D E V I C E S R O U T I N G O R I G I N API S E R V I C E S RxJava Hystrix S2 S2 S2 Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting S2 S2 S1 S2 S2 S4 S2 S2 S3 S2 S2 S6 S2 S2 S5 S2 S2 S8 S2 S2 S7 S2 S2 S10 S2 S2 S9 S2 S2 S12 S2 S2 S11 S2 S2 S13
  41. 41. D E V I C E S R O U T I N G O R I G I N API S E R V I C E S RxJava Hystrix S2 S2 S2 Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting S2 S2 S1 S2 S2 S4 S2 S2 S3 S2 S2 S6 S2 S2 S5 S2 S2 S8 S2 S2 S7 S2 S2 S10 S2 S2 S9 S2 S2 S12 S2 S2 S11 S2 S2 S13
  42. 42. D E V I C E S R O U T I N G O R I G I N API S E R V I C E S RxJava Hystrix S2 S2 S2 Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting API RxJava Hystrix Scripting S2 S2 S1 S2 S2 S4 S2 S2 S3 S2 S2 S6 S2 S2 S5 S2 S2 S8 S2 S2 S7 S2 S2 S10 S2 S2 S9 S2 S2 S12 S2 S2 S11 S2 S2 S13 Fallback
  43. 43. Demo May the demo gods be with us…
  44. 44. Scaling Systems Preventing failures due to capacity issues
  45. 45. “The possibilities are numerous once we decide to act and not react” --George Bernard Shaw
  46. 46. Reactive Auto Scaling •Reacts to real-time conditions •Responds to spikes/dips in metrics –Load average –Requests per second •Excellent for many scaling scenarios –Much better than static cluster sizing
  47. 47. Reactive Auto Scaling -Challenges •Policies can be inefficientw • •Outages can trigger scale down events •Excess capacity at peak and trough
  48. 48. Scryer : Predictive Auto Scaling Not yet…
  49. 49. Typical Traffic Patterns Over Five Days
  50. 50. Predicted RPS Compared to Actual RPS
  51. 51. Scaling Plan for Predicted Workload
  52. 52. What is Scryer Doing? •Evaluates needs based on historical data –Week over week, month over month metrics •Adjusts instance minimums based on algorithms –Constant feedback loops –Evaluated routinely through squeeze tests •Relies on Auto Scaling for unpredicted spikesin traffic
  53. 53. Results
  54. 54. Results : Load Average Reactive Predictive
  55. 55. Results : Load Average Reactive Predictive
  56. 56. Results : Response Latencies Reactive Predictive
  57. 57. Results : Response Latencies Reactive Predictive
  58. 58. Results : Outage Recovery
  59. 59. Results : AWS Costs
  60. 60. Key Takeaways
  61. 61. https://www.github.com/Netflix
  62. 62. Netflix talks at re:Invent Talk Time Title PFC-305 Wednesday, 1:15pm Embracing Failure: Fault Injection and ServiceReliability BDT-403 Wednesday, 2:15pm Next Generation Big Data Platform at Netflix PFC-306 Wednesday, 2:15pm Performance Tuning EC2 DEV-309 Wednesday, 3:30pm From Asgardto Zuul, How Netflix’s proven Open Source Tools can accelerateand scale your services ARC-317 Wednesday, 4:30pm Maintaining a ResilientFront-Door at Massive Scale PFC-304 Wednesday, 4:30pm Effective Inter-process Communicationsin the Cloud: The Pros and Cons of Micro Services Architectures ENT-209 Wednesday, 4:30pm Cloud Migration, Dev-Ops and Distributed Systems APP-310 Friday, 9:00am Scheduling using Apache Mesosin the Cloud
  63. 63. http://bit.ly/awsevals http://schma.us/in http://schma.us/slides

×