Keeping Movies Running Amid Thunderstorms!

  • 15,978 views
Uploaded on

How does Netflix strive to deliver an uninterrupted service? This talk, delivered for the first time in November, 2011, covers some engineering design concepts that help us deliver features at a rapid …

How does Netflix strive to deliver an uninterrupted service? This talk, delivered for the first time in November, 2011, covers some engineering design concepts that help us deliver features at a rapid pace while assuring high availability.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
15,978
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
0
Comments
0
Likes
35

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Keeping Movies Running Amid Thunderstorms Fault-tolerant Systems @ Netflix Sid Anand (@r39132) QCon SF 2011 1Thursday, November 17, 2011
  • 2. Backgrounder Netflix Then and Now 2Thursday, November 17, 2011
  • 3. Netflix Then and Now Netflix prior to circa 2009 Netflix post circa 2009 Users watched DVDs at home Users watch streaming at home Peak days : Friday, Saturday, Sunday Peak days : Friday, Saturday, Sunday Users returned DVDs & Updated their Qs Off-Peak days see many orders of magnitude more traffic than prior to Peak days : Sunday, Monday 2009 We shipped the next DVDs User expectation is that streaming is always available Peak days : Monday, Tuesday No Scheduled Site Downtimes Scheduled Site Downtimes on alternate Wednesdays Fault Tolerance is a top design concern 3Thursday, November 17, 2011
  • 4. Netflix DC Architecture A Simple System 4Thursday, November 17, 2011
  • 5. Netflix’s DC Architecture Components H/W Load Balancer 1 Netscaler H/W Load Balancer Apache + Tomcat Apache + Tomcat Apache + Tomcat ~20 “WWW” Apache+Tomcat servers 3 Oracle DBs & 1 MySQL DB Cache Servers Cinematch System MySQL Oracle Cache Servers Cinematch Recommendation System 5Thursday, November 17, 2011
  • 6. Netflix’s DC Architecture Types of Production Issues H/W Load Balancer Java Garbage Collection problems, which would would result in slower Apache + Tomcat Apache + Tomcat Apache + Tomcat WWW pages Deadlocks in our multi-threaded Java application would cause web page loading to timeout Cinematch System MySQL Oracle Cache Servers Transaction locking in the DB would result in the similar web page loading timeouts Under-optimized SQL or DB would cause slower web pages (e.g. DB optimizer picks a sub-optimal the execution plan) 6Thursday, November 17, 2011
  • 7. Netflix’s DC Architecture H/W Load Balancer Architecture Pros As serious as these sound, they were Apache + Tomcat Apache + Tomcat Apache + Tomcat typically single-system failure scenarios Single-system failures are relatively easy to resolve Architecture Cons Cinematch System MySQL Oracle Cache Servers Not horizontally scalable Weʼre constrained by what can fit on a single box Not conducive to high-velocity development and deployment 7Thursday, November 17, 2011
  • 8. Netflix’s Cloud Architecture A Less Simple System 8Thursday, November 17, 2011
  • 9. Netflix’s Cloud Architecture ELB ELB NES NES NES NES Components Many (~100) applications, organized in Discovery clusters NMTS NMTS NMTS NMTS Clusters can be at different levels in the call stack NMTS NMTS Clusters can call each other NBES NBES IAAS IAAS IAAS 9Thursday, November 17, 2011
  • 10. Netflix’s Cloud Architecture ELB ELB Levels NES NES NES NES NES : Netflix Edge Services Discovery NMTS : Netflix Mid-tier Services NMTS NMTS NMTS NMTS NBES : Netflix Back-end Services IAAS : AWS IAAS Services NMTS NMTS Discovery : Help services discover NMTS and NBES services NBES NBES IAAS IAAS IAAS 10Thursday, November 17, 2011
  • 11. Netflix’s Cloud Architecture ELB ELB Components (NES) NES NES NES NES Overview Any service that browsers and streaming Discovery devices connect to over the internet NMTS NMTS NMTS NMTS They sit behind AWS Elastic Load Balancers (a.k.a. ELB) NMTS NMTS They call clusters at lower levels NBES NBES IAAS IAAS IAAS 11Thursday, November 17, 2011
  • 12. Netflix’s Cloud Architecture Components (NES) ELB ELB Examples NES NES NES NES API Servers Discovery Support the video browsing experience NMTS NMTS NMTS NMTS Also allows users to modify their Q Streaming Control Servers NMTS NMTS Support streaming video playback Authenticate your Wii, PS3, etc... NBES NBES Download DRM to the Wii, PS3, etc... Return a list of CDN urls to the Wii, PS3, etc... IAAS IAAS IAAS 12Thursday, November 17, 2011
  • 13. Netflix’s Cloud Architecture ELB ELB Components (NMTS) NES NES NES NES Overview Discovery Can call services at the same or lower NMTS NMTS NMTS NMTS levels Other NMTS NMTS NMTS NBES, IAAS Not NES NBES NBES Exposed through our Discovery service IAAS IAAS IAAS 13Thursday, November 17, 2011
  • 14. Netflix’s Cloud Architecture ELB ELB Components (NMTS) NES NES NES NES Examples Discovery Netflix Queue Servers NMTS NMTS NMTS NMTS Modify items in the usersʼ movie queue Viewing History Servers NMTS NMTS Record and track all streaming movie watching SIMS Servers NBES NBES Compute and serve user-to-user and movie-to-movie similarities IAAS IAAS IAAS 14Thursday, November 17, 2011
  • 15. Netflix’s Cloud Architecture ELB ELB Components (NBES) NES NES NES NES Overview Discovery A back-end, usually 3rd party, open-source service NMTS NMTS NMTS NMTS Leaf in the call tree. Cannot call anything else NMTS NMTS NBES NBES IAAS IAAS IAAS 15Thursday, November 17, 2011
  • 16. Netflix’s Cloud Architecture ELB ELB Components (NBES) NES NES NES NES Examples Discovery Cassandra Clusters NMTS NMTS NMTS NMTS Our new cloud database is Cassandra and stores all sorts of data to support application needs NMTS NMTS Zookeeper Clusters Our distributed lock service and sequence NBES NBES generator Memcached Clusters Typically caches things that we store in S3 but need to access quickly or often IAAS IAAS IAAS 16Thursday, November 17, 2011
  • 17. Netflix’s Cloud Architecture ELB ELB Components (IAAS) NES NES NES NES Examples AWS S3 Discovery Large-sized data (e.g. video encodes, NMTS NMTS NMTS NMTS application logs, etc...) is stored here, not Cassandra NMTS NMTS AWS SQS Amazonʼs message queue to send events (e.g. Facebook network updates are processed asynchronously over SQS) NBES NBES IAAS IAAS IAAS 17Thursday, November 17, 2011
  • 18. Netflix’s Cloud Architecture ELB ELB Types of Production Issues NES NES NES NES A user-issued call will pass through multiple levels during normal operation Discovery We are now exposed to multi-system NMTS NMTS NMTS NMTS coincident failures, a.k.a. coordinated failures NMTS NMTS NBES NBES IAAS IAAS IAAS 18Thursday, November 17, 2011
  • 19. Netflix’s Cloud Architecture ELB ELB Architecture Pros NES NES NES NES Horizontally scalable at every level Should give us maximum availability Discovery NMTS NMTS NMTS NMTS Supports high-velocity development and deployment Architecture Cons NMTS NMTS A user-issued call will pass through multiple levels (a.k.a. hops) during normal operation NBES NBES Latency can be a concern We are now exposed to multi-system coincident failures, a.k.a. coordinated IAAS IAAS IAAS failures A lot of moving parts 19Thursday, November 17, 2011
  • 20. Issue 1 Capacity Planning 20Thursday, November 17, 2011
  • 21. Issue 1 X X Y Y• Service X and Service Y, each made up of 2 instances, call Service A, also made up of 2 instance• If either of these services expect a large increase in A A traffic, they need to let the owner of Service A know• Service A can then scale up ahead of the traffic X X Y Y increase Disaster Avoided ?? A A A A A A 21Thursday, November 17, 2011
  • 22. Issue 1• A given application owner may need to contact 20 other application owners each time he expects to get a large increase in traffic X X Y Y • Too much human coordination• A few options A A • Some service owners vastly over-provision for their application X X Y Y • Not cost effective • Auto-scaling • We want to generalize the model first A A A A A A proved by our Streaming Control Server (a.k.a. NCCP) team 22Thursday, November 17, 2011
  • 23. ELB AutoScaling Interlude How to use an ELB An elastic-load balancer (ELB) routes traffic to your EC2 instances e.g. of an ELB : nccp-wii-11111111.us- east-1.elb.amazonaws.com Netflix maps a CNAME to this ELB e.g. : nccp.wii.netflix.com Netflix then registers the API Service’s EC2 instances with this ELB The ELB periodically polls attached EC2 instances to ensure the instances are healthy 23Thursday, November 17, 2011
  • 24. ELB AutoScaling Interlude Taking this a bit further The NCCP servers can publish metrics to AWS CloudWatch We can set up an alarm in Cloud Watch on a metric (e.g. CPU) We can associate an auto scale policy with that alarm (e.g. if CPU > 60%, add 3 more instances) When a metric goes above a limit, an alarm is triggered, causing auto-scaling, which grows our pool 24Thursday, November 17, 2011
  • 25. ELB AutoScaling Interlude Cloud NCCP EC2 instances publish Watch CPU data to CW (Alarms) CloudWatch alarms trigger ASG policies Auto EC2 Instances Scaling Service Added/Removed (Policies) 25Thursday, November 17, 2011
  • 26. ELB AutoScaling Interlude Scale Out Event Average CPU > 60% for 5 minutes Scale In Event Average CPU < 30% FOR 5 minutes Cool Down Period 10 minutes Auto-Scale Alerts DLAutoScaleEvents 26Thursday, November 17, 2011
  • 27. 27 @r39132 23Thursday, November 17, 2011
  • 28. Issue 1 X X Y Y Summary A A We would like to have auto-scaling at all levels. X X Y Y A A A A A A 28Thursday, November 17, 2011
  • 29. Issue 2 Thundering herds to NMTS 29Thursday, November 17, 2011
  • 30. Issue 2 X X Y Y A A Step 1 Step 1 X X Y Y Service X and Service Y, each made up of 2 instances, call Service A, also made up of 2 instance A A Step 2a Step 2a Service Y overwhelms Service A X X Y Y s ut ts o Step 3 ou e m e Ti m Services X & Y experience read and Ti A A connection timeouts against an overwhelmed Step 3 Service A Step 4 X X Y Y s ut ts o ou e Service Aʼs tier get 2 more machines m e Ti m Ti JUST A ADDED A A A JUST ADDED Step 4 30Thursday, November 17, 2011
  • 31. Issue 2 X X Y Y Step 5 • New requests + Retries cause A A A A Step 5 request storms (a.k.a. thundering herds) • If Service A can be grown to exceed retry storm steady-state traffic VICIOUS CYCLE volume, we can exit this vicious cycle Step 6 X X Y Y ts • Else, more timeouts, and VC ou ts ou continues e m e Ti m Ti A A A A Step 6 31Thursday, November 17, 2011
  • 32. Issue 2 X X Y Y A A Step 1 Step 1 X X Y Y Service X and Service Y, each made up of 2 instances, call Service A, also made up of 2 instance A A Step 2b SLOW Step 2b Service A experiences slowness X X Y Y s ut ts o Step 3 ou e m e Ti m Services X & Y experience read and Ti A A connection timeouts against a slower Service A Step 3 Step 4 X ts X Y Y ou ts If the slowness can be fixed by adding more ou e machines to Service Aʼs tier, then do so m e Ti m OPTIONAL Ti JUST A ADDED A A A JUST ADDED Step 4 32Thursday, November 17, 2011
  • 33. Issue 2 X X Y Y Step 5 • New requests + Retries cause A A A A Step 5 request storms (a.k.a. thundering herds) • If Service A can be grown to exceed retry storm steady-state traffic VICIOUS CYCLE volume, we can exit this vicious cycle Step 6 X X Y Y ts • Else, more timeouts, and VC ou ts ou continues e m e Ti m Ti A A A A Step 6 33Thursday, November 17, 2011
  • 34. Issue 2 S1 Potential Causes of Thundering Herd Thundering Herd (Downstream) • Service Y sends more traffic to Service A, without checking if Service A has Time Outs (Upstream) available capacity • Service A slows down S2 • Service Yʻs time outs against Service A are set too low • Service Yʻs retries against Service A are too aggressive • Natural organic growth in traffic hit a tipping point in the system -- in Service A S2 in this case 34Thursday, November 17, 2011
  • 35. Solutions to Issue 2 Thundering herds to NMTS 35Thursday, November 17, 2011
  • 36. Solutions to Issue 2 X The Platform Solution NIWS Throttle Layer • Every service at Netflix sits on the platform.jar NIWS • The platform.jar offers 2 components of interest here: • NIWS library : the client-side of Netflix Inter-Web Service calls. Handles retry, failover, thundering-herd prevention, & fast failure • BaseServer library : a set of Tomcat servlet filters BaseServer Throttle Layer that protect the underlying application servlet stack. In this context, it throttles traffic BaseServer Filter Chain A 36Thursday, November 17, 2011
  • 37. Solutions to Issue 2 X The Platform Solution NIWS Throttle Layer • NIWS library NIWS • Fair Retry Logic : e.g. exponential bounded backoff • Takes 2 configuration params per client: • Max_Num_of_Requests (a.k.a. MNR) • Sample_Interval_in_seconds (a.k.a. SI) BaseServer Throttle Layer • Ensures that a client does not send more than MNR/ BaseServer Filter Chain SI requests/s, else throttles requests at the client A 37Thursday, November 17, 2011
  • 38. Solutions to Issue 2 X The Platform Solution NIWS Throttle Layer • BaseServer NIWS • As an additional fail-safe, the server can set throttles that are not client specific (i.e. the limits apply to total inbound traffic, regardless of client) • Takes 1 configuration parameter: • Max_Num_of_Concurrent_Requests (a.k.a. BaseServer Throttle Layer MNCR) BaseServer Filter Chain • Ensures that a server does not handle more than MNCR requests at any instant A • If the traffic exceeds the limits, reject excess calls at the server (i.e. 503s) 38Thursday, November 17, 2011
  • 39. Solutions to Issue 2 The Platform Solution X • Graceful Degradation NIWS Throttle Layer • Any client that is throttled at either the NIWS NIWS Throttle Layer or the BaseServer Throttle Layer need to implement graceful degradation • Netflixʼs Web Scale Traffic falls in 2 categories: • Users get a personalized set of movies to pick from (i.e. via API Edge Server path) BaseServer Throttle Layer • GD : Show popular movies, not BaseServer Filter Chain personalized movies A • Users can start watching a movie (i.e. via NCCP Edge Server path) • GD : tougher problem to solve • When device leases expire, we honor them if we are unable to generate a new one for them 39Thursday, November 17, 2011
  • 40. Solutions to Issue 2 This all sounds great! • But, what if developers do not use these built-in features of the platform or neglect to set their configuration appropriately? • (i.e. the default RPS limit in the NIWS client is Integer.MAX_VALUE) 40Thursday, November 17, 2011
  • 41. Solutions to Issue 2 We have a little help 41Thursday, November 17, 2011
  • 42. Simian Army Prevention is the best medicine 42Thursday, November 17, 2011
  • 43. Simian Army • Chaos Monkey • Simulates hard failures in AWS by killing a few instances per ASG (e.g. Auto Scale Group) • Similar to how EC2 instances can be killed by AWS with little warning • Tests clientsʼ ability to gracefully deal with broken connections, interrupted calls, etc... • Verifies that all services are running within the protection of AWS Auto Scale Groups, which reincarnates killed instances • If not, the Chaos monkey will win! 43Thursday, November 17, 2011
  • 44. Simian Army • Latency Monkey • Simulates soft failures -- i.e. a service gets slower • Injects random delays in NIWS (client-side) or BaseServer (server-side) of a client-server interaction in production • Tests the ability of applications to detect and recover (i.e. Graceful Degradation) from the harder problem of delays, that leads to thundering herd and timeouts 44Thursday, November 17, 2011
  • 45. Simian Army Does this solve all of our issues? 45Thursday, November 17, 2011
  • 46. Simian Army The infinite cloud is infinite when your needs are moderate! To ensure fairness among tenants, AWS meters or limits every resource Hence, we hit limits quite often. Our “velocity” is limited by how long it takes for AWS to turn around and raise the limit -- a few hours! 46Thursday, November 17, 2011
  • 47. Simian Army • Limits Monkey • Checks once a day whether we are approaching one of our limits and triggers alerts for us to proactively reach out to AWS! • Conformity & Janitor Monkeys • Finds and clean up orphaned resources (e.g. EC2 instances that are not in an ASG, unreferenced security groups, ELBs, ASGs, etc...) to increase head-room • Buys us more time before we run out of resources and also saves us $$$$ 47Thursday, November 17, 2011
  • 48. Simian Army The Simian Army fills the gap created by an absence of process and a need to ensure fault-tolerance and efficient operation of our systems 48Thursday, November 17, 2011
  • 49. Fast Rollback Fault-tolerant deployment 49Thursday, November 17, 2011
  • 50. Fast Rollback What is the point of having Fault-Tolerant layers if deployments of a bug can take them down? 50Thursday, November 17, 2011
  • 51. Fast Rollback 51Thursday, November 17, 2011
  • 52. Fast Rollback Optimism causes outages 51Thursday, November 17, 2011
  • 53. Fast Rollback Optimism causes outages Production traffic is unique 51Thursday, November 17, 2011
  • 54. Fast Rollback Optimism causes outages Production traffic is unique Keep old version running 51Thursday, November 17, 2011
  • 55. Fast Rollback Optimism causes outages Production traffic is unique Keep old version running Switch traffic to new version 51Thursday, November 17, 2011
  • 56. Fast Rollback Optimism causes outages Production traffic is unique Keep old version running Switch traffic to new version Monitor results 51Thursday, November 17, 2011
  • 57. Fast Rollback Optimism causes outages Production traffic is unique Keep old version running Switch traffic to new version Monitor results Revert traffic quickly 51Thursday, November 17, 2011
  • 58. Fast Rollback 52Thursday, November 17, 2011
  • 59. Fast Rollback api-frontend api-usprod-v007 52Thursday, November 17, 2011
  • 60. Fast Rollback api-frontend api-usprod-v007 api-usprod-v008 52Thursday, November 17, 2011
  • 61. Fast Rollback api-frontend api-usprod-v007 api-usprod-v008 52Thursday, November 17, 2011
  • 62. Fast Rollback api-frontend api-usprod-v007 api-usprod-v008 52Thursday, November 17, 2011
  • 63. Fast Rollback api-frontend api-usprod-v007 api-usprod-v008 52Thursday, November 17, 2011
  • 64. Fast Rollback api-frontend api-usprod-v008 52Thursday, November 17, 2011
  • 65. Fast Rollback 53Thursday, November 17, 2011
  • 66. Fast Rollback api-frontend api-usprod-v007 53Thursday, November 17, 2011
  • 67. Fast Rollback api-frontend api-usprod-v007 api-usprod-v008 53Thursday, November 17, 2011
  • 68. Fast Rollback api-frontend api-usprod-v007 api-usprod-v008 53Thursday, November 17, 2011
  • 69. Fast Rollback api-frontend api-usprod-v007 api-usprod-v008 53Thursday, November 17, 2011
  • 70. Fast Rollback api-frontend api-usprod-v007 53Thursday, November 17, 2011
  • 71. Acknowledgements Platform Engineering • Sudhir Tonse • Pradeep Kamath Engineering Tools • Joe Sondow Streaming Server • Ranjit Mavinkurve 54Thursday, November 17, 2011
  • 72. Questions? Sid Anand @r39132 55Thursday, November 17, 2011