AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global Architecture (ARC204)

571 views

Published on

Building and evolving a pervasive, global service requires a multi-disciplined approach that balances requirements with service availability, latency, data replication, compute capacity, and efficiency. In this session, we’ll follow the Netflix journey of failure, innovation, and ubiquity. We'll review the many facets of globalization and then delve deep into the architectural patterns that enable seamless, multi-region traffic management; reliable, fast data propagation; and efficient service infrastructure. The patterns presented will be broadly applicable to internet services with global aspirations.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
571
On SlideShare
0
From Embeds
0
Number of Embeds
21
Actions
Shares
0
Downloads
74
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global Architecture (ARC204)

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Coburn Watson, Director Performance & Reliability November 29, 2016 #NetflixEverywhere From Resilience to Ubiquity ARC204
  2. 2. Disappointment
  3. 3. Outrage
  4. 4. Withdrawal
  5. 5. Synchronized Frustration
  6. 6. Failure is inevitable
  7. 7. Failure-Driven Architecture Never fail the same way twice
  8. 8. #NetflixEverywhere Failure-Driven Architecture Never fail the same way twice
  9. 9. cwatson@netflix.com • Chaos Engineering • Core Site Reliability Engineering • Cloud Capacity Planning • Cloud Networking • Performance and OS Engineering • Traffic Management Director – Performance & Reliability
  10. 10. Bringing movies & TV shows from all over the world to people all over the world • Streaming, on demand, subscription • Global & regional licensing • Hollywood, independent, international • Striving for global ubiquity
  11. 11. 86,000,000 Members $2B+ Revenue per Quarter 35% of Internet Traffic at Peak in US < 3k Employees
  12. 12. 2007 Jan – Windows 2008 May – Roku Oct – LG, Samsung Blu-ray Oct – Apple Mac Nov – XBox 360 2009 Jun – LG DTV Nov –Sony PS3 (disc) Nov – Sony Bravia • DTV & Blu-ray Device Ubiquity
  13. 13. 2011 Android First e-readers • Kindle Fire, Nook Device Ubiquity 2010 • Nintendo Wii (disc) • Apple iPad • Apple iPhone • Apple TV • Sony PS3 (no disc) • Nintendo Wii (no disc) • Windows Phone 7
  14. 14. 2010 - Canada 2011 - Latin America 2012 - UK, Ireland, Nordics 2013 – Netherlands 2014 - Austria, Belgium, France, Germany, Luxembourg, Switzerland 2015 - Australia, New Zealand, Japan, Spain, Italy, Portugal Geographic Ubiquity
  15. 15. English Spanish (Latin American) Portuguese (Brazilian) Dutch French Language Ubiquity - Subs, Dubs, UI German Japanese Spanish (Castilian) Italian Portuguese (European)
  16. 16. August 2008
  17. 17. • No automation, virtualization, standardization • Manual, error prone, slow • Big iron & monoliths DC2 2009
  18. 18. Undifferentiated Heavy Lifting
  19. 19. US-East-1 Amazon Web Services 2010 • Scale & elasticity • Virtual, programmable • Global footprint
  20. 20. Microservices Database Cache Traffic Architectural Pillars
  21. 21. Microservices
  22. 22. Edge ELB Zuul NCCP API Middle Tier & Platform
  23. 23. FIT Fault-Injection Test Framework Microservice Failure
  24. 24. Database Architectural Pillars
  25. 25. NoSQL but… Not web scale Throttling Modest scale 100s of play starts/second 10,000s of requests/second 10s of billions of records Initially SimpleDB
  26. 26. Why Cassandra? • NoSQL at scale • Open source • Multi-region • Multi-directional • CAP Choices – Availability – Partition tolerance – Eventual consistency* Scalable, Durable, Global
  27. 27. Single Region, Multiple AZs 1. Client writes to any node Zone A Zone B Zone C Zone B Zone C Client Zone A Local Quorum (Typical) 100ms 2. Coordinator replicates to nodes 3. Nodes ack to coordinator 4. Coordinator acks to client 5. Write to commit log
  28. 28. Not quite fast enough
  29. 29. Cache
  30. 30. Ephemeral Volatile Cache (EVCache) Clustered memcached optimized for AWS
  31. 31. EVCache Server Memcached Prana (Sidecar) Monitoring & Other Processes Eureka Client Application Client Library EVCache Client - Shards, consistent hashing - TTLs & LRU EVCache Architecture
  32. 32. Zone A Client Application Client Library EVCache Client Zone B Client Application Client Library Zone C Client Application Client Library . . .. . . Reads EVCache Client EVCache Client . . . Client Application Client Library EVCache Client
  33. 33. Zone A Zone B Zone C . . . Writes Client Application Client Library EVCache Client Client Application Client Library EVCache Client Client Application Client Library EVCache Client . . . . . .
  34. 34. 1. Read from cache S S S S. . . DB DB DB DB. . . Fronting Microservices . . . Client Application Client Library EVCache Client Service Client 2. On cache miss call service 3. Service calls DB & responds 4. Service updates cache
  35. 35. Linear Scaling • 25 million requests/sec • 2 trillion requests per day globally • Hundreds of billions of objects • Tens of thousands of memcached instances • Milliseconds of latency per request
  36. 36. US-East-1 Canada International Expansion 2011 US Latin America
  37. 37. US-East-1 EU-West-1 Cloud Islands 2012
  38. 38. Traffic
  39. 39. US-East-1 EU-West-1 UK/IE, Nordics, Netherlands Latin America DNS Geo Mapping Canada US
  40. 40. US-East-1US-West-2 EU-West-1 Active-Active 2013 - 2014 Survive a large-scale regional service outage
  41. 41. Active-Active Data Replication
  42. 42. Region BRegion A Zone A Zone B Zone C Zone B Zone C Zone A Zone A Zone B Zone C Zone C Client Client Zone AZone B Multi-Region Writes 500ms Bi-directional Nightly compare & repair Local Quorum (Typical)
  43. 43. EVCache Replication Repl Writer Amazon SQS Application Client EVCache Replication Repl Writer 1. Set or delete 2. Send metadata 3. Poll msg 6. Set or delete Application Client Amazon SQS 7. Read EVCache Cross-Region Replication Region BRegion A
  44. 44. Active-Active Traffic Management
  45. 45. ELB US-West-2 ELB US-East-1 ELB EU-West-1 DNS api-global.netflix.com UltraDNS Amazon Route 53
  46. 46. DNS api-global.netflix.com • Remove state from geo bucket ELB US-West-2 ELB US-East-1 ELB EU-West-1
  47. 47. api-global.netflix.com DNS • Remove state from geo bucket • Add state to geo bucket • Log event • For each endpoint ELB US-West-2 ELB US-East-1 ELB EU-West-1
  48. 48. api-global.netflix.com api-global.us-west-2 .prodaa.neflix.com api-global.us-east-1 .prodaa.neflix.com api-global.eu-west-1 .prodaa.neflix.com ELB ELB ELB Shim
  49. 49. api-global.netflix.com api-global.us-west-2 .prodaa.neflix.com api-global.us-east-1 .prodaa.neflix.com api-global.eu-west-1 .prodaa.neflix.com ELB ELB ELB Shim
  50. 50. api-global.netflix.com api-global.us-west-2 .prodaa.neflix.com api-global.us-east-1 .prodaa.neflix.com api-global.eu-west-1 .prodaa.neflix.com ELB ELB ELB Shim
  51. 51. Active-Active Failover US-EAST-1, US-WEST-2
  52. 52. The Internet DNS Corse-grained routing Zuul Proxy Back Channel Fine-grained routing Vizceral
  53. 53. Taking it Global
  54. 54. January 6, 2016
  55. 55. Geographic Ubiquity
  56. 56. Before Global English Spanish (Latin American) Portuguese (Brazilian) Dutch French German Japanese Spanish (Castilian) Italian Portuguese (European) Global Chinese Korean Arabic Polish Turkish Language Ubiquity
  57. 57. March 18, 2016 Daredevil Season 2 All episodes, all devices, all countries Simultaneously Content Ubiquity
  58. 58. Ubiquitous, Resilient Architecture
  59. 59. US-East-1US-West-2 EU-West-1 Reliably and efficiently serve any customer from any region Netflix Global 2015
  60. 60. US-East-1US-West-2 EU-West-1
  61. 61. US-East-1US-West-2 EU-West-1
  62. 62. US-East-1US-West-2 EU-West-1
  63. 63. Ubiquitous Data
  64. 64. EVCache Replication Repl Writer Application Client Kafka Kafka • Low latency • Multiple readers • > 1M replications/sec
  65. 65. US Ring US Ring EU Ring EU-West-1US-East-1 1. Extend US ring to EU region & run repairs Client 2. Dual Write 3. Forklift
  66. 66. EU-West-1US-West-2 Global Ring Global Ring US-East-1 Global Ring
  67. 67. Ubiquitous Traffic Management
  68. 68. us-east-1-na • East US • East CA • MX us-west-2 • APAC • West US • West CA eu-west-1 • Europe • Mid East • Africa us-east-1-sa • LatAm • Not MX Virtual DNS Regions
  69. 69. Vizceral Evacuation of US-EAST-1 Traffic 1. Proxy US-East-1 traffic to EU-West-1 and US-West-2 via Zuul 2. DNS flip when capacity at scale Restoration of Steady State 1. DNS flip to default configuration, active proxying 2. Reduce proxying to 0%; restored
  70. 70. Multi-Region Failover
  71. 71. January 6, 2016
  72. 72. “Going global is just like having a baby.” - Reed Hastings, Netflix CEO
  73. 73. What’s next? • Global latency routing • ML-based monitoring • Fast Failover • Improved Capacity Utilization • Integrate DB & caching • Automated Chaos Experiments #NetflixEverywhere
  74. 74. Takeaways
  75. 75. Never fail the same way twice Christmas Eve 2012 Today
  76. 76. Know your resiliency patterns Pattern Properties DC SPoF, infrastructure heavy lifting Cloud (one region) Multiple DCs, one control plane Islands Regional containment Isthmus Regional ELB bypass Active-active Regional failover Global Ubiquity, resiliency, efficiency
  77. 77. Invest in architectural pillars Microservices Database Caching Traffic
  78. 78. #NetflixEverywhere Think globally, act globally
  79. 79. netflix.github.io
  80. 80. netflix.github.io
  81. 81. Netflix Tech Blog techblog.netflix.com
  82. 82. Thank you!
  83. 83. Resources • Netflix Performance and Reliability YouTube Channel • http://techblog.netflix.com • Netflix OSS @netflixOSS
  84. 84. Remember to complete your evaluations!

×