How Netflix Leverages Multiple Regions to Increase
Availability: Isthmus and Active-Active Case Study
Ruslan Meshenberg
No...
Failure
Assumptions
Everything Is Broken

Hardware Will Fail
•
•

Slowly Changing
Large Scale

•
•

•
•

Telcos
Scale

Enterprise ...
Incidents – Impact and Mitigation
Public relations
media impact
High customer
service calls
Affects AB
test results

PR
X ...
Does an Instance Fail?
• It can, plan for it
• Bad code / configuration pushes
• Latent issues
• Hardware failure
• Test w...
Does a Zone Fail?
• Rarely, but happened before
• Routing issues
• DC-specific issues
• App-specific issues within a zone
...
Does a Region Fail?
• Full region – unlikely, very rare
• Individual Services can fail region-wide
• Most likely, a region...
Everything Fails… Eventually
• The good news is you can do something about it
• Keep your services running by embracing
is...
Cloud Native
A New Engineering Challenge
Construct a highly agile and highly
available service from ephemeral and
assumed ...
Isolation
• Changes in one region should not affect others
• Regional outage should not affect others
• Network partitioni...
Redundancy
• Make more than one (of pretty much everything)
• Specifically, distribute services across
Availability Zones ...
History: X-mas Eve 2012
• Netflix multi-hour outage
• US-East1 regional Elastic Load Balancing issue
• “...data was delete...
Isthmus – Normal Operation

US-East ELB

US-West-2 ELB
ELB
Zuul
Infrastructure*

Zone A

Tunnel

Zone B

Zone C

Cassandra...
Isthmus – Failover

US-East ELB

US-West-2 ELB
ELB
Zuul
Infrastructure*

Zone A

Tunnel

Zone B

Zone C

Cassandra Replica...
Isthmus
Zuul – Overview

Elastic Load Balancing

Elastic Load Balancing

Elastic Load Balancing
Zuul – Details
Denominator – Abstracting the DNS Layer

Amazon
Route 53

DynECT
DNS

UltraDNS

Denominator

Regional Load Balancers
ELBs
...
Isthmus – Only for Elastic Load Balancing Failures

• Other services may fail region-wide
• Not worthwhile to develop one-...
Active-Active – Full Regional Resiliency

Regional Load Balancers

Regional Load Balancers

Zone A

Zone B

Zone C

Zone A...
Active-Active – Failover

Regional Load Balancers

Regional Load Balancers

Zone A

Zone B

Zone C

Zone A

Zone B

Zone C...
Active-Active Architecture
Separating the Data – Eventual Consistency
• 2–4 region Cassandra clusters
• Eventual consistency != hopeful consistency
Highly Available NoSQL Storage
A highly scalable, available, and
durable deployment pattern based on
Apache Cassandra
Benchmarking Global Cassandra
Write intensive test of cross-region replication capacity
16 x hi1.4xlarge SSD nodes per zon...
Propagating EVCache Invalidations
4. Calls Writer with Key, Write Time, TTL & Value after checking if this is the latest e...
Archaius – Region-isolated Configuration
Running Isthmus and Active-Active
Multiregional Monkeys
• Detect failure to deploy
• Differences in configuration
• Resource differences
Multiregional Monitoring and Alerting
• Per region metrics
• Global aggregation
• Anomaly detection
Failover and Fallback
• DNS (denominator) changes
• For fallback, ensure data consistency
• Some challenges
– Cold cache
–...
Validating the Whole Thing Works
Dev-Ops in N Regions
• Best practices: avoiding peak times for
deployment
• Early problem detection / rollbacks
• Automate...
Hyperscale Architecture

Amazon
Route 53

DynECT
DNS

UltraDNS

DNS
Automation

Regional Load Balancers

Regional Load Bal...
Does It Work?
Building Blocks Available on Netflix Github site
Topic

Session #

When

What an Enterprise Can Learn from Netflix, a Cloud-native Company

ENT203

Thursday, Nov 14, 4:15 ...
Takeaways
Embrace isolation and redundancy for availability
NetflixOSS helps everyone to become cloud native
http://netfli...
We are sincerely eager to hear
your feedback on this
presentation and on re:Invent.
Please fill out an evaluation form
whe...
Upcoming SlideShare
Loading in...5
×

How Netflix Leverages Multiple Regions to Increase Availability (ARC305) | AWS re:Invent 2013

1,510

Published on

Netflix sought to increase availability beyond the capabilities of a single region. How is that possible, you ask? We walk you through the journey that Netflix underwent to redesign and operate our service to achieve this lofty goal. Using the principles of isolation and redundancy, our destination is a fully redundant active-active deployment architecture where end users can be served out of multiple AWS regions. If one region fails, another can quickly take its place. Along the way, we'll explore our Isthmus architecture, a stepping stone toward full active-active deployment where only the edge services are redundantly deployed to multiple regions. We'll cover real-world challenges we had to overcome, like near-real-time data replication, and the operational tools and best-practices we needed to develop to make it a success. Discover a whole new breed of monkeys we created to test multiregional resiliency scenarios.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,510
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
49
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

How Netflix Leverages Multiple Regions to Increase Availability (ARC305) | AWS re:Invent 2013

  1. 1. How Netflix Leverages Multiple Regions to Increase Availability: Isthmus and Active-Active Case Study Ruslan Meshenberg November 13, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  2. 2. Failure
  3. 3. Assumptions Everything Is Broken Hardware Will Fail • • Slowly Changing Large Scale • • • • Telcos Scale Enterprise IT • • Rapid Change Large Scale Rapid Change Small Scale Web Scale Startups Slowly Changing Small Scale Everything Works Software Will Fail Speed
  4. 4. Incidents – Impact and Mitigation Public relations media impact High customer service calls Affects AB test results PR X Incidents Y incidents mitigated by Active Active, game day practicing CS XX Incidents Metrics Impact – Feature Disable XXX Incidents YY incidents mitigated by better tools and practices YYY incidents mitigated by better data tagging No Impact – Fast Retry or Automated Failover XXXX Incidents
  5. 5. Does an Instance Fail? • It can, plan for it • Bad code / configuration pushes • Latent issues • Hardware failure • Test with Chaos Monkey
  6. 6. Does a Zone Fail? • Rarely, but happened before • Routing issues • DC-specific issues • App-specific issues within a zone • Test with Chaos Gorilla
  7. 7. Does a Region Fail? • Full region – unlikely, very rare • Individual Services can fail region-wide • Most likely, a region-wide configuration issue • Test with Chaos Kong
  8. 8. Everything Fails… Eventually • The good news is you can do something about it • Keep your services running by embracing isolation and redundancy
  9. 9. Cloud Native A New Engineering Challenge Construct a highly agile and highly available service from ephemeral and assumed broken components
  10. 10. Isolation • Changes in one region should not affect others • Regional outage should not affect others • Network partitioning between regions should not affect functionality / operations
  11. 11. Redundancy • Make more than one (of pretty much everything) • Specifically, distribute services across Availability Zones and regions
  12. 12. History: X-mas Eve 2012 • Netflix multi-hour outage • US-East1 regional Elastic Load Balancing issue • “...data was deleted by a maintenance process that was inadvertently run against the production ELB state data” • “The ELB issue affected less than 7% of running ELBs and prevented other ELBs from scaling.”
  13. 13. Isthmus – Normal Operation US-East ELB US-West-2 ELB ELB Zuul Infrastructure* Zone A Tunnel Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas
  14. 14. Isthmus – Failover US-East ELB US-West-2 ELB ELB Zuul Infrastructure* Zone A Tunnel Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas
  15. 15. Isthmus
  16. 16. Zuul – Overview Elastic Load Balancing Elastic Load Balancing Elastic Load Balancing
  17. 17. Zuul – Details
  18. 18. Denominator – Abstracting the DNS Layer Amazon Route 53 DynECT DNS UltraDNS Denominator Regional Load Balancers ELBs Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C Cassandra Replicas Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas
  19. 19. Isthmus – Only for Elastic Load Balancing Failures • Other services may fail region-wide • Not worthwhile to develop one-offs for each one
  20. 20. Active-Active – Full Regional Resiliency Regional Load Balancers Regional Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
  21. 21. Active-Active – Failover Regional Load Balancers Regional Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
  22. 22. Active-Active Architecture
  23. 23. Separating the Data – Eventual Consistency • 2–4 region Cassandra clusters • Eventual consistency != hopeful consistency
  24. 24. Highly Available NoSQL Storage A highly scalable, available, and durable deployment pattern based on Apache Cassandra
  25. 25. Benchmarking Global Cassandra Write intensive test of cross-region replication capacity 16 x hi1.4xlarge SSD nodes per zone = 96 total 192 TB of SSD in six locations up and running Cassandra in 20 minutes Test Load Validation Load 1 Million Reads after 500 ms CL.ONE with No Data Loss US-West-2 Region - Oregon Test Load 1 Million Writes CL.ONE (Wait for One Replica to ack) US-East-1 Region - Virginia Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Interzone Traffic Interregion Traffic Up to 9Gbits/s, 83ms 18 TB backups from S3
  26. 26. Propagating EVCache Invalidations 4. Calls Writer with Key, Write Time, TTL & Value after checking if this is the latest event for the key in the current batch. Goes cross-region through ELB over HTTPS Drainer Writer 7. Return keys that were successful 3. Read from SQS in batches Drainer Writer EVCache Replication Service EVCache Replication Service 6. Deletes the value for the key EVCache Replication Metadata 5. Checks write time to ensure this is the latest operation for the key US-WEST-2 EVCAC EVCAC EVCACHE HE HE 8. Delete keys from SQS that were successful 1. Set data in EVCACHE EVCACH EVCACH EE EVCACHE EVCache Replication Metadata 2. Write events to SQS and EVCACHE_REGION_REPLICATION SQS EVCache Client App Server US-EAST-1
  27. 27. Archaius – Region-isolated Configuration
  28. 28. Running Isthmus and Active-Active
  29. 29. Multiregional Monkeys • Detect failure to deploy • Differences in configuration • Resource differences
  30. 30. Multiregional Monitoring and Alerting • Per region metrics • Global aggregation • Anomaly detection
  31. 31. Failover and Fallback • DNS (denominator) changes • For fallback, ensure data consistency • Some challenges – Cold cache – Autoscaling • Automate, automate, automate
  32. 32. Validating the Whole Thing Works
  33. 33. Dev-Ops in N Regions • Best practices: avoiding peak times for deployment • Early problem detection / rollbacks • Automated canaries / continuous delivery
  34. 34. Hyperscale Architecture Amazon Route 53 DynECT DNS UltraDNS DNS Automation Regional Load Balancers Regional Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
  35. 35. Does It Work?
  36. 36. Building Blocks Available on Netflix Github site
  37. 37. Topic Session # When What an Enterprise Can Learn from Netflix, a Cloud-native Company ENT203 Thursday, Nov 14, 4:15 PM - 5:15 PM Maximizing Audience Engagement in Media Delivery MED303 Thursday, Nov 14, 4:15 PM - 5:15 PM Scaling your Analytics with Amazon Elastic MapReduce BDT301 Thursday, Nov 14, 4:15 PM - 5:15 PM Automated Media Workflows in the Cloud MED304 Thursday, Nov 14, 5:30 PM - 6:30 PM Deft Data at Netflix: Using Amazon S3 and Amazon Elastic MapReduce for Monitoring at Gigascale BDT302 Thursday, Nov 14, 5:30 PM - 6:30 PM Encryption and Key Management in AWS SEC304 Friday, Nov 15, 9:00 AM - 10:00 AM Your Linux AMI: Optimization and Performance CPN302 Friday, Nov 15, 11:30 AM - 12:30 PM
  38. 38. Takeaways Embrace isolation and redundancy for availability NetflixOSS helps everyone to become cloud native http://netflix.github.com http://techblog.netflix.com http://slideshare.net/Netflix @rusmeshenberg @NetflixOSS
  39. 39. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.
  1. ¿Le ha llamado la atención una diapositiva en particular?

    Recortar diapositivas es una manera útil de recopilar información importante para consultarla más tarde.

×