Source: http://ir.netflix.com
(I’m skipping all the cloud intro etc. Netflix runs in thecloud, if you hadn’t figured that out already you aren’t   payin...
In production atNetflix20092009201020102010201020102011
Architecture applies to any cloud or datacenter  Illustrated today using real world examples
Consumer                                               User DataElectronics                                      Web Site ...
Each icon is three to a few hundred instances across                    Cassandra three AWS zones                         ...
Deployed in Three Balanced Availability Zones                           Load Balancers        Zone A                 Zone ...
Triple Replicated Persistence                             Load Balancers       Zone A                    Zone B           ...
Isolated Regions                     US-East Load Balancers                                                EU-West Load Ba...
Failure Mode          Probability   Mitigation PlanApplication Failure   High          Automatic degraded responseAWS Regi...
Run what you wrote Rapid detection Rapid Response
http://techblog.netflix.com/2012/06/annoucing-archaius-dynamic-properties.html
http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html
http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html                                              Eur...
http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html
Classify and name the types of things thatmight go wrong in the platform or infrastructure
Zone Network Outage                         US-East Load Balancers                                                   EU-We...
Regional Network Outage                     US-East Load Balancers                                                  EU-Wes...
Cascading Capacity Overload                         US-East Load Balancers                                                ...
Hardening the cloud Lessons Learned at ScaleWhy Netflix Stays Up (Mostly)
http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
http://googleappengine.blogspot.com/2012/10/about-todays-app-engine-outage.html
http://aws.amazon.com/message/67457/http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html
@NetflixOSS Eureka service directory failed to mark                                   down dead instances due to a configu...
Zone Enable DNSCommand Queue                                     Per-Zone Control Plane                                   ...
A highly scalable, available and durable          deployment pattern
Single function Cassandra Cluster  Many Different Single-Function REST Clients                                Managed by P...
Linux Base AMI (CentOS or Ubuntu)Optional Apache   frontend,      Java (JDK 6 or 7) memcached, non-java apps              ...
http://github.com/netflix
Linux Base AMI (CentOS or Ubuntu)  Tomcat and Priam on JDK   Java (JDK 7) Healthcheck,     Status                AppDynami...
http://github.com/netflix
Cassandra              Cassandra                     Cassandra  Cassandra                                             Cass...
@NetflixOSS
http://techblog.netflix.com
Legend Github / Techblog                Priam                                Exhibitor                                    ...
http://github.com/Netflix       http://techblog.netflix.com       http://slideshare.net/Netflixhttp://www.linkedin.com/in/...
We are sincerely eager tohear your FEEDBACK on thispresentation and on re:Invent. Please fill out an evaluation   form whe...
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
Upcoming SlideShare
Loading in...5
×

ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012

8,681

Published on

This talk describes a set of architectural patterns that support highly available services that are also scalable, low cost, low latency and allow agile continuous deployment development practices. The building blocks for these patterns have been released at netflix.github.com as open source projects for others to use.

Transcript of "ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012"

  1. 1. Source: http://ir.netflix.com
  2. 2. (I’m skipping all the cloud intro etc. Netflix runs in thecloud, if you hadn’t figured that out already you aren’t paying attention and should go to the other Netflixtalks at AWS Re:Invent or read slideshare.net/netflix)
  3. 3. In production atNetflix20092009201020102010201020102011
  4. 4. Architecture applies to any cloud or datacenter Illustrated today using real world examples
  5. 5. Consumer User DataElectronics Web Site or Browse Discovery APIAWS Cloud Services PersonalizationCDN EdgeLocations DRM Customer Play Device (PC, Streaming API PS3, TV…) QoS Logging CDN Management and Steering Watch OpenConnect CDN Boxes Content Encoding
  6. 6. Each icon is three to a few hundred instances across Cassandra three AWS zones memcached Web service Start Here S3 bucketPersonalization moviegroup chooser
  7. 7. Deployed in Three Balanced Availability Zones Load Balancers Zone A Zone B Zone CCassandra and Evcache Cassandra and Evcache Cassandra and Evcache Replicas Replicas Replicas
  8. 8. Triple Replicated Persistence Load Balancers Zone A Zone B Zone CCassandra and Evcache Cassandra and Evcache Cassandra and Evcache Replicas Replicas Replicas
  9. 9. Isolated Regions US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone CCassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
  10. 10. Failure Mode Probability Mitigation PlanApplication Failure High Automatic degraded responseAWS Region Failure Low Wait for region to recoverAWS Zone Failure Medium Continue to run on 2 out of 3 zonesDatacenter Failure Medium Migrate more functions to cloudData store failure Low Restore from S3 backupsS3 failure Low Restore from remote archive
  11. 11. Run what you wrote Rapid detection Rapid Response
  12. 12. http://techblog.netflix.com/2012/06/annoucing-archaius-dynamic-properties.html
  13. 13. http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
  14. 14. http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html
  15. 15. http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html Eureka Services metadata AWS AppDynamics Instances, Request flow ASGs, etc. Edda Monkeys
  16. 16. http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html
  17. 17. Classify and name the types of things thatmight go wrong in the platform or infrastructure
  18. 18. Zone Network Outage US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Zone DependentZone Power Outage Service Outage Dependent Service could be @NetflixOSS platform or underlying infrastructure
  19. 19. Regional Network Outage US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone CCassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Control Plane Overload
  20. 20. Cascading Capacity Overload US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra ReplicasCapacity demand migrates to services Platform and Infrastructure Migrating demand across regions mayin another zone that don’t scale up fast Software Bugs and Global just spread the problem further…enough to take the load Configuration Errors “Oops…”
  21. 21. Hardening the cloud Lessons Learned at ScaleWhy Netflix Stays Up (Mostly)
  22. 22. http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
  23. 23. http://googleappengine.blogspot.com/2012/10/about-todays-app-engine-outage.html
  24. 24. http://aws.amazon.com/message/67457/http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html
  25. 25. @NetflixOSS Eureka service directory failed to mark down dead instances due to a configuration error US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Effect: higher latency and errorsZone Power Outage Mitigation: Fixed configuration, and made Applications not using Zone- zone aware routing the default aware routing kept trying to talk to dead instances and timing out
  26. 26. Zone Enable DNSCommand Queue Per-Zone Control Plane Command Queues US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
  27. 27. A highly scalable, available and durable deployment pattern
  28. 28. Single function Cassandra Cluster Many Different Single-Function REST Clients Managed by Priam Between 6 and 72 nodes Stateless Data Access REST Service Astyanax Cassandra Client OptionalEach icon represents a horizontally scaled service of three to hundreds of Datacenterinstances deployed over three availability zones Update Flow Appdynamics Service Flow Visualization
  29. 29. Linux Base AMI (CentOS or Ubuntu)Optional Apache frontend, Java (JDK 6 or 7) memcached, non-java apps AppDynamics Monitoring appagent monitoring TomcatLog rotation to Application war file, base S3 Healthcheck, status servlets, GC and thread servlet, platform, clientAppDynamics JMX interface, Servo autoscale dump logging interface jars, Astyanaxmachineagent Epic/Atlas
  30. 30. http://github.com/netflix
  31. 31. Linux Base AMI (CentOS or Ubuntu) Tomcat and Priam on JDK Java (JDK 7) Healthcheck, Status AppDynamics appagent monitoring Cassandra Server MonitoringAppDynamics Local Ephemeral Disk Space – 2TB of SSD or 1.6TB disk holding GC and thread Commit log and SSTablesmachineagent dump logging Epic/Atlas
  32. 32. http://github.com/netflix
  33. 33. Cassandra Cassandra Cassandra Cassandra Cassandra S3 BackupCassandra Cassandra Cassandra Cassandra Cassandra Cassandra Archive
  34. 34. @NetflixOSS
  35. 35. http://techblog.netflix.com
  36. 36. Legend Github / Techblog Priam Exhibitor Servo and Autoscaling Scripts Cassandra as a Service Zookeeper as a ServiceApache Contributions Astyanax Curator HonuTechblog Post Only Cassandra client for Java Zookeeper Patterns Log4j streaming to Hadoop Coming Soon CassJMeter EVCache Circuit Breaker - Hystrix Cassandra test suite Memcached as a Service Robust service pattern Cassandra Multi-region EC2 Eureka / Discovery Asgard - AutoScaleGroup based AWS datastore support Service Directory console Aegisthus Archaius Chaos Monkey Hadoop ETL for Cassandra Dynamics Properties Service Robustness verification Edda Explorers Latency Monkey Queryable config history Governator - Library lifecycle and Server-side latency/error injection Janitor Monkey dependency injection Odin REST Client + mid-tier LB Bakeries and AMI Workflow orchestration Blitz4j - Async logging Configuration REST endpoints Build dynaslaves
  37. 37. http://github.com/Netflix http://techblog.netflix.com http://slideshare.net/Netflixhttp://www.linkedin.com/in/adriancockcroft
  38. 38. We are sincerely eager tohear your FEEDBACK on thispresentation and on re:Invent. Please fill out an evaluation form when you have a chance.

×