SlideShare a Scribd company logo
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
Source: http://ir.netflix.com
(I’m skipping all the cloud intro etc. Netflix runs in the
cloud, if you hadn’t figured that out already you aren’t
   paying attention and should go to the other Netflix
talks at AWS Re:Invent or read slideshare.net/netflix)
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
In production at
Netflix
2009
2009
2010
2010
2010
2010
2010
2011
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
Architecture applies to any cloud or datacenter
  Illustrated today using real world examples
Consumer                                               User Data
Electronics
                                      Web Site or
                       Browse        Discovery API
AWS Cloud
 Services                                            Personalization

CDN Edge
Locations
                                                          DRM
               Customer       Play
              Device (PC,            Streaming API
              PS3, TV…)
                                                      QoS Logging


                                                         CDN
                                                      Management
                                                      and Steering
                            Watch    OpenConnect
                                      CDN Boxes
                                                        Content
                                                       Encoding
Each icon is three to a
 few hundred
 instances across                    Cassandra
 three AWS zones

                                                 memcached
                                             Web service
                        Start Here
                                                 S3 bucket




Personalization movie
group chooser
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
Deployed in Three Balanced Availability Zones

                           Load Balancers




        Zone A                 Zone B                  Zone C
Cassandra and Evcache   Cassandra and Evcache   Cassandra and Evcache
      Replicas                Replicas                Replicas
Triple Replicated Persistence

                             Load Balancers




       Zone A                    Zone B                  Zone C
Cassandra and Evcache     Cassandra and Evcache   Cassandra and Evcache
      Replicas                  Replicas                Replicas
Isolated Regions


                     US-East Load Balancers                                                EU-West Load Balancers




     Zone A                     Zone B                Zone C               Zone A                     Zone B               Zone C

Cassandra Replicas         Cassandra Replicas    Cassandra Replicas   Cassandra Replicas         Cassandra Replicas   Cassandra Replicas
Failure Mode          Probability   Mitigation Plan
Application Failure   High          Automatic degraded response
AWS Region Failure    Low           Wait for region to recover
AWS Zone Failure      Medium        Continue to run on 2 out of 3 zones
Datacenter Failure    Medium        Migrate more functions to cloud
Data store failure    Low           Restore from S3 backups
S3 failure            Low           Restore from remote archive
Run what you wrote
 Rapid detection
 Rapid Response
http://techblog.netflix.com/2012/06/annoucing-archaius-dynamic-properties.html
http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html
http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html



                                              Eureka
                                             Services
                                             metadata

                           AWS
                                                               AppDynamics
                        Instances,
                                                               Request flow
                        ASGs, etc.



                                            Edda              Monkeys
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html
Classify and name the types of things that
might go wrong in the platform or infrastructure
Zone Network Outage


                         US-East Load Balancers                                                   EU-West Load Balancers




         Zone A                     Zone B                   Zone C               Zone A                     Zone B               Zone C

    Cassandra Replicas         Cassandra Replicas       Cassandra Replicas   Cassandra Replicas         Cassandra Replicas   Cassandra Replicas




                                                    Zone Dependent
Zone Power Outage
                                                    Service Outage


                                               Dependent Service could be @NetflixOSS
                                                 platform or underlying infrastructure
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
Regional Network Outage


                     US-East Load Balancers                                                  EU-West Load Balancers




     Zone A                     Zone B                  Zone C               Zone A                     Zone B               Zone C

Cassandra Replicas         Cassandra Replicas      Cassandra Replicas   Cassandra Replicas         Cassandra Replicas   Cassandra Replicas




                                         Control Plane Overload
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
Cascading Capacity Overload


                         US-East Load Balancers                                                     EU-West Load Balancers




         Zone A                     Zone B               Zone C                     Zone A                      Zone B               Zone C

    Cassandra Replicas         Cassandra Replicas   Cassandra Replicas         Cassandra Replicas          Cassandra Replicas   Cassandra Replicas




Capacity demand migrates to services                    Platform and Infrastructure
                                                                                                    Migrating demand across regions may
in another zone that don’t scale up fast                Software Bugs and Global
                                                                                                    just spread the problem further…
enough to take the load                                    Configuration Errors
                                                                     “Oops…”
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
Hardening the cloud
 Lessons Learned at Scale
Why Netflix Stays Up (Mostly)
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
http://googleappengine.blogspot.com/2012/10/about-todays-app-engine-outage.html
http://aws.amazon.com/message/67457/
http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html
@NetflixOSS Eureka service directory failed to mark
                                   down dead instances due to a configuration error

                         US-East Load Balancers                                                      EU-West Load Balancers




         Zone A                     Zone B                    Zone C                  Zone A                    Zone B               Zone C

    Cassandra Replicas         Cassandra Replicas        Cassandra Replicas     Cassandra Replicas         Cassandra Replicas   Cassandra Replicas




                                                                                           Effect: higher latency and errors
Zone Power Outage                                                                          Mitigation: Fixed configuration, and made
                                               Applications not using Zone-
                                                                                           zone aware routing the default
                                               aware routing kept trying to talk to
                                               dead instances and timing out
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
Zone Enable DNS
Command Queue                                     Per-Zone Control Plane
                                                  Command Queues


                      US-East Load Balancers                                               EU-West Load Balancers




      Zone A                     Zone B               Zone C               Zone A                     Zone B               Zone C

 Cassandra Replicas         Cassandra Replicas   Cassandra Replicas   Cassandra Replicas         Cassandra Replicas   Cassandra Replicas
A highly scalable, available and durable
          deployment pattern
Single function Cassandra Cluster
  Many Different Single-Function REST Clients                                Managed by Priam
                                                                             Between 6 and 72 nodes

                                            Stateless Data Access REST Service
                                            Astyanax Cassandra Client




                                                                                         Optional
Each icon represents a horizontally scaled service of three to hundreds of               Datacenter
instances deployed over three availability zones                                         Update Flow
                                    Appdynamics Service Flow Visualization
Linux Base AMI (CentOS or Ubuntu)

Optional Apache
   frontend,      Java (JDK 6 or 7)
 memcached,
 non-java apps
                   AppDynamics

  Monitoring
                     appagent
                    monitoring
                                  Tomcat
Log rotation to                    Application war file, base
      S3                                                          Healthcheck, status servlets,
                  GC and thread      servlet, platform, client
AppDynamics                                                      JMX interface, Servo autoscale
                  dump logging      interface jars, Astyanax
machineagent
  Epic/Atlas
http://github.com/netflix
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
Linux Base AMI (CentOS or Ubuntu)
  Tomcat and
 Priam on JDK   Java (JDK 7)
 Healthcheck,
     Status
                AppDynamics
                  appagent
                 monitoring
                                Cassandra Server
 Monitoring
AppDynamics                     Local Ephemeral Disk Space – 2TB of SSD or 1.6TB disk holding
                GC and thread                     Commit log and SSTables
machineagent    dump logging
 Epic/Atlas
http://github.com/netflix
Cassandra

              Cassandra                     Cassandra




  Cassandra                                             Cassandra




                               S3
                             Backup
Cassandra                                                 Cassandra




       Cassandra                                  Cassandra




                     Cassandra       Cassandra




 Archive
@NetflixOSS
http://techblog.netflix.com
Legend
 Github / Techblog                Priam                                Exhibitor
                                                                                                     Servo and Autoscaling Scripts
                           Cassandra as a Service                Zookeeper as a Service
Apache Contributions
                                Astyanax                                Curator                                  Honu
Techblog Post Only
                          Cassandra client for Java                Zookeeper Patterns                 Log4j streaming to Hadoop
   Coming Soon
                                CassJMeter                           EVCache                            Circuit Breaker - Hystrix
                             Cassandra test suite               Memcached as a Service                  Robust service pattern

                         Cassandra Multi-region EC2                Eureka / Discovery             Asgard - AutoScaleGroup based AWS
                             datastore support                      Service Directory                           console

                                Aegisthus                             Archaius                             Chaos Monkey
                         Hadoop ETL for Cassandra             Dynamics Properties Service               Robustness verification
                                                                        Edda
                                   Explorers                                                               Latency Monkey
                                                                Queryable config history

                       Governator - Library lifecycle and
                                                            Server-side latency/error injection             Janitor Monkey
                            dependency injection

                                    Odin
                                                                REST Client + mid-tier LB                  Bakeries and AMI
                            Workflow orchestration

                            Blitz4j - Async logging          Configuration REST endpoints                  Build dynaslaves
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
http://github.com/Netflix
       http://techblog.netflix.com
       http://slideshare.net/Netflix

http://www.linkedin.com/in/adriancockcroft
We are sincerely eager to
hear your FEEDBACK on this
presentation and on re:Invent.

 Please fill out an evaluation
   form when you have a
            chance.

More Related Content

What's hot

MED202 Netflix’s Transcoding Transformation - AWS re: Invent 2012
MED202 Netflix’s Transcoding Transformation - AWS re: Invent 2012MED202 Netflix’s Transcoding Transformation - AWS re: Invent 2012
MED202 Netflix’s Transcoding Transformation - AWS re: Invent 2012
Amazon Web Services
 
RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...
RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...
RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...
Amazon Web Services
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud Architecture
Adrian Cockcroft
 
(ENT209) Netflix Cloud Migration, DevOps and Distributed Systems | AWS re:Inv...
(ENT209) Netflix Cloud Migration, DevOps and Distributed Systems | AWS re:Inv...(ENT209) Netflix Cloud Migration, DevOps and Distributed Systems | AWS re:Inv...
(ENT209) Netflix Cloud Migration, DevOps and Distributed Systems | AWS re:Inv...
Amazon Web Services
 
Dystopia as a Service
Dystopia as a ServiceDystopia as a Service
Dystopia as a Service
Adrian Cockcroft
 
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...
Amazon Web Services
 
SV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformSV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source Platform
Adrian Cockcroft
 
NetflixOSS Meetup
NetflixOSS MeetupNetflixOSS Meetup
NetflixOSS Meetup
Adrian Cockcroft
 
Netflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconNetflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at Gluecon
Adrian Cockcroft
 
Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search Roadshow
Adrian Cockcroft
 
Building a CICD Pipeline for Containers - DevDay Austin 2017
Building a CICD Pipeline for Containers - DevDay Austin 2017Building a CICD Pipeline for Containers - DevDay Austin 2017
Building a CICD Pipeline for Containers - DevDay Austin 2017
Amazon Web Services
 
Netflix Velocity Conference 2011
Netflix Velocity Conference 2011Netflix Velocity Conference 2011
Netflix Velocity Conference 2011
Adrian Cockcroft
 
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Adrian Cockcroft
 
Netflix in the Cloud
Netflix in the CloudNetflix in the Cloud
Netflix in the Cloud
Adrian Cockcroft
 
Netflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and OpsNetflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and Ops
Adrian Cockcroft
 
Architectures for High Availability - QConSF
Architectures for High Availability - QConSFArchitectures for High Availability - QConSF
Architectures for High Availability - QConSF
Adrian Cockcroft
 
Cloud Developer Conference May 2011 SiliconIndia : Design for Failure - High ...
Cloud Developer Conference May 2011 SiliconIndia : Design for Failure - High ...Cloud Developer Conference May 2011 SiliconIndia : Design for Failure - High ...
Cloud Developer Conference May 2011 SiliconIndia : Design for Failure - High ...
Harish Ganesan
 
Introduction to Container Management on AWS
Introduction to Container Management  on AWSIntroduction to Container Management  on AWS
Introduction to Container Management on AWS
Amazon Web Services
 
Getting Started with Docker On AWS
Getting Started with Docker On AWSGetting Started with Docker On AWS
Getting Started with Docker On AWS
Amazon Web Services
 
High Availability in the Cloud - Architectural Best Practices
High Availability in the Cloud - Architectural Best PracticesHigh Availability in the Cloud - Architectural Best Practices
High Availability in the Cloud - Architectural Best Practices
RightScale
 

What's hot (20)

MED202 Netflix’s Transcoding Transformation - AWS re: Invent 2012
MED202 Netflix’s Transcoding Transformation - AWS re: Invent 2012MED202 Netflix’s Transcoding Transformation - AWS re: Invent 2012
MED202 Netflix’s Transcoding Transformation - AWS re: Invent 2012
 
RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...
RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...
RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agilit...
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud Architecture
 
(ENT209) Netflix Cloud Migration, DevOps and Distributed Systems | AWS re:Inv...
(ENT209) Netflix Cloud Migration, DevOps and Distributed Systems | AWS re:Inv...(ENT209) Netflix Cloud Migration, DevOps and Distributed Systems | AWS re:Inv...
(ENT209) Netflix Cloud Migration, DevOps and Distributed Systems | AWS re:Inv...
 
Dystopia as a Service
Dystopia as a ServiceDystopia as a Service
Dystopia as a Service
 
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...
 
SV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformSV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source Platform
 
NetflixOSS Meetup
NetflixOSS MeetupNetflixOSS Meetup
NetflixOSS Meetup
 
Netflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconNetflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at Gluecon
 
Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search Roadshow
 
Building a CICD Pipeline for Containers - DevDay Austin 2017
Building a CICD Pipeline for Containers - DevDay Austin 2017Building a CICD Pipeline for Containers - DevDay Austin 2017
Building a CICD Pipeline for Containers - DevDay Austin 2017
 
Netflix Velocity Conference 2011
Netflix Velocity Conference 2011Netflix Velocity Conference 2011
Netflix Velocity Conference 2011
 
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
 
Netflix in the Cloud
Netflix in the CloudNetflix in the Cloud
Netflix in the Cloud
 
Netflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and OpsNetflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and Ops
 
Architectures for High Availability - QConSF
Architectures for High Availability - QConSFArchitectures for High Availability - QConSF
Architectures for High Availability - QConSF
 
Cloud Developer Conference May 2011 SiliconIndia : Design for Failure - High ...
Cloud Developer Conference May 2011 SiliconIndia : Design for Failure - High ...Cloud Developer Conference May 2011 SiliconIndia : Design for Failure - High ...
Cloud Developer Conference May 2011 SiliconIndia : Design for Failure - High ...
 
Introduction to Container Management on AWS
Introduction to Container Management  on AWSIntroduction to Container Management  on AWS
Introduction to Container Management on AWS
 
Getting Started with Docker On AWS
Getting Started with Docker On AWSGetting Started with Docker On AWS
Getting Started with Docker On AWS
 
High Availability in the Cloud - Architectural Best Practices
High Availability in the Cloud - Architectural Best PracticesHigh Availability in the Cloud - Architectural Best Practices
High Availability in the Cloud - Architectural Best Practices
 

Similar to ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012

Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Acunu
 
Netflix and Open Source
Netflix and Open SourceNetflix and Open Source
Netflix and Open Source
Adrian Cockcroft
 
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
DataStax Academy
 
Servers fail, who cares?
Servers fail, who cares? Servers fail, who cares?
Servers fail, who cares?
greggulrich
 
Disaster Recovery with the AWS Cloud
Disaster Recovery with the AWS CloudDisaster Recovery with the AWS Cloud
Disaster Recovery with the AWS Cloud
Amazon Web Services
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWS
Adrian Cockcroft
 
CloudFest Denver Windows Azure Design Patterns
CloudFest Denver Windows Azure Design PatternsCloudFest Denver Windows Azure Design Patterns
CloudFest Denver Windows Azure Design Patterns
David Pallmann
 
The Netflix Open Source Platform
The Netflix Open Source PlatformThe Netflix Open Source Platform
The Netflix Open Source Platform
Ruslan Meshenberg
 
Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC
Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC
Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC
Amazon Web Services
 
Running High Availability Websites with Acquia and AWS
Running High Availability Websites with Acquia and AWSRunning High Availability Websites with Acquia and AWS
Running High Availability Websites with Acquia and AWS
Acquia
 
Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013
MassTLC
 
Ram chinta hug-20120922-v1
Ram chinta hug-20120922-v1Ram chinta hug-20120922-v1
Ram chinta hug-20120922-v1
Ram Chinta
 
Fault Tolerant Applications on AWS
Fault Tolerant Applications on AWSFault Tolerant Applications on AWS
Fault Tolerant Applications on AWS
Amazon Web Services LATAM
 
13h00 aws 2012-fault_tolerant_applications
13h00   aws 2012-fault_tolerant_applications13h00   aws 2012-fault_tolerant_applications
13h00 aws 2012-fault_tolerant_applications
infolive
 
CloudStack technical overview
CloudStack technical overviewCloudStack technical overview
1 Introduction at CloudStack Developer Day
1 Introduction at CloudStack Developer Day 1 Introduction at CloudStack Developer Day
1 Introduction at CloudStack Developer Day
Kimihiko Kitase
 
AWS for Start-ups - Case Study - Go Squared
AWS for Start-ups - Case Study - Go SquaredAWS for Start-ups - Case Study - Go Squared
AWS for Start-ups - Case Study - Go Squared
Amazon Web Services
 
.NET Developer Days - So many Docker platforms, so little time...
.NET Developer Days - So many Docker platforms, so little time....NET Developer Days - So many Docker platforms, so little time...
.NET Developer Days - So many Docker platforms, so little time...
Michele Leroux Bustamante
 
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
DataStax Academy
 
Randall Hunt - AWS Midwest Community Day Keynote
Randall Hunt - AWS Midwest Community Day KeynoteRandall Hunt - AWS Midwest Community Day Keynote
Randall Hunt - AWS Midwest Community Day Keynote
AWS Chicago
 

Similar to ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012 (20)

Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
 
Netflix and Open Source
Netflix and Open SourceNetflix and Open Source
Netflix and Open Source
 
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
 
Servers fail, who cares?
Servers fail, who cares? Servers fail, who cares?
Servers fail, who cares?
 
Disaster Recovery with the AWS Cloud
Disaster Recovery with the AWS CloudDisaster Recovery with the AWS Cloud
Disaster Recovery with the AWS Cloud
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWS
 
CloudFest Denver Windows Azure Design Patterns
CloudFest Denver Windows Azure Design PatternsCloudFest Denver Windows Azure Design Patterns
CloudFest Denver Windows Azure Design Patterns
 
The Netflix Open Source Platform
The Netflix Open Source PlatformThe Netflix Open Source Platform
The Netflix Open Source Platform
 
Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC
Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC
Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC
 
Running High Availability Websites with Acquia and AWS
Running High Availability Websites with Acquia and AWSRunning High Availability Websites with Acquia and AWS
Running High Availability Websites with Acquia and AWS
 
Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013
 
Ram chinta hug-20120922-v1
Ram chinta hug-20120922-v1Ram chinta hug-20120922-v1
Ram chinta hug-20120922-v1
 
Fault Tolerant Applications on AWS
Fault Tolerant Applications on AWSFault Tolerant Applications on AWS
Fault Tolerant Applications on AWS
 
13h00 aws 2012-fault_tolerant_applications
13h00   aws 2012-fault_tolerant_applications13h00   aws 2012-fault_tolerant_applications
13h00 aws 2012-fault_tolerant_applications
 
CloudStack technical overview
CloudStack technical overviewCloudStack technical overview
CloudStack technical overview
 
1 Introduction at CloudStack Developer Day
1 Introduction at CloudStack Developer Day 1 Introduction at CloudStack Developer Day
1 Introduction at CloudStack Developer Day
 
AWS for Start-ups - Case Study - Go Squared
AWS for Start-ups - Case Study - Go SquaredAWS for Start-ups - Case Study - Go Squared
AWS for Start-ups - Case Study - Go Squared
 
.NET Developer Days - So many Docker platforms, so little time...
.NET Developer Days - So many Docker platforms, so little time....NET Developer Days - So many Docker platforms, so little time...
.NET Developer Days - So many Docker platforms, so little time...
 
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
 
Randall Hunt - AWS Midwest Community Day Keynote
Randall Hunt - AWS Midwest Community Day KeynoteRandall Hunt - AWS Midwest Community Day Keynote
Randall Hunt - AWS Midwest Community Day Keynote
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Amazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
Amazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012

  • 3. (I’m skipping all the cloud intro etc. Netflix runs in the cloud, if you hadn’t figured that out already you aren’t paying attention and should go to the other Netflix talks at AWS Re:Invent or read slideshare.net/netflix)
  • 9. Architecture applies to any cloud or datacenter Illustrated today using real world examples
  • 10. Consumer User Data Electronics Web Site or Browse Discovery API AWS Cloud Services Personalization CDN Edge Locations DRM Customer Play Device (PC, Streaming API PS3, TV…) QoS Logging CDN Management and Steering Watch OpenConnect CDN Boxes Content Encoding
  • 11. Each icon is three to a few hundred instances across Cassandra three AWS zones memcached Web service Start Here S3 bucket Personalization movie group chooser
  • 13. Deployed in Three Balanced Availability Zones Load Balancers Zone A Zone B Zone C Cassandra and Evcache Cassandra and Evcache Cassandra and Evcache Replicas Replicas Replicas
  • 14. Triple Replicated Persistence Load Balancers Zone A Zone B Zone C Cassandra and Evcache Cassandra and Evcache Cassandra and Evcache Replicas Replicas Replicas
  • 15. Isolated Regions US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
  • 16. Failure Mode Probability Mitigation Plan Application Failure High Automatic degraded response AWS Region Failure Low Wait for region to recover AWS Zone Failure Medium Continue to run on 2 out of 3 zones Datacenter Failure Medium Migrate more functions to cloud Data store failure Low Restore from S3 backups S3 failure Low Restore from remote archive
  • 17. Run what you wrote Rapid detection Rapid Response
  • 21. http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html Eureka Services metadata AWS AppDynamics Instances, Request flow ASGs, etc. Edda Monkeys
  • 26. Classify and name the types of things that might go wrong in the platform or infrastructure
  • 27. Zone Network Outage US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Zone Dependent Zone Power Outage Service Outage Dependent Service could be @NetflixOSS platform or underlying infrastructure
  • 29. Regional Network Outage US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Control Plane Overload
  • 31. Cascading Capacity Overload US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Capacity demand migrates to services Platform and Infrastructure Migrating demand across regions may in another zone that don’t scale up fast Software Bugs and Global just spread the problem further… enough to take the load Configuration Errors “Oops…”
  • 33. Hardening the cloud Lessons Learned at Scale Why Netflix Stays Up (Mostly)
  • 38. @NetflixOSS Eureka service directory failed to mark down dead instances due to a configuration error US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Effect: higher latency and errors Zone Power Outage Mitigation: Fixed configuration, and made Applications not using Zone- zone aware routing the default aware routing kept trying to talk to dead instances and timing out
  • 40. Zone Enable DNS Command Queue Per-Zone Control Plane Command Queues US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
  • 41. A highly scalable, available and durable deployment pattern
  • 42. Single function Cassandra Cluster Many Different Single-Function REST Clients Managed by Priam Between 6 and 72 nodes Stateless Data Access REST Service Astyanax Cassandra Client Optional Each icon represents a horizontally scaled service of three to hundreds of Datacenter instances deployed over three availability zones Update Flow Appdynamics Service Flow Visualization
  • 43. Linux Base AMI (CentOS or Ubuntu) Optional Apache frontend, Java (JDK 6 or 7) memcached, non-java apps AppDynamics Monitoring appagent monitoring Tomcat Log rotation to Application war file, base S3 Healthcheck, status servlets, GC and thread servlet, platform, client AppDynamics JMX interface, Servo autoscale dump logging interface jars, Astyanax machineagent Epic/Atlas
  • 46. Linux Base AMI (CentOS or Ubuntu) Tomcat and Priam on JDK Java (JDK 7) Healthcheck, Status AppDynamics appagent monitoring Cassandra Server Monitoring AppDynamics Local Ephemeral Disk Space – 2TB of SSD or 1.6TB disk holding GC and thread Commit log and SSTables machineagent dump logging Epic/Atlas
  • 48. Cassandra Cassandra Cassandra Cassandra Cassandra S3 Backup Cassandra Cassandra Cassandra Cassandra Cassandra Cassandra Archive
  • 51. Legend Github / Techblog Priam Exhibitor Servo and Autoscaling Scripts Cassandra as a Service Zookeeper as a Service Apache Contributions Astyanax Curator Honu Techblog Post Only Cassandra client for Java Zookeeper Patterns Log4j streaming to Hadoop Coming Soon CassJMeter EVCache Circuit Breaker - Hystrix Cassandra test suite Memcached as a Service Robust service pattern Cassandra Multi-region EC2 Eureka / Discovery Asgard - AutoScaleGroup based AWS datastore support Service Directory console Aegisthus Archaius Chaos Monkey Hadoop ETL for Cassandra Dynamics Properties Service Robustness verification Edda Explorers Latency Monkey Queryable config history Governator - Library lifecycle and Server-side latency/error injection Janitor Monkey dependency injection Odin REST Client + mid-tier LB Bakeries and AMI Workflow orchestration Blitz4j - Async logging Configuration REST endpoints Build dynaslaves
  • 53. http://github.com/Netflix http://techblog.netflix.com http://slideshare.net/Netflix http://www.linkedin.com/in/adriancockcroft
  • 54. We are sincerely eager to hear your FEEDBACK on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.