This talk describes a set of architectural patterns that support highly available services that are also scalable, low cost, low latency and allow agile continuous deployment development practices. The building blocks for these patterns have been released at netflix.github.com as open source projects for others to use.
3. (I’m skipping all the cloud intro etc. Netflix runs in the
cloud, if you hadn’t figured that out already you aren’t
paying attention and should go to the other Netflix
talks at AWS Re:Invent or read slideshare.net/netflix)
10. Consumer User Data
Electronics
Web Site or
Browse Discovery API
AWS Cloud
Services Personalization
CDN Edge
Locations
DRM
Customer Play
Device (PC, Streaming API
PS3, TV…)
QoS Logging
CDN
Management
and Steering
Watch OpenConnect
CDN Boxes
Content
Encoding
11. Each icon is three to a
few hundred
instances across Cassandra
three AWS zones
memcached
Web service
Start Here
S3 bucket
Personalization movie
group chooser
12.
13. Deployed in Three Balanced Availability Zones
Load Balancers
Zone A Zone B Zone C
Cassandra and Evcache Cassandra and Evcache Cassandra and Evcache
Replicas Replicas Replicas
14. Triple Replicated Persistence
Load Balancers
Zone A Zone B Zone C
Cassandra and Evcache Cassandra and Evcache Cassandra and Evcache
Replicas Replicas Replicas
15. Isolated Regions
US-East Load Balancers EU-West Load Balancers
Zone A Zone B Zone C Zone A Zone B Zone C
Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
16. Failure Mode Probability Mitigation Plan
Application Failure High Automatic degraded response
AWS Region Failure Low Wait for region to recover
AWS Zone Failure Medium Continue to run on 2 out of 3 zones
Datacenter Failure Medium Migrate more functions to cloud
Data store failure Low Restore from S3 backups
S3 failure Low Restore from remote archive
17. Run what you wrote
Rapid detection
Rapid Response
26. Classify and name the types of things that
might go wrong in the platform or infrastructure
27. Zone Network Outage
US-East Load Balancers EU-West Load Balancers
Zone A Zone B Zone C Zone A Zone B Zone C
Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
Zone Dependent
Zone Power Outage
Service Outage
Dependent Service could be @NetflixOSS
platform or underlying infrastructure
28.
29. Regional Network Outage
US-East Load Balancers EU-West Load Balancers
Zone A Zone B Zone C Zone A Zone B Zone C
Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
Control Plane Overload
30.
31. Cascading Capacity Overload
US-East Load Balancers EU-West Load Balancers
Zone A Zone B Zone C Zone A Zone B Zone C
Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
Capacity demand migrates to services Platform and Infrastructure
Migrating demand across regions may
in another zone that don’t scale up fast Software Bugs and Global
just spread the problem further…
enough to take the load Configuration Errors
“Oops…”
38. @NetflixOSS Eureka service directory failed to mark
down dead instances due to a configuration error
US-East Load Balancers EU-West Load Balancers
Zone A Zone B Zone C Zone A Zone B Zone C
Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
Effect: higher latency and errors
Zone Power Outage Mitigation: Fixed configuration, and made
Applications not using Zone-
zone aware routing the default
aware routing kept trying to talk to
dead instances and timing out
39.
40. Zone Enable DNS
Command Queue Per-Zone Control Plane
Command Queues
US-East Load Balancers EU-West Load Balancers
Zone A Zone B Zone C Zone A Zone B Zone C
Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
42. Single function Cassandra Cluster
Many Different Single-Function REST Clients Managed by Priam
Between 6 and 72 nodes
Stateless Data Access REST Service
Astyanax Cassandra Client
Optional
Each icon represents a horizontally scaled service of three to hundreds of Datacenter
instances deployed over three availability zones Update Flow
Appdynamics Service Flow Visualization
43. Linux Base AMI (CentOS or Ubuntu)
Optional Apache
frontend, Java (JDK 6 or 7)
memcached,
non-java apps
AppDynamics
Monitoring
appagent
monitoring
Tomcat
Log rotation to Application war file, base
S3 Healthcheck, status servlets,
GC and thread servlet, platform, client
AppDynamics JMX interface, Servo autoscale
dump logging interface jars, Astyanax
machineagent
Epic/Atlas
46. Linux Base AMI (CentOS or Ubuntu)
Tomcat and
Priam on JDK Java (JDK 7)
Healthcheck,
Status
AppDynamics
appagent
monitoring
Cassandra Server
Monitoring
AppDynamics Local Ephemeral Disk Space – 2TB of SSD or 1.6TB disk holding
GC and thread Commit log and SSTables
machineagent dump logging
Epic/Atlas
51. Legend
Github / Techblog Priam Exhibitor
Servo and Autoscaling Scripts
Cassandra as a Service Zookeeper as a Service
Apache Contributions
Astyanax Curator Honu
Techblog Post Only
Cassandra client for Java Zookeeper Patterns Log4j streaming to Hadoop
Coming Soon
CassJMeter EVCache Circuit Breaker - Hystrix
Cassandra test suite Memcached as a Service Robust service pattern
Cassandra Multi-region EC2 Eureka / Discovery Asgard - AutoScaleGroup based AWS
datastore support Service Directory console
Aegisthus Archaius Chaos Monkey
Hadoop ETL for Cassandra Dynamics Properties Service Robustness verification
Edda
Explorers Latency Monkey
Queryable config history
Governator - Library lifecycle and
Server-side latency/error injection Janitor Monkey
dependency injection
Odin
REST Client + mid-tier LB Bakeries and AMI
Workflow orchestration
Blitz4j - Async logging Configuration REST endpoints Build dynaslaves