• Save
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Servers fail, who cares?



Presented at 2012 Cassandra Summit. Cassandra is a critical component of Netflix's streaming service. At this talk we will discuss the lessons we learned, and solutions we developed, for running ...

Presented at 2012 Cassandra Summit. Cassandra is a critical component of Netflix's streaming service. At this talk we will discuss the lessons we learned, and solutions we developed, for running Cassandra in an ephemeral AWS environment.



Total Views
Views on SlideShare
Embed Views



2 Embeds 15

https://twimg0-a.akamaihd.net 8
https://si0.twimg.com 7



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Video of this presentation from the Cassandra Summit here:

    All presentations are here and well worth your time:
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Outline of presentationJun 29 outageContext - cassandra and aws - updated usage numbers - include architecture diagram with cassandra called outHow clusters are constructed – blueprint diagrams should include #1 – aws make-up – ASG and Azs #2 - instance particulars #3 - priam s3Resiliency - node, zone and region outagespriam – bootstrapping, monitoring, backup and restore, open sourceMonitoring - what we monitor - tools we use - epic/atlas and dashboards, and Maintenance tasks - jenkinsThings we monitor Issues we haveNote on SSDs
  • Minimum cluster size = 6
  • … Developer in house …Quickly find problems by looking into codeDocumentation/tools for troubleshooting are scarce… repairs …Affect entire replication set, cause very high latency in I/O constrained environment… multi-tenant …Hard to track changes being madeShared resources mean that one service can affect another oneIndividual usage only growsMoving services to a new cluster with the service live is non-trivial… smaller per-node data …Instance level operations (bootstrap, compact, etc) are faster
  • Extension of Epic, using preconfigured dashboards for each clusterAdd additional metrics as we learn which to monitor

Servers fail, who cares? Presentation Transcript

  • 1. Servers fail, who cares? (Answer: I do, sort of)Gregg Ulrich, Netflix – @eatupmartha #netflixcloud #cassandra12 1
  • 2. June 29, 2012 2
  • 3. 3
  • 4. 4
  • 5. [1] 5
  • 6. From the Netflix tech blog:“Cassandra, our distributed cloud persistence store whichis distributed across all zones and regions, dealt with theloss of one third of its regional nodes without any loss ofdata or availability.[2]” 6
  • 7. Topics Cassandra at Netflix Constructing clusters in AWS with Priam Resiliency Observations on AWS, Cassandra and AWS/Cassandra Monitoring and maintenances References 7
  • 8. Cassandra by the numbers41 Number of production clusters13 Number of multi-region clusters4 Max regions, one cluster90 Total TB of data across all clusters621 Number of Cassandra nodes72/34 Largest Cassandra cluster (nodes/data in TB)80k/250k Max read/writes per second on a single cluster3* Size of Operations team * We are hiring DevOps and Developers. Stop by our booth! 8
  • 9. Netflix Deployed on AWSContent Logs Play WWW API CS Content S3 International DRM Sign-Up Metadata Management Terabytes CS lookup EC2 Device Diagnostics & EMR CDN routing Search Encoding Configuration Actions S3 Movie TV Movie Customer Call Hive & Pig Bookmarks Petabytes Choosing Choosing Log Business Social Logging Ratings CS Analytics Intelligence Facebook CDNs ISPs Terabits Customers
  • 10. Constructing clusters in AWS with Priam Tomcat webapp for Cassandra administration Token management Full and incremental backups JMX metrics collection cassandra.yaml configuration REST API for most nodetool commands AWS Security Groups for multi-region clusters Open sourced, available on github [3] 10
  • 11. Autoscaling Groups Region ASGs do not map directly to nodetool ring output, but are used to define the cluster (# of instances, AZs, etc).Address DC Rack Status State Load Owns Token …###.##.##.### eu-west 1a Up Normal 108.97 GB 16.67% …###.##.#.## us-east 1e Up Normal 103.72 GB 0.00% … Amazon machine image##.###.###.### eu-west 1b Up Normal 104.82 GB 16.67% …##.##.##.### us-east 1c Up Normal 111.87 GB 0.00% … Image loaded on to an AWS##.###.##.### eu-west 1c Up Normal 95.51 GB 16.67% … instance; all packages needed##.##.##.## us-east 1d Up Normal 105.85 GB 0.00% … to run an application.##.###.##.### eu-west 1a Up Normal 91.25 GB 16.67% …###.##.##.### us-east 1e Up Normal 102.71 GB 0.00% …##.###.###.### eu-west 1b Up Normal 101.87 GB 16.67% …##.##.###.## us-east 1c Up Normal 102.83 GB 0.00% … Security Group###.##.###.## eu-west 1c Up Normal 96.66 GB 16.67% …##.##.##.### us-east 1d Up Normal 99.68 GB 0.00% … Defines access control between ASGsInstance Availability Zone (AZ) AWS Terminology A Constructing a cluster in AWS 11
  • 12. APP is not an AWS entity, but one that we App = cass_cluster use internally to denote a service. This is part of asgard [4], our open- ASG # 1 ASG # 2 ASG # 3 sourced cloudMulti-region clusters application webhave the same Availabilty Zone = A Availability Zone = B Availability Zone = C interfaceconfiguration in eachregion. Just repeat what Region = us-east Region = us-east Region = us-eastyou see here! Instance count = 6 Instance count = 6 Instance count = 6 Instance type = Instance type = Instance type = m2.4xlarge m2.4xlarge m2.4xlarge External full backups to an alternate region saved for 30 days.Full and incrementalBackups to local-region S3 S3S3 via Priam Cassandra Configuration B Constructing a cluster in AWS 12
  • 13. AMI contains os, base netflix packages Priam runs on each node and and Cassandra and Priam will: * Assign tokens to each node, alternating (1) the Cassandra(1) Alternate C A B Priam AZs around the ring (2).availability zones * Perform nightly snapshot Tomcat(a, b, c) around the backup to S3ring to ensure data B Cis written to * Perform incrementalmultiple data SSTable backups to S3centers. A A * Bootstrap replacement(2) Survive the nodes to use vacatedloss of a data tokenscenter by ensuring C B S3 * Collect JMX metrics for ourthat we only lose monitoring systemsone node fromeach replication B c * REST API calls to mostset. A nodetool functions Putting it all together C Constructing a cluster in AWS 13
  • 14. Resiliency - Instance• RF=AZ=3• Cassandra bootstrapping works really well• Replace nodes immediately• Repair often 15
  • 15. Resiliency – One availability zone RF=AZ=3 Alternating AZs ensures that each AZ has a full replica of data Provision cluster to run at 2/3 capacity Ride out a zone outage; do not move to another zone Bootstrap one node at a time Repair after recovery 16
  • 16. What happened on June 29th? During outage  All Cassandra instances in us-east-1a were inaccessible  nodetool ring showed all nodes as DOWN  Monitoring other AZs to ensure availability Recovery – power restored to us-east-1a  Majority of instances rejoined the cluster without issue  Majority of remainder required a reboot to fix  Remainder of nodes needed to be replaced, one at a time 17
  • 17. Resiliency – Multiple availability zones Outage; can no longer satisfy quorum Restore from backup and repair 18
  • 18. Resiliency - Region Connectivity loss between regions – operate as island clusters until service restored Repair data between regions If an entire region disappears, watch DVDs instead 19
  • 19. Observations: AWS Ephemeral drive performance is better than EBS S3-backed AMIs help us weather EBS outages Instances seldom die on their own Use as many availability zones as you can afford Understand how AWS launches instances I/O is constrained in most AWS instance types  Repairs are very I/O intensive  Large size-tiered compactions can impact latency SSDs[5] are game changers [6] 20
  • 20. Observations: Cassandra A slow node is worse than a down node Cold cache increases load and kills latency Use whatever dials you can find in an emergency  Remove node from coordinator list  Compaction throttling  Min/max compaction thresholds  Enable/disable gossip Leveled compaction performance is very promising 1.1.x and 1.2.x should address some big issues 21
  • 21. Monitoring Actionable  Hardware and network issues  Cluster consistency Cumulative trends Informational  Schema changes  Log file errors/exceptions  Recent restarts 22
  • 22. Dashboards - identify anomalies 23
  • 23. Maintenances Repair clusters regularly Run off-line major compactions to avoid latency  SSDs will make this unnecessary Always replace nodes when they fail Periodically replace all nodes in the cluster Upgrade to new versions  Binary (rpm) for major upgrades or emergencies  Rolling AMI push over time 24
  • 24. References1. A bad night: Netflix and Instagram go down amid Amazon Web Services outage (theverge.com)2. Lessons Netflix learned from AWS Storm (techblog.netflix.com)3. github / Netflix / priam (github.com)4. github / Netflix / asgard (github.com)5. Announcing High I/O Instances for Amazon (aws.amazon.com)6. Benchmarking High Performance I/O with SSD for Cassandra on AWS (techblog.netflix.com) 25