Servers fail, who cares?

3,847 views
3,465 views

Published on

Presented at 2012 Cassandra Summit. Cassandra is a critical component of Netflix's streaming service. At this talk we will discuss the lessons we learned, and solutions we developed, for running Cassandra in an ephemeral AWS environment.

Published in: Technology, Sports
1 Comment
6 Likes
Statistics
Notes
  • Video of this presentation from the Cassandra Summit here:
    http://www.youtube.com/watch?v=9Vvc58oqox0

    All presentations are here and well worth your time:
    http://www.datastax.com/events/cassandrasummit2012/presentations
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
3,847
On SlideShare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
0
Comments
1
Likes
6
Embeds 0
No embeds

No notes for slide
  • Outline of presentationJun 29 outageContext - cassandra and aws - updated usage numbers - include architecture diagram with cassandra called outHow clusters are constructed – blueprint diagrams should include #1 – aws make-up – ASG and Azs #2 - instance particulars #3 - priam s3Resiliency - node, zone and region outagespriam – bootstrapping, monitoring, backup and restore, open sourceMonitoring - what we monitor - tools we use - epic/atlas and dashboards, and Maintenance tasks - jenkinsThings we monitor Issues we haveNote on SSDs
  • Minimum cluster size = 6
  • … Developer in house …Quickly find problems by looking into codeDocumentation/tools for troubleshooting are scarce… repairs …Affect entire replication set, cause very high latency in I/O constrained environment… multi-tenant …Hard to track changes being madeShared resources mean that one service can affect another oneIndividual usage only growsMoving services to a new cluster with the service live is non-trivial… smaller per-node data …Instance level operations (bootstrap, compact, etc) are faster
  • Extension of Epic, using preconfigured dashboards for each clusterAdd additional metrics as we learn which to monitor
  • Servers fail, who cares?

    1. Servers fail, who cares? (Answer: I do, sort of)Gregg Ulrich, Netflix – @eatupmartha #netflixcloud #cassandra12 1
    2. June 29, 2012 2
    3. 3
    4. 4
    5. [1] 5
    6. From the Netflix tech blog:“Cassandra, our distributed cloud persistence store whichis distributed across all zones and regions, dealt with theloss of one third of its regional nodes without any loss ofdata or availability.[2]” 6
    7. Topics Cassandra at Netflix Constructing clusters in AWS with Priam Resiliency Observations on AWS, Cassandra and AWS/Cassandra Monitoring and maintenances References 7
    8. Cassandra by the numbers41 Number of production clusters13 Number of multi-region clusters4 Max regions, one cluster90 Total TB of data across all clusters621 Number of Cassandra nodes72/34 Largest Cassandra cluster (nodes/data in TB)80k/250k Max read/writes per second on a single cluster3* Size of Operations team * We are hiring DevOps and Developers. Stop by our booth! 8
    9. Netflix Deployed on AWSContent Logs Play WWW API CS Content S3 International DRM Sign-Up Metadata Management Terabytes CS lookup EC2 Device Diagnostics & EMR CDN routing Search Encoding Configuration Actions S3 Movie TV Movie Customer Call Hive & Pig Bookmarks Petabytes Choosing Choosing Log Business Social Logging Ratings CS Analytics Intelligence Facebook CDNs ISPs Terabits Customers
    10. Constructing clusters in AWS with Priam Tomcat webapp for Cassandra administration Token management Full and incremental backups JMX metrics collection cassandra.yaml configuration REST API for most nodetool commands AWS Security Groups for multi-region clusters Open sourced, available on github [3] 10
    11. Autoscaling Groups Region ASGs do not map directly to nodetool ring output, but are used to define the cluster (# of instances, AZs, etc).Address DC Rack Status State Load Owns Token …###.##.##.### eu-west 1a Up Normal 108.97 GB 16.67% …###.##.#.## us-east 1e Up Normal 103.72 GB 0.00% … Amazon machine image##.###.###.### eu-west 1b Up Normal 104.82 GB 16.67% …##.##.##.### us-east 1c Up Normal 111.87 GB 0.00% … Image loaded on to an AWS##.###.##.### eu-west 1c Up Normal 95.51 GB 16.67% … instance; all packages needed##.##.##.## us-east 1d Up Normal 105.85 GB 0.00% … to run an application.##.###.##.### eu-west 1a Up Normal 91.25 GB 16.67% …###.##.##.### us-east 1e Up Normal 102.71 GB 0.00% …##.###.###.### eu-west 1b Up Normal 101.87 GB 16.67% …##.##.###.## us-east 1c Up Normal 102.83 GB 0.00% … Security Group###.##.###.## eu-west 1c Up Normal 96.66 GB 16.67% …##.##.##.### us-east 1d Up Normal 99.68 GB 0.00% … Defines access control between ASGsInstance Availability Zone (AZ) AWS Terminology A Constructing a cluster in AWS 11
    12. APP is not an AWS entity, but one that we App = cass_cluster use internally to denote a service. This is part of asgard [4], our open- ASG # 1 ASG # 2 ASG # 3 sourced cloudMulti-region clusters application webhave the same Availabilty Zone = A Availability Zone = B Availability Zone = C interfaceconfiguration in eachregion. Just repeat what Region = us-east Region = us-east Region = us-eastyou see here! Instance count = 6 Instance count = 6 Instance count = 6 Instance type = Instance type = Instance type = m2.4xlarge m2.4xlarge m2.4xlarge External full backups to an alternate region saved for 30 days.Full and incrementalBackups to local-region S3 S3S3 via Priam Cassandra Configuration B Constructing a cluster in AWS 12
    13. AMI contains os, base netflix packages Priam runs on each node and and Cassandra and Priam will: * Assign tokens to each node, alternating (1) the Cassandra(1) Alternate C A B Priam AZs around the ring (2).availability zones * Perform nightly snapshot Tomcat(a, b, c) around the backup to S3ring to ensure data B Cis written to * Perform incrementalmultiple data SSTable backups to S3centers. A A * Bootstrap replacement(2) Survive the nodes to use vacatedloss of a data tokenscenter by ensuring C B S3 * Collect JMX metrics for ourthat we only lose monitoring systemsone node fromeach replication B c * REST API calls to mostset. A nodetool functions Putting it all together C Constructing a cluster in AWS 13
    14. Resiliency - Instance• RF=AZ=3• Cassandra bootstrapping works really well• Replace nodes immediately• Repair often 15
    15. Resiliency – One availability zone RF=AZ=3 Alternating AZs ensures that each AZ has a full replica of data Provision cluster to run at 2/3 capacity Ride out a zone outage; do not move to another zone Bootstrap one node at a time Repair after recovery 16
    16. What happened on June 29th? During outage  All Cassandra instances in us-east-1a were inaccessible  nodetool ring showed all nodes as DOWN  Monitoring other AZs to ensure availability Recovery – power restored to us-east-1a  Majority of instances rejoined the cluster without issue  Majority of remainder required a reboot to fix  Remainder of nodes needed to be replaced, one at a time 17
    17. Resiliency – Multiple availability zones Outage; can no longer satisfy quorum Restore from backup and repair 18
    18. Resiliency - Region Connectivity loss between regions – operate as island clusters until service restored Repair data between regions If an entire region disappears, watch DVDs instead 19
    19. Observations: AWS Ephemeral drive performance is better than EBS S3-backed AMIs help us weather EBS outages Instances seldom die on their own Use as many availability zones as you can afford Understand how AWS launches instances I/O is constrained in most AWS instance types  Repairs are very I/O intensive  Large size-tiered compactions can impact latency SSDs[5] are game changers [6] 20
    20. Observations: Cassandra A slow node is worse than a down node Cold cache increases load and kills latency Use whatever dials you can find in an emergency  Remove node from coordinator list  Compaction throttling  Min/max compaction thresholds  Enable/disable gossip Leveled compaction performance is very promising 1.1.x and 1.2.x should address some big issues 21
    21. Monitoring Actionable  Hardware and network issues  Cluster consistency Cumulative trends Informational  Schema changes  Log file errors/exceptions  Recent restarts 22
    22. Dashboards - identify anomalies 23
    23. Maintenances Repair clusters regularly Run off-line major compactions to avoid latency  SSDs will make this unnecessary Always replace nodes when they fail Periodically replace all nodes in the cluster Upgrade to new versions  Binary (rpm) for major upgrades or emergencies  Rolling AMI push over time 24
    24. References1. A bad night: Netflix and Instagram go down amid Amazon Web Services outage (theverge.com)2. Lessons Netflix learned from AWS Storm (techblog.netflix.com)3. github / Netflix / priam (github.com)4. github / Netflix / asgard (github.com)5. Announcing High I/O Instances for Amazon (aws.amazon.com)6. Benchmarking High Performance I/O with SSD for Cassandra on AWS (techblog.netflix.com) 25

    ×