3
4
[1]      5
From the Netflix tech blog:“Cassandra, our distributed cloud persistence store whichis distributed across all zones and re...
Topics Cassandra at Netflix Constructing clusters in AWS with Priam Resiliency Observations on AWS, Cassandra and AWS/...
Cassandra by the numbers41         Number of production clusters13         Number of multi-region clusters4          Max r...
Netflix Deployed on AWSContent       Logs           Play          WWW        API              CS   Content       S3       ...
Constructing clusters in AWS with Priam Tomcat webapp for Cassandra administration Token management Full and incrementa...
Autoscaling Groups                 Region                                                              ASGs do not map dir...
APP is not an AWS                                                                                                         ...
AMI contains os, base netflix packages                      Priam runs on each node and                           and Cass...
Resiliency - Instance• RF=AZ=3• Cassandra bootstrapping works really well• Replace nodes immediately• Repair often        ...
Resiliency – One availability zone RF=AZ=3 Alternating AZs ensures that each AZ has a full replica of  data Provision c...
What happened on June 29th? During outage   All Cassandra instances in us-east-1a were inaccessible   nodetool ring sho...
Resiliency – Multiple availability zones Outage; can no longer satisfy quorum Restore from backup and repair            ...
Resiliency - Region Connectivity loss between regions – operate as island  clusters until service restored Repair data b...
Observations: AWS Ephemeral drive performance is better than EBS S3-backed AMIs help us weather EBS outages Instances s...
Observations: Cassandra A slow node is worse than a down node Cold cache increases load and kills latency Use whatever ...
Monitoring Actionable   Hardware and network issues   Cluster consistency Cumulative trends Informational   Schema c...
Dashboards - identify anomalies                                  23
Maintenances Repair clusters regularly Run off-line major compactions to avoid latency    SSDs will make this unnecessa...
References1. A bad night: Netflix and Instagram go down amid   Amazon Web Services outage (theverge.com)2. Lessons Netflix...
Servers fail, who cares?
Servers fail, who cares?
Upcoming SlideShare
Loading in...5
×

Servers fail, who cares?

2,972

Published on

Presented at 2012 Cassandra Summit. Cassandra is a critical component of Netflix's streaming service. At this talk we will discuss the lessons we learned, and solutions we developed, for running Cassandra in an ephemeral AWS environment.

Published in: Technology, Sports
1 Comment
6 Likes
Statistics
Notes
  • Video of this presentation from the Cassandra Summit here:
    http://www.youtube.com/watch?v=9Vvc58oqox0

    All presentations are here and well worth your time:
    http://www.datastax.com/events/cassandrasummit2012/presentations
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
2,972
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
1
Likes
6
Embeds 0
No embeds

No notes for slide
  • Outline of presentationJun 29 outageContext - cassandra and aws - updated usage numbers - include architecture diagram with cassandra called outHow clusters are constructed – blueprint diagrams should include #1 – aws make-up – ASG and Azs #2 - instance particulars #3 - priam s3Resiliency - node, zone and region outagespriam – bootstrapping, monitoring, backup and restore, open sourceMonitoring - what we monitor - tools we use - epic/atlas and dashboards, and Maintenance tasks - jenkinsThings we monitor Issues we haveNote on SSDs
  • Minimum cluster size = 6
  • … Developer in house …Quickly find problems by looking into codeDocumentation/tools for troubleshooting are scarce… repairs …Affect entire replication set, cause very high latency in I/O constrained environment… multi-tenant …Hard to track changes being madeShared resources mean that one service can affect another oneIndividual usage only growsMoving services to a new cluster with the service live is non-trivial… smaller per-node data …Instance level operations (bootstrap, compact, etc) are faster
  • Extension of Epic, using preconfigured dashboards for each clusterAdd additional metrics as we learn which to monitor
  • Servers fail, who cares?

    1. 1. Servers fail, who cares? (Answer: I do, sort of)Gregg Ulrich, Netflix – @eatupmartha #netflixcloud #cassandra12 1
    2. 2. June 29, 2012 2
    3. 3. 3
    4. 4. 4
    5. 5. [1] 5
    6. 6. From the Netflix tech blog:“Cassandra, our distributed cloud persistence store whichis distributed across all zones and regions, dealt with theloss of one third of its regional nodes without any loss ofdata or availability.[2]” 6
    7. 7. Topics Cassandra at Netflix Constructing clusters in AWS with Priam Resiliency Observations on AWS, Cassandra and AWS/Cassandra Monitoring and maintenances References 7
    8. 8. Cassandra by the numbers41 Number of production clusters13 Number of multi-region clusters4 Max regions, one cluster90 Total TB of data across all clusters621 Number of Cassandra nodes72/34 Largest Cassandra cluster (nodes/data in TB)80k/250k Max read/writes per second on a single cluster3* Size of Operations team * We are hiring DevOps and Developers. Stop by our booth! 8
    9. 9. Netflix Deployed on AWSContent Logs Play WWW API CS Content S3 International DRM Sign-Up Metadata Management Terabytes CS lookup EC2 Device Diagnostics & EMR CDN routing Search Encoding Configuration Actions S3 Movie TV Movie Customer Call Hive & Pig Bookmarks Petabytes Choosing Choosing Log Business Social Logging Ratings CS Analytics Intelligence Facebook CDNs ISPs Terabits Customers
    10. 10. Constructing clusters in AWS with Priam Tomcat webapp for Cassandra administration Token management Full and incremental backups JMX metrics collection cassandra.yaml configuration REST API for most nodetool commands AWS Security Groups for multi-region clusters Open sourced, available on github [3] 10
    11. 11. Autoscaling Groups Region ASGs do not map directly to nodetool ring output, but are used to define the cluster (# of instances, AZs, etc).Address DC Rack Status State Load Owns Token …###.##.##.### eu-west 1a Up Normal 108.97 GB 16.67% …###.##.#.## us-east 1e Up Normal 103.72 GB 0.00% … Amazon machine image##.###.###.### eu-west 1b Up Normal 104.82 GB 16.67% …##.##.##.### us-east 1c Up Normal 111.87 GB 0.00% … Image loaded on to an AWS##.###.##.### eu-west 1c Up Normal 95.51 GB 16.67% … instance; all packages needed##.##.##.## us-east 1d Up Normal 105.85 GB 0.00% … to run an application.##.###.##.### eu-west 1a Up Normal 91.25 GB 16.67% …###.##.##.### us-east 1e Up Normal 102.71 GB 0.00% …##.###.###.### eu-west 1b Up Normal 101.87 GB 16.67% …##.##.###.## us-east 1c Up Normal 102.83 GB 0.00% … Security Group###.##.###.## eu-west 1c Up Normal 96.66 GB 16.67% …##.##.##.### us-east 1d Up Normal 99.68 GB 0.00% … Defines access control between ASGsInstance Availability Zone (AZ) AWS Terminology A Constructing a cluster in AWS 11
    12. 12. APP is not an AWS entity, but one that we App = cass_cluster use internally to denote a service. This is part of asgard [4], our open- ASG # 1 ASG # 2 ASG # 3 sourced cloudMulti-region clusters application webhave the same Availabilty Zone = A Availability Zone = B Availability Zone = C interfaceconfiguration in eachregion. Just repeat what Region = us-east Region = us-east Region = us-eastyou see here! Instance count = 6 Instance count = 6 Instance count = 6 Instance type = Instance type = Instance type = m2.4xlarge m2.4xlarge m2.4xlarge External full backups to an alternate region saved for 30 days.Full and incrementalBackups to local-region S3 S3S3 via Priam Cassandra Configuration B Constructing a cluster in AWS 12
    13. 13. AMI contains os, base netflix packages Priam runs on each node and and Cassandra and Priam will: * Assign tokens to each node, alternating (1) the Cassandra(1) Alternate C A B Priam AZs around the ring (2).availability zones * Perform nightly snapshot Tomcat(a, b, c) around the backup to S3ring to ensure data B Cis written to * Perform incrementalmultiple data SSTable backups to S3centers. A A * Bootstrap replacement(2) Survive the nodes to use vacatedloss of a data tokenscenter by ensuring C B S3 * Collect JMX metrics for ourthat we only lose monitoring systemsone node fromeach replication B c * REST API calls to mostset. A nodetool functions Putting it all together C Constructing a cluster in AWS 13
    14. 14. Resiliency - Instance• RF=AZ=3• Cassandra bootstrapping works really well• Replace nodes immediately• Repair often 15
    15. 15. Resiliency – One availability zone RF=AZ=3 Alternating AZs ensures that each AZ has a full replica of data Provision cluster to run at 2/3 capacity Ride out a zone outage; do not move to another zone Bootstrap one node at a time Repair after recovery 16
    16. 16. What happened on June 29th? During outage  All Cassandra instances in us-east-1a were inaccessible  nodetool ring showed all nodes as DOWN  Monitoring other AZs to ensure availability Recovery – power restored to us-east-1a  Majority of instances rejoined the cluster without issue  Majority of remainder required a reboot to fix  Remainder of nodes needed to be replaced, one at a time 17
    17. 17. Resiliency – Multiple availability zones Outage; can no longer satisfy quorum Restore from backup and repair 18
    18. 18. Resiliency - Region Connectivity loss between regions – operate as island clusters until service restored Repair data between regions If an entire region disappears, watch DVDs instead 19
    19. 19. Observations: AWS Ephemeral drive performance is better than EBS S3-backed AMIs help us weather EBS outages Instances seldom die on their own Use as many availability zones as you can afford Understand how AWS launches instances I/O is constrained in most AWS instance types  Repairs are very I/O intensive  Large size-tiered compactions can impact latency SSDs[5] are game changers [6] 20
    20. 20. Observations: Cassandra A slow node is worse than a down node Cold cache increases load and kills latency Use whatever dials you can find in an emergency  Remove node from coordinator list  Compaction throttling  Min/max compaction thresholds  Enable/disable gossip Leveled compaction performance is very promising 1.1.x and 1.2.x should address some big issues 21
    21. 21. Monitoring Actionable  Hardware and network issues  Cluster consistency Cumulative trends Informational  Schema changes  Log file errors/exceptions  Recent restarts 22
    22. 22. Dashboards - identify anomalies 23
    23. 23. Maintenances Repair clusters regularly Run off-line major compactions to avoid latency  SSDs will make this unnecessary Always replace nodes when they fail Periodically replace all nodes in the cluster Upgrade to new versions  Binary (rpm) for major upgrades or emergencies  Rolling AMI push over time 24
    24. 24. References1. A bad night: Netflix and Instagram go down amid Amazon Web Services outage (theverge.com)2. Lessons Netflix learned from AWS Storm (techblog.netflix.com)3. github / Netflix / priam (github.com)4. github / Netflix / asgard (github.com)5. Announcing High I/O Instances for Amazon (aws.amazon.com)6. Benchmarking High Performance I/O with SSD for Cassandra on AWS (techblog.netflix.com) 25

    ×