Cassandra Operations at Netflix


Published on

Slides from Netflix Cassandra Meetup on 3/27. Lessons learned and tools created at Netflix to manage Cassandra clusters in AWS.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Keywords – Agenda
  • Centralized Cassandra team used as a resource for other teams
  • Minimum cluster size = 6
  • Don’t developers do everything?True for most of the services, Cassandra is an exceptionNeeded a team focused on Cassandra so that services could quickly adopt
  • M2.4xlarge68.4 GB of memory26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each)1690 GB of instance storage64-bit platformI/O Performance: HighAPI name: m2.4xlargeEphemeral drives mean that we have to bootstrap new nodes
  • Brief overview on this slide, go into detail on the next one
  • Things to cover on this slideHow AWS balances between AZsWhat happens when an AZ goes awayHow PRIAM alternates nodes around the ring, even in MR
  • (Vijay should have covered a lot of this)Refer back to previous slideREST useful for automation. Do not have to connect to nodes directly or use JMXPriam only supports doubling the ring
  • Node, AZ and cluster level metricsTime series metrics with extensive historyCan compare multiple metrics one one graphAlso configure to send alerts
  • Extension of Epic, using preconfigured dashboards for each clusterAdd additional metrics as we learn which to monitor
  • Cluster level monitoring, or things that we can not easily derive from JMX or Epic
  • Try to anticipate when a large minor compaction is going to happenFreedom and responsibility has forced us to monitor schema changesWant to understand every time Cassandra restartsAWS very infrequently swaps out bad nodes. Nodes usually become non-responsive
  • … Developer in house …Quickly find problems by looking into codeDocumentation/tools for troubleshooting are scarce… repairs …Affect entire replication set, cause very high latency in I/O constrained environment… multi-tenant …Hard to track changes being madeShared resources mean that one service can affect another oneIndividual usage only growsMoving services to a new cluster with the service live is non-trivial… smaller per-node data …Instance level operations (bootstrap, compact, etc) are faster
  • Cassandra Operations at Netflix

    1. 1. Cassandra Operations at NetflixGregg Ulrich 1
    2. 2. Agenda Who we are How much we use Cassandra How we do it What we learned 2
    3. 3. Who we are Cloud Database Engineering  Development – Cassandra and related tools  Architecture – data modeling and sizing  Operations – availability, performance and maintenance Operations  24x7 on-call support for all Cassandra clusters  Cassandra operations tools  Proactive problem hunting  Routine and non-routine maintenances 3
    4. 4. How much we use Cassandra30 Number of production clusters12 Number of multi-region clusters3 Max regions, one cluster65 Total TB of data across all clusters472 Number of Cassandra nodes72/28 Largest Cassandra cluster (nodes/data in TB)50k/250k Max read/writes per second on a single cluster3* Size of Operations team * Open position for an additional engineer 4
    5. 5. I read that Netflix doesn’t have operations Extension of Amazon’s PaaS Decentralized Cassandra ops is expensive at scale Immature product that changes rapidly (and drastically) Easily apply best practices across all clusters 5
    6. 6. How we configure Cassandra in AWS Most services get their own Cassandra cluster Mostly m2.4xlarge instances, but considering others Cassandra and supporting tools baked into the AMI Data stored on ephemeral drives Data durability – all writes to all availabilty zones  Alternate AZs in a replication set  RF = 3 6
    7. 7. Minimum cluster configuration Minimum production cluster configuration – 6 nodes  3 auto-scaling groups  2 instances per auto-scaling group  1 availability zone per auto-scaling group 7
    8. 8. Minimum cluster configuration, illustratedASG1 AZ1 RF=3ASG2 AZ2 PRIAMASG3 AZ3 8
    9. 9. Tools we use Administration  Priam  Jenkins Monitoring and alerting  Cassandra Explorer  Dashboards  Epic 9
    10. 10. Tools we use – Priam Open-sourced Tomcat webapp running on each instance Multi-region token management via SimpleDB Node replacement and ring expansion Backup and restore  Full nightly snapshot backup to S3  Incremental backup of flushed SSTables to S3 every 30 seconds Metrics collected via JMX REST API to most nodetool functions 10
    11. 11. Tools we use – Cassandra Explorer• Kiosk mode – no alerting• High level cluster status (thrift, gossip)• Warns on a small set of metrics 11
    12. 12. Tools we use – Epic• Netflix-wide monitoring and alerting tool based on RRD• Priam proxies all JMX data to Epic• Very useful for finding specific issues 12
    13. 13. Tools we use – Dashboards• Next level cluster metrics • Throughput • Latency • Gossip status • Maintenance operations • Trouble indicators• Useful for finding anomalies• Most investigations start here 13
    14. 14. Tools we use – Jenkins• Scheduling tool for additional monitors and maintenance tasks• Push button automation for recurring tasks• Repairs, upgrades, and other tasks are only performed through Jenkins to preserve history of actions• On-call dashboard displays current issues and maintenance required 14
    15. 15. Things we monitorCassandra System  Throughput  Disk space  Latency  Load average  Compactions  I/O errors  Repairs  Network errors  Pending threads  Dropped operations  Java heap  SSTable counts  Cassandra log files 15
    16. 16. Other things we monitor Compaction predictions Backup failures Recent restarts Schema changes Monitors 16
    17. 17. What we learned Having Cassandra developers in house is crucial Repairs are incredibly expensive Multi-tenanted clusters are challenging A down node is better than a slow node Better to compact on our terms and not Cassandra’s Sizing and tuning is difficult and often done live Smaller per-node data size is better 17
    18. 18. Q&A (and Recommended viewing) The Best of Times Taft and Bakersfield are real places South Park Later season episodes like F-Word and Elementary School Musical Caillou My kids love this show; I don’t know why Until the Light Takes Us Scary documentary on Norwegian Black Metal 18