Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Cassandra Operations at NetflixGregg Ulrich                                  1
Agenda Who we are How much we use Cassandra How we do it What we learned                              2
Who we are Cloud Database Engineering   Development – Cassandra and related tools   Architecture – data modeling and si...
How much we use Cassandra30         Number of production clusters12         Number of multi-region clusters3          Max ...
I read that Netflix doesn’t have operations Extension of Amazon’s PaaS Decentralized Cassandra ops is expensive at scale...
How we configure Cassandra in AWS Most services get their own Cassandra cluster Mostly m2.4xlarge instances, but conside...
Minimum cluster configuration Minimum production cluster configuration – 6 nodes   3 auto-scaling groups   2 instances ...
Minimum cluster configuration, illustratedASG1 AZ1                                   RF=3ASG2 AZ2               PRIAMASG3 ...
Tools we use Administration   Priam   Jenkins Monitoring and alerting   Cassandra Explorer   Dashboards   Epic     ...
Tools we use – Priam Open-sourced Tomcat webapp running on each instance Multi-region token management via SimpleDB Nod...
Tools we use – Cassandra Explorer• Kiosk mode – no  alerting• High level cluster  status (thrift, gossip)• Warns on a smal...
Tools we use – Epic• Netflix-wide  monitoring and  alerting tool based on  RRD• Priam proxies all JMX  data to Epic• Very ...
Tools we use – Dashboards• Next level cluster  metrics    • Throughput    • Latency    • Gossip status    • Maintenance   ...
Tools we use – Jenkins•   Scheduling tool for additional    monitors and maintenance    tasks•   Push button automation fo...
Things we monitorCassandra                 System   Throughput               Disk space   Latency                  Loa...
Other things we monitor Compaction predictions Backup failures Recent restarts Schema changes Monitors               ...
What we learned Having Cassandra developers in house is crucial Repairs are incredibly expensive Multi-tenanted cluster...
Q&A (and Recommended viewing)     The Best of Times     Taft and Bakersfield are real places     South Park     Later seas...
Upcoming SlideShare
Loading in …5
×

Cassandra Operations at Netflix

14,341 views

Published on

Slides from Netflix Cassandra Meetup on 3/27. Lessons learned and tools created at Netflix to manage Cassandra clusters in AWS.

Published in: Technology
  • Be the first to comment

Cassandra Operations at Netflix

  1. 1. Cassandra Operations at NetflixGregg Ulrich 1
  2. 2. Agenda Who we are How much we use Cassandra How we do it What we learned 2
  3. 3. Who we are Cloud Database Engineering  Development – Cassandra and related tools  Architecture – data modeling and sizing  Operations – availability, performance and maintenance Operations  24x7 on-call support for all Cassandra clusters  Cassandra operations tools  Proactive problem hunting  Routine and non-routine maintenances 3
  4. 4. How much we use Cassandra30 Number of production clusters12 Number of multi-region clusters3 Max regions, one cluster65 Total TB of data across all clusters472 Number of Cassandra nodes72/28 Largest Cassandra cluster (nodes/data in TB)50k/250k Max read/writes per second on a single cluster3* Size of Operations team * Open position for an additional engineer 4
  5. 5. I read that Netflix doesn’t have operations Extension of Amazon’s PaaS Decentralized Cassandra ops is expensive at scale Immature product that changes rapidly (and drastically) Easily apply best practices across all clusters 5
  6. 6. How we configure Cassandra in AWS Most services get their own Cassandra cluster Mostly m2.4xlarge instances, but considering others Cassandra and supporting tools baked into the AMI Data stored on ephemeral drives Data durability – all writes to all availabilty zones  Alternate AZs in a replication set  RF = 3 6
  7. 7. Minimum cluster configuration Minimum production cluster configuration – 6 nodes  3 auto-scaling groups  2 instances per auto-scaling group  1 availability zone per auto-scaling group 7
  8. 8. Minimum cluster configuration, illustratedASG1 AZ1 RF=3ASG2 AZ2 PRIAMASG3 AZ3 8
  9. 9. Tools we use Administration  Priam  Jenkins Monitoring and alerting  Cassandra Explorer  Dashboards  Epic 9
  10. 10. Tools we use – Priam Open-sourced Tomcat webapp running on each instance Multi-region token management via SimpleDB Node replacement and ring expansion Backup and restore  Full nightly snapshot backup to S3  Incremental backup of flushed SSTables to S3 every 30 seconds Metrics collected via JMX REST API to most nodetool functions 10
  11. 11. Tools we use – Cassandra Explorer• Kiosk mode – no alerting• High level cluster status (thrift, gossip)• Warns on a small set of metrics 11
  12. 12. Tools we use – Epic• Netflix-wide monitoring and alerting tool based on RRD• Priam proxies all JMX data to Epic• Very useful for finding specific issues 12
  13. 13. Tools we use – Dashboards• Next level cluster metrics • Throughput • Latency • Gossip status • Maintenance operations • Trouble indicators• Useful for finding anomalies• Most investigations start here 13
  14. 14. Tools we use – Jenkins• Scheduling tool for additional monitors and maintenance tasks• Push button automation for recurring tasks• Repairs, upgrades, and other tasks are only performed through Jenkins to preserve history of actions• On-call dashboard displays current issues and maintenance required 14
  15. 15. Things we monitorCassandra System  Throughput  Disk space  Latency  Load average  Compactions  I/O errors  Repairs  Network errors  Pending threads  Dropped operations  Java heap  SSTable counts  Cassandra log files 15
  16. 16. Other things we monitor Compaction predictions Backup failures Recent restarts Schema changes Monitors 16
  17. 17. What we learned Having Cassandra developers in house is crucial Repairs are incredibly expensive Multi-tenanted clusters are challenging A down node is better than a slow node Better to compact on our terms and not Cassandra’s Sizing and tuning is difficult and often done live Smaller per-node data size is better 17
  18. 18. Q&A (and Recommended viewing) The Best of Times Taft and Bakersfield are real places South Park Later season episodes like F-Word and Elementary School Musical Caillou My kids love this show; I don’t know why Until the Light Takes Us Scary documentary on Norwegian Black Metal 18

×