10/20/14 
Watching Your Cassandra Cluster Melt
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What is PagerDuty?
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
Cassandra at PagerDuty 
• Used to provide durable, consistent read/writes in a critical pipeline of 
service applications 
• Scala, Cassandra, Zookeeper. 
• Receives ~25 requests a sec 
• Each request is a handful of operations then processed asynchronously 
• Never lose an event. Never lose a message. 
• This has HUGE implications around our design and architecture.
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
Cassandra at PagerDuty 
• Cassandra 1.2 
• Thrift API 
• Using Hector/Cassie/Astyanax 
• Assigned tokens 
• Putting off migrating to vnodes 
• It is not big data 
• Clusters ~10s of GB 
• Data in the pipe is considered ephemeral
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
Cassandra at PagerDuty 
DC-C 
~20 MS ~5 MS 
DC-A DC-B 
~20 MS 
• Five (or 
ten) nodes 
in three 
regions 
• Quorum CL 
• RF = 5
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
Cassandra at PagerDuty 
• Operations cross the WAN and take inter-DC latency hit. 
• Since we use it as our pipeline without much of a user-facing front, 
we’re not latency sensitive, but throughput sensitive. 
• We get consistent read/write operations. 
• Events aren’t lost. Messages aren’t repeated. 
• We get availability in the face of a loss of entire DC-region.
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What Happened? 
• Everything fell apart and our critical pipeline began refusing new events and 
halted progress on existing ones. 
• Created degraded performance and a three-hour outage in PagerDuty 
• Unprecedented flush of in-flight data 
• Gory details on the impact found on the PD blog: https://blog.pagerduty.com/ 
2014/06/outage-post-mortem-june-3rd-4th-2014/
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What Happened… 
• It was just a semi-regular day… 
• …no particular changes in traffic 
• …no particular changes in volume 
• We had an incident the day before 
• Repairs and compactions had been taking longer and longer. They 
were starting to overlap on machines. 
• We used ‘nodetool disablethrift' to mitigate load on nodes that 
couldn’t handle being coordinators. 
• We even disabled nodes and found odd improvements with a 
smaller 3/5 cluster (any 3/5). 
• The next day, we started a repair that had been foregone…
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What happened… 
1 MIN SYSTEM LOAD
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What we did… 
• Tried a few things to mitigate the damage 
• Stopped less critical tenants. 
• Disabled thrift interfaces 
• Disabled nodes 
• No discernible effect. 
• Left with no choice, we blew away all data and restarted Cassandra fresh 
• This only took 10 minutes after committing to do this. 
sudo rm -r /var/lib/cassandra/commitlog/* 
sudo rm -r /var/lib/cassandra/saved_caches/* 
sudo rm -r /var/lib/cassandra/data/* 
• Then everything was fine and dandy, like sour candy.
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
So, what happened…? 
WHAT WENT HORRIBLY WRONG? 
• Multi-tenancy in the Cassandra cluster. 
• Operational ease isn’t worth the transparency. 
• Underprovisioning 
• AWS m1.larges 
• 2 cores 
• 8 GB RAM <—definitely not enough. 
• Poor monitoring and high-water marks 
• A twisted desire to get everything out of our little cluster
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
Why we didn’t see it coming… 
OR, HOW I LIKE TO MAKE MYSELF FEEL BETTER. 
• Everything was fine 99% of the time. 
• Read/write latencies close to the inter-DC latencies. 
• Despite load being relatively high sometimes. 
• Cassandra seems to have two modes: fine and catastrophe 
• We thought, “we don’t have much data, it should be able to handle this.” 
• Thought we must have misconfigured something. We didn’t need to scale up…
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What we should have seen… 
CONSTANT MEMORY PRESSURE 
This is bad 
This is good
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What we should have seen… 
• Consistent memtable flushing 
• “Flushing CFS(…) to relieve memory pressure” 
• Slower repair/compaction times 
• Likely related to the memory pressure 
• Widening disparity between median and p95 read/write latencies
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What we changed… 
THE AFTERMATH WAS ROUGH… 
• Immediately replaced all nodes with m2.2xlarges 
• 4 cores 
• 32 GB RAM 
• No more multi-tenancy. 
• Required nasty service migrations 
• Began watching a lot of pending task metrics. 
• Flushed blocker writers 
• Dropped messages
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
Lessons Learned 
• Cassandra has a steep performance degradation. 
• Stay ahead of the scaling curve. 
• Jump on any warning signs 
• Practice scaling. Be able to do it on quick notice. 
• Cassandra performance deteriorates with changes in the data set and 
asynchronous, eventual consistency. 
• Just because your latencies were one way doesn’t mean they’re 
supposed to be that way. 
• Don’t build for multi tenancy in your cluster.
PS. We’re hiring Cassandra people (enthusiast to expert) in our Realtime or Persistence 
teams. 
10/20/14 
Thank you. 
http://www.pagerduty.com/company/work-with-us/ 
http://bit.ly/1ym8j9g

Watching Your Cassandra Cluster Melt

  • 1.
    10/20/14 Watching YourCassandra Cluster Melt
  • 2.
    10/20/14 WATCHING YOURCASSANDRA CLUSTER MELT What is PagerDuty?
  • 3.
    10/20/14 WATCHING YOURCASSANDRA CLUSTER MELT Cassandra at PagerDuty • Used to provide durable, consistent read/writes in a critical pipeline of service applications • Scala, Cassandra, Zookeeper. • Receives ~25 requests a sec • Each request is a handful of operations then processed asynchronously • Never lose an event. Never lose a message. • This has HUGE implications around our design and architecture.
  • 4.
    10/20/14 WATCHING YOURCASSANDRA CLUSTER MELT Cassandra at PagerDuty • Cassandra 1.2 • Thrift API • Using Hector/Cassie/Astyanax • Assigned tokens • Putting off migrating to vnodes • It is not big data • Clusters ~10s of GB • Data in the pipe is considered ephemeral
  • 5.
    10/20/14 WATCHING YOURCASSANDRA CLUSTER MELT Cassandra at PagerDuty DC-C ~20 MS ~5 MS DC-A DC-B ~20 MS • Five (or ten) nodes in three regions • Quorum CL • RF = 5
  • 6.
    10/20/14 WATCHING YOURCASSANDRA CLUSTER MELT Cassandra at PagerDuty • Operations cross the WAN and take inter-DC latency hit. • Since we use it as our pipeline without much of a user-facing front, we’re not latency sensitive, but throughput sensitive. • We get consistent read/write operations. • Events aren’t lost. Messages aren’t repeated. • We get availability in the face of a loss of entire DC-region.
  • 7.
    10/20/14 WATCHING YOURCASSANDRA CLUSTER MELT What Happened? • Everything fell apart and our critical pipeline began refusing new events and halted progress on existing ones. • Created degraded performance and a three-hour outage in PagerDuty • Unprecedented flush of in-flight data • Gory details on the impact found on the PD blog: https://blog.pagerduty.com/ 2014/06/outage-post-mortem-june-3rd-4th-2014/
  • 8.
    10/20/14 WATCHING YOURCASSANDRA CLUSTER MELT What Happened… • It was just a semi-regular day… • …no particular changes in traffic • …no particular changes in volume • We had an incident the day before • Repairs and compactions had been taking longer and longer. They were starting to overlap on machines. • We used ‘nodetool disablethrift' to mitigate load on nodes that couldn’t handle being coordinators. • We even disabled nodes and found odd improvements with a smaller 3/5 cluster (any 3/5). • The next day, we started a repair that had been foregone…
  • 9.
    10/20/14 WATCHING YOURCASSANDRA CLUSTER MELT What happened… 1 MIN SYSTEM LOAD
  • 10.
    10/20/14 WATCHING YOURCASSANDRA CLUSTER MELT What we did… • Tried a few things to mitigate the damage • Stopped less critical tenants. • Disabled thrift interfaces • Disabled nodes • No discernible effect. • Left with no choice, we blew away all data and restarted Cassandra fresh • This only took 10 minutes after committing to do this. sudo rm -r /var/lib/cassandra/commitlog/* sudo rm -r /var/lib/cassandra/saved_caches/* sudo rm -r /var/lib/cassandra/data/* • Then everything was fine and dandy, like sour candy.
  • 11.
    10/20/14 WATCHING YOURCASSANDRA CLUSTER MELT So, what happened…? WHAT WENT HORRIBLY WRONG? • Multi-tenancy in the Cassandra cluster. • Operational ease isn’t worth the transparency. • Underprovisioning • AWS m1.larges • 2 cores • 8 GB RAM <—definitely not enough. • Poor monitoring and high-water marks • A twisted desire to get everything out of our little cluster
  • 12.
    10/20/14 WATCHING YOURCASSANDRA CLUSTER MELT Why we didn’t see it coming… OR, HOW I LIKE TO MAKE MYSELF FEEL BETTER. • Everything was fine 99% of the time. • Read/write latencies close to the inter-DC latencies. • Despite load being relatively high sometimes. • Cassandra seems to have two modes: fine and catastrophe • We thought, “we don’t have much data, it should be able to handle this.” • Thought we must have misconfigured something. We didn’t need to scale up…
  • 13.
    10/20/14 WATCHING YOURCASSANDRA CLUSTER MELT What we should have seen… CONSTANT MEMORY PRESSURE This is bad This is good
  • 14.
    10/20/14 WATCHING YOURCASSANDRA CLUSTER MELT What we should have seen… • Consistent memtable flushing • “Flushing CFS(…) to relieve memory pressure” • Slower repair/compaction times • Likely related to the memory pressure • Widening disparity between median and p95 read/write latencies
  • 15.
    10/20/14 WATCHING YOURCASSANDRA CLUSTER MELT What we changed… THE AFTERMATH WAS ROUGH… • Immediately replaced all nodes with m2.2xlarges • 4 cores • 32 GB RAM • No more multi-tenancy. • Required nasty service migrations • Began watching a lot of pending task metrics. • Flushed blocker writers • Dropped messages
  • 16.
    10/20/14 WATCHING YOURCASSANDRA CLUSTER MELT Lessons Learned • Cassandra has a steep performance degradation. • Stay ahead of the scaling curve. • Jump on any warning signs • Practice scaling. Be able to do it on quick notice. • Cassandra performance deteriorates with changes in the data set and asynchronous, eventual consistency. • Just because your latencies were one way doesn’t mean they’re supposed to be that way. • Don’t build for multi tenancy in your cluster.
  • 17.
    PS. We’re hiringCassandra people (enthusiast to expert) in our Realtime or Persistence teams. 10/20/14 Thank you. http://www.pagerduty.com/company/work-with-us/ http://bit.ly/1ym8j9g