SlideShare a Scribd company logo
1 of 17
Download to read offline
10/20/14 
Watching Your Cassandra Cluster Melt
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What is PagerDuty?
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
Cassandra at PagerDuty 
• Used to provide durable, consistent read/writes in a critical pipeline of 
service applications 
• Scala, Cassandra, Zookeeper. 
• Receives ~25 requests a sec 
• Each request is a handful of operations then processed asynchronously 
• Never lose an event. Never lose a message. 
• This has HUGE implications around our design and architecture.
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
Cassandra at PagerDuty 
• Cassandra 1.2 
• Thrift API 
• Using Hector/Cassie/Astyanax 
• Assigned tokens 
• Putting off migrating to vnodes 
• It is not big data 
• Clusters ~10s of GB 
• Data in the pipe is considered ephemeral
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
Cassandra at PagerDuty 
DC-C 
~20 MS ~5 MS 
DC-A DC-B 
~20 MS 
• Five (or 
ten) nodes 
in three 
regions 
• Quorum CL 
• RF = 5
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
Cassandra at PagerDuty 
• Operations cross the WAN and take inter-DC latency hit. 
• Since we use it as our pipeline without much of a user-facing front, 
we’re not latency sensitive, but throughput sensitive. 
• We get consistent read/write operations. 
• Events aren’t lost. Messages aren’t repeated. 
• We get availability in the face of a loss of entire DC-region.
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What Happened? 
• Everything fell apart and our critical pipeline began refusing new events and 
halted progress on existing ones. 
• Created degraded performance and a three-hour outage in PagerDuty 
• Unprecedented flush of in-flight data 
• Gory details on the impact found on the PD blog: https://blog.pagerduty.com/ 
2014/06/outage-post-mortem-june-3rd-4th-2014/
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What Happened… 
• It was just a semi-regular day… 
• …no particular changes in traffic 
• …no particular changes in volume 
• We had an incident the day before 
• Repairs and compactions had been taking longer and longer. They 
were starting to overlap on machines. 
• We used ‘nodetool disablethrift' to mitigate load on nodes that 
couldn’t handle being coordinators. 
• We even disabled nodes and found odd improvements with a 
smaller 3/5 cluster (any 3/5). 
• The next day, we started a repair that had been foregone…
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What happened… 
1 MIN SYSTEM LOAD
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What we did… 
• Tried a few things to mitigate the damage 
• Stopped less critical tenants. 
• Disabled thrift interfaces 
• Disabled nodes 
• No discernible effect. 
• Left with no choice, we blew away all data and restarted Cassandra fresh 
• This only took 10 minutes after committing to do this. 
sudo rm -r /var/lib/cassandra/commitlog/* 
sudo rm -r /var/lib/cassandra/saved_caches/* 
sudo rm -r /var/lib/cassandra/data/* 
• Then everything was fine and dandy, like sour candy.
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
So, what happened…? 
WHAT WENT HORRIBLY WRONG? 
• Multi-tenancy in the Cassandra cluster. 
• Operational ease isn’t worth the transparency. 
• Underprovisioning 
• AWS m1.larges 
• 2 cores 
• 8 GB RAM <—definitely not enough. 
• Poor monitoring and high-water marks 
• A twisted desire to get everything out of our little cluster
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
Why we didn’t see it coming… 
OR, HOW I LIKE TO MAKE MYSELF FEEL BETTER. 
• Everything was fine 99% of the time. 
• Read/write latencies close to the inter-DC latencies. 
• Despite load being relatively high sometimes. 
• Cassandra seems to have two modes: fine and catastrophe 
• We thought, “we don’t have much data, it should be able to handle this.” 
• Thought we must have misconfigured something. We didn’t need to scale up…
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What we should have seen… 
CONSTANT MEMORY PRESSURE 
This is bad 
This is good
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What we should have seen… 
• Consistent memtable flushing 
• “Flushing CFS(…) to relieve memory pressure” 
• Slower repair/compaction times 
• Likely related to the memory pressure 
• Widening disparity between median and p95 read/write latencies
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What we changed… 
THE AFTERMATH WAS ROUGH… 
• Immediately replaced all nodes with m2.2xlarges 
• 4 cores 
• 32 GB RAM 
• No more multi-tenancy. 
• Required nasty service migrations 
• Began watching a lot of pending task metrics. 
• Flushed blocker writers 
• Dropped messages
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
Lessons Learned 
• Cassandra has a steep performance degradation. 
• Stay ahead of the scaling curve. 
• Jump on any warning signs 
• Practice scaling. Be able to do it on quick notice. 
• Cassandra performance deteriorates with changes in the data set and 
asynchronous, eventual consistency. 
• Just because your latencies were one way doesn’t mean they’re 
supposed to be that way. 
• Don’t build for multi tenancy in your cluster.
PS. We’re hiring Cassandra people (enthusiast to expert) in our Realtime or Persistence 
teams. 
10/20/14 
Thank you. 
http://www.pagerduty.com/company/work-with-us/ 
http://bit.ly/1ym8j9g

More Related Content

More from DataStax Academy

Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and DriversDataStax Academy
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph DatabasesDataStax Academy
 
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkCassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkDataStax Academy
 
Analytics with Spark and Cassandra
Analytics with Spark and CassandraAnalytics with Spark and Cassandra
Analytics with Spark and CassandraDataStax Academy
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talkDataStax Academy
 
Client Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayClient Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayDataStax Academy
 

More from DataStax Academy (20)

Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and Drivers
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph Databases
 
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkCassandra Data Maintenance with Spark
Cassandra Data Maintenance with Spark
 
Analytics with Spark and Cassandra
Analytics with Spark and CassandraAnalytics with Spark and Cassandra
Analytics with Spark and Cassandra
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talk
 
Client Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayClient Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right Way
 

Recently uploaded

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 

Recently uploaded (20)

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

Apache Cassandra at Pager Duty 2014

  • 1. 10/20/14 Watching Your Cassandra Cluster Melt
  • 2. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT What is PagerDuty?
  • 3. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT Cassandra at PagerDuty • Used to provide durable, consistent read/writes in a critical pipeline of service applications • Scala, Cassandra, Zookeeper. • Receives ~25 requests a sec • Each request is a handful of operations then processed asynchronously • Never lose an event. Never lose a message. • This has HUGE implications around our design and architecture.
  • 4. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT Cassandra at PagerDuty • Cassandra 1.2 • Thrift API • Using Hector/Cassie/Astyanax • Assigned tokens • Putting off migrating to vnodes • It is not big data • Clusters ~10s of GB • Data in the pipe is considered ephemeral
  • 5. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT Cassandra at PagerDuty DC-C ~20 MS ~5 MS DC-A DC-B ~20 MS • Five (or ten) nodes in three regions • Quorum CL • RF = 5
  • 6. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT Cassandra at PagerDuty • Operations cross the WAN and take inter-DC latency hit. • Since we use it as our pipeline without much of a user-facing front, we’re not latency sensitive, but throughput sensitive. • We get consistent read/write operations. • Events aren’t lost. Messages aren’t repeated. • We get availability in the face of a loss of entire DC-region.
  • 7. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT What Happened? • Everything fell apart and our critical pipeline began refusing new events and halted progress on existing ones. • Created degraded performance and a three-hour outage in PagerDuty • Unprecedented flush of in-flight data • Gory details on the impact found on the PD blog: https://blog.pagerduty.com/ 2014/06/outage-post-mortem-june-3rd-4th-2014/
  • 8. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT What Happened… • It was just a semi-regular day… • …no particular changes in traffic • …no particular changes in volume • We had an incident the day before • Repairs and compactions had been taking longer and longer. They were starting to overlap on machines. • We used ‘nodetool disablethrift' to mitigate load on nodes that couldn’t handle being coordinators. • We even disabled nodes and found odd improvements with a smaller 3/5 cluster (any 3/5). • The next day, we started a repair that had been foregone…
  • 9. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT What happened… 1 MIN SYSTEM LOAD
  • 10. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT What we did… • Tried a few things to mitigate the damage • Stopped less critical tenants. • Disabled thrift interfaces • Disabled nodes • No discernible effect. • Left with no choice, we blew away all data and restarted Cassandra fresh • This only took 10 minutes after committing to do this. sudo rm -r /var/lib/cassandra/commitlog/* sudo rm -r /var/lib/cassandra/saved_caches/* sudo rm -r /var/lib/cassandra/data/* • Then everything was fine and dandy, like sour candy.
  • 11. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT So, what happened…? WHAT WENT HORRIBLY WRONG? • Multi-tenancy in the Cassandra cluster. • Operational ease isn’t worth the transparency. • Underprovisioning • AWS m1.larges • 2 cores • 8 GB RAM <—definitely not enough. • Poor monitoring and high-water marks • A twisted desire to get everything out of our little cluster
  • 12. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT Why we didn’t see it coming… OR, HOW I LIKE TO MAKE MYSELF FEEL BETTER. • Everything was fine 99% of the time. • Read/write latencies close to the inter-DC latencies. • Despite load being relatively high sometimes. • Cassandra seems to have two modes: fine and catastrophe • We thought, “we don’t have much data, it should be able to handle this.” • Thought we must have misconfigured something. We didn’t need to scale up…
  • 13. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT What we should have seen… CONSTANT MEMORY PRESSURE This is bad This is good
  • 14. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT What we should have seen… • Consistent memtable flushing • “Flushing CFS(…) to relieve memory pressure” • Slower repair/compaction times • Likely related to the memory pressure • Widening disparity between median and p95 read/write latencies
  • 15. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT What we changed… THE AFTERMATH WAS ROUGH… • Immediately replaced all nodes with m2.2xlarges • 4 cores • 32 GB RAM • No more multi-tenancy. • Required nasty service migrations • Began watching a lot of pending task metrics. • Flushed blocker writers • Dropped messages
  • 16. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT Lessons Learned • Cassandra has a steep performance degradation. • Stay ahead of the scaling curve. • Jump on any warning signs • Practice scaling. Be able to do it on quick notice. • Cassandra performance deteriorates with changes in the data set and asynchronous, eventual consistency. • Just because your latencies were one way doesn’t mean they’re supposed to be that way. • Don’t build for multi tenancy in your cluster.
  • 17. PS. We’re hiring Cassandra people (enthusiast to expert) in our Realtime or Persistence teams. 10/20/14 Thank you. http://www.pagerduty.com/company/work-with-us/ http://bit.ly/1ym8j9g