SlideShare a Scribd company logo
1 of 17
Download to read offline
10/20/14 
Watching Your Cassandra Cluster Melt
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What is PagerDuty?
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
Cassandra at PagerDuty 
• Used to provide durable, consistent read/writes in a critical pipeline of 
service applications 
• Scala, Cassandra, Zookeeper. 
• Receives ~25 requests a sec 
• Each request is a handful of operations then processed asynchronously 
• Never lose an event. Never lose a message. 
• This has HUGE implications around our design and architecture.
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
Cassandra at PagerDuty 
• Cassandra 1.2 
• Thrift API 
• Using Hector/Cassie/Astyanax 
• Assigned tokens 
• Putting off migrating to vnodes 
• It is not big data 
• Clusters ~10s of GB 
• Data in the pipe is considered ephemeral
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
Cassandra at PagerDuty 
DC-C 
~20 MS ~5 MS 
DC-A DC-B 
~20 MS 
• Five (or 
ten) nodes 
in three 
regions 
• Quorum CL 
• RF = 5
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
Cassandra at PagerDuty 
• Operations cross the WAN and take inter-DC latency hit. 
• Since we use it as our pipeline without much of a user-facing front, 
we’re not latency sensitive, but throughput sensitive. 
• We get consistent read/write operations. 
• Events aren’t lost. Messages aren’t repeated. 
• We get availability in the face of a loss of entire DC-region.
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What Happened? 
• Everything fell apart and our critical pipeline began refusing new events and 
halted progress on existing ones. 
• Created degraded performance and a three-hour outage in PagerDuty 
• Unprecedented flush of in-flight data 
• Gory details on the impact found on the PD blog: https://blog.pagerduty.com/ 
2014/06/outage-post-mortem-june-3rd-4th-2014/
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What Happened… 
• It was just a semi-regular day… 
• …no particular changes in traffic 
• …no particular changes in volume 
• We had an incident the day before 
• Repairs and compactions had been taking longer and longer. They 
were starting to overlap on machines. 
• We used ‘nodetool disablethrift' to mitigate load on nodes that 
couldn’t handle being coordinators. 
• We even disabled nodes and found odd improvements with a 
smaller 3/5 cluster (any 3/5). 
• The next day, we started a repair that had been foregone…
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What happened… 
1 MIN SYSTEM LOAD
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What we did… 
• Tried a few things to mitigate the damage 
• Stopped less critical tenants. 
• Disabled thrift interfaces 
• Disabled nodes 
• No discernible effect. 
• Left with no choice, we blew away all data and restarted Cassandra fresh 
• This only took 10 minutes after committing to do this. 
sudo rm -r /var/lib/cassandra/commitlog/* 
sudo rm -r /var/lib/cassandra/saved_caches/* 
sudo rm -r /var/lib/cassandra/data/* 
• Then everything was fine and dandy, like sour candy.
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
So, what happened…? 
WHAT WENT HORRIBLY WRONG? 
• Multi-tenancy in the Cassandra cluster. 
• Operational ease isn’t worth the transparency. 
• Underprovisioning 
• AWS m1.larges 
• 2 cores 
• 8 GB RAM <—definitely not enough. 
• Poor monitoring and high-water marks 
• A twisted desire to get everything out of our little cluster
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
Why we didn’t see it coming… 
OR, HOW I LIKE TO MAKE MYSELF FEEL BETTER. 
• Everything was fine 99% of the time. 
• Read/write latencies close to the inter-DC latencies. 
• Despite load being relatively high sometimes. 
• Cassandra seems to have two modes: fine and catastrophe 
• We thought, “we don’t have much data, it should be able to handle this.” 
• Thought we must have misconfigured something. We didn’t need to scale up…
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What we should have seen… 
CONSTANT MEMORY PRESSURE 
This is bad 
This is good
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What we should have seen… 
• Consistent memtable flushing 
• “Flushing CFS(…) to relieve memory pressure” 
• Slower repair/compaction times 
• Likely related to the memory pressure 
• Widening disparity between median and p95 read/write latencies
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What we changed… 
THE AFTERMATH WAS ROUGH… 
• Immediately replaced all nodes with m2.2xlarges 
• 4 cores 
• 32 GB RAM 
• No more multi-tenancy. 
• Required nasty service migrations 
• Began watching a lot of pending task metrics. 
• Flushed blocker writers 
• Dropped messages
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
Lessons Learned 
• Cassandra has a steep performance degradation. 
• Stay ahead of the scaling curve. 
• Jump on any warning signs 
• Practice scaling. Be able to do it on quick notice. 
• Cassandra performance deteriorates with changes in the data set and 
asynchronous, eventual consistency. 
• Just because your latencies were one way doesn’t mean they’re 
supposed to be that way. 
• Don’t build for multi tenancy in your cluster.
PS. We’re hiring Cassandra people (enthusiast to expert) in our Realtime or Persistence 
teams. 
10/20/14 
Thank you. 
http://www.pagerduty.com/company/work-with-us/ 
http://bit.ly/1ym8j9g

More Related Content

More from DataStax Academy

Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and DriversDataStax Academy
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph DatabasesDataStax Academy
 
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkCassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkDataStax Academy
 
Analytics with Spark and Cassandra
Analytics with Spark and CassandraAnalytics with Spark and Cassandra
Analytics with Spark and CassandraDataStax Academy
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talkDataStax Academy
 
Client Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayClient Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayDataStax Academy
 

More from DataStax Academy (20)

Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and Drivers
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph Databases
 
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkCassandra Data Maintenance with Spark
Cassandra Data Maintenance with Spark
 
Analytics with Spark and Cassandra
Analytics with Spark and CassandraAnalytics with Spark and Cassandra
Analytics with Spark and Cassandra
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talk
 
Client Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayClient Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right Way
 

Recently uploaded

ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty SecureFemke de Vroome
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FIDO Alliance
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyJohn Staveley
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKUXDXConf
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...CzechDreamin
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCzechDreamin
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutesconfluent
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1DianaGray10
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfFIDO Alliance
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekCzechDreamin
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomCzechDreamin
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfFIDO Alliance
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeCzechDreamin
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024Stephanie Beckett
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...FIDO Alliance
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessUXDXConf
 

Recently uploaded (20)

ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 

Apache Cassandra at Pager Duty 2014

  • 1. 10/20/14 Watching Your Cassandra Cluster Melt
  • 2. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT What is PagerDuty?
  • 3. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT Cassandra at PagerDuty • Used to provide durable, consistent read/writes in a critical pipeline of service applications • Scala, Cassandra, Zookeeper. • Receives ~25 requests a sec • Each request is a handful of operations then processed asynchronously • Never lose an event. Never lose a message. • This has HUGE implications around our design and architecture.
  • 4. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT Cassandra at PagerDuty • Cassandra 1.2 • Thrift API • Using Hector/Cassie/Astyanax • Assigned tokens • Putting off migrating to vnodes • It is not big data • Clusters ~10s of GB • Data in the pipe is considered ephemeral
  • 5. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT Cassandra at PagerDuty DC-C ~20 MS ~5 MS DC-A DC-B ~20 MS • Five (or ten) nodes in three regions • Quorum CL • RF = 5
  • 6. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT Cassandra at PagerDuty • Operations cross the WAN and take inter-DC latency hit. • Since we use it as our pipeline without much of a user-facing front, we’re not latency sensitive, but throughput sensitive. • We get consistent read/write operations. • Events aren’t lost. Messages aren’t repeated. • We get availability in the face of a loss of entire DC-region.
  • 7. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT What Happened? • Everything fell apart and our critical pipeline began refusing new events and halted progress on existing ones. • Created degraded performance and a three-hour outage in PagerDuty • Unprecedented flush of in-flight data • Gory details on the impact found on the PD blog: https://blog.pagerduty.com/ 2014/06/outage-post-mortem-june-3rd-4th-2014/
  • 8. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT What Happened… • It was just a semi-regular day… • …no particular changes in traffic • …no particular changes in volume • We had an incident the day before • Repairs and compactions had been taking longer and longer. They were starting to overlap on machines. • We used ‘nodetool disablethrift' to mitigate load on nodes that couldn’t handle being coordinators. • We even disabled nodes and found odd improvements with a smaller 3/5 cluster (any 3/5). • The next day, we started a repair that had been foregone…
  • 9. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT What happened… 1 MIN SYSTEM LOAD
  • 10. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT What we did… • Tried a few things to mitigate the damage • Stopped less critical tenants. • Disabled thrift interfaces • Disabled nodes • No discernible effect. • Left with no choice, we blew away all data and restarted Cassandra fresh • This only took 10 minutes after committing to do this. sudo rm -r /var/lib/cassandra/commitlog/* sudo rm -r /var/lib/cassandra/saved_caches/* sudo rm -r /var/lib/cassandra/data/* • Then everything was fine and dandy, like sour candy.
  • 11. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT So, what happened…? WHAT WENT HORRIBLY WRONG? • Multi-tenancy in the Cassandra cluster. • Operational ease isn’t worth the transparency. • Underprovisioning • AWS m1.larges • 2 cores • 8 GB RAM <—definitely not enough. • Poor monitoring and high-water marks • A twisted desire to get everything out of our little cluster
  • 12. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT Why we didn’t see it coming… OR, HOW I LIKE TO MAKE MYSELF FEEL BETTER. • Everything was fine 99% of the time. • Read/write latencies close to the inter-DC latencies. • Despite load being relatively high sometimes. • Cassandra seems to have two modes: fine and catastrophe • We thought, “we don’t have much data, it should be able to handle this.” • Thought we must have misconfigured something. We didn’t need to scale up…
  • 13. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT What we should have seen… CONSTANT MEMORY PRESSURE This is bad This is good
  • 14. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT What we should have seen… • Consistent memtable flushing • “Flushing CFS(…) to relieve memory pressure” • Slower repair/compaction times • Likely related to the memory pressure • Widening disparity between median and p95 read/write latencies
  • 15. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT What we changed… THE AFTERMATH WAS ROUGH… • Immediately replaced all nodes with m2.2xlarges • 4 cores • 32 GB RAM • No more multi-tenancy. • Required nasty service migrations • Began watching a lot of pending task metrics. • Flushed blocker writers • Dropped messages
  • 16. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT Lessons Learned • Cassandra has a steep performance degradation. • Stay ahead of the scaling curve. • Jump on any warning signs • Practice scaling. Be able to do it on quick notice. • Cassandra performance deteriorates with changes in the data set and asynchronous, eventual consistency. • Just because your latencies were one way doesn’t mean they’re supposed to be that way. • Don’t build for multi tenancy in your cluster.
  • 17. PS. We’re hiring Cassandra people (enthusiast to expert) in our Realtime or Persistence teams. 10/20/14 Thank you. http://www.pagerduty.com/company/work-with-us/ http://bit.ly/1ym8j9g