SlideShare a Scribd company logo
10/20/14 
Watching Your Cassandra Cluster Melt
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What is PagerDuty?
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
Cassandra at PagerDuty 
• Used to provide durable, consistent read/writes in a critical pipeline of 
service applications 
• Scala, Cassandra, Zookeeper. 
• Receives ~25 requests a sec 
• Each request is a handful of operations then processed asynchronously 
• Never lose an event. Never lose a message. 
• This has HUGE implications around our design and architecture.
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
Cassandra at PagerDuty 
• Cassandra 1.2 
• Thrift API 
• Using Hector/Cassie/Astyanax 
• Assigned tokens 
• Putting off migrating to vnodes 
• It is not big data 
• Clusters ~10s of GB 
• Data in the pipe is considered ephemeral
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
Cassandra at PagerDuty 
DC-C 
~20 MS ~5 MS 
DC-A DC-B 
~20 MS 
• Five (or 
ten) nodes 
in three 
regions 
• Quorum CL 
• RF = 5
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
Cassandra at PagerDuty 
• Operations cross the WAN and take inter-DC latency hit. 
• Since we use it as our pipeline without much of a user-facing front, 
we’re not latency sensitive, but throughput sensitive. 
• We get consistent read/write operations. 
• Events aren’t lost. Messages aren’t repeated. 
• We get availability in the face of a loss of entire DC-region.
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What Happened? 
• Everything fell apart and our critical pipeline began refusing new events and 
halted progress on existing ones. 
• Created degraded performance and a three-hour outage in PagerDuty 
• Unprecedented flush of in-flight data 
• Gory details on the impact found on the PD blog: https://blog.pagerduty.com/ 
2014/06/outage-post-mortem-june-3rd-4th-2014/
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What Happened… 
• It was just a semi-regular day… 
• …no particular changes in traffic 
• …no particular changes in volume 
• We had an incident the day before 
• Repairs and compactions had been taking longer and longer. They 
were starting to overlap on machines. 
• We used ‘nodetool disablethrift' to mitigate load on nodes that 
couldn’t handle being coordinators. 
• We even disabled nodes and found odd improvements with a 
smaller 3/5 cluster (any 3/5). 
• The next day, we started a repair that had been foregone…
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What happened… 
1 MIN SYSTEM LOAD
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What we did… 
• Tried a few things to mitigate the damage 
• Stopped less critical tenants. 
• Disabled thrift interfaces 
• Disabled nodes 
• No discernible effect. 
• Left with no choice, we blew away all data and restarted Cassandra fresh 
• This only took 10 minutes after committing to do this. 
sudo rm -r /var/lib/cassandra/commitlog/* 
sudo rm -r /var/lib/cassandra/saved_caches/* 
sudo rm -r /var/lib/cassandra/data/* 
• Then everything was fine and dandy, like sour candy.
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
So, what happened…? 
WHAT WENT HORRIBLY WRONG? 
• Multi-tenancy in the Cassandra cluster. 
• Operational ease isn’t worth the transparency. 
• Underprovisioning 
• AWS m1.larges 
• 2 cores 
• 8 GB RAM <—definitely not enough. 
• Poor monitoring and high-water marks 
• A twisted desire to get everything out of our little cluster
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
Why we didn’t see it coming… 
OR, HOW I LIKE TO MAKE MYSELF FEEL BETTER. 
• Everything was fine 99% of the time. 
• Read/write latencies close to the inter-DC latencies. 
• Despite load being relatively high sometimes. 
• Cassandra seems to have two modes: fine and catastrophe 
• We thought, “we don’t have much data, it should be able to handle this.” 
• Thought we must have misconfigured something. We didn’t need to scale up…
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What we should have seen… 
CONSTANT MEMORY PRESSURE 
This is bad 
This is good
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What we should have seen… 
• Consistent memtable flushing 
• “Flushing CFS(…) to relieve memory pressure” 
• Slower repair/compaction times 
• Likely related to the memory pressure 
• Widening disparity between median and p95 read/write latencies
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
What we changed… 
THE AFTERMATH WAS ROUGH… 
• Immediately replaced all nodes with m2.2xlarges 
• 4 cores 
• 32 GB RAM 
• No more multi-tenancy. 
• Required nasty service migrations 
• Began watching a lot of pending task metrics. 
• Flushed blocker writers 
• Dropped messages
10/20/14 
WATCHING YOUR CASSANDRA CLUSTER MELT 
Lessons Learned 
• Cassandra has a steep performance degradation. 
• Stay ahead of the scaling curve. 
• Jump on any warning signs 
• Practice scaling. Be able to do it on quick notice. 
• Cassandra performance deteriorates with changes in the data set and 
asynchronous, eventual consistency. 
• Just because your latencies were one way doesn’t mean they’re 
supposed to be that way. 
• Don’t build for multi tenancy in your cluster.
PS. We’re hiring Cassandra people (enthusiast to expert) in our Realtime or Persistence 
teams. 
10/20/14 
Thank you. 
http://www.pagerduty.com/company/work-with-us/ 
http://bit.ly/1ym8j9g

More Related Content

Viewers also liked

PagerDuty | OSCON 2016 Failure Testing
PagerDuty | OSCON 2016 Failure TestingPagerDuty | OSCON 2016 Failure Testing
PagerDuty | OSCON 2016 Failure Testing
PagerDuty
 
Failure Friday: Start Injecting Failure Today!
Failure Friday: Start Injecting Failure Today! Failure Friday: Start Injecting Failure Today!
Failure Friday: Start Injecting Failure Today!
PagerDuty
 
Bloated Chefs: A Tale of Gluttony, and the Path to Enlightenment
Bloated Chefs: A Tale of Gluttony, and the Path to EnlightenmentBloated Chefs: A Tale of Gluttony, and the Path to Enlightenment
Bloated Chefs: A Tale of Gluttony, and the Path to Enlightenment
PagerDuty
 
ITSM Solutions and DevOps Alignment
ITSM Solutions and DevOps AlignmentITSM Solutions and DevOps Alignment
ITSM Solutions and DevOps Alignment
PagerDuty
 
PagerDuty's Solutions Provider Session at Gartner IT Operations Strategies & ...
PagerDuty's Solutions Provider Session at Gartner IT Operations Strategies & ...PagerDuty's Solutions Provider Session at Gartner IT Operations Strategies & ...
PagerDuty's Solutions Provider Session at Gartner IT Operations Strategies & ...
PagerDuty
 
I dream of Gen'ning: Scala Check is Black Magic
I dream of Gen'ning: Scala Check is Black MagicI dream of Gen'ning: Scala Check is Black Magic
I dream of Gen'ning: Scala Check is Black Magic
PagerDuty
 
Sensu and Sensibility - Puppetconf 2014
Sensu and Sensibility - Puppetconf 2014Sensu and Sensibility - Puppetconf 2014
Sensu and Sensibility - Puppetconf 2014
Tomas Doran
 
An Introduction to Sensu by Bethany Erskine
An Introduction to Sensu by Bethany Erskine An Introduction to Sensu by Bethany Erskine
An Introduction to Sensu by Bethany Erskine
Hakka Labs
 
Sense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with Sensu
Sense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with SensuSense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with Sensu
Sense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with Sensu
Bethany Erskine
 
Monitoring with sensu
Monitoring with sensuMonitoring with sensu
Monitoring with sensu
miquelruizm
 
Monitoring As A Service - Monitorama 2015
Monitoring As A Service - Monitorama 2015Monitoring As A Service - Monitorama 2015
Monitoring As A Service - Monitorama 2015
James Turnbull
 
WTF is Sensu and Monitoring
WTF is Sensu and MonitoringWTF is Sensu and Monitoring
WTF is Sensu and Monitoring
Toby Jackson
 

Viewers also liked (12)

PagerDuty | OSCON 2016 Failure Testing
PagerDuty | OSCON 2016 Failure TestingPagerDuty | OSCON 2016 Failure Testing
PagerDuty | OSCON 2016 Failure Testing
 
Failure Friday: Start Injecting Failure Today!
Failure Friday: Start Injecting Failure Today! Failure Friday: Start Injecting Failure Today!
Failure Friday: Start Injecting Failure Today!
 
Bloated Chefs: A Tale of Gluttony, and the Path to Enlightenment
Bloated Chefs: A Tale of Gluttony, and the Path to EnlightenmentBloated Chefs: A Tale of Gluttony, and the Path to Enlightenment
Bloated Chefs: A Tale of Gluttony, and the Path to Enlightenment
 
ITSM Solutions and DevOps Alignment
ITSM Solutions and DevOps AlignmentITSM Solutions and DevOps Alignment
ITSM Solutions and DevOps Alignment
 
PagerDuty's Solutions Provider Session at Gartner IT Operations Strategies & ...
PagerDuty's Solutions Provider Session at Gartner IT Operations Strategies & ...PagerDuty's Solutions Provider Session at Gartner IT Operations Strategies & ...
PagerDuty's Solutions Provider Session at Gartner IT Operations Strategies & ...
 
I dream of Gen'ning: Scala Check is Black Magic
I dream of Gen'ning: Scala Check is Black MagicI dream of Gen'ning: Scala Check is Black Magic
I dream of Gen'ning: Scala Check is Black Magic
 
Sensu and Sensibility - Puppetconf 2014
Sensu and Sensibility - Puppetconf 2014Sensu and Sensibility - Puppetconf 2014
Sensu and Sensibility - Puppetconf 2014
 
An Introduction to Sensu by Bethany Erskine
An Introduction to Sensu by Bethany Erskine An Introduction to Sensu by Bethany Erskine
An Introduction to Sensu by Bethany Erskine
 
Sense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with Sensu
Sense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with SensuSense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with Sensu
Sense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with Sensu
 
Monitoring with sensu
Monitoring with sensuMonitoring with sensu
Monitoring with sensu
 
Monitoring As A Service - Monitorama 2015
Monitoring As A Service - Monitorama 2015Monitoring As A Service - Monitorama 2015
Monitoring As A Service - Monitorama 2015
 
WTF is Sensu and Monitoring
WTF is Sensu and MonitoringWTF is Sensu and Monitoring
WTF is Sensu and Monitoring
 

Similar to Watching Your Cassandra Cluster Melt

PagerDuty: One Year of Cassandra Failures
PagerDuty: One Year of Cassandra FailuresPagerDuty: One Year of Cassandra Failures
PagerDuty: One Year of Cassandra Failures
DataStax Academy
 
Cassandra summit 2013 how not to use cassandra
Cassandra summit 2013  how not to use cassandraCassandra summit 2013  how not to use cassandra
Cassandra summit 2013 how not to use cassandra
Axel Liljencrantz
 
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
C* Summit 2013: How Not to Use Cassandra by Axel LiljencrantzC* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
DataStax Academy
 
Cassandra At Wize Commerce
Cassandra At Wize CommerceCassandra At Wize Commerce
Cassandra At Wize Commerce
Eran Chinthaka Withana
 
Devops kc
Devops kcDevops kc
Devops kc
Philip Thompson
 
AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...
AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...
AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...
ScyllaDB
 
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...
DataStax Academy
 
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael Kjellman
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael KjellmanC* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael Kjellman
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael Kjellman
DataStax Academy
 
Hindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to CassandraHindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to Cassandra
Michael Kjellman
 
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Adrian Cockcroft
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
ScyllaDB
 
Cassandra Silicon Valley
Cassandra Silicon ValleyCassandra Silicon Valley
Cassandra Silicon Valley
Christopher Keller
 
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
DataStax
 
[Hadoop Meetup] Yarn at Microsoft - The challenges of scale
[Hadoop Meetup] Yarn at Microsoft - The challenges of scale[Hadoop Meetup] Yarn at Microsoft - The challenges of scale
[Hadoop Meetup] Yarn at Microsoft - The challenges of scale
Newton Alex
 
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into Cassandra
Brian Hess
 
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
DataStax Academy
 
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Richard Low
 
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent Cloud
HostedbyConfluent
 
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
ScyllaDB
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
DataStax Academy
 

Similar to Watching Your Cassandra Cluster Melt (20)

PagerDuty: One Year of Cassandra Failures
PagerDuty: One Year of Cassandra FailuresPagerDuty: One Year of Cassandra Failures
PagerDuty: One Year of Cassandra Failures
 
Cassandra summit 2013 how not to use cassandra
Cassandra summit 2013  how not to use cassandraCassandra summit 2013  how not to use cassandra
Cassandra summit 2013 how not to use cassandra
 
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
C* Summit 2013: How Not to Use Cassandra by Axel LiljencrantzC* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
 
Cassandra At Wize Commerce
Cassandra At Wize CommerceCassandra At Wize Commerce
Cassandra At Wize Commerce
 
Devops kc
Devops kcDevops kc
Devops kc
 
AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...
AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...
AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...
 
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...
 
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael Kjellman
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael KjellmanC* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael Kjellman
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael Kjellman
 
Hindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to CassandraHindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to Cassandra
 
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
 
Cassandra Silicon Valley
Cassandra Silicon ValleyCassandra Silicon Valley
Cassandra Silicon Valley
 
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
 
[Hadoop Meetup] Yarn at Microsoft - The challenges of scale
[Hadoop Meetup] Yarn at Microsoft - The challenges of scale[Hadoop Meetup] Yarn at Microsoft - The challenges of scale
[Hadoop Meetup] Yarn at Microsoft - The challenges of scale
 
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into Cassandra
 
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
 
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
 
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent Cloud
 
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 

Recently uploaded

Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfAI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
Techgropse Pvt.Ltd.
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 

Recently uploaded (20)

Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfAI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 

Watching Your Cassandra Cluster Melt

  • 1. 10/20/14 Watching Your Cassandra Cluster Melt
  • 2. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT What is PagerDuty?
  • 3. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT Cassandra at PagerDuty • Used to provide durable, consistent read/writes in a critical pipeline of service applications • Scala, Cassandra, Zookeeper. • Receives ~25 requests a sec • Each request is a handful of operations then processed asynchronously • Never lose an event. Never lose a message. • This has HUGE implications around our design and architecture.
  • 4. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT Cassandra at PagerDuty • Cassandra 1.2 • Thrift API • Using Hector/Cassie/Astyanax • Assigned tokens • Putting off migrating to vnodes • It is not big data • Clusters ~10s of GB • Data in the pipe is considered ephemeral
  • 5. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT Cassandra at PagerDuty DC-C ~20 MS ~5 MS DC-A DC-B ~20 MS • Five (or ten) nodes in three regions • Quorum CL • RF = 5
  • 6. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT Cassandra at PagerDuty • Operations cross the WAN and take inter-DC latency hit. • Since we use it as our pipeline without much of a user-facing front, we’re not latency sensitive, but throughput sensitive. • We get consistent read/write operations. • Events aren’t lost. Messages aren’t repeated. • We get availability in the face of a loss of entire DC-region.
  • 7. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT What Happened? • Everything fell apart and our critical pipeline began refusing new events and halted progress on existing ones. • Created degraded performance and a three-hour outage in PagerDuty • Unprecedented flush of in-flight data • Gory details on the impact found on the PD blog: https://blog.pagerduty.com/ 2014/06/outage-post-mortem-june-3rd-4th-2014/
  • 8. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT What Happened… • It was just a semi-regular day… • …no particular changes in traffic • …no particular changes in volume • We had an incident the day before • Repairs and compactions had been taking longer and longer. They were starting to overlap on machines. • We used ‘nodetool disablethrift' to mitigate load on nodes that couldn’t handle being coordinators. • We even disabled nodes and found odd improvements with a smaller 3/5 cluster (any 3/5). • The next day, we started a repair that had been foregone…
  • 9. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT What happened… 1 MIN SYSTEM LOAD
  • 10. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT What we did… • Tried a few things to mitigate the damage • Stopped less critical tenants. • Disabled thrift interfaces • Disabled nodes • No discernible effect. • Left with no choice, we blew away all data and restarted Cassandra fresh • This only took 10 minutes after committing to do this. sudo rm -r /var/lib/cassandra/commitlog/* sudo rm -r /var/lib/cassandra/saved_caches/* sudo rm -r /var/lib/cassandra/data/* • Then everything was fine and dandy, like sour candy.
  • 11. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT So, what happened…? WHAT WENT HORRIBLY WRONG? • Multi-tenancy in the Cassandra cluster. • Operational ease isn’t worth the transparency. • Underprovisioning • AWS m1.larges • 2 cores • 8 GB RAM <—definitely not enough. • Poor monitoring and high-water marks • A twisted desire to get everything out of our little cluster
  • 12. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT Why we didn’t see it coming… OR, HOW I LIKE TO MAKE MYSELF FEEL BETTER. • Everything was fine 99% of the time. • Read/write latencies close to the inter-DC latencies. • Despite load being relatively high sometimes. • Cassandra seems to have two modes: fine and catastrophe • We thought, “we don’t have much data, it should be able to handle this.” • Thought we must have misconfigured something. We didn’t need to scale up…
  • 13. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT What we should have seen… CONSTANT MEMORY PRESSURE This is bad This is good
  • 14. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT What we should have seen… • Consistent memtable flushing • “Flushing CFS(…) to relieve memory pressure” • Slower repair/compaction times • Likely related to the memory pressure • Widening disparity between median and p95 read/write latencies
  • 15. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT What we changed… THE AFTERMATH WAS ROUGH… • Immediately replaced all nodes with m2.2xlarges • 4 cores • 32 GB RAM • No more multi-tenancy. • Required nasty service migrations • Began watching a lot of pending task metrics. • Flushed blocker writers • Dropped messages
  • 16. 10/20/14 WATCHING YOUR CASSANDRA CLUSTER MELT Lessons Learned • Cassandra has a steep performance degradation. • Stay ahead of the scaling curve. • Jump on any warning signs • Practice scaling. Be able to do it on quick notice. • Cassandra performance deteriorates with changes in the data set and asynchronous, eventual consistency. • Just because your latencies were one way doesn’t mean they’re supposed to be that way. • Don’t build for multi tenancy in your cluster.
  • 17. PS. We’re hiring Cassandra people (enthusiast to expert) in our Realtime or Persistence teams. 10/20/14 Thank you. http://www.pagerduty.com/company/work-with-us/ http://bit.ly/1ym8j9g