Video: https://www.youtube.com/watch?v=wbUmIacfswU
Speaker: Owen Kim, Software Engineer
Company: Pager Duty
PagerDuty had the misfortune of watching its abused, underprovisioned Cassandra cluster collapse. This talk will cover the lessons learned from that experience like:
• Which of the many, many metrics did we learn to watch for
• What mistakes we made that lead to this catastrophe
• How we have changed our use to make our Cassandra cluster more stable
Owen Kim is a Software Engineer at PagerDuty and enjoys whiskey, riding his Honda Shadow 600 (named "Chie") and discussing the finer points in narrative and expression in video games.
3. 10/20/14
WATCHING YOUR CASSANDRA CLUSTER MELT
Cassandra at PagerDuty
• Used to provide durable, consistent read/writes in a critical pipeline of
service applications
• Scala, Cassandra, Zookeeper.
• Receives ~25 requests a sec
• Each request is a handful of operations then processed asynchronously
• Never lose an event. Never lose a message.
• This has HUGE implications around our design and architecture.
4. 10/20/14
WATCHING YOUR CASSANDRA CLUSTER MELT
Cassandra at PagerDuty
• Cassandra 1.2
• Thrift API
• Using Hector/Cassie/Astyanax
• Assigned tokens
• Putting off migrating to vnodes
• It is not big data
• Clusters ~10s of GB
• Data in the pipe is considered ephemeral
5. 10/20/14
WATCHING YOUR CASSANDRA CLUSTER MELT
Cassandra at PagerDuty
DC-C
~20 MS ~5 MS
DC-A DC-B
~20 MS
• Five (or
ten) nodes
in three
regions
• Quorum CL
• RF = 5
6. 10/20/14
WATCHING YOUR CASSANDRA CLUSTER MELT
Cassandra at PagerDuty
• Operations cross the WAN and take inter-DC latency hit.
• Since we use it as our pipeline without much of a user-facing front,
we’re not latency sensitive, but throughput sensitive.
• We get consistent read/write operations.
• Events aren’t lost. Messages aren’t repeated.
• We get availability in the face of a loss of entire DC-region.
7. 10/20/14
WATCHING YOUR CASSANDRA CLUSTER MELT
What Happened?
• Everything fell apart and our critical pipeline began refusing new events and
halted progress on existing ones.
• Created degraded performance and a three-hour outage in PagerDuty
• Unprecedented flush of in-flight data
• Gory details on the impact found on the PD blog: https://blog.pagerduty.com/
2014/06/outage-post-mortem-june-3rd-4th-2014/
8. 10/20/14
WATCHING YOUR CASSANDRA CLUSTER MELT
What Happened…
• It was just a semi-regular day…
• …no particular changes in traffic
• …no particular changes in volume
• We had an incident the day before
• Repairs and compactions had been taking longer and longer. They
were starting to overlap on machines.
• We used ‘nodetool disablethrift' to mitigate load on nodes that
couldn’t handle being coordinators.
• We even disabled nodes and found odd improvements with a
smaller 3/5 cluster (any 3/5).
• The next day, we started a repair that had been foregone…
10. 10/20/14
WATCHING YOUR CASSANDRA CLUSTER MELT
What we did…
• Tried a few things to mitigate the damage
• Stopped less critical tenants.
• Disabled thrift interfaces
• Disabled nodes
• No discernible effect.
• Left with no choice, we blew away all data and restarted Cassandra fresh
• This only took 10 minutes after committing to do this.
sudo rm -r /var/lib/cassandra/commitlog/*
sudo rm -r /var/lib/cassandra/saved_caches/*
sudo rm -r /var/lib/cassandra/data/*
• Then everything was fine and dandy, like sour candy.
11. 10/20/14
WATCHING YOUR CASSANDRA CLUSTER MELT
So, what happened…?
WHAT WENT HORRIBLY WRONG?
• Multi-tenancy in the Cassandra cluster.
• Operational ease isn’t worth the transparency.
• Underprovisioning
• AWS m1.larges
• 2 cores
• 8 GB RAM <—definitely not enough.
• Poor monitoring and high-water marks
• A twisted desire to get everything out of our little cluster
12. 10/20/14
WATCHING YOUR CASSANDRA CLUSTER MELT
Why we didn’t see it coming…
OR, HOW I LIKE TO MAKE MYSELF FEEL BETTER.
• Everything was fine 99% of the time.
• Read/write latencies close to the inter-DC latencies.
• Despite load being relatively high sometimes.
• Cassandra seems to have two modes: fine and catastrophe
• We thought, “we don’t have much data, it should be able to handle this.”
• Thought we must have misconfigured something. We didn’t need to scale up…
13. 10/20/14
WATCHING YOUR CASSANDRA CLUSTER MELT
What we should have seen…
CONSTANT MEMORY PRESSURE
This is bad
This is good
14. 10/20/14
WATCHING YOUR CASSANDRA CLUSTER MELT
What we should have seen…
• Consistent memtable flushing
• “Flushing CFS(…) to relieve memory pressure”
• Slower repair/compaction times
• Likely related to the memory pressure
• Widening disparity between median and p95 read/write latencies
15. 10/20/14
WATCHING YOUR CASSANDRA CLUSTER MELT
What we changed…
THE AFTERMATH WAS ROUGH…
• Immediately replaced all nodes with m2.2xlarges
• 4 cores
• 32 GB RAM
• No more multi-tenancy.
• Required nasty service migrations
• Began watching a lot of pending task metrics.
• Flushed blocker writers
• Dropped messages
16. 10/20/14
WATCHING YOUR CASSANDRA CLUSTER MELT
Lessons Learned
• Cassandra has a steep performance degradation.
• Stay ahead of the scaling curve.
• Jump on any warning signs
• Practice scaling. Be able to do it on quick notice.
• Cassandra performance deteriorates with changes in the data set and
asynchronous, eventual consistency.
• Just because your latencies were one way doesn’t mean they’re
supposed to be that way.
• Don’t build for multi tenancy in your cluster.
17. PS. We’re hiring Cassandra people (enthusiast to expert) in our Realtime or Persistence
teams.
10/20/14
Thank you.
http://www.pagerduty.com/company/work-with-us/
http://bit.ly/1ym8j9g