Apache Cassandra at Pager Duty 2014

10/20/14
Watching Your Cassandra Cluster Melt

10/20/14
WATCHING YOUR CASSANDRA CLUSTER MELT
What is PagerDuty?

10/20/14
Cassandra at PagerDuty
• Used to provide durable, consistent read/writes in a critical pipeline of
service applications
• Scala, Cassandra, Zookeeper.
• Receives ~25 requests a sec
• Each request is a handful of operations then processed asynchronously
• Never lose an event. Never lose a message.
• This has HUGE implications around our design and architecture.

10/20/14
• Cassandra 1.2
• Thrift API
• Using Hector/Cassie/Astyanax
• Assigned tokens
• Putting off migrating to vnodes
• It is not big data
• Clusters ~10s of GB
• Data in the pipe is considered ephemeral

10/20/14
DC-C
~20 MS ~5 MS
DC-A DC-B
~20 MS
• Five (or
ten) nodes
in three
regions
• Quorum CL
• RF = 5

10/20/14
• Operations cross the WAN and take inter-DC latency hit.
• Since we use it as our pipeline without much of a user-facing front,
we’re not latency sensitive, but throughput sensitive.
• We get consistent read/write operations.
• Events aren’t lost. Messages aren’t repeated.
• We get availability in the face of a loss of entire DC-region.

10/20/14
What Happened?
• Everything fell apart and our critical pipeline began refusing new events and
halted progress on existing ones.
• Created degraded performance and a three-hour outage in PagerDuty
• Unprecedented flush of in-flight data
• Gory details on the impact found on the PD blog: https://blog.pagerduty.com/
2014/06/outage-post-mortem-june-3rd-4th-2014/

10/20/14
What Happened…
• It was just a semi-regular day…
• …no particular changes in traffic
• …no particular changes in volume
• We had an incident the day before
• Repairs and compactions had been taking longer and longer. They
were starting to overlap on machines.
• We used ‘nodetool disablethrift' to mitigate load on nodes that
couldn’t handle being coordinators.
• We even disabled nodes and found odd improvements with a
smaller 3/5 cluster (any 3/5).
• The next day, we started a repair that had been foregone…

10/20/14
What happened…
1 MIN SYSTEM LOAD

10/20/14
What we did…
• Tried a few things to mitigate the damage
• Stopped less critical tenants.
• Disabled thrift interfaces
• Disabled nodes
• No discernible effect.
• Left with no choice, we blew away all data and restarted Cassandra fresh
• This only took 10 minutes after committing to do this.
sudo rm -r /var/lib/cassandra/commitlog/*
sudo rm -r /var/lib/cassandra/saved_caches/*
sudo rm -r /var/lib/cassandra/data/*
• Then everything was fine and dandy, like sour candy.

10/20/14
So, what happened…?
WHAT WENT HORRIBLY WRONG?
• Multi-tenancy in the Cassandra cluster.
• Operational ease isn’t worth the transparency.
• Underprovisioning
• AWS m1.larges
• 2 cores
• 8 GB RAM <—definitely not enough.
• Poor monitoring and high-water marks
• A twisted desire to get everything out of our little cluster

10/20/14
Why we didn’t see it coming…
OR, HOW I LIKE TO MAKE MYSELF FEEL BETTER.
• Everything was fine 99% of the time.
• Read/write latencies close to the inter-DC latencies.
• Despite load being relatively high sometimes.
• Cassandra seems to have two modes: fine and catastrophe
• We thought, “we don’t have much data, it should be able to handle this.”
• Thought we must have misconfigured something. We didn’t need to scale up…

10/20/14
What we should have seen…
CONSTANT MEMORY PRESSURE
This is bad
This is good

10/20/14
What we should have seen…
• Consistent memtable flushing
• “Flushing CFS(…) to relieve memory pressure”
• Slower repair/compaction times
• Likely related to the memory pressure
• Widening disparity between median and p95 read/write latencies

10/20/14
What we changed…
THE AFTERMATH WAS ROUGH…
• Immediately replaced all nodes with m2.2xlarges
• 4 cores
• 32 GB RAM
• No more multi-tenancy.
• Required nasty service migrations
• Began watching a lot of pending task metrics.
• Flushed blocker writers
• Dropped messages

10/20/14
Lessons Learned
• Cassandra has a steep performance degradation.
• Stay ahead of the scaling curve.
• Jump on any warning signs
• Practice scaling. Be able to do it on quick notice.
• Cassandra performance deteriorates with changes in the data set and
asynchronous, eventual consistency.
• Just because your latencies were one way doesn’t mean they’re
supposed to be that way.
• Don’t build for multi tenancy in your cluster.

PS. We’re hiring Cassandra people (enthusiast to expert) in our Realtime or Persistence
teams.
10/20/14
Thank you.
http://www.pagerduty.com/company/work-with-us/
http://bit.ly/1ym8j9g

Apache Cassandra at Pager Duty 2014

Recommended

Recommended

More Related Content

More from DataStax Academy

More from DataStax Academy (20)

Recently uploaded

Recently uploaded (20)

Apache Cassandra at Pager Duty 2014