Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PagerDuty: One Year of Cassandra Failures

1,753 views

Published on

Every company likes to brag about their successes, but not many are willing to talk about their failures. At PagerDuty we have been rigorously tracking downtime in order to analyze it and learn from our mistakes - we even blog about these failures publicly.

Despite being a highly available system, we have had three outages caused by problems with our production Cassandra clusters over the past year. We'll take a look at each of these outages: what we saw from the inside, the actions we took to recover, and most importantly the procedures and monitoring that will help prevent it from happening to you.

Published in: Technology
  • Be the first to comment

PagerDuty: One Year of Cassandra Failures

  1. 1. 2015−09−23 One Year of Cassandra Failures donny@pagerduty.com #CassandraSummit
  2. 2. 2015-09-30 PagerDuty (simplified, circa early 2014) ONE YEAR OF CASSANDRA FAILURES Monitoring system events.pagerduty.com Cassandra Enqueuer Dequeuer Event Processing Notifier XtraDB Phone SMS Email Push HTTP PagerDuty Customer
  3. 3. 2015−09−23 Span the WAN? Yes you can! Tomorrow at 9:50 AM Paul Rechsteiner
  4. 4. 2015−09−23 Outage 1 “The Backlog”
  5. 5. 2015-09-30 Background ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 • Shared cluster, 5 machines (with replication factor = 5) • 10s of GBs of data • In-flight data: 10s of MBs, maybe 100s
  6. 6. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Outage 1 - Foreshadowing • Series of small outages / degradations • Repair process started • High load, high latency • Response: disable thrift, turn off nodes
  7. 7. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Coordinator Read Latency (in ms, by host) 6 seconds ~25 ms
  8. 8. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Coordinator Read Latency (in ms, by host)
  9. 9. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Coordinator Read Latency (in ms, by host)
  10. 10. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Coordinator Read Latency (in ms, by host)
  11. 11. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Coordinator Read Latency (in ms, by host)
  12. 12. 2015−09−23 The Next Day…
  13. 13. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 The Plan • Trigger repair… … with lots of people watching • Use our load shedding strategies for any problems: • Proactively disable non-critical services • Disable thrift
  14. 14. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Surprise! • Cron triggers a different repair • Plus a compaction for a large CF
  15. 15. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Outgoing Notification Backlog Size Normal Bad Horrible
  16. 16. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Outgoing Notification Backlog Size Normal Bad Horrible :(
  17. 17. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Cassandra Pending Tasks: ReadStage (by host) Over 9000
  18. 18. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Cassandra CPU (by host) 100%
  19. 19. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Factory Reset Success… kind of
  20. 20. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Aftermath: The Investigation • Huge investigation • Silver lining: learned a lot • Host metrics (CPU, network, etc) fine most of the time • Need to look at Cassandra metrics for leading indicators
  21. 21. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Investigation Conclusion • Under-provisioned (mainly CPU) • No partial progress
  22. 22. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Lessons • Capacity planning • Important even with low volume • Cassandra-specific monitoring • Isolation
  23. 23. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Lessons - Metrics For Cassandra • Dropped messages (leading) • Blocked flush writers (leading) • GC behavior (leading) • Pending tasks: ReadStage, ResponseStage, etc (lagging)
  24. 24. 2015−09−23 Outage 2 “Aliens”
  25. 25. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 Changes • Isolated clusters for everyone • New service: heaviest Cassandra user so far • Upgrade Cassandra version
  26. 26. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 Application Logs ERR [20141202-23:14:02.808] #222 -- queue: There was a problem running the workqueue task for SimpleQueueable[entityId=deliveryProcessor_XXXXXXX] com.netflix.astyanax.connectionpool.exceptions.BadRequestException: BadRequestException: [host=##.###.##.1(##.###.##.1):9160, latency=24(24), attempts=1]InvalidRequestException(why:( String didn't validate.) [Artemis][MaterializedNotification][artemisAcceptedAt] failed validation) at com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java: 159) at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:65) at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:28) at com.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl $ThriftConnection.execute(ThriftSyncConnectionFactoryImpl.java:151)
  27. 27. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 “Cassandra Danger Metrics” (Partial)
  28. 28. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 cassandra-cli - “describe cluster” - Bad Output [default@Artemis] describe cluster; Cluster Information: Name: prod-artemis Snitch: org.apache.cassandra.locator.PropertyFileSnitch Partitioner: org.apache.cassandra.dht.RandomPartitioner Schema versions: 52eee0b6-dabb-3c44-af80-970b0e7f63ff: [##.###.##.1] 676d41bc-b9ce-3513-a232-b1056dea1ca6: [##.###.##.2, ##.###.##.3, ##.###.##.4, ##.###.##.5, ##.###.##.6, ##.###.##.7]
  29. 29. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
  30. 30. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 cassandra-cli - “describe cluster” - Good Output [default@unknown] describe cluster; Cluster Information: Name: prod-artemis Snitch: org.apache.cassandra.locator.PropertyFileSnitch Partitioner: org.apache.cassandra.dht.RandomPartitioner Schema versions: 676d41bc-b9ce-3513-a232-b1056dea1ca6: [##.###.##.1, ##.###.##.2, ##.###.##.3, ##.###.##.4, ##.###.##.5, ##.###.##.6, ##.###.##.7]
  31. 31. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 Notifications Sent
  32. 32. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 Application-Measured Cassandra Call Latency (in ms, by CF) 15 seconds
  33. 33. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 Pending Tasks: MutationStage 22,000 Should be small, < 5
  34. 34. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 Actions 17:01:21 disable thrift 17:02:08 kill repair 17:02:35 kill dash nine
  35. 35. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 Cassandra Operations (cluster-wide, by CF)
  36. 36. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 Cassandra Operations (cluster-wide, by CF) disable thrift kill repair kill -9
  37. 37. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 Puzzle • Why did one bad Cassandra node have such a huge effect?
  38. 38. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 Bad Coordinator Timeout vs average request 10,000 ms / 25 ms = 400
  39. 39. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 What Happened To Cassandra?
  40. 40. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 What Happened To Cassandra?
  41. 41. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 Lessons • Isolated clusters pays off • How to do schema changes: 1. describe cluster; 2. <schema change for one CF> 3. describe cluster; • Monitor for schema disagreement
  42. 42. 2015−09−23 Outage 3 “Human Error”
  43. 43. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 Application-Measured Cassandra Call Latency (ms, by CF) 8 seconds Normal: ~25 ms
  44. 44. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 “Cassandra Danger Metrics” (partial)
  45. 45. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 Cassandra Logs (on working hosts) INFO [HintedHandoff:2] 2014-12-18 03:21:39,396 HintedHandOffManager.java (line 427) Timed out replaying hints to /##.###.##.6; aborting (9079 delivered)
  46. 46. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 commitlog Directory ls -la /var/lib/cassandra/commitlog/ total 1015360 drwxr-xr-x 2 cassandra root 4096 2014-12-18 03:36 . drwxr-xr-x 6 cassandra root 4096 2014-08-19 17:00 .. -- SNIP -- -rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:33 CommitLog-2-1418873533553.log -rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:33 CommitLog-2-1418873533554.log -rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:34 CommitLog-2-1418873533555.log -rw-r--r-- 1 root root 33554432 2014-11-26 21:40 CommitLog-2-1418873533556.log -rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:36 CommitLog-2-1418873737850.log -rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:36 CommitLog-2-1418873737851.log -rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:36 CommitLog-2-1418873737852.log -rw-r--r-- 1 root root 33554432 2014-11-26 21:39 CommitLog-2-1418873737853.log -rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:36 CommitLog-2-1418873800630.log -rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:36 CommitLog-2-1418873812840.log
  47. 47. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 The Culprit Nov 26 21:39:53 prod-artemis-cass06 sudo: donny : TTY=pts/0 ; PWD=/var/lib/cassandra/data/ArtemisQueue/ WorkQueue ; USER=root ; COMMAND=/usr/local/share/cassandra/ bin/sstable2json ArtemisQueue-WorkQueue-ic-10035-Data.db Nov 26 21:40:12 prod-artemis-cass06 sudo: donny : TTY=pts/0 ; PWD=/var/lib/cassandra/data/ArtemisQueue/ WorkQueue ; USER=root ; COMMAND=/usr/local/share/cassandra/ bin/sstable2json ArtemisQueue-WorkQueue-ic-10037-Data.db
  48. 48. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 sstable2json sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db
  49. 49. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 sstable2json sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db ERROR 14:11:08,067 Cannot open /var/lib/cassandra/data/ system/peer_events/system-peer_events-ic-57; partitioner org.apache.cassandra.dht.RandomPartitioner does not match system partitioner org.apache.cassandra.dht.Murmur3Partitioner. Note that the default partitioner starting with Cassandra 1.2 is Murmur3Partitioner, so you will need to edit that to match your old partitioner if upgrading.
  50. 50. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 sstable2json export CASSANDRA_CONF=/etc/cassandra sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db
  51. 51. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 sstable2json export CASSANDRA_CONF=/etc/cassandra sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db Exception in thread "COMMIT-LOG-ALLOCATOR" FSWriteError in /var/lib/cassandra/commitlog/CommitLog-2-1441980887051.log at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:135) at org.apache.cassandra.db.commitlog.CommitLogSegment.freshSegment(CommitLogSegment.java:84) at org.apache.cassandra.db.commitlog.CommitLogAllocator.createFreshSegment(CommitLogAllocator.java:251) at org.apache.cassandra.db.commitlog.CommitLogAllocator.access$500(CommitLogAllocator.java:49) at org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:105) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: /var/lib/cassandra/commitlog/CommitLog-2-1441980887051.log (Permission denied) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.<init>(RandomAccessFile.java:241) at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:119)
  52. 52. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 sstable2json export CASSANDRA_CONF=/etc/cassandra sudo sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db Success!
  53. 53. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
  54. 54. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 Cassandra Thread Dump "MutationStage:30" daemon prio=10 tid=0x00007fec64ed9000 nid=0x1fe3 waiting on condition [0x00007fe3b56da000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000000061406ffe8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:349) at org.apache.cassandra.db.commitlog.PeriodicCommitLogExecutorService.add(PeriodicCommitLogExecutorService. java:93) at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:191) at org.apache.cassandra.db.Table.apply(Table.java:375) at org.apache.cassandra.db.Table.apply(Table.java:354) at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:283) at org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:56) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
  55. 55. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 Cassandra Thread Dump "COMMIT-LOG-WRITER" prio=10 tid=0x00007fec64293800 nid=0x1f8b waiting on condition [0x00007fec687d0000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000000061417d0d0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.cassandra.db.commitlog.CommitLogAllocator.fetchSegment(CommitLogAllocator.java: 126) at org.apache.cassandra.db.commitlog.CommitLog.activateNextSegment(CommitLog.java:305) at org.apache.cassandra.db.commitlog.CommitLog.access$100(CommitLog.java:44) at org.apache.cassandra.db.commitlog.CommitLog$LogRecordAdder.run(CommitLog.java:356) at org.apache.cassandra.db.commitlog.PeriodicCommitLogExecutorService $1.runMayThrow(PeriodicCommitLogExecutorService.java:46) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.lang.Thread.run(Thread.java:745)
  56. 56. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 Cassandra Logs - Commit Log Allocator Exception in thread Thread[COMMIT-LOG-ALLOCATOR,5,main] FSWriteError in /var/lib/cassandra/commitlog/CommitLog-2-1442099692080.log at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:135) at org.apache.cassandra.db.commitlog.CommitLogAllocator$3.run(CommitLogAllocator.java:197) at org.apache.cassandra.db.commitlog.CommitLogAllocator $1.runMayThrow(CommitLogAllocator.java:95) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Rename from /var/lib/cassandra/commitlog/ CommitLog-2-1418868735344.log to 1418873812840 failed at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:113) ... 4 more
  57. 57. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 Lessons • Be careful what habits you develop • Tools should be as isolated & focused as possible • Process startup code can create time bombs
  58. 58. 2015−09−23 Concluding Thoughts
  59. 59. 2015−09−23 donny@pagerduty.com Thank you.

×