Cassandra summit 2013 how not to use cassandra

3,974 views
3,726 views

Published on

Published in: Technology

Cassandra summit 2013 how not to use cassandra

  1. 1. June 17, 2013#Cassandra13Axel Liljencrantzliljencrantz@spotify.comHow not to useCassandra
  2. 2. #Cassandra13About me
  3. 3. #Cassandra13The Spotify backend
  4. 4. #Cassandra13The Spotify backend•  Around 4000 servers in 4 datacenters•  Volumes-  We have ~ 12 soccer fields of music-  Streaming ~ 4 Wikipedias/second-  ~ 24 000 000 active users
  5. 5. #Cassandra13The Spotify backend•  Specialized software powering Spotify-  ~ 70 services-  Mostly Python, some Java-  Small, simple services responsible for single task
  6. 6. #Cassandra13Storage needs•  Used to be a pure PostgreSQL shop•  Postgres is awesome, but...-  Poor cross-site replication support-  Write master failure requires manual intervention-  Sharding throws most relational advantages out thewindow
  7. 7. #Cassandra13Cassandra @ Spotify•  We started using Cassandra 2+ years ago-  ~ 24 services use it by now-  ~ 300 Cassandra nodes-  ~ 50 TB of data•  Back then, there was little information about how to designefficient, scalable storage schemas for Cassandra
  8. 8. #Cassandra13Cassandra @ Spotify•  We started using Cassandra 2+ years ago-  ~ 24 services use it by now-  ~ 300 Cassandra nodes-  ~ 50 TB of data•  Back then, there was little information about how to designefficient, scalable storage schemas for Cassandra•  So we screwed up•  A lot
  9. 9. #Cassandra13How to misconfigure Cassandra
  10. 10. #Cassandra13Read repair•  Repair from outages during regular read operation•  With RR, all reads request hash digests from all nodes•  Result is still returned as soon as enough nodes have replied•  If there is a mismatch, perform a repair
  11. 11. #Cassandra13Read repair•  Useful factoid: Read repair is performed across all datacenters•  So in a multi-DC setup, all reads will result in requests beingsent to every data center•  Weve made this mistake a bunch of times•  New in 1.1: dclocal_read_repair
  12. 12. #Cassandra13Row cache•  Cassandra can be configured to cache entire data rows inRAM•  Intended as a memcache alternative•  Lets enable it. Whats the worst that could happen, right?
  13. 13. #Cassandra13Row cacheNO!•  Only stores full rows•  All cache misses are silently promoted to full row slices•  All writes invalidate entire row•  Dont use unless you understand all use cases
  14. 14. #Cassandra13Compression•  Cassandra supports transparent compression of all data•  Compression algorithm (snappy) is super fast•  So you can just enable it and everything will be better, right?
  15. 15. #Cassandra13Compression•  Cassandra supports transparent compression of all data•  Compression algorithm (snappy) is super fast•  So you can just enable it and everything will be better, right?•  NO!•  Compression disables a bunch of fast paths, slowing downfast reads
  16. 16. #Cassandra13How to misuse Cassandra
  17. 17. #Cassandra13Performance worse over time•  A freshly loaded Cassandra cluster is usually snappy•  But when you keep writing to the same columns over for along time, the row will spread over more SSTables•  And performance jumps off a cliff•  Weve seen clusters where reads touch a dozen SSTables onaverage•  nodetool cfhistograms is your friend
  18. 18. #Cassandra13Performance worse over time•  CASSANDRA-5514•  Every SSTable stores first/last column of SSTable•  Time series-like data is effectively partitioned
  19. 19. #Cassandra13Few cross continent clusters•  Few cross continent Cassandra users•  We are kind of on our own when it comes to some problems•  CASSANDRA-5148•  Disable TCP nodelay•  Reduced packet count by 20 %
  20. 20. #Cassandra13How not to upgrade Cassandra
  21. 21. #Cassandra13How not to upgrade Cassandra•  Very few total cluster outages-  Clusters have been up and running since the early 0.7days, been rolling upgraded, expanded, full hardwarereplacements etc.•  Never lost any data!-  No matter how spectacularly Cassandra fails, it hasnever written bad data-  Immutable SSTables FTW
  22. 22. #Cassandra13Upgrade from 0.7 to 0.8•  This was the first big upgrade we did, 0.7.4 ⇾ 0.8.6•  Everyone claimed rolling upgrade would work-  It did not•  One would expect 0.8.6 to have this fixed•  Patched Cassandra and rolled it a day later•  Takeaways:-  ALWAYS try rolling upgrades in a testing environment-  Dont believe what people on the Internet tell you
  23. 23. #Cassandra13Upgrade from 0.8 to 1.0•  We tried upgrading in test env, worked fine•  Worked fine in production...•  Except the last cluster•  All data gone
  24. 24. #Cassandra13Upgrade from 0.8 to 1.0•  We tried upgrading in test env, worked fine•  Worked fine in production...•  Except the last cluster•  All data gone•  Many keys per SSTable ⇾ corrupt bloom filters•  Made Cassandra think it didnt have any keys•  Scrub data ⇾ fixed•  Takeaway: ALWAYS test upgrades using production data
  25. 25. #Cassandra13Upgrade from 1.0 to 1.1•  After the previous upgrades, we did all the tests withproduction data and everything worked fine...•  Until we redid it in production, and we had reports of missingrows•  Scrub ⇾ restart made them reappear•  This was in December, have not been able to reproduce•  PEBKAC?•  Takeaway: ?
  26. 26. #Cassandra13How not to deal with large clusters
  27. 27. #Cassandra13Coordinator•  Coordinator performs partitioning, passes on request tothe right nodes•  Merges all responses
  28. 28. #Cassandra13What happens if one node is slow?
  29. 29. #Cassandra13What happens if one node is slow?Many reasons for temporary slowness:•  Bad raid battery•  Sudden bursts of compaction/repair•  Bursty load•  Net hiccup•  Major GC•  Reality
  30. 30. #Cassandra13What happens if one node is slow?•  Coordinator has a request queue•  If a node goes down completely, gossip will notice quicklyand drop the node•  But what happens if a node is just super slow?
  31. 31. #Cassandra13What happens if one node is slow?•  Gossip doesnt react quickly to slow nodes•  The request queue for the coordinator on every node inthe cluster fills up•  And the entire cluster stops accepting requests
  32. 32. #Cassandra13What happens if one node is slow?•  Gossip doesnt react quickly to slow nodes•  The request queue for the coordinator on every node inthe cluster fills up•  And the entire cluster stops accepting requests•  No single point of failure?
  33. 33. #Cassandra13What happens if one node is slow?•  Solution: Partitioner awareness in client•  Max 3 nodes go down•  Available in Astyanax
  34. 34. #Cassandra13How not to delete data
  35. 35. #Cassandra13How not to delete dataHow is data deleted?•  SSTables are immutable, we cant remove the data•  Cassandra creates tombstones for deleted data•  Tombstones are versioned the same way as any otherwrite
  36. 36. #Cassandra13How not to delete dataDo tombstones ever go away?•  During compactions, tombstones can get merged intoSStables that hold the original data, making thetombstones redundant•  Once a tombstone is the only value for a specific column,the tombstone can go away•  Still need grace time to handle node downtime
  37. 37. #Cassandra13How not to delete data•  Tombstones can only be deleted once all non-tombstonevalues have been deleted•  Tombstones can only be deleted if all values for thespecified row are all being compacted•  If youre using SizeTiered compaction, old rows willrarely get deleted
  38. 38. #Cassandra13How not to delete data•  Tombstones are a problem even when using levelledcompaction•  In theory, 90 % of all rows should live in a single SSTable•  In production, weve found that only 50 - 80 % of all readshit only one SSTable•  In fact, frequently updated columns will exist in mostlevels, causing tombstones to stick around
  39. 39. #Cassandra13How not to delete data•  Deletions are messy•  Unless you perform major compactions, tombstones willrarely get deleted•  The problem is much worse for «popular» rows•  Avoid schemas that delete data!
  40. 40. #Cassandra13TTL:ed data•  Cassandra supports TTL:ed data•  Once TTL:ed data expires, it should just be compactedaway, right?•  We know we dont need the data anymore, no need for atombstone, so it should be fast, right?
  41. 41. #Cassandra13TTL:ed data•  Cassandra supports TTL:ed data•  Once TTL:ed data expires, it should just be compactedaway, right?•  We know we dont need the data anymore, no need for atombstone, so it should be fast, right?•  Noooooo...•  (Overwritten data could theoretically bounce back)
  42. 42. #Cassandra13TTL:ed data•  CASSANDRA-5228•  Drop entire sstables when all columns are expired
  43. 43. #Cassandra13The Playlist serviceOur most complex service•  ~ 1 billion playlists•  40 000 reads per second•  22 TB of compressed data
  44. 44. #Cassandra13The Playlist serviceOur old playlist system had many problems:•  Stored data across hundreds of millions of files, makingbackup process really slow.•  Home brewed replication model that didnt work very well•  Frequent downtimes, huge scalability problems
  45. 45. #Cassandra13The Playlist serviceOur old playlist system had many problems:•  Stored data across hundreds of millions of files, makingbackup process really slow.•  Home brewed replication model that didnt work very well•  Frequent downtimes, huge scalability problems•  Perfect test case forCassandra!
  46. 46. #Cassandra13Playlist data model•  Every playlist is a revisioned object•  Think of it like a distributed versioning system•  Allows concurrent modification on multiple offlined clients•  We even have an automatic merge conflict resolver thatworks really well!•  Thats actually a really useful feature
  47. 47. #Cassandra13Playlist data model•  Every playlist is a revisioned object•  Think of it like a distributed versioning system•  Allows concurrent modification on multiple offlined clients•  We even have an automatic merge conflict resolver thatworks really well!•  Thats actually a really useful feature said no one ever
  48. 48. #Cassandra13Playlist data model•  Sequence of changes•  The changes are the authoritative data•  Everything else is optimization•  Cassandra pretty neat for storing this kind of stuff•  Can use consistency level ONE safely
  49. 49. #Cassandra13
  50. 50. #Cassandra13Tombstone hell•  The HEAD column family stores the sequence ID of the latestrevision of each playlist•  90 % of all reads go to HEAD•  mlock
  51. 51. #Cassandra13Tombstone hell•  Noticed that HEAD requests took several seconds for somelists•  Easy to reproduce in cassandra-cli:• get playlist_head[utf8(spotify:user...)];•  1-15 seconds latency; should be < 0.1 s•  Copy SSTables to development machine for investigation
  52. 52. #Cassandra13Tombstone hell•  Noticed that HEAD requests took several seconds for somelists•  Easy to reproduce in cassandra-cli:• get playlist_head[utf8(spotify:user...)];•  1-15 seconds latency; should be < 0.1 s•  Copy SSTables to development machine for investigation•  Cassandra tool sstabletojson showed that the row contained600 000 tombstones!
  53. 53. #Cassandra13Tombstone hell•  WAT‽•  Data is in the column name•  Used to detect forks
  54. 54. #Cassandra13Tombstone hell•  We expected tombstones would be deleted after 30 days•  Nope, all tombstones since 1.5 years ago were there•  Revelation: Rows existing in 4+ SSTables never havetombstones deleted during minor compactions•  Frequently updated lists exists in nearly all SSTablesSolution:•  Major compaction (CF size cut in half)
  55. 55. #Cassandra13Zombie tombstones•  Ran major compaction manually on all nodes during a fewdays.•  All seemed well...•  But a week later, the same lists took several secondsagain‽‽‽
  56. 56. #Cassandra13Repair vs major compactionsA repair between the major compactions "resurrected" thetombstones :(New solution:•  Repairs during Monday-Friday•  Major compaction Saturday-SundayA (by now) well-known Cassandra anti-pattern:Dont use Cassandra to store queues
  57. 57. #Cassandra13Cassandra counters•  There are lots of places in the Spotify UI where we countthings•  # of followers of a playlist•  # of followers of an artist•  # of times a song has been played•  Cassandra has a feature called distributed counters thatsounds suitable•  Is this awesome?
  58. 58. #Cassandra13Cassandra counters•  Yep•  Theyve actually worked pretty well for us.
  59. 59. #Cassandra13Lessons
  60. 60. #Cassandra13How not to fail•  Treat Cassandra as a utility belt•  FlashLots of one-off solutions:•  Weekly major compactions•  Delete all sstables and recreate from scratch every day•  Memlock frequently used SSTables in RAM
  61. 61. #Cassandra13Lessons•  Cassandra read performance is heavily dependent on thetemporal patterns of your writes•  Cassandra is initially snappy, but various write patternsmake read performance slowly decrease•  Making benchmarks close to useless
  62. 62. #Cassandra13Lessons•  Avoid repeatedly writing data to the same row over verylong spans of time•  Avoid deleting data•  If youre working at scale, youll need to know howCassandra works under the hood•  nodetool cfhistograms is your friend
  63. 63. #Cassandra13Lessons•  There are still various esoteric problems with large scaleCassandra installations•  Debugging them is really interesting•  If you agree with the above statements, you should totallycome work with us
  64. 64. June 17, 2013#Cassandra13spotify.com/jobsQuestions?

×