Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

5,307 views
5,164 views

Published on

Nicolas' talk from Cassandra Europe on March 28th 2012.

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,307
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
167
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

  1. 1. Cassandrastorage internals Nicolas Favre-Felix Cassandra Europe 2012
  2. 2. What this talk covers• What happens within a Cassandra node• How Cassandra reads and writes data• What compaction is and why we need it• How counters are stored, modified, and read
  3. 3. Concepts• Memtables • On heap, off-heap• SSTables • Compaction• Commit Log • Bloom filters• Key cache • SSTable index• Row cache • Counters
  4. 4. Why is this important?• Understand what goes on under the hood• Understand the reasons for these choices• Diagnose issues• Tune Cassandra for performance• Make your data model efficient
  5. 5. A word about hard drives
  6. 6. A word about hard drives• Main driver behind Cassandra’s storage choices• The last moving part• Fast sequential I/O (150 MB/s)• Slow random I/O (120-200 IOPS)
  7. 7. What SSDs bring• Fast sequential I/O• Fast random I/O• Higher cost• Limited lifetime• Performance degradation
  8. 8. Disk usage with B-trees• Important data structure in relational databases• In-place overwrites (random I/O)• LogB(N) random accesses for reads and writes
  9. 9. Disk usage with Cassandra • Made for spinning disks • Sequential writes, much less than 1 I/O per insert • Several layers of cache • Random reads, approximately 1 I/O per read • Generally “write-optimised”
  10. 10. Writingto Cassandra
  11. 11. Writing to CassandraLet’s add a row with a few columnsRow Key Column Column Column Column
  12. 12. The Cassandra write pathIn the JVM New data Memtable Commit On disk log
  13. 13. The Commit Log• Each write is added to a log file• Guarantees durability after a crash• 1-second window during which data is still in RAM• Sequential I/O• A dedicated disk is recommended
  14. 14. Memtables• In-memory Key/Value data structure• Implemented with ConcurrentSkipListMap• One per column family• Very fast inserts• Columns are merged in memory for the same key• Flushed at a certain threshold, into an SSTable
  15. 15. Dumping a Memtable on diskIn the JVM Full Memtable Commit On disk log
  16. 16. Dumping a Memtable on diskIn the JVM New Memtable Commit On disk log SSTable
  17. 17. The SSTable• One file, written sequentially• Columns are in order, grouped by row• Immutable once written, no updates!
  18. 18. SSTables start piling up!In the JVM MemtableCommit log SSTable SSTable SSTable On disk SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable
  19. 19. SSTables• Can’t keep all of them forever• Need to reclaim disk space• Reads could touch several SSTables• Scans touch all of them• In-memory data structures per SSTable
  20. 20. Compacting SSTables
  21. 21. Compaction• Merge SSTables of similar size together• Remove overwrites and deleted data (timestamps)• Improve range query performance• Major compaction creates a single SSTable• I/O intensive operation
  22. 22. Recent improvements• Pluggable compaction• Different strategies, chosen per column family• SSTable compression• More efficient SSTable merges
  23. 23. Reading from Cassandra
  24. 24. Reading from Cassandra• Reading all these SSTables would be very inefficient• We have to read from memory as much as possible• Otherwise we need to do 2 things efficiently: • Find the right SSTable to read from • Find where in that SSTable to read the data
  25. 25. First step for reads• The Memtable!• Read the most recent data• Very fast, no need to touch the disk
  26. 26. Off-heap (no GC) Row cacheIn the JVM Memtable Commit On disk log SSTable
  27. 27. Row cache• Stores a whole row in memory• Off-heap, not subject to Garbage Collection• Size is configurable per column family• Last resort before having to read from disk
  28. 28. Finding the right SSTableIn the JVM MemtableCommit log SSTable SSTable On disk SSTable SSTable SSTable SSTable SSTable SSTable SSTable
  29. 29. Bloom filter• Saved with each SSTable• Answers “contains(Key) :: boolean”• Saved on disk but kept in memory• Probabilistic data structure• Configurable proportion of false positives• No false negatives
  30. 30. Bloom filter In the JVM Memtableexists(key)? Bloom filter Bloom filter Bloom filtertrue/false Commit On disk log SSTable SSTable SSTable
  31. 31. Reading from an SSTable• We need to know where in the file our data is saved• Keys are sorted, why don’t we do a binary search? • Keys are not all the same size • Jumping around in a file is very slow • Log2(N) random I/O, ~20 for 1 million keys
  32. 32. Reading from an SSTable Let’s index key ranges in the SSTable Key: k-128 Key: k-256 Key: k-384Position: 12098 Position: 23445 Position: 43678 SSTable
  33. 33. SSTable index• Saved with each SSTable• Stores key ranges and their offsets: [(Key, Offset)]• Saved on disk but kept in memory• Avoids searching for a key by scanning the file• Configurable key interval (default: 128)
  34. 34. SSTable indexIn the JVM Memtable SSTable Bloom filter index Commit On disk log SSTable
  35. 35. Sometimes not enough• Storing key ranges is limited• We can do better by storing the exact offset• This saves approximately one I/O
  36. 36. The key cacheIn the JVM Memtable SSTable Bloom filter Key cache index Commit On disk log SSTable
  37. 37. Key cache• Stores the exact location in the SSTable• Stored in heap• Avoids having to scan a whole index interval• Size is configurable per column family
  38. 38. 2Off-heap (no GC) Row cache 1In the JVM Memtable 3 4 5 SSTable Bloom filter Key cache index 6 Commit On disk log SSTable
  39. 39. 2Off-heap (no GC) Row cache 1In the JVM Memtable 3 4 5 SSTable Bloom filter Key cache index 6 Commit On disk log SSTable
  40. 40. 2Off-heap (no GC) Row cache 1In the JVM Memtable 3 4 5 SSTable Bloom filter Key cache index 6 Commit On disk log SSTable
  41. 41. 2Off-heap (no GC) Row cache 1In the JVM Memtable 3 4 5 SSTable Bloom filter Key cache index 6 Commit On disk log SSTable
  42. 42. 2Off-heap (no GC) Row cache 1In the JVM Memtable 3 4 5 SSTable Bloom filter Key cache index 6 Commit On disk log SSTable
  43. 43. 2Off-heap (no GC) Row cache 1In the JVM Memtable 3 4 5 SSTable Bloom filter Key cache index 6 Commit On disk log SSTable
  44. 44. 2Off-heap (no GC) Row cache 1In the JVM Memtable 3 4 5 SSTable Bloom filter Key cache index 6 Commit On disk log SSTable
  45. 45. Distributed counters
  46. 46. Distributed counters• 64-bit signed integer, replicated in the cluster• Atomic inc and dec by an arbitrary amount• Counting with read-inc-write would be inefficient• Stored differently from regular columns
  47. 47. Consider a clusterwith 3 nodes, RF=3
  48. 48. Internal counter data• List of increments received by the local node• Summaries (Version,Sum) sent by the other nodes• The total value is the sum of all counts
  49. 49. Internal counter data• List of increments received by the local node• Summaries (Version,Sum) sent by the other nodes• The total value is the sum of all counts Local increments +5 +2 -3node version: 3 Received from count: 5 version: 5 Received from count: 10
  50. 50. Incrementing a counter• A coordinator node is chosen (blue node)Local increments +5 +2 -3
  51. 51. Incrementing a counter• A coordinator node is chosen• Stores its increment locallyLocal increments +5 +2 -3 +1
  52. 52. Incrementing a counter• A coordinator node is chosen• Stores its increment locally• Reads back the sum of its incrementsLocal increments +5 +2 -3 +1
  53. 53. Incrementing a counter• A coordinator node is chosen• Stores its increment locally• Reads back the sum of its increments• Forwards a summary to other replicas: (v.4, sum 5)Local increments +5 +2 -3 +1
  54. 54. Incrementing a counter• A coordinator node is chosen• Stores its increment locally• Reads back the sum of its increments• Forwards a summary to other replicas• Replicas update their records: version: 4Received from count: 5
  55. 55. Reading a counter• Replicas return their counts and versions• Including what they know about other nodes• Only the most recent versions are kept
  56. 56. Reading a counterversion: 6count: 12
  57. 57. Reading a counter { v. 3, count 5 v. 6, count 12 v. 2, count 8 { v. 3, count 5 v. 5, count 10 v. 4, count 5version: 6count: 12
  58. 58. Reading a counter { v. 3, count 5 v. 6, count 12 v. 2, count 8 { v. 3, count 5 v. 5, count 10 v. 4, count 5version: 6count: 12 Counter value: 5 + 12 + 5 = 22
  59. 59. Storage problems
  60. 60. Tuning• Cassandra can’t really use large amounts of RAM• Garbage Collection pauses stop everything• Compaction has an impact on performance• Reading from disk is slow• These limitations restrict the size of each node
  61. 61. Recap• Fast sequential writes• ~1 I/O for uncached reads, 0 for cached• Counter increments read on write, columns don’t• Know where your time is spent (monitor!)• Tune accordingly
  62. 62. Questions?http://www.flickr.com/photos/kubina/326628918/sizes/l/in/photostream/http://www.flickr.com/photos/alwarrete/5651579563/sizes/o/in/photostream/http://www.flickr.com/photos/pio1976/3330670980/sizes/o/in/photostream/http://www.flickr.com/photos/lwr/100518736/sizes/l/in/photostream/
  63. 63. • In-kernel backend• No Garbage Collection• No need to plan heavy compactions• Low and consistent latency• Full versioning, snapshots• No degradation with Big Data

×