Your SlideShare is downloading. ×
A Deep Dive Into Understanding Apache Cassandra
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

A Deep Dive Into Understanding Apache Cassandra

2,419
views

Published on

Inside Cassandra – C* is an interesting piece of software for many reasons, but it is especially interesting in its use of elegant data structures and algorithms. This talk will focus on the data …

Inside Cassandra – C* is an interesting piece of software for many reasons, but it is especially interesting in its use of elegant data structures and algorithms. This talk will focus on the data structures and algorithms that make C* such a scalable and performant database. We will walk along the write, read and delete paths exploring the low-level details of how each of these operations work. We will also explore some of the background processes that maintain availability and performance. The goal of this talk is to gain a deeper understanding of C* by exploring the low-level details of its implementation.

Published in: Technology

0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,419
On Slideshare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
61
Comments
0
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Regular hashing
  • Today we’ll explore the path of your data through C* to the disk and from disk back out to your application. We discuss a lot of the processes, data formats and explore some of the data structures involved.I’m going to explore topics that I were hard for me to understand or I found interesting.I’m not going to talk about thrift or CQL 3 or then intermediate layers between there and C* writing/reading your data. If you have free time at night or on the weekend I highly recommend C* internals talk. Git clone the repo and follow along with his talk. He gives a good intro to exploring the source code. Warning: It’s like reading a Wikipedia article. You never know where you’re going to end up.
  • write_request_timeout_in_ms – memory pressure on coordinator node
  • Transcript

    • 1. Inside Cassandra Michael Penick
    • 2. Overview • To disk and back again • Cassandra Internals by Aaron Morton • Goals – RDBMS comparison to C* – Make educated decisions I’m configuration
    • 3. Node 3Node 2 Node 1Node 0 Distributed Hashing A B C D E F G H I J K L M N O P Location = Hash(Key) % # Nodes
    • 4. Node 4 Node 3Node 2 Node 1Node 0 Distributed Hashing A B C D F G H K J LP O M I N E % Data Moved = 100 * N / (N + 1)
    • 5. Consistent Hashing 0 Node 1 Node 2Node 3 Node 4
    • 6. Consistent Hashing 0 A E I M B F J N C G K O D H L P Add Node 0 A E I M B F J N C G K O D H L P % Data Moved = 100 * 1 / N
    • 7. Virtual Nodes Found: http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2 num_tokens initial_token
    • 8. Tunable Consistency 0 Node 1 Node 2 Node 3Node 4 Node 5 Node 6 replication_factor = 3 R1 R2 R3 Client INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY ONE
    • 9. Tunable Consistency 0 Node 1 Node 2 Node 3Node 4 Node 5 Node 6 replication_factor = 3 R1 R2 R3 Client INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY QUORUM
    • 10. Hinted Handoff 0 Node 1 Node 2 Node 3Node 4 Node 5 Node 6 replication_factor = 3 and hinted_handoff_enabled = true R1 R2 R3 Client INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY ANY Write locally: system.hints Note: Doesn’t not count toward consistency level (except ANY)
    • 11. Tunable Consistency 0 Node 1 Node 2 Node 3Node 4 Node 5 Node 6 R1 R2 R3 Client INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY EACH_QUORUM 0 Node 1 Node 2 Node 3Node 4 Node 5 Node 6 R1 R2 Appends FWD_TO parameter to message
    • 12. Read Repair 0 Node 1 Node 2 Node 3Node 4 Node 5 Node 6 R1 R2 R3 Client SELECT * FROM table USING CONSISTENCY ONE replication_factor = 3 and read_repair_chance > 0
    • 13. Write Memory Disk Commit Log Memtable K1 C1:V1 C2:V2 K1 C1:V1 C2:V2 SSTable #1 K1 C1:V1 C2:V2 … … … Flush when: > commitlog_total_space_in_mb or > memtable_total_space_in_mb
    • 14. Write Memory Disk Commit Log Memtable K1 C3:V3 K1 C3:V3 SSTable #1 SSTable #2 K1 C1:V1 C2:V2 … … … … Note: All writes are sequential! Physical Volume #1 Physical Volume #2 K1 C3:V3
    • 15. Commit Log Mutation #3 Mutation #2 Mutation #1 Commit Log Executor Commit Log Allocator Segment #3 Segment #2 Segment #1 Segment #1 Commit Log File Memory Disk Commit Log File Commit Log File Flush! Write! commitlog_segment_size_in_mb
    • 16. Commit Log • commitlog_sync 1. periodic (default) • commitlog_sync_period_in_ms (default: 10 seconds) 2. batch • commitlog_batch_window_in_ms
    • 17. Memtable • ConcurrentSkipListMap<RowPosition, AtomicSort edColumns> rows; • AtomicSortedColumns.Holder – DeletionInfo deletionInfo; // tombstone – SnapTreeMap<ByteBuffer, Column> map; • Goals – Fast operations – Fast concurrent access – Fast in-order iteration – Atomic/Isolated operations within a column family
    • 18. Skip List 1 2 3 4 5 6 7 NIL NIL NIL NIL
    • 19. Skip List Get 7 1 2 3 4 5 6 7 NIL NIL NIL NIL
    • 20. Skip List Delete 4 1 2 3 4 5 6 7 NIL NIL NIL NIL
    • 21. Skip List Delete 4 1 2 3 4 5 6 7 NIL NIL NIL NIL
    • 22. Skip List Delete 4 1 2 3 5 6 7 NIL NIL NIL NIL
    • 23. Skip List Insert 4 1 2 3 5 6 7 NIL NIL NIL NIL
    • 24. Skip List Insert 4 1 2 3 4 5 6 7 NIL NIL NIL NIL
    • 25. Skip List ConcurrentSkipListMap uses: p = 0.5
    • 26. Skip List Insert 4 1 2 3 4 5 6 7 NIL NIL NIL NIL
    • 27. Skip List H 1 3 T 2 H 1 3 T 2 C A S
    • 28. Skip List while(true): next = current.next new_node.next = next if(CompareAndSwap(current.next, next, new_node)): break
    • 29. Skip List H 1 3 T H 1 3 T 2 CAS I’m lost!
    • 30. Skip List H 1 3 T C A S H 1 3 T H 1 3 T CAS
    • 31. Skip List Insert 4 1 2 3 5 6 7 8 NIL NIL NIL NIL 4
    • 32. Skip List Insert 4 1 2 3 5 6 7 8 NIL NIL NIL NIL 4 CAS
    • 33. Skip List Insert 4 1 2 3 5 6 7 8 NIL NIL NIL NIL 4
    • 34. Skip List Insert 4 1 2 3 5 6 7 8 NIL NIL NIL NIL 4
    • 35. Skip List Insert 4 1 2 3 5 6 7 8 NIL NIL NIL NIL 4
    • 36. Skip List Delete 4 1 2 3 4 5 6 7 NIL NIL NIL NIL
    • 37. Skip List Delete 4 1 2 3 NIL 5 6 7 NIL NIL NIL NIL
    • 38. Skip List Delete 4 1 2 3 NIL 5 6 7 NIL NIL NIL NIL CAS
    • 39. Skip List Delete 4 1 2 3 NIL 5 6 7 NIL NIL NIL NIL
    • 40. Skip List Delete 4 1 2 3 NIL 5 6 7 NIL NIL NIL NIL
    • 41. Skip List Delete 4 1 2 3 5 6 7 NIL NIL NIL NIL
    • 42. SnapTree 3 2 5 1 4 6 Node Balance Factor 1 0 2 1 3 0 4 0 5 0 6 0 Balance Factor = Height(Left-Subtree) – Height(Right-Subtree)
    • 43. SnapTree 5 2 6 1 3 4 Node Balance Factor 1 0 2 -1 3 -1 4 0 5 2 6 0 Balance Factor must be -1, 0 or +1
    • 44. SnapTree 5 3 4 A B C D 5 4 3 A B C D 4 3 5 A B C D Left-Right Case Left-Left Case
    • 45. SnapTree 3 5 4 D CB A 3 4 5 DC B A 4 3 5 A B C D Right-Left Case Right-Right Case
    • 46. SnapTree 5 2 6 1 3 4 Node Balance Factor 1 0 2 1 3 1 4 0 5 2 6 0 5 2 6 1 3 4 Node Balance Factor 1 0 2 -1 3 -1 4 0 5 2 6 0
    • 47. SnapTree Node Balance Factor 1 0 2 1 3 1 4 0 5 2 6 0 5 2 6 1 3 4 Node Balance Factor 1 0 2 1 3 0 4 0 5 0 6 0 3 2 5 1 4 6
    • 48. Epoch SnapTree 5 2 6 1 3 4 Root Lock 4 Version(5) is 0 Version(2) is 0 Does Version(5) == 0? Insert
    • 49. Epoch SnapTree 5 2 6 1 3 4 Root 4Get Version(5) is 0 Version(2) is 0 Does Version(5) == 0?
    • 50. Epoch SnapTree Root 5 2 6 1 3 4 4Get Does Version(5) == 0? NO! Go back to 5
    • 51. Epoch SnapTree Root 4 3 2 5 1 4 6 3Delete Lock : (
    • 52. Epoch SnapTree Root 4 3 2 5 1 4 6 3Delete Lock
    • 53. Epoch SnapTree Root 4 3 2 5 1 4 6 3Delete Lock SetValue(3, null)
    • 54. SnapTree Epoch #1 Root 3 2 5 1 4 6 Clone Stop Delete Insert
    • 55. SnapTree Epoch #2 Root 3 2 5 1 4 6 Clone Epoch #3 Root I’m shared!
    • 56. SnapTree Epoch #2 Root 3 2 5 1 4 6 Epoch #3 Root 7Insert
    • 57. SnapTree Epoch #2 Root 3 2 5 1 4 6 Epoch #3 Root 7Insert 3 2 5 1 4 6
    • 58. SnapTree Epoch #2 Root 3 2 5 1 4 6 Epoch #3 Root 7Insert 3 2 5 1 4 6
    • 59. SnapTree Epoch #2 Root 3 2 5 1 4 6 Epoch #3 Root 7Insert 3 2 5 1 4 6 7
    • 60. Snap Tree C* 2.0.0 - File: db/AtomicSortedColumns.java Line: 307
    • 61. SSTable Filter.db Data.db K1 K2 K3 C1 C1 C2 C2 C3 CRC.db 0xFFCC23ED 0x1FEA2321 0xCE652133 Index.db K1 K2 K3 00001 00002 00003 CompressionInfo.db 00001 00002 00003 00001 00004 00006 Compression? NoYes • CASSANDRA-2319 • Promote row index • CASSANDRA-4885 • Remove … per-row bloom filters
    • 62. Delete • Essentially a write (mutation) • Data not remove immediately, but a tombstone record added • tombstone time > gc_grace = data removed (compaction)
    • 63. Bloom Filter
    • 64. Bloom Filter 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 K1Hash Insert
    • 65. Bloom Filter 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 K1Hash Insert
    • 66. Bloom Filter 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 K1Hash InsertHashHash hash = murmur3(key) # creates two hashes for i in count(hash): result[i] = abs(hash[0] + i * hash[1]) % num_keys)
    • 67. Bloom Filter Bloom Filter Probability Calculation Config: bloom_filter_fp_chance, and SSTable: number of rows Num hashes, and Num bits per entry
    • 68. Read Memory Disk Memtable K1 C4:V4 SSTable #2 K1 C3:V3 SSTable #1 K1 C1:V1 C2:V2 … … … Memtable K1 C5:V5 … K1 C4:V4C1:V1 C2:V2 C3:V3 C5:V5 Row Cache = Off-heap row_cache_size_in_mb > 0
    • 69. Read Memory Disk Bloom Filter Key Cache Partition Summary Compression Offsets Partition Index Data Cache Hit Cache Miss = Off-heap key_cache_size_in_mb > 0 index_interval = 128 (default)
    • 70. Compaction (Size-tiered)
    • 71. Compaction (Size-tiered) min_compaction_threshold = 4 Memtable flush!
    • 72. Compaction (Size-tiered)
    • 73. Compaction (Leveled) Memtable flush!
    • 74. Compaction (Leveled) L0: 160 MB L1: 160 MB x 10 sstable_size_in_mb = 160 L2: 160 MB x 100
    • 75. Compaction (Leveled) L0: 160 MB L1: 160 MB x 10 L2: 160 MB x 100 …
    • 76. Topics • CAS (PAXOS) • Anti-entropy (Merkel trees) • Gossip (Failure detection)
    • 77. Thanks