Inside Cassandra
Michael Penick
Overview
• To disk and back again
• Cassandra Internals by Aaron Morton
• Goals
– RDBMS comparison to C*
– Make educated d...
Node 3Node 2
Node 1Node 0
Distributed Hashing
A B
C D
E F
G H
I J
K L
M N
O P
Location = Hash(Key) % # Nodes
Node 4
Node 3Node 2
Node 1Node 0
Distributed Hashing
A B
C D
F G
H
K
J
LP
O
M
I
N
E
% Data Moved = 100 * N / (N + 1)
Consistent Hashing
0
Node 1
Node 2Node 3
Node 4
Consistent Hashing
0
A
E
I
M
B
F
J
N C
G
K
O
D
H
L
P
Add Node 0
A
E
I
M
B
F
J
N C
G
K
O
D
H
L
P
% Data Moved = 100 * 1 / N
Virtual Nodes
Found: http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2
num_tokens
initial_token
Tunable Consistency
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6
replication_factor = 3
R1
R2
R3
Client
INSERT INTO table (c...
Tunable Consistency
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6
replication_factor = 3
R1
R2
R3
Client
INSERT INTO table (c...
Hinted Handoff
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6
replication_factor = 3
and
hinted_handoff_enabled = true
R1
R2
R...
Tunable Consistency
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6
R1
R2
R3
Client
INSERT INTO table (column1, …) VALUES (valu...
Read Repair
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6
R1
R2
R3
Client
SELECT * FROM table USING CONSISTENCY ONE
replicati...
Write
Memory
Disk
Commit Log
Memtable
K1 C1:V1 C2:V2
K1 C1:V1 C2:V2
SSTable #1
K1 C1:V1 C2:V2
…
… …
Flush when:
> commitlo...
Write
Memory
Disk
Commit Log
Memtable
K1 C3:V3
K1 C3:V3
SSTable #1 SSTable #2
K1 C1:V1 C2:V2
…
… … …
Note: All writes are ...
Commit Log
Mutation
#3
Mutation
#2
Mutation
#1
Commit Log
Executor
Commit Log
Allocator
Segment #3 Segment #2 Segment #1 S...
Commit Log
• commitlog_sync
1. periodic (default)
• commitlog_sync_period_in_ms (default: 10 seconds)
2. batch
• commitlog...
Memtable
• ConcurrentSkipListMap<RowPosition, AtomicSort
edColumns> rows;
• AtomicSortedColumns.Holder
– DeletionInfo dele...
Skip List
1 2 3 4 5 6 7
NIL
NIL
NIL
NIL
Skip List
Get 7
1 2 3 4 5 6 7
NIL
NIL
NIL
NIL
Skip List
Delete 4
1 2 3 4 5 6 7
NIL
NIL
NIL
NIL
Skip List
Delete 4
1 2 3 4 5 6 7
NIL
NIL
NIL
NIL
Skip List
Delete 4
1 2 3 5 6 7
NIL
NIL
NIL
NIL
Skip List
Insert 4
1 2 3 5 6 7
NIL
NIL
NIL
NIL
Skip List
Insert 4
1 2 3 4 5 6 7
NIL
NIL
NIL
NIL
Skip List
ConcurrentSkipListMap uses: p = 0.5
Skip List
Insert 4
1 2 3 4 5 6 7
NIL
NIL
NIL
NIL
Skip List
H 1 3 T
2
H 1 3 T
2
C
A
S
Skip List
while(true):
next = current.next
new_node.next = next
if(CompareAndSwap(current.next, next, new_node)):
break
Skip List
H 1 3 T
H 1 3 T
2
CAS
I’m lost!
Skip List
H 1 3 T
C
A
S
H 1 3 T
H 1 3 T
CAS
Skip List
Insert 4
1 2 3 5 6 7 8
NIL
NIL
NIL
NIL
4
Skip List
Insert 4
1 2 3 5 6 7 8
NIL
NIL
NIL
NIL
4
CAS
Skip List
Insert 4
1 2 3 5 6 7 8
NIL
NIL
NIL
NIL
4
Skip List
Insert 4
1 2 3 5 6 7 8
NIL
NIL
NIL
NIL
4
Skip List
Insert 4
1 2 3 5 6 7 8
NIL
NIL
NIL
NIL
4
Skip List
Delete 4
1 2 3 4 5 6 7
NIL
NIL
NIL
NIL
Skip List
Delete 4
1 2 3 NIL 5 6 7
NIL
NIL
NIL
NIL
Skip List
Delete 4
1 2 3 NIL 5 6 7
NIL
NIL
NIL
NIL
CAS
Skip List
Delete 4
1 2 3 NIL 5 6 7
NIL
NIL
NIL
NIL
Skip List
Delete 4
1 2 3 NIL 5 6 7
NIL
NIL
NIL
NIL
Skip List
Delete 4
1 2 3 5 6 7
NIL
NIL
NIL
NIL
SnapTree
3
2 5
1 4 6
Node Balance Factor
1 0
2 1
3 0
4 0
5 0
6 0
Balance Factor = Height(Left-Subtree) – Height(Right-Subt...
SnapTree
5
2 6
1 3
4
Node Balance Factor
1 0
2 -1
3 -1
4 0
5 2
6 0
Balance Factor must be -1, 0 or +1
SnapTree
5
3
4
A
B C
D
5
4
3
A B
C
D
4
3 5
A B C D
Left-Right Case
Left-Left Case
SnapTree
3
5
4
D
CB
A
3
4
5
DC
B
A
4
3 5
A B C D
Right-Left Case
Right-Right Case
SnapTree
5
2 6
1 3
4
Node Balance Factor
1 0
2 1
3 1
4 0
5 2
6 0
5
2
6
1
3
4
Node Balance Factor
1 0
2 -1
3 -1
4 0
5 2
6 0
SnapTree
Node Balance Factor
1 0
2 1
3 1
4 0
5 2
6 0
5
2
6
1
3
4
Node Balance Factor
1 0
2 1
3 0
4 0
5 0
6 0
3
2 5
1 4 6
Epoch
SnapTree
5
2 6
1 3
4
Root
Lock
4
Version(5) is 0
Version(2) is 0
Does Version(5) == 0?
Insert
Epoch
SnapTree
5
2 6
1 3
4
Root
4Get
Version(5) is 0
Version(2) is 0
Does Version(5) == 0?
Epoch
SnapTree
Root
5
2
6
1
3
4
4Get
Does Version(5) == 0?
NO! Go back to 5
Epoch
SnapTree
Root
4
3
2 5
1 4 6
3Delete
Lock : (
Epoch
SnapTree
Root
4
3
2 5
1 4 6
3Delete
Lock
Epoch
SnapTree
Root
4
3
2 5
1 4 6
3Delete
Lock
SetValue(3, null)
SnapTree
Epoch #1
Root
3
2 5
1 4 6
Clone Stop
Delete
Insert
SnapTree
Epoch #2
Root
3
2 5
1 4 6
Clone
Epoch #3
Root
I’m
shared!
SnapTree
Epoch #2
Root
3
2 5
1 4 6
Epoch #3
Root
7Insert
SnapTree
Epoch #2
Root
3
2 5
1 4 6
Epoch #3
Root
7Insert
3
2 5
1 4 6
SnapTree
Epoch #2
Root
3
2 5
1 4 6
Epoch #3
Root
7Insert
3
2 5
1 4 6
SnapTree
Epoch #2
Root
3
2 5
1 4 6
Epoch #3
Root
7Insert
3
2 5
1 4 6
7
Snap Tree
C* 2.0.0 - File: db/AtomicSortedColumns.java Line: 307
SSTable
Filter.db Data.db
K1
K2
K3
C1
C1
C2
C2
C3
CRC.db
0xFFCC23ED
0x1FEA2321
0xCE652133
Index.db
K1
K2
K3
00001
00002
00...
Delete
• Essentially a write (mutation)
• Data not remove immediately, but a
tombstone record added
• tombstone time > gc_...
Bloom Filter
Bloom Filter
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
K1Hash Insert
Bloom Filter
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
K1Hash Insert
Bloom Filter
1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0
K1Hash InsertHashHash
hash = murmur3(key) # creates two hashes
for i in count...
Bloom Filter
Bloom Filter
Probability
Calculation
Config: bloom_filter_fp_chance,
and
SSTable: number of rows
Num hashes,
...
Read
Memory
Disk
Memtable
K1 C4:V4
SSTable #2
K1 C3:V3
SSTable #1
K1 C1:V1 C2:V2
…
… …
Memtable
K1 C5:V5
… K1 C4:V4C1:V1 C...
Read
Memory
Disk
Bloom
Filter
Key
Cache
Partition
Summary
Compression
Offsets
Partition
Index Data
Cache Hit
Cache Miss
= ...
Compaction (Size-tiered)
Compaction (Size-tiered)
min_compaction_threshold = 4
Memtable flush!
Compaction (Size-tiered)
Compaction (Leveled)
Memtable flush!
Compaction (Leveled)
L0: 160 MB L1: 160 MB x 10
sstable_size_in_mb = 160
L2: 160 MB x 100
Compaction (Leveled)
L0: 160 MB L1: 160 MB x 10 L2: 160 MB x 100
…
Topics
• CAS (PAXOS)
• Anti-entropy (Merkel trees)
• Gossip (Failure detection)
Thanks
Upcoming SlideShare
Loading in...5
×

A Deep Dive Into Understanding Apache Cassandra

2,907

Published on

Inside Cassandra – C* is an interesting piece of software for many reasons, but it is especially interesting in its use of elegant data structures and algorithms. This talk will focus on the data structures and algorithms that make C* such a scalable and performant database. We will walk along the write, read and delete paths exploring the low-level details of how each of these operations work. We will also explore some of the background processes that maintain availability and performance. The goal of this talk is to gain a deeper understanding of C* by exploring the low-level details of its implementation.

Published in: Technology
0 Comments
11 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,907
On Slideshare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
77
Comments
0
Likes
11
Embeds 0
No embeds

No notes for slide
  • Regular hashing
  • Today we’ll explore the path of your data through C* to the disk and from disk back out to your application. We discuss a lot of the processes, data formats and explore some of the data structures involved.I’m going to explore topics that I were hard for me to understand or I found interesting.I’m not going to talk about thrift or CQL 3 or then intermediate layers between there and C* writing/reading your data. If you have free time at night or on the weekend I highly recommend C* internals talk. Git clone the repo and follow along with his talk. He gives a good intro to exploring the source code. Warning: It’s like reading a Wikipedia article. You never know where you’re going to end up.
  • write_request_timeout_in_ms – memory pressure on coordinator node
  • A Deep Dive Into Understanding Apache Cassandra

    1. 1. Inside Cassandra Michael Penick
    2. 2. Overview • To disk and back again • Cassandra Internals by Aaron Morton • Goals – RDBMS comparison to C* – Make educated decisions I’m configuration
    3. 3. Node 3Node 2 Node 1Node 0 Distributed Hashing A B C D E F G H I J K L M N O P Location = Hash(Key) % # Nodes
    4. 4. Node 4 Node 3Node 2 Node 1Node 0 Distributed Hashing A B C D F G H K J LP O M I N E % Data Moved = 100 * N / (N + 1)
    5. 5. Consistent Hashing 0 Node 1 Node 2Node 3 Node 4
    6. 6. Consistent Hashing 0 A E I M B F J N C G K O D H L P Add Node 0 A E I M B F J N C G K O D H L P % Data Moved = 100 * 1 / N
    7. 7. Virtual Nodes Found: http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2 num_tokens initial_token
    8. 8. Tunable Consistency 0 Node 1 Node 2 Node 3Node 4 Node 5 Node 6 replication_factor = 3 R1 R2 R3 Client INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY ONE
    9. 9. Tunable Consistency 0 Node 1 Node 2 Node 3Node 4 Node 5 Node 6 replication_factor = 3 R1 R2 R3 Client INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY QUORUM
    10. 10. Hinted Handoff 0 Node 1 Node 2 Node 3Node 4 Node 5 Node 6 replication_factor = 3 and hinted_handoff_enabled = true R1 R2 R3 Client INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY ANY Write locally: system.hints Note: Doesn’t not count toward consistency level (except ANY)
    11. 11. Tunable Consistency 0 Node 1 Node 2 Node 3Node 4 Node 5 Node 6 R1 R2 R3 Client INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY EACH_QUORUM 0 Node 1 Node 2 Node 3Node 4 Node 5 Node 6 R1 R2 Appends FWD_TO parameter to message
    12. 12. Read Repair 0 Node 1 Node 2 Node 3Node 4 Node 5 Node 6 R1 R2 R3 Client SELECT * FROM table USING CONSISTENCY ONE replication_factor = 3 and read_repair_chance > 0
    13. 13. Write Memory Disk Commit Log Memtable K1 C1:V1 C2:V2 K1 C1:V1 C2:V2 SSTable #1 K1 C1:V1 C2:V2 … … … Flush when: > commitlog_total_space_in_mb or > memtable_total_space_in_mb
    14. 14. Write Memory Disk Commit Log Memtable K1 C3:V3 K1 C3:V3 SSTable #1 SSTable #2 K1 C1:V1 C2:V2 … … … … Note: All writes are sequential! Physical Volume #1 Physical Volume #2 K1 C3:V3
    15. 15. Commit Log Mutation #3 Mutation #2 Mutation #1 Commit Log Executor Commit Log Allocator Segment #3 Segment #2 Segment #1 Segment #1 Commit Log File Memory Disk Commit Log File Commit Log File Flush! Write! commitlog_segment_size_in_mb
    16. 16. Commit Log • commitlog_sync 1. periodic (default) • commitlog_sync_period_in_ms (default: 10 seconds) 2. batch • commitlog_batch_window_in_ms
    17. 17. Memtable • ConcurrentSkipListMap<RowPosition, AtomicSort edColumns> rows; • AtomicSortedColumns.Holder – DeletionInfo deletionInfo; // tombstone – SnapTreeMap<ByteBuffer, Column> map; • Goals – Fast operations – Fast concurrent access – Fast in-order iteration – Atomic/Isolated operations within a column family
    18. 18. Skip List 1 2 3 4 5 6 7 NIL NIL NIL NIL
    19. 19. Skip List Get 7 1 2 3 4 5 6 7 NIL NIL NIL NIL
    20. 20. Skip List Delete 4 1 2 3 4 5 6 7 NIL NIL NIL NIL
    21. 21. Skip List Delete 4 1 2 3 4 5 6 7 NIL NIL NIL NIL
    22. 22. Skip List Delete 4 1 2 3 5 6 7 NIL NIL NIL NIL
    23. 23. Skip List Insert 4 1 2 3 5 6 7 NIL NIL NIL NIL
    24. 24. Skip List Insert 4 1 2 3 4 5 6 7 NIL NIL NIL NIL
    25. 25. Skip List ConcurrentSkipListMap uses: p = 0.5
    26. 26. Skip List Insert 4 1 2 3 4 5 6 7 NIL NIL NIL NIL
    27. 27. Skip List H 1 3 T 2 H 1 3 T 2 C A S
    28. 28. Skip List while(true): next = current.next new_node.next = next if(CompareAndSwap(current.next, next, new_node)): break
    29. 29. Skip List H 1 3 T H 1 3 T 2 CAS I’m lost!
    30. 30. Skip List H 1 3 T C A S H 1 3 T H 1 3 T CAS
    31. 31. Skip List Insert 4 1 2 3 5 6 7 8 NIL NIL NIL NIL 4
    32. 32. Skip List Insert 4 1 2 3 5 6 7 8 NIL NIL NIL NIL 4 CAS
    33. 33. Skip List Insert 4 1 2 3 5 6 7 8 NIL NIL NIL NIL 4
    34. 34. Skip List Insert 4 1 2 3 5 6 7 8 NIL NIL NIL NIL 4
    35. 35. Skip List Insert 4 1 2 3 5 6 7 8 NIL NIL NIL NIL 4
    36. 36. Skip List Delete 4 1 2 3 4 5 6 7 NIL NIL NIL NIL
    37. 37. Skip List Delete 4 1 2 3 NIL 5 6 7 NIL NIL NIL NIL
    38. 38. Skip List Delete 4 1 2 3 NIL 5 6 7 NIL NIL NIL NIL CAS
    39. 39. Skip List Delete 4 1 2 3 NIL 5 6 7 NIL NIL NIL NIL
    40. 40. Skip List Delete 4 1 2 3 NIL 5 6 7 NIL NIL NIL NIL
    41. 41. Skip List Delete 4 1 2 3 5 6 7 NIL NIL NIL NIL
    42. 42. SnapTree 3 2 5 1 4 6 Node Balance Factor 1 0 2 1 3 0 4 0 5 0 6 0 Balance Factor = Height(Left-Subtree) – Height(Right-Subtree)
    43. 43. SnapTree 5 2 6 1 3 4 Node Balance Factor 1 0 2 -1 3 -1 4 0 5 2 6 0 Balance Factor must be -1, 0 or +1
    44. 44. SnapTree 5 3 4 A B C D 5 4 3 A B C D 4 3 5 A B C D Left-Right Case Left-Left Case
    45. 45. SnapTree 3 5 4 D CB A 3 4 5 DC B A 4 3 5 A B C D Right-Left Case Right-Right Case
    46. 46. SnapTree 5 2 6 1 3 4 Node Balance Factor 1 0 2 1 3 1 4 0 5 2 6 0 5 2 6 1 3 4 Node Balance Factor 1 0 2 -1 3 -1 4 0 5 2 6 0
    47. 47. SnapTree Node Balance Factor 1 0 2 1 3 1 4 0 5 2 6 0 5 2 6 1 3 4 Node Balance Factor 1 0 2 1 3 0 4 0 5 0 6 0 3 2 5 1 4 6
    48. 48. Epoch SnapTree 5 2 6 1 3 4 Root Lock 4 Version(5) is 0 Version(2) is 0 Does Version(5) == 0? Insert
    49. 49. Epoch SnapTree 5 2 6 1 3 4 Root 4Get Version(5) is 0 Version(2) is 0 Does Version(5) == 0?
    50. 50. Epoch SnapTree Root 5 2 6 1 3 4 4Get Does Version(5) == 0? NO! Go back to 5
    51. 51. Epoch SnapTree Root 4 3 2 5 1 4 6 3Delete Lock : (
    52. 52. Epoch SnapTree Root 4 3 2 5 1 4 6 3Delete Lock
    53. 53. Epoch SnapTree Root 4 3 2 5 1 4 6 3Delete Lock SetValue(3, null)
    54. 54. SnapTree Epoch #1 Root 3 2 5 1 4 6 Clone Stop Delete Insert
    55. 55. SnapTree Epoch #2 Root 3 2 5 1 4 6 Clone Epoch #3 Root I’m shared!
    56. 56. SnapTree Epoch #2 Root 3 2 5 1 4 6 Epoch #3 Root 7Insert
    57. 57. SnapTree Epoch #2 Root 3 2 5 1 4 6 Epoch #3 Root 7Insert 3 2 5 1 4 6
    58. 58. SnapTree Epoch #2 Root 3 2 5 1 4 6 Epoch #3 Root 7Insert 3 2 5 1 4 6
    59. 59. SnapTree Epoch #2 Root 3 2 5 1 4 6 Epoch #3 Root 7Insert 3 2 5 1 4 6 7
    60. 60. Snap Tree C* 2.0.0 - File: db/AtomicSortedColumns.java Line: 307
    61. 61. SSTable Filter.db Data.db K1 K2 K3 C1 C1 C2 C2 C3 CRC.db 0xFFCC23ED 0x1FEA2321 0xCE652133 Index.db K1 K2 K3 00001 00002 00003 CompressionInfo.db 00001 00002 00003 00001 00004 00006 Compression? NoYes • CASSANDRA-2319 • Promote row index • CASSANDRA-4885 • Remove … per-row bloom filters
    62. 62. Delete • Essentially a write (mutation) • Data not remove immediately, but a tombstone record added • tombstone time > gc_grace = data removed (compaction)
    63. 63. Bloom Filter
    64. 64. Bloom Filter 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 K1Hash Insert
    65. 65. Bloom Filter 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 K1Hash Insert
    66. 66. Bloom Filter 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 K1Hash InsertHashHash hash = murmur3(key) # creates two hashes for i in count(hash): result[i] = abs(hash[0] + i * hash[1]) % num_keys)
    67. 67. Bloom Filter Bloom Filter Probability Calculation Config: bloom_filter_fp_chance, and SSTable: number of rows Num hashes, and Num bits per entry
    68. 68. Read Memory Disk Memtable K1 C4:V4 SSTable #2 K1 C3:V3 SSTable #1 K1 C1:V1 C2:V2 … … … Memtable K1 C5:V5 … K1 C4:V4C1:V1 C2:V2 C3:V3 C5:V5 Row Cache = Off-heap row_cache_size_in_mb > 0
    69. 69. Read Memory Disk Bloom Filter Key Cache Partition Summary Compression Offsets Partition Index Data Cache Hit Cache Miss = Off-heap key_cache_size_in_mb > 0 index_interval = 128 (default)
    70. 70. Compaction (Size-tiered)
    71. 71. Compaction (Size-tiered) min_compaction_threshold = 4 Memtable flush!
    72. 72. Compaction (Size-tiered)
    73. 73. Compaction (Leveled) Memtable flush!
    74. 74. Compaction (Leveled) L0: 160 MB L1: 160 MB x 10 sstable_size_in_mb = 160 L2: 160 MB x 100
    75. 75. Compaction (Leveled) L0: 160 MB L1: 160 MB x 10 L2: 160 MB x 100 …
    76. 76. Topics • CAS (PAXOS) • Anti-entropy (Merkel trees) • Gossip (Failure detection)
    77. 77. Thanks
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×