Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
under the hood
Cassandra
2017
Who I am
Java Software Engineer @ Lohika
More than 7 years of experience
Andriy Rymar
2
What we won’t
• Learn how to use Cassandra
• Learn about performance tuning
• Learn how to manage cluster
• Learn how to i...
What we will
4
We will learn what is Cassandra
Content
• General overview
• Data model
• Architecture
• Read & Write operations
5
Preface
6
• RDBMS - is not bad
• RDBMS - has been successful in the last 40 years
RDBMS
7
• Slow queries due to complex joins, long time to reindexing data
• Expensive vertical scaling and problems with horizonta...
CAP
consistency availability
partition
tolerance
RDBSM
NoSQLNoSQL
9
CA, CP, AP
• Consistency & Availability
• Consistency & Partition-tolerance
• Availability & Partition-tolerance
10
Eventual consistent
Eventual consistent system without any failures
Eventual consistent system with failures
V0
V0
V0
V0 V...
Solution
Google BigTable
2004
Cassandra
2008 (2010 , 2013)
Amazon Dynamo DB
2012
12
Cassandra
General Overview
13
Cassandra cluster
N1
N2
N3
A
G
R
Tokens & Seed node & Ring representation
Tokens - determine position of node in ring clus...
Cassandra cluster
N1
N2
N3
A-F
G-Q
R-Z
pk: «Taras», message: «Hello»
Replication Factor (RF) = 2
G-Q
R-Z
A-F
15
Tokens
Issues
• Manually manage token initial value for all nodes
• Big overhead when restoring node data
for(int i=0; i <...
Virtual Nodes
1
2
3 4
5
6
7
8
910
11
12
Server1 Server2
Server3Server4
17
Virtual Nodes
Data restoring
vnode = 3
S1
S3
S2S4
RF = 2
18
V-nodes
Summary
• Rebalancing a cluster is no longer necessary when adding or removing nodes
• More powerful machines can ...
Cassandra
Data model
20
Introduction into data model
KEYSPACE
Table (column family)
partition key
column1 column2 column3
model123
value value
age...
Column family
• RDBMS
user
name email title age
…
Taras
Andriy
tm@gm.com
ar@gm.com
Staff Engineer
27
• Column Family
user
...
Column family
“user” : {
“Taras” : {
“email” : “tm@gm.com”,
“title” : “Staff Engineer”
},
“Andriy” : {
“email” : “ar@gm.co...
Other differences
• No relations (No Joins)
• Tuples (key-value pairs) are natural sorted
• May want to denormalize data m...
Type of keys
• Primary key
• Composite key
• Partition key
• Clustering key
• Composite partition key
25
Example 1
CREATE TABLE album (
id uuid,
name name,
PRIMARY KEY (id)
)
Primary key and also the partition key
id - partitio...
Composite key
Example 2
CREATE TABLE author_book (
author text,
book text,
population int,
PRIMARY KEY (author, book)
)
pa...
Example 3
Key with composite partition & clustering keys
CREATE TABLE teacher_lesson (
teacher text,
lesson text,
topic te...
Row vs Partition
Rows
Partitions
Node 1 Node 2
1234
5678
9101112
1234:user 5678:user
1234:address 5678:address
1234:detail...
Coffee break
• General overview
• Data model
• Architecture
• Read & Write operations
30
Cassandra
Architecture
31
Cassandra components
API Tools
Storage layer
Partitioner Replicator
Failure detector Compaction Manager
Messaging layer
32
Cassandra components
API Tools
Storage layer
Partitioner Replicator
Failure detector Compaction Manager
Messaging layer
33
Messaging service
In cluster of 5 nodes , each node has 8 opened socket connections
34
Has 2 opened socket connections wit...
Gossip
35
Gossip
How Cassandra initiates sessions?
• One session for any random live node
• One session for any random unreachable n...
Gossip
Session
1 : GossipSyncMessage
N1 N2
2 : GossipAckMessage
3 : GossipAck2Message
37
Cassandra components
API Tools
Storage layer
Partitioner Replicator
Failure detector Compaction Manager
Messaging layer
38
Failure detection
ϕ accrual failure detector
• Doesn’t use TRUE / FALSE
• Provides continuos value
• This value is called ...
Failure detection
ϕ accrual failure detector
40
time
session
1 2 3 4 5
1s2s
Failure detection
Proposed by Xavier Défago in 2004
41
https://goo.gl/xS0kB0
Cassandra components
API Tools
Storage layer
Partitioner Replicator
Failure detector Compaction Manager
Messaging layer
42
Partitioner
All
terabytes
of data
N1
N2
N3
N4
N5
N6
N7
N8
43
Partitioner
• Murmur 3 Partitioner
• Random Partitioner
• Byte Order Partitioner
44
Cassandra components
API Tools
Storage layer
Partitioner Replicator
Failure detector Compaction Manager
Messaging layer
45
Replicator
• Replication factor = 3
Write data
request
N1 N2 N3 N4N1 N2 N3
• Consistency Level = 2
N1 N2
46
Consistency level
• ZERO (write only)
• ANY (write only)
• ONE
• QUORUM
• ALL
Push and forget
Success even hinted of write...
Inconsistency
• 5 node cluster
• Replication factor 3
• Consistency level 1
N1 N2 N3 N4 N5
Write Read
N2 N3 N4
48
Replicat...
Tuning
• Use consistency level with at least 1 node overlap (Quorum)
Write CL = 2 Read CL = 2
Replication factor = 3
N1 N2...
Tuning
• Tune read and write CL separately to reach high performance
Fast write Fast Read
Write CL = 1 Read CL = ALL Write...
Cassandra components
API Tools
Storage layer
Partitioner Replicator
Failure detector Compaction Manager
Messaging layer
51
Storage layer
52
Client
Mutation Request
Commit log
MemTable
SSTable
mem
hdd
add / update
append
Flush
cleanup
Storage layer
53
Client
Mutation Request
Commit log
MemTable
SSTable
mem
hdd
add / update
append
Flush
cleanup
SSTable
• Representation of MemTable
• Immutable
• Eventually get merged into larger SSTable files (compaction)
• Has next...
SSTable
Bloom filter
• Bloom filter is used to determine correct SSTable
• Bloom filter may result as FALSE positive
• Sto...
SSTable
Bloom filter
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 1 0 1 1 1 0 0 1 0 0 1 0
0 0 5 0 0 2 0 1 1 3 0 0 1 0 0 2 0...
SSTable
Index file
• Contains all row keys and their offset in data file
• Each 128th key from the file will be stored int...
SSTable
Index file
58
memory hdd
126,…
127,…
128,…
…
199,…
200,…
201,…
202,…
203,…
…
Index file
1, 128, 256, 384
Sampled I...
Cassandra components
API Tools
Storage layer
Partitioner Replicator
Failure detector Compaction Manager
Messaging layer
59
Compaction
• Merges SSTables
• There are two compaction strategies
• size-tiered
• leveled
60
Compaction
A B A B
C C
D E
C
C F
Size-tiered (Minor compaction)
D E
C F
61
…
Compaction Level = 2
Compaction
B C
D
E
Size-tiered (Major compaction)
F
62
A
Data repair
• Hinted handoff
• Read repair
• Anti-entropy
63
Coffee break
• General overview
• Data model
• Architecture
• Read & Write operations
64
Cassandra
Read & Write operations
65
Write
Write Request
Node1
StorageProxy
Node2
Commit Log
MemTable
SSTable
66
Write
Replication & Consistency
N1
N4
Replication Factor = 3
N2 N3
Consistency Level = 2
67
Anti-entropy / Read repair
Hin...
Read
Snitch Function
Read request
Node1
StorageProxy
?
68
Read
Snitch Function
• SimpleSnitch
• DynamicSnitch
• PropertyFileSnitch
• GossipingPropertyFileSnitch
• RackInferringSnit...
Read
Snitch functions
SimpleSnitch
N1
N2
N3
N4
N5
N6
N7
70
Read
Snitch functions
DynamicSnitch
N1
N1
N2
N3
0.6ms
0.4ms
0.9ms
1
2
3
71
Read
Snitch functions
GossipingPropertyFileSnitch
$CASSANDRA_HOME/conf/cassandra-rackdc.properties
# indicate the rack and...
Read
In action
• 7 node • RF = 4 • CL = 3
Read request
Node1
StorageProxy Read data
Get digest
Get digest
Node 3
Node 4
No...
Read on node
B
SSTable
I
B B I
74
B
Almost the end
75
Nishant Neeraj : «Mastering Apache Cassandra - Second Edition»
Throughput comparison
76
Publication
A Real Comparison Of NoSQL Databases HBase, Cassandra & MongoDB
77
https://goo.gl/z5abRu
Summary
78
The end
79
Any
questions?
80
problems
Resources
Nishant Neeraj
Mastering Apache Cassandra - 2015
http://docs.datastax.com/en/cassand
ra/3.x/cassandra/cassandraA...
Upcoming SlideShare
Loading in …5
×

Cassandra under the hood

976 views

Published on

This presentation is a deep diving inside Cassandra architecture.

Content:
- General Overview
- Data model
- Architecture
- Read & Write operations

Published in: Engineering
  • Be the first to comment

Cassandra under the hood

  1. 1. under the hood Cassandra 2017
  2. 2. Who I am Java Software Engineer @ Lohika More than 7 years of experience Andriy Rymar 2
  3. 3. What we won’t • Learn how to use Cassandra • Learn about performance tuning • Learn how to manage cluster • Learn how to interact with Cassandra 3
  4. 4. What we will 4 We will learn what is Cassandra
  5. 5. Content • General overview • Data model • Architecture • Read & Write operations 5
  6. 6. Preface 6
  7. 7. • RDBMS - is not bad • RDBMS - has been successful in the last 40 years RDBMS 7
  8. 8. • Slow queries due to complex joins, long time to reindexing data • Expensive vertical scaling and problems with horizontal scaling • When you try to replicate database you hurt the availability of the system RDBMS Issues 8
  9. 9. CAP consistency availability partition tolerance RDBSM NoSQLNoSQL 9
  10. 10. CA, CP, AP • Consistency & Availability • Consistency & Partition-tolerance • Availability & Partition-tolerance 10
  11. 11. Eventual consistent Eventual consistent system without any failures Eventual consistent system with failures V0 V0 V0 V0 V0 V1 V0 V0 V1 V1 V1 V1 V1 V1 V1 V0 V1 V1 V1 V1V1 V1 11 V1
  12. 12. Solution Google BigTable 2004 Cassandra 2008 (2010 , 2013) Amazon Dynamo DB 2012 12
  13. 13. Cassandra General Overview 13
  14. 14. Cassandra cluster N1 N2 N3 A G R Tokens & Seed node & Ring representation Tokens - determine position of node in ring cluster and portion of data N1 14
  15. 15. Cassandra cluster N1 N2 N3 A-F G-Q R-Z pk: «Taras», message: «Hello» Replication Factor (RF) = 2 G-Q R-Z A-F 15
  16. 16. Tokens Issues • Manually manage token initial value for all nodes • Big overhead when restoring node data for(int i=0; i < CLUSTER_SIZE; i++) { System.out.println((((2**64 / CLUSTER_SIZE) * i) - 2**63)) } N1 N2N3 Replication Factor (RF) = 2 Ne w N2 16
  17. 17. Virtual Nodes 1 2 3 4 5 6 7 8 910 11 12 Server1 Server2 Server3Server4 17
  18. 18. Virtual Nodes Data restoring vnode = 3 S1 S3 S2S4 RF = 2 18
  19. 19. V-nodes Summary • Rebalancing a cluster is no longer necessary when adding or removing nodes • More powerful machines can have more v-nodes. This approach give ability to build heterogeneous Cassandra ring 19
  20. 20. Cassandra Data model 20
  21. 21. Introduction into data model KEYSPACE Table (column family) partition key column1 column2 column3 model123 value value age email test@test.com name demo14 value 21
  22. 22. Column family • RDBMS user name email title age … Taras Andriy tm@gm.com ar@gm.com Staff Engineer 27 • Column Family user … key: Taras key: Andriy value: value: email : tm@gm.com title: Staff Engineer email : ar@gm.com age: 27 22
  23. 23. Column family “user” : { “Taras” : { “email” : “tm@gm.com”, “title” : “Staff Engineer” }, “Andriy” : { “email” : “ar@gm.com”, “age” : “27” } } user … key: Taras key: Andriy value: value: email : tm@gm.com title: Staff Engineer email : ar@gm.com age: 27 23
  24. 24. Other differences • No relations (No Joins) • Tuples (key-value pairs) are natural sorted • May want to denormalize data model in database • No transactions 24
  25. 25. Type of keys • Primary key • Composite key • Partition key • Clustering key • Composite partition key 25
  26. 26. Example 1 CREATE TABLE album ( id uuid, name name, PRIMARY KEY (id) ) Primary key and also the partition key id - partition & primary key at the same time 26
  27. 27. Composite key Example 2 CREATE TABLE author_book ( author text, book text, population int, PRIMARY KEY (author, book) ) partition key primary key 27
  28. 28. Example 3 Key with composite partition & clustering keys CREATE TABLE teacher_lesson ( teacher text, lesson text, topic text, duration int, PRIMARY KEY ((teacher, lesson), topic, duration) ) clustering keyscomposite partition key 28
  29. 29. Row vs Partition Rows Partitions Node 1 Node 2 1234 5678 9101112 1234:user 5678:user 1234:address 5678:address 1234:details 5678:details 29
  30. 30. Coffee break • General overview • Data model • Architecture • Read & Write operations 30
  31. 31. Cassandra Architecture 31
  32. 32. Cassandra components API Tools Storage layer Partitioner Replicator Failure detector Compaction Manager Messaging layer 32
  33. 33. Cassandra components API Tools Storage layer Partitioner Replicator Failure detector Compaction Manager Messaging layer 33
  34. 34. Messaging service In cluster of 5 nodes , each node has 8 opened socket connections 34 Has 2 opened socket connections with every other node
  35. 35. Gossip 35
  36. 36. Gossip How Cassandra initiates sessions? • One session for any random live node • One session for any random unreachable node • If the node in point 1 is not a seed node, then create session with random seed node 36
  37. 37. Gossip Session 1 : GossipSyncMessage N1 N2 2 : GossipAckMessage 3 : GossipAck2Message 37
  38. 38. Cassandra components API Tools Storage layer Partitioner Replicator Failure detector Compaction Manager Messaging layer 38
  39. 39. Failure detection ϕ accrual failure detector • Doesn’t use TRUE / FALSE • Provides continuos value • This value is called «ϕ» 39
  40. 40. Failure detection ϕ accrual failure detector 40 time session 1 2 3 4 5 1s2s
  41. 41. Failure detection Proposed by Xavier Défago in 2004 41 https://goo.gl/xS0kB0
  42. 42. Cassandra components API Tools Storage layer Partitioner Replicator Failure detector Compaction Manager Messaging layer 42
  43. 43. Partitioner All terabytes of data N1 N2 N3 N4 N5 N6 N7 N8 43
  44. 44. Partitioner • Murmur 3 Partitioner • Random Partitioner • Byte Order Partitioner 44
  45. 45. Cassandra components API Tools Storage layer Partitioner Replicator Failure detector Compaction Manager Messaging layer 45
  46. 46. Replicator • Replication factor = 3 Write data request N1 N2 N3 N4N1 N2 N3 • Consistency Level = 2 N1 N2 46
  47. 47. Consistency level • ZERO (write only) • ANY (write only) • ONE • QUORUM • ALL Push and forget Success even hinted of write First replica returned successfully N/2 +1 replica success All replica success 47 Replicator
  48. 48. Inconsistency • 5 node cluster • Replication factor 3 • Consistency level 1 N1 N2 N3 N4 N5 Write Read N2 N3 N4 48 Replicator
  49. 49. Tuning • Use consistency level with at least 1 node overlap (Quorum) Write CL = 2 Read CL = 2 Replication factor = 3 N1 N2 N3 N4 N5 Write Read N2 N3 N4 49 Replicator
  50. 50. Tuning • Tune read and write CL separately to reach high performance Fast write Fast Read Write CL = 1 Read CL = ALL Write CL = ALL Read CL = 1 50 Replicator
  51. 51. Cassandra components API Tools Storage layer Partitioner Replicator Failure detector Compaction Manager Messaging layer 51
  52. 52. Storage layer 52 Client Mutation Request Commit log MemTable SSTable mem hdd add / update append Flush cleanup
  53. 53. Storage layer 53 Client Mutation Request Commit log MemTable SSTable mem hdd add / update append Flush cleanup
  54. 54. SSTable • Representation of MemTable • Immutable • Eventually get merged into larger SSTable files (compaction) • Has next components • Bloom filter • Index file • Data file 54
  55. 55. SSTable Bloom filter • Bloom filter is used to determine correct SSTable • Bloom filter may result as FALSE positive • Stored on heap memory 55
  56. 56. SSTable Bloom filter 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 1 1 0 0 1 0 0 1 0 0 0 5 0 0 2 0 1 1 3 0 0 1 0 0 2 0 56 murmur3(`key`) = 15
  57. 57. SSTable Index file • Contains all row keys and their offset in data file • Each 128th key from the file will be stored into memory • Use binary search to determine right index in memory 57
  58. 58. SSTable Index file 58 memory hdd 126,… 127,… 128,… … 199,… 200,… 201,… 202,… 203,… … Index file 1, 128, 256, 384 Sampled IndexBF 201
  59. 59. Cassandra components API Tools Storage layer Partitioner Replicator Failure detector Compaction Manager Messaging layer 59
  60. 60. Compaction • Merges SSTables • There are two compaction strategies • size-tiered • leveled 60
  61. 61. Compaction A B A B C C D E C C F Size-tiered (Minor compaction) D E C F 61 … Compaction Level = 2
  62. 62. Compaction B C D E Size-tiered (Major compaction) F 62 A
  63. 63. Data repair • Hinted handoff • Read repair • Anti-entropy 63
  64. 64. Coffee break • General overview • Data model • Architecture • Read & Write operations 64
  65. 65. Cassandra Read & Write operations 65
  66. 66. Write Write Request Node1 StorageProxy Node2 Commit Log MemTable SSTable 66
  67. 67. Write Replication & Consistency N1 N4 Replication Factor = 3 N2 N3 Consistency Level = 2 67 Anti-entropy / Read repair Hinted handoff
  68. 68. Read Snitch Function Read request Node1 StorageProxy ? 68
  69. 69. Read Snitch Function • SimpleSnitch • DynamicSnitch • PropertyFileSnitch • GossipingPropertyFileSnitch • RackInferringSnitch … 69
  70. 70. Read Snitch functions SimpleSnitch N1 N2 N3 N4 N5 N6 N7 70
  71. 71. Read Snitch functions DynamicSnitch N1 N1 N2 N3 0.6ms 0.4ms 0.9ms 1 2 3 71
  72. 72. Read Snitch functions GossipingPropertyFileSnitch $CASSANDRA_HOME/conf/cassandra-rackdc.properties # indicate the rack and dc for this node dc=DC1 rack=RAC1 72
  73. 73. Read In action • 7 node • RF = 4 • CL = 3 Read request Node1 StorageProxy Read data Get digest Get digest Node 3 Node 4 Node 5 Node 6 Node 7 Node 2 73 Node 3 Node 4 Node 5
  74. 74. Read on node B SSTable I B B I 74 B
  75. 75. Almost the end 75
  76. 76. Nishant Neeraj : «Mastering Apache Cassandra - Second Edition» Throughput comparison 76
  77. 77. Publication A Real Comparison Of NoSQL Databases HBase, Cassandra & MongoDB 77 https://goo.gl/z5abRu
  78. 78. Summary 78
  79. 79. The end 79
  80. 80. Any questions? 80 problems
  81. 81. Resources Nishant Neeraj Mastering Apache Cassandra - 2015 http://docs.datastax.com/en/cassand ra/3.x/cassandra/cassandraAbout.ht ml Thank you 81

×