It consists of 3 major parts:
1) Cassandra architecture / how reads and writes work - It is almost in alignment with official C* book by DataStax (pictures are from there) - It can be useful for those who either never used Cassandra or has some questions. During my presentation on-site I found that it makes sense to listen to this even for those who already read it sometime ago
2) Data Modeling on CQL3 - it can be helpful for those who never used Cassandra to learn CQL3 a little - as well as for those who worked with pre-CQL3 approach to understand what happens under the sweet CQL3 structures
3) Remaining things like DataStax Java Driver, C* known bugs
3. Scalable eCommerce Platform Solutions
Highlights
• Distributed columnar family database
• No SPOF
• decentralized
• data is both partitioned and replicated
• Optimized for high write throughput
• Query time tunable A vs C in CAP
• SEDA
3
2/14/14
8. Scalable eCommerce Platform Solutions
Virtual Nodes
8
2/14/14
• Going from one token and range per node to
many tokens per node
• No manual assignments of tokens to nodes
• Load is evenly distributed when a node joins
and leaves cluster
• Improves the use of heterogeneous machines in
a cluster
9. Scalable eCommerce Platform Solutions
Key Data Distribution Components
• Partitioner calculates token by a row key
(determines where to place first replica of a row)
• Replication Strategy determines total number of
replicas and where to place them
• Snitch defines network topology such as location
of nodes grouping them by racks and data
centers. Used by
• Replication Strategy
• Routing Requests (+Dynamic Snitch)
9
2/14/14
10. Scalable eCommerce Platform Solutions
Write Requests
• A coordinator node sends a write request to all
replicas regardless of Consistency Level (CL)
• It acknowledges request when CL is satisfied
10
2/14/14
11. Scalable eCommerce Platform Solutions
Read Requests - Optimistic Flow
• A coordinator node sends direct read requests to
CL number of fastest replicas (Dynamic Snitch)
• 1 request for full read
• CL - 1 requests for digest reads
• If there is a match it is returned to client
• Background read repair requests are sent to
other owners of that row based on read repair
chance
11
2/14/14
12. Scalable eCommerce Platform Solutions
Read Requests - Mismatch Case
• If there is a mismatch a coordinator node sends
direct full read requests to CL number of those
replicas
• Most recent copy returned to client
12
2/14/14
13. Scalable eCommerce Platform Solutions
Write Path
!
!
!
!
!
!
• Flush to disk is when memtable size threshold or commit log size
threshold or heap utilization threshold reached
• Never random disk IO or modification in place
• Compaction is in background
• A delete just marks a column with a tombstone
13
2/14/14
!
• commit log contains
all mutations
• memtable keeps
track of latest version
of data
14. Scalable eCommerce Platform Solutions
Read Path
!
!
!
!
!
!
!
!
!
!
!
• Each SSTable is read, results are combined with unflushed memtable(s), latest version
returned
• KeyCache is fixed size and shared among all tables
• are stored off heap (v1.2.X)
14
2/14/14
15. Scalable eCommerce Platform Solutions
ACID
• Atomicity
• a write is atomic at the row-level
• doesn’t roll back if a write fails on some replicas
• Consistency
• tunable through CL requirements (C vs A)
• Strong Consistency W + R > N
• Isolation
• row-level
• Durability
• yes, but
• commit log fsync each 10 seconds by default
• Lightweight transactions in Cassandra 2.0
• For INSERT, UPDATE statements
• using IF clause
15
2/14/14
16. Scalable eCommerce Platform Solutions
Built-in Repair Tools
• Hinted handoff
• does no count towards CL requirement
• if CL.ANY is used, not readable until at least
one normal owner is recovered
• Read repair
• Anti-entropy node repair
16
2/14/14
18. Scalable eCommerce Platform Solutions
Data Modeling
• Read by partition key
• Reduce number of reads
• aggregate data used together in a single row
• even at expense of number of writes to
duplicate some data
• Writes should not depend on reads
• Keep metadata overhead low
18
2/14/14
19. Scalable eCommerce Platform Solutions
CQL3 Overview
• It looks like SQL
• Compound keys
• Standard data types are built-in
• Collection type
• Asynchronous queries
• Tracing of queries
• … and more
19
2/14/14
20. Scalable eCommerce Platform Solutions
Simple Row / CQL3
CREATE TABLE simple_table (
my_key int PRIMARY KEY,
my_field_1 text,
my_field_2 boolean
);
!
INSERT INTO simple_table (my_key, my_field_1, my_field_2) VALUES ( 1, 'my value 1', false);
INSERT INTO simple_table (my_key, my_field_1, my_field_2) VALUES ( 2, 'my value 2', true);
!
SELECT * FROM simple_table ;
!
my_key | my_field_1 | my_field_2
--------+------------+------------
1 | my value 1 | False
2 | my value 2 | True
20
2/14/14
21. Scalable eCommerce Platform Solutions
Simple Row / Internal
[default@test] list simple_table;
-------------------
RowKey: 1
=> (name=, value=, timestamp=1395180822477000)
=> (name=my_field_1, value=6d792076616c75652031, timestamp=1395180822477000)
=> (name=my_field_2, value=00, timestamp=1395180822477000)
-------------------
RowKey: 2
=> (name=, value=, timestamp=1395180822480000)
=> (name=my_field_1, value=6d792076616c75652032, timestamp=1395180822480000)
=> (name=my_field_2, value=01, timestamp=1395180822480000)
!
1. Column name (size is proportional to column name length) and timestamp is stored for each column
2. There is an additional “empty” column per row
21
2/14/14
22. Scalable eCommerce Platform Solutions
Compound Key / CQL3
22
2/14/14
CREATE TABLE compound_key_table (
my_part_key int,
my_clust_key text,
my_field int,
PRIMARY KEY (my_part_key, my_clust_key)
);
!
INSERT INTO compound_key_table (my_part_key, my_clust_key, my_field) VALUES ( 1, 'my value 2', 2);
INSERT INTO compound_key_table (my_part_key, my_clust_key, my_field) VALUES ( 1, 'my value 1', 1);
INSERT INTO compound_key_table (my_part_key, my_clust_key, my_field) VALUES ( 1, 'my value 3', 3);
SELECT * FROM compound_key_table ;
!
my_part_key | my_clust_key | my_field
-------------+--------------+----------
1 | my value 1 | 1
1 | my value 2 | 2
1 | my value 3 | 3
23. Scalable eCommerce Platform Solutions
Compound Key / Internal
23
2/14/14
[default@test] list compound_key_table;
-------------------
RowKey: 1
=> (name=my value 1:, value=, timestamp=1395192704575000)
=> (name=my value 1:my_field, value=00000001, timestamp=1395192704575000)
=> (name=my value 2:, value=, timestamp=1395192704572000)
=> (name=my value 2:my_field, value=00000002, timestamp=1395192704572000)
=> (name=my value 3:, value=, timestamp=1395192704577000)
=> (name=my value 3:my_field, value=00000003, timestamp=1395192704577000)
!
1. Both CQL3 rows are in the same physical row, thus single read operation can read both of them
2. Still can read or update them partially (need to know PK - use lookup table)
3. Value of ‘my_clust_key’ column joined with ‘my_field’ column name and becomes my_field’s value column name
4. Value of ‘my_clust_key’ value doesn’t have associated timestamp, since it is part of PK
5. The CQL3 rows are sorted by value of ‘my_clust_key’ and can be used in ‘where’ clause
6. There is an additional “empty” column per CQL3 row
7. PK column names are hidden in system.schema_columnfamilies
25. Scalable eCommerce Platform Solutions
Collection Type / Internal
25
2/14/14
[default@test] list collection_type_table;
-------------------
RowKey: 1
=> (name=, value=, timestamp=1395253516706000)
=> (name=my_list:d1da8820af9311e38f4e97aee9b28d0c, value=00000001, timestamp=1395253516706000)
=> (name=my_list:d1da8821af9311e38f4e97aee9b28d0c, value=00000002, timestamp=1395253516706000)
=> (name=my_map:00000001, value=00000002, timestamp=1395253516706000)
=> (name=my_map:00000003, value=00000004, timestamp=1395253516706000)
=> (name=my_set:00000001, value=, timestamp=1395253516706000)
=> (name=my_set:00000002, value=, timestamp=1395253516706000)
!
1. Each element of each collection gets its own column
2. Each element of List type additionally consumes 16 bytes to maintain order of elements
3. Map key goes to column name
4. Set value goes to column name
26. Scalable eCommerce Platform Solutions
Column Overhead
• name : 2 bytes (length as short int) + byte[]
• flags : 1 byte
• if counter column : 8 bytes (timestamp of last
delete)
• if expiring column : 4 bytes (TTL) + 4 bytes
(local deletion time)
• timestamp : 8 bytes (long)
• value : 4 bytes (len as int) + byte[]
26
2/14/14
http://btoddb-cass-storage.blogspot.ru/2011/07/column-overhead-and-sizing-every-column.html
27. Scalable eCommerce Platform Solutions
Metadata Overhead
• Simple case (no TTL or not a Counter column ):
• regular_column_size = column_name_size +
column_value_size + 15 bytes
• row has has 23 bytes of overhead
• A column with name “my_column” of type int stores
your 4 bytes and incurs 24 bytes of overhead
• Keep in mind when internal columns created for CQL3
structures like Compound Keys or Collection Types
• Keep in mind when column value is used as column
name for many other columns
27
2/14/14
28. Scalable eCommerce Platform Solutions
JSON vs Separate Columns
• Drastically reduces metadata overhead
• A column with name “my_column” of type
text which stores your 1 kB bytes JSON
object and incurs 24 bytes of overhead
sounds much better!
• Saves CPU cycles and reduces read latency
• Supports complex hierarchical structures
• But it loses in partial reads / updates and
complicates schema versioning
28
2/14/14
33. Scalable eCommerce Platform Solutions
DataStax Java Driver
• Flexible load balancing policies
• includes token aware load balancing
• Connection pooling
• Flexible retry policy
• can retry on other nodes
• or reduce CL requirement
• Non-blocking I/O
• up to 128 simultaneous requests per connection
• asynchronous API
• Nodes discovery
33
2/14/14
34. Scalable eCommerce Platform Solutions
Multi-gets
• When you have N keys and want to read them all
• Built-in token-aware load balancer evaluates the first
key and sends all N keys to that node! oops…
• We preferred sending N fine-grained single-get queries in
async mode
• retries only those which failed
• can return partial result
• smart route for each key
• We tried multi-get-aware token-aware load balancer
• worked worse
34
2/14/14
36. Scalable eCommerce Platform Solutions
Data Loader
36
2/14/14
• partitions the whole
data set (MOD N)
• sorts all result sets by
product id
• accumulates assembled
products and executes
batch write to C*
• single connection per
reader thread
38. Scalable eCommerce Platform Solutions
OOM #1
• select count (*) from product limit 75000000;
• wait for timeout
• hmm, try again (arrow up, enter)
• select count (*) from product limit 75000000;
• wait for timeout
• again
38
2/14/14
39. Scalable eCommerce Platform Solutions
OOM #2
• Try the following in production and get
permanent vacation
• truncate, drop, create table
• load data there
• start light read load
• Up to all C* nodes can get OOM simultaneously
• That is called high availability!
39
2/14/14
40. Scalable eCommerce Platform Solutions
DROP/CREATE without TRUNCATE
• SSTable files are still on disk after DROP
• CREATE triggers reading of the files
• and C* fails…
40
2/14/14