6. Cassandra Write Path
❏ SSTable => Sorted Array of Strings.
❏ Write to Disk: Merges and Pre-sorts
happens.
❏ SSTables are IMMUTABLE.
❏ Compaction happens:
❏ Time to time
❏ Prune deleted data
❏ Has thread-offs
7. Tombstones
❏ Deleted data is MARKED as Removed == Tombstone
❏ Data is deleted and removed during compaction
❏ Compaction can happen in few days depending of the
configs.
❏ Queries on partition with lots of tombstones requires lots of
filtering which can slow down the CASS performance.
❏ Collections operations can lead to tombstones depending
on what you do.
❏ There are Compaction Trade-Offs.
8. Compaction Strategies
❏ STCS
❏ Default
❏ Insert-Heavy
❏ General Workloads
❏ LCS
❏ Read Heavy
❏ More Updates than
Inserts
❏ DTCS
❏ Time Series
❏ Inserts out of order
❏ Updates for old data
9. Cassandra ROW CACHE
❏ Buffer FULL merged row into memory
❏ Increase a lot the throughput
❏ Row Cache works with Key Cache
❏ Key Cache = Where the partition is on DISK.
CREATE TABLE status (
user text,
status_id timeuuid,
status text,
PRIMARY KEY (user, status_id))
WITH CLUSTERING ORDER BY (status_id DESC)
AND caching = '{"keys":"ALL", "rows_per_partition":"10"}'
10. Cassandra Bloom Filter
❏ Bloom Filter: Technique created on the 70s to filter db matches.
❏ Space Efficient
❏ Probabilistic Data Structures
❏ For each SSTable there is a Bloom Filter
❏ Used for Index scans - not used to range scans
❏ Stored OFF HEAP
❏ Tunable per TABLE
❏ Cassandra uses bloom filters to know if the data is on the ROW or not.
12. SASI
❏ Secondary Index: Not the primary key.
❏ Lookup tables: bySomething
❏ Distributed Index
❏ Search Like Capabilities: %diego%
❏ Great when:
❏ Multi fields Search
❏ You know the partition key
❏ Indexing static columns
❏ Issues:
❏ More than 1000 rows returned
❏ Searching in Large Partitions
❏ Aggressive Read SLOs
❏ Search for Analytics(Use Spark/Flink)
❏ Ordering Search is important
13. SASI
Samples
❏ SELECT * FROM users WHERE firstname LIKE 'Die%';
❏ SELECT * FROM users WHERE lastname LIKE '%ie%';
❏ SELECT * FROM users WHERE
created_date > '2015-01-02' AND created_date < '2017-01-02';
14. Materialized Views
❏ Automated - Table managed for you, Denormalization
❏ Copies of the data in different partitions / replicas
❏ Some Write penalty but acceptable performance
❏ Store results in table which can be indexed
❏ Update ASYNC
❏ Great For:
❏ Caching
❏ Result Sets
❏ Dashbaords
SAMPLE
CREATE MATERIALIZED VIEW all_time_high AS
SELECT user FROM scores WHERE
game IS NOT NULL AND
score IS NOT NULL
PRIMARY KEY (game,score) WITH CLUSTERING ORDER BY (score DESC)
15. Cassandra Counter Family
❏ Static VS Dynamic Column families
❏ Dynamic Column families A.K.A Wide Rows
❏ Wide Rows is good for: Ordering,Grouping and Filtering.
❏ Wide Rows are not split into NODES.
❏ Counters Internally:
❏ Calculated and sum of all replicas
❏ Split into fragments called SHARDs.
❏ Logical clock monotonically increased
❏ 3 tuple = { NODE_COUNTER_ID, SHARD_LOGICAL_CLOCK, SHARD_VALUE }
16. Anti-Patterns
❏ Using Cassandra as a queue or queue-like table
❏ Tombstones
❏ Lots of deleted columns(expiry) and slice-queries don't play well
❏ http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
❏ CQL Nulls
❏ Reading Tombstones
❏ Write NULL create tombstones
❏ Intensive Updates on SAME column
❏ Sensor table (ID,VALUE)
❏ Physical Limits
❏ Solution: Timestamp as cluster key.