Cassandra SF Meetup - CQL Performance With Apache Cassandra 3.X

SF CASSANDRA USERS MARCH 2016
CQL PERFORMANCE WITH APACHE
CASSANDRA 3.0
Aaron Morton
@aaronmorton
CEO
Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License

AboutThe Last Pickle.
Work with clients to deliver and improve Apache Cassandra
based solutions.
Apache Cassandra Committer and DataStax MVPs.
Based in New Zealand,Australia, France & USA.

How We Got Here
Storage Engine 3.0
Write Path
Read Path

How We Got Here
Way back in 2011…

2011
Blog: Cassandra Query Plans
http://thelastpickle.com/blog/2011/07/04/
Cassandra-Query-Plans.html

2012
Talk:Technical Deep Dive -
Query Performance
https://www.youtube.com/watch?
v=gomOKhMV0zc

2012
Explain Read & Write
performance in 45 minutes.

Skip Forward to 2016
Blog: Introduction To The
Apache Cassandra 3.x Storage
Engine
http://thelastpickle.com/blog/2016/03/04/introductiont-to-
the-apache-cassandra-3-storage-engine.html

“Why don’t I do another talk
about Cassandra
performance.”

It was a busy 4 years…

CQL 3, Collection Types,
UDTs, UDF’s, UDA’s,
MaterialisedViews,Triggers,
SASI,…

Explain Read & Write
performance in 45 minutes.

So Lets Avoid
CQL 3, Collection Types,
UDTs, UDF’s, UDA’s,
MaterialisedViews,Triggers,
SASI,…

Storage Engine 3.0 Files
Data.db
Index.db
Filter.db

Storage Engine 3.0 Files
CompressionInfo.db
Statistics.db
Digest.crc32
CRC.db
Summary.db
TOC.txt

CQL Recap
create table my_table (
partition_1 text,
cluster_1 text,
foo text,
bar text,
baz text,
PRIMARY KEY (partition_1, cluster_1)
);

CQL Recap
WARNING:
FAKE DATA AHEAD

CQL WithThrift Pre 3.0
[default@dev] list my_table;
-------------------
RowKey: part_a
=> (column=clust_a:, value=, timestamp=1357…739000)
=> (column=clust_a:foo, value=some foo, timestamp=1357…739000)
=> (column=clust_a:bar, value=and bar, timestamp=1357…739000)
=> (column=clust_a:baz, value=no baz, timestamp=1357…739000)
=> (column=clust_b:, value=, timestamp=1357…739000)
=> (column=clust_b:foo, value=no foo, timestamp=1357…739000)
=> (column=clust_b:bar, value=no bar, timestamp=1357…739000)
=> (column=clust_b:baz, value=lots baz, timestamp=1357…739000)

CQL Pre 3.0
Clustering Keys Repeated
Column Names Repeated
Timestamps Repeated
Fixed Width Encoding
No Knowledge Of Row Contents

Storage Engine 3.0 Improvements
Delta Encoding
Variable Int Encoding
Clustering Written Once
Aggregated Metadata
Cell Presence

SerializationHeader
For each SSTable*.
Stored in each SSTable.
Held in memory.

SerializationHeader
public class SerializationHeader
{
private final AbstractType<?> keyType;
private final List<AbstractType<?>>
clusteringTypes;
private final PartitionColumns columns;
private final EncodingStats stats;
…
}

EncodingStats
Collected on the ﬂy by the
Memtable.

EncodingStats
public class EncodingStats
{
public final long minTimestamp;
public final int minLocalDeletionTime;
public final int minTTL;
…
}

SerializationHeader
public class SerializationHeader
{
public void writeTimestamp(long timestamp,
DataOutputPlus out) throws IOException
{
out.writeUnsignedVInt(timestamp -
stats.minTimestamp);
}
…
}

VIntCoding
public class VIntCoding
{
public static void writeUnsignedVInt(long value, DataOutput
output) throws IOException {
int size = VIntCoding.computeUnsignedVIntSize(value);
if (size == 1)
{
output.write((int)value);
return;
}
output.write(VIntCoding.encodeVInt(value, size), 0,
size);
}

Storage Engine 3.0 Partition Header
Partition Key
Partition Deletion Information

Storage Engine 3.0 Partition Header

Storage Engine 3.0 Row
Clustering Information
Row Level Liveness
Row Level Deletion
Column Presence
Columns

Storage Engine 3.0 Clustering Block
Clustering Cell Presence
Clustering Cells

Storage Engine 3.0 Clustering Block

Storage Engine 3.0 Improvements
Delta Encoding
Variable Int Encoding
Clustering Written Once
Aggregated Cell Metadata
Cell Presence

Only store CellTimestamp,TTL, and
Local DeletionTime if different to
the Row.

Simple Cell Component Byte Size
Flags 1
Optional Cell Timestamp (delta) varint 1…n
Optional Cell Local Deletion Time (delta) varint 1…n
Optional Cell TTL (delta) varint 1…n
Fixed Width Cell Value Byte Size
Value 1…n
Optional Cell Value See Below
Variable Width Cell Value Byte Size
Value Length varint 1…n
Value 1…n
Apache Cassandra 3.0 Storage Engine

Cell Presence
SSTable stores list of Cells in this
SSTable.
Rows stores bitmap of Cells in this
Row, with reference to SSTable.

Remember Where We Came From
[default@dev] list my_table;
-------------------
RowKey: part_a
=> (column=clust_a:, value=, timestamp=1357…739000)
=> (column=clust_a:foo, value=some foo, timestamp=1357…739000)
=> (column=clust_a:bar, value=and bar, timestamp=1357…739000)
=> (column=clust_a:baz, value=no baz, timestamp=1357…739000)
=> (column=clust_b:, value=, timestamp=1357…739000)
=> (column=clust_b:foo, value=no foo, timestamp=1357…739000)
=> (column=clust_b:bar, value=no bar, timestamp=1357…739000)
=> (column=clust_b:baz, value=lots baz, timestamp=1357…739000)

Write Path
Commit Log
Merge Into Memtable

Commit Log
Allocate space in the current
commit log segment.

Allocate Segment
o.a.c.m.
CommitLog.WaitingOnSegmentAllocation.
95thPercentile

Merge Into Memtable
Find the Partition.
Loop trying to update the
Rows in it using CAS.

Merge Into Memtable
If more than 10MB wasted
allocations move to
Pessimistic locking on the
Partition object.

Read Paths
Ignoring Index Read paths.

Read Commands
PartitionRangeReadCommand
SinglePartitionReadCommand

AbstractClusteringIndexFilter
ClusteringIndexNamesFilter
(When we know the column names.)
ClusteringIndexSliceFilter
(When we do not know the column names.)

When we know what
Columns to select, we know
when the search is over.

1. Get Partition From Memtables.
2. Filter named columns into a temporary
result.
3. Select SSTables that may contain Partition
Key.
4. Order in descending timestamp order.
5. Read from SSTables in order.

Names Filter Short Circuits
If result has a Partition Deletion
newer than next SSTable max
timestamp.
Stop Search.

If read all Columns and max
timestamp of next SSTable less than
selected Columns min timestamp.
Stop Search.

Note: list of Columns
remaining to select is pruned
after every SSTable is read
based on max timestamp.

If search clustering value not within
clustering range in the SSTable.
Skip SSTable.

If SSTable Cell not in search set.
Skip reading value.

When we do not know which
columns to select, the search
ends when it is exhausted.

Used with:
Distinct.
Not all clustering columns
restricted.

1. Get Partition From Memtables.
2. Create Iterators for Partitions.
3. Select SSTables that may contain Partition
Key.
4. Order in reverse max timestamp order.
5. Create Iterators for SSTables in order.

Slice Filter Short Circuits
If SSTable max timestamp is before
max seen Partition Deletion
timestamp.
Stop Search.

So…
3.x is awesome.
Starting using it as soon as
possible.

Aaron Morton
@aaronmorton
Co-Founder & Principal Consultant
www.thelastpickle.com

Cassandra SF Meetup - CQL Performance With Apache Cassandra 3.X

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cassandra SF Meetup - CQL Performance With Apache Cassandra 3.X

Similar to Cassandra SF Meetup - CQL Performance With Apache Cassandra 3.X (20)

More from aaronmorton

More from aaronmorton (14)

Recently uploaded

Recently uploaded (20)

Cassandra SF Meetup - CQL Performance With Apache Cassandra 3.X