Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
SF CASSANDRA USERS MARCH 2016
CQL PERFORMANCE WITH APACHE
CASSANDRA 3.0
Aaron Morton
@aaronmorton
CEO
Licensed under a Cre...
AboutThe Last Pickle.
Work with clients to deliver and improve Apache Cassandra
based solutions.
Apache Cassandra Committe...
How We Got Here
Storage Engine 3.0
Write Path
Read Path
How We Got Here
Way back in 2011…
2011
Blog: Cassandra Query Plans
http://thelastpickle.com/blog/2011/07/04/
Cassandra-Query-Plans.html
2012
Talk:Technical Deep Dive -
Query Performance
https://www.youtube.com/watch?
v=gomOKhMV0zc
2012
Explain Read & Write
performance in 45 minutes.
Skip Forward to 2016
Blog: Introduction To The
Apache Cassandra 3.x Storage
Engine
http://thelastpickle.com/blog/2016/03/0...
Skip Forward to 2016
“Why don’t I do another talk
about Cassandra
performance.”
Skip Forward to 2016
It was a busy 4 years…
Skip Forward to 2016
CQL 3, Collection Types,
UDTs, UDF’s, UDA’s,
MaterialisedViews,Triggers,
SASI,…
Skip Forward to 2016
Explain Read & Write
performance in 45 minutes.
So Lets Avoid
CQL 3, Collection Types,
UDTs, UDF’s, UDA’s,
MaterialisedViews,Triggers,
SASI,…
How We Got Here
Storage Engine 3.0
Write Path
Read Path
High Level Storage Engine 3.0
Storage Engine 3.0 Files
Data.db
Index.db
Filter.db
Storage Engine 3.0 Files
CompressionInfo.db
Statistics.db
Digest.crc32
CRC.db
Summary.db
TOC.txt
CQL Recap
create table my_table (
partition_1 text,
cluster_1 text,
foo text,
bar text,
baz text,
PRIMARY KEY (partition_1...
CQL Recap
WARNING:
FAKE DATA AHEAD
CQL WithThrift Pre 3.0
[default@dev] list my_table;
-------------------
RowKey: part_a
=> (column=clust_a:, value=, timest...
CQL Pre 3.0
Clustering Keys Repeated
Column Names Repeated
Timestamps Repeated
Fixed Width Encoding
No Knowledge Of Row Co...
Storage Engine 3.0 Improvements
Delta Encoding
Variable Int Encoding
Clustering Written Once
Aggregated Metadata
Cell Pres...
SerializationHeader
For each SSTable*.
Stored in each SSTable.
Held in memory.
SerializationHeader
public class SerializationHeader
{
private final AbstractType<?> keyType;
private final List<AbstractT...
EncodingStats
Collected on the fly by the
Memtable.
EncodingStats
public class EncodingStats
{
public final long minTimestamp;
public final int minLocalDeletionTime;
public f...
SerializationHeader
public class SerializationHeader
{
public void writeTimestamp(long timestamp,
DataOutputPlus out) thro...
VIntCoding
public class VIntCoding
{
public static void writeUnsignedVInt(long value, DataOutput
output) throws IOExceptio...
Storage Engine 3.0 Improvements
Delta Encoding
Variable Int Encoding
Clustering Written Once
Aggregated Metadata
Cell Pres...
CQL WithThrift Pre 3.0
[default@dev] list my_table;
-------------------
RowKey: part_a
=> (column=clust_a:, value=, timest...
Storage Engine 3.0 Data.db
Storage Engine 3.0 Partition Header
Partition Key
Partition Deletion Information
Storage Engine 3.0 Partition Header
Storage Engine 3.0 Row
Clustering Information
Row Level Liveness
Row Level Deletion
Column Presence
Columns
Storage Engine 3.0 Row
Storage Engine 3.0 Clustering Block
Clustering Cell Presence
Clustering Cells
Storage Engine 3.0 Clustering Block
Storage Engine 3.0 Improvements
Delta Encoding
Variable Int Encoding
Clustering Written Once
Aggregated Cell Metadata
Cell...
CQL WithThrift Pre 3.0
[default@dev] list my_table;
-------------------
RowKey: part_a
=> (column=clust_a:, value=, timest...
Aggregated Cell Metadata
Only store CellTimestamp,TTL, and
Local DeletionTime if different to
the Row.
Aggregated Cell Metadata
Simple Cell Component Byte Size
Flags 1
Optional Cell Timestamp (delta) varint 1…n
Optional Cell ...
Storage Engine 3.0 Improvements
Delta Encoding
Variable Int Encoding
Clustering Written Once
Aggregated Cell Metadata
Cell...
Cell Presence
SSTable stores list of Cells in this
SSTable.
Rows stores bitmap of Cells in this
Row, with reference to SST...
Storage Engine 3.0 Row
Remember Where We Came From
[default@dev] list my_table;
-------------------
RowKey: part_a
=> (column=clust_a:, value=, t...
How We Got Here
Storage Engine 3.0
Write Path
Read Path
Write Path
Commit Log
Merge Into Memtable
Commit Log
Allocate space in the current
commit log segment.
Allocate Segment
o.a.c.m.
CommitLog.WaitingOnSegmentAllocation.
95thPercentile
Merge Into Memtable
Find the Partition.
Loop trying to update the
Rows in it using CAS.
Merge Into Memtable
If more than 10MB wasted
allocations move to
Pessimistic locking on the
Partition object.
How We Got Here
Storage Engine 3.0
Write Path
Read Path
Read Paths
Ignoring Index Read paths.
Read Commands
PartitionRangeReadCommand
SinglePartitionReadCommand
AbstractClusteringIndexFilter
ClusteringIndexNamesFilter
(When we know the column names.)
ClusteringIndexSliceFilter
(When...
ClusteringIndexNamesFilter
When we know what
Columns to select, we know
when the search is over.
ClusteringIndexNamesFilter
1. Get Partition From Memtables.
2. Filter named columns into a temporary
result.
3. Select SST...
Names Filter Short Circuits
If result has a Partition Deletion
newer than next SSTable max
timestamp.
Stop Search.
Names Filter Short Circuits
If read all Columns and max
timestamp of next SSTable less than
selected Columns min timestamp...
Names Filter Short Circuits
Note: list of Columns
remaining to select is pruned
after every SSTable is read
based on max t...
Names Filter Short Circuits
If search clustering value not within
clustering range in the SSTable.
Skip SSTable.
Names Filter Short Circuits
If SSTable Cell not in search set.
Skip reading value.
ClusteringIndexSliceFilter
When we do not know which
columns to select, the search
ends when it is exhausted.
ClusteringIndexSliceFilter
Used with:
Distinct.
Not all clustering columns
restricted.
ClusteringIndexSliceFilter
1. Get Partition From Memtables.
2. Create Iterators for Partitions.
3. Select SSTables that ma...
Slice Filter Short Circuits
If SSTable max timestamp is before
max seen Partition Deletion
timestamp.
Stop Search.
Names Filter Short Circuits
If search clustering value not within
clustering range in the SSTable.
Skip SSTable.
So…
3.x is awesome.
Starting using it as soon as
possible.
Thanks.
Aaron Morton
@aaronmorton
Co-Founder & Principal Consultant
www.thelastpickle.com
Upcoming SlideShare
Loading in …5
×

Cassandra SF Meetup - CQL Performance With Apache Cassandra 3.X

371 views

Published on

Discussing the Apache Cassandra 3.0 storage engine, and it's performance characteristics.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Cassandra SF Meetup - CQL Performance With Apache Cassandra 3.X

  1. 1. SF CASSANDRA USERS MARCH 2016 CQL PERFORMANCE WITH APACHE CASSANDRA 3.0 Aaron Morton @aaronmorton CEO Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License
  2. 2. AboutThe Last Pickle. Work with clients to deliver and improve Apache Cassandra based solutions. Apache Cassandra Committer and DataStax MVPs. Based in New Zealand,Australia, France & USA.
  3. 3. How We Got Here Storage Engine 3.0 Write Path Read Path
  4. 4. How We Got Here Way back in 2011…
  5. 5. 2011 Blog: Cassandra Query Plans http://thelastpickle.com/blog/2011/07/04/ Cassandra-Query-Plans.html
  6. 6. 2012 Talk:Technical Deep Dive - Query Performance https://www.youtube.com/watch? v=gomOKhMV0zc
  7. 7. 2012 Explain Read & Write performance in 45 minutes.
  8. 8. Skip Forward to 2016 Blog: Introduction To The Apache Cassandra 3.x Storage Engine http://thelastpickle.com/blog/2016/03/04/introductiont-to- the-apache-cassandra-3-storage-engine.html
  9. 9. Skip Forward to 2016 “Why don’t I do another talk about Cassandra performance.”
  10. 10. Skip Forward to 2016 It was a busy 4 years…
  11. 11. Skip Forward to 2016 CQL 3, Collection Types, UDTs, UDF’s, UDA’s, MaterialisedViews,Triggers, SASI,…
  12. 12. Skip Forward to 2016 Explain Read & Write performance in 45 minutes.
  13. 13. So Lets Avoid CQL 3, Collection Types, UDTs, UDF’s, UDA’s, MaterialisedViews,Triggers, SASI,…
  14. 14. How We Got Here Storage Engine 3.0 Write Path Read Path
  15. 15. High Level Storage Engine 3.0
  16. 16. Storage Engine 3.0 Files Data.db Index.db Filter.db
  17. 17. Storage Engine 3.0 Files CompressionInfo.db Statistics.db Digest.crc32 CRC.db Summary.db TOC.txt
  18. 18. CQL Recap create table my_table ( partition_1 text, cluster_1 text, foo text, bar text, baz text, PRIMARY KEY (partition_1, cluster_1) );
  19. 19. CQL Recap WARNING: FAKE DATA AHEAD
  20. 20. CQL WithThrift Pre 3.0 [default@dev] list my_table; ------------------- RowKey: part_a => (column=clust_a:, value=, timestamp=1357…739000) => (column=clust_a:foo, value=some foo, timestamp=1357…739000) => (column=clust_a:bar, value=and bar, timestamp=1357…739000) => (column=clust_a:baz, value=no baz, timestamp=1357…739000) => (column=clust_b:, value=, timestamp=1357…739000) => (column=clust_b:foo, value=no foo, timestamp=1357…739000) => (column=clust_b:bar, value=no bar, timestamp=1357…739000) => (column=clust_b:baz, value=lots baz, timestamp=1357…739000)
  21. 21. CQL Pre 3.0 Clustering Keys Repeated Column Names Repeated Timestamps Repeated Fixed Width Encoding No Knowledge Of Row Contents
  22. 22. Storage Engine 3.0 Improvements Delta Encoding Variable Int Encoding Clustering Written Once Aggregated Metadata Cell Presence
  23. 23. SerializationHeader For each SSTable*. Stored in each SSTable. Held in memory.
  24. 24. SerializationHeader public class SerializationHeader { private final AbstractType<?> keyType; private final List<AbstractType<?>> clusteringTypes; private final PartitionColumns columns; private final EncodingStats stats; … }
  25. 25. EncodingStats Collected on the fly by the Memtable.
  26. 26. EncodingStats public class EncodingStats { public final long minTimestamp; public final int minLocalDeletionTime; public final int minTTL; … }
  27. 27. SerializationHeader public class SerializationHeader { public void writeTimestamp(long timestamp, DataOutputPlus out) throws IOException { out.writeUnsignedVInt(timestamp - stats.minTimestamp); } … }
  28. 28. VIntCoding public class VIntCoding { public static void writeUnsignedVInt(long value, DataOutput output) throws IOException { int size = VIntCoding.computeUnsignedVIntSize(value); if (size == 1) { output.write((int)value); return; } output.write(VIntCoding.encodeVInt(value, size), 0, size); }
  29. 29. Storage Engine 3.0 Improvements Delta Encoding Variable Int Encoding Clustering Written Once Aggregated Metadata Cell Presence
  30. 30. CQL WithThrift Pre 3.0 [default@dev] list my_table; ------------------- RowKey: part_a => (column=clust_a:, value=, timestamp=1357…739000) => (column=clust_a:foo, value=some foo, timestamp=1357…739000) => (column=clust_a:bar, value=and bar, timestamp=1357…739000) => (column=clust_a:baz, value=no baz, timestamp=1357…739000) => (column=clust_b:, value=, timestamp=1357…739000) => (column=clust_b:foo, value=no foo, timestamp=1357…739000) => (column=clust_b:bar, value=no bar, timestamp=1357…739000) => (column=clust_b:baz, value=lots baz, timestamp=1357…739000)
  31. 31. Storage Engine 3.0 Data.db
  32. 32. Storage Engine 3.0 Partition Header Partition Key Partition Deletion Information
  33. 33. Storage Engine 3.0 Partition Header
  34. 34. Storage Engine 3.0 Row Clustering Information Row Level Liveness Row Level Deletion Column Presence Columns
  35. 35. Storage Engine 3.0 Row
  36. 36. Storage Engine 3.0 Clustering Block Clustering Cell Presence Clustering Cells
  37. 37. Storage Engine 3.0 Clustering Block
  38. 38. Storage Engine 3.0 Improvements Delta Encoding Variable Int Encoding Clustering Written Once Aggregated Cell Metadata Cell Presence
  39. 39. CQL WithThrift Pre 3.0 [default@dev] list my_table; ------------------- RowKey: part_a => (column=clust_a:, value=, timestamp=1357…739000) => (column=clust_a:foo, value=some foo, timestamp=1357…739000) => (column=clust_a:bar, value=and bar, timestamp=1357…739000) => (column=clust_a:baz, value=no baz, timestamp=1357…739000) => (column=clust_b:, value=, timestamp=1357…739000) => (column=clust_b:foo, value=no foo, timestamp=1357…739000) => (column=clust_b:bar, value=no bar, timestamp=1357…739000) => (column=clust_b:baz, value=lots baz, timestamp=1357…739000)
  40. 40. Aggregated Cell Metadata Only store CellTimestamp,TTL, and Local DeletionTime if different to the Row.
  41. 41. Aggregated Cell Metadata Simple Cell Component Byte Size Flags 1 Optional Cell Timestamp (delta) varint 1…n Optional Cell Local Deletion Time (delta) varint 1…n Optional Cell TTL (delta) varint 1…n Fixed Width Cell Value Byte Size Value 1…n Optional Cell Value See Below Variable Width Cell Value Byte Size Value Length varint 1…n Value 1…n Apache Cassandra 3.0 Storage Engine
  42. 42. Storage Engine 3.0 Improvements Delta Encoding Variable Int Encoding Clustering Written Once Aggregated Cell Metadata Cell Presence
  43. 43. Cell Presence SSTable stores list of Cells in this SSTable. Rows stores bitmap of Cells in this Row, with reference to SSTable.
  44. 44. Storage Engine 3.0 Row
  45. 45. Remember Where We Came From [default@dev] list my_table; ------------------- RowKey: part_a => (column=clust_a:, value=, timestamp=1357…739000) => (column=clust_a:foo, value=some foo, timestamp=1357…739000) => (column=clust_a:bar, value=and bar, timestamp=1357…739000) => (column=clust_a:baz, value=no baz, timestamp=1357…739000) => (column=clust_b:, value=, timestamp=1357…739000) => (column=clust_b:foo, value=no foo, timestamp=1357…739000) => (column=clust_b:bar, value=no bar, timestamp=1357…739000) => (column=clust_b:baz, value=lots baz, timestamp=1357…739000)
  46. 46. How We Got Here Storage Engine 3.0 Write Path Read Path
  47. 47. Write Path Commit Log Merge Into Memtable
  48. 48. Commit Log Allocate space in the current commit log segment.
  49. 49. Allocate Segment o.a.c.m. CommitLog.WaitingOnSegmentAllocation. 95thPercentile
  50. 50. Merge Into Memtable Find the Partition. Loop trying to update the Rows in it using CAS.
  51. 51. Merge Into Memtable If more than 10MB wasted allocations move to Pessimistic locking on the Partition object.
  52. 52. How We Got Here Storage Engine 3.0 Write Path Read Path
  53. 53. Read Paths Ignoring Index Read paths.
  54. 54. Read Commands PartitionRangeReadCommand SinglePartitionReadCommand
  55. 55. AbstractClusteringIndexFilter ClusteringIndexNamesFilter (When we know the column names.) ClusteringIndexSliceFilter (When we do not know the column names.)
  56. 56. ClusteringIndexNamesFilter When we know what Columns to select, we know when the search is over.
  57. 57. ClusteringIndexNamesFilter 1. Get Partition From Memtables. 2. Filter named columns into a temporary result. 3. Select SSTables that may contain Partition Key. 4. Order in descending timestamp order. 5. Read from SSTables in order.
  58. 58. Names Filter Short Circuits If result has a Partition Deletion newer than next SSTable max timestamp. Stop Search.
  59. 59. Names Filter Short Circuits If read all Columns and max timestamp of next SSTable less than selected Columns min timestamp. Stop Search.
  60. 60. Names Filter Short Circuits Note: list of Columns remaining to select is pruned after every SSTable is read based on max timestamp.
  61. 61. Names Filter Short Circuits If search clustering value not within clustering range in the SSTable. Skip SSTable.
  62. 62. Names Filter Short Circuits If SSTable Cell not in search set. Skip reading value.
  63. 63. ClusteringIndexSliceFilter When we do not know which columns to select, the search ends when it is exhausted.
  64. 64. ClusteringIndexSliceFilter Used with: Distinct. Not all clustering columns restricted.
  65. 65. ClusteringIndexSliceFilter 1. Get Partition From Memtables. 2. Create Iterators for Partitions. 3. Select SSTables that may contain Partition Key. 4. Order in reverse max timestamp order. 5. Create Iterators for SSTables in order.
  66. 66. Slice Filter Short Circuits If SSTable max timestamp is before max seen Partition Deletion timestamp. Stop Search.
  67. 67. Names Filter Short Circuits If search clustering value not within clustering range in the SSTable. Skip SSTable.
  68. 68. So… 3.x is awesome. Starting using it as soon as possible.
  69. 69. Thanks.
  70. 70. Aaron Morton @aaronmorton Co-Founder & Principal Consultant www.thelastpickle.com

×