Large partition in Cassandra
Shogo Hoshii
Yahoo! Japan Corp.
About me
• Cassandra operator atYahoo! Japan Corp.
• https://issues.apache.org/jira/browse/CASSA
NDRA-5977
remark
• This is a summary of following tickets:
– https://issues.apache.org/jira/browse/CASSANDR
A-11206
– https://issues.apache.org/jira/browse/CASSANDR
A-9738
Agenda
• Recap the read path
• What’s the problem?
• Solutions
High level: read path
Row Cache
Key Cache
SSTables MemTable
1. Check row cache before going to key cache
2. Check the key cache to get the
offsets to data
3. Find the offsets to data and retrieve data
4. Merge data from sstables and memtable
5. Populate row cache with new row returned
http://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlAboutReads.html
Pattern 1.The row is in row cache
Partition
Summary
Disk
MemTable
Compression
Offsets
Bloom Filter
Row Cache
Heap Off Heap
Key Cache
Partition
Index
Data
1. read request
2. return row when that is in row cache
Pattern 2.The key is in key cache
Partition
Summary
Disk
MemTable
Compression
Offsets
Bloom Filter
Row Cache
Heap Off Heap
Key Cache
Partition
Index
Data
1. read request
2. Check bloom filters 3. Check the partition key is in key cache
4. Find the offset to the result set
5. Access the result set
Pattern 3.The key is not cached
Partition
Summary
Disk
MemTable
Compression
Offsets
Bloom Filter
Row Cache
Heap Off Heap
Key Cache
Partition
Index
Data
1. read request
2. Miss -> Check bloom filters
3. Check the partition key is in key cache
4. Miss -> Bsearch the close location of index
5. Disk scan to find the offsets 6. Find the offset into the result set
7. Access the result set
8. Update key cache
What’s the problem?
• GC pressure by key cache when a large
partition is read
Partition Index Recap
• http://distributeddatastore.blogspot.jp/2013/08/cassandra-sstable-storage-format.html
RowIndexEntry
• Partition size < 64 kb
– RowIndexEntry
• Position
• Seriarized size of data
• Partition size > 64 kb
– IndexedEntry
• Position
• Seriarized size of data
• IndexInfo[]
– Seriarize method
– Offset
– width
– Etc.
Approximation on 16 byte value
1mb : 3kb / > 200 objects
4mb : 11kb / > 800 objects
64mb : 180kb / > 13k objects
512mb : 1.4mb / > 106k objects
3.The key is not cached
Partition
Summary
Disk
MemTable
Compression
Offsets
Bloom Filter
Row Cache
Heap Off Heap
Key Cache
Partition
Index
Data
1. read request
2. Miss -> Check bloom filters
3. Check the partition key is in key cache
4. Miss -> Bsearch the close location of index
5. Disk scan to find the offsets 6. Find the offsets into the result set
7. Access the result set
8. Update key cache
9. GC, GC, GC…
Current solution
• If partition size <
column_index_cache_size_in_kb(configurable)
– IndexedEntry is kept on heap
• Otherwise
– Always read from disk when needed
• https://issues.apache.org/jira/browse/CASSANDRA-11206
• https://www.youtube.com/watch?v=qa84vABqftM
Other possible solutions
• IndexInfo never be kept on heap
– Read from disk when needed
– degrades performance when small partition is
read
Other possible solutions
• Migrate key cache to be fully off heap
– https://issues.apache.org/jira/browse/CASSANDR
A-9738
– Serialization & deserialization cost so much when
large partition is read
• Will Birch help us to solve this problem?
– https://issues.apache.org/jira/browse/CASSANDRA-9754
Thank you

Large partition in Cassandra

  • 1.
    Large partition inCassandra Shogo Hoshii Yahoo! Japan Corp.
  • 2.
    About me • Cassandraoperator atYahoo! Japan Corp. • https://issues.apache.org/jira/browse/CASSA NDRA-5977
  • 3.
    remark • This isa summary of following tickets: – https://issues.apache.org/jira/browse/CASSANDR A-11206 – https://issues.apache.org/jira/browse/CASSANDR A-9738
  • 4.
    Agenda • Recap theread path • What’s the problem? • Solutions
  • 5.
    High level: readpath Row Cache Key Cache SSTables MemTable 1. Check row cache before going to key cache 2. Check the key cache to get the offsets to data 3. Find the offsets to data and retrieve data 4. Merge data from sstables and memtable 5. Populate row cache with new row returned http://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlAboutReads.html
  • 6.
    Pattern 1.The rowis in row cache Partition Summary Disk MemTable Compression Offsets Bloom Filter Row Cache Heap Off Heap Key Cache Partition Index Data 1. read request 2. return row when that is in row cache
  • 7.
    Pattern 2.The keyis in key cache Partition Summary Disk MemTable Compression Offsets Bloom Filter Row Cache Heap Off Heap Key Cache Partition Index Data 1. read request 2. Check bloom filters 3. Check the partition key is in key cache 4. Find the offset to the result set 5. Access the result set
  • 8.
    Pattern 3.The keyis not cached Partition Summary Disk MemTable Compression Offsets Bloom Filter Row Cache Heap Off Heap Key Cache Partition Index Data 1. read request 2. Miss -> Check bloom filters 3. Check the partition key is in key cache 4. Miss -> Bsearch the close location of index 5. Disk scan to find the offsets 6. Find the offset into the result set 7. Access the result set 8. Update key cache
  • 9.
    What’s the problem? •GC pressure by key cache when a large partition is read
  • 10.
    Partition Index Recap •http://distributeddatastore.blogspot.jp/2013/08/cassandra-sstable-storage-format.html
  • 11.
    RowIndexEntry • Partition size< 64 kb – RowIndexEntry • Position • Seriarized size of data • Partition size > 64 kb – IndexedEntry • Position • Seriarized size of data • IndexInfo[] – Seriarize method – Offset – width – Etc. Approximation on 16 byte value 1mb : 3kb / > 200 objects 4mb : 11kb / > 800 objects 64mb : 180kb / > 13k objects 512mb : 1.4mb / > 106k objects
  • 12.
    3.The key isnot cached Partition Summary Disk MemTable Compression Offsets Bloom Filter Row Cache Heap Off Heap Key Cache Partition Index Data 1. read request 2. Miss -> Check bloom filters 3. Check the partition key is in key cache 4. Miss -> Bsearch the close location of index 5. Disk scan to find the offsets 6. Find the offsets into the result set 7. Access the result set 8. Update key cache 9. GC, GC, GC…
  • 13.
    Current solution • Ifpartition size < column_index_cache_size_in_kb(configurable) – IndexedEntry is kept on heap • Otherwise – Always read from disk when needed • https://issues.apache.org/jira/browse/CASSANDRA-11206 • https://www.youtube.com/watch?v=qa84vABqftM
  • 14.
    Other possible solutions •IndexInfo never be kept on heap – Read from disk when needed – degrades performance when small partition is read
  • 15.
    Other possible solutions •Migrate key cache to be fully off heap – https://issues.apache.org/jira/browse/CASSANDR A-9738 – Serialization & deserialization cost so much when large partition is read • Will Birch help us to solve this problem? – https://issues.apache.org/jira/browse/CASSANDRA-9754
  • 16.

Editor's Notes

  • #7 Row Cache は、該当のrowに書き込みが入ると無効化される -> Row Cacheを使うのは read : write = 95 : 5 を目安とするとよい
  • #9 2.0 の read path http://www.slideshare.net/planetcassandra/c-summit-eu-2013-keynote-by-jonathan-ellis 2.1 の read path https://grockdoc.com/cassandra/2.1/articles/intra-node-read-path-flow_12a63adf-f148-45ff-87d8-0fcb84e8aca0/ http://www.slideshare.net/doanduyhai/cassandra-introduction-apache-con-2014-budapest ----- 会議メモ (6/21/16 19:17) ----- offset は、data file上のどこにdataがあるかを示すもの
  • #13 2.0 の read path http://www.slideshare.net/planetcassandra/c-summit-eu-2013-keynote-by-jonathan-ellis 2.1 の read path https://grockdoc.com/cassandra/2.1/articles/intra-node-read-path-flow_12a63adf-f148-45ff-87d8-0fcb84e8aca0/ http://www.slideshare.net/doanduyhai/cassandra-introduction-apache-con-2014-budapest