The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, DataStax) | Cassandra Summit 2016

Wei Deng, Ryan Svihla
The Missing Manual: Leveled Compaction Strategy

2© DataStax, All Rights Reserved.
1 Why
2 Background
3 How it affects you
4 Myth Busters
5 What’s coming

About us
© DataStax, All Rights Reserved. 3
Wei Deng, Ryan Svihla
• Both been working for DataStax for 3
years
• Helped a lot of customers on
production deployments and shared
their struggles with LCS
• Active on LCS related JIRAs

Why?
Are we lazy, or crazy to talk about a five-year-old technology

Why?
• Motivations
• Not a lot of technical content dedicated to LCS
• Many sharp edges for non-suspecting users that can cause operation pains
• Intended Audience
• You’ve been managing C* cluster using LCS for a few months and have been burned by it a
few times
• You just started using C* and are wondering what are the pitfalls to watch out
• What you can expect to get out of this talk
• A better understanding about inner-workings of LCS and some counter-intuitive behaviors
• Understand some concepts frequently used in the code (but not adequately discussed in
existing documentation) before diving into the LCS source code and JIRAs

A brief history on LCS
• One of the three main Compaction Strategies in C*: STCS, LCS, DTCS/TWCS
• Original idea comes from LevelDB of Google’s Chromium project
• Introduced in C* 1.0 almost 5 years ago
• A lot of people adopted it in production starting from C* 1.2 days
• Been going through many changes and is still evolving
• At various points people from community would like to change it to the default
compaction strategy to replace STCS but didn’t succeed (see CASSANDRA-7987 and
CASSANDRA-12209)
• Interesting divide in the community: committers and users
• A more detailed history can be found by looking at labels = “lcs” on ASF JIRA site.

A Quick Diversion to LevelDB
• LevelDB was introduced by Google, designed as an embedded key/value store in
browsers and mobile OS
• It was designed for low concurrency embedded purposes
• Spawned many additional key/value store use cases (Bitcoin, Akka, ActiveMQ,
RocksDB, etal)
• NoSQL database that adopted it for a multiuser database (riak, hyperdex, etal) had to
do some serious surgery similar to LCS (Don't block writes for L0, concurrent
compactions, etc.)
• As Basho’s Matthew Von-Maszewski pointed out in his Riak LevelDB talk, there are
some common problems that many LevelDB forks suffer, e.g. L0 overwhelmed
(pause), write amplification, repair speed

Background
Design Principles and Basic Concepts

What is overlap?
• It’s not a LCS-only concept, but it’s quite important for understanding LCS
• How the code decides overlap between two SSTableReader objects (‘current’ and ‘previous’):
• current.first.compareTo(previous.last) <= 0
• Assume you’ve done the following operations to insert a few partitions and incur two flushes:
• Insert P1, P2, P3, P4, flush
• Insert P5, P6, P7, P8, flush
cqlsh:lcstest> select token(key), key from t1;
system.token(key) | key
----------------------+-----
-8293482207053059348 | P3
-3394362371136168911 | P7
-2426286626330648673 | P8
2250124551998075523 | P1
3019394044321026632 | P6
5060311682392199866 | P4
5805828953249703115 | P5
9074316327295047501 | P2
-3394362371136168911 <= 9074316327295047501
-8293482207053059348 | P3
2250124551998075523 | P1
5060311682392199866 | P4
9074316327295047501 | P2
previous
-3394362371136168911 | P7
-2426286626330648673 | P8
3019394044321026632 | P6
5805828953249703115 | P5
current
current.first <= previous.last ?

Basic Design Principles
• LCS arranges data in SSTables on multiple levels
• Newly appeared SSTables (not from compaction) almost always show up in L0
• Only L0 is allowed to have SSTables of any size, L1+ levels can only have 160MB SSTables (with minor exceptions)
• Only L0 can allow SSTables to overlap, for all other levels, it’s guaranteed to not overlap within the level
• When LCS is able to catch up (i.e. no SSTable in L0), a read in the worst case just need to read from N SSTables (N
= the highest level that has data), i.e. at most one SSTable from each L(N>=1) level; ideally 90% of the read will hit
the highest level and be satisfied by one SSTable
• Each L(N) level (N>=1) is supposed to accommodate 10N SSTables. So going one level higher, you will be able to
accommodate 10x more SSTables
• When a L0 SSTable is picked as a compaction candidate, it will find all other overlapping L0 SSTables (maximum
32), as well as all of the overlapping L1 SSTables (most likely all 10 L1s), to perform a compaction, and the result
will be written into multiple SSTables in L1
• When the total size of all SSTables in L1 is greater than 10 x 160MB, the first L1 SSTable since the last compacted
key could be picked as the next compaction candidate, and it will be compacted together with all overlapping L2
SSTables, and the result will be written into multiple SSTables in L2
• For higher levels, it will follow the same fashion, adding candidates from highest level to the lowest

Basic Design Principles (cont.)
• On all levels, a Compaction Score will be periodically calculated, based on the total size of all non-
compacting SSTables in each level, divided by this level’s capacity (for example, L1 is supposed to
accommodate 1.6GB, L2 is supposed to accommodate 16GB). Going from highest level, any level having
score higher than 1.001 can be picked by a compaction thread
• Before any one of the higher-level compaction starts, check if L0 is too far-behind (i.e. having > 32
SSTables that are not yet in a compaction), if yes, do STCS in L0 first before doing higher level LCS
compaction
• After going through compactions at all higher levels, move onto doing regular LCS compaction in L0
• The number of compaction threads is capped by concurrent_compactors, and throughput throttled
• For each compaction session, the level for the added sstables is the max of the removed ones, plus one if
the removed were all on the same level
• If we go 25 rounds without compacting in the highest level, we start bringing in sstables from that level
into lower level compactions
• In addition to the DataStax blog post on LCS, more details are explained in the code comments
• http://www.datastax.com/dev/blog/when-to-use-leveled-compaction
• org.apache.cassandra.db.compaction.LeveledManifest.java
• org.apache.cassandra.db.compaction.LeveledCompactionStrategy.java

How up-level is happening
• Use a very simple key/value table in our demo
• For each partition, the key is an integer, and the
value is a 40MB blob, so each partition is about
40MB and 4 partitions can fill up a 160MB
SSTable
• Insert partitions with key 1-8, trigger a flush, then
insert 9-16, another flush, then insert 17-24,
flush, insert 25-32, and so on
• Each L(N>0) level has capacity of 160MB x 4N ,
so L1 has max capacity of 640MB, L2 has max
capacity of 2.56GB and so on, for easier
illustration of levels
CREATE TABLE kvstore (
key int PRIMARY KEY,
value text
);

system.token(key) | key | value (40MB)
----------------------+-----+-------------
-9157060164899361011 | 23 | aaa
-7509452495886106294 | 5 | aaa
-6715243485458697746 | 10 | aaa
-5477287129830487822 | 16 | aaa
-5034495173465742853 | 13 | aaa
-4156302194539278891 | 11 | aaa
-4069959284402364209 | 1 | aaa
-3974532302236993209 | 19 | aaa
-3799847372828181882 | 8 | aaa
-3248873570005575792 | 2 | aaa
-2729420104000364805 | 4 | aaa
-2695747960476065067 | 18 | aaa
-1191135763843456182 | 15 | aaa
-1117083337304738213 | 22 | aaa
1388667306199997068 | 20 | aaa
1634052884888577606 | 7 | aaa
2705480034054113608 | 6 | aaa
3728482343045213994 | 9 | aaa
4279681877540623768 | 14 | aaa
5176205029172940157 | 21 | aaa
5467144456125416399 | 17 | aaa
6462532412986175674 | 24 | aaa
8582886034424406875 | 12 | aaa
9010454139840013625 | 3 | aaa
-7509452495886106294 | 5
-4069959284402364209 | 1
-3799847372828181882 | 8
-3248873570005575792 | 2
-2729420104000364805 | 4
1634052884888577606 | 7
2705480034054113608 | 6
9010454139840013625 | 3
-6715243485458697746 | 10
-5477287129830487822 | 16
-5034495173465742853 | 13
-4156302194539278891 | 11
-1191135763843456182 | 15
3728482343045213994 | 9
4279681877540623768 | 14
8582886034424406875 | 12
L0
Can overlap
No size limit
-7509452495886106294 | 5
-6715243485458697746 | 10
-5477287129830487822 | 16
-5034495173465742853 | 13
-4156302194539278891 | 11
-4069959284402364209 | 1
-3799847372828181882 | 8
-3248873570005575792 | 2
-2729420104000364805 | 4
-1191135763843456182 | 15
1634052884888577606 | 7
2705480034054113608 | 6
3728482343045213994 | 9
4279681877540623768 | 14
8582886034424406875 | 12
9010454139840013625 | 3
L1
No overlap
41 x 160MB
L2
No overlap
42 x 160MB
…

----------------------+-----+-------------
-9157060164899361011 | 23 | aaa
-7509452495886106294 | 5 | aaa
-6715243485458697746 | 10 | aaa
-5477287129830487822 | 16 | aaa
-5034495173465742853 | 13 | aaa
-4156302194539278891 | 11 | aaa
-4069959284402364209 | 1 | aaa
-3974532302236993209 | 19 | aaa
-3799847372828181882 | 8 | aaa
-3248873570005575792 | 2 | aaa
-2729420104000364805 | 4 | aaa
-2695747960476065067 | 18 | aaa
-1191135763843456182 | 15 | aaa
-1117083337304738213 | 22 | aaa
1388667306199997068 | 20 | aaa
1634052884888577606 | 7 | aaa
2705480034054113608 | 6 | aaa
3728482343045213994 | 9 | aaa
4279681877540623768 | 14 | aaa
5176205029172940157 | 21 | aaa
5467144456125416399 | 17 | aaa
6462532412986175674 | 24 | aaa
8582886034424406875 | 12 | aaa
9010454139840013625 | 3 | aaa
L0
Can overlap
No size limit
-7509452495886106294 | 5
-6715243485458697746 | 10
-5477287129830487822 | 16
-5034495173465742853 | 13
-4156302194539278891 | 11
-4069959284402364209 | 1
-3799847372828181882 | 8
-3248873570005575792 | 2
-2729420104000364805 | 4
-1191135763843456182 | 15
1634052884888577606 | 7
2705480034054113608 | 6
3728482343045213994 | 9
4279681877540623768 | 14
8582886034424406875 | 12
9010454139840013625 | 3
L1
No overlap
41 x 160MB
L2
No overlap
42 x 160MB
…

----------------------+-----+-------------
-9157060164899361011 | 23 | aaa
-7509452495886106294 | 5 | aaa
-6715243485458697746 | 10 | aaa
-5477287129830487822 | 16 | aaa
-5034495173465742853 | 13 | aaa
-4156302194539278891 | 11 | aaa
-4069959284402364209 | 1 | aaa
-3974532302236993209 | 19 | aaa
-3799847372828181882 | 8 | aaa
-3248873570005575792 | 2 | aaa
-2729420104000364805 | 4 | aaa
-2695747960476065067 | 18 | aaa
-1191135763843456182 | 15 | aaa
-1117083337304738213 | 22 | aaa
1388667306199997068 | 20 | aaa
1634052884888577606 | 7 | aaa
2705480034054113608 | 6 | aaa
3728482343045213994 | 9 | aaa
4279681877540623768 | 14 | aaa
5176205029172940157 | 21 | aaa
5467144456125416399 | 17 | aaa
6462532412986175674 | 24 | aaa
8582886034424406875 | 12 | aaa
9010454139840013625 | 3 | aaa
L0
Can overlap
No size limit
-7509452495886106294 | 5
-6715243485458697746 | 10
-5477287129830487822 | 16
-5034495173465742853 | 13
-4156302194539278891 | 11
-4069959284402364209 | 1
-3799847372828181882 | 8
-3248873570005575792 | 2
-2729420104000364805 | 4
-1191135763843456182 | 15
1634052884888577606 | 7
2705480034054113608 | 6
3728482343045213994 | 9
4279681877540623768 | 14
8582886034424406875 | 12
9010454139840013625 | 3
L1
No overlap
41 x 160MB
L2
No overlap
42 x 160MB
…
-9157060164899361011 | 23
-3974532302236993209 | 19
-2695747960476065067 | 18
-1117083337304738213 | 22
1388667306199997068 | 20
5176205029172940157 | 21
5467144456125416399 | 17
6462532412986175674 | 24

----------------------+-----+-------------
-9157060164899361011 | 23 | aaa
-7509452495886106294 | 5 | aaa
-6715243485458697746 | 10 | aaa
-5477287129830487822 | 16 | aaa
-5034495173465742853 | 13 | aaa
-4156302194539278891 | 11 | aaa
-4069959284402364209 | 1 | aaa
-3974532302236993209 | 19 | aaa
-3799847372828181882 | 8 | aaa
-3248873570005575792 | 2 | aaa
-2729420104000364805 | 4 | aaa
-2695747960476065067 | 18 | aaa
-1191135763843456182 | 15 | aaa
-1117083337304738213 | 22 | aaa
1388667306199997068 | 20 | aaa
1634052884888577606 | 7 | aaa
2705480034054113608 | 6 | aaa
3728482343045213994 | 9 | aaa
4279681877540623768 | 14 | aaa
5176205029172940157 | 21 | aaa
5467144456125416399 | 17 | aaa
6462532412986175674 | 24 | aaa
8582886034424406875 | 12 | aaa
9010454139840013625 | 3 | aaa
L0
Can overlap
No size limit
-7509452495886106294 | 5
-6715243485458697746 | 10
-5477287129830487822 | 16
-5034495173465742853 | 13
-4156302194539278891 | 11
-4069959284402364209 | 1
-3799847372828181882 | 8
-3248873570005575792 | 2
-2729420104000364805 | 4
-1191135763843456182 | 15
1634052884888577606 | 7
2705480034054113608 | 6
3728482343045213994 | 9
4279681877540623768 | 14
8582886034424406875 | 12
9010454139840013625 | 3
L1
No overlap
41 x 160MB
L2
No overlap
42 x 160MB
…
-9157060164899361011 | 23
-3974532302236993209 | 19
-2695747960476065067 | 18
-1117083337304738213 | 22
1388667306199997068 | 20
5176205029172940157 | 21
5467144456125416399 | 17
6462532412986175674 | 24
90°

----------------------+-----+-------------
-9157060164899361011 | 23 | aaa
-7509452495886106294 | 5 | aaa
-6715243485458697746 | 10 | aaa
-5477287129830487822 | 16 | aaa
-5034495173465742853 | 13 | aaa
-4156302194539278891 | 11 | aaa
-4069959284402364209 | 1 | aaa
-3974532302236993209 | 19 | aaa
-3799847372828181882 | 8 | aaa
-3248873570005575792 | 2 | aaa
-2729420104000364805 | 4 | aaa
-2695747960476065067 | 18 | aaa
-1191135763843456182 | 15 | aaa
-1117083337304738213 | 22 | aaa
1388667306199997068 | 20 | aaa
1634052884888577606 | 7 | aaa
2705480034054113608 | 6 | aaa
3728482343045213994 | 9 | aaa
4279681877540623768 | 14 | aaa
5176205029172940157 | 21 | aaa
5467144456125416399 | 17 | aaa
6462532412986175674 | 24 | aaa
8582886034424406875 | 12 | aaa
9010454139840013625 | 3 | aaa
L0
Can overlap
No size limit
L1
No overlap
41 x 160MB
L2
No overlap
42 x 160MB
…
23 19 18 22 20 21 17
24
5 10 16 13 11 1 8 2 4 15 7 6 9 14 12
3

----------------------+-----+-------------
-9157060164899361011 | 23 | aaa
-7509452495886106294 | 5 | aaa
-6715243485458697746 | 10 | aaa
-5477287129830487822 | 16 | aaa
-5034495173465742853 | 13 | aaa
-4156302194539278891 | 11 | aaa
-4069959284402364209 | 1 | aaa
-3974532302236993209 | 19 | aaa
-3799847372828181882 | 8 | aaa
-3248873570005575792 | 2 | aaa
-2729420104000364805 | 4 | aaa
-2695747960476065067 | 18 | aaa
-1191135763843456182 | 15 | aaa
-1117083337304738213 | 22 | aaa
1388667306199997068 | 20 | aaa
1634052884888577606 | 7 | aaa
2705480034054113608 | 6 | aaa
3728482343045213994 | 9 | aaa
4279681877540623768 | 14 | aaa
5176205029172940157 | 21 | aaa
5467144456125416399 | 17 | aaa
6462532412986175674 | 24 | aaa
8582886034424406875 | 12 | aaa
9010454139840013625 | 3 | aaa
L0
Can overlap
No size limit
L1
No overlap
41 x 160MB
L2
No overlap
42 x 160MB
…
23 5 10 16 13 11 1 19 8 2 4 18 15 22 20 7 6 9 14 21 17 24 12 3
Total size of all L1 SSTables ≈ 40MB/partition x 24 partitions = 960MB / (41 x 160MB) > 1.001

----------------------+-----+-------------
-9157060164899361011 | 23 | aaa
-7509452495886106294 | 5 | aaa
-6715243485458697746 | 10 | aaa
-5477287129830487822 | 16 | aaa
-5034495173465742853 | 13 | aaa
-4156302194539278891 | 11 | aaa
-4069959284402364209 | 1 | aaa
-3974532302236993209 | 19 | aaa
-3799847372828181882 | 8 | aaa
-3248873570005575792 | 2 | aaa
-2729420104000364805 | 4 | aaa
-2695747960476065067 | 18 | aaa
-1191135763843456182 | 15 | aaa
-1117083337304738213 | 22 | aaa
1388667306199997068 | 20 | aaa
1634052884888577606 | 7 | aaa
2705480034054113608 | 6 | aaa
3728482343045213994 | 9 | aaa
4279681877540623768 | 14 | aaa
5176205029172940157 | 21 | aaa
5467144456125416399 | 17 | aaa
6462532412986175674 | 24 | aaa
8582886034424406875 | 12 | aaa
9010454139840013625 | 3 | aaa
L0
Can overlap
No size limit
L1
No overlap
41 x 160MB
L2
No overlap
42 x 160MB
…
23 5 10 16 13 11 1 19
8 2 4 18 15 22 20 7 6 9 14 21 17 24 12 3
Total size of all L1 SSTables ≈ 40MB/partition x 16 partitions = 640MB / (41 x 160MB) ≤ 1.001

----------------------+-----+-------------
-9157060164899361011 | 23 | aaa
-7509452495886106294 | 5 | aaa
-7492342649816923291 | 28 | aaa
-6715243485458697746 | 10 | aaa
-5477287129830487822 | 16 | aaa
-5034495173465742853 | 13 | aaa
-4944298994521308347 | 30 | aaa
-4156302194539278891 | 11 | aaa
-4069959284402364209 | 1 | aaa
-3974532302236993209 | 19 | aaa
-3799847372828181882 | 8 | aaa
-3248873570005575792 | 2 | aaa
-2729420104000364805 | 4 | aaa
-2695747960476065067 | 18 | aaa
-1191135763843456182 | 15 | aaa
-1117083337304738213 | 22 | aaa
945780720898208171 | 27 | aaa
1388667306199997068 | 20 | aaa
1634052884888577606 | 7 | aaa
2705480034054113608 | 6 | aaa
...
8041825692956475789 | 25 | aaa
8582886034424406875 | 12 | aaa
9010454139840013625 | 3 | aaa
L0
Can overlap
No size limit
L1
No overlap
41 x 160MB
L2
No overlap
42 x 160MB
…
23 5 10 16 13 11 1 19
8 2 4 18 15 22 20 7 6 9 14 21 17 24 12 3
28 30 27 29 26 31 32
25

----------------------+-----+-------------
-9157060164899361011 | 23 | aaa
-7509452495886106294 | 5 | aaa
-7492342649816923291 | 28 | aaa
-6715243485458697746 | 10 | aaa
-5477287129830487822 | 16 | aaa
-5034495173465742853 | 13 | aaa
-4944298994521308347 | 30 | aaa
-4156302194539278891 | 11 | aaa
-4069959284402364209 | 1 | aaa
-3974532302236993209 | 19 | aaa
-3799847372828181882 | 8 | aaa
-3248873570005575792 | 2 | aaa
-2729420104000364805 | 4 | aaa
-2695747960476065067 | 18 | aaa
-1191135763843456182 | 15 | aaa
-1117083337304738213 | 22 | aaa
945780720898208171 | 27 | aaa
1388667306199997068 | 20 | aaa
1634052884888577606 | 7 | aaa
2705480034054113608 | 6 | aaa
...
8041825692956475789 | 25 | aaa
8582886034424406875 | 12 | aaa
9010454139840013625 | 3 | aaa
L0
Can overlap
No size limit
L1
No overlap
41 x 160MB
L2
No overlap
42 x 160MB
…
23 5 10 16 13 11 1 19
27 20 7 6 29 9 14 26 21 17 31
24
32 25 12 328 30 8 2 4 18 15 22
Total size of all L1 SSTables ≈ 40MB/partition x 24 partitions = 960MB / (41 x 160MB) > 1.001
lastCompactedKeys[minLevel] = SSTableReader.sstableOrdering.max(added).last;

----------------------+-----+-------------
-9157060164899361011 | 23 | aaa
-7509452495886106294 | 5 | aaa
-7492342649816923291 | 28 | aaa
-6715243485458697746 | 10 | aaa
-5477287129830487822 | 16 | aaa
-5034495173465742853 | 13 | aaa
-4944298994521308347 | 30 | aaa
-4156302194539278891 | 11 | aaa
-4069959284402364209 | 1 | aaa
-3974532302236993209 | 19 | aaa
-3799847372828181882 | 8 | aaa
-3248873570005575792 | 2 | aaa
-2729420104000364805 | 4 | aaa
-2695747960476065067 | 18 | aaa
-1191135763843456182 | 15 | aaa
-1117083337304738213 | 22 | aaa
945780720898208171 | 27 | aaa
1388667306199997068 | 20 | aaa
1634052884888577606 | 7 | aaa
2705480034054113608 | 6 | aaa
...
8041825692956475789 | 25 | aaa
8582886034424406875 | 12 | aaa
9010454139840013625 | 3 | aaa
L0
Can overlap
No size limit
L1
No overlap
41 x 160MB
L2
No overlap
42 x 160MB
…
23 5 10 16 13 11 1 19 27 20 7 6
29 9 14 26 21 17 31
24
32 25 12 328 30 8 2
4 18 15 22
lastCompactedKeys[minLevel] = SSTableReader.sstableOrdering.max(added).last;

Fully Expired SSTable
• If an SSTable contains only tombstones and it is guaranteed that the SSTable is not
shadowing data in any other SSTable, compaction can drop that SSTable
• If you see SSTables with only tombstones (note that TTL’ed data is considered
tombstones once the time to live has expired) but it is not being dropped by
compaction, it is likely that other SSTables contain older data
• In LCS, these other SSTables containing older data are often in higher levels
• This directly impacts the effectiveness of tombstone compaction, and hence why you
always want tombstones to go to the highest level as quickly as possible

Write Amplification
• All fresh data enters L0, and will have to move up level one by one, no skipping level
possible
• In the ideal situation, highest level will have 90% of data, hence more data -> more
Write Amplification, because any data that eventually shows up on the highest level N
had to be written N+1 times
• This is why LCS is more demanding on CPU and IO capabilities compared to STCS

JBOD and CASSANDRA-6696
• JBOD is a primary use case for choosing LCS
• However, it used to create “zombie resurrection” problem when JBOD disk fails
• CASSANDRA-6696 (C* 3.2+) will create flush writer and compaction strategy per disk,
so all flush and compaction activities for each disk will only be responsible for one set
of token ranges
• More parallelism for compaction (no worries about overlap at higher levels between
different compaction threads)
• This also makes LCS to be able to handle more data per node, as each JBOD volume
is doing compaction on its own

How it affects you?
Best Practices

Where LCS fits the best
• Use cases needing very consistent read performance with much higher read to write
ratio
• Wide-partition data model with limited (or slow-growing) number of total partitions but
a lot of updates and deletes, or fully TTL’ed dataset

Typical anti-pattern use cases for LCS
• Heavy write with all unique partitions
• Too frequent memtable flushes
• Large vnode nodes
• Frequent bulk load with the goal of finishing the load as quickly as possible
• Any other condition that could create too many L0 SSTables
• High density node (> 1TB per node or > 0.5TB per CQL table per node)
• Node without enough IO (i.e. non-SSD, such as SAN, NFS or spindles) and CPU
resources
• Partitions that are too wide (> 160MB)
• Use LCS simply for saving disk spaces (because “LCS requires less free space”)
• Have too many SSTables without enough file handles and using snapshot repair

Troubleshooting
• Debug.log
• nodetool setlogginglevel org.apache.cassandra.db.compaction.LeveledManifest TRACE
• nodetool setlogginglevel org.apache.cassandra.db.compaction.LeveledCompactionStrategy TRACE
• log_all=true (CASSANDRA-10805, C* 3.6+) or JMX
• enable in table schema
• Will create an individual compaction.log file outside of system.log or debug.log
• Can be very useful for a visualizer
• The offline compaction-stress tool automatically enable it

Visualization demo
• Developed by Carl Yeksigian, still Work In Progress, 0.1 version
• Leverage the compaction.log generated by CASSANDRA-10805, which was also
authored by Carl

Myth Busters
Observations that often confuse people

Myth #1
• Using LCS, I can avoid STCS’s 50% free space requirement and use up
as much as 90% of my disk space?
WRONG
• You should not size your hardware based on 90% disk utilization
• STCS in L0 can easily throw off this assumption, also when L0 is overloaded, L0->L1 can easily
break the “up to 90%” assumption

Myth #2
• Why so many small SSTables show up in L0 when I start Repair
Service?
• Every repair session will force affected nodes to flush the corresponding memtable
• This is the only way to guarantee merkle tree comparison won’t miss any data from memtable
• If something is wrong with repair scheduling, two repair sessions could be scheduled back-to-back, causing
unnecessary memtable flush to happen
• If you use vnode, one range of data tends to spread across many nodes and naturally many
SSTables, hence when streaming the replica difference, it’s more likely to source from many nodes
and many SSTables

Myth #3
• Why compaction is no longer runnning and I’m still seeing something
like [1, 12/10, 104/100, …] ?
• Even when not doing STCS, L0 SSTables compacted don’t always end up in L1. If the combined
size of L0 SSTables involved in a compaction is smaller than max_sstable_size, then the output
will stay in L0
• Upper level compaction is triggered by compaction scores (watch debug.log to see it in action),
and compaction scores is completely based on size, instead of number of SSTables in each level.
For brevity of display, “nodetool cfstats” chooses to use # of SSTables

Myth #4
• Why it takes so long to free up disk space after delete?
• Because any newly inserted data (including delete/tombstone) will have to
bubble up
• A partition can have fragments in all different levels
• To allow the data belonging to a partition to be cleared, the tombstone has to
bubble up to the top and is also expired
• See fully expired sstable slide.

What’s coming
JIRA references to LCS’s past, present and future

What has happened since C* 2.1
• Partition SSTables by token ranges (CASSANDRA-6696, C* 3.2+)
• Try to include overlapping sstables in tombstone compactions to be able to actually drop the tombstones
(CASSANDRA-7019, C* 3.10+) See talk this afternoon at 4:10pm in 210C
• Add major compaction to LCS (CASSANDRA-7272, C* 2.2+), however note CASSANDRA-11817
• Send source sstable level when bootstrapping or replacing nodes (CASSANDRA-7460, C* 2.2+)
• Adjusting LCS to work with incremental repair (CASSANDRA-8004, C* 2.1.2+)
• A tool called sstableofflinerelevel to create a "decent" sstable leveling from a bunch of given SSTables
(CASSANDRA-8301, C* 2.1.5+)
• Use a new JBOD volume every time when a newly compacted SSTable is written (CASSANDRA-8329, C* 2.1.3+)
• Compaction throughput under stress improved by 2x with various optimizations (CASSANDRA-8915, C* 3.0+,
CASSANDRA-8920, C* 2.2+, CASSANDRA-8988, C* 2.2+, CASSANDRA-10099, C* 3.6+)
• Compact only certain token range (CASSANDRA-10643, C* 3.10+)
• Allow L0 STCS and L0->L1 LCS to happen at the same time if there’s no overlap (CASSANDRA-10979, C*
2.1.14+)
• Improve Cassandra startup speed with LCS tables (CASSANDRA-12114, C* 3.0.9+, C* 3.8+)
• Compaction Throttling is not working as expected (CASSANDRA-12366, C* 3.10+)
• Compute last compacted keys for each level at startup (CASSANDRA-6216, C* 3.0.9+, C* 3.10+)

JIRAs coming
A few JIRAs worth waiting for (already patch available) or watching (WIP) or
contributing/upvoting (not yet started):
• Different optimizations to pick compaction candidates to avoid write amplification and merge/eliminate
tombstones (CASSANDRA-11106)
• Range Aware compaction (CASSANDRA-10540)
• Create new sstables in the highest possible level (CASSANDRA-6323)
• Fractual cascading (CASSANDRA-7065)
• Allow multiple overlapping SSTables in L1 or above (CASSANDRA-7409)
• L0 should have a separate configurable bloom filter false positive ratio (CASSANDRA-8434)
• Dynamically adjusting the ideal top level size to how much data is actually in it (CASSANDRA-9829)
• Option to disable bloom filter in highest level of LCS sstables (CASSANDRA-9830)
• LCS repair: compact tables before making available in L0 (CASSANDRA-10862)
• Make the fanout size for LeveledCompactionStrategy to be configurable (CASSANDRA-11150)
• Improve parallelism on higher level compactions in LCS (CASSANDRA-12464)

Some parting thoughts
• Try out the latest C* trunk/3.x and play with the logging level
• Test out bigger sstable_size_in_mb with your environment (CASSANDRA-12591)
• Leverage the new compaction-stress tool to evaluate your deployment environment

Appendix / Deleted Scenes
The stuff we didn’t have time to cover

Some more comments on LevelDB design from
Cassandra source code
“LevelDB gives each level a score of how much data it contains vs its ideal amount, and compacts the level with the
highest score. But this falls apart spectacularly once you get behind. Consider this set of levels:
L0: 988 [ideal: 4]
L1: 117 [ideal: 10]
L2: 12 [ideal: 100]
The problem is that L0 has a much higher score (almost 250) than L1 (11), so what we'll do is compact a batch of
MAX_COMPACTING_L0 sstables with all 117 L1 sstables, and put the result (say, 120 sstables) in L1. Then we'll
compact the next batch of MAX_COMPACTING_L0, and so forth. So we spend most of our i/o rewriting the L1 data
with each batch.
If we could just do *all* L0 a single time with L1, that would be ideal. But we can’t. LevelDB's way around this is to
simply block writes if L0 compaction falls behind. We don't have that luxury. So instead, we 1) force compacting higher
levels first, which minimizes the i/o needed to compact optimally which gives us a long term win, and 2) if L0 falls
behind, we will size-tiered compact it to reduce read overhead until we can catch up on the higher levels.
This isn't a magic wand -- if you are consistently writing too fast for LCS to keep up, you're still screwed. But if instead
you have intermittent bursts of activity, it can help a lot.”© DataStax, All Rights Reserved. 46

Basics on compaction in general
• Why you need compaction
• Log Structured Merge Tree
• Immutable data files (SSTables)
• Super fast writes, but read will need to read as little amount of data files as possible
• What it usually does
• Normally combine *multiple* SSTables and generate another set of SSTables to merge
partition fragments from the source SSTables
• In this process, remove tombstones or change expired data to tombstones
• Tombstone compaction can also be automatically triggered on individual SSTables.
• How it is triggered
• Background / minor compaction
• Major compaction
• User defined compaction

Impact from Incremental Repair
• Anti-compaction at the end of incremental repair session will create repaired and
unrepaired SSTables
• Maintaining separate pools of repaired and unrepaired SSTables causes extra
complexity for LCS
• Two instances of LeveledCompactionStrategy running, one for each pool
• When moving a SSTable (at L1+) from unrepaired to repaired, it can cause overlap
and we will have to send this SSTable back to L0

What is suspectness and why it matters?
• A suspected SSTable is marked when any IO exception is encountered when reading any
component of this SSTable. For example, a SSTable component corruption can lead compaction to
mark this SSTable as suspected.
“Note that we ignore suspect-ness of L1 sstables here, since if an L1 sstable is suspect we're
basically screwed, since we expect all or most L0 sstables to overlap with each L1 sstable.
So if an L1 sstable is suspect we can't do much besides try anyway and hope for the best. ”
• This is actually one of the reasons why you may see a huge backup whenever there is sstable
corruption. Because if the corruption happens in L1, no L0 compaction can happen any more, so
no matter how many compaction threads you have, you may be blocked in L0->L1 compaction.

Subrange compaction
• For partitions with “Justin Bieber” effect
• Only compact a given sub range
• Could be useful if you know a partition/token has been misbehaving - either gathering
many updates or many deletes
• “nodetool compact -st x -et y” will pick all SSTables containing the range between x
and y and issue a compaction for those sstables
• With LCS it can issue the compaction for a subset of the SSTables, the resulting
SSTables will end up in L0

• High density node (> 1.5TB per node or > 1TB per CQL table per node)
resources

How is Flush triggered?
• memtable reaches threshold
• commitlog reaches threshold
• memtable_flush_period_in_ms
• manual flush
• repair triggering flush
• snapshot/backup
• nodetool drain
• drop table
• when you change compaction strategy
• When the node is requested to transfer ranges (streaming-in)

• High density node (> 1.5TB per node or > 1TB per CQL table per node)
resources

Other reasons massive SSTables can show up in L0
• Large vnode with repair streaming (CASSANDRA-10862)
• Bootstrap without sending source level (CASSANDRA-7460)
• Bulk load utility sstableloader from cluster-wide snapshots could be another bad
source of small L0 sstables (RF x RF) (CASSANDRA-4756)

Pitfalls to avoid
• Fast write that exceeds the capability of compaction
• Really large partitions (usually LCS is known to handle wide partitions well, but you
cannot go too high, especially not exceeding the max_sstable_size)
• A single CQL table holding too much data with LCS compaction
• Large num_token
• Bugs or data corruption that can block LCS compaction and create a lot of L0 overflow

Switch between STCS and LCS
• How do you properly do that?
• Simply change table schema can create a lot of impact on your production workload
• Least intrusive
• Use JMX to change Compaction parameters – set CompactionParametersJson
• Finish compaction on a couple of nodes before starting the next
• Be careful: don’t trigger schema update on this node before compaction around all
nodes finishes
• At the end, make schema change permanent by ALTER TABLE

Tuning Knobs to be aware of
• concurrent_compactors in yaml (be mindful of heap pressure)
• compaction_throughput_mb_per_sec in yaml
• All other yaml parameters impacting flush frequency
• memtable_heap_space_in_mb
• memtable_offheap_space_in_mb
• memtable_cleanup_threshold
• commitlog_total_space_in_mb
• JVM flag to disable stcs_in_l0 –Dcassandra.disable_stcs_in_l0
• sstable_size_in_mb in table schema
• Tombstone compaction related parameters in table schema
• For TTL’ed data, gc_grace_seconds in table schema
• Fan out size (CASSANDRA-7871, CASSANDRA-11550)
• Major compaction (caution)
• You can always throttle your write throughput
• Surprise! Switching to STCS can be an option too

Useful tools
• Passive
• nodetool cfstats
• nodetool compactionstats
• nodetool compactionhistory
• Transaction log in the data directory (CASSANDRA-7066)
• JMX for monitoring (check out the two MBeans: o.a.c.db.ColumnFamilyStoreMBean,
o.a.c.db.compaction.CompactionManagerMBean)
• Proactive
• sstableofflinerelevel (great tool, but don’t expect it to do magic) (CASSANDRA-8301)
• Major compaction (performance caveat!) (CASSANDRA-7272)
• sstablelevelreset (Weapon of Massive Destruction) (CASSANDRA-5271)
• User Defined Compaction - JMX or nodetool (CASSANDRA-10660)

Signs of a struggling LCS compaction
• 100s-1000s (or more) of L0 SSTables in cfstats
• 1000s (or more) of pending compactions in compactionstats
• Heavy GC activities that are not driven by read/write workload
• Much more disk space utilized than the real data volume
• Running out of file handles
• OOM with snapshot repair

Typical debug.log Entries
TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,236 LeveledManifest.java:119 - Adding BigTableReader(path='/home/automaton/cassandra-
trunk/data/data/keyspace1/standard1-37b1059068e011e68a2a31a1f422f58b/mc-1178-big-Data.db') to L3
DEBUG [CompactionExecutor:39] 2016-08-23 21:14:07,236 CompactionTask.java:249 - Compacted (7f458a20-6976-11e6-9ee3-dd591a930c3b) 1 sstables to
[/home/automaton/cassandra-trunk/data/data/keyspace1/standard1-37b1059068e011e68a2a31a1f422f58b/mc-1178-big,] to level=3. 160.000MiB to 160.000MiB (~100% of
original) in 12,865ms. Read Throughput = 12.436MiB/s, Write Throughput = 12.436MiB/s, Row Throughput = ~55,379/s. 719,931 total partitions merged to 719,931.
Partition merge counts were {1:719931, }
TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,236 LeveledManifest.java:748 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions to do for keyspace1.standard1
TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,236 LeveledManifest.java:748 - Estimating [0, 225, 2, 0, 0, 0, 0, 0, 0] compactions to do for keyspace1.standard1
TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,236 LeveledManifest.java:319 - Compaction score for level 3 is 0.0010000005662441254
TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,237 LeveledManifest.java:573 - Choosing candidates for L1
TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,237 LeveledManifest.java:406 - CompactionCounter: 0: 0
TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,238 LeveledManifest.java:335 - Compaction candidates for L1 are standard1-913(L1),
DEBUG [CompactionExecutor:39] 2016-08-23 21:14:07,238 CompactionTask.java:153 - Compacting (86f10a60-6976-11e6-9ee3-dd591a930c3b) [/home/automaton/cassandra-
trunk/data/data/keyspace1/standard1-37b1059068e011e68a2a31a1f422f58b/mc-913-big-Data.db:level=1, ]
TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,450 LeveledManifest.java:473 - L1 contains 235 SSTables (36.709GiB) in Manifest@546534055
TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,451 LeveledManifest.java:473 - L3 contains 1 SSTables (160.000MiB) in Manifest@546534055
TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,451 LeveledManifest.java:149 - Replacing [standard1-674(L2), ]
TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,451 LeveledManifest.java:166 - Adding [standard1-1179(L3), ]
TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,451 LeveledManifest.java:473 - L3 contains 1 SSTables (160.000MiB) in Manifest@546534055

Typical compaction.log Entries
{"type":"pending","keyspace":"stresscql","table":"blogposts","time":1472851796649
,"strategyId":"1","pending":5}
{"type":"compaction","keyspace":"stresscql","table":"blogposts","time":1472851799
027,"start":"1472851795595","end":"1472851799027","input":[{"strategyId":"1","tab
le":{"generation":3244,"version":"mc","size":169275615,"details":{"level":2,"min_
token":"-5663415854288007738","max_token":"-
5630042604729221974"}}},{"strategyId":"1","table":{"generation":3970,"version":"m
c","size":156101287,"details":{"level":3,"min_token":"-
5681368161876797914","max_token":"-
5523713966014233820"}}}],"output":[{"strategyId":"1","table":{"generation":4003,"
version":"mc","size":170059193,"details":{"level":3,"min_token":"-
5681368161876797914","max_token":"-
5654579422012458443"}}},{"strategyId":"1","table":{"generation":4008,"version":"m
c","size":155296723,"details":{"level":3,"min_token":"-
5654551486308333779","max_token":"-5523713966014233820"}}}]}

Myth #5
• Why I cannot use up all of my concurrent compactor threads?
• It’s all about overlap
• L0 -> L1 can only happen in one thread (most likely)
• L1+ up-level can only happen when L(N) sstable overlaps with a small subset of L(N+1) SSTables
• Some of your SSTables are marked as suspected so cannot participate in compaction

Myth #6
• Why memtable flushes at a much smaller size than the allocated
memtable space, causing way more L0 SSTables than expected?
• The threshold for regular memtable flush is calculated based not only on allocated onheap and
offheap memtable space, but also on memtable_cleanup_threshold
• memtable_cleanup_threshold by default is calculated based on memtable_flush_writers (inverse proportion)
• it’s recommended to always manually set this value, especially if you’ve manually configured
memtable_flush_writers
• Your commitlog total space is too low, which leads commitlog filled up too quickly under heavy
writes

Myth #7
• Why pending compaction numbers is often not accurate for LCS?
• It’s a known limitation of the current implementation
• Look at this code on how the pending compaction task numbers are calculated (it is an *estimate*):
estimated[i] = (long)Math.ceil((double)Math.max(0L, SSTableReader.getTotalBytes(sstables) -
(long)(maxBytesForLevel(i, maxSSTableSizeInBytes) * 1.001)) / (double)maxSSTableSizeInBytes);
• In other words, it’s (total data in the current level – current level’s capacity) / 160MB, then add up
individual estimates from all levels
• But it never consider the additional writes needed beyond the current level

Myth #8
• Why I am seeing a very low compaction throughput number (MB/s)
when I have super fast SSD and plenty of idle CPU?
• L0->L1 can only be single-threaded, so faster CPU helps but more CPU cores won’t for this phase
• Part of reporting problem, especially if you’re using compression, see CASSANDRA-11697 (WIP)
• Part of throttling problem, see CASSANDRA-12366 (C* 3.10+)
• Between 2.1 and 3.0, we’ve improved compaction throughput under stress by 2x, because of
CASSANDRA-8988 (C* 2.2+), CASSANDRA-8920 (C* 2.2+), CASSANDRA-8915 (C* 3.0+).
• There are still remaining issues, but are currently being actively worked on, so you should watch
CASSANDRA-11697 closely

The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, DataStax) | Cassandra Summit 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, DataStax) | Cassandra Summit 2016

Similar to The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, DataStax) | Cassandra Summit 2016 (20)

More from DataStax

More from DataStax (20)

Recently uploaded

Recently uploaded (20)

The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, DataStax) | Cassandra Summit 2016

Editor's Notes