SlideShare a Scribd company logo
1 of 65
Wei Deng, Ryan Svihla
The Missing Manual: Leveled Compaction Strategy
2© DataStax, All Rights Reserved.
1 Why
2 Background
3 How it affects you
4 Myth Busters
5 What’s coming
About us
© DataStax, All Rights Reserved. 3
Wei Deng, Ryan Svihla
• Both been working for DataStax for 3
years
• Helped a lot of customers on
production deployments and shared
their struggles with LCS
• Active on LCS related JIRAs
Why?
Are we lazy, or crazy to talk about a five-year-old technology
Why?
• Motivations
• Not a lot of technical content dedicated to LCS
• Many sharp edges for non-suspecting users that can cause operation pains
• Intended Audience
• You’ve been managing C* cluster using LCS for a few months and have been burned by it a
few times
• You just started using C* and are wondering what are the pitfalls to watch out
• What you can expect to get out of this talk
• A better understanding about inner-workings of LCS and some counter-intuitive behaviors
• Understand some concepts frequently used in the code (but not adequately discussed in
existing documentation) before diving into the LCS source code and JIRAs
© DataStax, All Rights Reserved. 5
A brief history on LCS
• One of the three main Compaction Strategies in C*: STCS, LCS, DTCS/TWCS
• Original idea comes from LevelDB of Google’s Chromium project
• Introduced in C* 1.0 almost 5 years ago
• A lot of people adopted it in production starting from C* 1.2 days
• Been going through many changes and is still evolving
• At various points people from community would like to change it to the default
compaction strategy to replace STCS but didn’t succeed (see CASSANDRA-7987 and
CASSANDRA-12209)
• Interesting divide in the community: committers and users
• A more detailed history can be found by looking at labels = “lcs” on ASF JIRA site.
© DataStax, All Rights Reserved. 6
A Quick Diversion to LevelDB
• LevelDB was introduced by Google, designed as an embedded key/value store in
browsers and mobile OS
• It was designed for low concurrency embedded purposes
• Spawned many additional key/value store use cases (Bitcoin, Akka, ActiveMQ,
RocksDB, etal)
• NoSQL database that adopted it for a multiuser database (riak, hyperdex, etal) had to
do some serious surgery similar to LCS (Don't block writes for L0, concurrent
compactions, etc.)
• As Basho’s Matthew Von-Maszewski pointed out in his Riak LevelDB talk, there are
some common problems that many LevelDB forks suffer, e.g. L0 overwhelmed
(pause), write amplification, repair speed
© DataStax, All Rights Reserved. 7
Background
Design Principles and Basic Concepts
What is overlap?
• It’s not a LCS-only concept, but it’s quite important for understanding LCS
• How the code decides overlap between two SSTableReader objects (‘current’ and ‘previous’):
• current.first.compareTo(previous.last) <= 0
• Assume you’ve done the following operations to insert a few partitions and incur two flushes:
• Insert P1, P2, P3, P4, flush
• Insert P5, P6, P7, P8, flush
© DataStax, All Rights Reserved. 9
cqlsh:lcstest> select token(key), key from t1;
system.token(key) | key
----------------------+-----
-8293482207053059348 | P3
-3394362371136168911 | P7
-2426286626330648673 | P8
2250124551998075523 | P1
3019394044321026632 | P6
5060311682392199866 | P4
5805828953249703115 | P5
9074316327295047501 | P2
-3394362371136168911 <= 9074316327295047501
-8293482207053059348 | P3
2250124551998075523 | P1
5060311682392199866 | P4
9074316327295047501 | P2
previous
-3394362371136168911 | P7
-2426286626330648673 | P8
3019394044321026632 | P6
5805828953249703115 | P5
current
current.first <= previous.last ?
Basic Design Principles
• LCS arranges data in SSTables on multiple levels
• Newly appeared SSTables (not from compaction) almost always show up in L0
• Only L0 is allowed to have SSTables of any size, L1+ levels can only have 160MB SSTables (with minor exceptions)
• Only L0 can allow SSTables to overlap, for all other levels, it’s guaranteed to not overlap within the level
• When LCS is able to catch up (i.e. no SSTable in L0), a read in the worst case just need to read from N SSTables (N
= the highest level that has data), i.e. at most one SSTable from each L(N>=1) level; ideally 90% of the read will hit
the highest level and be satisfied by one SSTable
• Each L(N) level (N>=1) is supposed to accommodate 10N SSTables. So going one level higher, you will be able to
accommodate 10x more SSTables
• When a L0 SSTable is picked as a compaction candidate, it will find all other overlapping L0 SSTables (maximum
32), as well as all of the overlapping L1 SSTables (most likely all 10 L1s), to perform a compaction, and the result
will be written into multiple SSTables in L1
• When the total size of all SSTables in L1 is greater than 10 x 160MB, the first L1 SSTable since the last compacted
key could be picked as the next compaction candidate, and it will be compacted together with all overlapping L2
SSTables, and the result will be written into multiple SSTables in L2
• For higher levels, it will follow the same fashion, adding candidates from highest level to the lowest
© DataStax, All Rights Reserved. 10
Basic Design Principles
• LCS arranges data in SSTables on multiple levels
• Newly appeared SSTables (not from compaction) almost always show up in L0
• Only L0 is allowed to have SSTables of any size, L1+ levels can only have 160MB SSTables (with minor exceptions)
• Only L0 can allow SSTables to overlap, for all other levels, it’s guaranteed to not overlap within the level
• When LCS is able to catch up (i.e. no SSTable in L0), a read in the worst case just need to read from N SSTables (N
= the highest level that has data), i.e. at most one SSTable from each L(N>=1) level; ideally 90% of the read will hit
the highest level and be satisfied by one SSTable
• Each L(N) level (N>=1) is supposed to accommodate 10N SSTables. So going one level higher, you will be able to
accommodate 10x more SSTables
• When a L0 SSTable is picked as a compaction candidate, it will find all other overlapping L0 SSTables (maximum
32), as well as all of the overlapping L1 SSTables (most likely all 10 L1s), to perform a compaction, and the result
will be written into multiple SSTables in L1
• When the total size of all SSTables in L1 is greater than 10 x 160MB, the first L1 SSTable since the last compacted
key could be picked as the next compaction candidate, and it will be compacted together with all overlapping L2
SSTables, and the result will be written into multiple SSTables in L2
• For higher levels, it will follow the same fashion, adding candidates from highest level to the lowest
© DataStax, All Rights Reserved. 11
Basic Design Principles (cont.)
• On all levels, a Compaction Score will be periodically calculated, based on the total size of all non-
compacting SSTables in each level, divided by this level’s capacity (for example, L1 is supposed to
accommodate 1.6GB, L2 is supposed to accommodate 16GB). Going from highest level, any level having
score higher than 1.001 can be picked by a compaction thread
• Before any one of the higher-level compaction starts, check if L0 is too far-behind (i.e. having > 32
SSTables that are not yet in a compaction), if yes, do STCS in L0 first before doing higher level LCS
compaction
• After going through compactions at all higher levels, move onto doing regular LCS compaction in L0
• The number of compaction threads is capped by concurrent_compactors, and throughput throttled
• For each compaction session, the level for the added sstables is the max of the removed ones, plus one if
the removed were all on the same level
• If we go 25 rounds without compacting in the highest level, we start bringing in sstables from that level
into lower level compactions
• In addition to the DataStax blog post on LCS, more details are explained in the code comments
• http://www.datastax.com/dev/blog/when-to-use-leveled-compaction
• org.apache.cassandra.db.compaction.LeveledManifest.java
• org.apache.cassandra.db.compaction.LeveledCompactionStrategy.java
© DataStax, All Rights Reserved. 12
Basic Design Principles (cont.)
• On all levels, a Compaction Score will be periodically calculated, based on the total size of all non-
compacting SSTables in each level, divided by this level’s capacity (for example, L1 is supposed to
accommodate 1.6GB, L2 is supposed to accommodate 16GB). Going from highest level, any level having
score higher than 1.001 can be picked by a compaction thread
• Before any one of the higher-level compaction starts, check if L0 is too far-behind (i.e. having > 32
SSTables that are not yet in a compaction), if yes, do STCS in L0 first before doing higher level LCS
compaction
• After going through compactions at all higher levels, move onto doing regular LCS compaction in L0
• The number of compaction threads is capped by concurrent_compactors, and throughput throttled
• For each compaction session, the level for the added sstables is the max of the removed ones, plus one if
the removed were all on the same level
• If we go 25 rounds without compacting in the highest level, we start bringing in sstables from that level
into lower level compactions
• In addition to the DataStax blog post on LCS, more details are explained in the code comments
• http://www.datastax.com/dev/blog/when-to-use-leveled-compaction
• org.apache.cassandra.db.compaction.LeveledManifest.java
• org.apache.cassandra.db.compaction.LeveledCompactionStrategy.java
© DataStax, All Rights Reserved. 13
How up-level is happening
• Use a very simple key/value table in our demo
• For each partition, the key is an integer, and the
value is a 40MB blob, so each partition is about
40MB and 4 partitions can fill up a 160MB
SSTable
• Insert partitions with key 1-8, trigger a flush, then
insert 9-16, another flush, then insert 17-24,
flush, insert 25-32, and so on
• Each L(N>0) level has capacity of 160MB x 4N ,
so L1 has max capacity of 640MB, L2 has max
capacity of 2.56GB and so on, for easier
illustration of levels
© DataStax, All Rights Reserved. 14
CREATE TABLE kvstore (
key int PRIMARY KEY,
value text
);
How up-level is happening
© DataStax, All Rights Reserved. 15
system.token(key) | key | value (40MB)
----------------------+-----+-------------
-9157060164899361011 | 23 | aaa
-7509452495886106294 | 5 | aaa
-6715243485458697746 | 10 | aaa
-5477287129830487822 | 16 | aaa
-5034495173465742853 | 13 | aaa
-4156302194539278891 | 11 | aaa
-4069959284402364209 | 1 | aaa
-3974532302236993209 | 19 | aaa
-3799847372828181882 | 8 | aaa
-3248873570005575792 | 2 | aaa
-2729420104000364805 | 4 | aaa
-2695747960476065067 | 18 | aaa
-1191135763843456182 | 15 | aaa
-1117083337304738213 | 22 | aaa
1388667306199997068 | 20 | aaa
1634052884888577606 | 7 | aaa
2705480034054113608 | 6 | aaa
3728482343045213994 | 9 | aaa
4279681877540623768 | 14 | aaa
5176205029172940157 | 21 | aaa
5467144456125416399 | 17 | aaa
6462532412986175674 | 24 | aaa
8582886034424406875 | 12 | aaa
9010454139840013625 | 3 | aaa
-7509452495886106294 | 5
-4069959284402364209 | 1
-3799847372828181882 | 8
-3248873570005575792 | 2
-2729420104000364805 | 4
1634052884888577606 | 7
2705480034054113608 | 6
9010454139840013625 | 3
-6715243485458697746 | 10
-5477287129830487822 | 16
-5034495173465742853 | 13
-4156302194539278891 | 11
-1191135763843456182 | 15
3728482343045213994 | 9
4279681877540623768 | 14
8582886034424406875 | 12
L0
Can overlap
No size limit
-7509452495886106294 | 5
-6715243485458697746 | 10
-5477287129830487822 | 16
-5034495173465742853 | 13
-4156302194539278891 | 11
-4069959284402364209 | 1
-3799847372828181882 | 8
-3248873570005575792 | 2
-2729420104000364805 | 4
-1191135763843456182 | 15
1634052884888577606 | 7
2705480034054113608 | 6
3728482343045213994 | 9
4279681877540623768 | 14
8582886034424406875 | 12
9010454139840013625 | 3
L1
No overlap
41 x 160MB
L2
No overlap
42 x 160MB
…
How up-level is happening
© DataStax, All Rights Reserved. 16
system.token(key) | key | value (40MB)
----------------------+-----+-------------
-9157060164899361011 | 23 | aaa
-7509452495886106294 | 5 | aaa
-6715243485458697746 | 10 | aaa
-5477287129830487822 | 16 | aaa
-5034495173465742853 | 13 | aaa
-4156302194539278891 | 11 | aaa
-4069959284402364209 | 1 | aaa
-3974532302236993209 | 19 | aaa
-3799847372828181882 | 8 | aaa
-3248873570005575792 | 2 | aaa
-2729420104000364805 | 4 | aaa
-2695747960476065067 | 18 | aaa
-1191135763843456182 | 15 | aaa
-1117083337304738213 | 22 | aaa
1388667306199997068 | 20 | aaa
1634052884888577606 | 7 | aaa
2705480034054113608 | 6 | aaa
3728482343045213994 | 9 | aaa
4279681877540623768 | 14 | aaa
5176205029172940157 | 21 | aaa
5467144456125416399 | 17 | aaa
6462532412986175674 | 24 | aaa
8582886034424406875 | 12 | aaa
9010454139840013625 | 3 | aaa
L0
Can overlap
No size limit
-7509452495886106294 | 5
-6715243485458697746 | 10
-5477287129830487822 | 16
-5034495173465742853 | 13
-4156302194539278891 | 11
-4069959284402364209 | 1
-3799847372828181882 | 8
-3248873570005575792 | 2
-2729420104000364805 | 4
-1191135763843456182 | 15
1634052884888577606 | 7
2705480034054113608 | 6
3728482343045213994 | 9
4279681877540623768 | 14
8582886034424406875 | 12
9010454139840013625 | 3
L1
No overlap
41 x 160MB
L2
No overlap
42 x 160MB
…
How up-level is happening
© DataStax, All Rights Reserved. 17
system.token(key) | key | value (40MB)
----------------------+-----+-------------
-9157060164899361011 | 23 | aaa
-7509452495886106294 | 5 | aaa
-6715243485458697746 | 10 | aaa
-5477287129830487822 | 16 | aaa
-5034495173465742853 | 13 | aaa
-4156302194539278891 | 11 | aaa
-4069959284402364209 | 1 | aaa
-3974532302236993209 | 19 | aaa
-3799847372828181882 | 8 | aaa
-3248873570005575792 | 2 | aaa
-2729420104000364805 | 4 | aaa
-2695747960476065067 | 18 | aaa
-1191135763843456182 | 15 | aaa
-1117083337304738213 | 22 | aaa
1388667306199997068 | 20 | aaa
1634052884888577606 | 7 | aaa
2705480034054113608 | 6 | aaa
3728482343045213994 | 9 | aaa
4279681877540623768 | 14 | aaa
5176205029172940157 | 21 | aaa
5467144456125416399 | 17 | aaa
6462532412986175674 | 24 | aaa
8582886034424406875 | 12 | aaa
9010454139840013625 | 3 | aaa
L0
Can overlap
No size limit
-7509452495886106294 | 5
-6715243485458697746 | 10
-5477287129830487822 | 16
-5034495173465742853 | 13
-4156302194539278891 | 11
-4069959284402364209 | 1
-3799847372828181882 | 8
-3248873570005575792 | 2
-2729420104000364805 | 4
-1191135763843456182 | 15
1634052884888577606 | 7
2705480034054113608 | 6
3728482343045213994 | 9
4279681877540623768 | 14
8582886034424406875 | 12
9010454139840013625 | 3
L1
No overlap
41 x 160MB
L2
No overlap
42 x 160MB
…
-9157060164899361011 | 23
-3974532302236993209 | 19
-2695747960476065067 | 18
-1117083337304738213 | 22
1388667306199997068 | 20
5176205029172940157 | 21
5467144456125416399 | 17
6462532412986175674 | 24
How up-level is happening
© DataStax, All Rights Reserved. 18
system.token(key) | key | value (40MB)
----------------------+-----+-------------
-9157060164899361011 | 23 | aaa
-7509452495886106294 | 5 | aaa
-6715243485458697746 | 10 | aaa
-5477287129830487822 | 16 | aaa
-5034495173465742853 | 13 | aaa
-4156302194539278891 | 11 | aaa
-4069959284402364209 | 1 | aaa
-3974532302236993209 | 19 | aaa
-3799847372828181882 | 8 | aaa
-3248873570005575792 | 2 | aaa
-2729420104000364805 | 4 | aaa
-2695747960476065067 | 18 | aaa
-1191135763843456182 | 15 | aaa
-1117083337304738213 | 22 | aaa
1388667306199997068 | 20 | aaa
1634052884888577606 | 7 | aaa
2705480034054113608 | 6 | aaa
3728482343045213994 | 9 | aaa
4279681877540623768 | 14 | aaa
5176205029172940157 | 21 | aaa
5467144456125416399 | 17 | aaa
6462532412986175674 | 24 | aaa
8582886034424406875 | 12 | aaa
9010454139840013625 | 3 | aaa
L0
Can overlap
No size limit
-7509452495886106294 | 5
-6715243485458697746 | 10
-5477287129830487822 | 16
-5034495173465742853 | 13
-4156302194539278891 | 11
-4069959284402364209 | 1
-3799847372828181882 | 8
-3248873570005575792 | 2
-2729420104000364805 | 4
-1191135763843456182 | 15
1634052884888577606 | 7
2705480034054113608 | 6
3728482343045213994 | 9
4279681877540623768 | 14
8582886034424406875 | 12
9010454139840013625 | 3
L1
No overlap
41 x 160MB
L2
No overlap
42 x 160MB
…
-9157060164899361011 | 23
-3974532302236993209 | 19
-2695747960476065067 | 18
-1117083337304738213 | 22
1388667306199997068 | 20
5176205029172940157 | 21
5467144456125416399 | 17
6462532412986175674 | 24
90°
How up-level is happening
© DataStax, All Rights Reserved. 19
system.token(key) | key | value (40MB)
----------------------+-----+-------------
-9157060164899361011 | 23 | aaa
-7509452495886106294 | 5 | aaa
-6715243485458697746 | 10 | aaa
-5477287129830487822 | 16 | aaa
-5034495173465742853 | 13 | aaa
-4156302194539278891 | 11 | aaa
-4069959284402364209 | 1 | aaa
-3974532302236993209 | 19 | aaa
-3799847372828181882 | 8 | aaa
-3248873570005575792 | 2 | aaa
-2729420104000364805 | 4 | aaa
-2695747960476065067 | 18 | aaa
-1191135763843456182 | 15 | aaa
-1117083337304738213 | 22 | aaa
1388667306199997068 | 20 | aaa
1634052884888577606 | 7 | aaa
2705480034054113608 | 6 | aaa
3728482343045213994 | 9 | aaa
4279681877540623768 | 14 | aaa
5176205029172940157 | 21 | aaa
5467144456125416399 | 17 | aaa
6462532412986175674 | 24 | aaa
8582886034424406875 | 12 | aaa
9010454139840013625 | 3 | aaa
L0
Can overlap
No size limit
L1
No overlap
41 x 160MB
L2
No overlap
42 x 160MB
…
23 19 18 22 20 21 17
24
5 10 16 13 11 1 8 2 4 15 7 6 9 14 12
3
How up-level is happening
© DataStax, All Rights Reserved. 20
system.token(key) | key | value (40MB)
----------------------+-----+-------------
-9157060164899361011 | 23 | aaa
-7509452495886106294 | 5 | aaa
-6715243485458697746 | 10 | aaa
-5477287129830487822 | 16 | aaa
-5034495173465742853 | 13 | aaa
-4156302194539278891 | 11 | aaa
-4069959284402364209 | 1 | aaa
-3974532302236993209 | 19 | aaa
-3799847372828181882 | 8 | aaa
-3248873570005575792 | 2 | aaa
-2729420104000364805 | 4 | aaa
-2695747960476065067 | 18 | aaa
-1191135763843456182 | 15 | aaa
-1117083337304738213 | 22 | aaa
1388667306199997068 | 20 | aaa
1634052884888577606 | 7 | aaa
2705480034054113608 | 6 | aaa
3728482343045213994 | 9 | aaa
4279681877540623768 | 14 | aaa
5176205029172940157 | 21 | aaa
5467144456125416399 | 17 | aaa
6462532412986175674 | 24 | aaa
8582886034424406875 | 12 | aaa
9010454139840013625 | 3 | aaa
L0
Can overlap
No size limit
L1
No overlap
41 x 160MB
L2
No overlap
42 x 160MB
…
23 5 10 16 13 11 1 19 8 2 4 18 15 22 20 7 6 9 14 21 17 24 12 3
Total size of all L1 SSTables ≈ 40MB/partition x 24 partitions = 960MB / (41 x 160MB) > 1.001
How up-level is happening
© DataStax, All Rights Reserved. 21
system.token(key) | key | value (40MB)
----------------------+-----+-------------
-9157060164899361011 | 23 | aaa
-7509452495886106294 | 5 | aaa
-6715243485458697746 | 10 | aaa
-5477287129830487822 | 16 | aaa
-5034495173465742853 | 13 | aaa
-4156302194539278891 | 11 | aaa
-4069959284402364209 | 1 | aaa
-3974532302236993209 | 19 | aaa
-3799847372828181882 | 8 | aaa
-3248873570005575792 | 2 | aaa
-2729420104000364805 | 4 | aaa
-2695747960476065067 | 18 | aaa
-1191135763843456182 | 15 | aaa
-1117083337304738213 | 22 | aaa
1388667306199997068 | 20 | aaa
1634052884888577606 | 7 | aaa
2705480034054113608 | 6 | aaa
3728482343045213994 | 9 | aaa
4279681877540623768 | 14 | aaa
5176205029172940157 | 21 | aaa
5467144456125416399 | 17 | aaa
6462532412986175674 | 24 | aaa
8582886034424406875 | 12 | aaa
9010454139840013625 | 3 | aaa
L0
Can overlap
No size limit
L1
No overlap
41 x 160MB
L2
No overlap
42 x 160MB
…
23 5 10 16 13 11 1 19
8 2 4 18 15 22 20 7 6 9 14 21 17 24 12 3
Total size of all L1 SSTables ≈ 40MB/partition x 16 partitions = 640MB / (41 x 160MB) ≤ 1.001
How up-level is happening
© DataStax, All Rights Reserved. 22
system.token(key) | key | value (40MB)
----------------------+-----+-------------
-9157060164899361011 | 23 | aaa
-7509452495886106294 | 5 | aaa
-7492342649816923291 | 28 | aaa
-6715243485458697746 | 10 | aaa
-5477287129830487822 | 16 | aaa
-5034495173465742853 | 13 | aaa
-4944298994521308347 | 30 | aaa
-4156302194539278891 | 11 | aaa
-4069959284402364209 | 1 | aaa
-3974532302236993209 | 19 | aaa
-3799847372828181882 | 8 | aaa
-3248873570005575792 | 2 | aaa
-2729420104000364805 | 4 | aaa
-2695747960476065067 | 18 | aaa
-1191135763843456182 | 15 | aaa
-1117083337304738213 | 22 | aaa
945780720898208171 | 27 | aaa
1388667306199997068 | 20 | aaa
1634052884888577606 | 7 | aaa
2705480034054113608 | 6 | aaa
...
8041825692956475789 | 25 | aaa
8582886034424406875 | 12 | aaa
9010454139840013625 | 3 | aaa
L0
Can overlap
No size limit
L1
No overlap
41 x 160MB
L2
No overlap
42 x 160MB
…
23 5 10 16 13 11 1 19
8 2 4 18 15 22 20 7 6 9 14 21 17 24 12 3
28 30 27 29 26 31 32
25
How up-level is happening
© DataStax, All Rights Reserved. 23
system.token(key) | key | value (40MB)
----------------------+-----+-------------
-9157060164899361011 | 23 | aaa
-7509452495886106294 | 5 | aaa
-7492342649816923291 | 28 | aaa
-6715243485458697746 | 10 | aaa
-5477287129830487822 | 16 | aaa
-5034495173465742853 | 13 | aaa
-4944298994521308347 | 30 | aaa
-4156302194539278891 | 11 | aaa
-4069959284402364209 | 1 | aaa
-3974532302236993209 | 19 | aaa
-3799847372828181882 | 8 | aaa
-3248873570005575792 | 2 | aaa
-2729420104000364805 | 4 | aaa
-2695747960476065067 | 18 | aaa
-1191135763843456182 | 15 | aaa
-1117083337304738213 | 22 | aaa
945780720898208171 | 27 | aaa
1388667306199997068 | 20 | aaa
1634052884888577606 | 7 | aaa
2705480034054113608 | 6 | aaa
...
8041825692956475789 | 25 | aaa
8582886034424406875 | 12 | aaa
9010454139840013625 | 3 | aaa
L0
Can overlap
No size limit
L1
No overlap
41 x 160MB
L2
No overlap
42 x 160MB
…
23 5 10 16 13 11 1 19
8 2 4 18 15 22 20 7 6 9 14 21 17 24 12 3
28 30 27 29 26 31 32
25
How up-level is happening
© DataStax, All Rights Reserved. 24
system.token(key) | key | value (40MB)
----------------------+-----+-------------
-9157060164899361011 | 23 | aaa
-7509452495886106294 | 5 | aaa
-7492342649816923291 | 28 | aaa
-6715243485458697746 | 10 | aaa
-5477287129830487822 | 16 | aaa
-5034495173465742853 | 13 | aaa
-4944298994521308347 | 30 | aaa
-4156302194539278891 | 11 | aaa
-4069959284402364209 | 1 | aaa
-3974532302236993209 | 19 | aaa
-3799847372828181882 | 8 | aaa
-3248873570005575792 | 2 | aaa
-2729420104000364805 | 4 | aaa
-2695747960476065067 | 18 | aaa
-1191135763843456182 | 15 | aaa
-1117083337304738213 | 22 | aaa
945780720898208171 | 27 | aaa
1388667306199997068 | 20 | aaa
1634052884888577606 | 7 | aaa
2705480034054113608 | 6 | aaa
...
8041825692956475789 | 25 | aaa
8582886034424406875 | 12 | aaa
9010454139840013625 | 3 | aaa
L0
Can overlap
No size limit
L1
No overlap
41 x 160MB
L2
No overlap
42 x 160MB
…
23 5 10 16 13 11 1 19
27 20 7 6 29 9 14 26 21 17 31
24
32 25 12 328 30 8 2 4 18 15 22
Total size of all L1 SSTables ≈ 40MB/partition x 24 partitions = 960MB / (41 x 160MB) > 1.001
lastCompactedKeys[minLevel] = SSTableReader.sstableOrdering.max(added).last;
How up-level is happening
© DataStax, All Rights Reserved. 25
system.token(key) | key | value (40MB)
----------------------+-----+-------------
-9157060164899361011 | 23 | aaa
-7509452495886106294 | 5 | aaa
-7492342649816923291 | 28 | aaa
-6715243485458697746 | 10 | aaa
-5477287129830487822 | 16 | aaa
-5034495173465742853 | 13 | aaa
-4944298994521308347 | 30 | aaa
-4156302194539278891 | 11 | aaa
-4069959284402364209 | 1 | aaa
-3974532302236993209 | 19 | aaa
-3799847372828181882 | 8 | aaa
-3248873570005575792 | 2 | aaa
-2729420104000364805 | 4 | aaa
-2695747960476065067 | 18 | aaa
-1191135763843456182 | 15 | aaa
-1117083337304738213 | 22 | aaa
945780720898208171 | 27 | aaa
1388667306199997068 | 20 | aaa
1634052884888577606 | 7 | aaa
2705480034054113608 | 6 | aaa
...
8041825692956475789 | 25 | aaa
8582886034424406875 | 12 | aaa
9010454139840013625 | 3 | aaa
L0
Can overlap
No size limit
L1
No overlap
41 x 160MB
L2
No overlap
42 x 160MB
…
23 5 10 16 13 11 1 19 27 20 7 6
29 9 14 26 21 17 31
24
32 25 12 328 30 8 2
4 18 15 22
lastCompactedKeys[minLevel] = SSTableReader.sstableOrdering.max(added).last;
Fully Expired SSTable
© DataStax, All Rights Reserved. 26
• If an SSTable contains only tombstones and it is guaranteed that the SSTable is not
shadowing data in any other SSTable, compaction can drop that SSTable
• If you see SSTables with only tombstones (note that TTL’ed data is considered
tombstones once the time to live has expired) but it is not being dropped by
compaction, it is likely that other SSTables contain older data
• In LCS, these other SSTables containing older data are often in higher levels
• This directly impacts the effectiveness of tombstone compaction, and hence why you
always want tombstones to go to the highest level as quickly as possible
Write Amplification
© DataStax, All Rights Reserved. 27
• All fresh data enters L0, and will have to move up level one by one, no skipping level
possible
• In the ideal situation, highest level will have 90% of data, hence more data -> more
Write Amplification, because any data that eventually shows up on the highest level N
had to be written N+1 times
• This is why LCS is more demanding on CPU and IO capabilities compared to STCS
JBOD and CASSANDRA-6696
• JBOD is a primary use case for choosing LCS
• However, it used to create “zombie resurrection” problem when JBOD disk fails
• CASSANDRA-6696 (C* 3.2+) will create flush writer and compaction strategy per disk,
so all flush and compaction activities for each disk will only be responsible for one set
of token ranges
• More parallelism for compaction (no worries about overlap at higher levels between
different compaction threads)
• This also makes LCS to be able to handle more data per node, as each JBOD volume
is doing compaction on its own
© DataStax, All Rights Reserved. 28
How it affects you?
Best Practices
Where LCS fits the best
• Use cases needing very consistent read performance with much higher read to write
ratio
• Wide-partition data model with limited (or slow-growing) number of total partitions but
a lot of updates and deletes, or fully TTL’ed dataset
© DataStax, All Rights Reserved. 30
Typical anti-pattern use cases for LCS
• Heavy write with all unique partitions
• Too frequent memtable flushes
• Large vnode nodes
• Frequent bulk load with the goal of finishing the load as quickly as possible
• Any other condition that could create too many L0 SSTables
• High density node (> 1TB per node or > 0.5TB per CQL table per node)
• Node without enough IO (i.e. non-SSD, such as SAN, NFS or spindles) and CPU
resources
• Partitions that are too wide (> 160MB)
• Use LCS simply for saving disk spaces (because “LCS requires less free space”)
• Have too many SSTables without enough file handles and using snapshot repair
© DataStax, All Rights Reserved. 31
Troubleshooting
• Debug.log
• nodetool setlogginglevel org.apache.cassandra.db.compaction.LeveledManifest TRACE
• nodetool setlogginglevel org.apache.cassandra.db.compaction.LeveledCompactionStrategy TRACE
• log_all=true (CASSANDRA-10805, C* 3.6+) or JMX
• enable in table schema
• Will create an individual compaction.log file outside of system.log or debug.log
• Can be very useful for a visualizer
• The offline compaction-stress tool automatically enable it
© DataStax, All Rights Reserved. 32
Visualization demo
• Developed by Carl Yeksigian, still Work In Progress, 0.1 version
• Leverage the compaction.log generated by CASSANDRA-10805, which was also
authored by Carl
© DataStax, All Rights Reserved. 33
Myth Busters
Observations that often confuse people
Myth #1
• Using LCS, I can avoid STCS’s 50% free space requirement and use up
as much as 90% of my disk space?
WRONG
• You should not size your hardware based on 90% disk utilization
• STCS in L0 can easily throw off this assumption, also when L0 is overloaded, L0->L1 can easily
break the “up to 90%” assumption
© DataStax, All Rights Reserved. 35
Myth #2
• Why so many small SSTables show up in L0 when I start Repair
Service?
• Every repair session will force affected nodes to flush the corresponding memtable
• This is the only way to guarantee merkle tree comparison won’t miss any data from memtable
• If something is wrong with repair scheduling, two repair sessions could be scheduled back-to-back, causing
unnecessary memtable flush to happen
• If you use vnode, one range of data tends to spread across many nodes and naturally many
SSTables, hence when streaming the replica difference, it’s more likely to source from many nodes
and many SSTables
© DataStax, All Rights Reserved. 36
Myth #3
• Why compaction is no longer runnning and I’m still seeing something
like [1, 12/10, 104/100, …] ?
• Even when not doing STCS, L0 SSTables compacted don’t always end up in L1. If the combined
size of L0 SSTables involved in a compaction is smaller than max_sstable_size, then the output
will stay in L0
• Upper level compaction is triggered by compaction scores (watch debug.log to see it in action),
and compaction scores is completely based on size, instead of number of SSTables in each level.
For brevity of display, “nodetool cfstats” chooses to use # of SSTables
© DataStax, All Rights Reserved. 37
Myth #4
• Why it takes so long to free up disk space after delete?
• Because any newly inserted data (including delete/tombstone) will have to
bubble up
• A partition can have fragments in all different levels
• To allow the data belonging to a partition to be cleared, the tombstone has to
bubble up to the top and is also expired
• See fully expired sstable slide.
© DataStax, All Rights Reserved. 38
What’s coming
JIRA references to LCS’s past, present and future
What has happened since C* 2.1
• Partition SSTables by token ranges (CASSANDRA-6696, C* 3.2+)
• Try to include overlapping sstables in tombstone compactions to be able to actually drop the tombstones
(CASSANDRA-7019, C* 3.10+) See talk this afternoon at 4:10pm in 210C
• Add major compaction to LCS (CASSANDRA-7272, C* 2.2+), however note CASSANDRA-11817
• Send source sstable level when bootstrapping or replacing nodes (CASSANDRA-7460, C* 2.2+)
• Adjusting LCS to work with incremental repair (CASSANDRA-8004, C* 2.1.2+)
• A tool called sstableofflinerelevel to create a "decent" sstable leveling from a bunch of given SSTables
(CASSANDRA-8301, C* 2.1.5+)
• Use a new JBOD volume every time when a newly compacted SSTable is written (CASSANDRA-8329, C* 2.1.3+)
• Compaction throughput under stress improved by 2x with various optimizations (CASSANDRA-8915, C* 3.0+,
CASSANDRA-8920, C* 2.2+, CASSANDRA-8988, C* 2.2+, CASSANDRA-10099, C* 3.6+)
• Compact only certain token range (CASSANDRA-10643, C* 3.10+)
• Allow L0 STCS and L0->L1 LCS to happen at the same time if there’s no overlap (CASSANDRA-10979, C*
2.1.14+)
• Improve Cassandra startup speed with LCS tables (CASSANDRA-12114, C* 3.0.9+, C* 3.8+)
• Compaction Throttling is not working as expected (CASSANDRA-12366, C* 3.10+)
• Compute last compacted keys for each level at startup (CASSANDRA-6216, C* 3.0.9+, C* 3.10+)
© DataStax, All Rights Reserved. 40
What has happened since C* 2.1
• Partition SSTables by token ranges (CASSANDRA-6696, C* 3.2+)
• Try to include overlapping sstables in tombstone compactions to be able to actually drop the tombstones
(CASSANDRA-7019, C* 3.10+) See talk this afternoon at 4:10pm in 210C
• Add major compaction to LCS (CASSANDRA-7272, C* 2.2+), however note CASSANDRA-11817
• Send source sstable level when bootstrapping or replacing nodes (CASSANDRA-7460, C* 2.2+)
• Adjusting LCS to work with incremental repair (CASSANDRA-8004, C* 2.1.2+)
• A tool called sstableofflinerelevel to create a "decent" sstable leveling from a bunch of given SSTables
(CASSANDRA-8301, C* 2.1.5+)
• Use a new JBOD volume every time when a newly compacted SSTable is written (CASSANDRA-8329, C* 2.1.3+)
• Compaction throughput under stress improved by 2x with various optimizations (CASSANDRA-8915, C* 3.0+,
CASSANDRA-8920, C* 2.2+, CASSANDRA-8988, C* 2.2+, CASSANDRA-10099, C* 3.6+)
• Compact only certain token range (CASSANDRA-10643, C* 3.10+)
• Allow L0 STCS and L0->L1 LCS to happen at the same time if there’s no overlap (CASSANDRA-10979, C*
2.1.14+)
• Improve Cassandra startup speed with LCS tables (CASSANDRA-12114, C* 3.0.9+, C* 3.8+)
• Compaction Throttling is not working as expected (CASSANDRA-12366, C* 3.10+)
• Compute last compacted keys for each level at startup (CASSANDRA-6216, C* 3.0.9+, C* 3.10+)
© DataStax, All Rights Reserved. 41
JIRAs coming
A few JIRAs worth waiting for (already patch available) or watching (WIP) or
contributing/upvoting (not yet started):
• Different optimizations to pick compaction candidates to avoid write amplification and merge/eliminate
tombstones (CASSANDRA-11106)
• Range Aware compaction (CASSANDRA-10540)
• Create new sstables in the highest possible level (CASSANDRA-6323)
• Fractual cascading (CASSANDRA-7065)
• Allow multiple overlapping SSTables in L1 or above (CASSANDRA-7409)
• L0 should have a separate configurable bloom filter false positive ratio (CASSANDRA-8434)
• Dynamically adjusting the ideal top level size to how much data is actually in it (CASSANDRA-9829)
• Option to disable bloom filter in highest level of LCS sstables (CASSANDRA-9830)
• LCS repair: compact tables before making available in L0 (CASSANDRA-10862)
• Make the fanout size for LeveledCompactionStrategy to be configurable (CASSANDRA-11150)
• Improve parallelism on higher level compactions in LCS (CASSANDRA-12464)
© DataStax, All Rights Reserved. 42
Some parting thoughts
• Try out the latest C* trunk/3.x and play with the logging level
• Test out bigger sstable_size_in_mb with your environment (CASSANDRA-12591)
• Leverage the new compaction-stress tool to evaluate your deployment environment
© DataStax, All Rights Reserved. 43
Q & A
Appendix / Deleted Scenes
The stuff we didn’t have time to cover
Some more comments on LevelDB design from
Cassandra source code
“LevelDB gives each level a score of how much data it contains vs its ideal amount, and compacts the level with the
highest score. But this falls apart spectacularly once you get behind. Consider this set of levels:
L0: 988 [ideal: 4]
L1: 117 [ideal: 10]
L2: 12 [ideal: 100]
The problem is that L0 has a much higher score (almost 250) than L1 (11), so what we'll do is compact a batch of
MAX_COMPACTING_L0 sstables with all 117 L1 sstables, and put the result (say, 120 sstables) in L1. Then we'll
compact the next batch of MAX_COMPACTING_L0, and so forth. So we spend most of our i/o rewriting the L1 data
with each batch.
If we could just do *all* L0 a single time with L1, that would be ideal. But we can’t. LevelDB's way around this is to
simply block writes if L0 compaction falls behind. We don't have that luxury. So instead, we 1) force compacting higher
levels first, which minimizes the i/o needed to compact optimally which gives us a long term win, and 2) if L0 falls
behind, we will size-tiered compact it to reduce read overhead until we can catch up on the higher levels.
This isn't a magic wand -- if you are consistently writing too fast for LCS to keep up, you're still screwed. But if instead
you have intermittent bursts of activity, it can help a lot.”© DataStax, All Rights Reserved. 46
Basics on compaction in general
• Why you need compaction
• Log Structured Merge Tree
• Immutable data files (SSTables)
• Super fast writes, but read will need to read as little amount of data files as possible
• What it usually does
• Normally combine *multiple* SSTables and generate another set of SSTables to merge
partition fragments from the source SSTables
• In this process, remove tombstones or change expired data to tombstones
• Tombstone compaction can also be automatically triggered on individual SSTables.
• How it is triggered
• Background / minor compaction
• Major compaction
• User defined compaction
© DataStax, All Rights Reserved. 47
Impact from Incremental Repair
• Anti-compaction at the end of incremental repair session will create repaired and
unrepaired SSTables
• Maintaining separate pools of repaired and unrepaired SSTables causes extra
complexity for LCS
• Two instances of LeveledCompactionStrategy running, one for each pool
• When moving a SSTable (at L1+) from unrepaired to repaired, it can cause overlap
and we will have to send this SSTable back to L0
© DataStax, All Rights Reserved. 48
What is suspectness and why it matters?
• A suspected SSTable is marked when any IO exception is encountered when reading any
component of this SSTable. For example, a SSTable component corruption can lead compaction to
mark this SSTable as suspected.
“Note that we ignore suspect-ness of L1 sstables here, since if an L1 sstable is suspect we're
basically screwed, since we expect all or most L0 sstables to overlap with each L1 sstable.
So if an L1 sstable is suspect we can't do much besides try anyway and hope for the best. ”
• This is actually one of the reasons why you may see a huge backup whenever there is sstable
corruption. Because if the corruption happens in L1, no L0 compaction can happen any more, so
no matter how many compaction threads you have, you may be blocked in L0->L1 compaction.
© DataStax, All Rights Reserved. 49
Subrange compaction
• For partitions with “Justin Bieber” effect
• Only compact a given sub range
• Could be useful if you know a partition/token has been misbehaving - either gathering
many updates or many deletes
• “nodetool compact -st x -et y” will pick all SSTables containing the range between x
and y and issue a compaction for those sstables
• With LCS it can issue the compaction for a subset of the SSTables, the resulting
SSTables will end up in L0
© DataStax, All Rights Reserved. 50
Typical anti-pattern use cases for LCS
• Heavy write with all unique partitions
• Too frequent memtable flushes
• Large vnode nodes
• Any other condition that could create too many L0 SSTables
• High density node (> 1.5TB per node or > 1TB per CQL table per node)
• Node without enough IO (i.e. non-SSD, such as SAN, NFS or spindles) and CPU
resources
• Frequent bulk load with the goal of finishing the load as quickly as possible
• Partitions that are too wide (> 160MB)
• Use LCS simply for saving disk spaces (because “LCS requires less free space”)
• Have too many SSTables without enough file handles and using snapshot repair
© DataStax, All Rights Reserved. 51
How is Flush triggered?
• memtable reaches threshold
• commitlog reaches threshold
• memtable_flush_period_in_ms
• manual flush
• repair triggering flush
• snapshot/backup
• nodetool drain
• drop table
• when you change compaction strategy
• When the node is requested to transfer ranges (streaming-in)
© DataStax, All Rights Reserved. 52
Typical anti-pattern use cases for LCS
• Heavy write with all unique partitions
• Too frequent memtable flushes
• Large vnode nodes
• Any other condition that could create too many L0 SSTables
• High density node (> 1.5TB per node or > 1TB per CQL table per node)
• Node without enough IO (i.e. non-SSD, such as SAN, NFS or spindles) and CPU
resources
• Frequent bulk load with the goal of finishing the load as quickly as possible
• Partitions that are too wide (> 160MB)
• Use LCS simply for saving disk spaces (because “LCS requires less free space”)
• Have too many SSTables without enough file handles and using snapshot repair
© DataStax, All Rights Reserved. 53
Other reasons massive SSTables can show up in L0
• Large vnode with repair streaming (CASSANDRA-10862)
• Bootstrap without sending source level (CASSANDRA-7460)
• Bulk load utility sstableloader from cluster-wide snapshots could be another bad
source of small L0 sstables (RF x RF) (CASSANDRA-4756)
© DataStax, All Rights Reserved. 54
Pitfalls to avoid
• Fast write that exceeds the capability of compaction
• Really large partitions (usually LCS is known to handle wide partitions well, but you
cannot go too high, especially not exceeding the max_sstable_size)
• A single CQL table holding too much data with LCS compaction
• Large num_token
• Bugs or data corruption that can block LCS compaction and create a lot of L0 overflow
© DataStax, All Rights Reserved. 55
Switch between STCS and LCS
• How do you properly do that?
• Simply change table schema can create a lot of impact on your production workload
• Least intrusive
• Use JMX to change Compaction parameters – set CompactionParametersJson
• Finish compaction on a couple of nodes before starting the next
• Be careful: don’t trigger schema update on this node before compaction around all
nodes finishes
• At the end, make schema change permanent by ALTER TABLE
© DataStax, All Rights Reserved. 56
Tuning Knobs to be aware of
• concurrent_compactors in yaml (be mindful of heap pressure)
• compaction_throughput_mb_per_sec in yaml
• All other yaml parameters impacting flush frequency
• memtable_heap_space_in_mb
• memtable_offheap_space_in_mb
• memtable_cleanup_threshold
• commitlog_total_space_in_mb
• JVM flag to disable stcs_in_l0 –Dcassandra.disable_stcs_in_l0
• sstable_size_in_mb in table schema
• Tombstone compaction related parameters in table schema
• For TTL’ed data, gc_grace_seconds in table schema
• Fan out size (CASSANDRA-7871, CASSANDRA-11550)
• Major compaction (caution)
• You can always throttle your write throughput
• Surprise! Switching to STCS can be an option too
© DataStax, All Rights Reserved. 57
Useful tools
• Passive
• nodetool cfstats
• nodetool compactionstats
• nodetool compactionhistory
• Transaction log in the data directory (CASSANDRA-7066)
• JMX for monitoring (check out the two MBeans: o.a.c.db.ColumnFamilyStoreMBean,
o.a.c.db.compaction.CompactionManagerMBean)
• Proactive
• sstableofflinerelevel (great tool, but don’t expect it to do magic) (CASSANDRA-8301)
• Major compaction (performance caveat!) (CASSANDRA-7272)
• sstablelevelreset (Weapon of Massive Destruction) (CASSANDRA-5271)
• User Defined Compaction - JMX or nodetool (CASSANDRA-10660)
© DataStax, All Rights Reserved. 58
Signs of a struggling LCS compaction
• 100s-1000s (or more) of L0 SSTables in cfstats
• 1000s (or more) of pending compactions in compactionstats
• Heavy GC activities that are not driven by read/write workload
• Much more disk space utilized than the real data volume
• Running out of file handles
• OOM with snapshot repair
© DataStax, All Rights Reserved. 59
Typical debug.log Entries
TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,236 LeveledManifest.java:119 - Adding BigTableReader(path='/home/automaton/cassandra-
trunk/data/data/keyspace1/standard1-37b1059068e011e68a2a31a1f422f58b/mc-1178-big-Data.db') to L3
DEBUG [CompactionExecutor:39] 2016-08-23 21:14:07,236 CompactionTask.java:249 - Compacted (7f458a20-6976-11e6-9ee3-dd591a930c3b) 1 sstables to
[/home/automaton/cassandra-trunk/data/data/keyspace1/standard1-37b1059068e011e68a2a31a1f422f58b/mc-1178-big,] to level=3. 160.000MiB to 160.000MiB (~100% of
original) in 12,865ms. Read Throughput = 12.436MiB/s, Write Throughput = 12.436MiB/s, Row Throughput = ~55,379/s. 719,931 total partitions merged to 719,931.
Partition merge counts were {1:719931, }
TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,236 LeveledManifest.java:748 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions to do for keyspace1.standard1
TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,236 LeveledManifest.java:748 - Estimating [0, 225, 2, 0, 0, 0, 0, 0, 0] compactions to do for keyspace1.standard1
TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,236 LeveledManifest.java:319 - Compaction score for level 3 is 0.0010000005662441254
TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,237 LeveledManifest.java:319 - Compaction score for level 2 is 1.0000003707408904
TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,237 LeveledManifest.java:319 - Compaction score for level 1 is 23.493476694822313
TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,237 LeveledManifest.java:573 - Choosing candidates for L1
TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,237 LeveledManifest.java:406 - CompactionCounter: 0: 0
TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,237 LeveledManifest.java:406 - CompactionCounter: 1: 107
TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,237 LeveledManifest.java:406 - CompactionCounter: 2: 0
TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,237 LeveledManifest.java:406 - CompactionCounter: 3: 1
TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,238 LeveledManifest.java:335 - Compaction candidates for L1 are standard1-913(L1),
DEBUG [CompactionExecutor:39] 2016-08-23 21:14:07,238 CompactionTask.java:153 - Compacting (86f10a60-6976-11e6-9ee3-dd591a930c3b) [/home/automaton/cassandra-
trunk/data/data/keyspace1/standard1-37b1059068e011e68a2a31a1f422f58b/mc-913-big-Data.db:level=1, ]
TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,450 LeveledManifest.java:473 - L1 contains 235 SSTables (36.709GiB) in Manifest@546534055
TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,450 LeveledManifest.java:473 - L2 contains 101 SSTables (15.781GiB) in Manifest@546534055
TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,451 LeveledManifest.java:473 - L3 contains 1 SSTables (160.000MiB) in Manifest@546534055
TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,451 LeveledManifest.java:149 - Replacing [standard1-674(L2), ]
TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,451 LeveledManifest.java:166 - Adding [standard1-1179(L3), ]
TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,451 LeveledManifest.java:473 - L1 contains 235 SSTables (36.709GiB) in Manifest@546534055
TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,451 LeveledManifest.java:473 - L2 contains 101 SSTables (15.781GiB) in Manifest@546534055
TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,451 LeveledManifest.java:473 - L3 contains 1 SSTables (160.000MiB) in Manifest@546534055
© DataStax, All Rights Reserved. 60
Typical compaction.log Entries
{"type":"pending","keyspace":"stresscql","table":"blogposts","time":1472851796649
,"strategyId":"1","pending":5}
{"type":"compaction","keyspace":"stresscql","table":"blogposts","time":1472851799
027,"start":"1472851795595","end":"1472851799027","input":[{"strategyId":"1","tab
le":{"generation":3244,"version":"mc","size":169275615,"details":{"level":2,"min_
token":"-5663415854288007738","max_token":"-
5630042604729221974"}}},{"strategyId":"1","table":{"generation":3970,"version":"m
c","size":156101287,"details":{"level":3,"min_token":"-
5681368161876797914","max_token":"-
5523713966014233820"}}}],"output":[{"strategyId":"1","table":{"generation":4003,"
version":"mc","size":170059193,"details":{"level":3,"min_token":"-
5681368161876797914","max_token":"-
5654579422012458443"}}},{"strategyId":"1","table":{"generation":4008,"version":"m
c","size":155296723,"details":{"level":3,"min_token":"-
5654551486308333779","max_token":"-5523713966014233820"}}}]}
© DataStax, All Rights Reserved. 61
Myth #5
• Why I cannot use up all of my concurrent compactor threads?
• It’s all about overlap
• L0 -> L1 can only happen in one thread (most likely)
• L1+ up-level can only happen when L(N) sstable overlaps with a small subset of L(N+1) SSTables
• Some of your SSTables are marked as suspected so cannot participate in compaction
© DataStax, All Rights Reserved. 62
Myth #6
• Why memtable flushes at a much smaller size than the allocated
memtable space, causing way more L0 SSTables than expected?
• The threshold for regular memtable flush is calculated based not only on allocated onheap and
offheap memtable space, but also on memtable_cleanup_threshold
• memtable_cleanup_threshold by default is calculated based on memtable_flush_writers (inverse proportion)
• it’s recommended to always manually set this value, especially if you’ve manually configured
memtable_flush_writers
• Your commitlog total space is too low, which leads commitlog filled up too quickly under heavy
writes
© DataStax, All Rights Reserved. 63
Myth #7
• Why pending compaction numbers is often not accurate for LCS?
• It’s a known limitation of the current implementation
• Look at this code on how the pending compaction task numbers are calculated (it is an *estimate*):
estimated[i] = (long)Math.ceil((double)Math.max(0L, SSTableReader.getTotalBytes(sstables) -
(long)(maxBytesForLevel(i, maxSSTableSizeInBytes) * 1.001)) / (double)maxSSTableSizeInBytes);
• In other words, it’s (total data in the current level – current level’s capacity) / 160MB, then add up
individual estimates from all levels
• But it never consider the additional writes needed beyond the current level
© DataStax, All Rights Reserved. 64
Myth #8
• Why I am seeing a very low compaction throughput number (MB/s)
when I have super fast SSD and plenty of idle CPU?
• L0->L1 can only be single-threaded, so faster CPU helps but more CPU cores won’t for this phase
• Part of reporting problem, especially if you’re using compression, see CASSANDRA-11697 (WIP)
• Part of throttling problem, see CASSANDRA-12366 (C* 3.10+)
• Between 2.1 and 3.0, we’ve improved compaction throughput under stress by 2x, because of
CASSANDRA-8988 (C* 2.2+), CASSANDRA-8920 (C* 2.2+), CASSANDRA-8915 (C* 3.0+).
• There are still remaining issues, but are currently being actively worked on, so you should watch
CASSANDRA-11697 closely
© DataStax, All Rights Reserved. 65

More Related Content

What's hot

Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 

What's hot (20)

TiDB Introduction
TiDB IntroductionTiDB Introduction
TiDB Introduction
 
How to Build a Scylla Database Cluster that Fits Your Needs
How to Build a Scylla Database Cluster that Fits Your NeedsHow to Build a Scylla Database Cluster that Fits Your Needs
How to Build a Scylla Database Cluster that Fits Your Needs
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
My first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdf
 
ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei Milovidov
 
PostgreSQL and Benchmarks
PostgreSQL and BenchmarksPostgreSQL and Benchmarks
PostgreSQL and Benchmarks
 
A glimpse of cassandra 4.0 features netflix
A glimpse of cassandra 4.0 features   netflixA glimpse of cassandra 4.0 features   netflix
A glimpse of cassandra 4.0 features netflix
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State StoresPerformance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Stores
 
Jvm tuning for low latency application & Cassandra
Jvm tuning for low latency application & CassandraJvm tuning for low latency application & Cassandra
Jvm tuning for low latency application & Cassandra
 
Introduction to Vacuum Freezing and XID
Introduction to Vacuum Freezing and XIDIntroduction to Vacuum Freezing and XID
Introduction to Vacuum Freezing and XID
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte DataProblems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
 
MariaDB 10.11 key features overview for DBAs
MariaDB 10.11 key features overview for DBAsMariaDB 10.11 key features overview for DBAs
MariaDB 10.11 key features overview for DBAs
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDB
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
 

Viewers also liked

What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
DataStax
 
Building Killr Applications with DSE
Building Killr Applications with DSEBuilding Killr Applications with DSE
Building Killr Applications with DSE
DataStax
 
高性能并发网络服务器设计与实现
高性能并发网络服务器设计与实现高性能并发网络服务器设计与实现
高性能并发网络服务器设计与实现
ideawu
 
LevelDB 간단한 소개
LevelDB 간단한 소개LevelDB 간단한 소개
LevelDB 간단한 소개
종빈 오
 

Viewers also liked (20)

Compaction, Compaction Everywhere
Compaction, Compaction EverywhereCompaction, Compaction Everywhere
Compaction, Compaction Everywhere
 
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
 
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
 
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
 
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
 
Webinar: Transforming Customer Experience Through an Always-On Data Platform
Webinar: Transforming Customer Experience Through an Always-On Data PlatformWebinar: Transforming Customer Experience Through an Always-On Data Platform
Webinar: Transforming Customer Experience Through an Always-On Data Platform
 
Can My Inventory Survive Eventual Consistency?
Can My Inventory Survive Eventual Consistency?Can My Inventory Survive Eventual Consistency?
Can My Inventory Survive Eventual Consistency?
 
Data Modeling a Scheduling App (Adam Hutson, DataScale) | Cassandra Summit 2016
Data Modeling a Scheduling App (Adam Hutson, DataScale) | Cassandra Summit 2016Data Modeling a Scheduling App (Adam Hutson, DataScale) | Cassandra Summit 2016
Data Modeling a Scheduling App (Adam Hutson, DataScale) | Cassandra Summit 2016
 
Building Killr Applications with DSE
Building Killr Applications with DSEBuilding Killr Applications with DSE
Building Killr Applications with DSE
 
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
 
Cassandra compaction
Cassandra compactionCassandra compaction
Cassandra compaction
 
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
 
Leveldb background
Leveldb backgroundLeveldb background
Leveldb background
 
Webinar: Fighting Fraud with Graph Databases
Webinar: Fighting Fraud with Graph DatabasesWebinar: Fighting Fraud with Graph Databases
Webinar: Fighting Fraud with Graph Databases
 
高性能并发网络服务器设计与实现
高性能并发网络服务器设计与实现高性能并发网络服务器设计与实现
高性能并发网络服务器设计与实现
 
SSDB(LevelDB server) vs Redis
SSDB(LevelDB server) vs RedisSSDB(LevelDB server) vs Redis
SSDB(LevelDB server) vs Redis
 
Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operation
 
Cassandra 2.1 boot camp, Compaction
Cassandra 2.1 boot camp, CompactionCassandra 2.1 boot camp, Compaction
Cassandra 2.1 boot camp, Compaction
 
Cassandra 2.0 to 2.1
Cassandra 2.0 to 2.1Cassandra 2.0 to 2.1
Cassandra 2.0 to 2.1
 
LevelDB 간단한 소개
LevelDB 간단한 소개LevelDB 간단한 소개
LevelDB 간단한 소개
 

Similar to The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, DataStax) | Cassandra Summit 2016

Similar to The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, DataStax) | Cassandra Summit 2016 (20)

C* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian
C* Summit 2013: Time is Money Jake Luciani and Carl YeksigianC* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian
C* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian
 
TechTalk: Reduce Your Storage Footprint with a Revolutionary New Compaction S...
TechTalk: Reduce Your Storage Footprint with a Revolutionary New Compaction S...TechTalk: Reduce Your Storage Footprint with a Revolutionary New Compaction S...
TechTalk: Reduce Your Storage Footprint with a Revolutionary New Compaction S...
 
How Incremental Compaction Reduces Your Storage Footprint
How Incremental Compaction Reduces Your Storage FootprintHow Incremental Compaction Reduces Your Storage Footprint
How Incremental Compaction Reduces Your Storage Footprint
 
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
 
Much Faster Networking
Much Faster NetworkingMuch Faster Networking
Much Faster Networking
 
3 ilp
3 ilp3 ilp
3 ilp
 
Everything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDBEverything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDB
 
Xia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATLXia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATL
 
L6.sp17.pptx
L6.sp17.pptxL6.sp17.pptx
L6.sp17.pptx
 
AWS re:Invent 2016| GAM302 | Sony PlayStation: Breaking the Bandwidth Barrier...
AWS re:Invent 2016| GAM302 | Sony PlayStation: Breaking the Bandwidth Barrier...AWS re:Invent 2016| GAM302 | Sony PlayStation: Breaking the Bandwidth Barrier...
AWS re:Invent 2016| GAM302 | Sony PlayStation: Breaking the Bandwidth Barrier...
 
LMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging LibraryLMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging Library
 
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
 
Bohan pac sec_2016
Bohan pac sec_2016Bohan pac sec_2016
Bohan pac sec_2016
 
Hekaton introduction for .Net developers
Hekaton introduction for .Net developersHekaton introduction for .Net developers
Hekaton introduction for .Net developers
 
LISP and NSH in Open vSwitch
LISP and NSH in Open vSwitchLISP and NSH in Open vSwitch
LISP and NSH in Open vSwitch
 
The Nightmare of Locking, Blocking and Isolation Levels!
The Nightmare of Locking, Blocking and Isolation Levels!The Nightmare of Locking, Blocking and Isolation Levels!
The Nightmare of Locking, Blocking and Isolation Levels!
 
Redis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetupRedis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetup
 
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest The Amino Analytical Framework - Leveraging Accumulo to the Fullest
The Amino Analytical Framework - Leveraging Accumulo to the Fullest
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at night
 
Webinar: Using Control Theory to Keep Compactions Under Control
Webinar: Using Control Theory to Keep Compactions Under ControlWebinar: Using Control Theory to Keep Compactions Under Control
Webinar: Using Control Theory to Keep Compactions Under Control
 

More from DataStax

More from DataStax (20)

Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?
 
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
 
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsRunning DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
 
Best Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphBest Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise Graph
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache Kafka
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseTop 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0
 
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for Dummies
 
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudHow to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
 
How to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceHow to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerce
 
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
 
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
 
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
 
Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)
 
An Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsAn Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking Applications
 
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingBecoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
 

Recently uploaded

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 

Recently uploaded (20)

WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public AdministrationWSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
 
WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...
WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...
WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
WSO2Con2024 - Organization Management: The Revolution in B2B CIAM
WSO2Con2024 - Organization Management: The Revolution in B2B CIAMWSO2Con2024 - Organization Management: The Revolution in B2B CIAM
WSO2Con2024 - Organization Management: The Revolution in B2B CIAM
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public AdministrationWSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
 
WSO2Con2024 - Unleashing the Financial Potential of 13 Million People
WSO2Con2024 - Unleashing the Financial Potential of 13 Million PeopleWSO2Con2024 - Unleashing the Financial Potential of 13 Million People
WSO2Con2024 - Unleashing the Financial Potential of 13 Million People
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
WSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
WSO2CON2024 - Why Should You Consider Ballerina for Your Next IntegrationWSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
WSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...
WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...
WSO2Con2024 - Simplified Integration: Unveiling the Latest Features in WSO2 L...
 

The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, DataStax) | Cassandra Summit 2016

  • 1. Wei Deng, Ryan Svihla The Missing Manual: Leveled Compaction Strategy
  • 2. 2© DataStax, All Rights Reserved. 1 Why 2 Background 3 How it affects you 4 Myth Busters 5 What’s coming
  • 3. About us © DataStax, All Rights Reserved. 3 Wei Deng, Ryan Svihla • Both been working for DataStax for 3 years • Helped a lot of customers on production deployments and shared their struggles with LCS • Active on LCS related JIRAs
  • 4. Why? Are we lazy, or crazy to talk about a five-year-old technology
  • 5. Why? • Motivations • Not a lot of technical content dedicated to LCS • Many sharp edges for non-suspecting users that can cause operation pains • Intended Audience • You’ve been managing C* cluster using LCS for a few months and have been burned by it a few times • You just started using C* and are wondering what are the pitfalls to watch out • What you can expect to get out of this talk • A better understanding about inner-workings of LCS and some counter-intuitive behaviors • Understand some concepts frequently used in the code (but not adequately discussed in existing documentation) before diving into the LCS source code and JIRAs © DataStax, All Rights Reserved. 5
  • 6. A brief history on LCS • One of the three main Compaction Strategies in C*: STCS, LCS, DTCS/TWCS • Original idea comes from LevelDB of Google’s Chromium project • Introduced in C* 1.0 almost 5 years ago • A lot of people adopted it in production starting from C* 1.2 days • Been going through many changes and is still evolving • At various points people from community would like to change it to the default compaction strategy to replace STCS but didn’t succeed (see CASSANDRA-7987 and CASSANDRA-12209) • Interesting divide in the community: committers and users • A more detailed history can be found by looking at labels = “lcs” on ASF JIRA site. © DataStax, All Rights Reserved. 6
  • 7. A Quick Diversion to LevelDB • LevelDB was introduced by Google, designed as an embedded key/value store in browsers and mobile OS • It was designed for low concurrency embedded purposes • Spawned many additional key/value store use cases (Bitcoin, Akka, ActiveMQ, RocksDB, etal) • NoSQL database that adopted it for a multiuser database (riak, hyperdex, etal) had to do some serious surgery similar to LCS (Don't block writes for L0, concurrent compactions, etc.) • As Basho’s Matthew Von-Maszewski pointed out in his Riak LevelDB talk, there are some common problems that many LevelDB forks suffer, e.g. L0 overwhelmed (pause), write amplification, repair speed © DataStax, All Rights Reserved. 7
  • 9. What is overlap? • It’s not a LCS-only concept, but it’s quite important for understanding LCS • How the code decides overlap between two SSTableReader objects (‘current’ and ‘previous’): • current.first.compareTo(previous.last) <= 0 • Assume you’ve done the following operations to insert a few partitions and incur two flushes: • Insert P1, P2, P3, P4, flush • Insert P5, P6, P7, P8, flush © DataStax, All Rights Reserved. 9 cqlsh:lcstest> select token(key), key from t1; system.token(key) | key ----------------------+----- -8293482207053059348 | P3 -3394362371136168911 | P7 -2426286626330648673 | P8 2250124551998075523 | P1 3019394044321026632 | P6 5060311682392199866 | P4 5805828953249703115 | P5 9074316327295047501 | P2 -3394362371136168911 <= 9074316327295047501 -8293482207053059348 | P3 2250124551998075523 | P1 5060311682392199866 | P4 9074316327295047501 | P2 previous -3394362371136168911 | P7 -2426286626330648673 | P8 3019394044321026632 | P6 5805828953249703115 | P5 current current.first <= previous.last ?
  • 10. Basic Design Principles • LCS arranges data in SSTables on multiple levels • Newly appeared SSTables (not from compaction) almost always show up in L0 • Only L0 is allowed to have SSTables of any size, L1+ levels can only have 160MB SSTables (with minor exceptions) • Only L0 can allow SSTables to overlap, for all other levels, it’s guaranteed to not overlap within the level • When LCS is able to catch up (i.e. no SSTable in L0), a read in the worst case just need to read from N SSTables (N = the highest level that has data), i.e. at most one SSTable from each L(N>=1) level; ideally 90% of the read will hit the highest level and be satisfied by one SSTable • Each L(N) level (N>=1) is supposed to accommodate 10N SSTables. So going one level higher, you will be able to accommodate 10x more SSTables • When a L0 SSTable is picked as a compaction candidate, it will find all other overlapping L0 SSTables (maximum 32), as well as all of the overlapping L1 SSTables (most likely all 10 L1s), to perform a compaction, and the result will be written into multiple SSTables in L1 • When the total size of all SSTables in L1 is greater than 10 x 160MB, the first L1 SSTable since the last compacted key could be picked as the next compaction candidate, and it will be compacted together with all overlapping L2 SSTables, and the result will be written into multiple SSTables in L2 • For higher levels, it will follow the same fashion, adding candidates from highest level to the lowest © DataStax, All Rights Reserved. 10
  • 11. Basic Design Principles • LCS arranges data in SSTables on multiple levels • Newly appeared SSTables (not from compaction) almost always show up in L0 • Only L0 is allowed to have SSTables of any size, L1+ levels can only have 160MB SSTables (with minor exceptions) • Only L0 can allow SSTables to overlap, for all other levels, it’s guaranteed to not overlap within the level • When LCS is able to catch up (i.e. no SSTable in L0), a read in the worst case just need to read from N SSTables (N = the highest level that has data), i.e. at most one SSTable from each L(N>=1) level; ideally 90% of the read will hit the highest level and be satisfied by one SSTable • Each L(N) level (N>=1) is supposed to accommodate 10N SSTables. So going one level higher, you will be able to accommodate 10x more SSTables • When a L0 SSTable is picked as a compaction candidate, it will find all other overlapping L0 SSTables (maximum 32), as well as all of the overlapping L1 SSTables (most likely all 10 L1s), to perform a compaction, and the result will be written into multiple SSTables in L1 • When the total size of all SSTables in L1 is greater than 10 x 160MB, the first L1 SSTable since the last compacted key could be picked as the next compaction candidate, and it will be compacted together with all overlapping L2 SSTables, and the result will be written into multiple SSTables in L2 • For higher levels, it will follow the same fashion, adding candidates from highest level to the lowest © DataStax, All Rights Reserved. 11
  • 12. Basic Design Principles (cont.) • On all levels, a Compaction Score will be periodically calculated, based on the total size of all non- compacting SSTables in each level, divided by this level’s capacity (for example, L1 is supposed to accommodate 1.6GB, L2 is supposed to accommodate 16GB). Going from highest level, any level having score higher than 1.001 can be picked by a compaction thread • Before any one of the higher-level compaction starts, check if L0 is too far-behind (i.e. having > 32 SSTables that are not yet in a compaction), if yes, do STCS in L0 first before doing higher level LCS compaction • After going through compactions at all higher levels, move onto doing regular LCS compaction in L0 • The number of compaction threads is capped by concurrent_compactors, and throughput throttled • For each compaction session, the level for the added sstables is the max of the removed ones, plus one if the removed were all on the same level • If we go 25 rounds without compacting in the highest level, we start bringing in sstables from that level into lower level compactions • In addition to the DataStax blog post on LCS, more details are explained in the code comments • http://www.datastax.com/dev/blog/when-to-use-leveled-compaction • org.apache.cassandra.db.compaction.LeveledManifest.java • org.apache.cassandra.db.compaction.LeveledCompactionStrategy.java © DataStax, All Rights Reserved. 12
  • 13. Basic Design Principles (cont.) • On all levels, a Compaction Score will be periodically calculated, based on the total size of all non- compacting SSTables in each level, divided by this level’s capacity (for example, L1 is supposed to accommodate 1.6GB, L2 is supposed to accommodate 16GB). Going from highest level, any level having score higher than 1.001 can be picked by a compaction thread • Before any one of the higher-level compaction starts, check if L0 is too far-behind (i.e. having > 32 SSTables that are not yet in a compaction), if yes, do STCS in L0 first before doing higher level LCS compaction • After going through compactions at all higher levels, move onto doing regular LCS compaction in L0 • The number of compaction threads is capped by concurrent_compactors, and throughput throttled • For each compaction session, the level for the added sstables is the max of the removed ones, plus one if the removed were all on the same level • If we go 25 rounds without compacting in the highest level, we start bringing in sstables from that level into lower level compactions • In addition to the DataStax blog post on LCS, more details are explained in the code comments • http://www.datastax.com/dev/blog/when-to-use-leveled-compaction • org.apache.cassandra.db.compaction.LeveledManifest.java • org.apache.cassandra.db.compaction.LeveledCompactionStrategy.java © DataStax, All Rights Reserved. 13
  • 14. How up-level is happening • Use a very simple key/value table in our demo • For each partition, the key is an integer, and the value is a 40MB blob, so each partition is about 40MB and 4 partitions can fill up a 160MB SSTable • Insert partitions with key 1-8, trigger a flush, then insert 9-16, another flush, then insert 17-24, flush, insert 25-32, and so on • Each L(N>0) level has capacity of 160MB x 4N , so L1 has max capacity of 640MB, L2 has max capacity of 2.56GB and so on, for easier illustration of levels © DataStax, All Rights Reserved. 14 CREATE TABLE kvstore ( key int PRIMARY KEY, value text );
  • 15. How up-level is happening © DataStax, All Rights Reserved. 15 system.token(key) | key | value (40MB) ----------------------+-----+------------- -9157060164899361011 | 23 | aaa -7509452495886106294 | 5 | aaa -6715243485458697746 | 10 | aaa -5477287129830487822 | 16 | aaa -5034495173465742853 | 13 | aaa -4156302194539278891 | 11 | aaa -4069959284402364209 | 1 | aaa -3974532302236993209 | 19 | aaa -3799847372828181882 | 8 | aaa -3248873570005575792 | 2 | aaa -2729420104000364805 | 4 | aaa -2695747960476065067 | 18 | aaa -1191135763843456182 | 15 | aaa -1117083337304738213 | 22 | aaa 1388667306199997068 | 20 | aaa 1634052884888577606 | 7 | aaa 2705480034054113608 | 6 | aaa 3728482343045213994 | 9 | aaa 4279681877540623768 | 14 | aaa 5176205029172940157 | 21 | aaa 5467144456125416399 | 17 | aaa 6462532412986175674 | 24 | aaa 8582886034424406875 | 12 | aaa 9010454139840013625 | 3 | aaa -7509452495886106294 | 5 -4069959284402364209 | 1 -3799847372828181882 | 8 -3248873570005575792 | 2 -2729420104000364805 | 4 1634052884888577606 | 7 2705480034054113608 | 6 9010454139840013625 | 3 -6715243485458697746 | 10 -5477287129830487822 | 16 -5034495173465742853 | 13 -4156302194539278891 | 11 -1191135763843456182 | 15 3728482343045213994 | 9 4279681877540623768 | 14 8582886034424406875 | 12 L0 Can overlap No size limit -7509452495886106294 | 5 -6715243485458697746 | 10 -5477287129830487822 | 16 -5034495173465742853 | 13 -4156302194539278891 | 11 -4069959284402364209 | 1 -3799847372828181882 | 8 -3248873570005575792 | 2 -2729420104000364805 | 4 -1191135763843456182 | 15 1634052884888577606 | 7 2705480034054113608 | 6 3728482343045213994 | 9 4279681877540623768 | 14 8582886034424406875 | 12 9010454139840013625 | 3 L1 No overlap 41 x 160MB L2 No overlap 42 x 160MB …
  • 16. How up-level is happening © DataStax, All Rights Reserved. 16 system.token(key) | key | value (40MB) ----------------------+-----+------------- -9157060164899361011 | 23 | aaa -7509452495886106294 | 5 | aaa -6715243485458697746 | 10 | aaa -5477287129830487822 | 16 | aaa -5034495173465742853 | 13 | aaa -4156302194539278891 | 11 | aaa -4069959284402364209 | 1 | aaa -3974532302236993209 | 19 | aaa -3799847372828181882 | 8 | aaa -3248873570005575792 | 2 | aaa -2729420104000364805 | 4 | aaa -2695747960476065067 | 18 | aaa -1191135763843456182 | 15 | aaa -1117083337304738213 | 22 | aaa 1388667306199997068 | 20 | aaa 1634052884888577606 | 7 | aaa 2705480034054113608 | 6 | aaa 3728482343045213994 | 9 | aaa 4279681877540623768 | 14 | aaa 5176205029172940157 | 21 | aaa 5467144456125416399 | 17 | aaa 6462532412986175674 | 24 | aaa 8582886034424406875 | 12 | aaa 9010454139840013625 | 3 | aaa L0 Can overlap No size limit -7509452495886106294 | 5 -6715243485458697746 | 10 -5477287129830487822 | 16 -5034495173465742853 | 13 -4156302194539278891 | 11 -4069959284402364209 | 1 -3799847372828181882 | 8 -3248873570005575792 | 2 -2729420104000364805 | 4 -1191135763843456182 | 15 1634052884888577606 | 7 2705480034054113608 | 6 3728482343045213994 | 9 4279681877540623768 | 14 8582886034424406875 | 12 9010454139840013625 | 3 L1 No overlap 41 x 160MB L2 No overlap 42 x 160MB …
  • 17. How up-level is happening © DataStax, All Rights Reserved. 17 system.token(key) | key | value (40MB) ----------------------+-----+------------- -9157060164899361011 | 23 | aaa -7509452495886106294 | 5 | aaa -6715243485458697746 | 10 | aaa -5477287129830487822 | 16 | aaa -5034495173465742853 | 13 | aaa -4156302194539278891 | 11 | aaa -4069959284402364209 | 1 | aaa -3974532302236993209 | 19 | aaa -3799847372828181882 | 8 | aaa -3248873570005575792 | 2 | aaa -2729420104000364805 | 4 | aaa -2695747960476065067 | 18 | aaa -1191135763843456182 | 15 | aaa -1117083337304738213 | 22 | aaa 1388667306199997068 | 20 | aaa 1634052884888577606 | 7 | aaa 2705480034054113608 | 6 | aaa 3728482343045213994 | 9 | aaa 4279681877540623768 | 14 | aaa 5176205029172940157 | 21 | aaa 5467144456125416399 | 17 | aaa 6462532412986175674 | 24 | aaa 8582886034424406875 | 12 | aaa 9010454139840013625 | 3 | aaa L0 Can overlap No size limit -7509452495886106294 | 5 -6715243485458697746 | 10 -5477287129830487822 | 16 -5034495173465742853 | 13 -4156302194539278891 | 11 -4069959284402364209 | 1 -3799847372828181882 | 8 -3248873570005575792 | 2 -2729420104000364805 | 4 -1191135763843456182 | 15 1634052884888577606 | 7 2705480034054113608 | 6 3728482343045213994 | 9 4279681877540623768 | 14 8582886034424406875 | 12 9010454139840013625 | 3 L1 No overlap 41 x 160MB L2 No overlap 42 x 160MB … -9157060164899361011 | 23 -3974532302236993209 | 19 -2695747960476065067 | 18 -1117083337304738213 | 22 1388667306199997068 | 20 5176205029172940157 | 21 5467144456125416399 | 17 6462532412986175674 | 24
  • 18. How up-level is happening © DataStax, All Rights Reserved. 18 system.token(key) | key | value (40MB) ----------------------+-----+------------- -9157060164899361011 | 23 | aaa -7509452495886106294 | 5 | aaa -6715243485458697746 | 10 | aaa -5477287129830487822 | 16 | aaa -5034495173465742853 | 13 | aaa -4156302194539278891 | 11 | aaa -4069959284402364209 | 1 | aaa -3974532302236993209 | 19 | aaa -3799847372828181882 | 8 | aaa -3248873570005575792 | 2 | aaa -2729420104000364805 | 4 | aaa -2695747960476065067 | 18 | aaa -1191135763843456182 | 15 | aaa -1117083337304738213 | 22 | aaa 1388667306199997068 | 20 | aaa 1634052884888577606 | 7 | aaa 2705480034054113608 | 6 | aaa 3728482343045213994 | 9 | aaa 4279681877540623768 | 14 | aaa 5176205029172940157 | 21 | aaa 5467144456125416399 | 17 | aaa 6462532412986175674 | 24 | aaa 8582886034424406875 | 12 | aaa 9010454139840013625 | 3 | aaa L0 Can overlap No size limit -7509452495886106294 | 5 -6715243485458697746 | 10 -5477287129830487822 | 16 -5034495173465742853 | 13 -4156302194539278891 | 11 -4069959284402364209 | 1 -3799847372828181882 | 8 -3248873570005575792 | 2 -2729420104000364805 | 4 -1191135763843456182 | 15 1634052884888577606 | 7 2705480034054113608 | 6 3728482343045213994 | 9 4279681877540623768 | 14 8582886034424406875 | 12 9010454139840013625 | 3 L1 No overlap 41 x 160MB L2 No overlap 42 x 160MB … -9157060164899361011 | 23 -3974532302236993209 | 19 -2695747960476065067 | 18 -1117083337304738213 | 22 1388667306199997068 | 20 5176205029172940157 | 21 5467144456125416399 | 17 6462532412986175674 | 24 90°
  • 19. How up-level is happening © DataStax, All Rights Reserved. 19 system.token(key) | key | value (40MB) ----------------------+-----+------------- -9157060164899361011 | 23 | aaa -7509452495886106294 | 5 | aaa -6715243485458697746 | 10 | aaa -5477287129830487822 | 16 | aaa -5034495173465742853 | 13 | aaa -4156302194539278891 | 11 | aaa -4069959284402364209 | 1 | aaa -3974532302236993209 | 19 | aaa -3799847372828181882 | 8 | aaa -3248873570005575792 | 2 | aaa -2729420104000364805 | 4 | aaa -2695747960476065067 | 18 | aaa -1191135763843456182 | 15 | aaa -1117083337304738213 | 22 | aaa 1388667306199997068 | 20 | aaa 1634052884888577606 | 7 | aaa 2705480034054113608 | 6 | aaa 3728482343045213994 | 9 | aaa 4279681877540623768 | 14 | aaa 5176205029172940157 | 21 | aaa 5467144456125416399 | 17 | aaa 6462532412986175674 | 24 | aaa 8582886034424406875 | 12 | aaa 9010454139840013625 | 3 | aaa L0 Can overlap No size limit L1 No overlap 41 x 160MB L2 No overlap 42 x 160MB … 23 19 18 22 20 21 17 24 5 10 16 13 11 1 8 2 4 15 7 6 9 14 12 3
  • 20. How up-level is happening © DataStax, All Rights Reserved. 20 system.token(key) | key | value (40MB) ----------------------+-----+------------- -9157060164899361011 | 23 | aaa -7509452495886106294 | 5 | aaa -6715243485458697746 | 10 | aaa -5477287129830487822 | 16 | aaa -5034495173465742853 | 13 | aaa -4156302194539278891 | 11 | aaa -4069959284402364209 | 1 | aaa -3974532302236993209 | 19 | aaa -3799847372828181882 | 8 | aaa -3248873570005575792 | 2 | aaa -2729420104000364805 | 4 | aaa -2695747960476065067 | 18 | aaa -1191135763843456182 | 15 | aaa -1117083337304738213 | 22 | aaa 1388667306199997068 | 20 | aaa 1634052884888577606 | 7 | aaa 2705480034054113608 | 6 | aaa 3728482343045213994 | 9 | aaa 4279681877540623768 | 14 | aaa 5176205029172940157 | 21 | aaa 5467144456125416399 | 17 | aaa 6462532412986175674 | 24 | aaa 8582886034424406875 | 12 | aaa 9010454139840013625 | 3 | aaa L0 Can overlap No size limit L1 No overlap 41 x 160MB L2 No overlap 42 x 160MB … 23 5 10 16 13 11 1 19 8 2 4 18 15 22 20 7 6 9 14 21 17 24 12 3 Total size of all L1 SSTables ≈ 40MB/partition x 24 partitions = 960MB / (41 x 160MB) > 1.001
  • 21. How up-level is happening © DataStax, All Rights Reserved. 21 system.token(key) | key | value (40MB) ----------------------+-----+------------- -9157060164899361011 | 23 | aaa -7509452495886106294 | 5 | aaa -6715243485458697746 | 10 | aaa -5477287129830487822 | 16 | aaa -5034495173465742853 | 13 | aaa -4156302194539278891 | 11 | aaa -4069959284402364209 | 1 | aaa -3974532302236993209 | 19 | aaa -3799847372828181882 | 8 | aaa -3248873570005575792 | 2 | aaa -2729420104000364805 | 4 | aaa -2695747960476065067 | 18 | aaa -1191135763843456182 | 15 | aaa -1117083337304738213 | 22 | aaa 1388667306199997068 | 20 | aaa 1634052884888577606 | 7 | aaa 2705480034054113608 | 6 | aaa 3728482343045213994 | 9 | aaa 4279681877540623768 | 14 | aaa 5176205029172940157 | 21 | aaa 5467144456125416399 | 17 | aaa 6462532412986175674 | 24 | aaa 8582886034424406875 | 12 | aaa 9010454139840013625 | 3 | aaa L0 Can overlap No size limit L1 No overlap 41 x 160MB L2 No overlap 42 x 160MB … 23 5 10 16 13 11 1 19 8 2 4 18 15 22 20 7 6 9 14 21 17 24 12 3 Total size of all L1 SSTables ≈ 40MB/partition x 16 partitions = 640MB / (41 x 160MB) ≤ 1.001
  • 22. How up-level is happening © DataStax, All Rights Reserved. 22 system.token(key) | key | value (40MB) ----------------------+-----+------------- -9157060164899361011 | 23 | aaa -7509452495886106294 | 5 | aaa -7492342649816923291 | 28 | aaa -6715243485458697746 | 10 | aaa -5477287129830487822 | 16 | aaa -5034495173465742853 | 13 | aaa -4944298994521308347 | 30 | aaa -4156302194539278891 | 11 | aaa -4069959284402364209 | 1 | aaa -3974532302236993209 | 19 | aaa -3799847372828181882 | 8 | aaa -3248873570005575792 | 2 | aaa -2729420104000364805 | 4 | aaa -2695747960476065067 | 18 | aaa -1191135763843456182 | 15 | aaa -1117083337304738213 | 22 | aaa 945780720898208171 | 27 | aaa 1388667306199997068 | 20 | aaa 1634052884888577606 | 7 | aaa 2705480034054113608 | 6 | aaa ... 8041825692956475789 | 25 | aaa 8582886034424406875 | 12 | aaa 9010454139840013625 | 3 | aaa L0 Can overlap No size limit L1 No overlap 41 x 160MB L2 No overlap 42 x 160MB … 23 5 10 16 13 11 1 19 8 2 4 18 15 22 20 7 6 9 14 21 17 24 12 3 28 30 27 29 26 31 32 25
  • 23. How up-level is happening © DataStax, All Rights Reserved. 23 system.token(key) | key | value (40MB) ----------------------+-----+------------- -9157060164899361011 | 23 | aaa -7509452495886106294 | 5 | aaa -7492342649816923291 | 28 | aaa -6715243485458697746 | 10 | aaa -5477287129830487822 | 16 | aaa -5034495173465742853 | 13 | aaa -4944298994521308347 | 30 | aaa -4156302194539278891 | 11 | aaa -4069959284402364209 | 1 | aaa -3974532302236993209 | 19 | aaa -3799847372828181882 | 8 | aaa -3248873570005575792 | 2 | aaa -2729420104000364805 | 4 | aaa -2695747960476065067 | 18 | aaa -1191135763843456182 | 15 | aaa -1117083337304738213 | 22 | aaa 945780720898208171 | 27 | aaa 1388667306199997068 | 20 | aaa 1634052884888577606 | 7 | aaa 2705480034054113608 | 6 | aaa ... 8041825692956475789 | 25 | aaa 8582886034424406875 | 12 | aaa 9010454139840013625 | 3 | aaa L0 Can overlap No size limit L1 No overlap 41 x 160MB L2 No overlap 42 x 160MB … 23 5 10 16 13 11 1 19 8 2 4 18 15 22 20 7 6 9 14 21 17 24 12 3 28 30 27 29 26 31 32 25
  • 24. How up-level is happening © DataStax, All Rights Reserved. 24 system.token(key) | key | value (40MB) ----------------------+-----+------------- -9157060164899361011 | 23 | aaa -7509452495886106294 | 5 | aaa -7492342649816923291 | 28 | aaa -6715243485458697746 | 10 | aaa -5477287129830487822 | 16 | aaa -5034495173465742853 | 13 | aaa -4944298994521308347 | 30 | aaa -4156302194539278891 | 11 | aaa -4069959284402364209 | 1 | aaa -3974532302236993209 | 19 | aaa -3799847372828181882 | 8 | aaa -3248873570005575792 | 2 | aaa -2729420104000364805 | 4 | aaa -2695747960476065067 | 18 | aaa -1191135763843456182 | 15 | aaa -1117083337304738213 | 22 | aaa 945780720898208171 | 27 | aaa 1388667306199997068 | 20 | aaa 1634052884888577606 | 7 | aaa 2705480034054113608 | 6 | aaa ... 8041825692956475789 | 25 | aaa 8582886034424406875 | 12 | aaa 9010454139840013625 | 3 | aaa L0 Can overlap No size limit L1 No overlap 41 x 160MB L2 No overlap 42 x 160MB … 23 5 10 16 13 11 1 19 27 20 7 6 29 9 14 26 21 17 31 24 32 25 12 328 30 8 2 4 18 15 22 Total size of all L1 SSTables ≈ 40MB/partition x 24 partitions = 960MB / (41 x 160MB) > 1.001 lastCompactedKeys[minLevel] = SSTableReader.sstableOrdering.max(added).last;
  • 25. How up-level is happening © DataStax, All Rights Reserved. 25 system.token(key) | key | value (40MB) ----------------------+-----+------------- -9157060164899361011 | 23 | aaa -7509452495886106294 | 5 | aaa -7492342649816923291 | 28 | aaa -6715243485458697746 | 10 | aaa -5477287129830487822 | 16 | aaa -5034495173465742853 | 13 | aaa -4944298994521308347 | 30 | aaa -4156302194539278891 | 11 | aaa -4069959284402364209 | 1 | aaa -3974532302236993209 | 19 | aaa -3799847372828181882 | 8 | aaa -3248873570005575792 | 2 | aaa -2729420104000364805 | 4 | aaa -2695747960476065067 | 18 | aaa -1191135763843456182 | 15 | aaa -1117083337304738213 | 22 | aaa 945780720898208171 | 27 | aaa 1388667306199997068 | 20 | aaa 1634052884888577606 | 7 | aaa 2705480034054113608 | 6 | aaa ... 8041825692956475789 | 25 | aaa 8582886034424406875 | 12 | aaa 9010454139840013625 | 3 | aaa L0 Can overlap No size limit L1 No overlap 41 x 160MB L2 No overlap 42 x 160MB … 23 5 10 16 13 11 1 19 27 20 7 6 29 9 14 26 21 17 31 24 32 25 12 328 30 8 2 4 18 15 22 lastCompactedKeys[minLevel] = SSTableReader.sstableOrdering.max(added).last;
  • 26. Fully Expired SSTable © DataStax, All Rights Reserved. 26 • If an SSTable contains only tombstones and it is guaranteed that the SSTable is not shadowing data in any other SSTable, compaction can drop that SSTable • If you see SSTables with only tombstones (note that TTL’ed data is considered tombstones once the time to live has expired) but it is not being dropped by compaction, it is likely that other SSTables contain older data • In LCS, these other SSTables containing older data are often in higher levels • This directly impacts the effectiveness of tombstone compaction, and hence why you always want tombstones to go to the highest level as quickly as possible
  • 27. Write Amplification © DataStax, All Rights Reserved. 27 • All fresh data enters L0, and will have to move up level one by one, no skipping level possible • In the ideal situation, highest level will have 90% of data, hence more data -> more Write Amplification, because any data that eventually shows up on the highest level N had to be written N+1 times • This is why LCS is more demanding on CPU and IO capabilities compared to STCS
  • 28. JBOD and CASSANDRA-6696 • JBOD is a primary use case for choosing LCS • However, it used to create “zombie resurrection” problem when JBOD disk fails • CASSANDRA-6696 (C* 3.2+) will create flush writer and compaction strategy per disk, so all flush and compaction activities for each disk will only be responsible for one set of token ranges • More parallelism for compaction (no worries about overlap at higher levels between different compaction threads) • This also makes LCS to be able to handle more data per node, as each JBOD volume is doing compaction on its own © DataStax, All Rights Reserved. 28
  • 29. How it affects you? Best Practices
  • 30. Where LCS fits the best • Use cases needing very consistent read performance with much higher read to write ratio • Wide-partition data model with limited (or slow-growing) number of total partitions but a lot of updates and deletes, or fully TTL’ed dataset © DataStax, All Rights Reserved. 30
  • 31. Typical anti-pattern use cases for LCS • Heavy write with all unique partitions • Too frequent memtable flushes • Large vnode nodes • Frequent bulk load with the goal of finishing the load as quickly as possible • Any other condition that could create too many L0 SSTables • High density node (> 1TB per node or > 0.5TB per CQL table per node) • Node without enough IO (i.e. non-SSD, such as SAN, NFS or spindles) and CPU resources • Partitions that are too wide (> 160MB) • Use LCS simply for saving disk spaces (because “LCS requires less free space”) • Have too many SSTables without enough file handles and using snapshot repair © DataStax, All Rights Reserved. 31
  • 32. Troubleshooting • Debug.log • nodetool setlogginglevel org.apache.cassandra.db.compaction.LeveledManifest TRACE • nodetool setlogginglevel org.apache.cassandra.db.compaction.LeveledCompactionStrategy TRACE • log_all=true (CASSANDRA-10805, C* 3.6+) or JMX • enable in table schema • Will create an individual compaction.log file outside of system.log or debug.log • Can be very useful for a visualizer • The offline compaction-stress tool automatically enable it © DataStax, All Rights Reserved. 32
  • 33. Visualization demo • Developed by Carl Yeksigian, still Work In Progress, 0.1 version • Leverage the compaction.log generated by CASSANDRA-10805, which was also authored by Carl © DataStax, All Rights Reserved. 33
  • 34. Myth Busters Observations that often confuse people
  • 35. Myth #1 • Using LCS, I can avoid STCS’s 50% free space requirement and use up as much as 90% of my disk space? WRONG • You should not size your hardware based on 90% disk utilization • STCS in L0 can easily throw off this assumption, also when L0 is overloaded, L0->L1 can easily break the “up to 90%” assumption © DataStax, All Rights Reserved. 35
  • 36. Myth #2 • Why so many small SSTables show up in L0 when I start Repair Service? • Every repair session will force affected nodes to flush the corresponding memtable • This is the only way to guarantee merkle tree comparison won’t miss any data from memtable • If something is wrong with repair scheduling, two repair sessions could be scheduled back-to-back, causing unnecessary memtable flush to happen • If you use vnode, one range of data tends to spread across many nodes and naturally many SSTables, hence when streaming the replica difference, it’s more likely to source from many nodes and many SSTables © DataStax, All Rights Reserved. 36
  • 37. Myth #3 • Why compaction is no longer runnning and I’m still seeing something like [1, 12/10, 104/100, …] ? • Even when not doing STCS, L0 SSTables compacted don’t always end up in L1. If the combined size of L0 SSTables involved in a compaction is smaller than max_sstable_size, then the output will stay in L0 • Upper level compaction is triggered by compaction scores (watch debug.log to see it in action), and compaction scores is completely based on size, instead of number of SSTables in each level. For brevity of display, “nodetool cfstats” chooses to use # of SSTables © DataStax, All Rights Reserved. 37
  • 38. Myth #4 • Why it takes so long to free up disk space after delete? • Because any newly inserted data (including delete/tombstone) will have to bubble up • A partition can have fragments in all different levels • To allow the data belonging to a partition to be cleared, the tombstone has to bubble up to the top and is also expired • See fully expired sstable slide. © DataStax, All Rights Reserved. 38
  • 39. What’s coming JIRA references to LCS’s past, present and future
  • 40. What has happened since C* 2.1 • Partition SSTables by token ranges (CASSANDRA-6696, C* 3.2+) • Try to include overlapping sstables in tombstone compactions to be able to actually drop the tombstones (CASSANDRA-7019, C* 3.10+) See talk this afternoon at 4:10pm in 210C • Add major compaction to LCS (CASSANDRA-7272, C* 2.2+), however note CASSANDRA-11817 • Send source sstable level when bootstrapping or replacing nodes (CASSANDRA-7460, C* 2.2+) • Adjusting LCS to work with incremental repair (CASSANDRA-8004, C* 2.1.2+) • A tool called sstableofflinerelevel to create a "decent" sstable leveling from a bunch of given SSTables (CASSANDRA-8301, C* 2.1.5+) • Use a new JBOD volume every time when a newly compacted SSTable is written (CASSANDRA-8329, C* 2.1.3+) • Compaction throughput under stress improved by 2x with various optimizations (CASSANDRA-8915, C* 3.0+, CASSANDRA-8920, C* 2.2+, CASSANDRA-8988, C* 2.2+, CASSANDRA-10099, C* 3.6+) • Compact only certain token range (CASSANDRA-10643, C* 3.10+) • Allow L0 STCS and L0->L1 LCS to happen at the same time if there’s no overlap (CASSANDRA-10979, C* 2.1.14+) • Improve Cassandra startup speed with LCS tables (CASSANDRA-12114, C* 3.0.9+, C* 3.8+) • Compaction Throttling is not working as expected (CASSANDRA-12366, C* 3.10+) • Compute last compacted keys for each level at startup (CASSANDRA-6216, C* 3.0.9+, C* 3.10+) © DataStax, All Rights Reserved. 40
  • 41. What has happened since C* 2.1 • Partition SSTables by token ranges (CASSANDRA-6696, C* 3.2+) • Try to include overlapping sstables in tombstone compactions to be able to actually drop the tombstones (CASSANDRA-7019, C* 3.10+) See talk this afternoon at 4:10pm in 210C • Add major compaction to LCS (CASSANDRA-7272, C* 2.2+), however note CASSANDRA-11817 • Send source sstable level when bootstrapping or replacing nodes (CASSANDRA-7460, C* 2.2+) • Adjusting LCS to work with incremental repair (CASSANDRA-8004, C* 2.1.2+) • A tool called sstableofflinerelevel to create a "decent" sstable leveling from a bunch of given SSTables (CASSANDRA-8301, C* 2.1.5+) • Use a new JBOD volume every time when a newly compacted SSTable is written (CASSANDRA-8329, C* 2.1.3+) • Compaction throughput under stress improved by 2x with various optimizations (CASSANDRA-8915, C* 3.0+, CASSANDRA-8920, C* 2.2+, CASSANDRA-8988, C* 2.2+, CASSANDRA-10099, C* 3.6+) • Compact only certain token range (CASSANDRA-10643, C* 3.10+) • Allow L0 STCS and L0->L1 LCS to happen at the same time if there’s no overlap (CASSANDRA-10979, C* 2.1.14+) • Improve Cassandra startup speed with LCS tables (CASSANDRA-12114, C* 3.0.9+, C* 3.8+) • Compaction Throttling is not working as expected (CASSANDRA-12366, C* 3.10+) • Compute last compacted keys for each level at startup (CASSANDRA-6216, C* 3.0.9+, C* 3.10+) © DataStax, All Rights Reserved. 41
  • 42. JIRAs coming A few JIRAs worth waiting for (already patch available) or watching (WIP) or contributing/upvoting (not yet started): • Different optimizations to pick compaction candidates to avoid write amplification and merge/eliminate tombstones (CASSANDRA-11106) • Range Aware compaction (CASSANDRA-10540) • Create new sstables in the highest possible level (CASSANDRA-6323) • Fractual cascading (CASSANDRA-7065) • Allow multiple overlapping SSTables in L1 or above (CASSANDRA-7409) • L0 should have a separate configurable bloom filter false positive ratio (CASSANDRA-8434) • Dynamically adjusting the ideal top level size to how much data is actually in it (CASSANDRA-9829) • Option to disable bloom filter in highest level of LCS sstables (CASSANDRA-9830) • LCS repair: compact tables before making available in L0 (CASSANDRA-10862) • Make the fanout size for LeveledCompactionStrategy to be configurable (CASSANDRA-11150) • Improve parallelism on higher level compactions in LCS (CASSANDRA-12464) © DataStax, All Rights Reserved. 42
  • 43. Some parting thoughts • Try out the latest C* trunk/3.x and play with the logging level • Test out bigger sstable_size_in_mb with your environment (CASSANDRA-12591) • Leverage the new compaction-stress tool to evaluate your deployment environment © DataStax, All Rights Reserved. 43
  • 44. Q & A
  • 45. Appendix / Deleted Scenes The stuff we didn’t have time to cover
  • 46. Some more comments on LevelDB design from Cassandra source code “LevelDB gives each level a score of how much data it contains vs its ideal amount, and compacts the level with the highest score. But this falls apart spectacularly once you get behind. Consider this set of levels: L0: 988 [ideal: 4] L1: 117 [ideal: 10] L2: 12 [ideal: 100] The problem is that L0 has a much higher score (almost 250) than L1 (11), so what we'll do is compact a batch of MAX_COMPACTING_L0 sstables with all 117 L1 sstables, and put the result (say, 120 sstables) in L1. Then we'll compact the next batch of MAX_COMPACTING_L0, and so forth. So we spend most of our i/o rewriting the L1 data with each batch. If we could just do *all* L0 a single time with L1, that would be ideal. But we can’t. LevelDB's way around this is to simply block writes if L0 compaction falls behind. We don't have that luxury. So instead, we 1) force compacting higher levels first, which minimizes the i/o needed to compact optimally which gives us a long term win, and 2) if L0 falls behind, we will size-tiered compact it to reduce read overhead until we can catch up on the higher levels. This isn't a magic wand -- if you are consistently writing too fast for LCS to keep up, you're still screwed. But if instead you have intermittent bursts of activity, it can help a lot.”© DataStax, All Rights Reserved. 46
  • 47. Basics on compaction in general • Why you need compaction • Log Structured Merge Tree • Immutable data files (SSTables) • Super fast writes, but read will need to read as little amount of data files as possible • What it usually does • Normally combine *multiple* SSTables and generate another set of SSTables to merge partition fragments from the source SSTables • In this process, remove tombstones or change expired data to tombstones • Tombstone compaction can also be automatically triggered on individual SSTables. • How it is triggered • Background / minor compaction • Major compaction • User defined compaction © DataStax, All Rights Reserved. 47
  • 48. Impact from Incremental Repair • Anti-compaction at the end of incremental repair session will create repaired and unrepaired SSTables • Maintaining separate pools of repaired and unrepaired SSTables causes extra complexity for LCS • Two instances of LeveledCompactionStrategy running, one for each pool • When moving a SSTable (at L1+) from unrepaired to repaired, it can cause overlap and we will have to send this SSTable back to L0 © DataStax, All Rights Reserved. 48
  • 49. What is suspectness and why it matters? • A suspected SSTable is marked when any IO exception is encountered when reading any component of this SSTable. For example, a SSTable component corruption can lead compaction to mark this SSTable as suspected. “Note that we ignore suspect-ness of L1 sstables here, since if an L1 sstable is suspect we're basically screwed, since we expect all or most L0 sstables to overlap with each L1 sstable. So if an L1 sstable is suspect we can't do much besides try anyway and hope for the best. ” • This is actually one of the reasons why you may see a huge backup whenever there is sstable corruption. Because if the corruption happens in L1, no L0 compaction can happen any more, so no matter how many compaction threads you have, you may be blocked in L0->L1 compaction. © DataStax, All Rights Reserved. 49
  • 50. Subrange compaction • For partitions with “Justin Bieber” effect • Only compact a given sub range • Could be useful if you know a partition/token has been misbehaving - either gathering many updates or many deletes • “nodetool compact -st x -et y” will pick all SSTables containing the range between x and y and issue a compaction for those sstables • With LCS it can issue the compaction for a subset of the SSTables, the resulting SSTables will end up in L0 © DataStax, All Rights Reserved. 50
  • 51. Typical anti-pattern use cases for LCS • Heavy write with all unique partitions • Too frequent memtable flushes • Large vnode nodes • Any other condition that could create too many L0 SSTables • High density node (> 1.5TB per node or > 1TB per CQL table per node) • Node without enough IO (i.e. non-SSD, such as SAN, NFS or spindles) and CPU resources • Frequent bulk load with the goal of finishing the load as quickly as possible • Partitions that are too wide (> 160MB) • Use LCS simply for saving disk spaces (because “LCS requires less free space”) • Have too many SSTables without enough file handles and using snapshot repair © DataStax, All Rights Reserved. 51
  • 52. How is Flush triggered? • memtable reaches threshold • commitlog reaches threshold • memtable_flush_period_in_ms • manual flush • repair triggering flush • snapshot/backup • nodetool drain • drop table • when you change compaction strategy • When the node is requested to transfer ranges (streaming-in) © DataStax, All Rights Reserved. 52
  • 53. Typical anti-pattern use cases for LCS • Heavy write with all unique partitions • Too frequent memtable flushes • Large vnode nodes • Any other condition that could create too many L0 SSTables • High density node (> 1.5TB per node or > 1TB per CQL table per node) • Node without enough IO (i.e. non-SSD, such as SAN, NFS or spindles) and CPU resources • Frequent bulk load with the goal of finishing the load as quickly as possible • Partitions that are too wide (> 160MB) • Use LCS simply for saving disk spaces (because “LCS requires less free space”) • Have too many SSTables without enough file handles and using snapshot repair © DataStax, All Rights Reserved. 53
  • 54. Other reasons massive SSTables can show up in L0 • Large vnode with repair streaming (CASSANDRA-10862) • Bootstrap without sending source level (CASSANDRA-7460) • Bulk load utility sstableloader from cluster-wide snapshots could be another bad source of small L0 sstables (RF x RF) (CASSANDRA-4756) © DataStax, All Rights Reserved. 54
  • 55. Pitfalls to avoid • Fast write that exceeds the capability of compaction • Really large partitions (usually LCS is known to handle wide partitions well, but you cannot go too high, especially not exceeding the max_sstable_size) • A single CQL table holding too much data with LCS compaction • Large num_token • Bugs or data corruption that can block LCS compaction and create a lot of L0 overflow © DataStax, All Rights Reserved. 55
  • 56. Switch between STCS and LCS • How do you properly do that? • Simply change table schema can create a lot of impact on your production workload • Least intrusive • Use JMX to change Compaction parameters – set CompactionParametersJson • Finish compaction on a couple of nodes before starting the next • Be careful: don’t trigger schema update on this node before compaction around all nodes finishes • At the end, make schema change permanent by ALTER TABLE © DataStax, All Rights Reserved. 56
  • 57. Tuning Knobs to be aware of • concurrent_compactors in yaml (be mindful of heap pressure) • compaction_throughput_mb_per_sec in yaml • All other yaml parameters impacting flush frequency • memtable_heap_space_in_mb • memtable_offheap_space_in_mb • memtable_cleanup_threshold • commitlog_total_space_in_mb • JVM flag to disable stcs_in_l0 –Dcassandra.disable_stcs_in_l0 • sstable_size_in_mb in table schema • Tombstone compaction related parameters in table schema • For TTL’ed data, gc_grace_seconds in table schema • Fan out size (CASSANDRA-7871, CASSANDRA-11550) • Major compaction (caution) • You can always throttle your write throughput • Surprise! Switching to STCS can be an option too © DataStax, All Rights Reserved. 57
  • 58. Useful tools • Passive • nodetool cfstats • nodetool compactionstats • nodetool compactionhistory • Transaction log in the data directory (CASSANDRA-7066) • JMX for monitoring (check out the two MBeans: o.a.c.db.ColumnFamilyStoreMBean, o.a.c.db.compaction.CompactionManagerMBean) • Proactive • sstableofflinerelevel (great tool, but don’t expect it to do magic) (CASSANDRA-8301) • Major compaction (performance caveat!) (CASSANDRA-7272) • sstablelevelreset (Weapon of Massive Destruction) (CASSANDRA-5271) • User Defined Compaction - JMX or nodetool (CASSANDRA-10660) © DataStax, All Rights Reserved. 58
  • 59. Signs of a struggling LCS compaction • 100s-1000s (or more) of L0 SSTables in cfstats • 1000s (or more) of pending compactions in compactionstats • Heavy GC activities that are not driven by read/write workload • Much more disk space utilized than the real data volume • Running out of file handles • OOM with snapshot repair © DataStax, All Rights Reserved. 59
  • 60. Typical debug.log Entries TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,236 LeveledManifest.java:119 - Adding BigTableReader(path='/home/automaton/cassandra- trunk/data/data/keyspace1/standard1-37b1059068e011e68a2a31a1f422f58b/mc-1178-big-Data.db') to L3 DEBUG [CompactionExecutor:39] 2016-08-23 21:14:07,236 CompactionTask.java:249 - Compacted (7f458a20-6976-11e6-9ee3-dd591a930c3b) 1 sstables to [/home/automaton/cassandra-trunk/data/data/keyspace1/standard1-37b1059068e011e68a2a31a1f422f58b/mc-1178-big,] to level=3. 160.000MiB to 160.000MiB (~100% of original) in 12,865ms. Read Throughput = 12.436MiB/s, Write Throughput = 12.436MiB/s, Row Throughput = ~55,379/s. 719,931 total partitions merged to 719,931. Partition merge counts were {1:719931, } TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,236 LeveledManifest.java:748 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions to do for keyspace1.standard1 TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,236 LeveledManifest.java:748 - Estimating [0, 225, 2, 0, 0, 0, 0, 0, 0] compactions to do for keyspace1.standard1 TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,236 LeveledManifest.java:319 - Compaction score for level 3 is 0.0010000005662441254 TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,237 LeveledManifest.java:319 - Compaction score for level 2 is 1.0000003707408904 TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,237 LeveledManifest.java:319 - Compaction score for level 1 is 23.493476694822313 TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,237 LeveledManifest.java:573 - Choosing candidates for L1 TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,237 LeveledManifest.java:406 - CompactionCounter: 0: 0 TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,237 LeveledManifest.java:406 - CompactionCounter: 1: 107 TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,237 LeveledManifest.java:406 - CompactionCounter: 2: 0 TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,237 LeveledManifest.java:406 - CompactionCounter: 3: 1 TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,238 LeveledManifest.java:335 - Compaction candidates for L1 are standard1-913(L1), DEBUG [CompactionExecutor:39] 2016-08-23 21:14:07,238 CompactionTask.java:153 - Compacting (86f10a60-6976-11e6-9ee3-dd591a930c3b) [/home/automaton/cassandra- trunk/data/data/keyspace1/standard1-37b1059068e011e68a2a31a1f422f58b/mc-913-big-Data.db:level=1, ] TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,450 LeveledManifest.java:473 - L1 contains 235 SSTables (36.709GiB) in Manifest@546534055 TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,450 LeveledManifest.java:473 - L2 contains 101 SSTables (15.781GiB) in Manifest@546534055 TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,451 LeveledManifest.java:473 - L3 contains 1 SSTables (160.000MiB) in Manifest@546534055 TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,451 LeveledManifest.java:149 - Replacing [standard1-674(L2), ] TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,451 LeveledManifest.java:166 - Adding [standard1-1179(L3), ] TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,451 LeveledManifest.java:473 - L1 contains 235 SSTables (36.709GiB) in Manifest@546534055 TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,451 LeveledManifest.java:473 - L2 contains 101 SSTables (15.781GiB) in Manifest@546534055 TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,451 LeveledManifest.java:473 - L3 contains 1 SSTables (160.000MiB) in Manifest@546534055 © DataStax, All Rights Reserved. 60
  • 61. Typical compaction.log Entries {"type":"pending","keyspace":"stresscql","table":"blogposts","time":1472851796649 ,"strategyId":"1","pending":5} {"type":"compaction","keyspace":"stresscql","table":"blogposts","time":1472851799 027,"start":"1472851795595","end":"1472851799027","input":[{"strategyId":"1","tab le":{"generation":3244,"version":"mc","size":169275615,"details":{"level":2,"min_ token":"-5663415854288007738","max_token":"- 5630042604729221974"}}},{"strategyId":"1","table":{"generation":3970,"version":"m c","size":156101287,"details":{"level":3,"min_token":"- 5681368161876797914","max_token":"- 5523713966014233820"}}}],"output":[{"strategyId":"1","table":{"generation":4003," version":"mc","size":170059193,"details":{"level":3,"min_token":"- 5681368161876797914","max_token":"- 5654579422012458443"}}},{"strategyId":"1","table":{"generation":4008,"version":"m c","size":155296723,"details":{"level":3,"min_token":"- 5654551486308333779","max_token":"-5523713966014233820"}}}]} © DataStax, All Rights Reserved. 61
  • 62. Myth #5 • Why I cannot use up all of my concurrent compactor threads? • It’s all about overlap • L0 -> L1 can only happen in one thread (most likely) • L1+ up-level can only happen when L(N) sstable overlaps with a small subset of L(N+1) SSTables • Some of your SSTables are marked as suspected so cannot participate in compaction © DataStax, All Rights Reserved. 62
  • 63. Myth #6 • Why memtable flushes at a much smaller size than the allocated memtable space, causing way more L0 SSTables than expected? • The threshold for regular memtable flush is calculated based not only on allocated onheap and offheap memtable space, but also on memtable_cleanup_threshold • memtable_cleanup_threshold by default is calculated based on memtable_flush_writers (inverse proportion) • it’s recommended to always manually set this value, especially if you’ve manually configured memtable_flush_writers • Your commitlog total space is too low, which leads commitlog filled up too quickly under heavy writes © DataStax, All Rights Reserved. 63
  • 64. Myth #7 • Why pending compaction numbers is often not accurate for LCS? • It’s a known limitation of the current implementation • Look at this code on how the pending compaction task numbers are calculated (it is an *estimate*): estimated[i] = (long)Math.ceil((double)Math.max(0L, SSTableReader.getTotalBytes(sstables) - (long)(maxBytesForLevel(i, maxSSTableSizeInBytes) * 1.001)) / (double)maxSSTableSizeInBytes); • In other words, it’s (total data in the current level – current level’s capacity) / 160MB, then add up individual estimates from all levels • But it never consider the additional writes needed beyond the current level © DataStax, All Rights Reserved. 64
  • 65. Myth #8 • Why I am seeing a very low compaction throughput number (MB/s) when I have super fast SSD and plenty of idle CPU? • L0->L1 can only be single-threaded, so faster CPU helps but more CPU cores won’t for this phase • Part of reporting problem, especially if you’re using compression, see CASSANDRA-11697 (WIP) • Part of throttling problem, see CASSANDRA-12366 (C* 3.10+) • Between 2.1 and 3.0, we’ve improved compaction throughput under stress by 2x, because of CASSANDRA-8988 (C* 2.2+), CASSANDRA-8920 (C* 2.2+), CASSANDRA-8915 (C* 3.0+). • There are still remaining issues, but are currently being actively worked on, so you should watch CASSANDRA-11697 closely © DataStax, All Rights Reserved. 65

Editor's Notes

  1. For example: The default LevelDB algorithm blocks writers for one millisecond each time the amount of un-compacted data exceeds a predefined threshold, thus limiting further writes to at most one-thousand writes per second. This effectively stops writes until compaction. By limiting writes in this manner, LevelDB ensures that the cost of a read will not grow without bound -- effectively trading write throughput for read throughput.
  2. Before we move onto introduce design principles of LCS, let’s first take a look at a key concept – overlap. Even though this is not a LCS-specific concept, if you want to read LCS source code or related JIRA, being crispy clear on what overlap truly means can help you a lot. I’ve copy/pasted a tiny piece from the source code here. In the code, assume we have two SSTable objects (current and previous), we use this formula to compare current.first <= previous.last, if it’s true, then overlap exists between the two SSTables. Now let’s look at an example: assuming I’ve got a really simple single-column table, with the only partition key column called ‘key’, now I’m inserting P1, P2, P3, P4 in that order, and then flush, this will now generate a SSTable on disk represented by ‘previous’, in blue color, then I insert P5, P6, P7, P8 in that order, and do another flush, and 2nd SSTable represented by ‘current’ shows up on disk, in black color. Note in both SSTables, I deliberately added a column before the partition key. This is the token value corresponding to the partition key. Another immediate observation you may have is, no matter what order the partitions were originally inserted, in SSTable they’re neither sorted by the insertion order, nor by the partition key, instead they’re sorted by the token values. Now if we want to use the earlier formula to check overlap, current.first is indeed less than previous.last, as we’re comparing their token values, instead of the partition key values. Intuitively if you run a select statement to get data from these two SSTables, you will see their colors are mixed together so they’re indeed overlapping.
  3. 2. Only exception is bootstrap, where freshly streamed SSTable can keep their level from source node 3. anti-pattern can cause >160MB SSTables in higher level, also the last written SSTables may have less-than-160MB due to not having enough padding. 3. 160MB is a table compaction parameter that can be adjusted, and we will talk about that in detail later.
  4. 2. Only exception is bootstrap, where freshly streamed SSTable can keep their level from source node 3. anti-pattern can cause >160MB SSTables in higher level, also the last written SSTables may have less-than-160MB due to not having enough padding. 3. 160MB is a table compaction parameter that can be adjusted, and we will talk about that in detail later.
  5. Only exception is bootstrap, where freshly streamed SSTable can keep their level from source node Explain the motivation for STCS in L0, to allow read latency doesn’t deteriate too much when there’s a backlog in L0 3. 160MB is a table compaction parameter that can be adjusted, and we will talk about that in detail later. 5. If we do something that makes many levels contain too little data (cleanup, change sstable size) we will "never” compact the high levels. This method finds if we have gone many compaction rounds without doing any high-level compaction, if so we start bringing in one sstable from the highest level until that level is either empty or is doing compaction. 6. When picking compaction candidates we have to make sure that the compaction does not create overlap in the target level. This is done by always including all overlapping sstables in the next level.
  6. Only exception is bootstrap, where freshly streamed SSTable can keep their level from source node Explain the motivation for STCS in L0, to allow read latency doesn’t deteriate too much when there’s a backlog in L0 3. 160MB is a table compaction parameter that can be adjusted, and we will talk about that in detail later. 5. If we do something that makes many levels contain too little data (cleanup, change sstable size) we will "never” compact the high levels. This method finds if we have gone many compaction rounds without doing any high-level compaction, if so we start bringing in one sstable from the highest level until that level is either empty or is doing compaction. 6. When picking compaction candidates we have to make sure that the compaction does not create overlap in the target level. This is done by always including all overlapping sstables in the next level.
  7. The 40MB blob for the value is not a realistic data model, but it is good enough and simple to demonstrate the up-level behavior Each flush will be represented by a different color, so you will see four different colors of SSTables.
  8. As you can see, if we add another SSTable from flush, this diagram will be too busy and impossible to show. To reduce the clutter, I’m going to turn all of the SSTables and levels 90 degree, and remove the token values but keep the order of the partition keys.
  9. As you can see, if we add another SSTable from flush, this diagram will be too busy and impossible to show. To reduce the clutter, I’m going to turn all of the SSTables and levels 90 degree, and remove the token values but keep the order of the partition keys.
  10. There are two main downsides of having a high density node: It’s easy to create L0 overload situation With a lot of data per node, the number of levels needed will be more, hence higher read latency, and more write amplifications.
  11. Some of them may have also gone into 2.0 branch, and this list is mainly about the JIRAs created after Sep 2014, which was when C* 2.1 GA was announced.
  12. Some of them may have also gone into 2.0 branch, and this list is mainly about the JIRAs created after Sep 2014, which was when C* 2.1 GA was announced.
  13. Leave hot data in lower level to avoid write amplification, also move tombstones to the highest level as quickly as possible 3. This idea is from LevelDB, where they allow memtable flush to direct go to the highest possible level, in this case, L(N) where L(N+1) has overlap.
  14. There are two main downsides of having a high density node: It’s easy to create L0 overload situation With a lot of data per node, the number of levels needed will be more, hence higher read latency, and more write amplifications.
  15. There are two main downsides of having a high density node: It’s easy to create L0 overload situation With a lot of data per node, the number of levels needed will be more, hence higher read latency, and more write amplifications.
  16. If people realize they have an anti-pattern, they may choose to switch