Scylla 5.0 introduces several new features to improve node operations and compaction:
1. Repair-based node operations (RBNO) provide more efficient, consistent, and simplified bootstrap, replace, rebuild, and other node operations by using row-level repair as the underlying mechanism instead of streaming.
2. Off-strategy compaction keeps sstables generated during node operations in a separate data set and compacts them together after the operation finishes for less compaction work and faster completion.
3. Space amplification goal (SAG) for compaction optimizes space efficiency for overwrite workloads by dynamically adapting compaction to meet latency and space goals, improving storage density.
3. Asias He
■ Asias He is a long-time open source developer who previously
worked on Debian Project, Solaris Kernel, KVM Virtualization for
Linux and OSv unikernel. He now works on Seastar and Scylla.
Principal Software Engineer
YOUR PHOTO
GOES HERE
6. What is RBNO
■ Use row level repair as the underlying mechanism to sync data between nodes
instead of streaming
■ Single mechanism for all the node operations
• Bootstrap / replace / rebuild / decommission / removenode / repair
7. Benefits of RBNO
Significant improvements on performance and data safety
■ Resumable
• Resume from previous failed bootstrap operations
■ Consistency
• Latest replica is guaranteed
■ Simplified
• No need to run repair before or after node operations like replace and removenode
■ Unified
• All node ops use the same underlying mechanism
8. Towards using RBNO by default
■ Enabled by default for replace operations
■ More operations will use RBNO by default in the future
■ All node operations are supported
■ Options to turn on specific node operations
• E.g., --enable-repair-based-node-ops true
• E.g., --allowed-repair-based-node-ops replace, bootstrap
■ Better IO scheduler improvement to reduce latency impact
10. Introduction
Make compaction during node operations more efficient
■ What is it
• Sstables generated by node operations are kept in a separate data set
• Compact them together and integrate to main set when node operation is done
■ Benefits
• Less compaction work during node operations
• Faster to complete node operations
11. Current status
■ Enabled for all node ops
• repair, bootstrap, replace, decommission, repair, rebuild
■ Normal trigger for node ops
• Trigger at the end of node operation
■ Smart trigger for repair
• Wait for more repair to come to batch more off-strategy compaction
13. What is it
■ A new feature to use RPC verbs instead of gossip status to perform node
operations like add or remove node.
14. ■ Require all nodes to participate by default
• Avoid data inconsistency issue if nodes are network partitioned
• Allow users to ignore dead nodes explicitly
• nodetool removenode --ignore-dead-nodes
• scylla --ignore-dead-nodes-for-replace
■ Automatically revert to previous state in case of error
■ Detect user operation mistakes
• Detect and reject multiple node operations in parallel
■ Each operations is assigned with a UUID
• Easier to identify a node operation
Benefits
15. How to use it
■ No action is needed from the user
■ Enabled for bootstrap, replace, decommission and removenode.
18. Asias He
■ Asias He is a long-time open source developer who previously
worked on Debian Project, Solaris Kernel, KVM Virtualization for
Linux and OSv unikernel. He now works on Seastar and Scylla.
Principal Software Engineer
YOUR PHOTO
GOES HERE
19. Agenda
■ Background of tombstones
■ Timeout based tombstone GC
■ Repair based tombstone GC
20. Background
■ Tombstones are used to delete data
■ Can’t keep tombstones forever
■ Tombstones GC happens
• Data covered by the tombstone and tombstone can compact away together
• It is old enough, older than gc_grace_seconds
■ Tombstones might be missed on some replica
■ Data resurrection happens
• Replica nodes with tombstone do GC
• Replica nodes without tombstone still contains the data should be deleted
• Read return deleted data to the user
21. Current timeout based tombstone gc
■ Must run full cluster wide repair in gc_grace_seconds
■ Not robust enough
• Nothing guarantees if one can finish repair in time
• Repair is a low priority maintenance operation
■ Pressure to people who operates scylla
■ Performance impact over critical period
22. Introduce repair based tombstone gc
■ The idea
• GC a tombstone only after repair is performed
■ Main benefits
• No need to tune and find a proper gc_grace_seconds
• No data resurrection if cluster wide repair couldn’t be performed within gc_grace_seconds
• Less pressure for users operating scylla clusters to run repairs in a timely manner
• Throttle repair intensity even more
• Reduce the latency impact on user workload
• Since there is no more hard requirement to finish repair in time.
• If repair is performed more frequently than gc_grace_seconds
• Tombstones can be garbage-collected sooner
• Improving performance
23. How to use it
■ ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'} ;
• The mode can be {timeout, repair, disabled, immediate}
• timeout = gc tombstone after gc_grace_seconds
• repair = gc tombstone after repair
• disabled = never gc tombstone
• immediate = gc tombstone immediately
■ CREATE TABLE ks.cf (key blob PRIMARY KEY, val blob) WITH tombstone_gc =
{'mode':'repair'};
24. More considerations
■ When to use mode = immediate
• Use it for TWCS with no user deletes
• Safer than gc_grace_seconds = 0
• Reject deletes if mode = immediate
■ When to use mode = disabled
• Tools that may load scylla with out of order writes or writes in the past, e.g., sstableloader
and cdc replicator
• Disable tombstone gc when the tools are in progress.
■ What happens if mode = repair but repair can not finish for some reason
• A new restful api to fake repair history
• Use it as emergency to allow gc
• Turn back to mode = timeout
25. How to upgrade from existing cluster
■ A gossip feature TOMBSTONE_GC_OPTIONS is added
■ The tombstone_gc option can not be used until full upgrade
• E.g., in a mixed cluster:
• cqlsh> ALTER TABLE ks.table WITH tombstone_gc = { 'mode':'repair'} ;
• ConfigurationException: tombstone_gc option not supported by the cluster
■ To keep max compatibility and introduce less surprise to users
• All tables default to use the mode = timeout (same as without this feature)
• Uses have to set mode = repair explicitly
28. Raphael S. Carvalho
■ Member of the ScyllaDB storage team
■ Responsible for the compaction subsystem
■ Previously worked on Syslinux and OSv
Software Engineer at ScyllaDB
29. Agenda
■ Space optimization for incremental compaction
■ “Bucketless” time series, i.e. time series made much easier for you
■ Upcoming improvements
30. Let’s take a look back
■ Incremental compaction (ICS) introduced in enterprise release 2019.1.4
■ Known for combining techniques from both size-tiered and leveled strategies
■ Fixes the 100% space overhead problem in size-tiered compaction, increasing disk utilization significantly.
31. Is it enough?
■ Space overhead in tiered compaction was efficiently fixed, however…
■ Incremental (ICS) and size-tiered (STCS) strategies share the same space amplification (~2-4x)
with overwrite workloads, where:
• They cover a similar region in the three-dimensional efficiency space, also known as RUM conjecture
trade-offs.
READ
WRITE SPACE
STCS ICS
32. Turns out it’s not enough. But can we do better?
■ Leveled strategy and Size-tiered (or ICS) cover different regions
• Interesting regions cannot be reached with either strategies.
• But interesting regions can be reached by combining data layout of both strategies
• i.e. a hybrid (tiered + leveled) approach
READ
WRITE SPACE
STCS ICS
LCS
33. Let’s work to optimize space efficiency then
■ A few high-level goals:
• Optimize space efficiency with overwrite workloads
• Ensure write and read latency meet SLA requirements
34. ■ That’s Space Amplification Goal (SAG) for you.
■ Increased storage density per node? YES.
■ Reduce costs? YES.
35. A few facts about this feature
■ This feature (available since Scylla Enterprise 2020.1.6) can only be used with Incremental Compaction
■ Compaction will dynamically adapt to the workload to meet requirements
■ Under heavy write load, compaction strategy will work to meet write latency requirement.
■ Otherwise, strategy works to optimize space efficiency to the desired extent
■ Translates into:
• Storage Density per node ++
• Costs --
• Scale ++
READ
WRITE SPACE
ICS+SAG
36. Enabling the space optimization (SAG)
■ This will enable the feature with a space amplification goal of 1.5
■ The lower the configured value the higher the write amplification
■ Adaptive approach minimizes the impact of extra amplification
■ Gives user control to reach interesting regions in the three-dimensional efficiency space
ALTER TABLE keyspace.table
WITH compaction = {
'class': 'IncrementalCompactionStrategy',
'space_amplification_goal': '1.5',
};
38. A common schema for time series looked like…
CREATE TABLE billy.readings (
sensor_id int,
date date,
time time,
temperature int,
PRIMARY KEY ((sensor_id, date), time)
)
39. Why bucket in time series?
■ Large partitions were known to causing all sorts of problems
• Index inefficiency when reading from the middle of a large partition
• Latency issues when repairing large partitions
• High resource usage and read amplification when querying multiple time windows
• Reactor stalls which caused higher P99 latencies
• And so on…
■ Consequently applications were forced to “bucket” partitions to keep their size within a limit.
40. Bucketed vs Unbucketed time series
Window
sstable 1 sstable 2 sstable 3 sstable 4
Bucketed partitions for a
single time series
Unbucketed partition for a
single time series
41. Time series made much easier for you!
■ But those bad days are gone!
• Scylla allows a large partition to be efficiently indexed: O(logN)
• Scylla’s row-level repair allows large partitions to be efficiently repaired
• TimeWindowCompactionStrategy can now efficiently query multiple time windows
• by discarding SSTable files which time range is irrelevant for the query
• Incrementally open the relevant files to reduce resource overhead
• Therefore, read amplification and resource usage problems are fixed
42. A schema for time series can now look like…
■ There’s no longer any field date in schema, meaning that:
• Application won’t have to create new partitions on a fixed interval for a time series
• Querying a time series will be much easier as only a single partition is involved
■ Bucketing days are potentially gone!
■ Lots of complexity reduced in the application side
CREATE TABLE billy.readings (
sensor_id int,
time time,
temperature int,
PRIMARY KEY (sensor_id, time)
)
43. Upcoming improvements
■ Compaction becoming overall more resilient / performant:
• Changes were recently made to make Cleanup, Major compactions more resilient when system is running
out of disk space
• Dynamic control of compaction fan-in to increase overall compaction efficiency
• Based on observation that efficiency is a function of number of input files and their relative sizes
• Don’t dilute the overall efficiency by submitting jobs which efficiency is greater than or equal to
efficiency of ongoing jobs
• Tests show that write amplification is reduced under heavy write load while keeping space and
read amplifications within bounds
• Makes the system adapt even better to changing workloads
• More stability. More performance.
44. Upcoming improvements
■ Reduce compaction aggressiveness by:
• Improvements in I/O scheduler (Pavel Emelyanov covers this in depth in his talk)
• Improvements in Compaction backlog controller
• Aiming at improving tail latency and overall system stability.
■ Off-strategy compaction (Asias He covers this better)
• Make repair-based node operations more efficient, faster
• Consequently, better elasticity
45. Thank you!
Stay in touch
Raphael S. Carvalho
@raphael_scarv
raphaelsc@scylladb.com
Editor's Notes
Hello everyone and welcome to my talk about SSTable compaction enhancements
My name is Raphael Carvalho and I have been working on ScyllaDB storage layer since its early days. Enough about me, let’s move on to the interesting part
In this session, we’ll describe space optimization for incremental compaction that will allow the storage density of nodes to increase
Additionally, how Scylla makes it much easier to model time series data. Without having to rely on old techniques like data bucketing, which is commonly used to avoid running into large partition performance issues
Last but not least, we’ll talk about upcoming improvements that will make compaction better for you
Now, let’s take a look back. Incremental compaction strategy, or ICS, was introduced back in 2019, to solve the large space overhead that affected the users. Before its existence, users were left with no choice but to leave 50% of free disk space for compactions to succeed.
But was it enough? Well, the aforementioned space overhead was efficiently fixed by Incremental Compaction, however, it still suffered with bad space amplification when facing overwrite-heavy workloads. That’s because the compaction strategy wasn’t efficient at removing the data redundancy accumulated across the tiers
We use a theoretical model called RUM conjecture to reason about compaction efficiency. It states that a compaction strategy cannot be optimal at the three efficiency goals: Read, write and space. That’s why we have different strategies available, each suiting better a particular use case
If we look at the three-dimensional efficiency space, which represents the RUM conjecture trade-offs, we’ll see that Incremental and size tiered strategies cover a similar region.
Turns out Incremental Compaction can do much better than fixing the space overhead problem. We know for a fact that leveled and size tiered strategies cover completely different regions in the efficiency space. Also, we know that interesting regions cannot be reached with either of them. However, very interesting regions in the efficiency space can be reached by combining the concepts of both strategies. We call it a hybrid approach.
What do we actually want to accomplish with this hybrid approach? Let’s set a few goals.
First, we want to optimize space efficiency for overwrite workloads, while ensuring write and read latencies meet service-level requirements
In other words, Performance must be sufficient to meet the needs, but (space) efficiency should be as good as possible to allow for scale.
That’s Space amplification goal for you. A feature that will help you increasing storage density per node therefore reducing costs. Who doesn’t like that?
It is only available in Scylla enterprise and can be used with our incremental compaction only. Everything was carefully implemented. To ensure latency will meet service-level requirements, compaction will dynamically adapt to the workload.
Under heavy write load, compaction strategy will enter a write optimized mode to make sure system can keep up with the write rate. Otherwise, the strategy will be continuously working to optimize space efficiency. The coexistence of both modes is the reason we call this a hybrid approach.
The adaptive approach, combined with the hybrid one is what makes this feature unique in the compaction world.
Let’s get to a bit of action. How to enable the space optimization? That’s simply a matter of specifying a value between 1 and 2 to the strategy option named space_amplification_goal. The lower the value the lower the space amplification but the higher the write amplification. 1.5 is a good value to start with
In order to optimize space efficiency, we’re willing to trade-off extra write amplification. However, the adaptive approach minimizes the impact of the extra amplification given that the strategy will switch between the modes, that is, write and space, whenever it has to.
To conclude, this will nicely give user control to reach interesting regions in the efficiency space, allowing the strategy to perform better for your particular use case
Now comes the interesting part… the optimization in action. We can clearly see in the graph that the lower the configured value the lower the disk usage will be. In the example, with a value of 1.25, the space amplification reached a maximum of 100%, but eventually went below 50% mark. If the system isn’t under heavy write load, then space amplification will meet the goal faster. As expected, the system is optimizing for space once performance objectives are achieved.
Now let’s switch gears and talk about how Scylla made time series less painful for application developers. Please look at that create table statement. That’s an usual schema for time series data. Note how the field date composes the partition key along with the sensor id. That’s a technique called data bucketing.
This bucketing technique is mainly used to prevent large partitions from being created, as they were known to creating all sorts of performance issues.
For example: Scylla was very inefficient when reading from the middle of a large partition
Repair was a enemy of large partitions too
And Time Window Compaction strategy wasn’t optimized for reading a large partition spanning multiple time windows
In the picture, each individual line represents a partition. In the bucketed case, the time series is split into multiple smaller partitions. While in the unbucketed case, the time series is kept in a single large partition.
One of the main problems with bucketing is that lots of complexity is pushed to the application side. The application will have to keep track of all partitions that belong to a particular time series. Also, aggregation is more complex as the application has to figure out which partitions store a particular time range, query each one of them individually and finally aggregate the results.
Fortunately, those bad days are gone.
Scylla fixed all problems aforementioned. Large partitions can be efficiently indexed, row-level repair was introduced to solve the problem of repairing large partitions, and time window compaction strategy can now efficiently read large partitions stored across multiple time windows
When a table uses time window strategy, its SSTable files do not overlap in timestamp range, so a specialized reader was implemented that will discard irrelevant files and efficiently read the relevant ones. This will reduce resource consumption and read amplification, making the queries much more efficient
With all those problems fixed, the schema for time series application can now look much simpler. Queries becomes simpler. Your application becomes simpler. Please only make sure you have enough time series (partitions) to avoid hot spots, where a subset of shards may be processing much more data than its counterparts. For example, if your application is monitoring millions of devices, where each has its own time series, then you will not run into unbalance issue. But if you only have a few time series in your application, it’s better to rely on the old bucketing technique, to guarantee proper balancing.
As for upcoming improvements,
Cleanup and major compaction will now be more resilient when system is running out of disk space.
The compaction manager will now dynamically control the compaction fan-in, essentially a threshold on minimum # of input files for compaction, to increase overall compaction efficiency, which translates into lower write amplification.
This decision is based on observation that compaction efficiency is a function # of input files for compaction, and also their relative sizes. That’s why size tiered and ICS favor similar sized files to be compacted together
Essentially, we will increase the overall efficiency by not diluting it with compaction jobs that are less inefficient than the ongoing compaction jobs. System becomes more stable and performant as a result of this change.
Last but not least, I/O scheduler is being nicely enhanced by Pavel Emelyanov. The enhancements will make the system more stable, allowing compaction to impact less the other activities in the system like user queries, streaming, and so on. To learn more about this, you can watch Pavel’s talk.
And off-strategy compaction was written with the goal of making compaction less aggressive for node operations like bootstrap, regular repair, etc, allowing them to complete faster. Consequently the system will be able to scale faster, making Scylla elasticity even better.