Scylla Summit 2022: Scylla 5.0 New Features, Part 2

Scylla 5.0 New
Features, Part 2
Asias He, Principal Software Engineer & Raphael S.
Carvalho, Software Engineer

Repair Based
Node Operations
Asias He
Principal Software Engineer

Asias He
■ Asias He is a long-time open source developer who previously
worked on Debian Project, Solaris Kernel, KVM Virtualization for
Linux and OSv unikernel. He now works on Seastar and Scylla.
YOUR PHOTO
GOES HERE

Agenda
■ Repair based node operations
■ Off-strategy compaction
■ Gossip free node operations

What is RBNO
■ Use row level repair as the underlying mechanism to sync data between nodes
instead of streaming
■ Single mechanism for all the node operations
• Bootstrap / replace / rebuild / decommission / removenode / repair

Benefits of RBNO
Significant improvements on performance and data safety
■ Resumable
• Resume from previous failed bootstrap operations
■ Consistency
• Latest replica is guaranteed
■ Simplified
• No need to run repair before or after node operations like replace and removenode
■ Unified
• All node ops use the same underlying mechanism

Towards using RBNO by default
■ Enabled by default for replace operations
■ More operations will use RBNO by default in the future
■ All node operations are supported
■ Options to turn on specific node operations
• E.g., --enable-repair-based-node-ops true
• E.g., --allowed-repair-based-node-ops replace, bootstrap
■ Better IO scheduler improvement to reduce latency impact

Introduction
Make compaction during node operations more efficient
■ What is it
• Sstables generated by node operations are kept in a separate data set
• Compact them together and integrate to main set when node operation is done
■ Benefits
• Less compaction work during node operations
• Faster to complete node operations

Current status
■ Enabled for all node ops
• repair, bootstrap, replace, decommission, repair, rebuild
■ Normal trigger for node ops
• Trigger at the end of node operation
■ Smart trigger for repair
• Wait for more repair to come to batch more off-strategy compaction

What is it
■ A new feature to use RPC verbs instead of gossip status to perform node
operations like add or remove node.

■ Require all nodes to participate by default
• Avoid data inconsistency issue if nodes are network partitioned
• Allow users to ignore dead nodes explicitly
• nodetool removenode --ignore-dead-nodes
• scylla --ignore-dead-nodes-for-replace
■ Automatically revert to previous state in case of error
■ Detect user operation mistakes
• Detect and reject multiple node operations in parallel
■ Each operations is assigned with a UUID
• Easier to identify a node operation
Benefits

How to use it
■ No action is needed from the user
■ Enabled for bootstrap, replace, decommission and removenode.

Thank you!
Stay in touch
Asias He
@asias_he
asias@scylladb.com

Repair Based
Tombstone GC
Asias He

Agenda
■ Background of tombstones
■ Timeout based tombstone GC
■ Repair based tombstone GC

Background
■ Tombstones are used to delete data
■ Can’t keep tombstones forever
■ Tombstones GC happens
• Data covered by the tombstone and tombstone can compact away together
• It is old enough, older than gc_grace_seconds
■ Tombstones might be missed on some replica
■ Data resurrection happens
• Replica nodes with tombstone do GC
• Replica nodes without tombstone still contains the data should be deleted
• Read return deleted data to the user

Current timeout based tombstone gc
■ Must run full cluster wide repair in gc_grace_seconds
■ Not robust enough
• Nothing guarantees if one can finish repair in time
• Repair is a low priority maintenance operation
■ Pressure to people who operates scylla
■ Performance impact over critical period

Introduce repair based tombstone gc
■ The idea
• GC a tombstone only after repair is performed
■ Main benefits
• No need to tune and find a proper gc_grace_seconds
• No data resurrection if cluster wide repair couldn’t be performed within gc_grace_seconds
• Less pressure for users operating scylla clusters to run repairs in a timely manner
• Throttle repair intensity even more
• Reduce the latency impact on user workload
• Since there is no more hard requirement to finish repair in time.
• If repair is performed more frequently than gc_grace_seconds
• Tombstones can be garbage-collected sooner
• Improving performance

How to use it
■ ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'} ;
• The mode can be {timeout, repair, disabled, immediate}
• timeout = gc tombstone after gc_grace_seconds
• repair = gc tombstone after repair
• disabled = never gc tombstone
• immediate = gc tombstone immediately
■ CREATE TABLE ks.cf (key blob PRIMARY KEY, val blob) WITH tombstone_gc =
{'mode':'repair'};

More considerations
■ When to use mode = immediate
• Use it for TWCS with no user deletes
• Safer than gc_grace_seconds = 0
• Reject deletes if mode = immediate
■ When to use mode = disabled
• Tools that may load scylla with out of order writes or writes in the past, e.g., sstableloader
and cdc replicator
• Disable tombstone gc when the tools are in progress.
■ What happens if mode = repair but repair can not finish for some reason
• A new restful api to fake repair history
• Use it as emergency to allow gc
• Turn back to mode = timeout

How to upgrade from existing cluster
■ A gossip feature TOMBSTONE_GC_OPTIONS is added
■ The tombstone_gc option can not be used until full upgrade
• E.g., in a mixed cluster:
• cqlsh> ALTER TABLE ks.table WITH tombstone_gc = { 'mode':'repair'} ;
• ConfigurationException: tombstone_gc option not supported by the cluster
■ To keep max compatibility and introduce less surprise to users
• All tables default to use the mode = timeout (same as without this feature)
• Uses have to set mode = repair explicitly

Compaction Enhancements:
Increased Storage Density
and Time Series
Made Much Easier
Raphael S. Carvalho
Software Engineer

Raphael S. Carvalho
■ Member of the ScyllaDB storage team
■ Responsible for the compaction subsystem
■ Previously worked on Syslinux and OSv
Software Engineer at ScyllaDB

Agenda
■ Space optimization for incremental compaction
■ “Bucketless” time series, i.e. time series made much easier for you
■ Upcoming improvements

Let’s take a look back
■ Incremental compaction (ICS) introduced in enterprise release 2019.1.4
■ Known for combining techniques from both size-tiered and leveled strategies
■ Fixes the 100% space overhead problem in size-tiered compaction, increasing disk utilization significantly.

Is it enough?
■ Space overhead in tiered compaction was efficiently fixed, however…
■ Incremental (ICS) and size-tiered (STCS) strategies share the same space amplification (~2-4x)
with overwrite workloads, where:
• They cover a similar region in the three-dimensional efficiency space, also known as RUM conjecture
trade-offs.
READ
WRITE SPACE
STCS ICS

Turns out it’s not enough. But can we do better?
■ Leveled strategy and Size-tiered (or ICS) cover different regions
• Interesting regions cannot be reached with either strategies.
• But interesting regions can be reached by combining data layout of both strategies
• i.e. a hybrid (tiered + leveled) approach
READ
WRITE SPACE
STCS ICS
LCS

Let’s work to optimize space efficiency then
■ A few high-level goals:
• Optimize space efficiency with overwrite workloads
• Ensure write and read latency meet SLA requirements

■ That’s Space Amplification Goal (SAG) for you.
■ Increased storage density per node? YES.
■ Reduce costs? YES.

A few facts about this feature
■ This feature (available since Scylla Enterprise 2020.1.6) can only be used with Incremental Compaction
■ Compaction will dynamically adapt to the workload to meet requirements
■ Under heavy write load, compaction strategy will work to meet write latency requirement.
■ Otherwise, strategy works to optimize space efficiency to the desired extent
■ Translates into:
• Storage Density per node ++
• Costs --
• Scale ++
READ
WRITE SPACE
ICS+SAG

Enabling the space optimization (SAG)
■ This will enable the feature with a space amplification goal of 1.5
■ The lower the configured value the higher the write amplification
■ Adaptive approach minimizes the impact of extra amplification
■ Gives user control to reach interesting regions in the three-dimensional efficiency space
ALTER TABLE keyspace.table
WITH compaction = {
'class': 'IncrementalCompactionStrategy',
'space_amplification_goal': '1.5',
};

Space optimization in action…

A common schema for time series looked like…
CREATE TABLE billy.readings (
sensor_id int,
date date,
time time,
temperature int,
PRIMARY KEY ((sensor_id, date), time)
)

Why bucket in time series?
■ Large partitions were known to causing all sorts of problems
• Index inefficiency when reading from the middle of a large partition
• Latency issues when repairing large partitions
• High resource usage and read amplification when querying multiple time windows
• Reactor stalls which caused higher P99 latencies
• And so on…
■ Consequently applications were forced to “bucket” partitions to keep their size within a limit.

Bucketed vs Unbucketed time series
Window
sstable 1 sstable 2 sstable 3 sstable 4
Bucketed partitions for a
single time series
Unbucketed partition for a
single time series

Time series made much easier for you!
■ But those bad days are gone!
• Scylla allows a large partition to be efficiently indexed: O(logN)
• Scylla’s row-level repair allows large partitions to be efficiently repaired
• TimeWindowCompactionStrategy can now efficiently query multiple time windows
• by discarding SSTable files which time range is irrelevant for the query
• Incrementally open the relevant files to reduce resource overhead
• Therefore, read amplification and resource usage problems are fixed

A schema for time series can now look like…
■ There’s no longer any field date in schema, meaning that:
• Application won’t have to create new partitions on a fixed interval for a time series
• Querying a time series will be much easier as only a single partition is involved
■ Bucketing days are potentially gone!
■ Lots of complexity reduced in the application side
CREATE TABLE billy.readings (
sensor_id int,
time time,
temperature int,
PRIMARY KEY (sensor_id, time)
)

Upcoming improvements
■ Compaction becoming overall more resilient / performant:
• Changes were recently made to make Cleanup, Major compactions more resilient when system is running
out of disk space
• Dynamic control of compaction fan-in to increase overall compaction efficiency
• Based on observation that efficiency is a function of number of input files and their relative sizes
• Don’t dilute the overall efficiency by submitting jobs which efficiency is greater than or equal to
efficiency of ongoing jobs
• Tests show that write amplification is reduced under heavy write load while keeping space and
read amplifications within bounds
• Makes the system adapt even better to changing workloads
• More stability. More performance.

Upcoming improvements
■ Reduce compaction aggressiveness by:
• Improvements in I/O scheduler (Pavel Emelyanov covers this in depth in his talk)
• Improvements in Compaction backlog controller
• Aiming at improving tail latency and overall system stability.
■ Off-strategy compaction (Asias He covers this better)
• Make repair-based node operations more efficient, faster
• Consequently, better elasticity

Thank you!
Stay in touch
Raphael S. Carvalho
@raphael_scarv
raphaelsc@scylladb.com

Scylla Summit 2022: Scylla 5.0 New Features, Part 2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scylla Summit 2022: Scylla 5.0 New Features, Part 2

Similar to Scylla Summit 2022: Scylla 5.0 New Features, Part 2 (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

Scylla Summit 2022: Scylla 5.0 New Features, Part 2

Editor's Notes