Scylla Summit 2022: Scylla 5.0 New Features, Part 1

Scylla 5.0 New
Features, Part 1
Avi Kivity, CTO; Eliran Sinvani, Software Team Leader; Botond
Dénes, Software Engineer; Tomasz Grabiec, Distinguished
Software Engineer; Kamil Braun, Software Engineer

I/O Scheduling
In ScyllaDB 5.0
Avi Kivity
CTO

Avi Kivity
■ Original maintainer of Linux KVM - Kernel-based Virtual Machine
■ Co-maintainer of Seastar, ScyllaDB
■ Co-founder of ScyllaDB
CTO

A database is a balancing act…
■ Your reads
■ Compaction
■ Repair/bootstrap/decommission
Why I/O Scheduling?

The spice bytes must flow
Read Queue
Compaction Queue
Maintenance Queue
Scheduler
Disk

Understanding disk performance

The new I/O Scheduler
■ Collect information about disks
■ Build a more accurate mathematical disk model
■ Embody the model into the I/O scheduler

Thank you!
Stay in touch
Avi Kivity
@AviKivity
avi@scylladb.com

ScyllaDB 5.0
Workload Specific
Optimizations
Eliran Sinvani
Software Team Leader

Eliran Sinvani
■ Core SW Team Leader at ScyllaDB for the past 3 years.
■ BSP and Embedded SW Team Leader at Airspan Networks for
over a year.
■ 3 years as Embedded SW engineer in the Cellular industry (both
UE and BS sides).
Software Team Leader

Dealing With Different Workloads
As the number of use cases supported by Scylla gets
bigger consistently, we sometimes encounter
conflicting requirements for different types of
workloads.
■ Parallelism
■ Mean latency or P99
■ Users priorities (different SLA requirements)

Workload Prioritization
The key for dealing with some differentiating aspects of workloads is provided in
our enterprise version and is called workload prioritization it provides several
benefits already:
■ Resource isolation of workloads (CPU and Memory)
■ Prioritization of workloads
■ Can balance to some extent between OLAP and OLTP workloads

OLTP and OLAP latencies with workload prioritization enabled.

Problem: Workload Isolation Is Not Enough :(
■ We have solved (to some extent) the problem of cross workload impact on
resource allocation.
■ But:
● Expressing the requirements is lacking:
● A quantitative description - we characterize workload by shares, the more shares a
workload gets relative to others the more important it is and more resources it will get.
● A lot of real world requirements can’t be expressed
● Relative description somewhat breaks the isolation concept - if there is an isolation,
what do I care about relation between workloads? (at least on some aspects)
● Some of Scylla’s configuration options and behaviours are global
● Timeouts
● Parallelism limitation (Botond’s talk: Improvements to the OOM resilience of reads on the replica
side)

Prioritization and isolation
is simply not enough

Example: A web server database with analytics
Scenario:
1. Main workload: we would like to present to a
user some information in response to a click on
a webpage.
2. Secondary workload: Periodically we would like
to run some DB wide analytics.

Example: A web server database with analytics cont.
Main workload:
1. Need to have at worst (timeout) tens to hundreds
of ms latency or the page will appear
unresponsive for some users.
2. Has high concurrency as requests are
independent.

Secondary workload:
1. Needs to have as much throughput as possible
2. Has bounded and controllable concurrency.
(since it is originated at the same client/logic)

The timeout dilemma:
1. We will need to set some timeout for the server side.
This timeout should follow:
2. For the main workload this can’t be too high, since it
will cause the interactive user either to retry (click
again and again) or to drop the request, both of which
will be a waste of resources or even worse, can be
experienced as unbounded concurrency by the server.
3. Can’t be too low as the analytics requests will fail,
since achieving high throughput will normally increase
latency since the queues are full.

Overload response:
1. Interactive client (main workload) can’t be throttled
since the requests are unrelated, delaying response to
some application user A will not cause some other
user B to delay or stop sending requests (Unbounded
concurrency)
2. In order to control the batch workload better we would
like to throttle since this will allow us to have a knob
that controls the pace of the analytics workload.
(Bounded concurrency)

Examples Of Workload Characteristics
■ Latency distribution (some approximate desired histogram)
■ Timeout
■ Throughput / Latency orientation (ie. OLTP vs OLAP)
■ Expected parallelism
■ Burstiness

Benefits of Workload Characterization
■ Better cluster utilization, increased ability to serve multi workload scenarios
● Side effect: serving on the same cluster means less administrative overhead.
■ Resource usage efficiency and more correct resource distribution
■ Better overload handling
■ More accurate metrics (i.e the per workload timeout metric is reliable and no
need to check latency distribution to calculate timeouts).
■ Smarter alerting when requirements can’t be met
■ Better isolation capabilities (not necessarily relative)
■ Can serve as a base for elasticity application (i.e Grow to meet requirements)

The Service Level Mechanism
We have already implemented some capabilities using our service levels
mechanism which we have imported from our enterprise version and extended to
support more detailed workload configuration.
■ Service Level -
• Contains a workload characteristic (can be partial)
• Can be attached to a role
■ A connection workload characteristics are determined by merging all service
levels attached to the authenticated role and its parent roles.

Workload characterization
Ideally we would like scylla to do:
■ For main workload:
● Have low timeout (~30-100ms)
● Load shedding (fail excessive requests immediately), because the
server cannot cause the interactive workload to slow down.
● Dedicate most of the resources to this workload.
■ For secondary workload:
● Relatively high timeout (~10-30s)
● Throttling - delaying some responses to distribute the load over time
and control it.
● Use mostly unused resources (unused by the main workload), if we
have interactive workloads, it means that those will fluctuate and
not always be at their peak level, this means we have resources lying
around some of the time.

Workload characterization cont.
The ability to configure the workload characteristics
is already implemented in Scylla.
CREATE SERVICE LEVEL main WITH timeout = 30ms
AND workload_type=interactive AND shares = 800
CREATE SERVICE LEVEL secondary WITH timeout =
30s AND workload_type=batch AND shares = 200
Shares = xxxx is only available in Enterprise version.

Workload characterization
Ideally, we would like Scylla to do: “CREATE SERVICE LEVEL XXX WITH”:
■ For main workload:
■ For secondary workload:
Have low timeout (30ms) timeout=30ms
Load shedding (*) AND workload_type=interactive
Dedicate most of the resources to this
workload. (80% guaranteed resources) (**)
AND shares=800
Have relatively high timeout (30s) timeout=30s
Throteling AND workload_type=batch
Use mostly unused resources (only 20%
guaranteed resources) (**)
AND shares=200
* Implemented but still hasn’t been tested extensively.
** Enterprise only

Workload characterization future improvements
■ Overload and timeout behaviour according to
workload type (i.e shedding vs throttling) .
■ Auto tuning according to workload
characterization.
■ Workload specific metrics and alerts.
■ Workload specific behaviours:
● Bypass cache as a default for analytics
● Cache division or isolation according to prioritization
● Disallowing filtering queries.
■ More precise and elaborate configuration
options.

References
■ https://scylla.docs.scylladb.com/master/design-
notes/service_levels.html
■ https://docs.scylladb.com/using-
scylla/workload-prioritization/

Thank you!
Stay in touch
Eliran Sinvani
eliransin@scylladb.com

ScyllaDB 5.0
Improvements to
the OOM Resilience of
Reads on the Replica
Botond Dénes
Software Engineer

■ Working @ ScyllaDB since 2017
■ Member of the storage team
Botond Dénes
Software Engineer

The basic idea
■ The concurrency of reads on the replica is controlled
• To keep concurrency within a useful limit
• To avoid resource exhaustion, in particular: OOM
■ Implemented via a semaphore
■ Semaphore is dual limited by count and memory
■ Separate semaphores for scheduling groups

Recent work - much tracking, such buffers
■ Track I/O buffers as soon as they are allocated
(instead of when read completes)
■ Track buffers used for parsing sstable data
■ Track reader buffers
■ (still not 100% of all buffers is tracked)
Result: reader permit everywhere and vastly improved tracking accuracy

Recent work - addressing the usual suspects
■ Unpaged reads
■ Reverse reads
■ & variations (unpaged full scan -- true story)
Introduce a special (soft, hard) limit pair for
these reads

Recent work - semaphore in the front
memtable
reader
cache
reader
restricted
reader
combined
reader
combined
reader
sstable
reader 1
sstable
reader 2
sstable
reader N on
cache
miss
memtable
reader
cache
reader
combined
reader
combined
reader
sstable
reader 1
sstable
reader 2
sstable
reader N

Recent work - better diagnostics
■ Dump memory diagnostics on OOM
■ Dump semaphore diagnostics on queue overload/timeout

So where are we now?
■ Read induced OOM is actually quite rare(?) now
■ Doesn’t mean it's gone for good, there might be corner cases

Thank you!
Stay in touch
Botond Dénes
dns.botond@gmail.com

ScyllaDB 5.0
SSTable Index
Caching
Tomasz Grabiec
Distinguished Software Engineer

Tomasz Grabiec
■ Core engineer and maintainer at ScyllaDB for the past 8 years
■ Started coding when Commodore 64 was still a thing
■ Lives in Cracow, Poland
Distinguished Software Engineer

SSTable indexing - what’s new?
■ Automatic caching of SSTable indexes
■ Reads from disk got faster!
■ …especially for large partitions

SSTable indexing
SELECT … WHERE key > …
Data file:
Index:

SSTable indexing
Clustering key index
Partition key index

SSTable indexing
Partition key index
Partition key index
Summary

RAM
Disk
SSTable indexing
Partition key index
Summary
● Always loaded in RAM
● Decided when sstable
written to disk
● Separate file on disk

RAM
Disk
SSTable indexing
Partition key index
Summary
● 1:20k ratio of summary
size to data file size
● Trade-off between
memory footprint and
speed of reads

RAM
Disk
SSTable indexing - problem
■ Only summary
permanently cached
■ Reads typically* need to
touch the disk while
walking the index
■ Increases load on the
disk
■ Adds latency
* Partition index pages are shared among concurrent readers

RAM
Disk
SSTable indexing - problem
■ Large partition workloads
experienced diminishing
index caching as partition
size grows
■ Average amount of I/O
needed is:
O(log(partition_size))

SSTable indexing - new in 5.0
■ The whole of index can now
be cached in memory
■ Populated on access (read-
through)
■ Evicted on memory
pressure
■ Partition index summary
still non-evictable and
always resident
RAM
Disk

SSTable indexing
RAM
Disk
■ Reads for different rows
still share access to part
of the index
■ Caching the index
reduces amount of I/O
for future reads

SSTable indexing - large partition example
Partition size: 10 GB, Rows: 10 M, Index file size: 5 MB
I/O for a single row read, cold cache:
■ 2x 32 KB for partition index summary page read
■ 17x 4 KB for binary search in the clustering index read
■ 2x 32 KB for data file read
TOTAL: 196 KB, 21 I/O reqs, 20ms

I/O for a single row read, cold cache:
■ 2x 32 KB for partition index summary page read
■ 17x 4 KB for binary search in the clustering index read
■ 2x 32 KB for data file read
TOTAL: 196 KB, 21 I/O reqs, 20ms
TOTAL: 64KB, 2 I/O reqs, 0.2ms
hot

scylla-5.0 -c1 -m4G
scylla-bench -workload uniform -mode read -limit 1 -concurrency 100 -partition-count 1
-clustering-row-count 10000000 -duration 60m
Before: 2’011 Rows/s
After: 6’191Rows/s
(the node was bound by disk bandwidth, ~530 MB/s)

■ Populated on index file
access (read-through cache)
■ Granularity: 4KB chunk of an
index file
■ Idea similar to the page cache
in Linux
RAM
Disk
Index file
SSTable indexing - Index file page cache
4K

SSTable indexing - Index file page cache
■ Clustering key index is
cached by the means of the
index file page cache
■ On-disk representation is
random-access so no need to
keep parsed entries
RAM
Disk
Clustering key
index

SSTable indexing
Partition key index
Summary
Partition key index
page

SSTable indexing - partition index page cache
■ Granularity: partition index
summary page
■ Contains parsed index pages for
fast lookup (on-disk
representation is not random-
access)
■ Saves CPU time compared to
having just the index file page
cache
RAM
Disk

■ No tunables, caches use all available free space
■ Multiple caches compete for space:
● row cache
● sstable partition index page cache
● sstable index file page cache
■ Fair eviction, single LRU for all caches
● E.g. no reads from disk => row cache uses all the free space
● Not optimal for all workloads
SSTable index caching

Thank you!
Stay in touch
Tomasz Grabiec
@tgrabiec
tgrabiec@scylladb.com

ScyllaDB 5.0
Improved Reversed
Queries
Kamil Braun
Software Engineer

Kamil Braun
■ Software engineer working on Scylla
■ Passionate about distributed systems, functional programming,
and formal methods in software development
■ Graduated from the University of Warsaw with a MSc in
Computer Science and BSc in Mathematics
Software Engineer

What are reversed queries?
CREATE TABLE ks.t (
pk int,
ck int,
v int,
PRIMARY KEY (pk, ck)
) WITH CLUSTERING ORDER BY (ck ASC)
Reversed query:
SELECT * FROM ks.t WHERE pk = 0 ORDER BY ck DESC;

CREATE TABLE ks.t (
pk int,
ck int,
v int,
SELECT * FROM ks.t WHERE pk = 0
(or SELECT * FROM ks.t WHERE pk = 0 ORDER BY ck ASC):
pk | ck | v
----+----+---
0 | 0 | 0
0 | 1 | 2
0 | 2 | 3

CREATE TABLE ks.t (
pk int,
ck int,
v int,
SELECT * FROM ks.t WHERE pk = 0 ORDER BY ck DESC:
pk | ck | v
----+----+---
0 | 2 | 3
0 | 1 | 2
0 | 0 | 0

Query range: [6, 16]
E.g.: SELECT * from ks.t WHERE pk = 0 AND ck >= 6 AND ck <= 16;
Disk
Memory
sstable 6 16

Disk
Memory
sstable 6 16
6 16
Read to memory

Disk
Memory
sstable 6 16
6 16 16
Reverse
First page
13 13
Return

Disk
Memory
sstable 6 16
6 16 16
Reverse
First page
13 13
Return
Wasted work

Disk
Memory
sstable 6 12
6 9 12
Reverse
Second page
12 9
Return
Read to memory
Wasted work

Problem 1: quadratic complexity
. . .

Problem 1: quadratic complexity
N pages: N + (N-1) + (N-2) + … + 1 = O(N^2) pages read
. . .

Problem 2: huge memory consumption
To read a single page from the range, we need to fetch the entire range into
memory.
It may not even fit in memory, causing the read to fail.

Disk
Memory
sstable 14
Index
I know where 14 is

Disk
Memory
sstable 14
14
Read to memory

Disk
Memory
sstable 14
14 16
15
16
15

Disk
Memory
sstable 14
14 16
15
16
15
Read to memory

Disk
Memory
sstable 14
14 16
15
16
15
Where does row 13 start?

Disk
Memory
sstable 14
14 16
15
16
15
Row metadata:
previous row size
13
13

Disk
Memory
sstable 14
14 16
15
16
15
13
13
12
12

Disk
Memory
sstable 14
14 16
15
16
15
13
13
12
12
Reverse page
14 12
13
15
16
Return page

New implementation:
■ Linear complexity
■ Memory consumption is O(page size)

New implementation:
Reversed reads from memtables were also improved.

New implementation:
Reversed reads from memtables were also improved.
Caveat: reversed queries are allowed for single-partition queries only.

■ Querying different partitions of sizes 10MB, 15MB, …, 115MB, 110MB
■ Forward and reversed queries
■ Scylla 4.5 branch vs master branch (as of 30.12.2021)
Comparison

Schema: pk int, ck int, v text, primary key (pk, ck)
Comparison

Schema: pk int, ck int, v text, primary key (pk, ck)
Query:
Comparison
SELECT * FROM ks.t WHERE pk = ?
{ORDER BY ck DESC}
BYPASS CACHE
{LIMIT 1000}

■ Reversed queries in Scylla <= 4.5:
● time complexity quadratic w.r.t size of queried range
● memory consumption linear w.r.t size of queried range
■ mc sstable format allows a better implementation
■ Reversed queries in upcoming release:
● time complexity linear w.r.t size of queried range
● memory consumption is linear w.r.t page size
Summary

Thank you!
Stay in touch
Kamil Braun
kbraun@scylladb.com

Scylla Summit 2022: Scylla 5.0 New Features, Part 1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scylla Summit 2022: Scylla 5.0 New Features, Part 1

Similar to Scylla Summit 2022: Scylla 5.0 New Features, Part 1 (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

Scylla Summit 2022: Scylla 5.0 New Features, Part 1

Editor's Notes