Critical Attributes for a High-Performance, Low-Latency Database

Benny Halevy, Director Software Engineering
Tzach Livyatan, VP product, ScyllaDB
Attributes of a
High-Performance,
Low-Latency Database

2
~$ whoami
Benny Halevy, Director Software Engineering
Tzach Livyatan, VP product, ScyllaDB

3
Agenda
+ About ScyllaDB
+ 20 years of hardware evolution in 5 minutes
+ Scylla - Design for performance
+ Results
+ Workload Prioritization
+ Summary

4
+ The Real-Time Big Data Database
+ Fully Compatible with Apache Cassandra
and Amazon DynamoDB
+ 10X the performance & low tail latency
+ Open Source, Enterprise and Cloud options
+ Founded by the creators of KVM hypervisor
+ HQs: Palo Alto, CA, USA; Herzelia, Israel;
Warsaw, Poland
About ScyllaDB

Cluster - Node Ring
5
Node 5 Node 2
Node 1
Node 3
Node 4

Active/active, replicated, auto-sharded
6
Scylla Architecture

Why Scylla?
On-Prem
Cloud Hosted
Scylla Cloud
Best High Availability in the industry
Best Disaster Recovery in the industry
Best Scalability in the industry
Best Performance in the industry
Auto-tune — out of the box performance
Fully compatible with Cassandra & DynamoDB
The power of Cassandra at the speed of Redis and more

8
20 years of hardware
evolution in 5 minutes

Basic architecture - Cassandra
9
Disk
MMaped ﬁle
Kernel tasks
Main threadpool
M:N
threads:clients
Client
Client
Client
Page
fault

Non Uniform Memory Access (NUMA)
12

What happened?
13
+ Per thread performance plateaued
+ Cores: 1 ⟶ 256, NUMA
+ RAM: 2GB ⟶ 2TB
+ Disk space: 10GB ⟶ 10TB
+ Disk seek time: 10-20ms ⟶ 20µs
+ Network throughput: 1Gbps ⟶ 100Gbps
This year: 64/128 cores/threads/cpu, 400Gbps NIC, Disk 10µs latency, 1.5TB/device, DDR5
2TB/DIMM
AWS u-24tb1.metal: 224 cores, 448 threads, 24TB RAM

14
Audience Poll
NoSQL Database Adoption

15
The database,
reimagined
Redesigning from ﬁrst principles

Shard per core
Share nothing, block nothing
16

Sharding/partitioning
+ Common concept in distributed databases
+ Break the system to N non-interacting parts
+ Usually done by hash(partition_key) % N
+ Data/load may be unbalanced
+ Fact of life in distributed databases 🤷
+ Logical mapping of data shards to core shards
17

Sharding all the way down
18
Node ID
Shard ID

Seastar
+ Open source framework, powering Scylla, Ceph,
Redpanda, ValuStor and more
+ A “mini operating system in userspace”
+ Task scheduler, I/O scheduler
+ Fully asynchronous - userspace coroutines
+ Direct I/O, (bypasses kernel pagecache)
+ App should implement caching on its own.
+ One thread per core, one shard per core
19

Shard per Core
Cassandra
TCP/IP
Scheduler
queue
queue
queue
queue
queue
Threads
NIC
Queues
Kernel
Traditional Stack SeaStar’s Sharded Stack
Memory
Lock contention
Cache contention
NUMA unfriendly
TCP/IP
Task Scheduler
queue
queue
queue
queue
queue
smp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
TCP/IP
Task Scheduler
queue
queue
queue
queue
queue
smp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
TCP/IP
queue
queue
queue
queue
queue
smp queue
NIC
Queue
Kernel
(isn’t
involved)
Userspace
No contention (*)
Linear scaling
NUMA friendly
(*) cooperative-
preemption
model in shard
Core
Database
Task Scheduler
queue
queue
queue
queue
smp queue
Userspace
NIC
Queue
20

Uniﬁed Cache
Cassandra
Key
cache
Row
cache
Linux page cache
SSTables
App
thread
Kernel
SSD
Page fault
Suspend thread
Initiate I/O
Context switch
I/O
completes
Interrupt
Context
switch
Map page
Resume
thread
Page fault
On-heap /
Off-heap
21
Shared
memory;
NUMA
unfriendly

Uniﬁed Cache
Cassandra Scylla
Key
cache
Row
cache
Linux page cache
SSTables
Uniﬁed cache
SSTables
Complex Tuning
On-heap /
Off-heap
22
Async, direct I/O
Keys /
Rows
GP
Buffers

Thou shalt not block
Query
Commitlog
Compaction
Queue
Queue
Queue
Userspace
I/O
Scheduler
Disk
Max useful disk concurrency
I/O queued in FS/device
No queues
23

Memtable
Seastar
Scheduler
Compaction
Query
Repair
Commitlog
SSD
Compaction
Backlog Monitor
Memory Monitor
Adjust priority
NET
CPU
How does scheduling work?
24

25
Shard aware I/O scheduler(s)
+ Each shard has independent scheduler
+ Capacity groups per NUMA zone
+ Shards grab capacity leases
Minimal, low cost coordination between shards!

The controllers - memtable
27
This is the CPU percentage needed (50 %) To keep the buffers at a stable level
Throughput barely oscillates
Total system CPU usage barely oscillates

The controllers - memtable
28
without
controller
with controller

31
Write Latency - Scylla vs Cassandra

32
Read Latency - Scylla vs Cassandra

33
Write Latency -
4 Scylla nodes vs. 40 Cassandra nodes

34 Source: https://www.scylladb.com/tech-talk/sprinting-from-cassandra-to-scylladb/

35
Real world results
C* nodes
962
Scylla nodes
78
+ 5x-10x throughput compared to Cassandra
+ Vertical scaling to hundreds of CPUs

Workload Prioritization: Different types of loads
■ OLTP
● Small work items
● Latency sensitive
● involves narrow
portion of the data
■ OLAP
● Large work items
● Throughput oriented
● Performed on large
amounts of data

+ Shares are really all there is to it :)
+ Schedulers maintain fairness by trying to optimize ratios
and not absolute throughput.
+ Schedulers only kick in when there is a conﬂict on the
resource.
+ Schedulers can be dynamic - meaning you can change the
amount of shares in real time.
+ Limits the impact of one Share-Holder on another.
Schedulers Basics - operation highlight

Memtable
Seastar
Scheduler
Compaction
Query
Repair
Commitlog
Compaction
Backlog Monitor
Memory Monitor
Adjust priority
NET
CPU
How does it work?
SSD
40

How does it work?
41
Memtable
Seastar
Scheduler
Compaction
Query
Repair
Commitlog
Compaction
Backlog Monitor
Memory Monitor
Adjust priority
NET
CPU
SSD

How does it work? Workload Prioritization!
Service-level
Controller
42
Memtable
Seastar
Scheduler
Compaction
Query
Repair
Commitlog
Compaction
Backlog Monitor
Memory Monitor
Adjust priority
NET
CPU
SSD

Conﬁguring Workload prioritization
1. Make users that generates the same workload be part of
the same group.
● Priorities are attached to groups or individual users.
2. Create a service level for the workload and set its shares:
● Share determine the amount of importance of the service level.
● It is always relative to other service levels.
3. Attach the service level to the group of users.
● This will grant the shares to the group of users.
● At that point the workload prioritization mechanizm will start to
● Treat their requests according to priorities.

Managing Workload Prioritization using CQL
1. Make users that generates the same workload be part of
the same group.
● CREATE ROLE super_high_priority;
● GRANT super_high_priority TO special_user;
2. Create a service level for the workload and set its shares:
● CREATE SERVICE_LEVEL 'important_load' WITH SHARES=1000;
3. Attach the service level to the group of users.
● ATTACH SERVICE_LEVEL 'important_load' TO ‘super_high_priority;

Workload Prioritization to the Rescue!
■ Load1: 200 shares, Load2: 400 shares, Load3: 800 shares Shares determine
workload latency
47

48
+ Design and built to meet modern
hardware
+ Use a fully async, share nothing, shard
per core architecture
+ Superior throughput and consistent low
latency
+ Expose internal scheduler to the user as
Workload Prioritization
Summary

Q&A
@tzachl
Stay in touch
@ScyllaDB-Users

United States
2445 Faber St, Suite #200
Palo Alto, CA USA 94303
Israel
Maskit 4
Herzliya, Israel 4673304
www.scylladb.com
@scylladb
Thank You!

Critical Attributes for a High-Performance, Low-Latency Database

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Critical Attributes for a High-Performance, Low-Latency Database

Similar to Critical Attributes for a High-Performance, Low-Latency Database (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

Critical Attributes for a High-Performance, Low-Latency Database