Under The Hood Of A Shard-Per-Core Database Architecture

Under the Hood of a
Shard-per-Core Database
Architecture
Tzach Livyatan, VP Product, ScyllaDB

Brought to you by
VIRTUAL EVENT | OCTOBER 19 + 20
P99 Conf: All Things
Performance
The event for developers who care about
P99 percentiles and high-performance,
low-latency applications.
Register at p99conf.io

Poll
Where are you in your NoSQL adoption?

Tzach Livyatan
VP of Product, ScyllaDB
+ Lead the product team in ScyllaDB
+ Appreciate distributed system testing
+ Lives in Tel Aviv, father of two

Agenda + What is ScyllaDB?
+ How did we get here? 5m history lesson
+ ScyllaDB Design Decisions
+ Shard Per Core
+ IO Scheduler revisit
+ Benchmark a Petabyte Cluster
+ QA

+ NoSQL, OLTP Distributed NoSQL Database
+ Founded by designers of KVM Hypervisor: Avi Kivity
and Dor Laor
+ Resolves challenges of legacy NoSQL databases
+ >5x higher throughput
+ >20x lower latency
+ >75% TCO savings
+ DBaaS/Cloud, Enterprise and Open Source solutions
+ Compatible with Apache Cassandra and AWS
DynamoDB
The Database Built for Gamechangers
6
“ScyllaDB stands apart...It’s the rare product
that exceeds my expectations.”
– Martin Heller, InfoWorld contributing editor and reviewer
“For 99.9% of applications, ScyllaDB delivers all the
power a customer will ever need, on workloads that other
databases can’t touch – and at a fraction of the cost of
an in-memory solution.”
– Adrian Bridgewater, Forbes senior contributor

8
+400 Gamechangers Leverage ScyllaDB
Seamless experiences
across content + devices
Fast computation of flight
pricing
Corporate fleet
management
Real-time analytics
2,000,000 SKU -commerce
management
Real-time location tracking
for friends/family
Video recommendation
management
IoT for industrial
machines
Synchronize browser
properties for millions
Threat intelligence service
using JanusGraph
Real time fraud detection
across 6M transactions/day
Uber scale, mission critical
chat & messaging app
Network security threat
detection
Power ~50M X1 DVRs with
billions of reqs/day
Precision healthcare via
Edison AI
Inventory hub for retail
operations
Property listings and
updates
Unified ML feature store
across the business
Cryptocurrency exchange
app
Geography-based
recommendations
Distributed storage for
distributed ledger tech
Global operations- Avon,
Body Shop + more
Predictable performance for
on sale surges
GPS-based exercise
tracking

Active/active, replicated, auto-sharded
9
ScyllaDB Architecture

Active/Active, replicated, auto-sharded
10
Tunable, Eventual Consistency
App
App
App
App
App
App
CL= Local
Quorum
CL= One

11
Scylla Architecture
Source:
https://code.kiwi.com/nonstop-operations-with-scylla-even-through-the-ovhcloud-ﬁre-4c0191e43ba1

A Brief History of Humankind
modern HW and DB

Non Uniform Memory Access (NUMA)
14

What happened?
15
+ Per thread performance plateaued
+ Cores: 1 ⟶ 256, NUMA
+ RAM: 2GB ⟶ 2TB
+ Disk space: 10GB ⟶ 10TB
+ Disk seek time: 10-20ms ⟶ 20µs
+ Network throughput: 1Gbps ⟶ 100Gbps
This year: 64/128 cores/threads/cpu, 400Gbps NIC, Disk 10µs latency, 1.5TB/device, DDR5
2TB/DIMM
AWS u-24tb1.metal: 224 cores, 448 threads, 24TB RAM

16
A Brief History of Databases
16
1970s
Mainframes:
inception of the
relational model
1990s
LAN age:
replication, external
caching, ORMs
SQL
1980s
SQL, relational
databases become
de-facto standard
2000s
WEB 2.0:
NoSQL databases
for scale
2010s
Cloud age:
commoditization
of NoSQL, NewSQL
inception
1996
1995
1978 2008
2015
2014

Scylla Design Decisions
1
2 All Things Async
3 Shard per Core
4 Uniﬁed Cache
5 I/O Scheduler
6 Autonomous
C++ instead of Java

High-level Goals
● Eﬃciency
● Utilization
● Control
19

ScyllaDB Design Decisions
1
2 All Things Async
3 Shard per Core
4 Uniﬁed Cache
5 I/O Scheduler
6 Autonomous
C++ instead of Java

Threads Shards
1 C++ instead of Java
2 All Things Async
3 Shard per Core
4 Uniﬁed Cache
5 I/O Scheduler
6 Autonomous

Legacy NoSQL Scylla
Key cache
Row cache
On-heap /
Off-heap
Linux page cache
SSTables
Uniﬁed cache
SSTables
Complex
Tuning
1
2 All Things Async
3 Shard per Core
4 Uniﬁed Cache
5 I/O Scheduler
6 Autonomous
C++ instead of Java

Legacy NoSQL Scylla
Key cache
Row cache
On-heap /
Off-heap
Linux page cache
SSTables
Uniﬁed cache
SSTables
App
thread
Kernel
SSD
Page fault
Suspend thread
Initiate I/O
Context switch
I/O
completes
Interrupt
Context
switch
Map page
Resume
thread
Page fault
1
2 All Things Async
3 Shard per Core
4 Uniﬁed Cache
5 I/O Scheduler
6 Autonomous
C++ instead of Java

Query
Commitlog
Compaction
Userspace
I/O
Scheduler
Disk
Max useful disk concurrency
I/O queued in
FS/device
No
queues
Queue
Queue
Queue
1
2 All Things Async
3 Shard per Core
4 Uniﬁed Cache
5 I/O Scheduler
6 Autonomous
C++ instead of Java

Scylla Design Decisions
Memtable
Seastar
Scheduler
Compaction
Query
Repair
Commitlog
SSD
Compaction
Backlog Monitor
Memory Monitor
Adjust priority
Adjust priority
WAN
CPU
1
2 All Things Async
3 Shard per Core
4 Uniﬁed Cache
5 I/O Scheduler
6 Autonomous
C++ instead of Java

Shard Per Core
Share nothing, block nothing
28

Sharding/Partitioning
+ Common concept in distributed databases
+ Break the system to N non-interacting parts
+ Usually done by hash(partition_key) % N
+ Data/load may be unbalanced
+ Fact of life in distributed databases 🤷
+ Logical mapping of data shards to core shards
29

Sharding All The Way Down
30
Node ID
Shard ID

Shard per Core
Cassandra
TCP/IP
Scheduler
queue
queue
queue
queue
queue
Threads
NIC
Queues
Kernel
Traditional Stack Seastar’s Sharded Stack
Memory
Lock contention
Cache contention
NUMA unfriendly
TCP/IP
Task Scheduler
queue
queue
queue
queue
queue
smp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
TCP/IP
Task Scheduler
queue
queue
queue
queue
queue
smp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
TCP/IP
queue
queue
queue
queue
queue
smp queue
NIC
Queue
Kernel
(isn’t
involved)
Userspace
No contention
Linear scaling
NUMA friendly
Core
Database
Task Scheduler
queue
queue
queue
queue
smp queue
Userspace
NIC
Queue
31
vs.

Seastar
+ Open source framework, powering ScyllaDB,
Redpanda, ValuStor, Ceph, RageDB and more
+ A “mini operating system in userspace”
+ Task scheduler, I/O scheduler
+ Fully asynchronous - userspace coroutines
+ Direct I/O, self managed cache (bypass pagecache)
+ One thread per core, one shard per core
32

Why Scheduling At All
+ Different components compete for limited resources (Reads, Writes, Admin)
+ They have different priorities
+ They have no idea how not to over-consume the resource

How Does It Work?
Flush
sched-group
Compaction
sched-group
Query
sched-group

How disk should work
Little’s law
Internal parallelism

SSDs are Amazing
+ 6.4 GB/s read
+ 3.3 GB/s write
+ 1M read IOPS
+ 200k write IOPS
+ Often, several disks in a single server!
40

SSDs are Amazing, but not Magic
+ 6.4 GB/s read
+ OR 3.3 GB/s write
+ OR 1M read IOPS
+ OR 200k write IOPS
+ Or some kind of mix
+ But what kind of mix?!
41

+ Online transaction processing (OLTP)
+ There’s a real user at the other end
+ Maintenance workloads
+ Scaling out
+ Compaction
+ Backup
+ Analytics (OLAP)
+ Want to soak up free bandwidth, but not under a tight deadline
+ Multi-tenancy
+ Several OLTP and OLAP workloads on the same disk/data
Why mixed workloads?
42

Introducing Diskplorer
+ Tool to test disks at a variety of mixed workloads
+ Open source: https://github.com/scylladb/diskplorer
+ Python, ﬁo, matplotlib
+ Fancy graphs
+ Hours of fun!
43

Diskplorer 3 (AWS i3en.3xlarge)

Disk math (step 1)
46
Source: https://www.scylladb.com/2022/08/03/implementing-a-new-io-scheduler-algorithm-for-mixed-read-write-workloads/

Latest Results I3 vs I4 - One Node
I3.16xlarge vs i4.16xlarge (64 vCPU servers)
50% Reads / 50% Writes
Latency tests with 50% of the max throughput
source:
https://www.scylladb.com/2022/09/07/benchmarking-scylladb-5-0
-on-aws-i4i-4xlarge/

Latest Results I3 vs I4 - 3 Node Cluster
Big thanks to Michał
Chojnowski for benchmarking
all the new AWS instances
types!
I3.16xlarge vs i4.16xlarge (64 vCPU servers)
50% Reads / 50% Writes
Latency tests with 50% of the max throughput
67% better price/performance!

Bill of Materials
+ ScyllaDB cluster: 20 x i3en.metal AWS instances, each having:
+ 96 vCPUs
+ 768 GiB RAM
+ 60 TB NVMe disk space
+ 100 Gbps network bandwidth
+ Load Generators: 50 x c5n.9xlarge AWS instances, each having:
+ 36 vCPUs
+ 96 GiB RAM
+ 50 Gbps network bandwidth

Petabyte Performance
source: https://www.scylladb.com/2022/07/14/benchmarking-petabyte-scale-nosql-workloads-with-scylladb/

ScyllaDB is Different
56
+ Multi queue
+ Poll mode
+ Userspace
+ TCP/IP
+ Thread per core
+ lock-free
+ Task scheduler
+ Reactor programing
+ C++14/17/20…
+ NUMA friendly
+ Log structured
allocator
+ Zero copy
+ DMA
+ Log structured
+ merge tree
+ DBaware cache
+ Userspace I/O
+ scheduler

Higher Throughput - Lower Cost
ScyllaDB vs Google Bigtable
ScyllaDB vs DynamoDB ScyllaDB vs Cassandra
1/7th the cost
26x performance
in real-life scenario
4 ScyllaDB nodes vs
40 Cassandra nodes
2.5X less expensive
up to 22x better latencies
1/5th cost
20x better latencies
in real-life scenario

Poll
How much data do you under management of your
transactional database?

Thank you
for joining us today.
@scylladb scylladb/
slack.scylladb.com
@scylladb company/scylladb/
scylladb/

Under The Hood Of A Shard-Per-Core Database Architecture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Under The Hood Of A Shard-Per-Core Database Architecture

Similar to Under The Hood Of A Shard-Per-Core Database Architecture (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

Under The Hood Of A Shard-Per-Core Database Architecture