Transforming the Database: Critical Innovations for Performance at Scale

Transforming the Database:
Critical Innovations for
Performance at Scale
Benny Halevy, Director Software Engineering, ScyllaDB
Tzach Livyatan, VP Product, ScyllaDB

Brought to you by
VIRTUAL EVENT | OCTOBER 19 + 20
P99 Conf: All Things
Performance
The event for developers who care about
P99 percentiles and high-performance,
low-latency applications.
Register at p99conf.io

Poll
Where are you in your NoSQL adoption?

Tzach Livyatan
VP of Product, ScyllaDB
+ Lead the product team in ScyllaDB
+ Appreciate distributed system testing
+ Lives in Tel Aviv, father of two

+ Leading the storage software development team at ScyllaDB
+ Has been working on operating systems and distributed ﬁle
systems for over 20 years
+ Most recently, Benny led software development for GSI
Technology, and previously co-founded Tonian (later acquired
by Primary Data) and led it as CTO
+ Before Tonian, Benny was the lead architect in Panasas of the
pNFS protocol
Director, Software Engineering, ScyllaDB
Benny Halevy

Agenda + How did we get here? Quick history
of HW
+ Shard Per Core Architecture
+ IO Scheduler revisit
+ I4i results
+ Benchmark a PB cluster
+ Even More Optimizations
+ QA

+ Infoworld 2020 Technology of the Year!
+ Founded by designers of KVM Hypervisor
The Database Built for Gamechangers
7
“ScyllaDB stands apart...It’s the rare product
that exceeds my expectations.”
– Martin Heller, InfoWorld contributing editor and reviewer
“For 99.9% of applications, ScyllaDB delivers all the
power a customer will ever need, on workloads that other
databases can’t touch – and at a fraction of the cost of
an in-memory solution.”
– Adrian Bridgewater, Forbes senior contributor
+ Resolves challenges of legacy NoSQL databases
+ >5x higher throughput
+ >20x lower latency
+ >75% TCO savings
+ DBaaS/Cloud, Enterprise and Open Source solutions
+ Proven globally at scale

Why Scylla?
On-Prem
Cloud Hosted
Scylla Cloud
Best High Availability in the industry
Best Disaster Recovery in the industry
Best Scalability in the industry
Best Performance in the industry
Auto-tune — out of the box performance
Fully compatible with Cassandra & DynamoDB
The power of Cassandra at the speed of Redis and more

9
+400 Gamechangers Leverage ScyllaDB
Seamless experiences
across content + devices
Fast computation of flight
pricing
Corporate fleet
management
Real-time analytics
2,000,000 SKU -commerce
management
Real-time location tracking
for friends/family
Video recommendation
management
IoT for industrial
machines
Synchronize browser
properties for millions
Threat intelligence service
using JanusGraph
Real time fraud detection
across 6M transactions/day
Uber scale, mission critical
chat & messaging app
Network security threat
detection
Power ~50M X1 DVRs with
billions of reqs/day
Precision healthcare via
Edison AI
Inventory hub for retail
operations
Property listings and
updates
Unified ML feature store
across the business
Cryptocurrency exchange
app
Geography-based
recommendations
Distributed storage for
distributed ledger tech
Global operations- Avon,
Body Shop + more
Predictable performance for
on sale surges
GPS-based exercise
tracking

Active/active, replicated, auto-sharded
10
Scylla Architecture

Non Uniform Memory Access (NUMA)
13

What happened?
14
+ Per thread performance plateaued
+ Cores: 1 ⟶ 256, NUMA
+ RAM: 2GB ⟶ 2TB
+ Disk space: 10GB ⟶ 10TB
+ Disk seek time: 10-20ms ⟶ 20µs
+ Network throughput: 1Gbps ⟶ 100Gbps
This year: 64/128 cores/threads/cpu, 400Gbps NIC, Disk 10µs latency, 1.5TB/device, DDR5
2TB/DIMM
AWS u-24tb1.metal: 224 cores, 448 threads, 24TB RAM

15
A Brief History of Databases
15
1970s
Mainframes:
inception of the
relational model
1990s
LAN age:
replication, external
caching, ORMs
SQL
1980s
SQL, relational
databases become
de-facto standard
2000s
WEB 2.0:
NoSQL databases
for scale
2010s
Cloud age:
commoditization
of NoSQL, NewSQL
inception
1996
1995
1978 2008
2015
2014

Cloud Infrastructure: The Last ~10 Years
16
SSD: $2500/TB
Performance
improvement
2008 2012
Typical instance 4 cores
SSD $100/TB - 1000x faster, 10x cheaper
96 core VMs - 20x more cores
100Gbps NICs - 100x more throughput
2015 2022
2000 CPU core systems and
beyond

Shard Per Core
Share nothing, block nothing
18

Sharding/Partitioning
+ Common concept in distributed databases
+ Break the system to N non-interacting parts
+ Usually done by hash(partition_key) % N
+ Data/load may be unbalanced
+ Fact of life in distributed databases 🤷
+ Logical mapping of data shards to core shards
19

Sharding All The Way Down
20
Node ID
Shard ID

Shard per Core
Cassandra
TCP/IP
Scheduler
queue
queue
queue
queue
queue
Threads
NIC
Queues
Kernel
Traditional Stack Seastar’s Sharded Stack
Memory
Lock contention
Cache contention
NUMA unfriendly
TCP/IP
Task Scheduler
queue
queue
queue
queue
queue
smp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
TCP/IP
Task Scheduler
queue
queue
queue
queue
queue
smp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
TCP/IP
queue
queue
queue
queue
queue
smp queue
NIC
Queue
Kernel
(isn’t
involved)
Userspace
No contention
Linear scaling
NUMA friendly
Core
Database
Task Scheduler
queue
queue
queue
queue
smp queue
Userspace
NIC
Queue
21
vs.

Seastar
+ Open source framework, powering ScyllaDB,
Redpanda, ValuStor
+ A “mini operating system in userspace”
+ Task scheduler, I/O scheduler
+ Fully asynchronous - userspace coroutines
+ Direct I/O, self managed cache (bypass pagecache)
+ One thread per core, one shard per core
22

ScyllaDB is Different
23
+ Multi queue
+ Poll mode
+ Userspace
+ TCP/IP
+ Thread per core
+ lock-free
+ Task scheduler
+ Reactor programing
+ C++14
+ NUMA friendly
+ Log structured
allocator
+ Zero copy
+ DMA
+ Log structured
+ merge tree
+ DBaware cache
+ Userspace I/O
+ scheduler

Why Scheduling At All
+ Different components compete for limited resources (Reads, Writes, Admin)
+ They have different priorities
+ They have no idea how not to over-consume the resource

How Does It Work?
Flush
sched-group
Compaction
sched-group
Query
sched-group

Diskplorer 3 (AWS i3en.3xlarge)

The New I/O Scheduler
+ Collect information about disks
+ Build a more accurate mathematical disk model
+ Embody the model into the I/O scheduler

Latency While Replacing a Node

Replace a Node in ScyllaDB 4.6
New node
added Streaming
completed
P99
Latency

Replace a node in ScyllaDB 5.1
New node
added Streaming
completed
P99
Latency

Latest Results I3 vs I4 - One Node
I3.16xlarge vs i4.16xlarge (64 vCPU servers)
50% Reads / 50% Writes
Latency tests with 50% of the max throughput

Latest Results I3 vs I4 - 3 Node Cluster
Big thanks to Michał
Chojnowski for benchmarking
all the new AWS instances
types!
I3.16xlarge vs i4.16xlarge (64 vCPU servers)
50% Reads / 50% Writes
Latency tests with 50% of the max throughput
67% better price/performance!

Bill of Materials
+ ScyllaDB cluster: 20 x i3en.metal AWS instances, each having:
+ 96 vCPUs
+ 768 GiB RAM
+ 60 TB NVMe disk space
+ 100 Gbps network bandwidth
+ Load Generators: 50 x c5n.9xlarge AWS instances, each having:
+ 36 vCPUs
+ 96 GiB RAM
+ 50 Gbps network bandwidth

Concurrent Workloads: R/W + 80/20
(1) Throughput is in transactions/second
(2) Latency is in milliseconds
(1) Workload: Application: 200K R/W User: 5M 80/20 R/W
(2) Write latency 0.682 P50
2.454 P99
0.326 P50
1.252 P99
Read latency 1.195 P50
4.555 P99
0.744 P50
3.709 P99
+ 5Mi user 80/20 R/W ops/sec + 200K application 50/50 R/W ops/sec
+ Added user write workload increases app workload latency.

Making Conﬂicting Loads Coexist with
Workload Prioritization
https://www.scylladb.com/2019/05/23/workload-prioritization-running-oltp-and-olap-traffic-on-t
he-same-superhighway/

Workload Prioritization in a Glance
placeholder

Concurrent workloads: R/W + 80/20
(1) Throughput is in transactions/second
(2) Latency is in milliseconds
With Workload Prioritization
+ As the 80/20 user workload interfered with the application latency,
let’s reduce its relative priority to better share the system resources.
(1) Workload: Application: 200K R/W User: 5M 80/20 R/W
before:
1000 shares
after:
1000 shares
before:
1000 shares
after:
500 shares
(2) Write latency 0.682
2.454
0.354 P50
1.184 P99
0.326
1.252
0.440 P50
3.244 P99
Read latency 1.195
4.555
0.855 P50
3.731 P99
0.744
3.709
1.043 P50
6.455 P99

+ Each service level has its own per-shard queue for consuming cpu and I/O
Service Levels in Action
Application workload (200K ops/sec)
User workload (5M ops/sec)

48
Large
Partition?
Wide Partition Example

Removed Large Partition Penalty
RAM
Disk

code: https://github.com/cvybhu/rust-driver-benchmarks
Rust Driver

Higher Throughput - Lower Cost
ScyllaDB vs Google Bigtable
ScyllaDB vs DynamoDB ScyllaDB vs Cassandra
1/7th the cost
26x performance
in real-life scenario
4 ScyllaDB nodes vs
40 Cassandra nodes
2.5X less expensive
up to 22x better latencies
1/5th cost
20x better latencies
in real-life scenario

Poll
How much data do you under management of your
transactional database?

Thank you
for joining us today.
@scylladb scylladb/
slack.scylladb.com
@scylladb company/scylladb/
scylladb/

Transforming the Database: Critical Innovations for Performance at Scale

More Related Content

Similar to Transforming the Database: Critical Innovations for Performance at Scale

More from ScyllaDB

Recently uploaded

Transforming the Database: Critical Innovations for Performance at Scale