ScyllaDB Architecture -
built for speed
Tzach Livyatan, VP Product
Tzach Livyatan
■ VP Product ScyllaDB
■ Love Databases and NoSQL
■ <cool hobby>
Your photo
goes here,
smile :)
■ High Availability
■ Data Modeling
■ Implementation
Presentation Agenda
Vestibulum congue
Distributed
Node
HW
Control
High Availability
NoSQL – By Data Model
Key / Value Redis, Aerospike, RocksDB
Document store MongoDB, Couchbase
Wide column store Scylla, Apache Cassandra,
HBase, DynamoDB
Graph Neo4j, JanusGraph
Complexity
5
NoSQL– By Availability vs Consistency
6
Pick Two
Availability
Partition Tolerance
Consistency
PACELC:
Latency vs Consistency
Cluster - Node Ring
7
Node 5
Node 1
Node 2
Node 4 Node 3
Data Replication
■ Replication Factor: number of nodes where data (rows and partitions) are replicated
■ Done automatically
■ Set for keyspace
CREATE KEYSPACE mykeyspace WITH replication = {
'class': 'NetworkTopologyStrategy',
'replication_factor' : 3}
AND durable_writes = true;
8
Replication Factor (RF) = 3
9
Node 5
Node 1
Node 2
Node 4 Node 3
Multiple Data Centers
10
USA DC
Asia DC
EU DC
'us_1' : 3,
'eu' : 3,
'asia' : 3
Consistency Level
■ CL: # of nodes that must acknowledge read/write
■ I.E.: 1, QUORUM, LOCAL_QUORUM, ALL
■ Tunable Consistency: CL set per operation
11
12
Cluster Level Write
Cluster Level Read
13
CL =1
Datacenters
14
USA DC
Asia DC
LOCAL
QUORUM
15
Scylla Architecture
Data Modeling
CQL Example
Query:
SELECT * from heartrate_v10 WHERE
pet_chip_id = 80d39c78-9dc0-11eb-a8b3-0242ac130003 LIMIT 1;
SELECT * from heartrate_v10 WHERE
pet_chip_id = 80d39c78-9dc0-11eb-a8b3-0242ac130003 AND
time >= '2021-05-01 01:00+0000' AND
time < '2021-05-01 01:03+0000';
17
https://gist.github.com/tzach/7486f1a0cc904c52f4514f20f14d2a97
Wide Partition Example
CREATE TABLE heartrate_v10 (
pet_chip_id uuid,
owner uuid,
time timestamp,
heart_rate int,
PRIMARY KEY (pet_chip_id, time)
);
pet_chip_id time heart_rate
80d39c78-9dc0-11eb-a8b3-0242ac130003 2021-05-01 01:00:00.000000+0000 120
80d39c78-9dc0-11eb-a8b3-0242ac130003 2021-05-01 01:01:00.000000+0000 121
80d39c78-9dc0-11eb-a8b3-0242ac130003 2021-05-01 01:02:00.000000+0000 120
Partition Key Clustering Key
18
Architecture
pet_chip_id time heart_rate
80d39c78-9dc0-11eb-a8b3-
0242ac130003 2021-05-01 01:00:00.000000+0000 120
80d39c78-9dc0-11eb-a8b3-
0242ac130003 2021-05-01 01:01:00.000000+0000 121
80d39c78-9dc0-11eb-a8b3-
0242ac130003 2021-05-01 01:02:00.000000+0000 120
Partitioner
Hash Function
Partition Key
Token Range
20
Wide Partition Example
Advance Data Modeling
Materialized Views (MV)
Secondary Index (SI)
Change Data Capture (CDC)
Collections
User Defined Types
Time To Live (TTL)
…
1. INSERT INTO heartrate
(pet_chip_id,
Owner,
Time,
heart_rate)
VALUES (..);
2. INSERT INTO
heartrate
Base replica
View replica
Coordinator
3. INSERT INTO
heartrate_by_owner
View is another table
22
View is another table
2.
SELECT * FROM
heartrate_by_owner
WHERE owner = ‘642a..’;
Base replica
View replica
Coordinator
1.
SELECT * FROM
heartrate_by_owner
WHERE owner = ‘642a..’;
23
Global Sec Index - Different Partition key
2.
SELECT name
FROM pet_by_owner_index
WHERE owner = '642a..';
3.
SELECT *
FROM heartrate_10
WHERE pet_chip_id in (...)
AND time in (...)
Base replica
View replica
Coordinator
1.
SELECT * FROM heartrate_v10
WHERE owner = ‘642a..’;
Write Path - Replica
25
26
Read Path - Replica
SSTables
Cache
Memory
Disk
1
2
3
4
5
…
Bloom
Filter
2.5
Storage - Log-Structured Merge Tree
SStable 1
Time
Storage - Log-Structured Merge Tree
SStable 1
Time
SStable 2
SStable 1
SStable 2
SStable 3
Time
SStable 4
SStable 1+2+3
Storage - Log-Structured Merge Tree
Implementation
ScyllaDB Design Decisions
C++ instead of Java
1
2 All Things Async
3 Shard per Core
4 Unified Cache
5 I/O Scheduler
6 Autonomous
C++
ScyllaDB Design Decisions
1
2 All Things Async
3 Shard per Core
4 Unified Cache
5 I/O Scheduler
6 Autonomous
C++
ScyllaDB Design Decisions
Shards
1
2 All Things Async
3 Shard per Core
4 Unified Cache
5 I/O Scheduler
6 Autonomous
C++
Small, Medium, Large Machines
Why larger nodes?
■ Time between failures is
shorter
■ Ease of maintenance
■ No noisy neighbours
■ No virtualization, container
overhead
■ No other moving parts
■ Scale up before out!
Linear Scale Ingestion
Constant Time while volume & throughput double
2X 2X 2X 2X 2X
Network Comparison
Kernel
Cassandra
TCP/IP
Scheduler
queue
queue
queue
queue
queue
threads
NIC
Queues
Kernel
Traditional Stack SeaStar’s Sharded Stack
Memory
Application
TCP/I
P
Task Scheduler
queue
queue
queue
queue
queue
smp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Application
TCP/I
P
Task Scheduler
queue
queue
queue
queue
queue
smp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Application
TCP/I
P
Task Scheduler
queue
queue
queue
queue
queue
smp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Core
Database
Task Scheduler
queue
queue
queue
queue
queue
smp queue
NIC
Queue
Userspace
ScyllaDB Has Its Own Task Scheduler
Traditional Stack Scylla’s Stack
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise is a
pointer to
eventually
computed value
Task is a
pointer to a
lambda function
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread is a
function pointer
Stack is a byte
array from 64k
to megabytes
ScyllaDB Design Decisions
Cassandra Scylla
Key cache
Row cache
On-heap /
Off-heap
Linux page cache
SSTables
Unified cache
SSTables
Complex
Tuning
1
2 All Things Async
3 Shard per Core
4 Unified Cache
5 I/O Scheduler
6 Autonomous
C++
ScyllaDB Design Decisions
Cassandra
Key cache
Row cache
On-heap /
Off-heap
Linux page cache
SSTables
1
2 All Things Async
3 Shard per Core
4 Unified Cache
5 I/O Scheduler
6 Autonomous
C++
App
thread
Kernel
SSD
Page fault
Suspend thread
Initiate I/O
Context switch
I/O
completes
Interrupt
Context
switch
Map page
Resume
thread
ScyllaDB Design Decisions
Query
Commitlog
Compaction
Userspace
I/O
Scheduler
Disk
Max useful disk concurrency
Queue
Queue
Queue
1
2 All Things Async
3 Shard per Core
4 Unified Cache
5 I/O Scheduler
6 Autonomous
C++
ScyllaDB Design Decisions
Memtable
Seastar
Scheduler
Compaction
Query
Repair
Commitlog
SSD
Compaction
Backlog Monitor
Memory Monitor
Adjust priority
Adjust priority
WAN
CPU
1
2 All Things Async
3 Shard per Core
4 Unified Cache
5 I/O Scheduler
6 Autonomous
C++
Different types of loads
■ OLTP
● Small work items
● Latency sensitive
● involves narrow
portion of the data
■ OLAP
● Large work items
● Throughput oriented
● Performed on large
amounts of data
Workload Prioritization
Load #3
800 shares
Load #2
400 shares
Load #1
200 shares
Scylla Design Decisions
1
2 All Things Async
3 Shard per Core
4 Unified Cache
5 I/O Scheduler
6 Autonomous
C++
More than 1M req/sec on i4i.8xlarge
https://github.com/scylladb/1m-ops-demo by Attila Tóth
46
■ Built for High Availability
■ Design to meet modern hardware
■ Use a fully async, share nothing, shard per core architecture
■ Superior throughput and consistent low latency
■ Expose internal scheduler to the user as Workload Prioritization
Summary
Scylla vs Competition
■ 1/7th the cost
■ 26x better in a real life
scenario
■ 10x volume
■ 9.3x throughput
■ 1/4x latency
■ 4 Scylla nodes vs 40
Cassandra
■ 2.5X cheaper
■ 11x better latency
■ 1/5th cost in a benchmark
■ 20x better real-life scenario
■ No throttling
■ No locking
CockroachDB
Google’s
Bigtable
DynamoDB Cassandra
logscaled
Stay in Touch
Tzach Livyatan
tzach@scylladb.com
https://twitter.com/TzachL
https://github.com/tzach
https://www.linkedin.com/in/tzach/

A Deep Dive into ScyllaDB's Architecture

  • 1.
    ScyllaDB Architecture - builtfor speed Tzach Livyatan, VP Product
  • 2.
    Tzach Livyatan ■ VPProduct ScyllaDB ■ Love Databases and NoSQL ■ <cool hobby> Your photo goes here, smile :)
  • 3.
    ■ High Availability ■Data Modeling ■ Implementation Presentation Agenda Vestibulum congue Distributed Node HW Control
  • 4.
  • 5.
    NoSQL – ByData Model Key / Value Redis, Aerospike, RocksDB Document store MongoDB, Couchbase Wide column store Scylla, Apache Cassandra, HBase, DynamoDB Graph Neo4j, JanusGraph Complexity 5
  • 6.
    NoSQL– By Availabilityvs Consistency 6 Pick Two Availability Partition Tolerance Consistency PACELC: Latency vs Consistency
  • 7.
    Cluster - NodeRing 7 Node 5 Node 1 Node 2 Node 4 Node 3
  • 8.
    Data Replication ■ ReplicationFactor: number of nodes where data (rows and partitions) are replicated ■ Done automatically ■ Set for keyspace CREATE KEYSPACE mykeyspace WITH replication = { 'class': 'NetworkTopologyStrategy', 'replication_factor' : 3} AND durable_writes = true; 8
  • 9.
    Replication Factor (RF)= 3 9 Node 5 Node 1 Node 2 Node 4 Node 3
  • 10.
    Multiple Data Centers 10 USADC Asia DC EU DC 'us_1' : 3, 'eu' : 3, 'asia' : 3
  • 11.
    Consistency Level ■ CL:# of nodes that must acknowledge read/write ■ I.E.: 1, QUORUM, LOCAL_QUORUM, ALL ■ Tunable Consistency: CL set per operation 11
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
    CQL Example Query: SELECT *from heartrate_v10 WHERE pet_chip_id = 80d39c78-9dc0-11eb-a8b3-0242ac130003 LIMIT 1; SELECT * from heartrate_v10 WHERE pet_chip_id = 80d39c78-9dc0-11eb-a8b3-0242ac130003 AND time >= '2021-05-01 01:00+0000' AND time < '2021-05-01 01:03+0000'; 17 https://gist.github.com/tzach/7486f1a0cc904c52f4514f20f14d2a97
  • 18.
    Wide Partition Example CREATETABLE heartrate_v10 ( pet_chip_id uuid, owner uuid, time timestamp, heart_rate int, PRIMARY KEY (pet_chip_id, time) ); pet_chip_id time heart_rate 80d39c78-9dc0-11eb-a8b3-0242ac130003 2021-05-01 01:00:00.000000+0000 120 80d39c78-9dc0-11eb-a8b3-0242ac130003 2021-05-01 01:01:00.000000+0000 121 80d39c78-9dc0-11eb-a8b3-0242ac130003 2021-05-01 01:02:00.000000+0000 120 Partition Key Clustering Key 18
  • 19.
    Architecture pet_chip_id time heart_rate 80d39c78-9dc0-11eb-a8b3- 0242ac1300032021-05-01 01:00:00.000000+0000 120 80d39c78-9dc0-11eb-a8b3- 0242ac130003 2021-05-01 01:01:00.000000+0000 121 80d39c78-9dc0-11eb-a8b3- 0242ac130003 2021-05-01 01:02:00.000000+0000 120 Partitioner Hash Function Partition Key Token Range
  • 20.
  • 21.
    Advance Data Modeling MaterializedViews (MV) Secondary Index (SI) Change Data Capture (CDC) Collections User Defined Types Time To Live (TTL) …
  • 22.
    1. INSERT INTOheartrate (pet_chip_id, Owner, Time, heart_rate) VALUES (..); 2. INSERT INTO heartrate Base replica View replica Coordinator 3. INSERT INTO heartrate_by_owner View is another table 22
  • 23.
    View is anothertable 2. SELECT * FROM heartrate_by_owner WHERE owner = ‘642a..’; Base replica View replica Coordinator 1. SELECT * FROM heartrate_by_owner WHERE owner = ‘642a..’; 23
  • 24.
    Global Sec Index- Different Partition key 2. SELECT name FROM pet_by_owner_index WHERE owner = '642a..'; 3. SELECT * FROM heartrate_10 WHERE pet_chip_id in (...) AND time in (...) Base replica View replica Coordinator 1. SELECT * FROM heartrate_v10 WHERE owner = ‘642a..’;
  • 25.
    Write Path -Replica 25
  • 26.
    26 Read Path -Replica SSTables Cache Memory Disk 1 2 3 4 5 … Bloom Filter 2.5
  • 27.
    Storage - Log-StructuredMerge Tree SStable 1 Time
  • 28.
    Storage - Log-StructuredMerge Tree SStable 1 Time SStable 2
  • 29.
    SStable 1 SStable 2 SStable3 Time SStable 4 SStable 1+2+3 Storage - Log-Structured Merge Tree
  • 30.
  • 31.
    ScyllaDB Design Decisions C++instead of Java 1 2 All Things Async 3 Shard per Core 4 Unified Cache 5 I/O Scheduler 6 Autonomous C++
  • 32.
    ScyllaDB Design Decisions 1 2All Things Async 3 Shard per Core 4 Unified Cache 5 I/O Scheduler 6 Autonomous C++
  • 33.
    ScyllaDB Design Decisions Shards 1 2All Things Async 3 Shard per Core 4 Unified Cache 5 I/O Scheduler 6 Autonomous C++
  • 34.
    Small, Medium, LargeMachines Why larger nodes? ■ Time between failures is shorter ■ Ease of maintenance ■ No noisy neighbours ■ No virtualization, container overhead ■ No other moving parts ■ Scale up before out!
  • 35.
    Linear Scale Ingestion ConstantTime while volume & throughput double 2X 2X 2X 2X 2X
  • 36.
    Network Comparison Kernel Cassandra TCP/IP Scheduler queue queue queue queue queue threads NIC Queues Kernel Traditional StackSeaStar’s Sharded Stack Memory Application TCP/I P Task Scheduler queue queue queue queue queue smp queue NIC Queue DPDK Kernel (isn’t involved) Userspace Application TCP/I P Task Scheduler queue queue queue queue queue smp queue NIC Queue DPDK Kernel (isn’t involved) Userspace Application TCP/I P Task Scheduler queue queue queue queue queue smp queue NIC Queue DPDK Kernel (isn’t involved) Userspace Core Database Task Scheduler queue queue queue queue queue smp queue NIC Queue Userspace
  • 37.
    ScyllaDB Has ItsOwn Task Scheduler Traditional Stack Scylla’s Stack Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise is a pointer to eventually computed value Task is a pointer to a lambda function Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread is a function pointer Stack is a byte array from 64k to megabytes
  • 38.
    ScyllaDB Design Decisions CassandraScylla Key cache Row cache On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Complex Tuning 1 2 All Things Async 3 Shard per Core 4 Unified Cache 5 I/O Scheduler 6 Autonomous C++
  • 39.
    ScyllaDB Design Decisions Cassandra Keycache Row cache On-heap / Off-heap Linux page cache SSTables 1 2 All Things Async 3 Shard per Core 4 Unified Cache 5 I/O Scheduler 6 Autonomous C++ App thread Kernel SSD Page fault Suspend thread Initiate I/O Context switch I/O completes Interrupt Context switch Map page Resume thread
  • 40.
    ScyllaDB Design Decisions Query Commitlog Compaction Userspace I/O Scheduler Disk Maxuseful disk concurrency Queue Queue Queue 1 2 All Things Async 3 Shard per Core 4 Unified Cache 5 I/O Scheduler 6 Autonomous C++
  • 41.
    ScyllaDB Design Decisions Memtable Seastar Scheduler Compaction Query Repair Commitlog SSD Compaction BacklogMonitor Memory Monitor Adjust priority Adjust priority WAN CPU 1 2 All Things Async 3 Shard per Core 4 Unified Cache 5 I/O Scheduler 6 Autonomous C++
  • 42.
    Different types ofloads ■ OLTP ● Small work items ● Latency sensitive ● involves narrow portion of the data ■ OLAP ● Large work items ● Throughput oriented ● Performed on large amounts of data
  • 43.
    Workload Prioritization Load #3 800shares Load #2 400 shares Load #1 200 shares
  • 44.
    Scylla Design Decisions 1 2All Things Async 3 Shard per Core 4 Unified Cache 5 I/O Scheduler 6 Autonomous C++
  • 45.
    More than 1Mreq/sec on i4i.8xlarge https://github.com/scylladb/1m-ops-demo by Attila Tóth
  • 46.
    46 ■ Built forHigh Availability ■ Design to meet modern hardware ■ Use a fully async, share nothing, shard per core architecture ■ Superior throughput and consistent low latency ■ Expose internal scheduler to the user as Workload Prioritization Summary
  • 47.
    Scylla vs Competition ■1/7th the cost ■ 26x better in a real life scenario ■ 10x volume ■ 9.3x throughput ■ 1/4x latency ■ 4 Scylla nodes vs 40 Cassandra ■ 2.5X cheaper ■ 11x better latency ■ 1/5th cost in a benchmark ■ 20x better real-life scenario ■ No throttling ■ No locking CockroachDB Google’s Bigtable DynamoDB Cassandra logscaled
  • 48.
    Stay in Touch TzachLivyatan tzach@scylladb.com https://twitter.com/TzachL https://github.com/tzach https://www.linkedin.com/in/tzach/