ScyllaDB Leaps Forward
Dor Laor, Co-founder & CEO of ScyllaDB
Dor Laor
■ Hello world, Core router, KVM, OSv, ScyllaDB
■ Phd in Snowboard, aspiring MTB rider
■ Let’s shard!
3 Learnings from the KVM Hypervisor Dev Period
1.Layer == overhead
2.Locking == Evil
3.Simplicity == True
4.1 million op/s == Expectations
5.Off-by == one || 0x10
ScyllaDB is Proud to Serve!
You’ll hear from many of our customers
Why ScyllaDB?
Best High Availability in the industry
Best Disaster Recovery in the industry
Best scalability in the industry
Best Price/Performance in the industry Auto-tune - out of the box performance
Compatible with Cassandra & DynamoDB
The power of Cassandra at the speed of Redis with the usability of DynamoDB
No Lock-in
Open Source Software
Agenda
Part 1
■ Arch overview
■ New 2024 results & benchmarks
■ ScyllaDB Cloud brief
Part 2
■ ScyllaDB 6.0
■ Tablets
■ What’s coming
Shard Per Core Architecture
Shards
Threads
ScyllaDB Architecture
Homogeneous nodes Ring Architecture
ScyllaDB Architecture
Homogeneous nodes Ring Architecture?@$@#
Can we linearly
scale up?
■ Time between failures TBF
■ Ease of maintenance
■ No noisy neighbours
■ No virtualization, container overhead
■ No other moving parts
■ Scale up before out!
Small vs Large Machines
Linear Scale Ingestion
2X 2X 2X 2X
3 x 4-vcpu VMs
1B keys
3 x 8-vcpu VMs
2B keys
3 x 16-vcpu VMs
4B keys
3 x 32-vcpu VMs
8B keys
3 x 64-vcpu VMs
16B keys
3 x 128-vcpu VMs
32B keys
Credit: Felipe Cardeneti Mendes
Linear Scale Ingestion
2X 2X 2X 2X
3 x 4-vcpu VMs
1B keys
3 x 8-vcpu VMs
2B keys
3 x 16-vcpu VMs
4B keys
3 x 32-vcpu VMs
8B keys
3 x 64-vcpu VMs
16B keys
3 x 128-vcpu VMs
32B keys
Credit: Felipe Cardeneti Mendes
Linear Scale Ingestion
2X 2X 2X 2X
3 x 4-vcpu VMs
1B keys
3 x 8-vcpu VMs
2B keys
3 x 16-vcpu VMs
4B keys
3 x 32-vcpu VMs
8B keys
3 x 64-vcpu VMs
16B keys
3 x 128-vcpu VMs
32B keys
Credit: Felipe Cardeneti Mendes
Linear Scale Ingestion
2X 2X 2X 2X
3 x 4-vcpu VMs
1B keys
3 x 8-vcpu VMs
2B keys
3 x 16-vcpu VMs
4B keys
3 x 32-vcpu VMs
8B keys
3 x 64-vcpu VMs
16B keys
3 x 128-vcpu VMs
32B keys
Credit: Felipe Cardeneti Mendes
Linear Scale Ingestion
2X 2X 2X 2X 2X
3 x 4-vcpu VMs
1B keys
3 x 8-vcpu VMs
2B keys
3 x 16-vcpu VMs
4B keys
3 x 32-vcpu VMs
8B keys
3 x 64-vcpu VMs
16B keys
3 x 128-vcpu VMs
32B keys
Credit: Felipe Cardeneti Mendes
Linear Scale Ingestion
2X 2X 2X 2X 2X
3 x 4-vcpu VMs
1B keys
3 x 8-vcpu VMs
2B keys
3 x 16-vcpu VMs
4B keys
3 x 32-vcpu VMs
8B keys
3 x 64-vcpu VMs
16B keys
3 x 128-vcpu VMs
32B keys
Credit: Felipe Cardeneti Mendes
Linear Scale Ingestion (2023 vs 2018)
Constant Time Ingestion
2X 2X 2X 2X 2X
Ingestion:
17.9k OPS
per shard *
16
Ingestion:
<2ms P99
per shard
“Nodes must be small, in case they fail”
You can scale OPS but can you scale failures?
2X 2X 2X 2X 2X
“Nodes must be small, in case they fail”
No they don’t! {Replace, Add, remove} Node at constant time
2X 2X 2X 2X 2X
“Nodes must be small, in case they fail”
No they don’t! {Replace, Add, remove} Node at constant time
2X 2X 2X 2X 2X
“Nodes must be small, in case they fail”
No they don’t! {Replace, Add, remove} Node at constant time
What’s new in 2024.1?
2024.1 vs 2023.1 vs OSS 5.4
Benchmark: ScyllaDB vs MongoDB
Benchmark: ScyllaDB vs MongoDB
■ Fair
■ 3rd party
■ SaaS - zero config
■ YCSB - standard
■ 132 wins in 133 workloads
■ 10x-20x better performance
■ 10x-20x better latency
■ MongoDB doesn’t scale!
Scale Benchmark: ScyllaDB vs MongoDB
ScyllaDB Cloud News
ScyllaDB Cloud - 65% of Customers
■ Terraform providers, launch,
scale, manage
■ Multiple networking modes
■ Encryption at rest, BYOK
■ Certifications: SoC2, ISO, PCI
■ Coming: Azure
Welcome ScyllaDB
Our journey
from eventual
to immediate
consistency
Scylla today is awesome but
■ Topology changes are allowed one-at-a-time
■ Rely on 30+ second timeouts for consistency
■ Node failed/down block scale
■ Streaming time is a function of the schema
■ Additional complex operations: Cleanup, repair
ScyllaDB 6.0
■ Consistent schema changes (raft, >= 5.2)
■ Consistent topology changes (raft, 6.0)
■ Tablets
ScyllaDB 6.0 Value
Elasticity
■ Faster bootstrap
■ Concurrent node operations
■ Immediate request serving
Simplicity
■ Transparent cleanups
■ Semi transparent repairs
■ Auto gc-grace period
■ Parallel maintenance operations
Speed
■ Streaming sstables - 30x faster
■ Load balancing of tablets Consistency
TCO
■ Shrink free space
■ Reduce static, over provisioned deployments
Behind the Scene
Raft- Consistent metadata
Protocol for state machine replication
Total order broadcast of state change commands
X = 0 X += 1 CAS(X, 0, 1)
node A
bootstrap
bootstrap
Linearizable Token Metadata
node B
node C
system.token_metadata
Read
barrier
Read
barrier
■ Fencing - each write is signed with topology version
■ If there is a version mismatch, the write doesn’t go through
Changes in the Data Plane
Replica
Coordinator
Topology
coordinator
Consistent Metadata Journey
RAFT Safe schema
changes
Safe topology
changes
Dynamic partitioning
Consistent tables
Tablets
5.0
5.2
5.2+
6.0
6.0 Tablets FTW
Standard tables
{R1, R2, R3}
R1
R2
R3
key1
replication
metadata:
(per keyspace)
Standard tables
{R1, R2, R3}
R1
R2
R3
key1
key2
Sharding function generates
good load distribution between
CPUs
RAFT group
No. 299238
RAFT group
No. 299236 RAFT group
No. 299237
RAFT tables key1
key2
tablet
tablet
replica
tablet
replica
RAFT tables key1
key2
Good load distribution requires
lots of RAFT groups.
Tablets - balancing
Table starts with a few tablets.
Small tables end there
Not fragmented into tiny pieces
like with tokens
Tablets - balancing
When tablet becomes too heavy
(disk, CPU, …) it is split
Tablets - balancing
When tablet becomes too heavy
(disk, CPU, …) it is split
Tablets - balancing
The load balancer can decide to
move tablets
Tablets - balancing
Depends on fault-tolerant,
reliable, and fast topology
changes.
Tablets
Resharding is cheap.
SStables split at tablet boundary.
Reassign tablets to shards (logical operation).
Tablets
Cleanup after topology change is cheap.
Just delete SStables.
Tablet Scheduler
Scheduler globally controls movement, maintenance
operation on a per tablet basis
repair
migration
tablet 0
tablet 1
schedule schedule
repair
Backup
Tablet Scheduler
Goals:
■ Maximize throughput (saturate)
■ Keep migrations short (don’t overload)
Rules:
■ migrations-in <= 2 per shard
■ migrations-out <= 4 per shard
Tablets - Repair - Fire & forget
■ Tablet based
■ Continuous, transparent controlled by the load
balancer
■ Auto GC grace period
Tablets - Streaming (Enterprise)
Send files over RPC. No per
schema, per row processing.
30x faster, saturate links.
Scylla Enterprise only
Sstables
files
Sstables
files
Sstables
files
Post 6.0
Since we have
■ 30x faster streaming
■ Parallel operations
■ Small unit size - based on tablet/shard, not hardware
■ Negligent performance impact
■ Incremental serving as you add machines
No reliance on cluster size, instance size or instance type
Tablets =~ Serverless
Serverless
Time
ie3n’s
i4i
Time
Capacity
Required
Time
On-demand
Base
Typeless Sizeless limitless
What’s Cooking?
■ Full transactional consistency with Raft
■ HDD and dense standard nodes (d3en)
■ S3 backend
■ Incremental repair
■ Point in time backup/restore
■ Tiered storage
What’s in the Oven
Eventual
Consistency
Thank You! Keep Innovating!
IoT
Crypto
eCommerce
Telco
Feature Store
Streaming
Fintech
Cyber
Social network
Graph Storage
Layer
Recommendation
& Personalization
Engine
Fraud & Threat
Detection
AI/ML
Analytics
Customer
Experience

ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB

  • 1.
    ScyllaDB Leaps Forward DorLaor, Co-founder & CEO of ScyllaDB
  • 2.
    Dor Laor ■ Helloworld, Core router, KVM, OSv, ScyllaDB ■ Phd in Snowboard, aspiring MTB rider ■ Let’s shard!
  • 3.
    3 Learnings fromthe KVM Hypervisor Dev Period 1.Layer == overhead 2.Locking == Evil 3.Simplicity == True 4.1 million op/s == Expectations 5.Off-by == one || 0x10
  • 4.
  • 5.
    You’ll hear frommany of our customers
  • 6.
    Why ScyllaDB? Best HighAvailability in the industry Best Disaster Recovery in the industry Best scalability in the industry Best Price/Performance in the industry Auto-tune - out of the box performance Compatible with Cassandra & DynamoDB The power of Cassandra at the speed of Redis with the usability of DynamoDB No Lock-in Open Source Software
  • 7.
    Agenda Part 1 ■ Archoverview ■ New 2024 results & benchmarks ■ ScyllaDB Cloud brief Part 2 ■ ScyllaDB 6.0 ■ Tablets ■ What’s coming
  • 8.
    Shard Per CoreArchitecture Shards Threads
  • 9.
  • 10.
  • 11.
  • 13.
    ■ Time betweenfailures TBF ■ Ease of maintenance ■ No noisy neighbours ■ No virtualization, container overhead ■ No other moving parts ■ Scale up before out! Small vs Large Machines
  • 14.
    Linear Scale Ingestion 2X2X 2X 2X 3 x 4-vcpu VMs 1B keys 3 x 8-vcpu VMs 2B keys 3 x 16-vcpu VMs 4B keys 3 x 32-vcpu VMs 8B keys 3 x 64-vcpu VMs 16B keys 3 x 128-vcpu VMs 32B keys Credit: Felipe Cardeneti Mendes
  • 15.
    Linear Scale Ingestion 2X2X 2X 2X 3 x 4-vcpu VMs 1B keys 3 x 8-vcpu VMs 2B keys 3 x 16-vcpu VMs 4B keys 3 x 32-vcpu VMs 8B keys 3 x 64-vcpu VMs 16B keys 3 x 128-vcpu VMs 32B keys Credit: Felipe Cardeneti Mendes
  • 16.
    Linear Scale Ingestion 2X2X 2X 2X 3 x 4-vcpu VMs 1B keys 3 x 8-vcpu VMs 2B keys 3 x 16-vcpu VMs 4B keys 3 x 32-vcpu VMs 8B keys 3 x 64-vcpu VMs 16B keys 3 x 128-vcpu VMs 32B keys Credit: Felipe Cardeneti Mendes
  • 17.
    Linear Scale Ingestion 2X2X 2X 2X 3 x 4-vcpu VMs 1B keys 3 x 8-vcpu VMs 2B keys 3 x 16-vcpu VMs 4B keys 3 x 32-vcpu VMs 8B keys 3 x 64-vcpu VMs 16B keys 3 x 128-vcpu VMs 32B keys Credit: Felipe Cardeneti Mendes
  • 18.
    Linear Scale Ingestion 2X2X 2X 2X 2X 3 x 4-vcpu VMs 1B keys 3 x 8-vcpu VMs 2B keys 3 x 16-vcpu VMs 4B keys 3 x 32-vcpu VMs 8B keys 3 x 64-vcpu VMs 16B keys 3 x 128-vcpu VMs 32B keys Credit: Felipe Cardeneti Mendes
  • 19.
    Linear Scale Ingestion 2X2X 2X 2X 2X 3 x 4-vcpu VMs 1B keys 3 x 8-vcpu VMs 2B keys 3 x 16-vcpu VMs 4B keys 3 x 32-vcpu VMs 8B keys 3 x 64-vcpu VMs 16B keys 3 x 128-vcpu VMs 32B keys Credit: Felipe Cardeneti Mendes
  • 20.
    Linear Scale Ingestion(2023 vs 2018) Constant Time Ingestion 2X 2X 2X 2X 2X
  • 21.
  • 22.
  • 23.
    “Nodes must besmall, in case they fail” You can scale OPS but can you scale failures?
  • 24.
    2X 2X 2X2X 2X “Nodes must be small, in case they fail” No they don’t! {Replace, Add, remove} Node at constant time
  • 25.
    2X 2X 2X2X 2X “Nodes must be small, in case they fail” No they don’t! {Replace, Add, remove} Node at constant time
  • 26.
    2X 2X 2X2X 2X “Nodes must be small, in case they fail” No they don’t! {Replace, Add, remove} Node at constant time
  • 27.
  • 28.
    2024.1 vs 2023.1vs OSS 5.4
  • 30.
  • 31.
    Benchmark: ScyllaDB vsMongoDB ■ Fair ■ 3rd party ■ SaaS - zero config ■ YCSB - standard ■ 132 wins in 133 workloads ■ 10x-20x better performance ■ 10x-20x better latency ■ MongoDB doesn’t scale!
  • 32.
  • 33.
  • 34.
    ScyllaDB Cloud -65% of Customers ■ Terraform providers, launch, scale, manage ■ Multiple networking modes ■ Encryption at rest, BYOK ■ Certifications: SoC2, ISO, PCI ■ Coming: Azure
  • 35.
    Welcome ScyllaDB Our journey fromeventual to immediate consistency
  • 36.
    Scylla today isawesome but ■ Topology changes are allowed one-at-a-time ■ Rely on 30+ second timeouts for consistency ■ Node failed/down block scale ■ Streaming time is a function of the schema ■ Additional complex operations: Cleanup, repair
  • 37.
    ScyllaDB 6.0 ■ Consistentschema changes (raft, >= 5.2) ■ Consistent topology changes (raft, 6.0) ■ Tablets
  • 38.
    ScyllaDB 6.0 Value Elasticity ■Faster bootstrap ■ Concurrent node operations ■ Immediate request serving Simplicity ■ Transparent cleanups ■ Semi transparent repairs ■ Auto gc-grace period ■ Parallel maintenance operations Speed ■ Streaming sstables - 30x faster ■ Load balancing of tablets Consistency TCO ■ Shrink free space ■ Reduce static, over provisioned deployments
  • 39.
  • 40.
    Raft- Consistent metadata Protocolfor state machine replication Total order broadcast of state change commands X = 0 X += 1 CAS(X, 0, 1)
  • 41.
    node A bootstrap bootstrap Linearizable TokenMetadata node B node C system.token_metadata Read barrier Read barrier
  • 42.
    ■ Fencing -each write is signed with topology version ■ If there is a version mismatch, the write doesn’t go through Changes in the Data Plane Replica Coordinator Topology coordinator
  • 43.
    Consistent Metadata Journey RAFTSafe schema changes Safe topology changes Dynamic partitioning Consistent tables Tablets 5.0 5.2 5.2+ 6.0
  • 44.
  • 45.
    Standard tables {R1, R2,R3} R1 R2 R3 key1 replication metadata: (per keyspace)
  • 46.
    Standard tables {R1, R2,R3} R1 R2 R3 key1 key2 Sharding function generates good load distribution between CPUs
  • 47.
    RAFT group No. 299238 RAFTgroup No. 299236 RAFT group No. 299237 RAFT tables key1 key2 tablet tablet replica tablet replica
  • 48.
    RAFT tables key1 key2 Goodload distribution requires lots of RAFT groups.
  • 49.
    Tablets - balancing Tablestarts with a few tablets. Small tables end there Not fragmented into tiny pieces like with tokens
  • 50.
    Tablets - balancing Whentablet becomes too heavy (disk, CPU, …) it is split
  • 51.
    Tablets - balancing Whentablet becomes too heavy (disk, CPU, …) it is split
  • 52.
    Tablets - balancing Theload balancer can decide to move tablets
  • 53.
    Tablets - balancing Dependson fault-tolerant, reliable, and fast topology changes.
  • 54.
    Tablets Resharding is cheap. SStablessplit at tablet boundary. Reassign tablets to shards (logical operation).
  • 55.
    Tablets Cleanup after topologychange is cheap. Just delete SStables.
  • 56.
    Tablet Scheduler Scheduler globallycontrols movement, maintenance operation on a per tablet basis repair migration tablet 0 tablet 1 schedule schedule repair Backup
  • 57.
    Tablet Scheduler Goals: ■ Maximizethroughput (saturate) ■ Keep migrations short (don’t overload) Rules: ■ migrations-in <= 2 per shard ■ migrations-out <= 4 per shard
  • 58.
    Tablets - Repair- Fire & forget ■ Tablet based ■ Continuous, transparent controlled by the load balancer ■ Auto GC grace period
  • 59.
    Tablets - Streaming(Enterprise) Send files over RPC. No per schema, per row processing. 30x faster, saturate links. Scylla Enterprise only Sstables files Sstables files Sstables files
  • 60.
  • 61.
    Since we have ■30x faster streaming ■ Parallel operations ■ Small unit size - based on tablet/shard, not hardware ■ Negligent performance impact ■ Incremental serving as you add machines No reliance on cluster size, instance size or instance type Tablets =~ Serverless
  • 62.
  • 63.
  • 64.
    ■ Full transactionalconsistency with Raft ■ HDD and dense standard nodes (d3en) ■ S3 backend ■ Incremental repair ■ Point in time backup/restore ■ Tiered storage What’s in the Oven Eventual Consistency
  • 65.
    Thank You! KeepInnovating! IoT Crypto eCommerce Telco Feature Store Streaming Fintech Cyber Social network Graph Storage Layer Recommendation & Personalization Engine Fraud & Threat Detection AI/ML Analytics Customer Experience