Glauber Costa
Principal Architect, ScyllaDB
SCYLLA
+ Tweet pictures of you and your plushie in some known landmark
+ Make sure to mention @ScyllaDB !
Want to win a ScyllaDB T-shirt?
+ What’s ScyllaDB; Why ScyllaDB
+ How ScyllaDB helps AdGear win.
+ What’s under the hood, that allows that to happen
Today we will cover:
No clear winner in NoSQL
Challenges:
• Cost
• Lock-in
Challenges:
• Scale
• Multi DC
• Latency
Challenges:
• Not persistent
• Manageability
Challenges:
• Price/performance
• Complexity
• JVM..
What we do: Scylla, towards the best NoSQL
Cassandra
What we do: Scylla, towards the best NoSQL
+ > 1 million OPS per node
+ < 1ms 99% latency
+ Auto tuned
+ Scale up and out
Some of Our Users
Cassandra Scylla
Throughput: Cannot utilize multi-core efficiently Scales linearly - shard-per-core
Latency: High due to Java and JVM’s GC Low and consistent - own cache
Complexity: Intricate tuning and configuration Auto tuned, dynamic scheduling
Admin: Maintenance impacts performance SLA guarantee for admin vs serving
Scylla Scales UP and OUT
Ingestion time. Every point doubles node size and data per node.
Total data size per node in the i3.16xlarge case is 4.8TB.
1B rows 2B rows 4B rows 8B rows 16B rows
time to ingest
Scylla Scales UP and OUT
nodetool compact from quiescent state. Each point doubles node size and data per node
4.8TB i3.16xlarge: 2:11:34
4.8TB2.4TB1.2TB0.6TB0.3TB
Time to fully compact the node
“Nodes must be small in case they fail”
11
+ No, they don’t.
+ Same clusters as previous experiments.
+ Destroy compacted node, rebuild from remaining two.
1B rows 2B rows 4B rows 8B rows 16B rows
4.8TB2.4TB1.2TB0.6TB0.3TB
Please welcome Mina
Naguib!
About AdGear Samsung Ads
1. AdTech (Advertising Technology) space
2. Started ~10 years ago here in Montreal
▪ Classical Publisher and Advertiser use cases
▪ “Big Data” 250-5k ad impressions / second
3. Then added RTB (Real-Time-Bidding) functionality
▪ Classical buyer/seller use cases
▪ “Big Data” 1M+ transactions / second
4. Then acquired by Samsung VD (Visual Display) while forming
Samsung Ads
▪ Classical hardware manufacturer
▪ Unique “Big Data” and opportunities
Real-Time-Bidding:
RTB: Value in execution based on data
asymmetry
bob: previously purchased a $4k bike
bob: habitually watches cycling races
bob: is male
bob: db timeout
Requirements for that database:
1. Key-value(s) store
2. Low-latency reads. Single milliseconds or less
3. High-throughput to keep up with the rest of the stack volume
4. Horizontal scalability
5. Multi-DC by design
6. Behaves well under mixed concurrent loads:
a. Point Reads X Point Writes X Bulk Writes
Apache Cassandra at AdGear
1. Used Cassandra since 2010 (v0.6) on sun-jdk (1.6)
a. Those were the days of many operational “WTFs” and gnashing of
teeth
i. Fun fact! That JVM enters 100% CPU usage on leap second adjustments!
b. But it worked fairly well all things considered
2. Cassandra matured as our company matured:
a. Now with VTokens like described in the Dynamo Paper. Yay!
b. Now with LevelDB-like compaction strategy. Yay!
c. Now with off-heap low-GC-cost data structures. Yay!
d. Now with G1Gc on by default. Yay!
e. Now with forked community vs enterprise roadmap.. Yay?
2017 Tipping Point
Cassandra:
• Slowly losing the latency battle
• Node proliferation
• Load-induced deep JVM bugs
beyond our capacity to debug ->
instability
• Not particularly interested in
enterprise-packaged version of
the above
What to do:
• What are modern alternatives ?
• Have you guys heard of ScyllaDB
? Seen them pop up a few times
• Willing to help POC with great
engineering guidance!
• Marketed as:
▪ service cassandra stop
▪ service scylladb start
2017 Scylla DB at AdGear
Cassandra Scylla
Servers 31 16
Read latency ~21ms <5ms
Backlog and timeouts As high as 15% at peak
☹
~0
2017 Scylla DB at AdGear: POC metrics
2017 Scylla DB at AdGear: POC metrics
2018 Scylla DB at AdGear: In Production
2018 Scylla DB at AdGear: In Production
HOW?
Threads Shards
Two-level sharding - shard per core
Seastar, Scylla’s engine: “All things async”
Close to the hardware
• Our own memory allocator
• Our own Disk I/O Scheduler
• Our own CPU Scheduler
• Our own cache, bypasses Linux entirely.
27
The Autonomous NoSQL Database
28
• SLA for Requests over maintenance operations
• Automatic tuning
• Automatic backpressure
• Scale up/down easily and stream as fast as possible
• Ongoing repair
• Smoothes complex data models
Throughput is EASY
29
• Maybe costly, but easy
• Bruce Wayne can get any throughput he wants from any modern
NoSQL, including Cassandra.
Throughput is EASY
30
• Maybe costly, but easy
• Bruce Wayne can get any throughput he wants from any modern
NoSQL, including Cassandra.
LATENCY IS HARD
Dear Scylla,
31
What do you call a latency distribution for which the high percentiles
are much higher than the average?
Dear Scylla,
32
What do you call a latency distribution for which the high percentiles
are much higher than the average?
Three main sources of latencies - Act 1
(Speed mismatch)
33
How fast is my system?
▪ There are two speeds:
o Disk Speed
o CPU/memory speed
▪ What happens when they are not in sync ?
latency mean : 51.9
latency median : 9.8
latency 95th percentile : 125.6
latency 99th percentile : 1184.0
latency 99.9th percentile : 1991.2
34
How fast is my system?
▪ There are two speeds:
o Disk Speed
o CPU/memory speed
▪ What happens when they are not in sync ?
latency mean : 51.9
latency median : 9.8
latency 95th percentile : 125.6
latency 99th percentile : 1184.0 (x 22)
latency 99.9th percentile : 1991.2 (x 38)
35
The Wall - where is it relevant?
▪ Disk speed slower than CPU speed
o plain slow disk, large payloads
36
The Wall - where is it relevant?
▪ Disk speed slower than CPU speed
o plain slow disk, large payloads
▪ Any other mismatch between resources
o For example, large memory capped by narrow network
37
The Wall
38
The Wall - Results
39
latency mean : 54.9
latency median : 43.5
latency 95th percentile : 126.9
latency 99th percentile : 253.9
latency 99.9th percentile : 364.6
The Wall - Results
40
latency mean : 54.9
latency median : 43.5
latency 95th percentile : 126.9
latency 99th percentile : 253.9 (x 4.6)
latency 99.9th percentile : 364.6 (x 6.6)
Three main sources of latencies - Act 2
(Lack of respect for limits)
41
Tasks in Scylla
42
Traditional stack Scylla’s stack
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise is a
pointer to
eventually
computed value
Task is a
pointer to a
lambda function
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread is a
function pointer
Stack is a byte
array from 64k
to megabytes
The task quota
▪ How often do we check the work queues?
▪ Pre-2.0 defaults too high for latency bound systems
▪ Tasks not respecting it will cause spikes
43
The task quota
▪ How often do we check the work queues?
▪ Pre-2.0 defaults too high for latency bound systems
▪ Tasks not respecting it will cause spikes
44
Three main sources of latencies - Act 3
(Imperfect Isolation)
45
The I/O Scheduler
46
Query
Commitlog
Compaction
Queue
Queue
Queue
Userspace
I/O
Scheduler
Disk
Max useful disk concurrency
I/O queued in FS/deviceNo queues
The I/O Scheduler
47
• Major component of Scylla since early versions
▪ Central component in The Wall
▪ Getting major improvements for latency workloads in Scylla 2.3
The CPU Scheduler
48
• Since Scylla 2.0, initial version
▪ disabled by default, AdGear enables it.
▪ enabled in our AWS AMI if using i3 instances.
• 2.2 ships with the full solution
▪ Ships this week!
▪ Enabled by default everywhere.
▪ Much better isolation
Memtable
Seastar
Scheduler
Compaction
Query
Repair
Commitlog
SSD
Compaction
Backlog
controller
Memory
controller
Adjust priority
Adjust priority
WAN
CPU
The Autonomous Database
49
The controllers
50
The controllers
51
The controllers - memtable
52
The controllers - memtable
53
The controllers - memtable
54
latency mean : 0.6
latency median : 0.5
latency 95th percentile : 0.8
latency 99th percentile : 3.6 (x 6.0)
latency 99.9th percentile : 4.5 (x 7.5)
latency mean : 0.4
latency median : 0.4
latency 95th percentile : 0.6
latency 99th percentile : 0.8 (x 2.0)
latency 99.9th percentile : 1.9 (x 4.7)
The controllers - compactions
55
% CPU time used by Compactions
Throughput
The controllers - compactions
56
workload changes:
- automatic adjustment
- new equilibrium
The controllers - compactions
57
2ms : 99.9 % latencies at 100 % load
< 2ms : 99 % latencies,
1ms : 95 % latencies.
The controllers - coming soon
58
• Scylla 2.2: SizeTiered compactions are controlled.
• Scylla 2.3: All compaction strategies are controlled.
• Repairs
▪ Repairs already respect latencies very well, but are not as fast as
they could be. Controllers will help unleash their full potential
▪ Done: Scylla Enterprise Manager schedules repairs automatically, no
human involvement needed
Summary
59
• Scylla inherits the user-visible architecture from Cassandra, a
solution that is known to scale up very well
• Scylla employs a radically different internal architecture, allowing
it to scale up as well as out while keeping latencies predictable
• Scylla reduces TCO across the board, by also minimizing
operational expenses.
Thanks You!
Resources
slideshare.net/ScyllaDB
glauber@scylladb.com (@glcst)
@scylladb
http://bit.ly/2oHAfok
youtube.com/c/scylladbscylladb.com/blog

AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millisecond Latencies

  • 1.
  • 2.
    + Tweet picturesof you and your plushie in some known landmark + Make sure to mention @ScyllaDB ! Want to win a ScyllaDB T-shirt?
  • 3.
    + What’s ScyllaDB;Why ScyllaDB + How ScyllaDB helps AdGear win. + What’s under the hood, that allows that to happen Today we will cover:
  • 4.
    No clear winnerin NoSQL Challenges: • Cost • Lock-in Challenges: • Scale • Multi DC • Latency Challenges: • Not persistent • Manageability Challenges: • Price/performance • Complexity • JVM..
  • 5.
    What we do:Scylla, towards the best NoSQL Cassandra
  • 6.
    What we do:Scylla, towards the best NoSQL + > 1 million OPS per node + < 1ms 99% latency + Auto tuned + Scale up and out
  • 7.
  • 8.
    Cassandra Scylla Throughput: Cannotutilize multi-core efficiently Scales linearly - shard-per-core Latency: High due to Java and JVM’s GC Low and consistent - own cache Complexity: Intricate tuning and configuration Auto tuned, dynamic scheduling Admin: Maintenance impacts performance SLA guarantee for admin vs serving
  • 9.
    Scylla Scales UPand OUT Ingestion time. Every point doubles node size and data per node. Total data size per node in the i3.16xlarge case is 4.8TB. 1B rows 2B rows 4B rows 8B rows 16B rows time to ingest
  • 10.
    Scylla Scales UPand OUT nodetool compact from quiescent state. Each point doubles node size and data per node 4.8TB i3.16xlarge: 2:11:34 4.8TB2.4TB1.2TB0.6TB0.3TB Time to fully compact the node
  • 11.
    “Nodes must besmall in case they fail” 11 + No, they don’t. + Same clusters as previous experiments. + Destroy compacted node, rebuild from remaining two. 1B rows 2B rows 4B rows 8B rows 16B rows 4.8TB2.4TB1.2TB0.6TB0.3TB
  • 12.
  • 13.
    About AdGear SamsungAds 1. AdTech (Advertising Technology) space 2. Started ~10 years ago here in Montreal ▪ Classical Publisher and Advertiser use cases ▪ “Big Data” 250-5k ad impressions / second 3. Then added RTB (Real-Time-Bidding) functionality ▪ Classical buyer/seller use cases ▪ “Big Data” 1M+ transactions / second 4. Then acquired by Samsung VD (Visual Display) while forming Samsung Ads ▪ Classical hardware manufacturer ▪ Unique “Big Data” and opportunities
  • 14.
  • 15.
    RTB: Value inexecution based on data asymmetry bob: previously purchased a $4k bike bob: habitually watches cycling races bob: is male bob: db timeout
  • 16.
    Requirements for thatdatabase: 1. Key-value(s) store 2. Low-latency reads. Single milliseconds or less 3. High-throughput to keep up with the rest of the stack volume 4. Horizontal scalability 5. Multi-DC by design 6. Behaves well under mixed concurrent loads: a. Point Reads X Point Writes X Bulk Writes
  • 17.
    Apache Cassandra atAdGear 1. Used Cassandra since 2010 (v0.6) on sun-jdk (1.6) a. Those were the days of many operational “WTFs” and gnashing of teeth i. Fun fact! That JVM enters 100% CPU usage on leap second adjustments! b. But it worked fairly well all things considered 2. Cassandra matured as our company matured: a. Now with VTokens like described in the Dynamo Paper. Yay! b. Now with LevelDB-like compaction strategy. Yay! c. Now with off-heap low-GC-cost data structures. Yay! d. Now with G1Gc on by default. Yay! e. Now with forked community vs enterprise roadmap.. Yay?
  • 18.
    2017 Tipping Point Cassandra: •Slowly losing the latency battle • Node proliferation • Load-induced deep JVM bugs beyond our capacity to debug -> instability • Not particularly interested in enterprise-packaged version of the above What to do: • What are modern alternatives ? • Have you guys heard of ScyllaDB ? Seen them pop up a few times • Willing to help POC with great engineering guidance! • Marketed as: ▪ service cassandra stop ▪ service scylladb start
  • 19.
    2017 Scylla DBat AdGear Cassandra Scylla Servers 31 16 Read latency ~21ms <5ms Backlog and timeouts As high as 15% at peak ☹ ~0
  • 20.
    2017 Scylla DBat AdGear: POC metrics
  • 21.
    2017 Scylla DBat AdGear: POC metrics
  • 22.
    2018 Scylla DBat AdGear: In Production
  • 23.
    2018 Scylla DBat AdGear: In Production
  • 24.
  • 25.
  • 26.
    Seastar, Scylla’s engine:“All things async”
  • 27.
    Close to thehardware • Our own memory allocator • Our own Disk I/O Scheduler • Our own CPU Scheduler • Our own cache, bypasses Linux entirely. 27
  • 28.
    The Autonomous NoSQLDatabase 28 • SLA for Requests over maintenance operations • Automatic tuning • Automatic backpressure • Scale up/down easily and stream as fast as possible • Ongoing repair • Smoothes complex data models
  • 29.
    Throughput is EASY 29 •Maybe costly, but easy • Bruce Wayne can get any throughput he wants from any modern NoSQL, including Cassandra.
  • 30.
    Throughput is EASY 30 •Maybe costly, but easy • Bruce Wayne can get any throughput he wants from any modern NoSQL, including Cassandra. LATENCY IS HARD
  • 31.
    Dear Scylla, 31 What doyou call a latency distribution for which the high percentiles are much higher than the average?
  • 32.
    Dear Scylla, 32 What doyou call a latency distribution for which the high percentiles are much higher than the average?
  • 33.
    Three main sourcesof latencies - Act 1 (Speed mismatch) 33
  • 34.
    How fast ismy system? ▪ There are two speeds: o Disk Speed o CPU/memory speed ▪ What happens when they are not in sync ? latency mean : 51.9 latency median : 9.8 latency 95th percentile : 125.6 latency 99th percentile : 1184.0 latency 99.9th percentile : 1991.2 34
  • 35.
    How fast ismy system? ▪ There are two speeds: o Disk Speed o CPU/memory speed ▪ What happens when they are not in sync ? latency mean : 51.9 latency median : 9.8 latency 95th percentile : 125.6 latency 99th percentile : 1184.0 (x 22) latency 99.9th percentile : 1991.2 (x 38) 35
  • 36.
    The Wall -where is it relevant? ▪ Disk speed slower than CPU speed o plain slow disk, large payloads 36
  • 37.
    The Wall -where is it relevant? ▪ Disk speed slower than CPU speed o plain slow disk, large payloads ▪ Any other mismatch between resources o For example, large memory capped by narrow network 37
  • 38.
  • 39.
    The Wall -Results 39 latency mean : 54.9 latency median : 43.5 latency 95th percentile : 126.9 latency 99th percentile : 253.9 latency 99.9th percentile : 364.6
  • 40.
    The Wall -Results 40 latency mean : 54.9 latency median : 43.5 latency 95th percentile : 126.9 latency 99th percentile : 253.9 (x 4.6) latency 99.9th percentile : 364.6 (x 6.6)
  • 41.
    Three main sourcesof latencies - Act 2 (Lack of respect for limits) 41
  • 42.
    Tasks in Scylla 42 Traditionalstack Scylla’s stack Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise is a pointer to eventually computed value Task is a pointer to a lambda function Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread is a function pointer Stack is a byte array from 64k to megabytes
  • 43.
    The task quota ▪How often do we check the work queues? ▪ Pre-2.0 defaults too high for latency bound systems ▪ Tasks not respecting it will cause spikes 43
  • 44.
    The task quota ▪How often do we check the work queues? ▪ Pre-2.0 defaults too high for latency bound systems ▪ Tasks not respecting it will cause spikes 44
  • 45.
    Three main sourcesof latencies - Act 3 (Imperfect Isolation) 45
  • 46.
  • 47.
    The I/O Scheduler 47 •Major component of Scylla since early versions ▪ Central component in The Wall ▪ Getting major improvements for latency workloads in Scylla 2.3
  • 48.
    The CPU Scheduler 48 •Since Scylla 2.0, initial version ▪ disabled by default, AdGear enables it. ▪ enabled in our AWS AMI if using i3 instances. • 2.2 ships with the full solution ▪ Ships this week! ▪ Enabled by default everywhere. ▪ Much better isolation
  • 49.
  • 50.
  • 51.
  • 52.
    The controllers -memtable 52
  • 53.
    The controllers -memtable 53
  • 54.
    The controllers -memtable 54 latency mean : 0.6 latency median : 0.5 latency 95th percentile : 0.8 latency 99th percentile : 3.6 (x 6.0) latency 99.9th percentile : 4.5 (x 7.5) latency mean : 0.4 latency median : 0.4 latency 95th percentile : 0.6 latency 99th percentile : 0.8 (x 2.0) latency 99.9th percentile : 1.9 (x 4.7)
  • 55.
    The controllers -compactions 55 % CPU time used by Compactions Throughput
  • 56.
    The controllers -compactions 56 workload changes: - automatic adjustment - new equilibrium
  • 57.
    The controllers -compactions 57 2ms : 99.9 % latencies at 100 % load < 2ms : 99 % latencies, 1ms : 95 % latencies.
  • 58.
    The controllers -coming soon 58 • Scylla 2.2: SizeTiered compactions are controlled. • Scylla 2.3: All compaction strategies are controlled. • Repairs ▪ Repairs already respect latencies very well, but are not as fast as they could be. Controllers will help unleash their full potential ▪ Done: Scylla Enterprise Manager schedules repairs automatically, no human involvement needed
  • 59.
    Summary 59 • Scylla inheritsthe user-visible architecture from Cassandra, a solution that is known to scale up very well • Scylla employs a radically different internal architecture, allowing it to scale up as well as out while keeping latencies predictable • Scylla reduces TCO across the board, by also minimizing operational expenses.
  • 60.