The presentation gives a brief overview of the high-load service that stores users' actions. The given service is able to serve up to 240k writes per second in less than 2ms 95 percentile with just a few ScyllaDB nodes packed with HDDs. Hardware setup, cluster specification, live load numbers and latencies achieved are given. The problems we encountered with HDD setup are described along with the possible solutions to them.
2. Kirill Alekseev
Software Engineering Team Lead, Mail.Ru Group
■ Software Engineering Team Lead @ Mail Service @ Mail.Ru Group
■ Master’s degree in Computer Science in 2019 @ Lomonosov Moscow State
University
■ Love coding, music and parties
YOUR COMPANY
LOGO HERE
2
10. Problems of previous storage
The previous storage had the following problems:
▪ poor scalability
▪ difficult to maintain
▪ lack of must-have DBMS features (secondary indexes, tunable
replication, query language etc)
10
11. Scylla as a storage
for users’ actions
Cluster and data model overview, hardware specs
11
12. 12
HTTP API
Mail Service Cloud Service Calendar Service
write action by user read a list of actions by user
13. Cluster overview
▪ 2 DCs, 4+5 nodes, RF=1 inside each DC
▪ CL=ONE for writes/reads
▪ Bare metal
• 2 x Intel Xeon Gold 6230
• 6 x 32GB DDR4 2666 MHz
• 2 x SATA SSD 1TB RAID 1 for clogs, 10 x HDD 16TB RAID 10 for data
• 10 Gb/s Network
13
14. CREATE TABLE becca.events (
user text, year smallint, week tinyint,
time timeuuid,
project_id smallint, event_id smallint,
ip inet, args map<text, text>,
PRIMARY KEY ((user, year, week, project_id), time)
) WITH CLUSTERING ORDER BY (time DESC)
Data model
▪ Partition is a list of actions sorted by time
▪ Partition is identified by user, year, week and project
▪ Thanks to promoted index, large partitions can be iterated using the ‘time’
column
▪ We use Time Window Compaction Strategy with size of 1 week
14
15. Reading by a secondary key
15
▪ Out-of-the-box secondary indexes give unpredictable
performance and lots of random IO
▪ Materialized views require a read-before-update for every
write operation (not gonna work with HDDs)
▪ Duplicating writes to a separate table by a different partition
key
16. CREATE TABLE becca.events_by_ip (
ip text, year smallint, week tinyint,
user text, time timeuuid,
project_id smallint, event_id smallint,
ip inet, args map<text, text>,
PRIMARY KEY ((ip, year, week, project_id), time, user)
) WITH CLUSTERING ORDER BY (time DESC)
Secondary key data model
▪ Requires 2x space and 2x write load
▪ Gives predictable performance on reads
16
20. Using Scylla with HDDs
Potential problems and possible solutions to them
20
21. num-io-queues
21
▪ num-io-queues stands for a number of threads
that interact with disks
▪ You have to find your sweet spot so that throughput is
optimal and latencies are ok (Little’s Law)
▪ 10 HDDs in RAID 10 provide the maximum concurrency of
5 for writes, set num-io-queues to 4-5
22. Cluster repairs
22
▪ nodetool repair does not finish in acceptable
time (months)
▪ nodetool repair overloads cluster (read
latencies grow 4 times)
▪ We came up with a more IO-efficient way to repair a
cluster in our case
24. Cluster repairs
24
▪ nodetool refresh will finish quickly
▪ compactions of new data will be triggered, but the
cluster will not be overloaded
▪ compactions will finish in a couple of hours
25. Problems yet to be solved
25
The following problems are yet to be solved:
▪ latencies grow during compactions, cleanup, bootstrap
▪ latencies grow when a node is down
▪ slow bootstrapping
28. Results
We have achieved the following results:
▪ we have built a high-load service for storing users’
actions with Scylla and HDDs
▪ the given service is able to handle 240 000 writes per second
with 95% of timing equal to 1.5ms with just a few Scylla nodes
▪ we have implemented an approach to serve reads by a
secondary key with predictable performance
28
29. Future work
In 2021:
▪ third DC
▪ optimize Scylla and clients to get even better latencies
▪ integrate Scylla into more projects
29
30. Special Thanks
I would like to give special thanks to:
▪ Dmitry Pavlov, Pavel Buchinchik, Igor Platonov
▪ Vladislav Zolotarov, Avi Kivity, Raphael Carvalho
▪ The whole ScyllaDB team
31
Let’s talk numbers
Does not include bots, only real users
We store every action
User may want to see what happened in his mailbox
Another examples: investigating possible attacks, sorting out user complaints
The thing that we wanted to replace in this scheme was the storage
Writes prevail 1000 times
The thing that we wanted to replace in this scheme was the storage
Tell why we have different amount of nodes in different dcs
Think what to answer to questions about CL=ONE
We want to be available when a DC goes down, it’s ok for us to serve inconsistent read requests
All user data is split by weeks and projects
Ambiguous number of network requests to other nodes
Can’t trasform all those writes to reads
We create another table and duplicate all writes there from the app
All user data is split by weeks and projects
Latencies are measured from client
RPS == API rps + RF + secondary index
Remind that we are talking about HDDs
They do not recommend hdds
It is reasonable
In ssd setups it will be probably set to some large value like number of shards
The most accurate way is to run benchmarks with different values for num-io-queue
Lets say one node failed and we know the exact moment of time when it happened
Normally nodetool repair would run full scan but we now the exact moment when problem happened
We need to go to nodes from a different DC, transfer data to the affected node and run nodetool repair
Refresh will finish soon, then go compactions that do not overload cluster and in our case finished in 6 hours
Latencies stay in a reasonable range
Resharding is slow but faster than repair and does not overaload cluster
Dedicated a whole section for problems with HDDs, what for