When a new node is added or removed, Scylla has to transfer part of the existing data from some nodes to their neighbors. When a node fails, Scylla has to repopulate its data with data from the surviving replicas. Those operations are collectively referred to as "streaming" operations, since they simply stream data from one node to another, without using this opportunity to also fix discrepancies in the data. This is in contrast with the repair operation, that looks into all existing replicas and reconcile their contents. Scylla is moving towards unifying those two operations. In this talk we will discuss why this is considered beneficial, and what other possibilities this opens to users.
Presentation on how to chat with PDF using ChatGPT code interpreter
How Scylla Make Adding and Removing Nodes Faster and Safer
1. How We Make Adding
and Removing Nodes
Faster and Safer
Asias He, Software Developer
2. Presenter
Asias He, Software Developer
Asias He is a software developer with over 10 years of experience
in programming. In the past, he worked on Debian Project,
Solaris Kernel, KVM Virtualization for Linux, OSv unikernel. He
now works on Seastar and ScyllaDB.
4. 1. Replace operation
■ To replace a dead node
● Token ring does not change
● Use same tokens and host_id of the replaced node
■ Suffer from the resumable issue
● Replace failed after 99% of data is streamed
● Run replace again, we need to stream all the data again
■ Suffer from the "not streaming latest copy” issue
● Stream data from only one of the replicas which might not have the latest copy
● Stream from all the replicas solve the problem
● Too heavy and wasteful to stream the same data more than once
5. Replace operation: Not streaming latest copy
■ What do we expect from a QUORUM read that follows a QUORUM
write?
● Strong Consistency: Write CL + Read CL > RF
● X2 is newer than X1
Node1
=X2
Node2
=X1 Correct quorum read!
Node3
=X2
6. ■ Node 3 dies and Node 4 replaces it
■ What do we expect from a QUORUM read that follows a QUORUM
write?
Replace operation: Not streaming latest copy
Node1
=X2
Node2
=X1
Node4
(replacing)
=?
7. ■ This is what you would expect:
Replace operation: Not streaming latest copy
Node4
(replacing)
=X2
Stream: X2Node1
=X2
Node2
=X1
8. ■ This is what (may) happen:
Replace operation: Not streaming latest copy
Stream: X1
Node1
=X2
Node2
=X1
Node4
(replacing)
=X1
9. ■ This is what (may) happen:
Replace operation: Not streaming latest copy
Wrong quorum read!
Node1
=X2
Node2
=X1
Node4
(replacing)
=X1
10. ■ What if we have to replace again before repair is run?
■ Node 1 dies and Node 5 replaces it
Replace operation: Not streaming latest copy
New data was lost!
Node5
(replacing)
=X1
Node2
=X1
Node4
(replacing)
=X1
Stream: X1
11. 2. Rebuild operation
■ To get all the data this node owns from other nodes
● E.g., rebuild a new DC
● Token ring does not change
■ Suffer from the resumable issue
■ Suffer from the "not streaming latest copy” issue
● Stream data from only one of the replicas which might not have the latest copy
12. 3. Removenode operation
■ To remove a dead node out of the cluster
● Token ring changes
■ Suffer from the resumable issue
■ Suffer from the "not streaming latest copy” issue
● Remaining nodes pull data from other nodes for the new ranges they own
● Stream from only one of the replicas which might not have the latest copy
13. 4. Decommission operation
■ To remove a live node from the cluster
● Token ring changes
■ Suffer from the resumable issue
■ Do not suffer from the “not streaming latest copy” issue
● The leaving node pushes data to other nodes which are the new owner
Stream: Y2
Node1
=X2
Node3
(leaving)
=X2, Y2
Node2
=Y2
Stream: X2
Node 1: new owner of range for X2
Node 2: new owner of range for Y2
Node 3: loses the range for X2 and Y2
14. 5. Bootstrap operation
■ To add a new node into the cluster
● Token ring changes
■ Suffer from the resumable issue
■ Do not suffer from the "not streaming latest copy” issue
● New node pulls data from existing nodes that are losing the token ranges
Stream: Y2
Node1
=X2
Node3
(joining)
=X2, Y2
Node2
=Y2
Stream: X2
Node 1: loses range for X2
Node 2: loses range for Y2
Node 3: new owner of the range for X2 and Y2
15. Node operation summary
Node
Operations
Token ring
change
Resumable
issue
Latest copy
issue
Replace No Yes Yes
Rebuild No Yes Yes
Removenode Yes Yes Yes
Decommission Yes Yes No
Bootstrap Yes Yes No
17. Repair based node operations
The idea is: use repair to sync data
between replicas instead of streaming
18. Benefits of repair based node operation
■ Latest copy is guaranteed
● The operation node will always have the latest copy
■ Resumable in nature
● Repair skips the already synced data very quickly
● E.g., Restart the replace operation from where it failed
■ No extra data is streamed
● E.g., rebuild twice, will not stream the same data twice
■ Free repair during node operations
● No need to run repair before/after the node operations
● Simplify the procedure and reduce the chance to make mistakes
■ Unify code path for node operations and repair
● Retire regular streaming code
■ The way you operate the cluster stays the same
● You can still use the nodetool rebuild, decommission commands
19. Isn’t repair a heavy operation
■ Node operations assume the data is already consistent
● The job to make data consistent is repair
● We recommend repair before node operations
● Repair + streaming won’t be faster than doing only repair
■ Old repair is not fast enough: (partition level repair)
● Over-streaming problem
● Granularity is ~100 partitions
■ New repair is fast (row level repair introduced in Scylla 3.1 )
● No overstreaming
● Only mismatched rows are synced
● Foundation of the repair based node operations
20. Optimize repair for node operations
■ Increased the internal row buffer size
● Old 256 KiB (3.1) to current 32 MiB(3.2+)
● Good for cross DC cluster with high latency links
■ Improved data transfer efficiency between nodes
● From rpc verb (3.1) to rpc stream (3.2+)
● More efficient to transfer large amount of data
● Same as regular stream based node operations
22. Repair v.s. Stream based rebuild operation 1/2
Rebuild from 1 DC Method Space after rebuild Time to rebuild Notes
us-east Stream 26GB 573s Stream 10% of vnode
ranges at a time
us-east Repair 26GB 368s Repair works on more
vnode ranges in parallel,
1.5X less time
● 3 nodes in the cluster, 1 node per DC, 3 DCs
● AWS, i3.2xlarge
● 150M partitions on each node
● RF = { eu-west=1, us-east=1, us-west-2=1 }
● 80 ms latency
● Run rebuild on DC us-west-2
23. Repair v.s. Stream based rebuild operation 2/2
Rebuild from 2 DCs Method Space after rebuild Time to rebuild Notes
us-east and eu-west Stream 39GB 1500s Two rebuild operations,
streams 2X data, total time
573 + 927 = 1500s
us-east and eu-west Repair 26GB 611s Single rebuild to sync from
two DCs, streams no extra
data, 2.5x less time
● 3 nodes in the cluster, 3 DCs, 1 node per DC
● AWS, i3.2xlarge
● 150M partitions on each node
● RF = { eu-west=1,us-east=1,us-west-2=1 }
● 80ms latency
● Run rebuild on DC use-west-2
24. Thank you Stay in touch
Any questions?
Asias He
asias@scylladb.com
@asias_he
Editor's Notes
Here is an example of not streaming latest copy issue.
We have 3 nodes.
We do a quorum write.
The second node missed the write
When we do a quorum read. We will still have the correct result.
This is what you would expect.
We would stream the latest copy to the replacing node.
But this what may happen.
We stream the old copy to the replacing node.
As a result, the quorum read will be wrong.
In this case, node 1 dies and node 5 replaces it.
Unfortunately, node2 streams the old copy to the replacing node.
As a result, the new data was lost!
.
The first test is to rebuild from 1DC.
3 nodes in the cluster, ...
As we can see, in this test, the repair based operation is actually faster.
This is mainly because, internally the repair works on more vnode ranges in parallel than streaming. In streaming, we stream 10% of the vnode ranges at a time. But there is only one pending stream plan in parallel.
However, even if repair based rebuild is slower in some cases, for instance, not considering the parallel contribution, it would be acceptable, because repair has more work to do and it is much much safer.
The second test is to rebuild from 2 DCs.
For the stream based operation, We have to perform two rebuilds operations. It streams 2 times more data to the rebuilding node.
For the repair based operation, we only need to perform one rebuild operation to sync from two DCS. It streams no extra data and uses less time. The time difference is around 2.5 times.