How Scylla Make Adding and Removing Nodes Faster and Safer

How We Make Adding
and Removing Nodes
Faster and Safer
Asias He, Software Developer

Presenter
Asias He, Software Developer
Asias He is a software developer with over 10 years of experience
in programming. In the past, he worked on Debian Project,
Solaris Kernel, KVM Virtualization for Linux, OSv unikernel. He
now works on Seastar and ScyllaDB.

Overview of the
node operations

1. Replace operation
■ To replace a dead node
● Token ring does not change
● Use same tokens and host_id of the replaced node
■ Suffer from the resumable issue
● Replace failed after 99% of data is streamed
● Run replace again, we need to stream all the data again
■ Suffer from the "not streaming latest copy” issue
● Stream data from only one of the replicas which might not have the latest copy
● Stream from all the replicas solve the problem
● Too heavy and wasteful to stream the same data more than once

Replace operation: Not streaming latest copy
■ What do we expect from a QUORUM read that follows a QUORUM
write?
● Strong Consistency: Write CL + Read CL > RF
● X2 is newer than X1
Node1
=X2
Node2
=X1 Correct quorum read!
Node3
=X2

■ Node 3 dies and Node 4 replaces it
■ What do we expect from a QUORUM read that follows a QUORUM
write?
Node1
=X2
Node2
=X1
Node4
(replacing)
=?

■ This is what you would expect:
Node4
(replacing)
=X2
Stream: X2Node1
=X2
Node2
=X1

■ This is what (may) happen:
Stream: X1
Node1
=X2
Node2
=X1
Node4
(replacing)
=X1

■ This is what (may) happen:
Wrong quorum read!
Node1
=X2
Node2
=X1
Node4
(replacing)
=X1

■ What if we have to replace again before repair is run?
■ Node 1 dies and Node 5 replaces it
New data was lost!
Node5
(replacing)
=X1
Node2
=X1
Node4
(replacing)
=X1
Stream: X1

2. Rebuild operation
■ To get all the data this node owns from other nodes
● E.g., rebuild a new DC
● Token ring does not change
● Stream data from only one of the replicas which might not have the latest copy

3. Removenode operation
■ To remove a dead node out of the cluster
● Token ring changes
● Remaining nodes pull data from other nodes for the new ranges they own
● Stream from only one of the replicas which might not have the latest copy

4. Decommission operation
■ To remove a live node from the cluster
■ Do not suffer from the “not streaming latest copy” issue
● The leaving node pushes data to other nodes which are the new owner
Stream: Y2
Node1
=X2
Node3
(leaving)
=X2, Y2
Node2
=Y2
Stream: X2
Node 1: new owner of range for X2
Node 2: new owner of range for Y2
Node 3: loses the range for X2 and Y2

5. Bootstrap operation
■ To add a new node into the cluster
■ Do not suffer from the "not streaming latest copy” issue
● New node pulls data from existing nodes that are losing the token ranges
Stream: Y2
Node1
=X2
Node3
(joining)
=X2, Y2
Node2
=Y2
Stream: X2
Node 1: loses range for X2
Node 2: loses range for Y2
Node 3: new owner of the range for X2 and Y2

Node operation summary
Node
Operations
Token ring
change
Resumable
issue
Latest copy
issue
Replace No Yes Yes
Rebuild No Yes Yes
Removenode Yes Yes Yes
Decommission Yes Yes No
Bootstrap Yes Yes No

Repair based node operations
The idea is: use repair to sync data
between replicas instead of streaming

Benefits of repair based node operation
■ Latest copy is guaranteed
● The operation node will always have the latest copy
■ Resumable in nature
● Repair skips the already synced data very quickly
● E.g., Restart the replace operation from where it failed
■ No extra data is streamed
● E.g., rebuild twice, will not stream the same data twice
■ Free repair during node operations
● No need to run repair before/after the node operations
● Simplify the procedure and reduce the chance to make mistakes
■ Unify code path for node operations and repair
● Retire regular streaming code
■ The way you operate the cluster stays the same
● You can still use the nodetool rebuild, decommission commands

Isn’t repair a heavy operation
■ Node operations assume the data is already consistent
● The job to make data consistent is repair
● We recommend repair before node operations
● Repair + streaming won’t be faster than doing only repair
■ Old repair is not fast enough: (partition level repair)
● Over-streaming problem
● Granularity is ~100 partitions
■ New repair is fast (row level repair introduced in Scylla 3.1 )
● No overstreaming
● Only mismatched rows are synced
● Foundation of the repair based node operations

Optimize repair for node operations
■ Increased the internal row buffer size
● Old 256 KiB (3.1) to current 32 MiB(3.2+)
● Good for cross DC cluster with high latency links
■ Improved data transfer efficiency between nodes
● From rpc verb (3.1) to rpc stream (3.2+)
● More efficient to transfer large amount of data
● Same as regular stream based node operations

Repair v.s. Stream based rebuild operation 1/2
Rebuild from 1 DC Method Space after rebuild Time to rebuild Notes
us-east Stream 26GB 573s Stream 10% of vnode
ranges at a time
us-east Repair 26GB 368s Repair works on more
vnode ranges in parallel,
1.5X less time
● 3 nodes in the cluster, 1 node per DC, 3 DCs
● AWS, i3.2xlarge
● 150M partitions on each node
● RF = { eu-west=1, us-east=1, us-west-2=1 }
● 80 ms latency
● Run rebuild on DC us-west-2

Repair v.s. Stream based rebuild operation 2/2
Rebuild from 2 DCs Method Space after rebuild Time to rebuild Notes
us-east and eu-west Stream 39GB 1500s Two rebuild operations,
streams 2X data, total time
573 + 927 = 1500s
us-east and eu-west Repair 26GB 611s Single rebuild to sync from
two DCs, streams no extra
data, 2.5x less time
● 3 nodes in the cluster, 3 DCs, 1 node per DC
● AWS, i3.2xlarge
● 150M partitions on each node
● RF = { eu-west=1,us-east=1,us-west-2=1 }
● 80ms latency
● Run rebuild on DC use-west-2

Thank you Stay in touch
Any questions?
Asias He
asias@scylladb.com
@asias_he

How Scylla Make Adding and Removing Nodes Faster and Safer

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How Scylla Make Adding and Removing Nodes Faster and Safer

Similar to How Scylla Make Adding and Removing Nodes Faster and Safer (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

How Scylla Make Adding and Removing Nodes Faster and Safer

Editor's Notes