Efficient node bootstrapping for decentralised shared-nothing Key-Value Stores

Efficient Bootstrapping for Decentralised
Shared-nothing Key-value Stores
Click to edit Present’s Name
Han Li, SrikumarVenugopal
Never Stand Still

Faculty of Engineering

Computer Science and Engineering

Agenda
•
•
•
•
•

Motivations for Node Bootstrapping
Research Gap
Challenges and Solutions
Evaluations
Conclusion

School of Computer Science and Engineering

On-demand Provisioning

The Capacity versus Utilisation Curve


Key-value Stores
• The standard component for cloud data management
• Increasing workload  Node bootstrapping
– Incorporate a new, empty node as a member of KVS

• Decreasing workload  Node decommissioning
– Eliminate an existing member with redundant data off the KVS


Goals for Efficient Node Bootstrapping
• Minimise the overhead of data movement
– How to partition/store data?

• Balance the load at node bootstrapping
– Both data volume and workload
– How to place/allocate data?

• Maintain data consistency and availability
– How to execute data movement?


Background: Storage model
• Shared Storage
– Access same storage
• Distributed file systems
• Networked attached storage

– E.g. GFS, HDFS
– Simply exchange metadata
• Albatross, by S. Das, UCSB

• Shared Nothing
– Use individual local storage
– Decentralised, peer-to-peer
– E.g. Dynamo, Cassandra,
Voldemort, etc.
– Require data movement
• Lightweight solutions?


Background: Split-Move Approach
Key space

Node 1

A

②

A

I

B

I

New Node
B2

H
A

H

Node 2

②
①

A

B1
B
B2

I

B1

C

Node 4

Node 3
G

D
F

E

Partition at node bootstrapping

C
B1

A
B

B2

①
Master Replica

③

D
D
B1

C
B2
B

①
Slave Replica

To be deleted


Background: Virtual-Node Approach
Node 1

Key space

D
I

H

C
G

D
F

E

Partition at system startup

H

I

B

B

E

A

G

Node 2
New Node
A

E

Node 3

B

A

C

D

F

G

I

F

Node 4

H

A

B

C

D

C

E

F

H

I

......

G

Data skew: e.g., the majority of data is stored in a minority of partitions.
Moving around giant partitions is not a good idea.


Research Gap
• Shared Storage vs. Shared Nothing
– Require data movement

• Centralised vs. Decentralised
– Require coordination

• Split-Move vs. Virtual-node Based
– Partition at node bootstrapping is heavyweight
– Partition at system startup causes data skew

• The Gap: A scheme of data partitioning and placement that
improves the efficiency of bootstrapping in shared-nothing KVS


Our Solution
• Virtual-node based movement
– Each partition of data is stored in separated files
– Reduced overhead of data movement
– Many existing nodes can participate in bootstrapping

• Automatic sharding
– Split and merge partitions at runtime
– Each partition stores a bounded volume of data
• Easy to reallocate data
• Easy to balance the load


The timing for data partitioning
• Shard partitions at writes (insert and delete)
– Split:
– Merge:

Size(Pi) ≤ Θmax
Size(Pi) + Size(Pi+1) ≥ Θmin

Insert

A

Θmax ≥ 2Θmin
Avoid oscillation!

A
B1

B

B2
E

Split

C

E

C
D

D

A

A

B1

B1
B2

Delete

M

Merge

E
D

E

C
D


Challenge 1: Sharding coordination
• Issues
– Totally decentralised
– Each partition has multiple replicas
– Each replica is split or merged locally

• Question
– How to guarantee that all the replicas of certain partition are
simultaneously sharded?


Challenge 1: Sharding coordination
• Solution: Election-based coordination
Data/Node
mapping

Node-A
SortedList:

3rd
Step 2
Step 3
Step1
EnforceElection
Split/Merge
Finish Step 4
Split/Merge
Announce to all nodes 2nd

C, E, ..., A, ..., B

Node-B

Coordinator
Node-C
Node-C
4th

1st

Node-E


Challenge 2: Node failover during sharding
Coordinator
Coordinator
Before
Before
execution
execution

Gossip
yes
Failed

Election
Election

Step1
Step1

NonNonNonNonNonNoncoordinators
coordinators
coordinators
coordinators
coordinators
coordinators

Resurrect

No
Notiﬁcation:
Notiﬁcation:
Shard Pi
Shard Pi

During
During
execution
execution
Announce:
Announce:
Announce:
Elect
Successful
Successful
Successful
New coordinator

After
After
execution Announce:
execution Successful
Time
Time

Append to
candidate list

No

Removed from
candidate list

Yes

Dead

Step2
Step2

Gossip
Yes

No

Continue without coordinator

Failed

Resurrect
Yes

Failed

Step3 Step3
Step3
Step3

Timeout

No

Invalidate Pi
in this node

Step4
Step4 Step4
Replace Replicas
Replace Replicas


Challenge 3: Data consistency during sharding
• Use two sets of replicas at sharding
– Original partition and future partition
– Data from different partitions is stored separate files

• Approach 1
– Write to future partition, roll back at failure
– Read from both partitions

• Approach 2
– Write to both partitions, abandon future partition at failure
– Read from original partition


Challenge 3: Data consistency during movement
• Use a pair of tokens for each partition
– A Boolean token to approve and disapprove read/write
t0 t1
t1
Source
node

Read
Write

Destine
node

Data
Transfer

t2

t3

t4

Read
Write
Negative

Positive


Replica Placement at Node Bootstrap
• Partition re-allocation and sharding are mutually exclusive;
• Maintain data availability
– Each partition has at least R replicas

• Balance the load (e.g., number of requests)
– Heavily loaded nodes have higher priority to “move out” data

• Balance the data
– Balance the number of partitions across nodes
• Each partition, via sharding, is of similar size

• Two-phase bootstrap
– Phase 1: guarantee R replicas, shift load from heavily loaded nodes
– Phase 2: achieve load and data balancing in low-priority threads

Evaluation Setup
• ElasCass: An implemention of auto-sharding, building on Apache
Cassandra (version 1.0.5), which uses Split-Move approach.
•
•
•
•

Key-value stores: ElasCass vs. Cassandra (v1.0.5)
Test bed: Amazon EC2, m1.large type, 2 CPU cores, 8GB ram
Benchmark: YCSB
System scale: Start from 1 node, with 100GB of data, R=2. Scale up
to 10 nodes.


Evaluation – Bootstrap Time
•

In Split-Move, data volume transferred reduces
by half from 3 nodes onwards.

•

In ElasCass, data volume transferred remains
below 10GB from 2 nodes.

•

Bootstrap time is determined by data volume
transferred. ElasCass exhibits a consistent
performance at all scales.


Evaluation – Data Volume
•
•
•

ElasCass uses two-phase bootstrap. More data is pulled in at phase 2.
Imbalance Index = standard deviation / average. Data is well balanced in ElasCass.
ElasCass occupies less storage space than Split-Move approach.


Evaluation – Query Processing
•

ElasCass is scalable, while
Split-Move is not.

•

Write throughput is higher
than read throughput.

•

ElasCass has better
resources utilisation.

•

ElasCass achieves
balanced load.


Key Takeaways
• Using virtual nodes introduces less overhead in data movement,
and reduces the bootstrap time to below 10 mins.
– Apache Cassandra v.1.1 uses virtual nodes

• Consolidating the partitions into bounded ranges simplifies replica
placement and facilitates load-balancing
– MySQL, MongoDB start to auto-shard partitions

• A balanced loadleads to 80% resource utilisation and increasing
throughput scalable to #nodes.


Contributions and Acknowledgments
• We have designed and implemented a decentralised auto-sharding
scheme that
– consolidates each partition replica into single transferable units to
provide efficient data movement;
– automatically shards the partitions into bounded ranges to address data
skew;
– reduces the time to bootstrap nodes, achieves more balancing load and
better performance of query processing.

• The authors would like to thank Smart Services CRC Pty Ltd for the
grant of Services Aggregation project that made this work possible.

Thank You!


Efficient node bootstrapping for decentralised shared-nothing Key-Value Stores

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Efficient node bootstrapping for decentralised shared-nothing Key-Value Stores

Similar to Efficient node bootstrapping for decentralised shared-nothing Key-Value Stores (20)

Recently uploaded

Recently uploaded (20)

Efficient node bootstrapping for decentralised shared-nothing Key-Value Stores

Editor's Notes