This slide was presented in ACM/IFIP/USENIX Middleware 2013, for the paper of "Efficient node bootstrapping for decentralised shared-nothing Key-Value Stores". Abstract of the paper is shown below.
Abstract. Distributed key-value stores (KVSs) have become an important component for data management in cloud applications. Since resources can be provisioned on demand in the cloud, there is a need for efficient node bootstrapping and decommissioning, i.e. to incorporate or eliminate the provisioned resources as a members of the KVS. It requires the data be handed over and the load be shifted across the nodes quickly. However, the data partitioning schemes in the current-state shared nothing KVSs are not efficient in quick bootstrapping. In this paper, we have designed a middleware layer that provides a decentralised scheme of auto-sharding with a two-phase bootstrapping. We experimentally demonstrate that our scheme reduces bootstrap time and improves load-balancing thereby increasing scalability of the KVS.
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Efficient node bootstrapping for decentralised shared-nothing Key-Value Stores
1. Efficient Bootstrapping for Decentralised
Shared-nothing Key-value Stores
Click to edit Present’s Name
Han Li, SrikumarVenugopal
Never Stand Still
Faculty of Engineering
Computer Science and Engineering
2. Agenda
•
•
•
•
•
Motivations for Node Bootstrapping
Research Gap
Challenges and Solutions
Evaluations
Conclusion
School of Computer Science and Engineering
4. Key-value Stores
• The standard component for cloud data management
• Increasing workload Node bootstrapping
– Incorporate a new, empty node as a member of KVS
• Decreasing workload Node decommissioning
– Eliminate an existing member with redundant data off the KVS
School of Computer Science and Engineering
5. Goals for Efficient Node Bootstrapping
• Minimise the overhead of data movement
– How to partition/store data?
• Balance the load at node bootstrapping
– Both data volume and workload
– How to place/allocate data?
• Maintain data consistency and availability
– How to execute data movement?
School of Computer Science and Engineering
6. Background: Storage model
• Shared Storage
– Access same storage
• Distributed file systems
• Networked attached storage
– E.g. GFS, HDFS
– Simply exchange metadata
• Albatross, by S. Das, UCSB
• Shared Nothing
– Use individual local storage
– Decentralised, peer-to-peer
– E.g. Dynamo, Cassandra,
Voldemort, etc.
– Require data movement
• Lightweight solutions?
School of Computer Science and Engineering
7. Background: Split-Move Approach
Key space
Node 1
A
②
A
I
B
I
New Node
B2
H
A
H
Node 2
②
①
A
B1
B
B2
I
B1
C
Node 4
Node 3
G
D
F
E
Partition at node bootstrapping
C
B1
A
B
B2
①
Master Replica
③
D
D
B1
C
B2
B
①
Slave Replica
To be deleted
School of Computer Science and Engineering
8. Background: Virtual-Node Approach
Node 1
Key space
D
I
H
C
G
D
F
E
Partition at system startup
H
I
B
B
E
A
G
Node 2
New Node
A
E
Node 3
B
A
C
D
F
G
I
F
Node 4
H
A
B
C
D
C
E
F
H
I
......
G
Data skew: e.g., the majority of data is stored in a minority of partitions.
Moving around giant partitions is not a good idea.
School of Computer Science and Engineering
9. Research Gap
• Shared Storage vs. Shared Nothing
– Require data movement
• Centralised vs. Decentralised
– Require coordination
• Split-Move vs. Virtual-node Based
– Partition at node bootstrapping is heavyweight
– Partition at system startup causes data skew
• The Gap: A scheme of data partitioning and placement that
improves the efficiency of bootstrapping in shared-nothing KVS
School of Computer Science and Engineering
10. Our Solution
• Virtual-node based movement
– Each partition of data is stored in separated files
– Reduced overhead of data movement
– Many existing nodes can participate in bootstrapping
• Automatic sharding
– Split and merge partitions at runtime
– Each partition stores a bounded volume of data
• Easy to reallocate data
• Easy to balance the load
School of Computer Science and Engineering
11. The timing for data partitioning
• Shard partitions at writes (insert and delete)
– Split:
– Merge:
Size(Pi) ≤ Θmax
Size(Pi) + Size(Pi+1) ≥ Θmin
Insert
A
Θmax ≥ 2Θmin
Avoid oscillation!
A
B1
B
B2
E
Split
C
E
C
D
D
A
A
B1
B1
B2
Delete
M
Merge
E
D
E
C
D
School of Computer Science and Engineering
12. Challenge 1: Sharding coordination
• Issues
– Totally decentralised
– Each partition has multiple replicas
– Each replica is split or merged locally
• Question
– How to guarantee that all the replicas of certain partition are
simultaneously sharded?
School of Computer Science and Engineering
13. Challenge 1: Sharding coordination
• Solution: Election-based coordination
Data/Node
mapping
Node-A
SortedList:
3rd
Step 2
Step 3
Step1
EnforceElection
Split/Merge
Finish Step 4
Split/Merge
Announce to all nodes 2nd
C, E, ..., A, ..., B
Node-B
Coordinator
Node-C
Node-C
4th
1st
Node-E
School of Computer Science and Engineering
14. Challenge 2: Node failover during sharding
Coordinator
Coordinator
Before
Before
execution
execution
Gossip
yes
Failed
Election
Election
Step1
Step1
NonNonNonNonNonNoncoordinators
coordinators
coordinators
coordinators
coordinators
coordinators
Resurrect
No
Notification:
Notification:
Shard Pi
Shard Pi
During
During
execution
execution
Announce:
Announce:
Announce:
Elect
Successful
Successful
Successful
New coordinator
After
After
execution Announce:
execution Successful
Time
Time
Append to
candidate list
No
Removed from
candidate list
Yes
Dead
Step2
Step2
Gossip
Yes
No
Continue without coordinator
Failed
Resurrect
Yes
Failed
Step3 Step3
Step3
Step3
Timeout
No
Invalidate Pi
in this node
Step4
Step4 Step4
Replace Replicas
Replace Replicas
School of Computer Science and Engineering
15. Challenge 3: Data consistency during sharding
• Use two sets of replicas at sharding
– Original partition and future partition
– Data from different partitions is stored separate files
• Approach 1
– Write to future partition, roll back at failure
– Read from both partitions
• Approach 2
– Write to both partitions, abandon future partition at failure
– Read from original partition
School of Computer Science and Engineering
16. Challenge 3: Data consistency during movement
• Use a pair of tokens for each partition
– A Boolean token to approve and disapprove read/write
t0 t1
t1
Source
node
Read
Write
Destine
node
Data
Transfer
t2
t3
t4
Read
Write
Negative
Positive
School of Computer Science and Engineering
17. Replica Placement at Node Bootstrap
• Partition re-allocation and sharding are mutually exclusive;
• Maintain data availability
– Each partition has at least R replicas
• Balance the load (e.g., number of requests)
– Heavily loaded nodes have higher priority to “move out” data
• Balance the data
– Balance the number of partitions across nodes
• Each partition, via sharding, is of similar size
• Two-phase bootstrap
– Phase 1: guarantee R replicas, shift load from heavily loaded nodes
– Phase 2: achieve load and data balancing in low-priority threads
School of Computer Science and Engineering
18. Evaluation Setup
• ElasCass: An implemention of auto-sharding, building on Apache
Cassandra (version 1.0.5), which uses Split-Move approach.
•
•
•
•
Key-value stores: ElasCass vs. Cassandra (v1.0.5)
Test bed: Amazon EC2, m1.large type, 2 CPU cores, 8GB ram
Benchmark: YCSB
System scale: Start from 1 node, with 100GB of data, R=2. Scale up
to 10 nodes.
School of Computer Science and Engineering
19. Evaluation – Bootstrap Time
•
In Split-Move, data volume transferred reduces
by half from 3 nodes onwards.
•
In ElasCass, data volume transferred remains
below 10GB from 2 nodes.
•
Bootstrap time is determined by data volume
transferred. ElasCass exhibits a consistent
performance at all scales.
School of Computer Science and Engineering
20. Evaluation – Data Volume
•
•
•
ElasCass uses two-phase bootstrap. More data is pulled in at phase 2.
Imbalance Index = standard deviation / average. Data is well balanced in ElasCass.
ElasCass occupies less storage space than Split-Move approach.
School of Computer Science and Engineering
21. Evaluation – Query Processing
•
ElasCass is scalable, while
Split-Move is not.
•
Write throughput is higher
than read throughput.
•
ElasCass has better
resources utilisation.
•
ElasCass achieves
balanced load.
School of Computer Science and Engineering
22. Key Takeaways
• Using virtual nodes introduces less overhead in data movement,
and reduces the bootstrap time to below 10 mins.
– Apache Cassandra v.1.1 uses virtual nodes
• Consolidating the partitions into bounded ranges simplifies replica
placement and facilitates load-balancing
– MySQL, MongoDB start to auto-shard partitions
• A balanced loadleads to 80% resource utilisation and increasing
throughput scalable to #nodes.
School of Computer Science and Engineering
23. Contributions and Acknowledgments
• We have designed and implemented a decentralised auto-sharding
scheme that
– consolidates each partition replica into single transferable units to
provide efficient data movement;
– automatically shards the partitions into bounded ranges to address data
skew;
– reduces the time to bootstrap nodes, achieves more balancing load and
better performance of query processing.
• The authors would like to thank Smart Services CRC Pty Ltd for the
grant of Services Aggregation project that made this work possible.
School of Computer Science and Engineering
I will start from the picture that we want to achieve in the end. The workload is mostly dynamic in web applications and services. There are peak hours and off-peaks every day and every week. In the Infrastructure-as-a-service cloud, computation resources can be provisioned on demand to deal with increasing workloads, and when the workload decreases, the resources can be dismissed in order to save on economic costs, assuming the billing model is pay-as-you-go.
In the cloud environment, key-value stores have become the standard reference architecture for data management. When the workload rises up, key-value stores are required to bootstrap nodes, that is, incorporating new empty nodes as its members. When the workload declines, existing members with redundant data can be eliminated. That is, node decommissioning.
This work is focused on efficient node bootstrapping in key-value stores. There are a few goals to achieve. First of all, we want to minimise the overhead of data movement, so as to reduce the time required to bootstrap a node. It depends on the way the data is partitioned and stored. Second, after an empty node is added to the system, we want to balance the load, in terms of both data volume and workload each node undertakes. It depends on how the data is allocated amongst the nodes. Third, we need to maintain data consistency and availability while nodes are being added or removed. It depends on how data movement is executed.
The behaviour of node bootstrapping largely depends on the storage model of the system. There is the shared-storage model, in which all the nodes access the same underlying storage. node bootstrapping is efficient in this model, because it does not require data movement. Instead, the ownership of data can be taken over simply by exchanging the metadata. An example is Albatross proposed by Das from UCSB. In contrast, there is the shared-nothing model, in which each node of the key-value store uses individual disks for storing the data. These kinds of systems are usually deployed in a decentralised manner, and require actual data movement across nodes atbootstrapping. The question is, how to move the data in a lightweight manner?
We reviewed the literature, and there are generally two approaches to data movement in shared-nothing key-value stores. The first one is what we call the Split-Move approach, which leverages hash functions to partition the key space based on the number of nodes in the system. *When a new node is added to the system, one or a few existing partitions are split to generate extra partitions. For example, Partition B is split into B1 and B2. *Then, the source nodes scans their local files to move out the data in the form of key-value pairs. The new node, which is the destination, receives the key-value pairs and reassemble the data into files. *The redundant data in the source nodes is deleted later. This approach is not efficient in two ways. First, it involves with scanning and reassembling, which are heavyweight operations and not appropriate when dealing with large amount of data. Second, only a limited number of existing nodes can participate in node bootstrapping, mostly because they use consistent hashing like algorithms.
Alternatively, systems like Dynamo and Voldemort use the Virtual-node based approach, which is originated from Chord. In this approach, the key space is split when the system starts. It results in many small partitions (or virtual nodes as we call). *Bootstrapping a new node becomes simpler. A list of partitions are selected to move out from the source nodes, and are stored in the new node as the way they were. However, the drawback of this approach is that it introduces data skew problems. Since the key space is partitioned at startup, as more data are inserted and deleted at runtime, there is no guarantee that each partition is of similar sizes. In the worst case, the system may end up in storing, the majority of its data in a minority of the partitions. Moving the partitions with large amount of data across the node is never efficient.
We have known that it is inevitable to move data in decentralised shared-nothing key-value stores. We have also reviewed the split-move and virtual-node based approaches. They are either heavyweight, or have data skew issues. What is lacking is the scheme of data partitioning and placement that handles node bootstrapping in an efficient, timely manner.
We propose our strategy, which builds on the virtual-node approach. Each partition of data should be stored in separate files, so that when we move a partition replica, we can simply move the corresponding files. There is no heavyweight operation such as scanning or reassembling of key-value pairs. In addition, many existing nodes can participate in bootstrapping a new node, which also improves the performance. We also proposed to automatically split and merge the partitions at runtime, such that the data volume in each partition is of similar size. Once the partitions are consolidated into a bounded size, it becomes easier to reallocate the data and to balance the load.
Now we talk about the timing for partitioning. Remember that in split-move approach, the key space is partitioned at node bootstrapping, while in the virtual-node approach, the key space is partitioned at system startup. In our approach, the key space is partitioned at writes. We have defined an upper bound for the size of each partition. A partition is split when its size reaches the threshold due to data insertion. *We have also defined a lower bound for the total size of any two adjacent partitions. *Two neighbouring partitions are merged if the total size falls below the lower bound. Of course, to avoid oscillation, the upper bound should be considerably larger than the lower bound. The idea of our approach is simple, but realising this idea is non-trivial. There are a few challenges to address.
First of all, there is the coordination issue. We assume the system is totally decentralised. Each partitions has multiple copies in different nodes. Hence, the data files of each replica is split or merged locally within each node. The question is, how do we guarantee that all the replicas of certain partition can be simultaneously sharded?
The answer to this question is to elect a coordinator for each sharding operation. When a partition reaches the upper or lower bound, a polling is triggered, any node that serves this partition can be voted as the coordinator. The election is based on the Chubby implementation, in which the coordinator obtains votes from a majority of the participating nodes. *Once we have the coordinator, it can enforce a sharding operation amongst all the nodes that serve the partition. *When a node finishes sharding locally, it sends an acknowledgements to the coordinator. If the coordinator manages to collect the acknowledgements from all the nodes that participate, the sharding is considered successful. *In the end, the coordinator broadcasts an update of the key range of the partition to all the nodes in the system.
So this is the four-step coordination that I described. The second challenge is how to deal with node failure during coordination. Basically what we did here is to allow a dead node to resurrect within a time window. Even if one dead node does not come back to life, the sharding operation can still proceed and succeed. However, if more than one node fail during sharding, the operation is aborted and re-initiated later. Our paper has detailed description. I will have to skip this slide due to time constrain. Overall, our solution can tolerate the failure of one node duringsharding, and also guarantees that no data loss occurs if the sharding is aborted when multiple nodes fail.
The third challenge is data consistency. There are two aspects. One is related to sharding. Remember that each partition of data is stored in separate files. So we have to use two sets of replicas. One belongs to the partition before sharding, that is the original partition. The other set of replicas belongs to the partition after sharding, that is the future partition. There are two ways to handle reads and writes. One approach is to write to the future partition, and read from both partitions. If, unfortunately, the sharding fails, the future partition is merged back to the original partition so that we can recover the latest updates. The alternative approach is to write to both partitions, so that we can simply abandon the future partition when failure occurs. We prefer the first solution, because the failure of sharding does not happen very often in practice.
The other aspect of consistency is related to data movement. We proposed that each node uses a pair of tokens to control the reads and writes for every partition. Each token is a boolean value. This figure shows how to switch the value of the tokens when moving one partition from the source node to the destination. The solid line means positive, while the dash line means negative. As can be seen, the source node serves both reads and writes during the whole process. *The destination node should start accepting writes before the replica is transferred at Time t1, so that it can receive the latest updates during data transfer. *After the replica is successfully accepted by the destination at t2, *the destine node should also start serving reads for this partition at Time t3, which is a short while after Time t2. Once the destination node can serve both reads and writes, *the source node is allowed to release both tokens in the end. The data files can be deleted later.
Now that we have the solutions to auto-sharding. Let’s take a look at the replica placement algorithm that is based on it.There are a few rules to follow. Number one, to make our life easier, we don't move and shard the same partition at the same time. Second, we make sure each partition has R copies, wherein R is typically equal to 3. Third, we try to balance the workload each node undertakes, so if a node is heavily loaded, a number of partitions will be moved out from it. Last but not least, we will try to balance the number of partitions if no node is heavily loaded. *Based on these rules, we propose a two-phase node bootstrapping. In Phase 1, the new node receives a limited number of hotspot replicas from the most loaded nodes, so that it can start serving queries within a few minutes. In Phase 2, the new node continues to pull in more replicas from different nodes, until it possesses an average number of partitions. This process can take one or several hours, as long as it does not affect the front-end query processing.
We have implemented the scheme of auto-sharding and placement on Apache Cassandra version 1.0.5, which uses the Split-Move approach. We call our system ElasCass, that is elastic Cassandra. We set up the evaluation on Amazon EC2, using the instances with 2 cpu cores and 8GB of memory. We use YCSB benchmark to launch queries. We have used a small cluster of 10 nodes. There are two reasons. One is because we didn’t have the resources to scale up to hundreds of nodes. The other is that we can have a fine-grained analysis of system behaviour when the scale is smaller.
Firstly, we have evaluated the bootstrap performance. Remember the replication number is equal to 2, so the whole data set is copied to the second node in both approaches. From the 3rd node onwards, the behaviours of node bootstrapping are totally different. In ElasCass, the volume of data transferred remains below 10GB at all times. While in Split-Move, the volume transferred is reduced by half at each step. We have analysed the result, and realised that, in apache cassandra that uses Split-Move, the data is always transferred from the node with the most volume of data, which, in this case, is the first node in the system. Each time the data is moved out from the first node, the key range it serves is reduced by half. However, the data on disk is not deleted at runtime, because the data files are notmutable. As a result, the first node always stores the most data, but the actual key range it can offer is reduced exponentially by the power of 2. In contrast, ElasCass does not have this problem, because each partition is stored in separate files. It can simply move the files between the nodes.The time to bootstrap a node is determined by the volume of data transferred. Overall, ElasCass is able to bootstrap a node within 10 minutes. The BalanceVolume is the average volume of data each node should store at each scale. So let’s look at the data volume at bootstrap.
Remember that ElasCass bootstrap a node in two phases. in the second phase, more data is pulled in by the new node from multiple nodes, it uses a low-priority background thread. And it stops when the data volume reaches the balance volume. As a result, data is well balanced in ElasCass. We have used an imbalance index to evaluate load balancing. So Lower is better. As can been seen in the second figure, with Split-Move approach, the data gets more and more imbalanced as the system scales up, while the imbalance index remains low in ElasCass. In the third figure, we can see that ElasCass uses less storage space than the split-move approach. This is because the data files are immutable in Apache Cassandra. Even the key-value pairs are moved out, they cannot be deleted until the files are re-constructed. In contrast, ElasCass can simply delete the data files of any specific partition.
We have also used the YCSB benchmark to evaluate the performance of query processing. The two figures on top show the throughput of queries under zipfian distribution. The figure on the left is write throughput while read throughput is on the right. The throughput in ElasCass increases mostly linearly as the system scales up. While with Split-Move approach, the throughput stops improving from 5 nodes onwards. *This is because the data volume transferred is far less than the balance volume from 5 nodes. As a result, the new node is unable to serve enough queries without a sufficient key range. The two figures at the bottom depict the CPU usage. We use CPU usage to indicate the workload each node undertakes. As can be seen in the figure on the left, the cpu usage of ElasCass is above 70% at all times, because it manages to offer higher throughput. In contrast, in Apache Cassandra, which uses split-move, the average cpu usage decreases as the system scales up. and the workload becomes more and more imbalanced. This is the penalty of imbalance data allocation that we discussed in the previous slide.
These are the takeaways from me. using the virtual node approach reduces the overhead of data movement, and thus improving the performance of node bootstrapping. Automatically shard the partitions at runtime simplifies replica placement and improve load-balancing, which leads to 80% of resource utilisation and an increased throughput that is scalable to #nodes. After we submitted this paper, We were happy to see that Apache Cassandra, MySQL and MongoDB have started to use auto-sharding of virtual nodes. So we are quite convinced that this approach can be used to improve the data movement in shared-nothing key-value stores.
In conclusion, we have used auto-sharding to reduce the time to bootstrap nodes, achieve more balancing load and better performance of query processing.. We would also like to thank Smart Services CRC for the grant that makes this work possible.