Your SlideShare is downloading. ×
A request skew aware heterogeneous distributed
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

A request skew aware heterogeneous distributed


Published on

Published in: Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. A request skew aware heterogeneous distributed storage system based on Cassandra Zhen Ye, Shanping Li Department of Computer Science and Technology Zhejiang University Hangzhou, China {yezhen, shan}—many distributed storage systems have been proposed two technologies these systems likely to use to achieve aboveto provide high scalability and high availability for modern web targets. Data partition can be used to improve scalability andapplications. However, most of those applications only aware performance while data replication is a good way to get highdata skew while actually request skew is also widely exist and availability and balance the load.needed to be considered as well. In this paper, we present arequest skew aware heterogeneous distributed storage system As we know, most Internet-scale applications exhibit highlybased on Cassandra—a famous NoSQL database aiming to skewed workload, including data skew and request skew,manage very large scale data without single point of failure. We which the system should distribute to its nodes evenly in orderimprove Cassandra through two ways: 1) minimize forward to improve their usability. However, data skew and requestrequest load by shifting the node where the client application skew may have contradiction. Sometimes the data alreadyconnect to the one which can handle maximum number of distributed evenly while request load still skew to one or someskewed request dynamically; 2) when balancing data load among nodes, or vise verse. Although now most of the system saidall nodes within the cluster, take their storage capacity into they aware those skews, actually many of them only care aboutconsideration. The results of our experiment present that we can balancing the data into different nodes. Balancing both requestreduce about 25% forward read request and 15% forward write skew and data skew is still a challenge issue in many systems.request through approach 1) and balance storage utilization of In addition, when balancing the data, many systems assume alleach node obviously after applying 2). the nodes are identical, but actually in distributed environment, Keywords: Distributed Storage System; NoSQL Database; different nodes may have different capacity, which we shouldRequest Skew; Heterogeneous environment take into consideration when designing the system. In this paper, we present a request skew aware I. INTRODUCTION heterogeneous distributed storage system which is based on Cassandra [5]. Cassandra is a NoSQL Database for managing Modern web applications probably have to deal with very very large amounts of data spread out across many commoditylarge scale data set. For most of these applications, the most servers, while providing highly available service with no singlecharacteristics they want are high scalability, high availability point of failure. However, Cassandra only provides data skewand the ability to response quickly even when there exist aware solution while assuming there is no request skew. Also ithundreds of thousands of request concurrently. Comparing assumes that all the nodes have the same storage capacity. Wewith these, strong data consistency and strict transaction improve Cassandra through two ways: 1) minimize forwardsupport, such as ACID, sometimes can be weakened or even request load by changing the node where the client applicationdropped up. Obviously, traditional Relational Database is not connected to the one which can handle maximum number ofsuitable for serving these kinds of applications, since most of skewed request dynamically; 2) when balancing storage loadthem are transaction based and are hard to scale to very large among all nodes within the cluster, take their storage capacitysize or with a very high cost; in addition, relational database into consideration to maximum utilization.always provide too large feature set that many of them will notbe used, which only add the cost and complexity [6, 7]. The remainder of the paper is organized as follows. Section II introduces the related work. Section III presents the So in the past few years, many scalable NoSQL related background of Cassandra and also how we improve it. Sectiondistributed storage systems have been proposed, e.g. Google’s IV studies different experiments. And finally, we presentBigtable [2], Amazon’s Dynamo [3] and Yahoo!’s PNUTS [4]. conclusions and future work in Section V.Based on CAP theory [8], for distributed system we cannot gethigh availability while still maintain the strong consistency. Forthis reason, most of these systems relax strong consistency and II. RELATED WORKstrict transaction to get high scalability and availability, thus Bigtable, Dynamo and Pnuts all are widely studied andthey can scale out dynamically to the internet size, have high cited in distributed storage system domain. Bigtable isresilient to the node failure or network failure and serve well to implemented as sparse, multidimensional sorted maps, which isthe massive access. Data partition and data replication are the richer than key-value data model while still simple enough. It 978-1-4244-9283-1/11/$26.00 ©2011 IEEE
  • 2. uses Google File System (GFS) [1] to store data and log node heterogeneity. It defines the load measure as a function ofinformation. GFS can divide file into fixed size chunks and the number of queries issued on data items per time frame andbalances the data load by distributing those chunks into then proposes a mechanism that balances the systems load bydifferent nodes evenly. However, GFS use primary copy, adjusting the DHT structure so that it best captures query loadpessimistic algorithm to synchronize data between different distributions and node heterogeneity. However, this systemreplicate nodes, which make it scale and performance poorly in does not consider the data replication. Also it assumes that thethe write intensive and wide area scenario. forward request and the response message have the same load, which may be not correct. Dynamo is a highly available, eventually consistent key-value storage system that uses Consistent Hashing to increasescalability, Vector Clock to do reconciliation and Sloppy III. SYSTEM DESIGNQuorum and Hinted handoff to handle temporary failure. In Cassandra is inspired by Bigtable and Dynamo, it integrateDynamo, all the nodes and the keys of data items are hashed Bigtables Column Family based data model and Dynamosand then the output values are mapped to a “ring”. The node’ Eventually Consistency behavior and thus can get both of theiroutput represent the position of this node in the “ring”, while advantages.the key’s output decide which node this item will be stored. Inorder to balance the data load, Dynamo map one real node into As with Dynamo, Cassandra use Consistent Hashing tomany virtual nodes, each of them occupy one position in the partition and distribute data into different nodes by hashing“ring”, different node may have different number of virtual both nodes and data’s value into a “ring”. To improvenodes, based on their capacity. availability and balance the load, all the data are replicated into N nodes, where N is the replication number that can be Pnuts, a geographically distributed database system, use configured in advance. First, it will assign each data item apub/sub message to assure the order of update for one key and Coordinator Node, which is the first node this data item meetprovide per-record timeline consistency, which is between when walk clockwise around the “ring”, then replicate this datastrict serialization and eventually consistency. Similar with item to the next N-1 clockwise successor nodes in the “ring”.Bigtable, Pnuts use one centralized router to look up the right To adapt between strong consistency and high performance,node for a specified key and divide data into many fix sized Cassandra provide different consistency level options to bothtablet. It can move tablet from overload node to low load node read and write operations. For write, Consistency.One meansto get data load balanced and will dynamically change node the the operation will only be routed to the closet replica node.client connect to reduce forward request. Consistency.Quorum means system will route request to ecStore [9] is an cloud based elastic storage system which quorum, usually N/2+1, number of nodes and wait for theirsupports data partition and replication automatically, load responses. Consistency.All means it will route request to all Nbalancing, efficient range query and transactional access. It use replica nodes and waiting for their response. For read, thestratum architecture: BATON tree based data partition as the operation will be routed to all replicas, but only will waitbottom layer to provide highly scalability, 2 tie load-adaptive specific number of response, others will be received andreplication as the middle layer to balance load and provide handled in asynchronies way. For Consistency.One, thishighly availability and multi-version optimistic concurrent number is 1; for Consistency.Quorum, this number is N/2+1control as the top transaction level to provide data consistency. and for Consistency.All this number is N.It use data partition to solve data skew issue and solve request In this chapter we will introduce how we improveskew by adding second replicas to those hotspot data. ecStore Cassandra.use primary copy optimistic replication and provide adaptiveread consistency. However, if you choose read consistency A. Minimize forward request loadvalue equal to the number of replicas, which means it will notfeedback to client until the data has been updated to all In order to use Cassandra cluster, client application needs toreplicas, the result is the same with using primary copy connect to a node within the cluster. We name this node as thepessimistic replication and may cause poor performance issue, Connected Node. When client read or write a data item, if thisotherwise, you may get corrupt data in the situation where Connected Node is not one of the replica nodes responsible forrecent updated data has not been synchronized to the nodes that this item, it has to forward the request to one or more nodes,you are accessing to. wait for their response and finally feedback to the client. Since most of the applications exists request skew, when you choose S. Bianchi et al. [10] studies the load of a P2P system under different node as Connected Node, the total forward requestbiased request workloads. It discovers that those systems show load will be different. However, client often do not knowa heavy lookup traffic load and also the load that in the which node has the least forward request load at first. Evenintermediate node that is responsible for forward the access to client connect to the most loaded node at first, however, as thethe target node. Based on this, the authors propose a way to use time goes by, the users’ access pattern will change, and thus theRouting Tables Reorganization to reduce forward request load hot spot data will also changes. For these two reasons, we needand use cache and data replicas to reduce local request load to to change this configuration dynamically to minimize thebalance the traffic load. As it needs hundreds of nodes in the forward request load.experiments, the authors just did some simulation experiment. In Cassandra, we will meet three kind of forward request: M. Abdallah et al. [11] proposes a load balancingmechanism that takes into account both data popularity and
  • 3. • K1: Request’s consistency level is Consistency.One, in moveRatio, which can be configured, then we say this local this situation the Connected Node only forward request node is overloaded. If finding one node is overloaded, we will to closest replica and wait for its response. select the nodes that have enough free space to make sure both themselves and overloaded node’s used ratio less than • K2: If it is read request and consistency level is not moveRatio after moving the position of overloaded node as Consistency.One, Connected Node will forward two candidates to move with. If there is more than one candidate, type of request: One is read request that will be routed we will select the one who has the minimum original used ratio to its closest replica, which then will return whole as the target node and move the overloaded node beside it to message to Connected Node; another is read digest balance the data between them. request sent to other replicas, which only return message digest. After get all the response, system will Since our balancing algorithm is based on average used digest the message and compare it with digests got ratio, so if two nodes’ total storage capacity is very different, from other replicas to see if they are the same version. even their used ratios are similar, their used storage size also will very different. It means one node’s storage data number is • K3: If it is write operation and not Consistency.One, much more than another one, which sometimes also means this Connected Node will forward and wait for node has much more request load than that one. We solve this blockNumber number of write responses according to potential issue by setting a variable called allowCapacityRatio. the consistency level. Here blockNumber’s value is For any node whose total storage is larger than N/2+1 if consistency level is Consistency.Quorum or allowCapacityRatio times of minimum nodes’ total storage, we N if it is Consistency.All. will use this most allowed capacity to present its total capacity When accessing a data item, for different kind of forward instead.request, the benefit we get is different after shifting ConnectedNode from original node to one of the replicas responsible for //It will be accessed in Connected Nodethis item. In K1, if Connected Node is one of replicas, it does 1: nodes←findNodes(key)not need to wait any message from other remote nodes, which 2: for node∈nodes dowill improve its response time a lot. In K2, it still needs to wait 3: load←baseLoad //Each write operations loadread digest message from other replicas, but Connected Node 4: if blockNumber equals 1can handle read request itself. In K3, it still needs to wait 5: load←baseLoad *weightOne(blockNumber-1) number of write response from other 6: end ifreplicas, which comparing with other 2 situation, the 7: if (blockNumber great than 1)improvement is limited. and isReadOperation() Based on this observation, we give out our improvement. 8: load←baseLoad *weighReadThe opinion is recording all nodes’ request load in Connected 9: end ifNode, and assigns different kinds of request with different 10: addLoad(node, load)weight. Every specify time, system we compare the max 11: end forrequest load with Connected Node, if its request load is much Figure 1. Record each node’s request loadlarger than original Connected Node, then we willchange Connected Nod to that node. Fig. 1 and Fig.2 describethe Pseudo code. 1: maxNode ← maxLoadNode() 2: if getload(maxNode) – getLoad(connectedNode)B. Consider node’s storage capacity when balancing data great than changeFactor* totalClusterLoad() among each node 3: changeConnectedNode(node) In order to balance the data load between each nodes, 4: end ifCassandra monitors each node’s data load information on the Figure 2. Change Connected Nodering. If finding one node is overloaded, system will alleviate itsload by moving its position along the ring. The detail checkingand moving algorithms are described in [12]. 1: localUsedRatio←localUsedSize / localTotalSize 2: averageUsedRatio←getClusterAverageUsedRatio() However, Cassandra assumes each node has the same 3: if localUsedRatio great thanstorage capacity, it only monitors the storage size each node moveRatio * averageUsedRatiohas used and then uses this information to judge if there exist 4: candidateNodes ← find all nodes whoseoverloaded node or not. But in the reality, within one (usedSize+localUsedSize)/(totalSize+localToalCassandra’s cluster, different commodity server nodes may Size) less than moveRatio * averageUsedRatiohave different storage capacity, which also need to take into 5: targetNode←minUsedRatio(candidateNodes)consideration when balancing the data. 6: Move local node to let newLocalUsedRatio In order to maximum each node’s storage utilization, we equals newTargetNodeUsedRatiopropose an enhanced data balancing algorithm. We suggest 7: end ifcomparing each node’s local storage used ratio to whole Figure 3. Storage balance algorithmcluster’s average storage used ratio. If it is large than
  • 4. IV. EXPERIMENT 2) Replicas Factor = 2, W = 1, R = 1 We run series of experiments to evaluate the result. The In this round we change the factor to 2; the purpose is tobase Cassandra version we use is 0.6.4. We set up a 6 see how replicas number affects our algorithm. From table IIIcommodity server nodes cluster; all nodes are within the same we can see if we connect our TPC-W client application toLAN and same Rack. Node1, Node2 or Node3 at first, the Connected Node will be changed to Node5 after running the algorithm. Table IV tells us We did some changes to the TPC-W benchmark application if the Connected Node is shift from Node1 to Node5, it willand use it in our system. The requests in TPC-W are following reduce 25.2% forward read request and 18% forward writeZipf distribution, which is used very often in the web request. When it is changed from Node2 to Node5, then theapplication domain to simulate users’ real access model. results is 22.5% and 19.2% respectively. For Node5, result is 17% and 11.6%.A. The result of minimize forward request load Comparing with the result in Round1, we find when other The criteria here we use is the total forward request number configuration remain the same, the less replicas number we use,for each Node. We assign following values to the variables the more chance the Connected Node will be changed, but fordescribe in Fig. 1: baseLoad = 1, weightOne = 2, weighRead = each change, the improvement is less than what we get from1.2. As in Fig. 2, we set changeFactor = 5% and make the round 1.system check one day a time to see if need to changeConnected Node or not. TABLE III. REQUEST LOAD DIFFERENCE IN ROUND2 To see how the Replicas Factor and different consistency Node1 Node2 Node3 Node4 Node5 Node6level affect the result, we set up 3 round tests, each round will K1 Request 147 156 187 230 247 195be run 24 hours, and the units of request number in the Total Load 294 312 374 460 494 390following tables are ten thousands. Factor Diff 8.6% 7.8% 5.2% 1.5% 0% 4.5% 1) Replicas Factor = 3, W = 1, R = 1 In this round, each data item has 3 replicas distributed in 3 TABLE IV. FORWARD REQUEST REDUCED RATIO IN ROUND2different nodes, all operation’s consistent level are set toConsistency.One, which means for write operation, it will only Forward Forward Read Reduced Write Reducedtouch one node and for read operation, it will only wait for one Read request Write request Ratio Ratio Node1 295 139 25.2% 18%response while other response will be got in asynchronous way. Node2 284 141 22.5% 19.2% We connect client to Node 3 at first, since its Ratio Node3 265 129 17.0% 11.6%Difference is 1.8%, which is small than changeFactor, so the Node5 220 114 0% 0%Connected Node has not changed after run the algorithm, butfrom Table I, we can see if we connect to Node1 or Node2 at 3) Replicas Factor = 3, W = 2, R = 2first, then the Connected Node will be changed to Node4. In In round 3 we change both read and write consistency levelthis round, there is no K2 and K3 request since all operation are to Consistency.Quorum, to see how it affect our algorithm.Consistency.One. Factor Diff is defined as: Since all requests are not Consistency.One, so there is no Factor Diffi = (loadmax - loadi) / loadtotal K1 request. Table V presents the detail result, from which we can see if we first connected client application to Node1 or Table II shows if the Connected Node is Node1 at first, it Node2, the Connected Node need to be changed to Node4. Aswill reduce 34.8% of the forward read request and 31% of the we can see in table VI, the read reduced ratio is the same withforward write request. For Node2, the value is 29.4% and round one, write reduced ratio is less than the one in round one.25.7% respectively. There is no read digest request that needed That means if the replicas number is the same, the stricter theto be forward in synchronized way in this round. consistency level is, the less improvement it will get. TABLE I. REQUEST LOAD DIFFERENCE IN ROUND1 TABLE V. REQUEST LOAD DIFFERENCE IN ROUND3 Node1 Node2 Node3 Node4 Node5 Node6 Node1 Node2 Node3 Node4 Node5 Node6 K1 Request 232 256 315 349 323 266 K2 Request 154 171 213 236 218 177 Total Load 464 512 630 698 646 532 K3 Request 78 84 102 113 105 89 Factor Diff 6.4% 5.0% 1.8% 0% 1.4% 4.5% Total Load 262.8 289.2 357.6 396.2 366.6 301.4 Factor Diff 6.8% 5.4% 2.0% 0% 1.5% 4.8% TABLE II. FORWARD REQUEST REDUCED RATIO IN ROUND1 TABLE VI. FORWARD REQUEST REDUCED RATIO IN ROUND3 Forward Forward Read Reduced Write Reduced Read request Write request Ratio Ratio Forward Forward Forward Read Write Node1 236 113 34.8% 31% Read Read Digest Write Reduced Reduced request request request Ratio Ratio Node2 218 105 29.4% 25.7% Node1 236 390 304 34.8% 11.5% Node4 154 78 0% 0% Node2 218 390 296 29.4% 9.1% Node4 154 390 269 0% 0%
  • 5. B. The result of considering storage capacity V. CONCLUSIONS AND FUTURE WORK In this experiment we set moveRatio = 1.5 and In this paper we have presented two ways to improveallowCapacityRatio = 2. Cassandra to make it aware request skew issue and notice different nodes’ capacity when balancing the storage load. First we run TPC-W for a long time to populate enoughdata into cluster nodes. Then we run load balance script Firstly, we propose an algorithm that can minimize forwardcommand several times. In round one we use Cassandra’s request by shifting Connected Node dynamically to the oneoriginal load balance algorithm, in round two we use our that can handle maximum number of request locally.algorithm. Fig. 4 and Fig. 5 present the results. LB1 is the LoadBalance algorithm provided by Cassandra and LB2 is our Secondly, we give out a new idea that can improve storagealgorithm. utilization of each node by using used ratio instead of used size to do data storage balance. From Fig. 4 we can see although LB2 cannot distribute datainto different nodes as evenly as LB1, but comparing with the After that, we did several experiments to evaluate theoriginal load distribution, it also has significant effect. effectiveness of our approach. The result shows that in different scenarios, we all can reduce both forward read request From Fig. 5, it is obviously that our algorithm can utilize and forward write request a lot. Also from the experiment wedifferent nodes’ storage capacity much better than its original learnt the storage utilization will be balanced and improvedone. As its utilization is more balanced, thus the whole cluster obviously.can store more data, which means the storage utilization isimproved. For now, we only assume all the nodes are within the same datacenter, we will extend our research to different datacenter in the future. Also, currently all data has the same replicas number, in the next step, we will think to add additional adaptive replicas for those nodes that contain spot hot data. REFERENCES [1] S. Ghemawat, H. Gobioff and S. Leung, “The Google File System”, In 19th Symposium on Operating Systems Principles, Lake George, New York, 2003, pp. 29–43. [2] F. Chang et al., “Bigtable: A distributed storage system for structured data”, In Proc. OSDI, 2006, pp 205–218. [3] G. DeCandia et al., “Dynamo: amazon’s highly available key-value store”, In Proc. SOSP, 2007, pp. 205-220. [4] B. F. Cooper et al., “PNUTS: Yahoo!’s Hosted Data Serving Platform”, Proc. VLDB Endow, vol. 1, pp. 1277-1288, August 2008. Figure 4. Storage size used by each node [5] A. Lakshman and P. Malik, “Cassandra: Adecentralized structured storage system”, SIGOPS Oper. Syst. Rev. vol. 44, pp. 35-40, 2009. [6] M. Stonebraker, “SQL Databases v. NoSQL Databases”, Commun. ACM, vol. 53, pp. 10-11, April 2010. [7] N. Leavitt, “Will NoSQL Databases Live Up to Their Promise?”, Computer, vol. 43, pp. 12-14, February 2010. [8] E. A. Brewer, “Towards robust distributed systems”, Principles of Distributed Computing, Portland, Oregon, July 2000. [9] H. T. Vo, C. Chen and B. C. Ooi, “Towards elastic transactional cloud storage with range query support”, Proc. VLDB Endow., vol. 3, pp. 506–517, 2010. [10] S. Bianchi, S. Serbu, P. Felber and P. Kropf, “Adaptive Load Balancing for DHT Lookups”, ICCCN, 2006, pp. 411-418. [11] M. Abdallah and E. Buyukkaya, “Fair load balancing under skewed popularity patterns in heterogeneous DHT-based P2P systems”, International Conference on Parallel and Distributed Computing and Systems, 2007, pp. 484-490. Figure 5. Storage utilization of each node [12] M. Abdallah and H.C. Le, “Scalable Range Query Processing for Large-Scale Distributed Database Applications,” Proc. Intl Conf. Parallel and Distributed Computing Systems (PDCS), 2005.