Your SlideShare is downloading. ×
  • Like
  • Save
Optimization on Key-value Stores in Cloud Environment
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Optimization on Key-value Stores in Cloud Environment



  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • great article! I want to ask a question:what's the meaning of the x-Coordinate RunningTime in figure 4?
    Are you sure you want to
    Your message goes here
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Optimization on Key-value Stores in Cloud Environment Fei Dong Duke University April 17, 2012 Abstract In the current information age, large amounts of data are being gener- ated and accumulated rapidly in various domains. This imposes impor- tant demands on data processing capabilities that can extract sensible and valuable information from the large amount of data in a timely manner. Hadoop, the open source implementation of Google’s data processing framework (MapReduce, Google File System and BigTable), is becoming increasingly popular and being used to solve data process- ing problems in various application scenarios. On the other hand, there are some distributed storage systems in Hadoop. Some applica- tion needs to efficiently store structured data while having a variable schema. HBase is useful in this scenario. In this project, we leverage the cost-based optimizer and rule-based optimizer for key-value storage to get better performance for workflows.1 IntroductionStorage in large scale data centers is changing dramatically. The move fromhigh cost enterprise class storage systems, to storage nodes that are partapplication, part storage and part service is pervasive. On one hand, wehave many alternative choices to fit in current system such as HBase [6],Cassandra [5], VoltDB [7]. Most of them are claimed to be scalable and reli-able. While it is difficult to decide which system is right for the application,partially because the features differ between systems, and partially becausethere is not an easy way to compare the performance of one system versusanother. Although Yahoo launches a project called Yahoo Cloud ServingBenchmark [8], which goal is to facilitate performance comparisons of thenew generation of cloud data serving systems, it is still very challenging totune a specific key-value store in cloud environment. In this project, we willsurvey some storage services, and choose HBase as an experiment target. 1
  • 2. Furthermore, we will analyze the features and properties of HBase imple-mentation, such as scalability and reliability. Finally, we will experimentthe performance in EC2 [3] environment.1.1 Cloud Computing EnvironmentInfrastructure-as-a-service(IaaS) cloud platform has brought unprecedentedchanges in the cloud leasing market. Amazon EC2 [3] is a popular cloudprovider to target the market, by providing the standard on-demand in-stances, reserved instances and Spot Instances(SI). EC2 is basically theequivalent of a virtual machine. When launching an instance, users areasked to choose an AMI (Amazon Machine Image), which is an image ofan operating system. Next, users can choose the type of instance they like,there are quite a few options to choose from, depending on how much ‘power’they need and the kind of operations are running. Table 1 shows the featuresand renting costs of some representative EC2 node types. In following Chapters, we prepare an AMI including all of the softwareHadoop needs and compare performance via choosing proper node type andsize. EC2 Node CPU Memory Storage I/O Cost Type (# EC2 Units) (GB) (GB) Performance (U.S. $ per hour) m1.small 1 1.7 160 moderate 0.085 m1.large 4 7.5 850 high 0.34 m1.xlarge 8 15 1,690 high 0.68 m2.xlarge 6.5 17.1 1,690 high 0.54 c1.medium 5 1.7 350 moderate 0.17 c1.xlarge 20 7 1,690 high 0.68Table 1: Six representative EC2 node types, along with resources and costs1.2 HBase OverviewThe HBase is a faithful, open source implementation of Google´ Bigtable [1]. sBasically it is a distributed column-oriented database built on top of HDFS.HBase is the Hadoop application to use when you require real-time read-/write random-access to very large datasets and it provides a scalable, dis-tributed database. See Figure 1 to present the HBase cluster architecture.In HBase, there are some concepts. Tables: HBase tables are like those in an RDBMS, only cells are ver-sioned, rows are sorted, and columns can be added on the fly by the clientas long as the column family they belong to prefix exists. Tables are auto-matically partitioned horizontally by HBase into regions. Rows: Row keys are uninterpreted bytes. Rows are lexicographicallysorted with the lowest order appearing rst in a table. The empty byte array 2
  • 3. is used to denote both the start and end of a tables’ namespace. Column Family: Columns in HBase are grouped into column families.All Column family members have a common prefix. Physically, all columnfamily members are stored together on the file system. Because tunings andstorage specifications are done at the column family level, it is advised thatall column family members have the same general access pattern and sizecharacteristics. Cells: A row, column, version tuple exactly species a cell in HBase. Cellcontent is uninterpreted bytes. Regions: The Tables are automatically partitioned horizontally by HBaseinto regions and each of them comprises a subset of a tables rows. HBasecharacterized with an HBase master node orchestrating a cluster of one ormore regionserver slaves. The HBase Master is responsible for boot strap-ping, for assigning regions to registered regionservers, and for recoveringregionserver failure. HBase keeps special catalog tables named -ROOT- and.META, which maintains the current list, state, recent history, and loca-tion of all regions afloat on the cluster. The -ROOT- table holds the listof .META table regions. The .META table holds the list of all user-spaceregions. Entries in these tables are keyed using the regions start row. Figure 1: HBase cluster architecture1.3 Hadoop Storage SystemHDFS is the most used distributed filesystem in Hadoop. The primaryreason HDFS is so popular is its built-in replication, fault tolerance, andscalability. However, it is not enough. Some application needs to efficientlystore structured data while having a variable schema. HBase is useful in thisscenario. Its goal is hosting of very large tables – billions of rows X millions 3
  • 4. of columns and provide high reliability. Figure 2: HBase components Figure 2 shows how the various components of HBase are orchestratedto make use of existing system, like HDFS and Zookeeper, but also addingits own layers to form a complete platform. Requirements HDFS HBase Scalable Storage ! ! System Fault Tolerance ! ! Sequence Read/Write ! ! Random Write # ! Client Fault Tolerance # ! Append & Flush # ! Table 2: HDFS vs. HBase Table 2 discussed some tradeoffs of HBase. In general, HBase is acolumn-oriented store implemented as the open source version of GooglesBigTable system. Each row in the sparse tables corresponds to a set ofnested key-value pairs indexed by the same top level key (called ”row key”).Scalability is achieved by transparently range-partitioning data based on rowkeys into partitions of equal total size following a shared-nothing architec-ture. As the size of data grows, more data partitions are created. Persistentdistributed data storage systems are normally used to store all the data forfault tolerance purposes.1.4 Cost Based Optimization vs. Rule Based OptimizationThere are two kinds of optimization strategies used by Database engines forexecuting a plan.Rule Based Optimization (RBO): RBO uses a set of rules to deter-mine how to execute a plan. Cloudera [2] gives us some tips to improve 4
  • 5. MapReduce performance. For example: If a job has more than 1TB of in-put, consider increasing the block size of the input dataset to 256M or even512M so that the number of tasks will be smaller. They recommend a num-ber of reduce tasks equal to or a bit less than the number of reduce slots inthe cluster.Cost based optimization (CBO): The motivation behind CBO is tocome up with the cheapest execution plan available for each query. For ex-ample, the cheapest plan may be the plan that will use the least amount ofresources (CPU, memory, I/O, etc.) to get the desired output, or the onethat completes in the least amount of time, or the plan that use the leastmoney. Query optimizers are responsible for finding a good execution planp for a given query q, given an input set of tables with some data propertiesd, and some resources r allocated to run the plan. The kernel of CBO is acost model, which is used to estimate the performance cost y of a plan p.The query optimizer employs a search strategy to explore the space of allpossible execution plans in search for the plan with the least estimated costy.Here we consider rule-base optimization in our project.2 HBase ProfileSome projects rely on HDFS and HBase for data storage. HBase recordsimportant calculation results such as intersection area and target geoloca-tion. We also want to measure the cost for HBase usage. Hence, we developthe HBase Profiler. The principle is to hook HBase client API such as put(), get() whenHBase is enabled. At the entry of methods, we will capture the input sizeand starting time. When the function returns, we update the counter andfinally statistics below metrics after MapReduce jobs finish. The task profiler populates the metrics:HBASE PUT_DURATION 36002716HBASE PUT_COUNT 3435HBASE PUT_BYTE_COUNT 5789092HBASE GET_DURATION 162122328HBASE GET_COUNT 23232HBASE GET_RESULT_COUNT 27692HBASE GET_RESULT_BYTE_COUNT 3889092 When the job finished running, the job profile will gather the task profiletogether and export as a XML report.<counter key="HBASE_PUT_SIZE" value="3435"/><counter key="HBASE_PUT_BYTES" value="4780"/> 5
  • 6. <counter key="HBASE_GET_SIZE" value="23232"/><counter key="HBASE_GET_RESULT_SIZE" value="27692"/><counter key="HBASE_GET_RESULT_BYTES" value="3889092"/><cost_factor key="HBASE_PUT_COST" value="106064.0"/><cost_factor key="HBASE_GET_COST" value="212321.0"/><timing key="HBASE_PUT" value="36.633596"/><timing key="HBASE_GET" value="74.985654"/>How to integrate the HBase profile with Starfish´ cost-based model is still a spending problem.3 Memory Allocation Process Heap Description Namenode 2G about 1G for each 10TB data SecondaryNameNode 2G Applies the edits in memory, needs about the same amount as the NameNode JobTracker 2G Moderate requirement HBase Master 4G Usually light loaded DataNode 1G Moderate requirement TaskTracker 1G Moderate requirement Task Attempts 1G Multiply by the maximum number alllowed HBase Region Server 8G Majority of available memory Zookeeper 1G Moderate requirement Table 3: Memory allocation per Java process for a cluster Suggested by the HBase book [4], Table 3 shows a basic distributionof memory to specific processes. The setup is as such: for the mastermachine, running the NameNode, Secondary NameNode, JobTracker andHBase Master, 17 GB of memory; and for the slaves, running the DataN-ode, TaskTrackers, and HBase RegionServers. We note that setting the heap of region servers to larger than 16GB isconsidered dangerous. Once a stop-the-world garbage collection is required,it simply takes too long to rewrite the fragmented heap.4 HBase Performance Tuning4.1 Experiment Environment • Cluster Type1: m1.large 10-20 nodes. 7.5 GB memory, 2 virtual cores, 850 GB storage, set 3 map tasks and 2 reduce tasks concurrently. 6
  • 7. • Cluster Type2: m2.xlarge 10 nodes: 17.1 GB of memory, 6.5 EC2 Compute Units (2 virtual cores with 3.25 EC2 Compute Units each) • Hadoop Version1: Apache 0.20.205. • Hadoop Version2: CDH3U3 (Cloudera Update Version) based on 0.20.2. • HBase Version1: Apache 0.90.4. • HBase Version2: CDH3U3 (support LZO, SNAPPY compression li- brary). • YCSB Version: 1.2.4 • Data Set:10 Million records for reads, writes, and updates. Each record is about 10 K size.4.2 Optimization Space 1. Operation System: ulimit, nproc, JVM Settings 2. Hadoop Configuration: xciever, handlers 3. HBase Configuration: garbage collections, heap size, block cache. 4. HBase Schema: row key, compression, block size, max filesize, bloom- filter 5. HBase processes: split, compacting4.3 JVM SettingHadoop JVM instances was set to run with ”-server” option with a HEAPsize of 4GB. HBase nodes ran with 4GB heap with JVM settings. ”-server -XX:+UseParallelGC -XX:ParallelGCThraed=8 -XX:+AggressivHeap-XX:+HeapDumpOnOutOfMemoryError”. The parallel GC leverages multiple CPUs.4.4 Configurations SettingsSome of the configurations carried out in Table 44.5 HBase Operation PerformanceFigure 5 describes the basic operation performance under the workload of10M records on 10 m2.xlarge nodes. 7
  • 8. Configuration File Property Description Value dfs.block.size Lower value offers more paral- 33554432 hdfs-site.xml lelism dfs.datanode.max.xcievers upper bound on the number of 4096 files that it will serve at any one time core-site.xml io.file.buffer.size Read and write buffer size. By 16384 setting limit to 16KB it allows continous streaming hbase.regionserver.handler.count RPC Server instances suun up on 100 HBase regionservers hbase-site.xml hfile.min.blocksize.size small size increases the index but 65536 reduces the less fetch on a ran- dom access hbase.hregion.max.filesize Maximum HStoreFile size. If 512M any HStoreFiles has grown to ex- ceed this value, the hosting HRe- gion is split in two. zookeeper.session.timeout ZooKeeper session timeout. 60000 hfile.block.cache.size Percentage of maximum heap to 0.39 allocate to block cache used by HFile/StoreFile. Table 4: HBase configuration tuning Operation Throughput Latency(ms) random read 2539 3.921 random insert 46268 0.2 random update 12361 0.797 read-modify-write 4245 2.214 scan 592 16.59 Table 5: Performance test in HBase4.5.1 WritingFrom Figure 3, we can see that when client threads reach 250, the responsetime is around 6ms and throughput is 7000. While we find some fluctuationsduring the running. In order to tune HBase, we choose six different scenariosto test the write speed. The following two graphs show testing results withclient or server-side settings that may affect speed and fluctuation. The first test case is using the default settings. From the first graph,it shakes from 5ms to 25ms, and lasts several seconds on waiting. For theonline business, it is not acceptable. We also meet a common problem ofHBase, that is when writing an empty table, the stress would aggregate intoone region server for a long period of time, that waste cluster cability. Itcan be solved by pre-splitting regions. 8
  • 9. Figure 3: HBase write performance change with threads Figure 4: HBase performance test with different client parameters Figure 5: HBase performance test with different server parameters After reading the source code in HBase client, it will retry several timeswhen writing. The strategy is to increase the step exponentially to retry. 9
  • 10. For example: it starts from 1 second and retry 10 times. Hence, the intervalsequence is [1,1,1,2,2,4,4,8,16,32]. If server occurs some exceptions, clientswill wait for long time. Therefore one solution is to limit the pause time to20 ms (hbase.client.pause), retry times is set 11 times. Within those, it canguarantee clients retry in 2s. and set hbase.ipc.client.tcpnodelay  true,  3s that can avoid long time waiting. After applyingthose solutions, we can see that although writing still sometimes fluctuates,the peak is under 5 ms which is much better than before. After tuning the client, we should also tune the server. By mining theserver logs, we find one key factor is split. If disable split, fluctuation fre-quency is less than baseline, but it still happens at some time. Another problem is with more data writing, server will compact largerfiles, then the disk IO is bottleneck during compacting phase. Then weexperiment to disable compact. In the first test of closing compact, thefluctuation range becomes larger and logs show ”Blocking updates for” and”delay flush up to”. Actually when the memstore flush as a region, it willfirst check the number of store files. If there are too many files, it will firstexecute memstore and defer flushing. And in the process of deferring, itmay lead to generate more data in memstore. If the memstore size is largerthan 2X space, it will block wirte/delete request until the memestore sizebecomes normal. We modify the related parameter: base.hstore.blockingStoreF iles  Integer.M AX V ALU E hbase.hregion.memstore.block.multipiler 8 When testing another region server with many regions, we find thatthough disable compact/split and some configuration changes, it still haslarge fluctuation. One reason is it maintains too many regions, which canbe solved by decreasing region numbers or memstore size; Another reasonis HLog. HLog will asks memstore to flush regions regardless of the size ofthe memstore. The final graph shows closing HLog can reduce fluctuationeven more. Some other parameters that affects writing: 1. Batch Operations 2. Write Buffer Size 3. WAL (HLog) 4. AutoFlush WAL=false, autoFlush=false, buffer=25165824 insert costs: 0.058ms WAL=true, autoFlush=false, buffer=25165824, insert costs: 0.081ms 10
  • 11. WAL=false, autoFlush=true, insert costs: 0.75ms WAL=true, autoFlush=true, insert costs: 1.00msOur experiment shows that without autoFlush, the insert can be 10X faster.4.5.2 Reading Figure 6: HBase read performance We continue to test read performance. Figure 6 show read throughputin default is 2539 while 7385 with Bloomfilter. We can conclude that usingbloom filters with the matching update or read patterns can save much IO.4.6 CompressionHBase comes with support for a number of compression algorithms that canbe enabled at the column family level. We run 10Million records to insertwith different compression algorithms. Algorithm Insert Time (ms) NONE 0.076 GZIP 0.150 LZO 0.12 SNAPPY 0.086 Table 6: Compression performance in HBase4.7 Costs of Running BenchmarkBenchmarking can take a lot of time and money; there are many permuta-tions of factors to test so the cost of each test in terms of setup time(launchservices and load data) and compute resources used can be a real limitation.The table 7 shows the test duration and AWS cost at the normal list price.This cost could be reduced by using spot pricing, or by sharing unused reser-vations with the production site. Figure 7 shows the monetary cost whenwe run those experiments. 11
  • 12. Figure 7: EC2 cost for running experiments5 Rules and GuidelinesBased on the above tests and understanding of HBase model, we can extractsome best practice: • Recommend c1.xlarge, m2.xlarge or even better node type ro run HBase. Isolate HBase cluster to avoid memory competition with other services. • Some important factors affect writing speed. Writing HLog ¡ Split ¡ Compact. • When HBase writes, it is hard to get rid of ”shaking”, the factors are ranked as Split ¡ Writing hlog ¡ Compact. • If applications do not need ”delete” operation, we can shutdown com- pact/split which can save some cost. • In reducer phase, we can writes result into HDFS and HBase. If we use batch operation or set autoFlush=false and proper buffersize, it speeds up to 10X of the default condition. • For high-frequency write applications, one region server should not have too many regions. It depends on row size and machine memory. • If applications do not require strict data security, closing HLog can get 2X speedup. • hbase.regionserver.handler.count (default is 10). Around 100 threads are blocked when load increases. If we set it to 100, it is fair and then throughput becomes the bottleneck. • If not considering prespliting region, we can manually move and design proper row key to distribute evenly onto region servers. • Split takes about seconds. If visiting region when in splitting phase, it may throw NotServingRegion exception. 12
  • 13. • Compression can save space on storage. Snappy provide high speeds and reasonable compression. • In a read-busy system, using bloom filters with the matching update or read patterns can save a huge amount of IO.6 ConclusionIn this project, we described our approache to optimize a key-value storeusing rule-based technology. We analyze some performance bottleneck forHBase and discuss the improment on writing and reading. At last we sum-marize some optimization rules that can be applied in practice. For futurework, we will develop some related profiler to capture the running behavior.Furthermore, diagnosing the cause of performance problems is a challengingexercise in workflow. If we have a diagnosis tool that makes use of differ-ent log sources and profiles and present in a nice visual way, it will helpdevelopers to improve their productivity to find the root of problem.References[1] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a distributed storage system for structured data. In Proceedings of the 7th USENIX Symposium on Oper- ating Systems Design and Implementation - Volume 7, OSDI ’06, pages 15–15, Berkeley, CA, USA, 2006. USENIX Association.[2] Cloudera: 7 tips for Improving MapReduce Performance , 2009. 7-tips-for-improving-mapreduce-performance/.[3] Amazon Elastic Cloud Computing, 2012.[4] L. George. HBase: The Definitive Guide. O’Reilly Media, 2011.[5] Apache Cassandra, 2012.[6] Apache HBase, 2012.[7] VoltDB, 2012.[8] Yahoo! Cloud Serving Benchmark, 2012. brianfrankcooper/YCSB/wiki. 13