1
Rigorous and Multi-tenant HBase Performance
Govind Kamat, Yanpei Chen
Performance Engineering
2
Bio
Govind Kamat
• Member of the Performance Engineering Team at Cloudera
• Focuses on Hadoop and HBase performance and ...
3
Outline
• Apache HBase overview
• Measuring performance + YCSB basics
• Cluster setup best practices
• Techniques for ri...
4
HBase Overview
• Distributed, "NoSQL" key-value store
• Column-oriented, sorted map
• Keys are lexicographically sorted
...
5
Measuring HBase Performance is Hard!
• Numbers not reproducible
• Large run-to-run variation
• Testbeds not clearly defi...
6
Cluster is
down … sigh!
7
Workloads for Performance Measurement
• Set of transactions to be imposed against it
• read, update, insert, scan and mi...
8
Yahoo! Cloud Serving Benchmark (YCSB) Basics
• Performance evaluation framework for key-value
databases, such as:
• HBas...
9
YCSB Basics - Running YCSB
• Create a table called "usertable" in HBase
$ ycsb [load | run] hbase
-p workload=
com.yahoo...
10
YCSB Basics – YCSB Parameters
• Specified like so: '-p property=value’
• columnfamily, fieldcount, fieldlength
• record...
11
YCSB Basics - YCSB Output 1/2
2014-05-28 17:08:34:025 1310 sec: 2951422 operations; 2737.33 current
ops/sec; [READ Aver...
12
YCSB Basics - YCSB Output 2/2
[READ], 0, 2168499
[READ], 1, 445777
[READ], 2, 29748
[READ], 3, 32264
[READ], 4, 28154
[...
13
Cluster Setup Best Practices
• Setting up the cluster
• Configuring HBase
• Creating tables
• Pre-splitting tables
• Lo...
14
HBase Cluster Configuration Best Practices
• Use the appropriate hardware, correctly sized: memory, disk
• Dedicate sep...
15
16
Data Loading – Several Options
• Real, actual, production (hot) data 
• Custom loader
• PerformanceEvaluation
• Loadin...
17
Data Loading - Pre-split the Table
• Auto-splitting has significant overhead
• RegionSplitter utility
• UniformSplit
• ...
18
Techniques for Rigorous Measurement
• Keep the input data set fixed
• Warm up the cache
• Set the target throughput
• U...
19
Keep the Input Data Set Fixed!
20
Keep the Input Data Set Fixed!
A beginning is the time for taking the most
delicate care that the balances are correct....
21
Cluster is
down … sigh!
22
Warm Up the Cache
• Performance depends significantly on memory
• HBase block cache and OS page cache for reads
• Memst...
23
Warm Up the Cache
24
Set the Target Throughput
• Two parameters to set desired throughput
• -threads
• -target
• Actual throughput will matc...
25
Set the Target Throughput
26
Use the Appropriate Workload Distribution
• Various types possible
• Uniform (default, but unrealistic)
• Latest
• Hots...
27
Rigorous Measurement Techniques
• Set the cluster up properly
• Keep the input data set fixed
• Pre-split the key space...
28 ©2014 Cloudera, Inc. All rights reserved.
• Multi-tenant as in different compute frameworks
Multi-tenant HBase Performa...
29 ©2014 Cloudera, Inc. All rights reserved.
HBase in a Multi-tenant Environment
Integration
Storage
Resource Management
M...
30 ©2014 Cloudera, Inc. All rights reserved.
• Customer wants to do free-text search on data in HBase
• Explore relevant d...
31 ©2014 Cloudera, Inc. All rights reserved.
• Inevitable constraints
• More processing, different processing on the same ...
32 ©2014 Cloudera, Inc. All rights reserved.
• Configure HBase, Search, and MapReduce
• Large set of performance-relevant ...
33 ©2014 Cloudera, Inc. All rights reserved.
Start with stand-alone performance
• Stand-alone MR indexing rate of HBase  ...
34 ©2014 Cloudera, Inc. All rights reserved.
• Stand-alone MR indexing rate of HBase  Search
• Should be no lower than th...
35 ©2014 Cloudera, Inc. All rights reserved.
• MR indexing HBase  Solr while both are active
• Test efficiency, fairness,...
36 ©2014 Cloudera, Inc. All rights reserved.
• HBase essential to an enterprise data hub
• Need for multiple frameworks to...
37 ©2014 Cloudera, Inc. All rights reserved.
• gkamat@cloudera.com
• yanpei@cloudera.com
Thanks!
38 ©2014 Cloudera, Inc. All rights reserved.
Backup slides
39
Building YCSB
$ git clone http://github.com/brianfrankcooper/YCSB
$ mvn package –DskipTests
diff --git a/pom.xml b/pom....
40
Building YCSB (contd.)
diff --git a/hbase/pom.xml b/hbase/pom.xml
- <artifactId>hbase</artifactId>
+ <artifactId>hbase-...
Upcoming SlideShare
Loading in...5
×

Rigorous and Multi-tenant HBase Performance

1,267

Published on

Speakers: Govind Kamat and Yanpei Chen

Hadoop Summit 2014

Published in: Software, Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,267
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Rigorous and Multi-tenant HBase Performance

  1. 1. 1 Rigorous and Multi-tenant HBase Performance Govind Kamat, Yanpei Chen Performance Engineering
  2. 2. 2 Bio Govind Kamat • Member of the Performance Engineering Team at Cloudera • Focuses on Hadoop and HBase performance and scalability • Experience includes the development of large-scale software systems, microprocessor architecture, compilers and electronic design Yanpei Chen • Member of the Performance Engineering Team at Cloudera • Works on cross-component performance - Hadoop, HBase, Search and Impala • Ph.D. from UC Berkeley, focus on performance measurement method and theory
  3. 3. 3 Outline • Apache HBase overview • Measuring performance + YCSB basics • Cluster setup best practices • Techniques for rigorous measurement • HBase in a multi-tenant environment
  4. 4. 4 HBase Overview • Distributed, "NoSQL" key-value store • Column-oriented, sorted map • Keys are lexicographically sorted • Multiple regions across “regionservers” • Built on HDFS, MapReduce not required
  5. 5. 5 Measuring HBase Performance is Hard! • Numbers not reproducible • Large run-to-run variation • Testbeds not clearly defined/properly setup • Various workloads have been used • Configuration parameters not specified • State of regionservers not taken into account • Reported numbers not comparable
  6. 6. 6 Cluster is down … sigh!
  7. 7. 7 Workloads for Performance Measurement • Set of transactions to be imposed against it • read, update, insert, scan and mixes thereof • Initial data to be loaded into the DB • Insert • Transaction load intensity variation over time • Possible HBase workloads: • Actual customer/production workloads (best) • PerformanceEvaluation (not really a workload ) • YCSB (Yahoo! Cloud Serving Benchmark, commonly used)
  8. 8. 8 Yahoo! Cloud Serving Benchmark (YCSB) Basics • Performance evaluation framework for key-value databases, such as: • HBase, Cassandra, Sherpa, Accumulo, Voldemort • Abstracts out the client from the DB • Flexible and configurable • Comes with a standard “core” workload • Reports throughput and latency metrics
  9. 9. 9 YCSB Basics - Running YCSB • Create a table called "usertable" in HBase $ ycsb [load | run] hbase -p workload= com.yahoo.ycsb.workloads.CoreWorkload -p columnfamily=cf -p operationcount=1000000 -P workloads/randomWrite -threads 10 -s
  10. 10. 10 YCSB Basics – YCSB Parameters • Specified like so: '-p property=value’ • columnfamily, fieldcount, fieldlength • recordcount, operationcount • readproportion, updateproportion, scanproportion, .. • readallfields, writeallfields • requestdistribution • maxscanlength, scanlengthdistribution • maxexecutiontime
  11. 11. 11 YCSB Basics - YCSB Output 1/2 2014-05-28 17:08:34:025 1310 sec: 2951422 operations; 2737.33 current ops/sec; [READ AverageLatency(us)=8098.29] 2014-05-28 17:08:44:026 1320 sec: 2972315 operations; 2089.09 current ops/sec; [READ AverageLatency(us)=8671.15] [OVERALL], RunTime(ms), 1334884.0 [OVERALL], Throughput(ops/sec), 2247.3862897450267 [READ], Operations, 3000000 [READ], AverageLatency(us), 8876.560442666667 [READ], MinLatency(us), 205 [READ], MaxLatency(us), 2530720 [READ], 95thPercentileLatency(ms), 9 [READ], 99thPercentileLatency(ms), 15
  12. 12. 12 YCSB Basics - YCSB Output 2/2 [READ], 0, 2168499 [READ], 1, 445777 [READ], 2, 29748 [READ], 3, 32264 [READ], 4, 28154 [READ], 5, 26195 [READ], 6, 32222 [READ], 7, 39343 [READ], 8, 44038 [READ], 9, 41481 [...] [READ], >1000, 11925
  13. 13. 13 Cluster Setup Best Practices • Setting up the cluster • Configuring HBase • Creating tables • Pre-splitting tables • Loading data
  14. 14. 14 HBase Cluster Configuration Best Practices • Use the appropriate hardware, correctly sized: memory, disk • Dedicate separate nodes for master services and worker roles • No Task Trackers and Node Managers on regionserver nodes • Segregate clients from the regionservers • Configure HBase properly: • Block cache (read), memstore (write) • Bloom filters, compression, compaction, short-circuit reads, etc. • Use the appropriate data set size, number of regions, etc. • Monitor the cluster constantly
  15. 15. 15
  16. 16. 16 Data Loading – Several Options • Real, actual, production (hot) data  • Custom loader • PerformanceEvaluation • Loading using YCSB • HFileGenerator followed by bulk-load
  17. 17. 17 Data Loading - Pre-split the Table • Auto-splitting has significant overhead • RegionSplitter utility • UniformSplit • HexStringSplit • YCSB: user100000 .. user999999 hbase(main):1:0> create 'usertable', 'cf’, { SPLITS=> (1..(50-1)).map {|i| "user#{1000 + i*9000/50}" } } #50 splits • Set maximum region file size to a large value
  18. 18. 18 Techniques for Rigorous Measurement • Keep the input data set fixed • Warm up the cache • Set the target throughput • Use the correct workload distribution
  19. 19. 19 Keep the Input Data Set Fixed!
  20. 20. 20 Keep the Input Data Set Fixed! A beginning is the time for taking the most delicate care that the balances are correct. The manual of Muad’Dib From “Dune” by Frank Herbert
  21. 21. 21 Cluster is down … sigh!
  22. 22. 22 Warm Up the Cache • Performance depends significantly on memory • HBase block cache and OS page cache for reads • Memstore and WAL for writes • Load all the rows in the table • Write until data starts getting flushed • Compaction can affect performance significantly • Carry out long-running tests • Repeat till steady-state • Otherwise, performance can vary a lot
  23. 23. 23 Warm Up the Cache
  24. 24. 24 Set the Target Throughput • Two parameters to set desired throughput • -threads • -target • Actual throughput will match target throughput ... • ... until the DB hits its limit • Performance may then begin to degrade • This throughput defines maximum cluster performance • Can be used to evaluate different HBase releases • Otherwise, HBase is never stressed beyond saturation
  25. 25. 25 Set the Target Throughput
  26. 26. 26 Use the Appropriate Workload Distribution • Various types possible • Uniform (default, but unrealistic) • Latest • Hotspot • Zipfian
  27. 27. 27 Rigorous Measurement Techniques • Set the cluster up properly • Keep the input data set fixed • Pre-split the key space • Warm up the cache properly • Set the target throughput • Use the correct workload distribution • Monitor cluster statistics continually
  28. 28. 28 ©2014 Cloudera, Inc. All rights reserved. • Multi-tenant as in different compute frameworks Multi-tenant HBase Performance
  29. 29. 29 ©2014 Cloudera, Inc. All rights reserved. HBase in a Multi-tenant Environment Integration Storage Resource Management Metadata Processing Batch MR … Interactive SQL Impala Interactive Search Solr Interactive Serving HBase Machine Learning System Management Data Management Support Security
  30. 30. 30 ©2014 Cloudera, Inc. All rights reserved. • Customer wants to do free-text search on data in HBase • Explore relevant data beyond just key look-up • This is “multi-tenant” as in multiple frameworks • HBase + MapReduce + Cloudera Search (Apache Solr) • Data indexed into Solr via MapReduce (or Lily HBase Indexer) • Challenge is to not impact HBase and Solr performance Real Multi-tenant Use Case
  31. 31. 31 ©2014 Cloudera, Inc. All rights reserved. • Inevitable constraints • More processing, different processing on the same hardware • Multi-tenant performance of each framework < stand-alone perf. • Good multi-tenant performance means • Efficient - good aggregate performance across HBase/MR/Search • Fair - performance of each reflects assigned share of resources • Elastic - transient spare resources get quickly and fully used Multi-tenant Performance is Hard!
  32. 32. 32 ©2014 Cloudera, Inc. All rights reserved. • Configure HBase, Search, and MapReduce • Large set of performance-relevant parameters for each • Configure each for achieve a desired resource share • Many implicit resource controls • Setup the datasets for high performance • How many regions for the HBase table • How many shards for the Solr collection Practically doing HBase  Solr via MapReduce
  33. 33. 33 ©2014 Cloudera, Inc. All rights reserved. Start with stand-alone performance • Stand-alone MR indexing rate of HBase  Search • Should be no lower than that for HDFS  Search
  34. 34. 34 ©2014 Cloudera, Inc. All rights reserved. • Stand-alone MR indexing rate of HBase  Search • Should be no lower than that for HDFS  Search Start with stand-alone performance time MapReduce indexing HBase  Solr resource HBase, MR, Solr all idle HBase, MR, Solr all idle capacity
  35. 35. 35 ©2014 Cloudera, Inc. All rights reserved. • MR indexing HBase  Solr while both are active • Test efficiency, fairness, elasticity Multi-tenant Performance HBase transactions HBase transactions HBase transactions MR indexing HBase  Solr Search queries Search queriesSearch queries time resource capacity
  36. 36. 36 ©2014 Cloudera, Inc. All rights reserved. • HBase essential to an enterprise data hub • Need for multiple frameworks to analyze HBase data • Challenging to define/measure multi-tenant performance • Not tractable without rigorous techniques • Look for discipline and rigor in performance numbers! Recap
  37. 37. 37 ©2014 Cloudera, Inc. All rights reserved. • gkamat@cloudera.com • yanpei@cloudera.com Thanks!
  38. 38. 38 ©2014 Cloudera, Inc. All rights reserved. Backup slides
  39. 39. 39 Building YCSB $ git clone http://github.com/brianfrankcooper/YCSB $ mvn package –DskipTests diff --git a/pom.xml b/pom.xml - <maven.assembly.version>2.2.1</maven.assembly.version> - <hbase.version>0.92.1</hbase.version> + <maven.assembly.version>2.4</maven.assembly.version> + <hbase.version>0.98.1-hadoop2</hbase.version>
  40. 40. 40 Building YCSB (contd.) diff --git a/hbase/pom.xml b/hbase/pom.xml - <artifactId>hbase</artifactId> + <artifactId>hbase-client</artifactId> - <artifactId>hadoop-core</artifactId> - <version>1.0.0</version> + <artifactId>hadoop-common</artifactId> + <version>2.3.0</version>

×