• Save
HBase Consistency and Performance Improvements
Upcoming SlideShare
Loading in...5
×
 

HBase Consistency and Performance Improvements

on

  • 2,682 views

 

Statistics

Views

Total Views
2,682
Views on SlideShare
2,535
Embed Views
147

Actions

Likes
11
Downloads
0
Comments
0

3 Embeds 147

http://marilson.pbworks.com 83
http://eventifier.co 47
http://eventifier.com 17

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

HBase Consistency and Performance Improvements HBase Consistency and Performance Improvements Presentation Transcript

  • June  13,  2012  HBase Consistency andPerformance ImprovementsEsteban  Gu+errez,  Gregory  Chanan  {esteban,  gchanan}@cloudera.com  
  • HBase Consistency •  ACID guarantees within a single row •  “Any row returned by the scan will be a consistent view (i.e. that version of the complete row existed at some point in time)”[1] [1] http://hbase.apache.org/acid-semantics.html2 ©2012 Cloudera, Inc. All Rights Reserved.
  • HBase Consistency Issues •  Write Consistency Issues •  Read Consistency Issues3 ©2012 Cloudera, Inc. All Rights Reserved.
  • Write Consistency HBASE-4552 •  Importing Multiple CFs HFiles is not an atomic operation4 ©2012 Cloudera, Inc. All Rights Reserved.
  • Write Consistency HBASE-4552•  Importing Multiple CFs HFileswas not an atomic operation is5 ©2012 Cloudera, Inc. All Rights Reserved.
  • Write Consistency HBASE-4552 HRegion.bulkLoadHFile() HFile1: HFile2: HFile3: HFile4: Row 1 fam1:col1 fam2:col2 fam3:col3 fam4:col4 val1 T1 Scan T2 Scan val1 val2 T3 Scan val1 val2 val3 T4 Scan val1 val2 val3 val4 < HBase 0.90.56 ©2012 Cloudera, Inc. All Rights Reserved.
  • Write Consistency HBASE-4552 HRegion.bulkLoadHFiles() HFile1: HFile2: HFile3: HFile4: Row 1 fam1:col1 fam2:col2 fam3:col3 fam4:col4 T1 Scan public void bulkLoadHFiles(List<Pair<byte[], String>> familyPaths) {! ...! startRegionOperation(); ç lock.writeLock().lock()! T2 Scan } finally {! closeBulkRegionOperation(); ! }! T3 Scan ...! ! T4 Scan ≥ HBase 0.90.57 ©2012 Cloudera, Inc. All Rights Reserved.
  • Write Consistency HBASE-4552 HRegion.bulkLoadHFiles() HFile1: HFile2: HFile3: HFile4: Row 1 fam1:col1 fam2:col2 fam3:col3 fam4:col4 T1 Scan public void bulkLoadHFiles(List<Pair<byte[], String>> familyPaths) {! ...! startRegionOperation(); ! T2 Scan } finally {! closeBulkRegionOperation(); ç lock.writeLock().unlock()! }! T3 Scan ...! ! T4 Scan ≥ HBase 0.90.58 ©2012 Cloudera, Inc. All Rights Reserved.
  • Write Consistency HBASE-4552 HRegion.bulkLoadHFiles() HFile1: HFile2: HFile3: HFile4: Row 1 fam1:col1 fam2:col2 fam3:col3 fam4:col4 T1 Scan public void bulkLoadHFiles(List<Pair<byte[], String>> familyPaths) {! ...! startRegionOperation(); ! T2 Scan } finally {! closeBulkRegionOperation(); ! }! T3 Scan ...! ! T4 Scan val1 val2 val3 val4 ≥ HBase 0.90.59 ©2012 Cloudera, Inc. All Rights Reserved.
  • Read Consistency HBASE-2856 •  Seen only twice in the wilderness •  Hard to detect if application monitoring is not implemented10 ©2012 Cloudera, Inc. All Rights Reserved.
  • Read Consistency HBASE-2856 •  Table size ≈ 50 M records •  Large number of CFs •  New records are continuously added to the table •  Concurrent MR Jobs on the same table •  Cluster has to meet strict SLAs11 ©2011 Cloudera, Inc. All Rights Reserved.
  • Read Consistency HBASE-2856 Symptoms Run 1 … … … SPLIT_RAW_FILES … Map-Reduce Framework Map output records 500,00012 ©2011 Cloudera, Inc. All Rights Reserved.
  • Read Consistency HBASE-2856 Symptoms Run 1 Run 2 … … … … SPLIT_RAW_FILES … … Map-Reduce Framework Map output records 500,000 499,99713 ©2011 Cloudera, Inc. All Rights Reserved.
  • Read Consistency HBASE-2856 Symptoms Run 1 Run 2 Run 3 … … … … … SPLIT_RAW_FILES … … … Map-Reduce Framework Map output records 500,000 499,997 500,00114 ©2011 Cloudera, Inc. All Rights Reserved.
  • Read Consistency HBASE-2856 Symptoms Run 1 Run 2 Run 3 … … … … … SPLIT_RAW_FILES … … … Map-Reduce Framework Map output records 500,000 499,997 500,001 cf1:col1 cf2:col2 cf3:col3 cf1:col1 cf2:col2 cf3:col3 cf1:col115 ©2011 Cloudera, Inc. All Rights Reserved.
  • Read Consistency HBASE-2856 Symptoms Run 1 Run 2 Run 3 … … … … … SPLIT_RAW_FILES … … … Map-Reduce Framework Map output records 500,000 499,997 500,001 cf1:col1 cf2:col2 cf3:col3 cf1:col1 cf2:col2 cf3:col3 cf1:col1 Scale testing shows between 0.5% to 2% of inconsistent results between runs16 ©2011 Cloudera, Inc. All Rights Reserved.
  • Read Consistency HBASE-2856 Impact •  Result is used to update user facing records •  Customer is not happy17 ©2011 Cloudera, Inc. All Rights Reserved.
  • Read Consistency HBASE-2856 Impact •  Result is used to update user facing records •  Customer is not happy — “Where is my data?”18 ©2011 Cloudera, Inc. All Rights Reserved.
  • Read Consistency HBASE-2856 Workarounds •  Re-try scan if not all CFs are present •  Re-submit job if any inconsistency is found19 ©2011 Cloudera, Inc. All Rights Reserved.
  • Read Consistency HBASE-2856 Workarounds •  Re-try scan if not all CFs are present •  Re-submit job if any inconsistency is found •  Sometimes that is not possible20 ©2011 Cloudera, Inc. All Rights Reserved.
  • Read Consistency HBASE-2856 Workarounds •  Re-try scan if not all CFs are present •  Re-submit job if any inconsistency is found •  Sometimes that is not possible SLAs!21 ©2011 Cloudera, Inc. All Rights Reserved.
  • MVCC •  HBase maintains ACID semantics using Multiversion Concurrency Control •  Instead of overwriting state, create a new version of object with timestamp Timestamp Row fam1:col1 fam2:col2 t1 row1 val1 val122 ©2012 Cloudera, Inc. All Rights Reserved.
  • MVCC •  HBase maintains ACID semantics using Multiversion Concurrency Control •  Instead of overwriting state, create a new version of object with timestamp Timestamp Row fam1:col1 fam2:col2 t2 row1 val2 val2 t1 row1 val1 val1 •  Reads never have to block •  Note this timestamp is not externally visible! Internally called “memStoreTs”23 ©2012 Cloudera, Inc. All Rights Reserved.
  • HBase Write Path 1.  Write to WAL (per RegionServer) 2.  Write to In-Memory Sorted Map (MemStore) (per Region+ColumnFamily) 3.  Flush MemStore to disk as HFile when MemStore hits configurable hbase.hregion.memstore.flush.size24 ©2012 Cloudera, Inc. All Rights Reserved.
  • Internals / Bug Now that we know the internals – what could go wrong?25 ©2012 Cloudera, Inc. All Rights Reserved.
  • Putting it together Let’s go back to the beginning… MemStore Timestamp Row fam1:col1 fam2:col2 t1 row1 val1 val126 ©2012 Cloudera, Inc. All Rights Reserved.
  • Putting it together Let’s go back to the beginning… MemStore Timestamp Row fam1:col1 fam2:col2 t1 row1 val1 val1 And start a scan.27 ©2012 Cloudera, Inc. All Rights Reserved.
  • Putting it together Let’s go back to the beginning… MemStore Timestamp Row fam1:col1 fam2:col2 t2 row1 val2 val2 t1 row1 val1 val1 And start a scan. And concurrently put.28 ©2012 Cloudera, Inc. All Rights Reserved.
  • Putting it together Let’s go back to the beginning… MemStore Timestamp Row fam1:col1 fam2:col2 t2 row1 val2 val2 t1 row1 val1 val1 And start a scan. HFile And concurrently put. Row fam2:col2: Which causes a flush. row1 val2 row1 val129 ©2012 Cloudera, Inc. All Rights Reserved.
  • Putting it together Now, scan needs to make sense of this… MemStore Ts Row fam1:col1 t2 row1 val2 t1 row1 val1 HFile Row fam2:col2: row1 val2 row1 val130 ©2012 Cloudera, Inc. All Rights Reserved.
  • Putting it together Now, scan needs to make sense of this… MemStore Ts Row fam1:col1 t2 row1 val2 t1 row1 val1 HFile Row fam2:col2: row1 val2 row1 val1 But HFile has no timestamp!31 ©2012 Cloudera, Inc. All Rights Reserved.
  • Putting it together Now, scan needs to make sense of this… MemStore Ts Row fam1:col1 t2 row1 val2 t1 row1 val1 HFile Inconsistent Result Row fam2:col2: Row fam1:col1 fam2:col2 row1 val2 row1 val1 val2 row1 val1 But HFile has no timestamp!32 ©2012 Cloudera, Inc. All Rights Reserved.
  • Solution Store the timestamp in the Hfile MemStore HFileTs Row fam1:col1 Ts Row fam2:col2:t2 row1 val2 t2 row1 val2t1 row1 val1 t1 row1 val1 Correct Result Row fam1:col1 fam2:col2 row1 val1 val2 Now we have all the information we need33 ©2012 Cloudera, Inc. All Rights Reserved.
  • Consistency •  Only some of the consistency issues in 0.90 –  e.g. HBASE-5121: MajorCompaction may affect scans correctness •  Solution: Upgrade to 0.92 or 0.9434 ©2012 Cloudera, Inc. All Rights Reserved.
  • HBase 0.94 “Performance Release”35 ©2012 Cloudera, Inc. All Rights Reserved.
  • Performance Improvements in 0.94 •  HBASE-5047 Support checksums in HBase block cache •  HBASE-5199 Delete out of TTL store files before compaction selection •  HBASE-4608 HLog Compression •  HBASE-4465 Lazy-seek optimization for StoreFile scanners36 ©2012 Cloudera, Inc. All Rights Reserved.
  • Performance Improvements in 0.94 •  HBASE-5047 Support checksums in HBase block cache •  HBASE-5199 Delete out of TTL store files before compaction selection •  HBASE-4608 HLog Compression •  HBASE-4465 Lazy-seek optimization for StoreFile scanners37 ©2012 Cloudera, Inc. All Rights Reserved.
  • HBASE-5047 •  HDFS stores checksum is separate file HFile Checksum •  So each file read actually requires two disk iops •  HBase often bottlenecked by random disk ipos38 ©2012 Cloudera, Inc. All Rights Reserved.
  • HBASE-5047 Solution •  Solution: Store checksum in HFile block HFile HFile Block Chksum Data •  On by default (“hbase.regionserver.checksum.verify”) •  Bytes per checksum (“hbase.hstore.bytes.per.checksum”) – default is 16K39 ©2012 Cloudera, Inc. All Rights Reserved.
  • Performance Improvements in 0.94 •  HBASE-5047 Support checksums in HBase block cache •  HBASE-5199 Delete out of TTL store files before compaction selection •  HBASE-4608 HLog Compression •  HBASE-4465 Lazy-seek optimization for StoreFile scanners40 ©2012 Cloudera, Inc. All Rights Reserved.
  • HBASE-5199 •  User can specify TTL per column family •  If all values in the HFile are expired, delete HFile rather than compact •  Off by default, turn on via ("hbase.store.delete.expired.storefile“)41 ©2012 Cloudera, Inc. All Rights Reserved.
  • Conclusion •  Most consistency issues fixed in 0.92/ CDH4 •  Performance improvements in 0.94 •  0.94 is wire compatible with 0.92, so will be in a CDH4 update42 ©2012 Cloudera, Inc. All Rights Reserved.
  • References •  HBase Acid Semantics, http://hbase.apache.org/acid-semantics.html •  Apache HBase Meetup @ SU, Michael Stack. http://files.meetup.com/ 1350427/20120327hbase_meetup.pdf •  HBase Internals, Lars Hofhansl. http://www.cloudera.com/resource/hbasecon-2012- learning-hbase-internals/43 ©2012 Cloudera, Inc. All Rights Reserved.