• Save
HBase Consistency and Performance Improvements
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

HBase Consistency and Performance Improvements

on

  • 2,813 views

 

Statistics

Views

Total Views
2,813
Views on SlideShare
2,666
Embed Views
147

Actions

Likes
12
Downloads
0
Comments
0

3 Embeds 147

http://marilson.pbworks.com 83
http://eventifier.co 47
http://eventifier.com 17

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

HBase Consistency and Performance Improvements Presentation Transcript

  • 1. June  13,  2012  HBase Consistency andPerformance ImprovementsEsteban  Gu+errez,  Gregory  Chanan  {esteban,  gchanan}@cloudera.com  
  • 2. HBase Consistency •  ACID guarantees within a single row •  “Any row returned by the scan will be a consistent view (i.e. that version of the complete row existed at some point in time)”[1] [1] http://hbase.apache.org/acid-semantics.html2 ©2012 Cloudera, Inc. All Rights Reserved.
  • 3. HBase Consistency Issues •  Write Consistency Issues •  Read Consistency Issues3 ©2012 Cloudera, Inc. All Rights Reserved.
  • 4. Write Consistency HBASE-4552 •  Importing Multiple CFs HFiles is not an atomic operation4 ©2012 Cloudera, Inc. All Rights Reserved.
  • 5. Write Consistency HBASE-4552•  Importing Multiple CFs HFileswas not an atomic operation is5 ©2012 Cloudera, Inc. All Rights Reserved.
  • 6. Write Consistency HBASE-4552 HRegion.bulkLoadHFile() HFile1: HFile2: HFile3: HFile4: Row 1 fam1:col1 fam2:col2 fam3:col3 fam4:col4 val1 T1 Scan T2 Scan val1 val2 T3 Scan val1 val2 val3 T4 Scan val1 val2 val3 val4 < HBase 0.90.56 ©2012 Cloudera, Inc. All Rights Reserved.
  • 7. Write Consistency HBASE-4552 HRegion.bulkLoadHFiles() HFile1: HFile2: HFile3: HFile4: Row 1 fam1:col1 fam2:col2 fam3:col3 fam4:col4 T1 Scan public void bulkLoadHFiles(List<Pair<byte[], String>> familyPaths) {! ...! startRegionOperation(); ç lock.writeLock().lock()! T2 Scan } finally {! closeBulkRegionOperation(); ! }! T3 Scan ...! ! T4 Scan ≥ HBase 0.90.57 ©2012 Cloudera, Inc. All Rights Reserved.
  • 8. Write Consistency HBASE-4552 HRegion.bulkLoadHFiles() HFile1: HFile2: HFile3: HFile4: Row 1 fam1:col1 fam2:col2 fam3:col3 fam4:col4 T1 Scan public void bulkLoadHFiles(List<Pair<byte[], String>> familyPaths) {! ...! startRegionOperation(); ! T2 Scan } finally {! closeBulkRegionOperation(); ç lock.writeLock().unlock()! }! T3 Scan ...! ! T4 Scan ≥ HBase 0.90.58 ©2012 Cloudera, Inc. All Rights Reserved.
  • 9. Write Consistency HBASE-4552 HRegion.bulkLoadHFiles() HFile1: HFile2: HFile3: HFile4: Row 1 fam1:col1 fam2:col2 fam3:col3 fam4:col4 T1 Scan public void bulkLoadHFiles(List<Pair<byte[], String>> familyPaths) {! ...! startRegionOperation(); ! T2 Scan } finally {! closeBulkRegionOperation(); ! }! T3 Scan ...! ! T4 Scan val1 val2 val3 val4 ≥ HBase 0.90.59 ©2012 Cloudera, Inc. All Rights Reserved.
  • 10. Read Consistency HBASE-2856 •  Seen only twice in the wilderness •  Hard to detect if application monitoring is not implemented10 ©2012 Cloudera, Inc. All Rights Reserved.
  • 11. Read Consistency HBASE-2856 •  Table size ≈ 50 M records •  Large number of CFs •  New records are continuously added to the table •  Concurrent MR Jobs on the same table •  Cluster has to meet strict SLAs11 ©2011 Cloudera, Inc. All Rights Reserved.
  • 12. Read Consistency HBASE-2856 Symptoms Run 1 … … … SPLIT_RAW_FILES … Map-Reduce Framework Map output records 500,00012 ©2011 Cloudera, Inc. All Rights Reserved.
  • 13. Read Consistency HBASE-2856 Symptoms Run 1 Run 2 … … … … SPLIT_RAW_FILES … … Map-Reduce Framework Map output records 500,000 499,99713 ©2011 Cloudera, Inc. All Rights Reserved.
  • 14. Read Consistency HBASE-2856 Symptoms Run 1 Run 2 Run 3 … … … … … SPLIT_RAW_FILES … … … Map-Reduce Framework Map output records 500,000 499,997 500,00114 ©2011 Cloudera, Inc. All Rights Reserved.
  • 15. Read Consistency HBASE-2856 Symptoms Run 1 Run 2 Run 3 … … … … … SPLIT_RAW_FILES … … … Map-Reduce Framework Map output records 500,000 499,997 500,001 cf1:col1 cf2:col2 cf3:col3 cf1:col1 cf2:col2 cf3:col3 cf1:col115 ©2011 Cloudera, Inc. All Rights Reserved.
  • 16. Read Consistency HBASE-2856 Symptoms Run 1 Run 2 Run 3 … … … … … SPLIT_RAW_FILES … … … Map-Reduce Framework Map output records 500,000 499,997 500,001 cf1:col1 cf2:col2 cf3:col3 cf1:col1 cf2:col2 cf3:col3 cf1:col1 Scale testing shows between 0.5% to 2% of inconsistent results between runs16 ©2011 Cloudera, Inc. All Rights Reserved.
  • 17. Read Consistency HBASE-2856 Impact •  Result is used to update user facing records •  Customer is not happy17 ©2011 Cloudera, Inc. All Rights Reserved.
  • 18. Read Consistency HBASE-2856 Impact •  Result is used to update user facing records •  Customer is not happy — “Where is my data?”18 ©2011 Cloudera, Inc. All Rights Reserved.
  • 19. Read Consistency HBASE-2856 Workarounds •  Re-try scan if not all CFs are present •  Re-submit job if any inconsistency is found19 ©2011 Cloudera, Inc. All Rights Reserved.
  • 20. Read Consistency HBASE-2856 Workarounds •  Re-try scan if not all CFs are present •  Re-submit job if any inconsistency is found •  Sometimes that is not possible20 ©2011 Cloudera, Inc. All Rights Reserved.
  • 21. Read Consistency HBASE-2856 Workarounds •  Re-try scan if not all CFs are present •  Re-submit job if any inconsistency is found •  Sometimes that is not possible SLAs!21 ©2011 Cloudera, Inc. All Rights Reserved.
  • 22. MVCC •  HBase maintains ACID semantics using Multiversion Concurrency Control •  Instead of overwriting state, create a new version of object with timestamp Timestamp Row fam1:col1 fam2:col2 t1 row1 val1 val122 ©2012 Cloudera, Inc. All Rights Reserved.
  • 23. MVCC •  HBase maintains ACID semantics using Multiversion Concurrency Control •  Instead of overwriting state, create a new version of object with timestamp Timestamp Row fam1:col1 fam2:col2 t2 row1 val2 val2 t1 row1 val1 val1 •  Reads never have to block •  Note this timestamp is not externally visible! Internally called “memStoreTs”23 ©2012 Cloudera, Inc. All Rights Reserved.
  • 24. HBase Write Path 1.  Write to WAL (per RegionServer) 2.  Write to In-Memory Sorted Map (MemStore) (per Region+ColumnFamily) 3.  Flush MemStore to disk as HFile when MemStore hits configurable hbase.hregion.memstore.flush.size24 ©2012 Cloudera, Inc. All Rights Reserved.
  • 25. Internals / Bug Now that we know the internals – what could go wrong?25 ©2012 Cloudera, Inc. All Rights Reserved.
  • 26. Putting it together Let’s go back to the beginning… MemStore Timestamp Row fam1:col1 fam2:col2 t1 row1 val1 val126 ©2012 Cloudera, Inc. All Rights Reserved.
  • 27. Putting it together Let’s go back to the beginning… MemStore Timestamp Row fam1:col1 fam2:col2 t1 row1 val1 val1 And start a scan.27 ©2012 Cloudera, Inc. All Rights Reserved.
  • 28. Putting it together Let’s go back to the beginning… MemStore Timestamp Row fam1:col1 fam2:col2 t2 row1 val2 val2 t1 row1 val1 val1 And start a scan. And concurrently put.28 ©2012 Cloudera, Inc. All Rights Reserved.
  • 29. Putting it together Let’s go back to the beginning… MemStore Timestamp Row fam1:col1 fam2:col2 t2 row1 val2 val2 t1 row1 val1 val1 And start a scan. HFile And concurrently put. Row fam2:col2: Which causes a flush. row1 val2 row1 val129 ©2012 Cloudera, Inc. All Rights Reserved.
  • 30. Putting it together Now, scan needs to make sense of this… MemStore Ts Row fam1:col1 t2 row1 val2 t1 row1 val1 HFile Row fam2:col2: row1 val2 row1 val130 ©2012 Cloudera, Inc. All Rights Reserved.
  • 31. Putting it together Now, scan needs to make sense of this… MemStore Ts Row fam1:col1 t2 row1 val2 t1 row1 val1 HFile Row fam2:col2: row1 val2 row1 val1 But HFile has no timestamp!31 ©2012 Cloudera, Inc. All Rights Reserved.
  • 32. Putting it together Now, scan needs to make sense of this… MemStore Ts Row fam1:col1 t2 row1 val2 t1 row1 val1 HFile Inconsistent Result Row fam2:col2: Row fam1:col1 fam2:col2 row1 val2 row1 val1 val2 row1 val1 But HFile has no timestamp!32 ©2012 Cloudera, Inc. All Rights Reserved.
  • 33. Solution Store the timestamp in the Hfile MemStore HFileTs Row fam1:col1 Ts Row fam2:col2:t2 row1 val2 t2 row1 val2t1 row1 val1 t1 row1 val1 Correct Result Row fam1:col1 fam2:col2 row1 val1 val2 Now we have all the information we need33 ©2012 Cloudera, Inc. All Rights Reserved.
  • 34. Consistency •  Only some of the consistency issues in 0.90 –  e.g. HBASE-5121: MajorCompaction may affect scans correctness •  Solution: Upgrade to 0.92 or 0.9434 ©2012 Cloudera, Inc. All Rights Reserved.
  • 35. HBase 0.94 “Performance Release”35 ©2012 Cloudera, Inc. All Rights Reserved.
  • 36. Performance Improvements in 0.94 •  HBASE-5047 Support checksums in HBase block cache •  HBASE-5199 Delete out of TTL store files before compaction selection •  HBASE-4608 HLog Compression •  HBASE-4465 Lazy-seek optimization for StoreFile scanners36 ©2012 Cloudera, Inc. All Rights Reserved.
  • 37. Performance Improvements in 0.94 •  HBASE-5047 Support checksums in HBase block cache •  HBASE-5199 Delete out of TTL store files before compaction selection •  HBASE-4608 HLog Compression •  HBASE-4465 Lazy-seek optimization for StoreFile scanners37 ©2012 Cloudera, Inc. All Rights Reserved.
  • 38. HBASE-5047 •  HDFS stores checksum is separate file HFile Checksum •  So each file read actually requires two disk iops •  HBase often bottlenecked by random disk ipos38 ©2012 Cloudera, Inc. All Rights Reserved.
  • 39. HBASE-5047 Solution •  Solution: Store checksum in HFile block HFile HFile Block Chksum Data •  On by default (“hbase.regionserver.checksum.verify”) •  Bytes per checksum (“hbase.hstore.bytes.per.checksum”) – default is 16K39 ©2012 Cloudera, Inc. All Rights Reserved.
  • 40. Performance Improvements in 0.94 •  HBASE-5047 Support checksums in HBase block cache •  HBASE-5199 Delete out of TTL store files before compaction selection •  HBASE-4608 HLog Compression •  HBASE-4465 Lazy-seek optimization for StoreFile scanners40 ©2012 Cloudera, Inc. All Rights Reserved.
  • 41. HBASE-5199 •  User can specify TTL per column family •  If all values in the HFile are expired, delete HFile rather than compact •  Off by default, turn on via ("hbase.store.delete.expired.storefile“)41 ©2012 Cloudera, Inc. All Rights Reserved.
  • 42. Conclusion •  Most consistency issues fixed in 0.92/ CDH4 •  Performance improvements in 0.94 •  0.94 is wire compatible with 0.92, so will be in a CDH4 update42 ©2012 Cloudera, Inc. All Rights Reserved.
  • 43. References •  HBase Acid Semantics, http://hbase.apache.org/acid-semantics.html •  Apache HBase Meetup @ SU, Michael Stack. http://files.meetup.com/ 1350427/20120327hbase_meetup.pdf •  HBase Internals, Lars Hofhansl. http://www.cloudera.com/resource/hbasecon-2012- learning-hbase-internals/43 ©2012 Cloudera, Inc. All Rights Reserved.