HBase Consistency and Performance Improvements

3,153 views
2,869 views

Published on

Published in: Health & Medicine, Technology
0 Comments
13 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,153
On SlideShare
0
From Embeds
0
Number of Embeds
154
Actions
Shares
0
Downloads
0
Comments
0
Likes
13
Embeds 0
No embeds

No notes for slide

HBase Consistency and Performance Improvements

  1. 1. June  13,  2012  HBase Consistency andPerformance ImprovementsEsteban  Gu+errez,  Gregory  Chanan  {esteban,  gchanan}@cloudera.com  
  2. 2. HBase Consistency •  ACID guarantees within a single row •  “Any row returned by the scan will be a consistent view (i.e. that version of the complete row existed at some point in time)”[1] [1] http://hbase.apache.org/acid-semantics.html2 ©2012 Cloudera, Inc. All Rights Reserved.
  3. 3. HBase Consistency Issues •  Write Consistency Issues •  Read Consistency Issues3 ©2012 Cloudera, Inc. All Rights Reserved.
  4. 4. Write Consistency HBASE-4552 •  Importing Multiple CFs HFiles is not an atomic operation4 ©2012 Cloudera, Inc. All Rights Reserved.
  5. 5. Write Consistency HBASE-4552•  Importing Multiple CFs HFileswas not an atomic operation is5 ©2012 Cloudera, Inc. All Rights Reserved.
  6. 6. Write Consistency HBASE-4552 HRegion.bulkLoadHFile() HFile1: HFile2: HFile3: HFile4: Row 1 fam1:col1 fam2:col2 fam3:col3 fam4:col4 val1 T1 Scan T2 Scan val1 val2 T3 Scan val1 val2 val3 T4 Scan val1 val2 val3 val4 < HBase 0.90.56 ©2012 Cloudera, Inc. All Rights Reserved.
  7. 7. Write Consistency HBASE-4552 HRegion.bulkLoadHFiles() HFile1: HFile2: HFile3: HFile4: Row 1 fam1:col1 fam2:col2 fam3:col3 fam4:col4 T1 Scan public void bulkLoadHFiles(List<Pair<byte[], String>> familyPaths) {! ...! startRegionOperation(); ç lock.writeLock().lock()! T2 Scan } finally {! closeBulkRegionOperation(); ! }! T3 Scan ...! ! T4 Scan ≥ HBase 0.90.57 ©2012 Cloudera, Inc. All Rights Reserved.
  8. 8. Write Consistency HBASE-4552 HRegion.bulkLoadHFiles() HFile1: HFile2: HFile3: HFile4: Row 1 fam1:col1 fam2:col2 fam3:col3 fam4:col4 T1 Scan public void bulkLoadHFiles(List<Pair<byte[], String>> familyPaths) {! ...! startRegionOperation(); ! T2 Scan } finally {! closeBulkRegionOperation(); ç lock.writeLock().unlock()! }! T3 Scan ...! ! T4 Scan ≥ HBase 0.90.58 ©2012 Cloudera, Inc. All Rights Reserved.
  9. 9. Write Consistency HBASE-4552 HRegion.bulkLoadHFiles() HFile1: HFile2: HFile3: HFile4: Row 1 fam1:col1 fam2:col2 fam3:col3 fam4:col4 T1 Scan public void bulkLoadHFiles(List<Pair<byte[], String>> familyPaths) {! ...! startRegionOperation(); ! T2 Scan } finally {! closeBulkRegionOperation(); ! }! T3 Scan ...! ! T4 Scan val1 val2 val3 val4 ≥ HBase 0.90.59 ©2012 Cloudera, Inc. All Rights Reserved.
  10. 10. Read Consistency HBASE-2856 •  Seen only twice in the wilderness •  Hard to detect if application monitoring is not implemented10 ©2012 Cloudera, Inc. All Rights Reserved.
  11. 11. Read Consistency HBASE-2856 •  Table size ≈ 50 M records •  Large number of CFs •  New records are continuously added to the table •  Concurrent MR Jobs on the same table •  Cluster has to meet strict SLAs11 ©2011 Cloudera, Inc. All Rights Reserved.
  12. 12. Read Consistency HBASE-2856 Symptoms Run 1 … … … SPLIT_RAW_FILES … Map-Reduce Framework Map output records 500,00012 ©2011 Cloudera, Inc. All Rights Reserved.
  13. 13. Read Consistency HBASE-2856 Symptoms Run 1 Run 2 … … … … SPLIT_RAW_FILES … … Map-Reduce Framework Map output records 500,000 499,99713 ©2011 Cloudera, Inc. All Rights Reserved.
  14. 14. Read Consistency HBASE-2856 Symptoms Run 1 Run 2 Run 3 … … … … … SPLIT_RAW_FILES … … … Map-Reduce Framework Map output records 500,000 499,997 500,00114 ©2011 Cloudera, Inc. All Rights Reserved.
  15. 15. Read Consistency HBASE-2856 Symptoms Run 1 Run 2 Run 3 … … … … … SPLIT_RAW_FILES … … … Map-Reduce Framework Map output records 500,000 499,997 500,001 cf1:col1 cf2:col2 cf3:col3 cf1:col1 cf2:col2 cf3:col3 cf1:col115 ©2011 Cloudera, Inc. All Rights Reserved.
  16. 16. Read Consistency HBASE-2856 Symptoms Run 1 Run 2 Run 3 … … … … … SPLIT_RAW_FILES … … … Map-Reduce Framework Map output records 500,000 499,997 500,001 cf1:col1 cf2:col2 cf3:col3 cf1:col1 cf2:col2 cf3:col3 cf1:col1 Scale testing shows between 0.5% to 2% of inconsistent results between runs16 ©2011 Cloudera, Inc. All Rights Reserved.
  17. 17. Read Consistency HBASE-2856 Impact •  Result is used to update user facing records •  Customer is not happy17 ©2011 Cloudera, Inc. All Rights Reserved.
  18. 18. Read Consistency HBASE-2856 Impact •  Result is used to update user facing records •  Customer is not happy — “Where is my data?”18 ©2011 Cloudera, Inc. All Rights Reserved.
  19. 19. Read Consistency HBASE-2856 Workarounds •  Re-try scan if not all CFs are present •  Re-submit job if any inconsistency is found19 ©2011 Cloudera, Inc. All Rights Reserved.
  20. 20. Read Consistency HBASE-2856 Workarounds •  Re-try scan if not all CFs are present •  Re-submit job if any inconsistency is found •  Sometimes that is not possible20 ©2011 Cloudera, Inc. All Rights Reserved.
  21. 21. Read Consistency HBASE-2856 Workarounds •  Re-try scan if not all CFs are present •  Re-submit job if any inconsistency is found •  Sometimes that is not possible SLAs!21 ©2011 Cloudera, Inc. All Rights Reserved.
  22. 22. MVCC •  HBase maintains ACID semantics using Multiversion Concurrency Control •  Instead of overwriting state, create a new version of object with timestamp Timestamp Row fam1:col1 fam2:col2 t1 row1 val1 val122 ©2012 Cloudera, Inc. All Rights Reserved.
  23. 23. MVCC •  HBase maintains ACID semantics using Multiversion Concurrency Control •  Instead of overwriting state, create a new version of object with timestamp Timestamp Row fam1:col1 fam2:col2 t2 row1 val2 val2 t1 row1 val1 val1 •  Reads never have to block •  Note this timestamp is not externally visible! Internally called “memStoreTs”23 ©2012 Cloudera, Inc. All Rights Reserved.
  24. 24. HBase Write Path 1.  Write to WAL (per RegionServer) 2.  Write to In-Memory Sorted Map (MemStore) (per Region+ColumnFamily) 3.  Flush MemStore to disk as HFile when MemStore hits configurable hbase.hregion.memstore.flush.size24 ©2012 Cloudera, Inc. All Rights Reserved.
  25. 25. Internals / Bug Now that we know the internals – what could go wrong?25 ©2012 Cloudera, Inc. All Rights Reserved.
  26. 26. Putting it together Let’s go back to the beginning… MemStore Timestamp Row fam1:col1 fam2:col2 t1 row1 val1 val126 ©2012 Cloudera, Inc. All Rights Reserved.
  27. 27. Putting it together Let’s go back to the beginning… MemStore Timestamp Row fam1:col1 fam2:col2 t1 row1 val1 val1 And start a scan.27 ©2012 Cloudera, Inc. All Rights Reserved.
  28. 28. Putting it together Let’s go back to the beginning… MemStore Timestamp Row fam1:col1 fam2:col2 t2 row1 val2 val2 t1 row1 val1 val1 And start a scan. And concurrently put.28 ©2012 Cloudera, Inc. All Rights Reserved.
  29. 29. Putting it together Let’s go back to the beginning… MemStore Timestamp Row fam1:col1 fam2:col2 t2 row1 val2 val2 t1 row1 val1 val1 And start a scan. HFile And concurrently put. Row fam2:col2: Which causes a flush. row1 val2 row1 val129 ©2012 Cloudera, Inc. All Rights Reserved.
  30. 30. Putting it together Now, scan needs to make sense of this… MemStore Ts Row fam1:col1 t2 row1 val2 t1 row1 val1 HFile Row fam2:col2: row1 val2 row1 val130 ©2012 Cloudera, Inc. All Rights Reserved.
  31. 31. Putting it together Now, scan needs to make sense of this… MemStore Ts Row fam1:col1 t2 row1 val2 t1 row1 val1 HFile Row fam2:col2: row1 val2 row1 val1 But HFile has no timestamp!31 ©2012 Cloudera, Inc. All Rights Reserved.
  32. 32. Putting it together Now, scan needs to make sense of this… MemStore Ts Row fam1:col1 t2 row1 val2 t1 row1 val1 HFile Inconsistent Result Row fam2:col2: Row fam1:col1 fam2:col2 row1 val2 row1 val1 val2 row1 val1 But HFile has no timestamp!32 ©2012 Cloudera, Inc. All Rights Reserved.
  33. 33. Solution Store the timestamp in the Hfile MemStore HFileTs Row fam1:col1 Ts Row fam2:col2:t2 row1 val2 t2 row1 val2t1 row1 val1 t1 row1 val1 Correct Result Row fam1:col1 fam2:col2 row1 val1 val2 Now we have all the information we need33 ©2012 Cloudera, Inc. All Rights Reserved.
  34. 34. Consistency •  Only some of the consistency issues in 0.90 –  e.g. HBASE-5121: MajorCompaction may affect scans correctness •  Solution: Upgrade to 0.92 or 0.9434 ©2012 Cloudera, Inc. All Rights Reserved.
  35. 35. HBase 0.94 “Performance Release”35 ©2012 Cloudera, Inc. All Rights Reserved.
  36. 36. Performance Improvements in 0.94 •  HBASE-5047 Support checksums in HBase block cache •  HBASE-5199 Delete out of TTL store files before compaction selection •  HBASE-4608 HLog Compression •  HBASE-4465 Lazy-seek optimization for StoreFile scanners36 ©2012 Cloudera, Inc. All Rights Reserved.
  37. 37. Performance Improvements in 0.94 •  HBASE-5047 Support checksums in HBase block cache •  HBASE-5199 Delete out of TTL store files before compaction selection •  HBASE-4608 HLog Compression •  HBASE-4465 Lazy-seek optimization for StoreFile scanners37 ©2012 Cloudera, Inc. All Rights Reserved.
  38. 38. HBASE-5047 •  HDFS stores checksum is separate file HFile Checksum •  So each file read actually requires two disk iops •  HBase often bottlenecked by random disk ipos38 ©2012 Cloudera, Inc. All Rights Reserved.
  39. 39. HBASE-5047 Solution •  Solution: Store checksum in HFile block HFile HFile Block Chksum Data •  On by default (“hbase.regionserver.checksum.verify”) •  Bytes per checksum (“hbase.hstore.bytes.per.checksum”) – default is 16K39 ©2012 Cloudera, Inc. All Rights Reserved.
  40. 40. Performance Improvements in 0.94 •  HBASE-5047 Support checksums in HBase block cache •  HBASE-5199 Delete out of TTL store files before compaction selection •  HBASE-4608 HLog Compression •  HBASE-4465 Lazy-seek optimization for StoreFile scanners40 ©2012 Cloudera, Inc. All Rights Reserved.
  41. 41. HBASE-5199 •  User can specify TTL per column family •  If all values in the HFile are expired, delete HFile rather than compact •  Off by default, turn on via ("hbase.store.delete.expired.storefile“)41 ©2012 Cloudera, Inc. All Rights Reserved.
  42. 42. Conclusion •  Most consistency issues fixed in 0.92/ CDH4 •  Performance improvements in 0.94 •  0.94 is wire compatible with 0.92, so will be in a CDH4 update42 ©2012 Cloudera, Inc. All Rights Reserved.
  43. 43. References •  HBase Acid Semantics, http://hbase.apache.org/acid-semantics.html •  Apache HBase Meetup @ SU, Michael Stack. http://files.meetup.com/ 1350427/20120327hbase_meetup.pdf •  HBase Internals, Lars Hofhansl. http://www.cloudera.com/resource/hbasecon-2012- learning-hbase-internals/43 ©2012 Cloudera, Inc. All Rights Reserved.

×