Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BigBucket Cache, Texas Edition

781 views

Published on

HBase read performance is important for HPE and Cloudera customers. As such, to fully take advantage of hardware capabilities, HBase BucketCache needs to perform and scale well. This talk covers adventures in BucketCache internals on the way to reaching 4 million ops (YCSB Workload C) in a 4U rack space.

Published in: Software
  • Be the first to comment

BigBucket Cache, Texas Edition

  1. 1. BigBucketCache Texas Edition Daniel Pol (HPE), Michael Stack (Cloudera)
  2. 2. Agenda • HBase caching introduction • Why Bucketcache on SSD? • Bucketcache SSD wish list • Bucketcache SSD observations • Tools used to diagnose • Current status • Current performance numbers 2
  3. 3.  Reads  Reads data from HDFS Read HBaseClient.table.get(row=a,...);
  4. 4. Read “Blocks” row05 column=family:field3, timestamp=1460751485935, value=>000.t-L3%0`"3,4,x8Ty#5b,Y}>7n?Pe2+2="05,` H'$ row05 column=family:field4, timestamp=1460751485935, value=7Gs'I7,>:$3v?Qk:!(:0d/5.(+|;Ly)"883p'7x2Zo09t: row05 column=family:field5, timestamp=1460751485935, value=7(~>Ri0N/!D}<-h>?d'7~7,|3:t6*d#7"4L5 ,z>*j*R3) row05 column=family:field6, timestamp=1460751485935, value=!B%17 02|&T'0Q%-Z{$No*Ku,Oi#Wu:Q111>#T3 +x0P7! row05 column=family:field7, timestamp=1460751485935, value=,6t*G%%;"-T?>-r+:,6&n'D}:P---n(Wq6Ee#Tg**t#L)( row05 column=family:field8, timestamp=1460751485935, value=?9d3%x1(t25d.A-5Wy-/<%@a5(0 Q%")x.H!.Gg28j54d# row05 column=family:field9, timestamp=1460751485935, value=%P/,>n:1:%@?%#",!63M'<Jy)!b6Zm!.n#>:0B38E!'Nw/ row06 column=family:field0, timestamp=1460751485940, value=+1z#Z/4C?8&j!&l.Xq2x5C+<P3$Tq&T#84 0U/%/z/H)> CRC Block Metadata (size, compression, etc.) Not Cells/KeyValues Blocks are “unit” of HBase I/O Block has 1-N Cells 64k size by default >= 64k Read from an HFile HFile is “Blocks” & Index on Blocks Block Types DATA: User Cells META: About Hfile BLOOM: HFile bloomfilters INDEX: Which blocks have which Cells
  5. 5. Client goes to HBase Hit in-process Cache Save I/O, no trip to HDFS Read… from Cache HBaseClient.table.get(row=a,...);
  6. 6. LRUBlockCache Default - 40% of java heap hbase.bucketcache.size
  7. 7. BucketCache Not on by default L1/L2 deploy LRUBlockCache = L1 Hosts INDEX, BLOOM, META BucketCache = L2 Hosts DATA blocks BucketCache has Flexible Engine On/offheap, file-backed, MMAP’d file Developed by FusionIO ‘Buckets’ of fixed size
  8. 8. BucketCache Developments Must be enabled if offheap read path enabled (hbase-2.0.0) BucketCache offheap blocks reference counted +1 new Scan (Get is a Scan) -1 when result shipped to client Default BucketCache always on in hbase-2.0.0? HBase reads are faster when offheap Even when working set fits all in L1 cache BucketCache Gateway to (Big) Machine Resources RAM NVRAM SSD
  9. 9. Why BucketCache on SSD performance is important ? • Most of our customers write data once and read it at least 3 times • While you have “Big” data stored, you usually have “smaller” “hot” data areas • SSD provides the best $/ops for HBase BucketCache • Can take advantage of persistent memory devices without code changes 9 1700%
  10. 10. BucketCache SSD wish list for great performance • Single cell HBase get request to generate a single BucketCache IO • Aligned IO to BucketCache • Measured performance delta 44% (2048 vs 2056 bytes) • File access that bypasses filesystem cache (Mmap/DirectIO) • Big BucketCache space (>4TB per region server) • Multiple files support • Fast prefetch, so hot data is quickly loaded into BucketCache at startup • Fast deallocation, so hot tables can be quickly changed 10
  11. 11. Initial observations • BucketCache file format precise info hard to find/digest • Prefetching not working with big files (>100000 blocks) • Single cell HBase get request generates 2.5 BucketCache IO • BucketCache uses a lot more space then HFiles (cacheSize vs usedSize) • Prefetch much faster (~5x) when starting with an empty BucketCache file • Recent 2P servers hard to drive to full CPU load, need extra settings to obtain full potential 11
  12. 12. Tools used for diagnostics • HBase charts for BucketCache freeSize, Evictions, StorefileSize, Regions. All per server instances • Java Mission Control – yes, it does expose HBase metrics and you can have sub second samples • TRACE Level logging in order to detect prefetch was not working. • HTRACE to gain additional insight • SystemTap for file IO tracing • Hexdump –C on the BucketCache file 12
  13. 13. Custom CM Charts 13
  14. 14. #!/bin/sh cat << '_Marker_' > BucketTrace.stp global verbose=0 probe begin{ %( $# > 2 %? verbose=1 %) task = pid2task($1) if (task == 0) error (sprintf("Process-id %d is invalid, please provide valid PID for HBase region server", $1)) printf("Tracing active...n") } probe vfs.read.return, vfs.write.return { if (pid() != $1) next if (devname=="N/A") next op = name == "vfs.read" ? "R" : "W" if (ino == $2){ printf("VFS Op: %1s Size: %7d Position: %un", op, $count, $pos) } } probe ioblock.end { if (ino == $2 ){ printf("BIO Op: %1s Size: %7d Position: %dn", bio_rw_str(rw), $bio->bi_seg_back_size, sector) } } _Marker_ pid=$(ps -Af |grep -i java |grep -i regionserver |grep -v grep |awk '{print $2}') file=$(curl -s http://localhost:60030/conf |grep hbase.bucketcache.ioengine|awk -F'[<>]' '{split($9,a,":") ;print(a[2])}') inode=$(stat -c "%i" $file) echo "Preparing to trace process $pid and file $file" stap -DMAXACTION=5000 ./BucketTrace.stp $pid $inode $1 14 BucketTrace.sh
  15. 15. probe vfs.read.return, vfs.write.return { if (pid() != $1) next if (devname=="N/A") next op = name == "vfs.read" ? "R" : "W" if (ino == $2){ printf("VFS Op: %1s Size: %7d Position: %un", op, $count, $pos) if (verbose){ for (i=0;i<$count;i+=16){ printf("n%08X ",i) for (y=0;y<16;y++){ if ((i+y)>=$count) continue ch= user_uint8($buf+i+y) printf( "%02X ", ch ) if ((y%8) == 7) printf(" ") } printf("|") for (y=0;y<16;y++){ if ((i+y)>=$count) continue ch= user_uint8($buf+i+y) if (ch >= 32 && ch <= 126) printf("%c",ch) else printf(".") } printf("|") } print("n") } } } 15 SystemTap - https://sourceware.org/systemtap/
  16. 16. BucketCache File Format (no tags) 16
  17. 17. Current status Umbrella JIRA: https://issues.apache.org/jira/browse/HBASE-15240 Prefetch HBASE-15386 PREFETCH_BLOCKS_ON_OPEN in HColumnDescriptor is ignored HBASE-15241 Blockcache hits hbase.ui.blockcache.by.file.max limit and is silent that it will load no more blocks UI HBASE-15640 L1 cache doesn't give fair warning that it is showing partial stats only when it hits limit Overreads HBASE-15477 Do not save 'next block header' when we cache hfileblocks HBASE-15392 Single Cell Get reads two HFileBlocks https://github.com/brianfrankcooper/YCSB/pull/674 >= YCSB 0.9.0 Continuous effort – Join us ! 17
  18. 18. Current performance observed on SSD YCSB WorkloadC result : Before: 208197 ops After : 454121 ops Using BucketCache on SSD Almost 1,000,000 ops per 1U Rack space Whitepaper https://www.hpe.com/h20195/v2/GetPDF.aspx/4AA5-8757ENW.pdf 18 Hardware: 5 region servers - HPE Moonshot m710p cartridges • CPU – Intel E3-1284L v4 @ 2.90GHz (4 cores) • Memory – 32 GB RAM • Disk – 960GB NVME • Network – 2x 10GbE 4.3U Chassis supports up to 45 cartridges Software: • RHEL 6.6 • Cloudera CDH 5.7
  19. 19. Thank you for your patience ! (20 minutes) Additional references • HBase BlockCache 101 http://hortonworks.com/blog/hbase-blockcache-101/ • Comparing BlockCache Deploys https://blogs.apache.org/hbase/entry/comparing_blockcache_deploys • HBase Reference Guide: https://hbase.apache.org/book.html#block.cache

×