Off-heaping the Apache HBase Read Path

426 views

Published on

Anoop Sam John and Ramkrishna Vasudevan (Intel)

HBase provides an LRU based on heap cache but its size (and so the total data size that can be cached) is limited by Java’s max heap space. This talk highlights our work under HBASE-11425 to allow the HBase read path to work directly from the off-heap area.

Published in: Software
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
426
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • https://blogs.apache.org/hbase/entry/offheaping_the_read_path_in
  • postScannerFilterRow
  • Temp array of 64K creation and copy
    Typically only 20% heap left other than memstore and BC.
  • Same slide to show the E2E picture after explanation
  • Off-heaping the Apache HBase Read Path

    1. 1. Off Heaping HBase Read path HBASE-11425 Anoop Sam John Ramkrishna S Vasudevan Intel BigData Team – Bangalore, India
    2. 2.  L2 off heap cache can give large cache size  Not constrained by Java max heap size possible issues.  4 MB physical memory buffers.  Different sized buckets 5 KB, 9 KB,… 513 KB. Each bucket having at least 4 slots  HFile blocks placed in appropriate sized bucket  One Block may span across 2 ByteBuffers.  Read path assumption of data being in a byte array.  Cells having assumption of data parts being in byte array. (ie. Rowkey, family, value etc)  Read hitting block in cache need on heap copy of that block  Temp array of 64K creation and copy. More garbage Overview 4 MB 513 KB buckets
    3. 3. Read from Bucket Cache Region1 Region2 Read request Read request HRegionServer Read response Read response Scanner layers Scanner layers On heap HfileBlock On heap HfileBlock Off heap Bucket Cache
    4. 4. Off Heap Read Path from Bucket Cache Region1 Region2 Read request Read request HRegionServer Off heap Bucket Cache Read response Read response Scanner layers Scanner layers End to End off heap - from bucket cache till RPC
    5. 5.  Selection of data structure for off heap storage  During reads, parse individual Cell components multiple times  Cells are frequently compared for proper ordering  Bucket cache uses NIO DirectByteBuffer for off heap cache  JMH benchmark NIO vs Netty  Test doing reads of int, long, bytes from NIO ByteBuffer and Netty ByteBuff  Test with Unsafe based reads  Conclusion : Continue with the existing NIO DBB based buckets in BucketCache Off Heap Data Structure Benchmark Mode Cnt Score Error Units nettyOffheap: thrpt 57366360.944 ±11533933.769 ops/s nioOffheap : thrpt 60089837.738 ±14171768.229 ops/s Benchmark Mode Cnt Score Error Units nettyOffheap: thrpt 83613659.416 ± 535211.991 ops/s nioOffheap : thrpt 84514777.734 ± 1199369.976 ops/s
    6. 6.  Cellify read path HBASE-7320 , HBASE-11871 , HBASE-11805  Cells flow in read path  Move out of KeyValue assumption  HFile block backed by ByteBuffer rather than byte[]  Remove all byte[] assumption in seeking, encoding etc  Cell extension  Support ByteBuffer backed getXXX APIs.  Added Cell extension ByteBufferedCell and exposed within Server only  Creating off heap backed ByteBufferedCell when reading blocks from off heap bucket cache  getXXXArray() calls on off heap buffer backed Cells works with a temp byte[] copy. More garbage  CellUtil APIs for operations like equals, copy which checks for ByteBufferedCell  Suggest CPs, custom filter use these APIs. Note  Filter# filterRowKey(byte[] buffer, int offset, int length) deprecated against filterRowKey(Cell firstRowCell)  RegionObserver # postScannerFilterRow(ObserverContext<RegionCoprocessorEnvironment>, InternalScanner, byte[], int, short, boolean) deprecated against postScannerFilterRow(ObserverContext<RegionCoprocessorEnvironment>, InternalScanner, Cell, boolean) Building Blocks for Off Heaping
    7. 7.  KVComparator -> CellComparator HBASE-10800 , HBASE-13500  JMH benchmark with off heap buffer compare vs byte[] compare  Using Unsafe way of compare  Each buffer with 135 bytes  Both buffers equal  No performance overhead with comparing off heap backed cells Benchmark Mode Cnt Score Error Units offheapCompare: thrpt 38205893.545 ± 265309.769 ops/s onheapCompare: thrpt 37166847.740 ± 430242.970 ops/s Building Blocks for Off Heaping
    8. 8.  HFile block data might split across 2 ByteBuffers  Avoid copy  Need single data structure which backs N ByteBuffers  Java NIO ByteBuffer is not extendable  Wrapper class org.apache.hadoop.hbase.nio.ByteBuff  org.apache.hadoop.hbase.nio.SingleByteBuff  org.apache.hadoop.hbase.nio. MultiByteBuff  HFile block’s data structure type changed to ByteBuff NIO ByteBuffer Wrapper MultiByteBuff SingleByteBuff
    9. 9.  BucketCache evicts blocks and frees the buckets when out of space  Any block can be evicted. Readers copy block data to temp byte[]  After HBASE-11425 readers refer to bucket memory area directly  Can evict only unreferenced blocks Bucket Cache Block Eviction Call#setResponse RpcCallback#run RegionScanner#shipped KeyValueHeap#shipped StoreScanner#shipped KeyValueHeap#shipped StoreFileScanner#shipped HFileScanner#shipped HFile.Reader#returnBlock BlockCache#returnBlock Decrement ref count  Ref count based block cache and block eviction  Increment ref count when reader hits a block in L2 cache  Decrement once response is created for RPC  Evict if/when ref count = 0
    10. 10. Complete Picture Region1 Region2 Read request Read request HRegionServer Off heap Bucket Cache Refcount++ Read response Read response Scanner layers Scanner layers Refcount++ callback callback Refcount-- Refcount-- MultiByteBuff SingleByteBuff End to End off heap - from bucket cache till RPC
    11. 11. Performance Test Results  PerformanceEvaluation Tool (PE)  Table with one CF and one cell per row. 100 GB total data. Each row with 1K value size  Entire data is loaded into Bucket cache  Single node cluster  CPU : Intel(R) Xeon(R) CPU with 8 cores. RAM : 150 GB  JDK : 1.8  HBase configuration  HBASE_HEAPSIZE = 9 GB  HBASE_OFFHEAPSIZE = 105 GB  hbase.bucketcache.size = 102GB  GC – Default HBase GC setting (CMS )  Multi get with 100 rows  Every thread doing 100 K operations = 10 million rows get  Avg completion run time of each thread (In secs)  Convert to throughput – Gain of 102% - 460% 89.38 139.81 285.66 361.23 817.91 1372.81 44.04 50.55 70.23 88.6 165.4 244.72 0 200 400 600 800 1000 1200 1400 1600 5 threads 10 threads 20 threads 25 threads 50 threads 75 threads HBase Random GET Average Completion Time (s) (The lower the better) Before HBASE-11425 After HBASE-11425
    12. 12. Performance Test Results  PerformanceEvaluation Tool (PE)  Random Range Scan 10K range  with filterAll filter (No data returned back)  Each thread doing range scan for 1000 times  Entire data is loaded into Bucket cache 449.1 728.64 908.26 1904.93 319.87 451.58 560.46 1158 0 500 1000 1500 2000 2500 10 threads 20 threads 25 threads 50 threads Range Scan only server side Average Completion Time (s) (The lower the better) Before HBASE-11425 After HBASE-11425
    13. 13. Performance Test Results  PerformanceEvaluation Tool (PE)  Random Range Scan 10K range  Returning 10% of rows back to client  Each thread doing range scan for 1000 times  Entire data is loaded into Bucket cache 449.1 728.64 908.26 1904.93 319.87 451.58 560.46 1158 0 500 1000 1500 2000 2500 10 threads 20 threads 25 threads 50 threads HBase Range Scan with filter Average Completion Time (s) (The lower the better) Before HBASE-11425 After HBASE-11425
    14. 14. Performance Test Results  YCSB Test  Table with one CF and 10 columns per row. Each row with 1K value. 90 GB total data  Entire data is loaded into Bucket cache  Single node cluster  CPU : Intel(R) Xeon(R) CPU with 8 cores. RAM : 150 GB  JDK : 1.8  HBase configuration  HBASE_HEAPSIZE = 9 GB  HBASE_OFFHEAPSIZE = 105 GB  hbase.bucketcache.size = 102GB 23277.97 25922.18 24558.72 24316.74 28045.53 45767.99 58904.03 63280.86 0 10000 20000 30000 40000 50000 60000 70000 10 threads 25 threads 50 threads 75 threads YCSB Random GET Throughput Before HBASE-11425 After HBASE-11425  Multi get with 100 rows  Every thread doing 5 million operations  20- 160% improvement
    15. 15.  PE test comparing L1 cache vs Off heap L2 cache with 20GB data  Multi get with 100 rows  Entire data is loaded into bucket cache  Each thread doing 10 million operations = 10 billion rows get L1 test L2 test Max heap – 32 GB Max heap – 12 GB Performance Test Results 300.5 559.3 1195.9 1793.1 307.6 523.9 1144.2 1707.6 0 200 400 600 800 1000 1200 1400 1600 1800 2000 10 threads 25 threads 50 threads 75 threads HBase Random GET Average Completion Time (s) (The lower the better) L1 cache L2 cache
    16. 16. MultiGets – Before HBASE-11425 (25 threads) MultiGets – After HBASE-11425(25 threads) GC Graphs
    17. 17. ScanRange10000 – Before HBASE-11425 (20 threads) ScanRange10000 – After HBASE-11425(20 threads) GC Graphs
    18. 18.  Feature will be available in HBase 2.0 release  Make Bucket cache default in HBase 2.0 – Refer HBASE-11323  ‘Rocketfuel’ started using this for random read work load  Backported to 1.x based version  More details  https://blogs.apache.org/hbase/entry/offheaping_the_read_path_in Feature Availability
    19. 19.  Future work  Off heaping write path – HBASE-11579  Off heap MSLAB pool  Read request bytes into off heap buffer pool  Lazy creation of ByteBuffer pools  Fixed sized off heap ByteBuffers from pool  Protobuf changes to handle off heap ByteBuffers  In-memory flushes/compaction (HBASE-14918) from Yahoo Questions?? Future work & QA

    ×