HBase Accelerated introduces an in-memory flush and compaction pipeline for HBase to improve performance of real-time workloads. By keeping data in memory longer and avoiding frequent disk flushes and compactions, it reduces I/O and improves read and scan latencies. Evaluation on workloads with high update rates and small working sets showed the new approach significantly outperformed the default HBase implementation by serving most data from memory. Work is ongoing to further optimize the in-memory representation and memory usage.
HBase Accelerated:
In-Memory Flushand Compaction
E s h c a r H i l l e l , A n a s t a s i a B r a g i n s k y , E d w a r d B o r t n i k o v ⎪ H a d o o p S u m m i t E u r o p e ,
A p r i l 1 3 , 2 0 1 6
2.
Motivation: Dynamic ContentProcessing on Top of HBase
2
Real-time content processing pipelines
› Store intermediate results in persistent map
› Notification mechanism is prevalent
Storage and notifications on the same platform
Sieve – Yahoo’s real-time content management platform
Crawl
Docpro
c
Link
Analysis Queue
Crawl
schedule
Content
Queue
Links
Serving
Apache Storm
Apache HBase
3.
Notification Mechanism isLike a Sliding Window
3
Small working set but not necessarily FIFO queue
Short life-cycle delete message after processing it
High-churn workload message state can be updated
Frequent scans to consume message
Producer
Producer
Producer
Consumer
Consumer
Consumer
4.
HBase Accelerated: MissionDefinition
4
Goal
Real-time performance and less I/O in persistent KV-stores
How?
Exploit redundancies in the workload to eliminate duplicates in memory
Reduce amount of new files
Reduce write amplification effect (overall I/O)
Reduce retrieval latencies
How much can we gain?
Gain is proportional to the duplications
5.
Persistent map
›Atomic get, put, and snapshot scans
Table sorted set of rows
› Each row uniquely identified by a key
Region unit of horizontal partitioning
Column family unit of vertical partitioning
Store implements a column family within a
region
In production a region server (RS) can
manage 100s-1000s of regions
› Share resources: memory, cpu, disk
key Col1 Col2 Col3
5
HBase Data model
Column family
Region
6.
Random I/O Sequential I/O
Random writes absorbed in RAM (Cm)
› Low latency
› Built in versioning
› Write-ahead-log (WAL) for durability
Periodic batch merge to disk
› High-throughput I/O
Disk components (Cd) immutable
› Periodic merge&compact eliminate duplicate
versions and tombstones
Random reads from Cm or Cd cache
6
Log-Structured Merge (LSM) Key-Value Store Approach
Cm memory
disk
C1
C2
merge
Cd
7.
Random writesabsorbed in active segment (Cm)
When active segment is full
› Becomes immutable (snapshot)
› A new mutable (active) segment serves writes
› Flushed to disk, truncate WAL
Compaction reads a few files, merge-sorts them, writes back new files
7
HBase Writes
C’m
flush
memory HDFS
Cm
prepare-for-flush
Cd
WAL
write
8.
Random readsfrom Cm or C’m or Cd cache
When data piles-up on disk
› Hit ratio drops
› Retrieval latency up
Compaction re-writes small files into fewer bigger files
› Causes replication-related network and disk IO
8
HBase Reads
C’m
memory HDFS
Cm
Cd
Read
Block
cache
9.
New compactionpipeline
› Active segment flushed to pipeline
› Pipeline segments compacted in memory
› Flush to disk only when needed
9
New Design: In-Memory Flush and Compaction
C’m
flush-to-disk
memory HDFS
Cm
in-memory-flush
Cd
prepare-for-flush
Compaction
pipeline
WAL
Block
cache
memory
10.
Trade read cache(BlockCache) for write cache (compaction pipeline)
10
New Design: In-Memory Flush and Compaction
write read
cache cache
C’m
flush-to-disk
memory HDFS
Cm
in-memory-flush
Cd
prepare-for-flush
Compaction
pipeline
WAL
Block
cache
memory
11.
11
New Design: In-MemoryFlush and Compaction
CPU
IO
C’m
flush-to-disk
memory HDFS
Cm
in-memory-flush
Cd
prepare-for-flush
Compaction
pipeline
WAL
Block
cache
memory
Trade read cache (BlockCache) for write cache (compaction pipeline)
More CPU cycles for less I/O
12.
Evaluation Settings: In-MemoryWorking Set
12
YCSB: compares compacting vs. default memstore
Small cluster: 3 HDFS nodes on a single rack, 1 RS
› 1GB heap space, MSLAB enabled (2MB chunks)
› Default: 128MB flush size, 100MB block-cache
› Compacting: 192MB flush size, 36MB block-cache
High-churn workload, small working set
› 128,000 records, 1KB value field
› 10 threads running 5 millions operations, various key distributions
› 50% reads 50% updates, target 1000ops
› 1% (long) scans 99% updates, target 500ops
Measure average latency over time
› Latencies accumulated over intervals of 10 seconds
13.
Evaluation Results: ReadLatency (Zipfian Distribution)
13
Flush
to disk
Compaction
Data
fits into
cache
Work in Progress:Memory Optimizations
20
Current design
› Data stored in flat buffers, index is a skip-list
› All memory allocated on-heap
Optimizations
› Flat layout for immutable segments index
› Manage (allocate, store, release) data buffers off-heap
Pros
› Locality in access to index
› Reduce memory fragmentation
› Significantly reduce GC work
› Better utilization of memory and CPU
Summary
22
Feature intendedfor HBase 2.0.0
New design pros over default implementation
› Predictable retrieval latency by serving (mainly) from memory
› Less compaction on disk reduces write amplification effect
› Less disk I/O and network traffic reduces load on HDFS
We would like to thank the reviewers
› Michael Stack, Anoop Sam John, Ramkrishna s. Vasudevan, Ted Yu