1. HBase Accelerated:
In-Memory Flush and Compaction
E s h c a r H i l l e l , A n a s t a s i a B r a g i n s k y , E d w a r d B o r t n i k o v ⎪ H a d o o p S u m m i t E u r o p e ,
A p r i l 1 3 , 2 0 1 6
2. Motivation: Dynamic Content Processing on Top of HBase
2
Real-time content processing pipelines
› Store intermediate results in persistent map
› Notification mechanism is prevalent
Storage and notifications on the same platform
Sieve – Yahoo’s real-time content management platform
Crawl
Docpro
c
Link
Analysis Queue
Crawl
schedule
Content
Queue
Links
Serving
Apache Storm
Apache HBase
3. Notification Mechanism is Like a Sliding Window
3
Small working set but not necessarily FIFO queue
Short life-cycle delete message after processing it
High-churn workload message state can be updated
Frequent scans to consume message
Producer
Producer
Producer
Consumer
Consumer
Consumer
4. HBase Accelerated: Mission Definition
4
Goal
Real-time performance and less I/O in persistent KV-stores
How?
Exploit redundancies in the workload to eliminate duplicates in memory
Reduce amount of new files
Reduce write amplification effect (overall I/O)
Reduce retrieval latencies
How much can we gain?
Gain is proportional to the duplications
5. Persistent map
› Atomic get, put, and snapshot scans
Table sorted set of rows
› Each row uniquely identified by a key
Region unit of horizontal partitioning
Column family unit of vertical partitioning
Store implements a column family within a
region
In production a region server (RS) can
manage 100s-1000s of regions
› Share resources: memory, cpu, disk
key Col1 Col2 Col3
5
HBase Data model
Column family
Region
6. Random I/O Sequential I/O
Random writes absorbed in RAM (Cm)
› Low latency
› Built in versioning
› Write-ahead-log (WAL) for durability
Periodic batch merge to disk
› High-throughput I/O
Disk components (Cd) immutable
› Periodic merge&compact eliminate duplicate
versions and tombstones
Random reads from Cm or Cd cache
6
Log-Structured Merge (LSM) Key-Value Store Approach
Cm memory
disk
C1
C2
merge
Cd
7. Random writes absorbed in active segment (Cm)
When active segment is full
› Becomes immutable (snapshot)
› A new mutable (active) segment serves writes
› Flushed to disk, truncate WAL
Compaction reads a few files, merge-sorts them, writes back new files
7
HBase Writes
C’m
flush
memory HDFS
Cm
prepare-for-flush
Cd
WAL
write
8. Random reads from Cm or C’m or Cd cache
When data piles-up on disk
› Hit ratio drops
› Retrieval latency up
Compaction re-writes small files into fewer bigger files
› Causes replication-related network and disk IO
8
HBase Reads
C’m
memory HDFS
Cm
Cd
Read
Block
cache
9. New compaction pipeline
› Active segment flushed to pipeline
› Pipeline segments compacted in memory
› Flush to disk only when needed
9
New Design: In-Memory Flush and Compaction
C’m
flush-to-disk
memory HDFS
Cm
in-memory-flush
Cd
prepare-for-flush
Compaction
pipeline
WAL
Block
cache
memory
10. Trade read cache (BlockCache) for write cache (compaction pipeline)
10
New Design: In-Memory Flush and Compaction
write read
cache cache
C’m
flush-to-disk
memory HDFS
Cm
in-memory-flush
Cd
prepare-for-flush
Compaction
pipeline
WAL
Block
cache
memory
11. 11
New Design: In-Memory Flush and Compaction
CPU
IO
C’m
flush-to-disk
memory HDFS
Cm
in-memory-flush
Cd
prepare-for-flush
Compaction
pipeline
WAL
Block
cache
memory
Trade read cache (BlockCache) for write cache (compaction pipeline)
More CPU cycles for less I/O
12. Evaluation Settings: In-Memory Working Set
12
YCSB: compares compacting vs. default memstore
Small cluster: 3 HDFS nodes on a single rack, 1 RS
› 1GB heap space, MSLAB enabled (2MB chunks)
› Default: 128MB flush size, 100MB block-cache
› Compacting: 192MB flush size, 36MB block-cache
High-churn workload, small working set
› 128,000 records, 1KB value field
› 10 threads running 5 millions operations, various key distributions
› 50% reads 50% updates, target 1000ops
› 1% (long) scans 99% updates, target 500ops
Measure average latency over time
› Latencies accumulated over intervals of 10 seconds
13. Evaluation Results: Read Latency (Zipfian Distribution)
13
Flush
to disk
Compaction
Data
fits into
cache
20. Work in Progress: Memory Optimizations
20
Current design
› Data stored in flat buffers, index is a skip-list
› All memory allocated on-heap
Optimizations
› Flat layout for immutable segments index
› Manage (allocate, store, release) data buffers off-heap
Pros
› Locality in access to index
› Reduce memory fragmentation
› Significantly reduce GC work
› Better utilization of memory and CPU
22. Summary
22
Feature intended for HBase 2.0.0
New design pros over default implementation
› Predictable retrieval latency by serving (mainly) from memory
› Less compaction on disk reduces write amplification effect
› Less disk I/O and network traffic reduces load on HDFS
We would like to thank the reviewers
› Michael Stack, Anoop Sam John, Ramkrishna s. Vasudevan, Ted Yu