HBase Accelerated:
In-Memory Flush and Compaction
E s h c a r H i l l e l , A n a s t a s i a B r a g i n s k y , E d w a r d B o r t n i k o v ⎪ H a d o o p S u m m i t E u r o p e ,
A p r i l 1 3 , 2 0 1 6
Motivation: Dynamic Content Processing on Top of HBase
2
 Real-time content processing pipelines
› Store intermediate results in persistent map
› Notification mechanism is prevalent
 Storage and notifications on the same platform
 Sieve – Yahoo’s real-time content management platform
Crawl
Docpro
c
Link
Analysis Queue
Crawl
schedule
Content
Queue
Links
Serving
Apache Storm
Apache HBase
Notification Mechanism is Like a Sliding Window
3
 Small working set but not necessarily FIFO queue
 Short life-cycle delete message after processing it
 High-churn workload message state can be updated
 Frequent scans to consume message
Producer
Producer
Producer
Consumer
Consumer
Consumer
HBase Accelerated: Mission Definition
4
Goal
Real-time performance and less I/O in persistent KV-stores
How?
Exploit redundancies in the workload to eliminate duplicates in memory
Reduce amount of new files
Reduce write amplification effect (overall I/O)
Reduce retrieval latencies
How much can we gain?
Gain is proportional to the duplications
 Persistent map
› Atomic get, put, and snapshot scans
 Table sorted set of rows
› Each row uniquely identified by a key
 Region unit of horizontal partitioning
 Column family unit of vertical partitioning
 Store implements a column family within a
region
 In production a region server (RS) can
manage 100s-1000s of regions
› Share resources: memory, cpu, disk
key Col1 Col2 Col3
5
HBase Data model
Column family
Region
 Random I/O  Sequential I/O
 Random writes absorbed in RAM (Cm)
› Low latency
› Built in versioning
› Write-ahead-log (WAL) for durability
 Periodic batch merge to disk
› High-throughput I/O
 Disk components (Cd) immutable
› Periodic merge&compact eliminate duplicate
versions and tombstones
 Random reads from Cm or Cd cache
6
Log-Structured Merge (LSM) Key-Value Store Approach
Cm memory
disk
C1
C2
merge
Cd
 Random writes absorbed in active segment (Cm)
 When active segment is full
› Becomes immutable (snapshot)
› A new mutable (active) segment serves writes
› Flushed to disk, truncate WAL
 Compaction reads a few files, merge-sorts them, writes back new files
7
HBase Writes
C’m
flush
memory HDFS
Cm
prepare-for-flush
Cd
WAL
write
 Random reads from Cm or C’m or Cd cache
 When data piles-up on disk
› Hit ratio drops
› Retrieval latency up
 Compaction re-writes small files into fewer bigger files
› Causes replication-related network and disk IO
8
HBase Reads
C’m
memory HDFS
Cm
Cd
Read
Block
cache
 New compaction pipeline
› Active segment flushed to pipeline
› Pipeline segments compacted in memory
› Flush to disk only when needed
9
New Design: In-Memory Flush and Compaction
C’m
flush-to-disk
memory HDFS
Cm
in-memory-flush
Cd
prepare-for-flush
Compaction
pipeline
WAL
Block
cache
memory
Trade read cache (BlockCache) for write cache (compaction pipeline)
10
New Design: In-Memory Flush and Compaction
write read
cache cache
C’m
flush-to-disk
memory HDFS
Cm
in-memory-flush
Cd
prepare-for-flush
Compaction
pipeline
WAL
Block
cache
memory
11
New Design: In-Memory Flush and Compaction
CPU
IO
C’m
flush-to-disk
memory HDFS
Cm
in-memory-flush
Cd
prepare-for-flush
Compaction
pipeline
WAL
Block
cache
memory
Trade read cache (BlockCache) for write cache (compaction pipeline)
More CPU cycles for less I/O
Evaluation Settings: In-Memory Working Set
12
 YCSB: compares compacting vs. default memstore
 Small cluster: 3 HDFS nodes on a single rack, 1 RS
› 1GB heap space, MSLAB enabled (2MB chunks)
› Default: 128MB flush size, 100MB block-cache
› Compacting: 192MB flush size, 36MB block-cache
 High-churn workload, small working set
› 128,000 records, 1KB value field
› 10 threads running 5 millions operations, various key distributions
› 50% reads 50% updates, target 1000ops
› 1% (long) scans 99% updates, target 500ops
 Measure average latency over time
› Latencies accumulated over intervals of 10 seconds
Evaluation Results: Read Latency (Zipfian Distribution)
13
Flush
to disk
Compaction
Data
fits into
cache
Evaluation Results: Read Latency (Uniform Distribution)
14
Region
split
Evaluation Results: Scan Latency (Uniform Distribution)
15
Evaluation Settings: Handling Tombstones
16
 YCSB: compares compacting vs. default memstore
 Small cluster: 3 HDFS nodes on a single rack, 1 RS
› 1GB heap space, MSLAB enabled (2MB chunks), 128MB flush size, 64MB block-cache
› Default: Minimum 4 files for compaction
› Compaction: Minimum 2 files for compaction
 High-churn workload, small working set with deletes
› 128,000 records, 1KB value field
› 10 threads running 5 millions operations, various key distributions
› 40% reads 40% updates 20% deletes (with 50,000 updates head start), target 1000ops
› 1% (long) scans 66% updates 33% deletes (with head start), target 500ops
 Measure average latency over time
› Latencies accumulated over intervals of 10 seconds
Evaluation Results: Read Latency (Zipfian Distribution)
17
(total 2 flushes and 1 disk compactions)
(total 15 flushes and 4 disk compactions)
Evaluation Results: Read Latency (Uniform Distribution)
18
(total 3 flushes and 2 disk compactions)
(total 15 flushes and 4 disk compactions)
Evaluation Results: Scan Latency (Zipfian Distribution)
19
Work in Progress: Memory Optimizations
20
 Current design
› Data stored in flat buffers, index is a skip-list
› All memory allocated on-heap
 Optimizations
› Flat layout for immutable segments index
› Manage (allocate, store, release) data buffers off-heap
 Pros
› Locality in access to index
› Reduce memory fragmentation
› Significantly reduce GC work
› Better utilization of memory and CPU
Status Umbrella Jira HBASE-14918
21
 HBASE-14919 HBASE-15016 HBASE-15359 Infrastructure refactoring
› Status: committed
 HBASE-14920 new compacting memstore
› Status: pre-commit
 HBASE-14921 memory optimizations (memory layout, off-heaping)
› Status: under code review
Summary
22
 Feature intended for HBase 2.0.0
 New design pros over default implementation
› Predictable retrieval latency by serving (mainly) from memory
› Less compaction on disk reduces write amplification effect
› Less disk I/O and network traffic reduces load on HDFS
 We would like to thank the reviewers
› Michael Stack, Anoop Sam John, Ramkrishna s. Vasudevan, Ted Yu
Thank You
24
Evaluation Results: Write Latency

HBase Accelerated: In-Memory Flush and Compaction

  • 1.
    HBase Accelerated: In-Memory Flushand Compaction E s h c a r H i l l e l , A n a s t a s i a B r a g i n s k y , E d w a r d B o r t n i k o v ⎪ H a d o o p S u m m i t E u r o p e , A p r i l 1 3 , 2 0 1 6
  • 2.
    Motivation: Dynamic ContentProcessing on Top of HBase 2  Real-time content processing pipelines › Store intermediate results in persistent map › Notification mechanism is prevalent  Storage and notifications on the same platform  Sieve – Yahoo’s real-time content management platform Crawl Docpro c Link Analysis Queue Crawl schedule Content Queue Links Serving Apache Storm Apache HBase
  • 3.
    Notification Mechanism isLike a Sliding Window 3  Small working set but not necessarily FIFO queue  Short life-cycle delete message after processing it  High-churn workload message state can be updated  Frequent scans to consume message Producer Producer Producer Consumer Consumer Consumer
  • 4.
    HBase Accelerated: MissionDefinition 4 Goal Real-time performance and less I/O in persistent KV-stores How? Exploit redundancies in the workload to eliminate duplicates in memory Reduce amount of new files Reduce write amplification effect (overall I/O) Reduce retrieval latencies How much can we gain? Gain is proportional to the duplications
  • 5.
     Persistent map ›Atomic get, put, and snapshot scans  Table sorted set of rows › Each row uniquely identified by a key  Region unit of horizontal partitioning  Column family unit of vertical partitioning  Store implements a column family within a region  In production a region server (RS) can manage 100s-1000s of regions › Share resources: memory, cpu, disk key Col1 Col2 Col3 5 HBase Data model Column family Region
  • 6.
     Random I/O Sequential I/O  Random writes absorbed in RAM (Cm) › Low latency › Built in versioning › Write-ahead-log (WAL) for durability  Periodic batch merge to disk › High-throughput I/O  Disk components (Cd) immutable › Periodic merge&compact eliminate duplicate versions and tombstones  Random reads from Cm or Cd cache 6 Log-Structured Merge (LSM) Key-Value Store Approach Cm memory disk C1 C2 merge Cd
  • 7.
     Random writesabsorbed in active segment (Cm)  When active segment is full › Becomes immutable (snapshot) › A new mutable (active) segment serves writes › Flushed to disk, truncate WAL  Compaction reads a few files, merge-sorts them, writes back new files 7 HBase Writes C’m flush memory HDFS Cm prepare-for-flush Cd WAL write
  • 8.
     Random readsfrom Cm or C’m or Cd cache  When data piles-up on disk › Hit ratio drops › Retrieval latency up  Compaction re-writes small files into fewer bigger files › Causes replication-related network and disk IO 8 HBase Reads C’m memory HDFS Cm Cd Read Block cache
  • 9.
     New compactionpipeline › Active segment flushed to pipeline › Pipeline segments compacted in memory › Flush to disk only when needed 9 New Design: In-Memory Flush and Compaction C’m flush-to-disk memory HDFS Cm in-memory-flush Cd prepare-for-flush Compaction pipeline WAL Block cache memory
  • 10.
    Trade read cache(BlockCache) for write cache (compaction pipeline) 10 New Design: In-Memory Flush and Compaction write read cache cache C’m flush-to-disk memory HDFS Cm in-memory-flush Cd prepare-for-flush Compaction pipeline WAL Block cache memory
  • 11.
    11 New Design: In-MemoryFlush and Compaction CPU IO C’m flush-to-disk memory HDFS Cm in-memory-flush Cd prepare-for-flush Compaction pipeline WAL Block cache memory Trade read cache (BlockCache) for write cache (compaction pipeline) More CPU cycles for less I/O
  • 12.
    Evaluation Settings: In-MemoryWorking Set 12  YCSB: compares compacting vs. default memstore  Small cluster: 3 HDFS nodes on a single rack, 1 RS › 1GB heap space, MSLAB enabled (2MB chunks) › Default: 128MB flush size, 100MB block-cache › Compacting: 192MB flush size, 36MB block-cache  High-churn workload, small working set › 128,000 records, 1KB value field › 10 threads running 5 millions operations, various key distributions › 50% reads 50% updates, target 1000ops › 1% (long) scans 99% updates, target 500ops  Measure average latency over time › Latencies accumulated over intervals of 10 seconds
  • 13.
    Evaluation Results: ReadLatency (Zipfian Distribution) 13 Flush to disk Compaction Data fits into cache
  • 14.
    Evaluation Results: ReadLatency (Uniform Distribution) 14 Region split
  • 15.
    Evaluation Results: ScanLatency (Uniform Distribution) 15
  • 16.
    Evaluation Settings: HandlingTombstones 16  YCSB: compares compacting vs. default memstore  Small cluster: 3 HDFS nodes on a single rack, 1 RS › 1GB heap space, MSLAB enabled (2MB chunks), 128MB flush size, 64MB block-cache › Default: Minimum 4 files for compaction › Compaction: Minimum 2 files for compaction  High-churn workload, small working set with deletes › 128,000 records, 1KB value field › 10 threads running 5 millions operations, various key distributions › 40% reads 40% updates 20% deletes (with 50,000 updates head start), target 1000ops › 1% (long) scans 66% updates 33% deletes (with head start), target 500ops  Measure average latency over time › Latencies accumulated over intervals of 10 seconds
  • 17.
    Evaluation Results: ReadLatency (Zipfian Distribution) 17 (total 2 flushes and 1 disk compactions) (total 15 flushes and 4 disk compactions)
  • 18.
    Evaluation Results: ReadLatency (Uniform Distribution) 18 (total 3 flushes and 2 disk compactions) (total 15 flushes and 4 disk compactions)
  • 19.
    Evaluation Results: ScanLatency (Zipfian Distribution) 19
  • 20.
    Work in Progress:Memory Optimizations 20  Current design › Data stored in flat buffers, index is a skip-list › All memory allocated on-heap  Optimizations › Flat layout for immutable segments index › Manage (allocate, store, release) data buffers off-heap  Pros › Locality in access to index › Reduce memory fragmentation › Significantly reduce GC work › Better utilization of memory and CPU
  • 21.
    Status Umbrella JiraHBASE-14918 21  HBASE-14919 HBASE-15016 HBASE-15359 Infrastructure refactoring › Status: committed  HBASE-14920 new compacting memstore › Status: pre-commit  HBASE-14921 memory optimizations (memory layout, off-heaping) › Status: under code review
  • 22.
    Summary 22  Feature intendedfor HBase 2.0.0  New design pros over default implementation › Predictable retrieval latency by serving (mainly) from memory › Less compaction on disk reduces write amplification effect › Less disk I/O and network traffic reduces load on HDFS  We would like to thank the reviewers › Michael Stack, Anoop Sam John, Ramkrishna s. Vasudevan, Ted Yu
  • 23.
  • 24.