A fully optimized HBase cluster could easily hit the limit of the underlying storage device’s capability, which is beyond the reach of software optimization alone. To get around this constraint, we need a new design that brings data processing and data storage closer together. In this presentation, we will look at how persistent memory will change the way large datasets are stored. We will review the hardware characteristics of 3D XPoint™, a new persistent memory technology with low latency and high capacity. We will also discuss opportunities for further improvement within the HBase framework using persistent memory.
3. Problems at Hand
Disk writes in burst fashion
High bandwidth in flush, compaction
Disk bandwidth inflation
Write: Key/Value (KV) pairs written to disk many times in consecutive
compactions
Read: All KVs bring back to memory when HFile block was hit
Data format changing
Write: “serialize” maps to HFile blocks
Read: linear scan within HFile blocks
3
4. Problems at Hand
Disk writes in burst fashion
High bandwidth in flush, compaction
Disk bandwidth inflation
Write: Key/Value (KV) pairs written to disk many times in consecutive
compactions
Read: All KVs bring back to memory when HFile block was hit
Data format changing
Write: “serialize” maps to HFile blocks
Read: linear scan within HFile blocks
4
These problems come with
Mem+Disk hardware architecture
5. Addressing
Disk writes in burst fashion
Faster drives: PCIe SSDs
Disk bandwidth inflation
Larger DRAM
Data format changing
None
5
6. Addressing
Disk writes in burst fashion
Faster drives: PCIe SSDs
Disk bandwidth inflation
Larger DRAM
Data format changing
None
6
Expensive, still not fast enough
7. Addressing
Disk writes in burst fashion
Faster drives: PCIe SSDs
Disk bandwidth inflation
Larger DRAM
Data format changing
None
7
Expensive, still not fast enough
More expensive, small, volatile
8. Addressing
Disk writes in burst fashion
Faster drives: PCIe SSDs
Disk bandwidth inflation
Larger DRAM
Data format changing
None
8
Expensive, still not fast enough
More expensive, small, volatile
Do we have to persist on block device?
9. Addressing with Persistent Memory
Disk writes in burst fashion
Disk bandwidth inflation
Data format changing
9
High bandwidth, low latency
High bandwidth, non-volatile
Could be eliminated
10. Persistent Memory
10
CPU caches (L1-L3)
Register
DRAM
Persistent Memory
NAND SSD
HDD and other block devices
CPU caches (L1-L3)
Register
DRAM
NAND SSD
HDD and other block devices
Bandwidt
h
Latency
Size
Byte Addressable
Performance Gap
11. Experiment on BucketCache
• BucketCache on persistent memory
• Code change in HBase
• Introduce new IOEngine for BucketCache
• Use libraries from http://pmem.io for persistent memory operations
• Experiment
• Persistent memory emulation with configurable latencies
• Focus on performance impact
11
12. Experiment Design
• Basic setup
• HBase 2.0.0-SNAPSHOT, YCSB 0.6.0, Hadoop 2.5.2
• Preloaded table, 10 fields per row, 100 Bytes per field
• 100% un-throttled uniform read, measures after BlockCache is filled
• Experiments
• Baseline: DRAM_LRU_BlockCache only
• PM runs: DRAM_LRU_BlockCache + different_size_PM_BucketCache
• Measure the throughput/latency impact
12
13. 1.00 1.06 1.13 1.23
1.44
1.65
1.98
2.44
3.13
4.22
5.95
0
1
2
3
4
5
6
7
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Normalizedspeedup
Size of extra PM BucketCache
Speed up with extra BucketCache
Result: Throughput
5x
13
15. Summary
• Significant performance improvement with persistent memory (~6x throughput,
latency reduced by 85%)
• Offers new possible solution from architecture side
• Persist partially or completely on persistent memory
• No more format changing overhead
• Faster recovery
15
These are general issues where a two level LSM tree architecture is used.
For 64K Hfile block, 100Byte Values, inflation for caching is about hundreds.
These are general issues where a two level
These are general issues where a two level
These are general issues where a two level
These are general issues where a two level
These are general issues where a two level
These are general issues where a two level
Dram ~100ns
Nand SSD P3700 20us
Nand SSD S3700 50-60us
WD black Iometer: avg. 5-6ms, max 50ms