Now that you've seen Base 1.0, what's ahead in HBase 2.0, and beyond—and why? Find out from this panel of people who have designed and/or are working on 2.0 features.
1 hbasecon.com
HBase 2.0 and Beyond Panel
Moderator: Jonathan Hsieh
Panel: Matteo Bertozzi / Sean Busbey / Jingcheng Du / Lars Hofhansl /
/ Enis Soztutar / Jimmy Xiang
2 hbasecon.com
Who are we?
Matteo Bertozzi – HBase PMC, Cloudera
Sean Busbey – HBase PMC, Cloudera
Jingcheng Du – Intel
Lars Hofhansl – HBase PMC, 0.94.x RM,
Salesforce.com
Jonathan Hsieh – HBase PMC
Enis Soztutar – HBase PMC, 1.0.0 RM,
Hortonworks
Jimmy Xiang – HBase PMC, Cloudera
3 hbasecon.com
Outline
Storing Larger Objects efficiently
Making DDL Operations fault tolerant
Better Region Assignment
Compatibility guarantees for our users
Improving Availability
Using all machine resources
Q+A
4 hbasecon.com
Outline
Storing Larger Objects efficiently
Making DDL Operations fault tolerant
Better Region Assignment
Compatibility guarantees for our users
Improving Availability
Using all machine resources
Q+A
5 hbasecon.com
Why Moderate Object Storage (MOB)?
A growing demand for the ability to store moderateobjects (MOB) in HBase ( 100KB
up to 10MB).
Write amplification created by compactions, the write performance degrades along
with the accumulation of massive MOBs in HBase.
Too many store files -> Frequent region compactions -> Massive I/O -> Slow compactions ->
Flush delay -> High memory usage -> Blocking updates
8.098
10.159 10.700
0.000
2.000
4.000
6.000
8.000
10.000
12.000
125G 500G 1T
Latency(sec)
Data volume
Data Insertion Average Latency
(5MB/record, 32 pre-split regions)
0
5
10
15
20
25
1 2 3 4 5 6 7 8
Latency(sec)
Time (hour)
1T Data Insertion Average Latency
(5MB/record, 32 pre-split regions)
7 hbasecon.com
Benefits
Move the MOBs out of the main I/O path to make the write amplification more predictable.
The same APIs to read and write MOBs.
Work with HBase export/copy table, bulk load, replication and snapshot features.
Work with HBase security mechanism.
8.098
10.159
10.700
6.851 6.963 7.033
0.000
2.000
4.000
6.000
8.000
10.000
12.000
125G 500G 1T
Latency(sec)
Data volume
Data Insertion Average Latency
(5MB/record, 32 pre-split regions)
MOB Disabled
MOB Enabled
10.590
57.975
6.212
33.886
0.000
10.000
20.000
30.000
40.000
50.000
60.000
Data Insertion Data Random Get
Latency(sec)
Average Latency for R/W Mixed Workload
(5MB/record, 32 pre-split regions,
300G pre-load, 200G insertion)
MOB Disabled
MOB Enabled
0
2
4
6
8
10
12
14
16
18
10 20 30 40 50 60
Lantecy(sec)
Time (minute)
Data Insertion Average Latency MOB Enabled
MOB Disabled
0
10
20
30
40
50
60
70
80
90
100
10 20 30 40 50 60
Latency(minute)
Time (minute)
Data Random Get Average Latency
MOB Enabled
MOB Disabled
8 hbasecon.com
Outline
Storing Larger Objects efficiently
Making DDL Operations fault tolerant
Better Region Assignment
Compatibility guarantees for our users
Improving Availability
Using all machine resources
Q+A
9 hbasecon.com
Problem – Multi-Steps ops & Failures
DDL & other operations consist of multiple steps
e.g.
Create Table
Handler
Create regions on FileSystem
Add regions to META
Assign
cpHost.postCreateTableHandler() -> (ACLs)
if we crash in between steps.
we end up with half state.
e.g. File-System present, META not present
hbck MAY be able to repair it
if we crash in the middle of a single step (e.g. create N regions on fs)
hbck has not enough information to rebuild a correct state.
Requires manual intervention to repair the state
10 hbasecon.com
Solution – Multi-Steps ops & Failures
Rewrite each operation to use a State-Machine
e.g.
Create Table
Handler
Create regions on FileSystem
Add regions to META
Assign
cpHost.postCreateTableHandler() -> (ACLs)
...each executed step is written to a store
if the machine goes down
we know what was pending
and what should be rolledback
or how to continue to complete the operation
11 hbasecon.com
Procedure-v2/Notification-Bus
The Procedure v2/NotificationBus aims to provide a unified way to build:
Synchronous calls, with the ability to see the state/result in case of failure.
Multisteps procedure with a rollback/rollforward ability in case of failure (e.g.
create/delete table)
Notifications across multiple machines (e.g. ACLs/Labels/Quota cache updates)
Coordination of long-running/heavy procedures (e.g. compactions, splits, …)
Procedures across multiple machines (e.g. Snapshots, Assignment)
Replication for Master operations (e.g. grant/revoke)
12 hbasecon.com
Procedure-v2/Notification-Bus - Roadmap
Apache HBase 1.1
Fault tolerant Master Operations (e.g. create/delete/…)
Sync Client (We are still wire compatible, both ways)
Apache HBase 1.2
Master WebUI
Notification BUS, and at least Snapshot using it.
Apache HBase 1.3+ or 2.0 (depending on how hard is to keep Master/RSs compatibility)
Replace Cache Updates, Assignment Manager, Distributed Log Replay,…
New Features: Coordinated compactions, Master ops Replication (e.g. grant/revoke)
13 hbasecon.com
Outline
Storing Larger Objects efficiently
Making DDL Operations fault tolerant
Better Region Assignment
Compatibility guarantees for our users
Improving Availability
Using all machine resources
Q+A
14 hbasecon.com
ZK-based Region Assignment
Region states could be inconsistent
Assignment info stored in both meta table and ZooKeeper
Both Master and RegionServer can update them
Limited scalability and operations efficiency
ZooKeeper events used for coordination
14
15 hbasecon.com
ZK-less Region Assignment
RPC based
Master, the true coordinator
Only Master can update meta table
All state changes are persisted
Follow the state machine
RegionServer does what told by Master
Report status to Master
Each step needs acknowledgement from Master
15
16 hbasecon.com
Current Status
Off by default in 1.0
Impact
Master is in the critical path
Meta should be co-located with Master
Procedure V2 could solve it (future work)
Deployment topology change
Master is a RegionServer, serves small system tables
Blog post has more info
https://blogs.apache.org/hbase/entry/hbase_zk_less_region_assignment
16
17 hbasecon.com
Outline
Storing Larger Objects efficiently
Making DDL Operations fault tolerant
Better Region Assignment
Compatibility guarantees for our users
Improving Availability
Using all machine resources
Q+A
26 hbasecon.com
Compatibility Dimensions
(the long version)
Client-Server wire protocol compatibility
Server-Server protocol compatibility
File format compatibility
Client API compatibility
Client Binary compatibility
Server-Side Limited API compatibility (taken from Hadoop)
Dependency Compatibility
Operational Compatibility
27 hbasecon.com
TL;DR:
A patch upgrade is a drop-in replacement
A minor upgrade requires no application or client code
modification
A major upgrade allows us - the HBase community - to make
breaking changes.
30 hbasecon.com
Outline
Storing Larger Objects efficiently
Making DDL Operations fault tolerant
Better Region Assignment
Compatibility guarantees for our users
Improving Availability
Using all machine resources
Q+A
31 hbasecon.com
Improving read availability
HBase is CP
When a node goes down, some regions are unavailable until
recovery
Some class of applications want high availability (for reads)
Region replicas
TIMELINE consistency reads
32 hbasecon.com
Phase contents
Phase 1
Region replicas
Stale data up to minutes (15 min)
in 1.0
Phase 2
millisecond-latencies for staleness (WAL replication)
Replicas for the meta table
Region splits and merges with region replicas
Scan support
In 1.1
35 hbasecon.com
Pluggable WAL Replication
Pluggable WAL replication endpoint
You can write your own replicators!
Similar to co-processors (runs in the same RS process)
hbase> add_peer ’my_peer',
ENDPOINT_CLASSNAME =>
'org.hbase.MyReplicationEndpoint',
DATA => { "key1" => 1 },
CONFIG => { "config1" => "value1", "config2"
=> "value2" }}
36 hbasecon.com
Outline
Storing Larger Objects efficiently
Making DDL Operations fault tolerant
Better Region Assignment
Compatibility guarantees for our users
Improving Availability
Using all machine resources
Q+A
41 hbasecon.com
Modest Gain: Multiple WALs
All regions write to one Write
ahead log file. (WAL)
Idea: Let’s have multiple write
ahead logs so that we can write
more in parallel.
Follow-up work:
To the limit if were on SSD we
could have one WAL per
region.
RS
1
2
3
DNDisksRS
1
2
3
DNDisks
IDLE
IDLE
When working with a big mass of machines, your first optimization step has to be getting to the exhaustion of one of these three resources.
The specifics will depend on your workload, but right now we have big room for improvement.
This is a mixed write / update/ read workload after reaching a state where there are memstore flushes and compactions happening.. It’s mostly waiting on synchronization AFAICT
This is a mixed write / update/ read workload after reaching a state where there are memstore flushes and compactions happening.. It’s mostly waiting on synchronization AFAICT
This is a mixed write / update/ read workload after reaching a state where there are memstore flushes and compactions happening.. It’s mostly waiting on synchronization AFAICT
Historically one of the long poles in the tent has been the WAL, since all the regions served by a regions server hit the same one.
As of HBase 1.0, there are options to expand to multiple pipelines. But the gains are modest.
As of HBase 1.1, we can make use of HDFS storage policies to keep just the WAL on SSD in mixed disk deployments. We need more testing and operational feedback from the community though.
Longer term solutions that will start showing up in HBase 2.0 involve updates to both the read and write paths.
For WAL limitations, we need to examine some base assumptions; HDFS is made for throughput of large blobs, not for many small writes.
Custom DFSClient in HBase to show value, then push upstream
Maybe it’s best to defer to a system made for these kinds of writes, e.g. Kafka
Stack has recently done some excellent work profiling what happens in an HBase system under load and some optimizations to better work with the jit compiler have been landing as a result.
Frankly, we have a huge number of tuning options now that can eat a lot of hardware, but they remain inaccessible. Documentation improvements and a round of updating defaults based on current machine specs.