SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
In this session, we’ll discuss the various practices around HBase in use at Xiaomi, including those relating to HA, tiered compaction, multi-tenancy, and failover across data centers.
In this session, we’ll discuss the various practices around HBase in use at Xiaomi, including those relating to HA, tiered compaction, multi-tenancy, and failover across data centers.
1.
Some improvements and practices of
HBase at Xiaomi
Duo Zhang, Liangliang He
{zhangduo, heliangliang}@xiaomi.com
........ ..... ................. ................. ................. .... .... . .... ........ .
2.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
About Xiaomi
Xiaomi Inc. (literally ”millet technology”) is a privately owned Chinese
electronics company headquartered in Beijing.
▶ Sold 70m+ smart phones in 2015
▶ 100m+ DAU for MIUI
▶ Lots of other smart devices.(Mi Band, Air Purifier, etc.)
2 / 38
7.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Offline Scenario: User Profile
▶ Input data replicated from online to offline cluster
▶ Output data is written to offline cluster and replicated to online cluster
Numbers
▶ 200+ million users
▶ Both batch and streaming processing
7 / 38
8.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Agenda
1. Current Status
2. Problems and Solutions
3. HBase as a service
8 / 38
9.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Per-CF Flush
HBase book, section 34, On the number of column families:
HBase currently does not do well with anything above two or
three column families ... if one column family is carrying the
bulk of the data bringing on flushes, the adjacent families will
also be flushed even though the amount of data they carry is
small ...
So let’s not flush the small families
9 / 38
10.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Per-CF Flush
HBase book, section 34, On the number of column families:
HBase currently does not do well with anything above two or
three column families ... if one column family is carrying the
bulk of the data bringing on flushes, the adjacent families will
also be flushed even though the amount of data they carry is
small ...
So let’s not flush the small families
9 / 38
11.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Per-CF Flush
▶ Why we must flush all families?
▶ Our sequence id accounting is per region.
▶ Can not know the lowest unflushed sequence id.
▶ Track sequence id per store, i.e., per family
▶ Map<RegionName, SequenceId> to
Map<RegionName, Map<FamilyName, SequenceId>>
▶ SequenceId map in WAL implementation
▶ FlushedSequenceId in ServerManager at master
▶ Report a Map of flushed sequence id to master(Thanks protobuf for
compatibility)
▶ Skip WAL cells per store when replaying
▶ FlushPolicy
▶ FlushAllStoresPolicy
▶ FlushLargeStoresPolicy
10 / 38
12.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Per-CF Flush
▶ Flush is not only used for releasing memory
▶ WAL truncating
▶ Region merge, split, move...
▶ Bulk load
▶ Introduce a ’force’ flag
▶ Always flush all families regardless of which FlushPolicy we use
11 / 38
17.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Async WAL
Solution: AsyncFSWAL and FanOutOneBlockAsyncDFSOutput
▶ Simple, can only write one block
▶ Fail-fast
▶ All things are done in netty’s EventLoop, fully event-driven
▶ Fan out, write to 3 datanodes concurrently
16 / 38
18.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Async WAL
Implementation:
▶ Why not disruptor?
▶ Should not block EventLoop thread
▶ Submit consumer task only if there are entries in queue
▶ Avoid submit a task for every entry
▶ SASL and encryption support
▶ Be compatible with hadoop from 2.4.x to 2.7.x
▶ Classes and methods are changed, moved, removed, etc.
▶ Abstract common interface
▶ Reflection
17 / 38
20.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Async WAL
▶ Available in HBase-2.0
▶ Also the default WAL implementation in HBase-2.0
▶ Will push the AsyncFSOutput related code to HDFS
▶ HBASE-14790
19 / 38
21.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Revisit the semantic of Delete
Problem: The ’Delete Version’ problem
▶ Let MaxVersion = 2, and timestamp T1 < T2 < T3
1. Put T1, T2, T3
2. Major compaction
3. Delete T2
1. Put T1, T2, T3
2. Delete T2
3. Major compaction
T3 vs. T3, T1
20 / 38
22.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Revisit the semantic of Delete
Problem: The ’Delete Version’ problem
▶ Let MaxVersion = 2, and timestamp T1 < T2 < T3
1. Put T1, T2, T3
2. Major compaction
3. Delete T2
1. Put T1, T2, T3
2. Delete T2
3. Major compaction
T3 vs. T3, T1
20 / 38
23.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Revisit the semantic of Delete
Problem: The ’Delete Version’ problem
▶ Let MaxVersion = 2, and timestamp T1 < T2 < T3
1. Put T1, T2, T3
2. Major compaction
3. Delete T2
1. Put T1, T2, T3
2. Delete T2
3. Major compaction
T3 vs. T3, T1
20 / 38
24.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Revisit the semantic of Delete
Problem: The ’Delete Version’ problem
▶ Let MaxVersion = 2, and timestamp T1 < T2 < T3
1. Put T1, T2, T3
2. Major compaction
3. Delete T2
1. Put T1, T2, T3
2. Delete T2
3. Major compaction
T3 vs. T3, T1
20 / 38
25.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Revisit the semantic of Delete
Problem: The ’Delete Version’ problem
▶ Let MaxVersion = 2, and timestamp T1 < T2 < T3
1. Put T1, T2, T3
2. Major compaction
3. Delete T2
1. Put T1, T2, T3
2. Delete T2
3. Major compaction
T3 vs. T3, T1
20 / 38
26.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Revisit the semantic of Delete
Problem: The ’Delete Version’ problem
▶ Let MaxVersion = 2, and timestamp T1 < T2 < T3
1. Put T1, T2, T3
2. Major compaction
3. Delete T2
1. Put T1, T2, T3
2. Delete T2
3. Major compaction
T3 vs. T3, T1
20 / 38
27.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Revisit the semantic of Delete
Problem: The ’Delete Version’ problem
▶ Let MaxVersion = 2, and timestamp T1 < T2 < T3
1. Put T1, T2, T3
2. Major compaction
3. Delete T2
1. Put T1, T2, T3
2. Delete T2
3. Major compaction
T3 vs. T3, T1
20 / 38
28.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Revisit the semantic of Delete
Problem: The ’Delete Version’ problem
▶ Let MaxVersion = 2, and timestamp T1 < T2 < T3
1. Put T1, T2, T3
2. Major compaction
3. Delete T2
1. Put T1, T2, T3
2. Delete T2
3. Major compaction
T3 vs. T3, T1
20 / 38
29.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Revisit the semantic of Delete
Problem: The ’Delete Version’ problem
▶ Let MaxVersion = 2, and timestamp T1 < T2 < T3
1. Put T1, T2, T3
2. Major compaction
3. Delete T2
1. Put T1, T2, T3
2. Delete T2
3. Major compaction
T3 vs. T3, T1
20 / 38
30.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Revisit the semantic of Delete
Problem: The ’Delete Version’ problem
▶ Let MaxVersion = 2, and timestamp T1 < T2 < T3
1. Put T1, T2, T3
2. Major compaction
3. Delete T2
1. Put T1, T2, T3
2. Delete T2
3. Major compaction
T3 vs. T3, T1
20 / 38
31.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Revisit the semantic of Delete
Problem: Delete has effect on newer Put(with higher sequence id)
▶ Let timestamp T1 < T2
▶ Delete all versions less than T2
▶ Major compaction
▶ Put T1
▶ Delete all versions less than T2
▶ Put T1
▶ Major compaction
T1 vs. Nothing
21 / 38
32.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Revisit the semantic of Delete
Problem: Delete has effect on newer Put(with higher sequence id)
▶ Let timestamp T1 < T2
▶ Delete all versions less than T2
▶ Major compaction
▶ Put T1
▶ Delete all versions less than T2
▶ Put T1
▶ Major compaction
T1 vs. Nothing
21 / 38
33.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Revisit the semantic of Delete
▶ Not a big problem? It depends.
▶ Major compaction is a low frequency operation
▶ You just choose one path so the result is deterministic
▶ What if we use replication?
Eventual inconsistency
22 / 38
34.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Revisit the semantic of Delete
▶ Not a big problem? It depends.
▶ Major compaction is a low frequency operation
▶ You just choose one path so the result is deterministic
▶ What if we use replication?
Eventual inconsistency
22 / 38
35.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Revisit the semantic of Delete
▶ Not a big problem? It depends.
▶ Major compaction is a low frequency operation
▶ You just choose one path so the result is deterministic
▶ What if we use replication?
Eventual inconsistency
22 / 38
36.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Revisit the semantic of Delete
Solution: Also consider sequence id
▶ Once a value is invisible, it should never appear again
▶ A modified scanner that also consider sequence id when deciding visibility
▶ Can not use max timestamp to exclude store files when scan
▶ Delete should not have effect on put with a higher sequence id
▶ Maybe a table level config to turn it on
23 / 38
37.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Revisit the semantic of Delete
▶ Enough?
▶ Not really for replication
▶ The WAL of the same Cell should be sent by ascending order of sequence id
▶ HBASE-2256, HBASE-8721, HBASE-8770...
24 / 38
38.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Revisit the semantic of Delete
▶ Enough?
▶ Not really for replication
▶ The WAL of the same Cell should be sent by ascending order of sequence id
▶ HBASE-2256, HBASE-8721, HBASE-8770...
24 / 38
39.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Revisit the semantic of Delete
▶ Enough?
▶ Not really for replication
▶ The WAL of the same Cell should be sent by ascending order of sequence id
▶ HBASE-2256, HBASE-8721, HBASE-8770...
24 / 38
40.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Multi-Tenancy Practice
Difference from trunk HBase quota implementation
▶ Requests are size weighted when counting quota
▶ Per user instead of per regionserver
▶ Assume the workloads are evenly distributed to each region
▶ Soft qps limit, like DynamoDB
▶ Configurable qps quota limit for each regionserver
▶ User can have a qps higher than its quota if regionserver has free quota
▶ Transparent client side auto backoff when quota exceeds
25 / 38
41.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Cross Data-Center Failover Practice
Modifications of HBase:
▶ HBase nameservice
▶ Read-write switch in client configuration
▶ Dynamic configuration with zookeeper
▶ Record last synced WAL write time when update replication log position
Failover steps:
▶ Check and make sure replication is in-sync
▶ Stop write operation by update config in zookeeper
▶ Check and wait replication is done by checking the sync time of last
replicated log
▶ Switch master cluster and turn on write operation
26 / 38
42.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Cross Data-Center Failover Practice
Modifications of HBase:
▶ HBase nameservice
▶ Read-write switch in client configuration
▶ Dynamic configuration with zookeeper
▶ Record last synced WAL write time when update replication log position
Failover steps:
▶ Check and make sure replication is in-sync
▶ Stop write operation by update config in zookeeper
▶ Check and wait replication is done by checking the sync time of last
replicated log
▶ Switch master cluster and turn on write operation
26 / 38
43.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Cross Data-Center Failover Practice
Modifications of HBase:
▶ HBase nameservice
▶ Read-write switch in client configuration
▶ Dynamic configuration with zookeeper
▶ Record last synced WAL write time when update replication log position
Failover steps:
▶ Check and make sure replication is in-sync
▶ Stop write operation by update config in zookeeper
▶ Check and wait replication is done by checking the sync time of last
replicated log
▶ Switch master cluster and turn on write operation
26 / 38
44.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Cross Data-Center Failover Practice
Modifications of HBase:
▶ HBase nameservice
▶ Read-write switch in client configuration
▶ Dynamic configuration with zookeeper
▶ Record last synced WAL write time when update replication log position
Failover steps:
▶ Check and make sure replication is in-sync
▶ Stop write operation by update config in zookeeper
▶ Check and wait replication is done by checking the sync time of last
replicated log
▶ Switch master cluster and turn on write operation
26 / 38
45.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Agenda
1. Current Status
2. Problems and Solutions
3. HBase as a service
27 / 38
50.
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
libsds
Formalized Data Model
▶ Entity Group: group of records
belong to a single entity
▶ Primary Index: primary index
within an entity group
▶ Local Secondary Index: index
within a single entity group
▶ Eager index
▶ Lazy index
▶ Immutable index
32 / 38