HBaseCon 2015: HBase Operations at Xiaomi

HBase at Xiaomi
Jianwei Cui, Shaohui Liu
{cuijianwei, liushaohui}@xiaomi.com

About Xiaomi
Sold 60M phones in 2014, 3X of 2013
Guinness World Record: selling 2.11M phones online in 24h
2 / 35

Our HBase Team
8 Developers
Honghua Feng
Liang Xie
Jianwei Cui
Liangliang He
YingChao Zhou
Qiming Cheng
Guanghao Zhang
Shaohui Liu
130 patches submmitted in 2014, 82 committed
3 / 35

Agenda
1. Current Status
2. Problems and Solutions
3. Themis
4 / 35

Clusters and Scenarios
Mainland China
20 online clusters / 2 offline clusters in 3 data centers
AWS
4 online clusters / 1 offline cluster in 2 regions
Online Service
Mi Cloud, Mi Push, Galaxy, Mi Message,...
Offline Processing
User Profile, Trace, Recommendation, ...
5 / 35

Scenario A: Mi Cloud
Personal cloud storage for smart phones
Numbers
90+ milion users, 3X increased in 2014
500 billion rows, 6X increased in 2014
1000+ regions in the largest table
See: https://i.mi.com
6 / 35

Scenario B: Mi Push
Push service on android
Data stored in HBase
Pub-sub relations of topics and devices
Messages to each device
Numbers
200+ milion users
Push 2 billion+ messages every day
200,000+ requests per second at peak
7 / 35

Deployment
Two clusters with master-master replication in diﬀerent data centers
Client switches clusters through conﬁgs on ZooKeeper
Using canary for availability check and alerts
8 / 35

Agenda
1. Current Status
3. Themis
9 / 35

Long Full GC Pauses for RegionServer
Problem: Long full GC pauses making ZooKeeper session expire
zookeeper.session.timeout = 30s
Full GC pause of RegionServer with heap of 30G can be 40s
Solution:
Multi regionserver instances in a node
More memory on oﬀheap using bucket cache
10 / 35

Hotspot for Temporal Data
Problem: Writes of temporal Data go to a small set of regions
Solution: Salted Table
Based on SaltedHTable opensourced by Intel Hadoop team
See: https://github.com/intel-hadoop/SaltedHTable
Transparent to applications by table schema support
MapReduce support
11 / 35

Coordinated Compaction
Problem: Compaction storm
Solution:
A compaction manager in HMaster coordinates all the compactions in the
cluster
Before a compaction starts, regionserver needs to acquire a compaction
quota
12 / 35

Exception Aggregation
Purposes: Find the potential bugs in the clusters
Solution:
Write HMaster/RegionServer log asynchronously to HDFS through Scribe
Using MapReduce to aggregate errors and exceptions of clusters
13 / 35

Table Based Replication Queue (in progress)
Problems:
Too much data stored on ZooKeeper
Over 200MB replication data for a disabled peer
Too many writes to ZooKeeper
5k/s writes to ZooKeeper in a cluster with 100k/s writes (HBASE-12636)
14 / 35

Table Based Replication Queue (in progress)
Solution: Move replication queue to a system table
Row key : server name + peer id + hlog name
One column records the oﬀset at which the log is replicated
15 / 35

Asynchronous Event Notiﬁcation (in progress)
Purposes:
Incremental statistics of data in HBase
Table schema transformation
Asynchronous data indexing
Solution:
An asynchronous event notiﬁcation framework on HBase (HBASE-12884)
Replication based implementation:
Add a fake replication peer, which can receive the WAL edits from HBase
clusters
16 / 35

Agenda
1. Current Status
3. Themis
17 / 35

Cross-Row Transaction
Why cross-row transaction?
Cross-row data consistency
Rows in diﬀerent regions / tables
Example
Music index building
18 / 35

Cross-Row Transaction
Features
ACID
No central coordinator
Integrated without HBase code change
Google’s Percolator
Large-scale Incremental Processing Using Distribute Transactions and
Notiﬁcations, by Daniel Peng and Frank Dabek, 2010
Themis
https://github.com/Xiaomi/themis
Provides cross-row transactions on HBase based on Percolator
19 / 35

Themis Infrastructure
Timestamp server
Use timestamp of KeyValue internally
Timestamp must be globally incremental
Client
Coprocessor
20 / 35

Timestamp Server
Seperate a long type timestamp into two parts :
Higher 46 bits: Sync with system time
Lower 18 bits: Incremental counter
Hundreds of thousands unique timestamps in each millisecond
Incremental in one timestamp server
21 / 35

Timestamp Server
Incremental cross timestamp servers
Periodically save a future timestamp into ZooKeeper
Allocated timestamp must be smaller than saved timestamp
Another server needs to read the saved timestamp when starting
High availability
High throughput : 600,000 RPCs per second
Batch concurrent requests in one RPC
22 / 35

Themis
Cross-row mutation example : Cash Table
Rows for Bob and Joe are in diﬀerent regions
Transfer $3 from Joe to Bob atomically
Two auxiliary columns : lock column and commit column
Two-phase commit
Prewrite Phase
Commit Phase
checkAndMutate of HBase : guarantee the atomicity for a single row
23 / 35

Prewrite Phase
Fetch a prewrite timestamp from timestamp server (prewriteTs=99)
Select primary and secondary columns
Primary
Column : (Joe, f:c)
PrimaryLock: {secondaries : [(Bob, f:c)]}
Secondaries
Column : (Bob, f:c)
SecondaryLock: {primary : (Joe, f:c)}
24 / 35

Prewrite Phase
Prewrite primary column
Write primary lock and data if no lock exists in lock column
checkAndMutate of HBase to guarantee the atomicity
Prevent other clients mutating the same column concurrently
25 / 35

Prewrite Phase
Prewrite secondary columns
Follow the same steps of prewriting primary column
26 / 35

Commit Phase
Fetch commit timestamp from timestamp server (commitTs=100)
Commit Primary
Delete the lock and write commit column if the lock exists
checkAndMutate of HBase to guarantee the atmocity
Decide the success or failure the whole transaction
27 / 35

Commit Phase
Commit secondaries
Delete the lock and write commit column atomically
28 / 35

Themis Read
Fetch a read timestamp (readTs=101)
Read commit columns with commitTs < readTs
(Joe, c:f#c) => (100 : 99)
(Bob, c:f#c) => (100 : 99)
Read data column with prewriteTs
prewriteTs is just the value of commit column
(Joe, 99: $17) and (Bob, 99: $12)
29 / 35

Performance Comparison
Single-Row Transaction
The worst case compared with raw HBase
One region server with 10GB heap memory
Write Performance : Preload 3 million rows, 256MB LRU cache
Read Performance : Preload 30GB data
30 / 35

Single-Row Transaction
The worst case compared with raw HBase
One region server with 10GB heap memory
Write Performance
Read Performance
31 / 35

Themis vs. Percolator
Write optimization
Set lock family IN MEMORY
Not sync the lock of prewrite phase
32 / 35

Future Work of Themis
Generic transaction API: HBASE-11447
Support diﬀerent isolation levels
Global secondary index
33 / 35

Thanks! Questions?
Contacts: {cuijianwei, liushaohui}@xiaomi.com

HBaseCon 2015: HBase Operations at Xiaomi

More Related Content

What's hot

Viewers also liked

Similar to HBaseCon 2015: HBase Operations at Xiaomi

More from HBaseCon

Recently uploaded

HBaseCon 2015: HBase Operations at Xiaomi