HBase at Xiaomi
Jianwei Cui, Shaohui Liu
{cuijianwei, liushaohui}@xiaomi.com
About Xiaomi
Sold 60M phones in 2014, 3X of 2013
Guinness World Record: selling 2.11M phones online in 24h
2 / 35
Our HBase Team
8 Developers
Honghua Feng
Liang Xie
Jianwei Cui
Liangliang He
YingChao Zhou
Qiming Cheng
Guanghao Zhang
Shaohui Liu
130 patches submmitted in 2014, 82 committed
3 / 35
Agenda
1. Current Status
2. Problems and Solutions
3. Themis
4 / 35
Clusters and Scenarios
Mainland China
20 online clusters / 2 offline clusters in 3 data centers
AWS
4 online clusters / 1 offline cluster in 2 regions
Online Service
Mi Cloud, Mi Push, Galaxy, Mi Message,...
Offline Processing
User Profile, Trace, Recommendation, ...
5 / 35
Scenario A: Mi Cloud
Personal cloud storage for smart phones
Numbers
90+ milion users, 3X increased in 2014
500 billion rows, 6X increased in 2014
1000+ regions in the largest table
See: https://i.mi.com
6 / 35
Scenario B: Mi Push
Push service on android
Data stored in HBase
Pub-sub relations of topics and devices
Messages to each device
Numbers
200+ milion users
Push 2 billion+ messages every day
200,000+ requests per second at peak
7 / 35
Deployment
Two clusters with master-master replication in different data centers
Client switches clusters through configs on ZooKeeper
Using canary for availability check and alerts
8 / 35
Agenda
1. Current Status
2. Problems and Solutions
3. Themis
9 / 35
Long Full GC Pauses for RegionServer
Problem: Long full GC pauses making ZooKeeper session expire
zookeeper.session.timeout = 30s
Full GC pause of RegionServer with heap of 30G can be 40s
Solution:
Multi regionserver instances in a node
More memory on offheap using bucket cache
10 / 35
Hotspot for Temporal Data
Problem: Writes of temporal Data go to a small set of regions
Solution: Salted Table
Based on SaltedHTable opensourced by Intel Hadoop team
See: https://github.com/intel-hadoop/SaltedHTable
Transparent to applications by table schema support
MapReduce support
11 / 35
Coordinated Compaction
Problem: Compaction storm
Solution:
A compaction manager in HMaster coordinates all the compactions in the
cluster
Before a compaction starts, regionserver needs to acquire a compaction
quota
12 / 35
Exception Aggregation
Purposes: Find the potential bugs in the clusters
Solution:
Write HMaster/RegionServer log asynchronously to HDFS through Scribe
Using MapReduce to aggregate errors and exceptions of clusters
13 / 35
Table Based Replication Queue (in progress)
Problems:
Too much data stored on ZooKeeper
Over 200MB replication data for a disabled peer
Too many writes to ZooKeeper
5k/s writes to ZooKeeper in a cluster with 100k/s writes (HBASE-12636)
14 / 35
Table Based Replication Queue (in progress)
Solution: Move replication queue to a system table
Row key : server name + peer id + hlog name
One column records the offset at which the log is replicated
15 / 35
Asynchronous Event Notification (in progress)
Purposes:
Incremental statistics of data in HBase
Table schema transformation
Asynchronous data indexing
Solution:
An asynchronous event notification framework on HBase (HBASE-12884)
Replication based implementation:
Add a fake replication peer, which can receive the WAL edits from HBase
clusters
16 / 35
Agenda
1. Current Status
2. Problems and Solutions
3. Themis
17 / 35
Cross-Row Transaction
Why cross-row transaction?
Cross-row data consistency
Rows in different regions / tables
Example
Music index building
18 / 35
Cross-Row Transaction
Features
ACID
No central coordinator
Integrated without HBase code change
Google’s Percolator
Large-scale Incremental Processing Using Distribute Transactions and
Notifications, by Daniel Peng and Frank Dabek, 2010
Themis
https://github.com/Xiaomi/themis
Provides cross-row transactions on HBase based on Percolator
19 / 35
Themis Infrastructure
Timestamp server
Use timestamp of KeyValue internally
Timestamp must be globally incremental
Client
Coprocessor
20 / 35
Timestamp Server
Seperate a long type timestamp into two parts :
Higher 46 bits: Sync with system time
Lower 18 bits: Incremental counter
Hundreds of thousands unique timestamps in each millisecond
Incremental in one timestamp server
21 / 35
Timestamp Server
Incremental cross timestamp servers
Periodically save a future timestamp into ZooKeeper
Allocated timestamp must be smaller than saved timestamp
Another server needs to read the saved timestamp when starting
High availability
High throughput : 600,000 RPCs per second
Batch concurrent requests in one RPC
22 / 35
Themis
Cross-row mutation example : Cash Table
Rows for Bob and Joe are in different regions
Transfer $3 from Joe to Bob atomically
Two auxiliary columns : lock column and commit column
Two-phase commit
Prewrite Phase
Commit Phase
checkAndMutate of HBase : guarantee the atomicity for a single row
23 / 35
Prewrite Phase
Fetch a prewrite timestamp from timestamp server (prewriteTs=99)
Select primary and secondary columns
Primary
Column : (Joe, f:c)
PrimaryLock: {secondaries : [(Bob, f:c)]}
Secondaries
Column : (Bob, f:c)
SecondaryLock: {primary : (Joe, f:c)}
24 / 35
Prewrite Phase
Prewrite primary column
Write primary lock and data if no lock exists in lock column
checkAndMutate of HBase to guarantee the atomicity
Prevent other clients mutating the same column concurrently
25 / 35
Prewrite Phase
Prewrite secondary columns
Follow the same steps of prewriting primary column
26 / 35
Commit Phase
Fetch commit timestamp from timestamp server (commitTs=100)
Commit Primary
Delete the lock and write commit column if the lock exists
checkAndMutate of HBase to guarantee the atmocity
Decide the success or failure the whole transaction
27 / 35
Commit Phase
Commit secondaries
Delete the lock and write commit column atomically
28 / 35
Themis Read
Fetch a read timestamp (readTs=101)
Read commit columns with commitTs < readTs
(Joe, c:f#c) => (100 : 99)
(Bob, c:f#c) => (100 : 99)
Read data column with prewriteTs
prewriteTs is just the value of commit column
(Joe, 99: $17) and (Bob, 99: $12)
29 / 35
Performance Comparison
Single-Row Transaction
The worst case compared with raw HBase
One region server with 10GB heap memory
Write Performance : Preload 3 million rows, 256MB LRU cache
Read Performance : Preload 30GB data
30 / 35
Performance Comparison
Single-Row Transaction
The worst case compared with raw HBase
One region server with 10GB heap memory
Write Performance
Read Performance
31 / 35
Performance Comparison
Themis vs. Percolator
Write optimization
Set lock family IN MEMORY
Not sync the lock of prewrite phase
32 / 35
Future Work of Themis
Generic transaction API: HBASE-11447
Support different isolation levels
Global secondary index
33 / 35
34 / 35
Thanks! Questions?
Contacts: {cuijianwei, liushaohui}@xiaomi.com

HBaseCon 2015: HBase Operations at Xiaomi

  • 1.
    HBase at Xiaomi JianweiCui, Shaohui Liu {cuijianwei, liushaohui}@xiaomi.com
  • 2.
    About Xiaomi Sold 60Mphones in 2014, 3X of 2013 Guinness World Record: selling 2.11M phones online in 24h 2 / 35
  • 3.
    Our HBase Team 8Developers Honghua Feng Liang Xie Jianwei Cui Liangliang He YingChao Zhou Qiming Cheng Guanghao Zhang Shaohui Liu 130 patches submmitted in 2014, 82 committed 3 / 35
  • 4.
    Agenda 1. Current Status 2.Problems and Solutions 3. Themis 4 / 35
  • 5.
    Clusters and Scenarios MainlandChina 20 online clusters / 2 offline clusters in 3 data centers AWS 4 online clusters / 1 offline cluster in 2 regions Online Service Mi Cloud, Mi Push, Galaxy, Mi Message,... Offline Processing User Profile, Trace, Recommendation, ... 5 / 35
  • 6.
    Scenario A: MiCloud Personal cloud storage for smart phones Numbers 90+ milion users, 3X increased in 2014 500 billion rows, 6X increased in 2014 1000+ regions in the largest table See: https://i.mi.com 6 / 35
  • 7.
    Scenario B: MiPush Push service on android Data stored in HBase Pub-sub relations of topics and devices Messages to each device Numbers 200+ milion users Push 2 billion+ messages every day 200,000+ requests per second at peak 7 / 35
  • 8.
    Deployment Two clusters withmaster-master replication in different data centers Client switches clusters through configs on ZooKeeper Using canary for availability check and alerts 8 / 35
  • 9.
    Agenda 1. Current Status 2.Problems and Solutions 3. Themis 9 / 35
  • 10.
    Long Full GCPauses for RegionServer Problem: Long full GC pauses making ZooKeeper session expire zookeeper.session.timeout = 30s Full GC pause of RegionServer with heap of 30G can be 40s Solution: Multi regionserver instances in a node More memory on offheap using bucket cache 10 / 35
  • 11.
    Hotspot for TemporalData Problem: Writes of temporal Data go to a small set of regions Solution: Salted Table Based on SaltedHTable opensourced by Intel Hadoop team See: https://github.com/intel-hadoop/SaltedHTable Transparent to applications by table schema support MapReduce support 11 / 35
  • 12.
    Coordinated Compaction Problem: Compactionstorm Solution: A compaction manager in HMaster coordinates all the compactions in the cluster Before a compaction starts, regionserver needs to acquire a compaction quota 12 / 35
  • 13.
    Exception Aggregation Purposes: Findthe potential bugs in the clusters Solution: Write HMaster/RegionServer log asynchronously to HDFS through Scribe Using MapReduce to aggregate errors and exceptions of clusters 13 / 35
  • 14.
    Table Based ReplicationQueue (in progress) Problems: Too much data stored on ZooKeeper Over 200MB replication data for a disabled peer Too many writes to ZooKeeper 5k/s writes to ZooKeeper in a cluster with 100k/s writes (HBASE-12636) 14 / 35
  • 15.
    Table Based ReplicationQueue (in progress) Solution: Move replication queue to a system table Row key : server name + peer id + hlog name One column records the offset at which the log is replicated 15 / 35
  • 16.
    Asynchronous Event Notification(in progress) Purposes: Incremental statistics of data in HBase Table schema transformation Asynchronous data indexing Solution: An asynchronous event notification framework on HBase (HBASE-12884) Replication based implementation: Add a fake replication peer, which can receive the WAL edits from HBase clusters 16 / 35
  • 17.
    Agenda 1. Current Status 2.Problems and Solutions 3. Themis 17 / 35
  • 18.
    Cross-Row Transaction Why cross-rowtransaction? Cross-row data consistency Rows in different regions / tables Example Music index building 18 / 35
  • 19.
    Cross-Row Transaction Features ACID No centralcoordinator Integrated without HBase code change Google’s Percolator Large-scale Incremental Processing Using Distribute Transactions and Notifications, by Daniel Peng and Frank Dabek, 2010 Themis https://github.com/Xiaomi/themis Provides cross-row transactions on HBase based on Percolator 19 / 35
  • 20.
    Themis Infrastructure Timestamp server Usetimestamp of KeyValue internally Timestamp must be globally incremental Client Coprocessor 20 / 35
  • 21.
    Timestamp Server Seperate along type timestamp into two parts : Higher 46 bits: Sync with system time Lower 18 bits: Incremental counter Hundreds of thousands unique timestamps in each millisecond Incremental in one timestamp server 21 / 35
  • 22.
    Timestamp Server Incremental crosstimestamp servers Periodically save a future timestamp into ZooKeeper Allocated timestamp must be smaller than saved timestamp Another server needs to read the saved timestamp when starting High availability High throughput : 600,000 RPCs per second Batch concurrent requests in one RPC 22 / 35
  • 23.
    Themis Cross-row mutation example: Cash Table Rows for Bob and Joe are in different regions Transfer $3 from Joe to Bob atomically Two auxiliary columns : lock column and commit column Two-phase commit Prewrite Phase Commit Phase checkAndMutate of HBase : guarantee the atomicity for a single row 23 / 35
  • 24.
    Prewrite Phase Fetch aprewrite timestamp from timestamp server (prewriteTs=99) Select primary and secondary columns Primary Column : (Joe, f:c) PrimaryLock: {secondaries : [(Bob, f:c)]} Secondaries Column : (Bob, f:c) SecondaryLock: {primary : (Joe, f:c)} 24 / 35
  • 25.
    Prewrite Phase Prewrite primarycolumn Write primary lock and data if no lock exists in lock column checkAndMutate of HBase to guarantee the atomicity Prevent other clients mutating the same column concurrently 25 / 35
  • 26.
    Prewrite Phase Prewrite secondarycolumns Follow the same steps of prewriting primary column 26 / 35
  • 27.
    Commit Phase Fetch committimestamp from timestamp server (commitTs=100) Commit Primary Delete the lock and write commit column if the lock exists checkAndMutate of HBase to guarantee the atmocity Decide the success or failure the whole transaction 27 / 35
  • 28.
    Commit Phase Commit secondaries Deletethe lock and write commit column atomically 28 / 35
  • 29.
    Themis Read Fetch aread timestamp (readTs=101) Read commit columns with commitTs < readTs (Joe, c:f#c) => (100 : 99) (Bob, c:f#c) => (100 : 99) Read data column with prewriteTs prewriteTs is just the value of commit column (Joe, 99: $17) and (Bob, 99: $12) 29 / 35
  • 30.
    Performance Comparison Single-Row Transaction Theworst case compared with raw HBase One region server with 10GB heap memory Write Performance : Preload 3 million rows, 256MB LRU cache Read Performance : Preload 30GB data 30 / 35
  • 31.
    Performance Comparison Single-Row Transaction Theworst case compared with raw HBase One region server with 10GB heap memory Write Performance Read Performance 31 / 35
  • 32.
    Performance Comparison Themis vs.Percolator Write optimization Set lock family IN MEMORY Not sync the lock of prewrite phase 32 / 35
  • 33.
    Future Work ofThemis Generic transaction API: HBASE-11447 Support different isolation levels Global secondary index 33 / 35
  • 34.
  • 35.