HBaseCon 2015: HBase Operations at Xiaomi

2,537 views

Published on

In this session, you will learn the work Xiaomi has done to improve the availability and stability of our HBase clusters, including cross-site data and service backup and a coordinated compaction framework. You'll also learn about the Themis framework, which supports cross-row transactions on HBase based on Google's percolator algorithm, and its usage in Xiaomi's applications.

Published in: Software

HBaseCon 2015: HBase Operations at Xiaomi

  1. 1. HBase at Xiaomi Jianwei Cui, Shaohui Liu {cuijianwei, liushaohui}@xiaomi.com
  2. 2. About Xiaomi Sold 60M phones in 2014, 3X of 2013 Guinness World Record: selling 2.11M phones online in 24h 2 / 35
  3. 3. Our HBase Team 8 Developers Honghua Feng Liang Xie Jianwei Cui Liangliang He YingChao Zhou Qiming Cheng Guanghao Zhang Shaohui Liu 130 patches submmitted in 2014, 82 committed 3 / 35
  4. 4. Agenda 1. Current Status 2. Problems and Solutions 3. Themis 4 / 35
  5. 5. Clusters and Scenarios Mainland China 20 online clusters / 2 offline clusters in 3 data centers AWS 4 online clusters / 1 offline cluster in 2 regions Online Service Mi Cloud, Mi Push, Galaxy, Mi Message,... Offline Processing User Profile, Trace, Recommendation, ... 5 / 35
  6. 6. Scenario A: Mi Cloud Personal cloud storage for smart phones Numbers 90+ milion users, 3X increased in 2014 500 billion rows, 6X increased in 2014 1000+ regions in the largest table See: https://i.mi.com 6 / 35
  7. 7. Scenario B: Mi Push Push service on android Data stored in HBase Pub-sub relations of topics and devices Messages to each device Numbers 200+ milion users Push 2 billion+ messages every day 200,000+ requests per second at peak 7 / 35
  8. 8. Deployment Two clusters with master-master replication in different data centers Client switches clusters through configs on ZooKeeper Using canary for availability check and alerts 8 / 35
  9. 9. Agenda 1. Current Status 2. Problems and Solutions 3. Themis 9 / 35
  10. 10. Long Full GC Pauses for RegionServer Problem: Long full GC pauses making ZooKeeper session expire zookeeper.session.timeout = 30s Full GC pause of RegionServer with heap of 30G can be 40s Solution: Multi regionserver instances in a node More memory on offheap using bucket cache 10 / 35
  11. 11. Hotspot for Temporal Data Problem: Writes of temporal Data go to a small set of regions Solution: Salted Table Based on SaltedHTable opensourced by Intel Hadoop team See: https://github.com/intel-hadoop/SaltedHTable Transparent to applications by table schema support MapReduce support 11 / 35
  12. 12. Coordinated Compaction Problem: Compaction storm Solution: A compaction manager in HMaster coordinates all the compactions in the cluster Before a compaction starts, regionserver needs to acquire a compaction quota 12 / 35
  13. 13. Exception Aggregation Purposes: Find the potential bugs in the clusters Solution: Write HMaster/RegionServer log asynchronously to HDFS through Scribe Using MapReduce to aggregate errors and exceptions of clusters 13 / 35
  14. 14. Table Based Replication Queue (in progress) Problems: Too much data stored on ZooKeeper Over 200MB replication data for a disabled peer Too many writes to ZooKeeper 5k/s writes to ZooKeeper in a cluster with 100k/s writes (HBASE-12636) 14 / 35
  15. 15. Table Based Replication Queue (in progress) Solution: Move replication queue to a system table Row key : server name + peer id + hlog name One column records the offset at which the log is replicated 15 / 35
  16. 16. Asynchronous Event Notification (in progress) Purposes: Incremental statistics of data in HBase Table schema transformation Asynchronous data indexing Solution: An asynchronous event notification framework on HBase (HBASE-12884) Replication based implementation: Add a fake replication peer, which can receive the WAL edits from HBase clusters 16 / 35
  17. 17. Agenda 1. Current Status 2. Problems and Solutions 3. Themis 17 / 35
  18. 18. Cross-Row Transaction Why cross-row transaction? Cross-row data consistency Rows in different regions / tables Example Music index building 18 / 35
  19. 19. Cross-Row Transaction Features ACID No central coordinator Integrated without HBase code change Google’s Percolator Large-scale Incremental Processing Using Distribute Transactions and Notifications, by Daniel Peng and Frank Dabek, 2010 Themis https://github.com/Xiaomi/themis Provides cross-row transactions on HBase based on Percolator 19 / 35
  20. 20. Themis Infrastructure Timestamp server Use timestamp of KeyValue internally Timestamp must be globally incremental Client Coprocessor 20 / 35
  21. 21. Timestamp Server Seperate a long type timestamp into two parts : Higher 46 bits: Sync with system time Lower 18 bits: Incremental counter Hundreds of thousands unique timestamps in each millisecond Incremental in one timestamp server 21 / 35
  22. 22. Timestamp Server Incremental cross timestamp servers Periodically save a future timestamp into ZooKeeper Allocated timestamp must be smaller than saved timestamp Another server needs to read the saved timestamp when starting High availability High throughput : 600,000 RPCs per second Batch concurrent requests in one RPC 22 / 35
  23. 23. Themis Cross-row mutation example : Cash Table Rows for Bob and Joe are in different regions Transfer $3 from Joe to Bob atomically Two auxiliary columns : lock column and commit column Two-phase commit Prewrite Phase Commit Phase checkAndMutate of HBase : guarantee the atomicity for a single row 23 / 35
  24. 24. Prewrite Phase Fetch a prewrite timestamp from timestamp server (prewriteTs=99) Select primary and secondary columns Primary Column : (Joe, f:c) PrimaryLock: {secondaries : [(Bob, f:c)]} Secondaries Column : (Bob, f:c) SecondaryLock: {primary : (Joe, f:c)} 24 / 35
  25. 25. Prewrite Phase Prewrite primary column Write primary lock and data if no lock exists in lock column checkAndMutate of HBase to guarantee the atomicity Prevent other clients mutating the same column concurrently 25 / 35
  26. 26. Prewrite Phase Prewrite secondary columns Follow the same steps of prewriting primary column 26 / 35
  27. 27. Commit Phase Fetch commit timestamp from timestamp server (commitTs=100) Commit Primary Delete the lock and write commit column if the lock exists checkAndMutate of HBase to guarantee the atmocity Decide the success or failure the whole transaction 27 / 35
  28. 28. Commit Phase Commit secondaries Delete the lock and write commit column atomically 28 / 35
  29. 29. Themis Read Fetch a read timestamp (readTs=101) Read commit columns with commitTs < readTs (Joe, c:f#c) => (100 : 99) (Bob, c:f#c) => (100 : 99) Read data column with prewriteTs prewriteTs is just the value of commit column (Joe, 99: $17) and (Bob, 99: $12) 29 / 35
  30. 30. Performance Comparison Single-Row Transaction The worst case compared with raw HBase One region server with 10GB heap memory Write Performance : Preload 3 million rows, 256MB LRU cache Read Performance : Preload 30GB data 30 / 35
  31. 31. Performance Comparison Single-Row Transaction The worst case compared with raw HBase One region server with 10GB heap memory Write Performance Read Performance 31 / 35
  32. 32. Performance Comparison Themis vs. Percolator Write optimization Set lock family IN MEMORY Not sync the lock of prewrite phase 32 / 35
  33. 33. Future Work of Themis Generic transaction API: HBASE-11447 Support different isolation levels Global secondary index 33 / 35
  34. 34. 34 / 35
  35. 35. Thanks! Questions? Contacts: {cuijianwei, liushaohui}@xiaomi.com

×