Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
HBase at Xiaomi
{xieliang, fenghonghua}@xiaomi.com
Liang Xie / Honghua Feng
1www.mi.com
2
About Us
Honghua FengLiang Xie
www.mi.com
3
Outline
 Introduction
 Latency practice
 Some patches we contributed
 Some ongoing patches
 Q&A
www.mi.com
4
About Xiaomi
 Mobile internet company founded in 2010
 Sold 18.7 million phones in 2013
 Over $5 billion revenue in 2...
5
Hardware
www.mi.com
6
Software
www.mi.com
7
Internet Services
www.mi.com
8
About Our HBase Team
 Founded in October 2012
 5 members
 Liang Xie
 Shaohui Liu
 Jianwei Cui
 Liangliang He
 Hon...
9
Our Clusters and Scenarios
 15 Clusters : 9 online / 2 processing / 4 test
 Scenarios
 MiCloud
 MiPush
 MiTalk
 Pe...
10
Our Latency Pain Points
 Java GC
 Stable page write in OS layer
 Slow buffered IO (FS journal IO)
 Read/Write IO co...
11
 Bucket cache with off-heap mode
 Xmn/ServivorRatio/MaxTenuringThreshold
 PretenureSizeThreshold & repl src size
 G...
12
HBase client put
->HRegion.batchMutate
->HLog.sync
->SequenceFileLogWriter.sync
->DFSOutputStream.flushOrSync
->DFSOutp...
13
 write() is expected to be fast
 But blocked by write-back sometimes!
www.mi.com
Root Cause of Write Latency Spikes
14
Workaround :
2.6.32.279(6.3) -> 2.6.32.220(6.2)
or
2.6.32.279(6.3) -> 2.6.32.358(6.4)
Try to avoid deploying REHL6.3/Ce...
15
...
0xffffffffa00dc09d : do_get_write_access+0x29d/0x520 [jbd2]
0xffffffffa00dc471 : jbd2_journal_get_write_access+0x31...
16
8 YCSB threads; write 20 million rows, each 3*200 Bytes; 3 DN; kernel :
3.12.17
Statistic the stalled write() which cos...
17
Hedged Read (HDFS-5776)
www.mi.com
18
 Long first “put” issue (HBASE-10010)
 Token invalid (HDFS-5637)
 Retry/timeout setting in DFSClient
 Reduce write ...
19
 Real-time HDFS, esp. priority related
 Core data structure GC friendly
 More off-heap; shenandoah GC
 TCP/Disk IO ...
 New write thread model(HBASE-8755)
 Reverse scan(HBASE-4811)
 Per table/cf replication(HBASE-8751)
 Block index key o...
WriteHandler :sync to HDFS
WriteHandler :write to HDFS
WriteHandler :sync to HDFS
WriteHandler :write to HDFS
1. New Write...
WriteHandler :sync to HDFSWriteHandler :sync to HDFS
New Write Thread Model
WriteHandler WriteHandlerWriteHandler ……
Async...
New Write Thread Model
 Low load : No improvement
 Heavy load : Huge improvement (3.5x)
23www.mi.com
2. Reverse Scan
Row2 kv2
Row3 kv1
Row3 kv3
Row4 kv2
Row4 kv5
Row5 kv2
Row1 kv2
Row3 kv2
Row3 kv4
Row4 kv4
Row4 kv6
Row5 kv...
Need a way to specify which data to replicate!
3. Per Table/CF Replication
Source
PeerA
(backup)
PeerB
(T2:cfX)
T1 : cfA, ...
Per Table/CF Replication
Source
PeerA
PeerB
(T2:cfX)
T1:cfA,cfB; T2:cfX,cfY
T2:cfX
 add_peer ‘PeerA’, ‘PeerA_ZK’
 add_pe...
4. Block Index Key Optimization
Block 1 Block 2
… …
k1:“ab” k2 : “ah, hello world”
Before : ‘Block 2’ block index key = “a...
 Cross-table cross-row transaction(HBASE-10999)
 HLog compactor(HBASE-9873)
 Adjusted delete semantic(HBASE-8721)
 Coo...
http://github.com/xiaomi/themis
1. Cross-Row Transaction : Themis
 Google Percolator : Large-scale Incremental Processing...
2. HLog Compactor HLog 1,2,3
Region 1Memstore
HFiles
Region 2 Region x
Region x : few writes but scatter in many HLogs
Per...
HLog Compactor HLog 1, 2, 3,4
Region 1Memstore
HFiles
Region 2 Region x
 Compact : HLog 1,2,3,4  HLog x
 Archive : HLog...
3. Adjusted Delete Semantic
1. Write kvA at t0
2. Delete kvA at t0, flush to hfile
3. Write kvA at t0 again
4. Read kvA
Re...
4. Coordinated Compaction
HDFS (global resource)
RS RS RS
Compact storm!
 Compact uses a global HDFS, while whether to co...
Coordinated Compaction
RS RS RS
MasterCan I ?OK Can I ? OK
Can I ?
NO
HDFS (global resource)
 Compact is scheduled by mas...
5. Quorum Master
zk3 zk2
zk1
RS RSRS
Master
Master
ZooKeeper
X
Read info/states
A
A
 When active master serves, standby m...
Quorum Master
Master 3 Master 1
Master 2
RS RSRS
X
A
A
 Better master failover perf : No phase to rebuild in-memory statu...
Hangjun Ye, Zesheng Wu, Peng Zhang
Xing Yong, Hao Huang, Hailei Li
Shaohui Liu, Jianwei Cui, Liangliang He
Dihao Chen
Ackn...
Thank You!
xieliang@xiaomi.com
fenghonghua@xiaomi.com
www.mi.com
38www.mi.com
Upcoming SlideShare
Loading in …5
×

HBase at Xiaomi

4,547 views

Published on

Speakers: Liang Xie and Honghua Feng (Xiamoi)

This talk covers the HBase environment at Xiaomi, including thoughts and practices around latency, hardware/OS/VM configuration, GC tuning, the use of a new write thread model and reverse scan, and block index optimization. It will also include some discussion of planned JIRAs based on these approaches.

Published in: Software, Technology
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

HBase at Xiaomi

  1. 1. HBase at Xiaomi {xieliang, fenghonghua}@xiaomi.com Liang Xie / Honghua Feng 1www.mi.com
  2. 2. 2 About Us Honghua FengLiang Xie www.mi.com
  3. 3. 3 Outline  Introduction  Latency practice  Some patches we contributed  Some ongoing patches  Q&A www.mi.com
  4. 4. 4 About Xiaomi  Mobile internet company founded in 2010  Sold 18.7 million phones in 2013  Over $5 billion revenue in 2013  Sold 11 million phones in Q1, 2014 www.mi.com
  5. 5. 5 Hardware www.mi.com
  6. 6. 6 Software www.mi.com
  7. 7. 7 Internet Services www.mi.com
  8. 8. 8 About Our HBase Team  Founded in October 2012  5 members  Liang Xie  Shaohui Liu  Jianwei Cui  Liangliang He  Honghua Feng  Resolved 130+ JIRAs so far www.mi.com
  9. 9. 9 Our Clusters and Scenarios  15 Clusters : 9 online / 2 processing / 4 test  Scenarios  MiCloud  MiPush  MiTalk  Perf Counter www.mi.com
  10. 10. 10 Our Latency Pain Points  Java GC  Stable page write in OS layer  Slow buffered IO (FS journal IO)  Read/Write IO contention www.mi.com
  11. 11. 11  Bucket cache with off-heap mode  Xmn/ServivorRatio/MaxTenuringThreshold  PretenureSizeThreshold & repl src size  GC concurrent thread number GC time per day : [2500, 3000] -> [300, 600]s !!! www.mi.com HBase GC Practice
  12. 12. 12 HBase client put ->HRegion.batchMutate ->HLog.sync ->SequenceFileLogWriter.sync ->DFSOutputStream.flushOrSync ->DFSOutputStream.waitForAckedSeqno <Stuck here often!> =================================================== DataNode pipeline write, in BlockReceiver.receivePacket() : ->receiveNextPacket ->mirrorPacketTo(mirrorOut) //write packet to the mirror ->out.write/flush //write data to local disk. <- buffered IO [Added instrumentation(HDFS-6110) showed the stalled write was the culprit, strace result also confirmed it www.mi.com Write Latency Spikes
  13. 13. 13  write() is expected to be fast  But blocked by write-back sometimes! www.mi.com Root Cause of Write Latency Spikes
  14. 14. 14 Workaround : 2.6.32.279(6.3) -> 2.6.32.220(6.2) or 2.6.32.279(6.3) -> 2.6.32.358(6.4) Try to avoid deploying REHL6.3/Centos6.3 in an extremely latency sensitive HBase cluster! www.mi.com Stable page write issue workaround
  15. 15. 15 ... 0xffffffffa00dc09d : do_get_write_access+0x29d/0x520 [jbd2] 0xffffffffa00dc471 : jbd2_journal_get_write_access+0x31/0x50 [jbd2] 0xffffffffa011eb78 : __ext4_journal_get_write_access+0x38/0x80 [ext4] 0xffffffffa00fa253 : ext4_reserve_inode_write+0x73/0xa0 [ext4] 0xffffffffa00fa2cc : ext4_mark_inode_dirty+0x4c/0x1d0 [ext4] 0xffffffffa00fa6c4 : ext4_generic_write_end+0xe4/0xf0 [ext4] 0xffffffffa00fdf74 : ext4_writeback_write_end+0x74/0x160 [ext4] 0xffffffff81111474 : generic_file_buffered_write+0x174/0x2a0 [kernel] 0xffffffff81112d60 : __generic_file_aio_write+0x250/0x480 [kernel] 0xffffffff81112fff : generic_file_aio_write+0x6f/0xe0 [kernel] 0xffffffffa00f3de1 : ext4_file_write+0x61/0x1e0 [ext4] 0xffffffff811762da : do_sync_write+0xfa/0x140 [kernel] 0xffffffff811765d8 : vfs_write+0xb8/0x1a0 [kernel] 0xffffffff81176fe1 : sys_write+0x51/0x90 [kernel] XFS in latest kernel can relieve journal IO blocking issue, more friendly to metadata heavy scenarios like HBase + HDFS www.mi.com Root Cause of Write Latency Spikes
  16. 16. 16 8 YCSB threads; write 20 million rows, each 3*200 Bytes; 3 DN; kernel : 3.12.17 Statistic the stalled write() which costs > 100ms The largest write() latency in Ext4 : ~600ms ! www.mi.com Write Latency Spikes Testing
  17. 17. 17 Hedged Read (HDFS-5776) www.mi.com
  18. 18. 18  Long first “put” issue (HBASE-10010)  Token invalid (HDFS-5637)  Retry/timeout setting in DFSClient  Reduce write traffic? (HLog compression)  HDFS IO Priority (HADOOP-10410) Other Meaningful Latency Work www.mi.com
  19. 19. 19  Real-time HDFS, esp. priority related  Core data structure GC friendly  More off-heap; shenandoah GC  TCP/Disk IO characteristic analysis Need more eyes on OS Stay tuned… www.mi.com Wish List
  20. 20.  New write thread model(HBASE-8755)  Reverse scan(HBASE-4811)  Per table/cf replication(HBASE-8751)  Block index key optimization(HBASE-7845) 20www.mi.com Some Patches Xiaomi Contributed
  21. 21. WriteHandler :sync to HDFS WriteHandler :write to HDFS WriteHandler :sync to HDFS WriteHandler :write to HDFS 1. New Write Thread Model WriteHandler WriteHandlerWriteHandler …… WriteHandler : write to HDFS WriteHandler : sync to HDFS Local Buffer Problem : WriteHandler does everything, severe lock race! Old model: 21www.mi.com 256 256 256
  22. 22. WriteHandler :sync to HDFSWriteHandler :sync to HDFS New Write Thread Model WriteHandler WriteHandlerWriteHandler …… AsyncWriter : write to HDFS AsyncSyncer : sync to HDFS Local Buffer New model : AsyncNotifier : notify writers 22www.mi.com 256 1 1 4
  23. 23. New Write Thread Model  Low load : No improvement  Heavy load : Huge improvement (3.5x) 23www.mi.com
  24. 24. 2. Reverse Scan Row2 kv2 Row3 kv1 Row3 kv3 Row4 kv2 Row4 kv5 Row5 kv2 Row1 kv2 Row3 kv2 Row3 kv4 Row4 kv4 Row4 kv6 Row5 kv3 Row1 kv1 Row2 kv1 Row2 kv3 Row4 kv1 Row4 kv3 Row6 kv1 1. All scanners seek to ‘previous’ rows (SeekBefore) 2. Figure out next row : max ‘previous’ row 3. All scanners seek to first KV of next row (SeekTo) Performance : 70% of forward scan 24www.mi.com
  25. 25. Need a way to specify which data to replicate! 3. Per Table/CF Replication Source PeerA (backup) PeerB (T2:cfX) T1 : cfA, cfB T2 : cfX, cfY  PeerB creates T2 only : replication can’t work! T1:cfA,cfB; T2:cfX,cfY ?  PeerB creates T1&T2 : all data replicated! 25www.mi.com
  26. 26. Per Table/CF Replication Source PeerA PeerB (T2:cfX) T1:cfA,cfB; T2:cfX,cfY T2:cfX  add_peer ‘PeerA’, ‘PeerA_ZK’  add_peer ‘PeerB’, ‘PeerB_ZK’, ‘T2:cfX’ T1 : cfA, cfB T2 : cfX, cfY 26www.mi.com
  27. 27. 4. Block Index Key Optimization Block 1 Block 2 … … k1:“ab” k2 : “ah, hello world” Before : ‘Block 2’ block index key = “ah, hello world/…” Now : ‘Block 2’ block index key = “ac/…” ( k1 < key <= k2)  Reduce block index size  Save seeking previous block if the searching key is in [‘ac’, ‘ah, hello world’] 27www.mi.com
  28. 28.  Cross-table cross-row transaction(HBASE-10999)  HLog compactor(HBASE-9873)  Adjusted delete semantic(HBASE-8721)  Coordinated compaction (HBASE-9528)  Quorum master (HBASE-10296) 28www.mi.com Some ongoing patches
  29. 29. http://github.com/xiaomi/themis 1. Cross-Row Transaction : Themis  Google Percolator : Large-scale Incremental Processing Using Distributed Transactions and Notifications  Two-phase commit : strong cross-table/row consistency  Global timestamp server : global strictly incremental timestamp  No touch to HBase internal: based on HBase Client and coprocessor  Read : 90%, Write : 23% (same downgrade as Google percolator)  More details : HBASE-10999 29www.mi.com
  30. 30. 2. HLog Compactor HLog 1,2,3 Region 1Memstore HFiles Region 2 Region x Region x : few writes but scatter in many HLogs PeriodicMemstoreFlusher : flush old memstores forcefully  ‘flushCheckInterval’/‘flushPerChanges’ : hard to config  Result in ‘tiny’ HFiles  HBASE-10499 : problematic region can’t be flushed! 30 www.mi.com
  31. 31. HLog Compactor HLog 1, 2, 3,4 Region 1Memstore HFiles Region 2 Region x  Compact : HLog 1,2,3,4  HLog x  Archive : HLog1,2,3,4 HLog x 31www.mi.com
  32. 32. 3. Adjusted Delete Semantic 1. Write kvA at t0 2. Delete kvA at t0, flush to hfile 3. Write kvA at t0 again 4. Read kvA Result : kvA can’t be read out Scenario 1 1. Write kvA at t0 2. Delete kvA at t0, flush to hfile 3. Major compact 4. Write kvA at t0 again Result : kvA can be read out Scenario 2 5. Read kvA Fix : “delete can’t mask kvs with larger mvcc ( put later )” 32www.mi.com
  33. 33. 4. Coordinated Compaction HDFS (global resource) RS RS RS Compact storm!  Compact uses a global HDFS, while whether to compact is decided locally! 33www.mi.com
  34. 34. Coordinated Compaction RS RS RS MasterCan I ?OK Can I ? OK Can I ? NO HDFS (global resource)  Compact is scheduled by master, no compact storm any longer 34www.mi.com
  35. 35. 5. Quorum Master zk3 zk2 zk1 RS RSRS Master Master ZooKeeper X Read info/states A A  When active master serves, standby master stays ‘really’ idle  When standby master becomes active, it needs to rebuild in-memory status 35www.mi.com
  36. 36. Quorum Master Master 3 Master 1 Master 2 RS RSRS X A A  Better master failover perf : No phase to rebuild in-memory status  No external(ZooKeeper) dependency  No potential consistency issue  Simpler deployment  Better restart perf for BIG cluster(10+K regions) 36www.mi.com
  37. 37. Hangjun Ye, Zesheng Wu, Peng Zhang Xing Yong, Hao Huang, Hailei Li Shaohui Liu, Jianwei Cui, Liangliang He Dihao Chen Acknowledgement 37www.mi.com
  38. 38. Thank You! xieliang@xiaomi.com fenghonghua@xiaomi.com www.mi.com 38www.mi.com

×