HBase at Xiaomi

Jun. 16, 2014

More Related Content


HBase at Xiaomi

  1. HBase at Xiaomi {xieliang, fenghonghua} Liang Xie / Honghua Feng
  2. 2 About Us Honghua FengLiang Xie
  3. 3 Outline  Introduction  Latency practice  Some patches we contributed  Some ongoing patches  Q&A
  4. 4 About Xiaomi  Mobile internet company founded in 2010  Sold 18.7 million phones in 2013  Over $5 billion revenue in 2013  Sold 11 million phones in Q1, 2014
  5. 5 Hardware
  6. 6 Software
  7. 7 Internet Services
  8. 8 About Our HBase Team  Founded in October 2012  5 members  Liang Xie  Shaohui Liu  Jianwei Cui  Liangliang He  Honghua Feng  Resolved 130+ JIRAs so far
  9. 9 Our Clusters and Scenarios  15 Clusters : 9 online / 2 processing / 4 test  Scenarios  MiCloud  MiPush  MiTalk  Perf Counter
  10. 10 Our Latency Pain Points  Java GC  Stable page write in OS layer  Slow buffered IO (FS journal IO)  Read/Write IO contention
  11. 11  Bucket cache with off-heap mode  Xmn/ServivorRatio/MaxTenuringThreshold  PretenureSizeThreshold & repl src size  GC concurrent thread number GC time per day : [2500, 3000] -> [300, 600]s !!! HBase GC Practice
  12. 12 HBase client put ->HRegion.batchMutate ->HLog.sync ->SequenceFileLogWriter.sync ->DFSOutputStream.flushOrSync ->DFSOutputStream.waitForAckedSeqno <Stuck here often!> =================================================== DataNode pipeline write, in BlockReceiver.receivePacket() : ->receiveNextPacket ->mirrorPacketTo(mirrorOut) //write packet to the mirror ->out.write/flush //write data to local disk. <- buffered IO [Added instrumentation(HDFS-6110) showed the stalled write was the culprit, strace result also confirmed it Write Latency Spikes
  13. 13  write() is expected to be fast  But blocked by write-back sometimes! Root Cause of Write Latency Spikes
  14. 14 Workaround : -> or -> Try to avoid deploying REHL6.3/Centos6.3 in an extremely latency sensitive HBase cluster! Stable page write issue workaround
  15. 15 ... 0xffffffffa00dc09d : do_get_write_access+0x29d/0x520 [jbd2] 0xffffffffa00dc471 : jbd2_journal_get_write_access+0x31/0x50 [jbd2] 0xffffffffa011eb78 : __ext4_journal_get_write_access+0x38/0x80 [ext4] 0xffffffffa00fa253 : ext4_reserve_inode_write+0x73/0xa0 [ext4] 0xffffffffa00fa2cc : ext4_mark_inode_dirty+0x4c/0x1d0 [ext4] 0xffffffffa00fa6c4 : ext4_generic_write_end+0xe4/0xf0 [ext4] 0xffffffffa00fdf74 : ext4_writeback_write_end+0x74/0x160 [ext4] 0xffffffff81111474 : generic_file_buffered_write+0x174/0x2a0 [kernel] 0xffffffff81112d60 : __generic_file_aio_write+0x250/0x480 [kernel] 0xffffffff81112fff : generic_file_aio_write+0x6f/0xe0 [kernel] 0xffffffffa00f3de1 : ext4_file_write+0x61/0x1e0 [ext4] 0xffffffff811762da : do_sync_write+0xfa/0x140 [kernel] 0xffffffff811765d8 : vfs_write+0xb8/0x1a0 [kernel] 0xffffffff81176fe1 : sys_write+0x51/0x90 [kernel] XFS in latest kernel can relieve journal IO blocking issue, more friendly to metadata heavy scenarios like HBase + HDFS Root Cause of Write Latency Spikes
  16. 16 8 YCSB threads; write 20 million rows, each 3*200 Bytes; 3 DN; kernel : 3.12.17 Statistic the stalled write() which costs > 100ms The largest write() latency in Ext4 : ~600ms ! Write Latency Spikes Testing
  17. 17 Hedged Read (HDFS-5776)
  18. 18  Long first “put” issue (HBASE-10010)  Token invalid (HDFS-5637)  Retry/timeout setting in DFSClient  Reduce write traffic? (HLog compression)  HDFS IO Priority (HADOOP-10410) Other Meaningful Latency Work
  19. 19  Real-time HDFS, esp. priority related  Core data structure GC friendly  More off-heap; shenandoah GC  TCP/Disk IO characteristic analysis Need more eyes on OS Stay tuned… Wish List
  20.  New write thread model(HBASE-8755)  Reverse scan(HBASE-4811)  Per table/cf replication(HBASE-8751)  Block index key optimization(HBASE-7845) Some Patches Xiaomi Contributed
  21. WriteHandler :sync to HDFS WriteHandler :write to HDFS WriteHandler :sync to HDFS WriteHandler :write to HDFS 1. New Write Thread Model WriteHandler WriteHandlerWriteHandler …… WriteHandler : write to HDFS WriteHandler : sync to HDFS Local Buffer Problem : WriteHandler does everything, severe lock race! Old model: 256 256 256
  22. WriteHandler :sync to HDFSWriteHandler :sync to HDFS New Write Thread Model WriteHandler WriteHandlerWriteHandler …… AsyncWriter : write to HDFS AsyncSyncer : sync to HDFS Local Buffer New model : AsyncNotifier : notify writers 256 1 1 4
  23. New Write Thread Model  Low load : No improvement  Heavy load : Huge improvement (3.5x)
  24. 2. Reverse Scan Row2 kv2 Row3 kv1 Row3 kv3 Row4 kv2 Row4 kv5 Row5 kv2 Row1 kv2 Row3 kv2 Row3 kv4 Row4 kv4 Row4 kv6 Row5 kv3 Row1 kv1 Row2 kv1 Row2 kv3 Row4 kv1 Row4 kv3 Row6 kv1 1. All scanners seek to ‘previous’ rows (SeekBefore) 2. Figure out next row : max ‘previous’ row 3. All scanners seek to first KV of next row (SeekTo) Performance : 70% of forward scan
  25. Need a way to specify which data to replicate! 3. Per Table/CF Replication Source PeerA (backup) PeerB (T2:cfX) T1 : cfA, cfB T2 : cfX, cfY  PeerB creates T2 only : replication can’t work! T1:cfA,cfB; T2:cfX,cfY ?  PeerB creates T1&T2 : all data replicated!
  26. Per Table/CF Replication Source PeerA PeerB (T2:cfX) T1:cfA,cfB; T2:cfX,cfY T2:cfX  add_peer ‘PeerA’, ‘PeerA_ZK’  add_peer ‘PeerB’, ‘PeerB_ZK’, ‘T2:cfX’ T1 : cfA, cfB T2 : cfX, cfY
  27. 4. Block Index Key Optimization Block 1 Block 2 … … k1:“ab” k2 : “ah, hello world” Before : ‘Block 2’ block index key = “ah, hello world/…” Now : ‘Block 2’ block index key = “ac/…” ( k1 < key <= k2)  Reduce block index size  Save seeking previous block if the searching key is in [‘ac’, ‘ah, hello world’]
  28.  Cross-table cross-row transaction(HBASE-10999)  HLog compactor(HBASE-9873)  Adjusted delete semantic(HBASE-8721)  Coordinated compaction (HBASE-9528)  Quorum master (HBASE-10296) Some ongoing patches
  29. 1. Cross-Row Transaction : Themis  Google Percolator : Large-scale Incremental Processing Using Distributed Transactions and Notifications  Two-phase commit : strong cross-table/row consistency  Global timestamp server : global strictly incremental timestamp  No touch to HBase internal: based on HBase Client and coprocessor  Read : 90%, Write : 23% (same downgrade as Google percolator)  More details : HBASE-10999
  30. 2. HLog Compactor HLog 1,2,3 Region 1Memstore HFiles Region 2 Region x Region x : few writes but scatter in many HLogs PeriodicMemstoreFlusher : flush old memstores forcefully  ‘flushCheckInterval’/‘flushPerChanges’ : hard to config  Result in ‘tiny’ HFiles  HBASE-10499 : problematic region can’t be flushed! 30
  31. HLog Compactor HLog 1, 2, 3,4 Region 1Memstore HFiles Region 2 Region x  Compact : HLog 1,2,3,4  HLog x  Archive : HLog1,2,3,4 HLog x
  32. 3. Adjusted Delete Semantic 1. Write kvA at t0 2. Delete kvA at t0, flush to hfile 3. Write kvA at t0 again 4. Read kvA Result : kvA can’t be read out Scenario 1 1. Write kvA at t0 2. Delete kvA at t0, flush to hfile 3. Major compact 4. Write kvA at t0 again Result : kvA can be read out Scenario 2 5. Read kvA Fix : “delete can’t mask kvs with larger mvcc ( put later )”
  33. 4. Coordinated Compaction HDFS (global resource) RS RS RS Compact storm!  Compact uses a global HDFS, while whether to compact is decided locally!
  34. Coordinated Compaction RS RS RS MasterCan I ?OK Can I ? OK Can I ? NO HDFS (global resource)  Compact is scheduled by master, no compact storm any longer
  35. 5. Quorum Master zk3 zk2 zk1 RS RSRS Master Master ZooKeeper X Read info/states A A  When active master serves, standby master stays ‘really’ idle  When standby master becomes active, it needs to rebuild in-memory status
  36. Quorum Master Master 3 Master 1 Master 2 RS RSRS X A A  Better master failover perf : No phase to rebuild in-memory status  No external(ZooKeeper) dependency  No potential consistency issue  Simpler deployment  Better restart perf for BIG cluster(10+K regions)
  37. Hangjun Ye, Zesheng Wu, Peng Zhang Xing Yong, Hao Huang, Hailei Li Shaohui Liu, Jianwei Cui, Liangliang He Dihao Chen Acknowledgement
  38. Thank You!

Editor's Notes

  1. This is the throughput comparison against a single regionserver: when the write load is low, there is almost no improvement, but as write load gets heavier and heavier, the improvement is pretty amazing, 3.5x at most Actually when write load is very low, new model has some small downgrade(about 10%), Michael Stack has fixed this downgrade in another patch, Thanks Stack!
  2. The second one is reverse scan. Before explaining how reverse scan works, I want to point out an important fact which can help understanding this patch. This fact is the granularity of scan is row, not key-value. All key-values of a row are read out in order from HFile or Memstore, and assembled together as a result row in RegionServer’s memory and then be returned to the client. This work is the same for both forward scan and reverse scan. So the difficulty of reverse scan is when the current row is done, figure out which is the next row, then jump to that row, and start to scan. Let’s see how we do it Since there are two more extra seek operations compared to forward scan, there is 30% downgrade in performance compared to forward scan, almost the same as in LevelDB. Finally thanks Chunhui very much for porting our patch to trunk!
  3. This is the third patch : per table/cf replication. Suppose we have a source cluster, it has two tables and four column families, all can be replicated. For data safety we deployed a peer cluster for backup, and the source cluster replicates all the data to this backup cluster, that’s just what we want and the replication works pretty well Then for some reason such as data analysis or experimental purpose we deployed another peer cluster, and our experimental program just needs data from cfX of table T2, What kind of replication we expect? Ideally we expect only data from cfX of T2 is replicated… but replication can’t work! Then we have to create all tables and column-families in PeerB, and all the data will be replicated, it’s really bad, either in term of bandwidth between source and PeerB, or in term of PeerB’s resource usage.
  4. Then we implement this feature, it allows to specify which data will be replicated to a peer cluster. For PeerA, the add_peer command is the same as before since PeerA want to replicate all the data. But for PeerB, the add_peer has an additional argument to specify which tables or column-families to replicate The implementation change is quite straightforward : In the source cluster, when parsing the log entries, the replication source thread will ignore all other ones and only replicates entries from cfX of table T2
  5. This is the fourth patch : block index key optimization. It is to reduce the overall block index size Suppose two contiguous blocks, the last key-avlue’s row of Block1 is “ab”, the first key-value’s row of Block2 is “ah, hello world”, before our patch the block index key of Block2 is “ah, hello world”(the first keyvalue of Block2), after our patch the block index key is “ac”(a fake key, it’s the minimal keyvalue which is larger than the last keyvalue of Block1 and less than or equal to the first keyvalue of Block2, with shortest row length), the new block index key is much shorter than old one
  6. Now let’s continue to talk about some work items we are currently working on
  7. The second one is HLog compactor, its target is to keep as few HLogs as possible, so we can say its final target is to improve regionserver failover performance, since the less HLog files to split, the better failover performance is We know a regionserver typically serves many regions, and the write patterns for all these regions can be quite different, so the flush frequency and timing of these regions can also be very different. Considering there is a region x, its memstore contains quite few entries, no flush triggered for a long time, and all its entries scatter in many HLogs. For these HLogs, though all other entries have been flushed to HFiles, they still can’t be archived since they contain entries from region x… We do have a background flusher thread to flush old memstores forcefully, but it has some obvious drawbacks, the first one is it’s hard to configure good-enough flushCheckInterval and flushPerChanges, second is forceful flush will result in tiny Hfiles, last one, as in jira HBASE-10499, some problematic region just can’t be flushed at all by this background flusher thread!
  8. Our patch works as this : we introduce another background thread, HLog compactor. When the HLog size is too large compared to the memstore size(which means we flushed enough, but not enough archive), we trigger the HLog compactor, it reads entries from all active HLog files, if the entry is still in some region’s memstore, write it to new HLog file; if not in any memstore(which means it has been flushed to some HFile) ignore it. After the compaction, we can archive all the old HLog files without flushing any memstore We have finished this feature and are testing it in our test cluster, we’ll share the patch after the test
  9. Let’s consider two scenarios The first scenario: first we write kvA at timestamp t0, then delete it and flush, and then write it again, and finally we try to read it, the result is we can’t read it out since both writes are masked by the delete The second scenario is the same as the first one except that before writing kvA for the second time we trigger a major compact. But this time kvA can be read out, since the delete is collected by the major compact This is inconsistent since major compact is transparent to the client but the read results are different depending on whether major compact occurs or not, the root cause is that the delete can even mask a key-values put later than it. The fix is simple, since mvcc represents the order all writes(including put/delete) entering HBase, we use it as an additional delete criterion to prevent delete from masking later put We ever have some severe discussion on this patch, personally I still insist that it deserves further thinking and discussion
  10. The fourth item is coordinated compaction. We talk about compact storm from time to time, now let’s check how it happens, when a regionserver wants to do compact, it just triggers it, and compact reads from HDFS and write back to HDFS, a regionserver can trigger a new compact no matter how overloaded the whole system is So we can see the problem is, what compact eventually uses is a global HDFS, but whether to trigger a compact is a local decision by each regionserver
  11. What we propose is using the master as a coordinator for compact scheduling, it works this way: when a regionserver want a compact, it asks the master, if the master says yes, it can trigger a compact, if the master thinks the system is loaded, it will reject all later compact requests until the system becomes not loaded
  12. The last item is quorum master. This is a master re-design and there are some discussion on it already. And I noticed that JimmyXiang from Cloudera and Mikhail from wandisco have put some efforts on it. It’s great! Current master design has 2 problems: 1. the first problem is some system-wide metadata and status are only maintained in the active master, for master failover these metadata and status are stored in ZooKeeper as well, and during master failover the new active master needs to read from ZooKeeper to rebuild the in-memory state 2. the second problem is the way ZooKeeper is used as the communication channel between master and regionservers for the state machine of region assigning task, ZooKeeper’s asynchronous notification mechanism is just not suitable for state machine logic, it’s also the root cause of many tricky bugs ever found
  13. We propose this new design: Instead of storing in-memory status in ZooKeeper, we replicate it among all master instances using a consensus protocol such as Raft or Paxos. This way when active master fails, a new active master is elected via consensus protocol among all alive standby masters, and the new active master serves immediately without reading from elsewhere Quorum master has some advantages: Better master failover performance Better restart performance for big cluster, since the communication between master and ZooKeeper is the bottleneck when a big number region assignment tasks happen concurrently No external dependency on ZooKeeper No potential consistency issue any longer Simpler deployment