Alibaba HBase Design and Practice in the Field of Search


Published on

It was one of the PPTs published in Alibaba Technology Salon, Beijing on Apr. 26, 2014.
You'd better to watch it on a Retina Macbook with Keynote or some pics may blur.

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Alibaba: nearly 1.5 billionTaobao: more than 800 millionTmall: more than 60 millionEtao: CRAWL more than 250 million; FEED nearly 100 millionOpenSearch: (INTERNAL) more than 800 million docs in more than 400 apps (EXTERNAL) nearly 600 million docs in more than 3000 appsOpenCrawl: nearly 10 billion e-commerce related web pages and nearly 1 billion pivotal web attachments(image & js) in nearly 60 apps
  • Total: nearly 400 nodes * 3 clustersEtao cluster: 24000 regions, 120T storage
  • Based on Google’s BigTable paper (2006)Distributed Column Oriented DatabaseRows stored in sorted orderRandom read time read/write access to very large data
  • NodeManager(YARN), DataNode(HDFS),HRegionServer(HBase) share the same node.Incremental flow needs huge amount of random read and write.JavaAPI: OpenCrawlMR Job & iStream Service: ALLThrift: Offline ETL and Front-end Real-time Queries
  • Coprocessor: Observer & EndpointIC(based on Observer) FeaturesWrite a message to a certain HQueue when a Put matches IC configuration.The message can be a raw Rowkey or a JSON-formatted String with data from the original Put operation.The configuration support a variety of OPERATORs. For example: STRING_EQ, STRING_NE, INT_LT, INT_GE, FLOAT_EQ, FLOAT_LE, FLOAT_GTInstall/Uninstall IC on HTables dynamically, without restarting region servers or disabling tables.
  • Add support for HQueue. Put, Get, Scan by partition id or by HQueue name.Add scannerOpenWithTimeRange method to scan a HTable with both start/end rowkeys and start/end timestamps.Add a scanner auto-release mechanism to release timeout scanner, preventing from crashing the Thrift Server process.Add several scanner related metrics into ganglia, so that we can track the status of thrift server’s scanners.
  • (1)MemStoreFlush: hbase.hregion.max.filesize(3)HLog: io.file.buffer.sizehbase.regionserver.hlog.replicationhbase.regionserver.hlog.replicationhbase.regionserver.logroll.multiplierhbase.regionserver.maxlogs(4)Compaction: hbase.regionserver.thread.compaction.largehbase.regionserver.thread.compaction.smallhbase.regionserver.thread.compaction.throttlehbase.hstore.compaction.ratiohbase.hstore.compaction.ratio.offpeakhbase.offpeak.start.hourhbase.offpeak.end.hourhbase.hstore.compaction.min (.90 hbase.hstore.compactionThreshold) hbase.hstore.compaction.maxhbase.hstore.compaction.min.sizehbase.hstore.compaction.max.sizehbase.hregion.majorcompaction
  • Append & Increment also return data, so I add them into ‘READ Requests’.Write operations cannot be limited at Coprocessor layer, but for most of the time, limited read can lead to limited write.
  • Alibaba HBase Design and Practice in the Field of Search

    1. 1. Alibaba HBase Design and Practice in the Field of Search Victor Xu Senior Search R&D Engineer of Alibaba
    2. 2. Who am I? • Hi, I’m Victor Xu (Name in Chinese Character: ‘徐斌’), and I also have a nick name in Alibaba Group: ‘雨田’ or ‘Yu Tian’. • I’ve been working in the Alibaba Group since I graduated from Huazhong University of Science and Technology, Wuhan, China in 2009. • Mainly focus on the Web Spider and HBase Storage fields in the Search Technology Department of Alibaba. Contacts Email: Phone: +1(949) 288-0697 Skype: victorunique Weibo: 淘宝雨田
    3. 3. Agenda • Overview • Improvements • Maintenance • Extensional Projects • Q & A
    4. 4. Overview
    5. 5. Upgrade History 2010.08 2012.04 2013.09 NOW HBase 0.20.5 2013.03 HBase 0.94.5 TODO: HBase 0.98.X HBase 0.92.1 HBase 0.94.10
    6. 6. Architecture
    7. 7. Features of our scenario  YARN, HDFS and HBase coexist  Intensive random read  Various client type (JavaAPI, MR Job, iStream, Thrift …)
    8. 8. HBase Improvements
    9. 9. Increment Coprocessor An incremental trigger mechanism to transmit application’s real-time messages synchronously.
    10. 10. Other Assistant Coprocessors Name Description Compare Compare a Put data with the corresponding storage data, add additional data(constant value) into the Put if it matches some conditions. Trace Compare a Put data with the corresponding storage data, remove columns of Put if they are the same with the ones in the storage, leaving only the changed columns. Copy Compare a column data of a Put with some constant value (or other column data in the same Put), copy values to other columns according to configurations.
    11. 11. Thrift Server  API improvements  Scanner auto-release  Add metrics to Ganglia C/C++ Pytho n … … PHP Java Thrift Server HBase Cluster
    12. 12. Constant Family Size Region Split Policy • Set a constant limit for each family, if not, use the region max size limit instead. • Region split will be triggered if any family reaches its size limit. • The split point is determined by the family who exceeds the most proportion of its size limit. Original Region Split Policy F1 F2 F3 New Region Split Policy F1 F2 F3 For example: SizeOf(F1) = 5M, SizeOf(F2) = 15M, SizeOf(F3) = 10M LimitOf(F1)= 10M, LimitOf(F2) = 14M, LimitOf(F3) = 8M
    13. 13. Local Region Scanner Open and read(scan) an online region at local machine in ‘READ ONLY’ mode, so that we can debug or trace some region issues without changing the region server’s code. Client Region Server Region Region Region Local Scanner Call API Internal Scanner
    14. 14. Compaction Rate Limit Check compaction process periodically. If it spends less time than we expected to compact a certain amount of data (for example: 10M), we should force it to ‘sleep’ a little while to lower the total compaction rate speed. (HBASE-8329)
    15. 15. Online Reload Configurations MemStoreFlush, Split, HLog, Compaction and more… hbase-site.xml HMaster HRegionServer Push Push HBase Shell Notify Notify Step 1: Step 2:
    16. 16. Web-based Query Tool 1) Entry on Master’s status page 2) Query by ROWKEY or URL 3) Support multi versions 4) Filter by column family
    17. 17. Cluster Maintenance
    18. 18. Rolling Upgrade HBase upgrade is always a nightmare for all users and administrators. We should find some ways to do it without a whole cluster shutdown. Why do we need this? Rolling upgrade means upgrade region servers one by one (or group by group), so any IPC protocol version difference between region servers or between region server and master will cause catastrophic failures. When can we use it? Select a group of 10 ~ 15 region servers. Move all regions out of these nodes concurrently, shut these servers down, switch to new version, restart region servers, move back all regions to original locations. How did we use it?
    19. 19. Cluster Availability Control • To block all the clients requests when HBase cluster is shutting down. • Add a ‘cluster unavailable’ znode on HBase zookeeper. When it exists, all HBase clients(Java API, YARN App, ThriftServer) will be blocked. • Use master’s coprocessor to control the znode. YARN … …iStreamM/R Java API Thrift Server ZOOKEEPER HBASE
    20. 20. HBase Monitor (1) Collect cluster’s read & write requests status, display QPS graph using OpenTSDB.
    21. 21. HBase Monitor (2) • Use a ‘region probe’ (actually, a special defined Scan operation) to request a single row from each region on each region server. • Collect the response time, tag them with NORMAL or TIMEOUT. • Draw a graph based on the proportion of NORMAL regions in all regions of a cluster using OpenTSDB. • Send out warning messages(in form of Email, AliWangWang, SMS or VoiceCall) if the proportion of TIMEOUT regions of a single region server (or a single HTable) reaches a certain limit.
    22. 22. Request Profiler Real-time profiler for any slow request. Filtered by processing time or response size.
    23. 23. Region Server Read Rate Limit Region Server Region A Region B Region C … … Traffic Controller Read Requests Scan Get Append Increment • Launch region-level read rate control if the total size of the past 10s read operations reaches RegionServer’s read limit(100M/s). • Throw IOExceptions when a certain region reaches its own limit. • Cancel the region-level control if all regions obey their limits in the past 10s.
    24. 24. Extensional Projects
    25. 25. HQueue (1) It’s a distributed and persistent message-oriented middleware based on HBase. What’s HQueue? Features  HQueue is based on HTable, so it also supports auto-failover, multi- partition(HTable region) and load balance.  It’s a persistent storage system. (HBase HLog & HDFS Append)  High performance in both read and write.  Message is classified by ‘Topic’. (HBase column qualifier)  Support TTL mechanism. (HBase’s KeyValue level TTL)  As a lightweight wrapper of HTable, HQueue can upgrade with HBase cluster seamlessly.  Allows map/reduce jobs to assign tasks based on locality.  Support subscribe mechanism.
    26. 26. HQueue (2)
    27. 27. OpenTSDB What’s OpenTSDB? OpenTSDB is an open-source, distributed time series database designed to monitor large clusters of commodity machines at an unprecedented level of granularity. OpenTSDB allows operation teams to keep track of all the metrics exposed by operating systems, applications and network equipment, and makes the data easily accessible. Performance Thanks to HBase's scalability, OpenTSDB allows you to collect thousands of metrics from tens of thousands of hosts and applications, at a high rate (every few seconds). OpenTSDB will never delete or downsample data and can easily store hundreds of billions of data points. Where do we use it? Etao Search Dump Stats, Web Crawler Stats, HBase Monitor and more …
    28. 28. Phoenix Apache Phoenix is a SQL skin over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. What’s Phoenix? Principle Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows. Performance