Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HBaseCon 2013: Near Real Time Indexing for eBay Search

4,774 views

Published on

Presented by: Swati Agarwal (eBay) and Raj Tanneru (eBay)

Published in: Technology, Business
  • Be the first to comment

HBaseCon 2013: Near Real Time Indexing for eBay Search

  1. 1. Near Real Time Indexing for ebay Search Swati Agarwal & Raj Tanneru
  2. 2. AGENDA • Overview • Indexing pipeline • Performance Enhancements • Search Data Acquisition • Challenges in Data Acquisition • Q&A
  3. 3. OVERVIEW Writes Build indexes for all new updates and apply them in query servers in “near real time” Freshness is important for better search experience for end users ! Updates Queries
  4. 4. DESIGN GOALS • Reduce average time taken to propagate user updates • Handle large update volume • Reduce variability in performance • Allow horizontal scaling • Support distributed search servers • Improved monitoring
  5. 5. WHY HBASE ? • Scalable – supports storing large volumes of data • Feature rich -- efficient scans and key based lookups • No schema • Support for versioning • Data consistency • Good support from open-source community
  6. 6. INDEXING PIPELINE OVERVIEW Data Store HBASE Item + seller data store Event stream Bulk Data Loader D i S T R I B U T O R Query Servers Delta updates (Every few mins) Full IndexEBay World Wide Event based update Batch Update (Every few hours)
  7. 7. HIGH LEVEL APPROACH • Building a full index takes hours due to data-set size • # of items changed every minute are much less • Identify updates in time window t1 – t2 (Timerange scan) • Build a ‘mini index’ only on last X minutes of changes using Map-Reduce • Mini indices are copied and consumed in near real time by query servers
  8. 8. IDENTIFY UPDATES IN A TIME WINDOW • Column Family to track last modified time • Utilize ‘time range scan’ feature of HBase HBASE ITEM TABLE ROWKEY MAIN DATA (VERSION = 1) NRT_CHANGE_SET (VERSION = Inf, TTL) ITEM # SELLER TITLE CHANGE _SET TIME (VERSION) 12357899 1234 4444 Ipod.. ALL 3:15 pm BID 3:18 pm 14535788 6776 3344 Xbox … ALL 3:19 pm 14535788 4566 5553 Shirt … ALL 3:30 pm Items Changed between 3:15 – 3:20 pm
  9. 9. INDEXING USING MAP REDUCE
  10. 10. JOB MONITORING • Counters – HBase Scan time – HBase Random read time – HDFS I/O times – CPU time etc. • Job logs • Hadoop Application Monitoring system based on Open TSDB • Cluster monitoring – Ganglia – Nagios Alerts • Cloudera Manager (CDH4)
  11. 11. UNSTABLE JOB PERFORMANCE Time
  12. 12. PERFORMANCE DIAGNOSIS • Slow HBase – Excessive flush – Too many HFiles – Major compaction at peak traffic hours • Bad nodes in the cluster – Machine restarts – Slow disk – Data node / Region server / task tracker not running on same machine • Slow HDFS – HBase RPC timeouts – HDFS I/O timeout or slowness • Job Scheduler – Even with highest priority for NRT jobs preemption time is in the order of minutes
  13. 13. HBASE IMPROVEMENTS • HBase Schema – Using version = 1 – Setting TTL where version ≠ 1 • HBase – Optimized read/write cache • hbase.regionserver.global.memstore.upperLimit = 0.25 (previously 0.1) • hbase.regionserver.global.memstore.lowerLimit = 0.24 (previously 0.09) – Optimized scanner performance • Increased Hbase.regionserver.metahandler.count (prefetch region info) – Optimized flush size • hbase.hregion.memstore.flush.size = 500MB • hbase.hregion.memstore.block.multiplier = 4 (previously 8) – Optimized Major Compaction • Major compaction at off peak hours • Increased frequency of major compaction to decrease number of Hfiles
  14. 14. FUTURE DIRECTION • Reduce map reduce initialization overhead – Stand Alone framework to build Neal Real Time Indices – YARN (next generation map reduce) • Co-Processors • Improved monitoring
  15. 15. SEARCH DATA ACQUISITION
  16. 16. SEARCH DATA ACQUISITION - NRT
  17. 17. EVENT STREAM CONSUMER • Consumer receives events in batches • Event processing – Load item – Transform item – Write item – Read item • Event life cycle – Success – Failure/Abandon – Retry
  18. 18. HBASE DATA MODEL • Three tables(active item, completed item, seller) • Up to four column families – Main – Partial Document – Change Set – Audit Trail • 100s of columns • Notion of compound and multi value fields
  19. 19. CHALLENGES IN DATA ACQUISITION • Multiple data centers – One cluster per data center – Independent of each other • High update rate • Event processing order – via source modified time • Handle two acquisition pipelines without collisions • Reload data with minimal impact to existing jobs
  20. 20. OPTIMIZATIONS • Ensure there is no update when a record is being purged • Reduce hbase rpc timeout in consumer • Wrapper script to detect idle/non responsive region servers • Audit trail column family for debugging • Htable pool
  21. 21. STATS • 1.2 billion completed items in HBase • 600 million active items in HBase • 1.4 tera bytes of data processed per day • 400 million puts in HBase per day • 250 million search metrics per day
  22. 22. Thank you Questions??

×