HBaseCon 2013: Near Real Time Indexing for eBay Search

4,082 views
3,829 views

Published on

Presented by: Swati Agarwal (eBay) and Raj Tanneru (eBay)

Published in: Technology, Business
0 Comments
11 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,082
On SlideShare
0
From Embeds
0
Number of Embeds
576
Actions
Shares
0
Downloads
0
Comments
0
Likes
11
Embeds 0
No embeds

No notes for slide
  • Site activity is continuous but bulk indexes are built infrequentlyTakes over 6 hours to build full index
  • to reflect user updates in indexes
  • HBaseCon 2013: Near Real Time Indexing for eBay Search

    1. 1. Near Real Time Indexing for ebay Search Swati Agarwal & Raj Tanneru
    2. 2. AGENDA • Overview • Indexing pipeline • Performance Enhancements • Search Data Acquisition • Challenges in Data Acquisition • Q&A
    3. 3. OVERVIEW Writes Build indexes for all new updates and apply them in query servers in “near real time” Freshness is important for better search experience for end users ! Updates Queries
    4. 4. DESIGN GOALS • Reduce average time taken to propagate user updates • Handle large update volume • Reduce variability in performance • Allow horizontal scaling • Support distributed search servers • Improved monitoring
    5. 5. WHY HBASE ? • Scalable – supports storing large volumes of data • Feature rich -- efficient scans and key based lookups • No schema • Support for versioning • Data consistency • Good support from open-source community
    6. 6. INDEXING PIPELINE OVERVIEW Data Store HBASE Item + seller data store Event stream Bulk Data Loader D i S T R I B U T O R Query Servers Delta updates (Every few mins) Full IndexEBay World Wide Event based update Batch Update (Every few hours)
    7. 7. HIGH LEVEL APPROACH • Building a full index takes hours due to data-set size • # of items changed every minute are much less • Identify updates in time window t1 – t2 (Timerange scan) • Build a ‘mini index’ only on last X minutes of changes using Map-Reduce • Mini indices are copied and consumed in near real time by query servers
    8. 8. IDENTIFY UPDATES IN A TIME WINDOW • Column Family to track last modified time • Utilize ‘time range scan’ feature of HBase HBASE ITEM TABLE ROWKEY MAIN DATA (VERSION = 1) NRT_CHANGE_SET (VERSION = Inf, TTL) ITEM # SELLER TITLE CHANGE _SET TIME (VERSION) 12357899 1234 4444 Ipod.. ALL 3:15 pm BID 3:18 pm 14535788 6776 3344 Xbox … ALL 3:19 pm 14535788 4566 5553 Shirt … ALL 3:30 pm Items Changed between 3:15 – 3:20 pm
    9. 9. INDEXING USING MAP REDUCE
    10. 10. JOB MONITORING • Counters – HBase Scan time – HBase Random read time – HDFS I/O times – CPU time etc. • Job logs • Hadoop Application Monitoring system based on Open TSDB • Cluster monitoring – Ganglia – Nagios Alerts • Cloudera Manager (CDH4)
    11. 11. UNSTABLE JOB PERFORMANCE Time
    12. 12. PERFORMANCE DIAGNOSIS • Slow HBase – Excessive flush – Too many HFiles – Major compaction at peak traffic hours • Bad nodes in the cluster – Machine restarts – Slow disk – Data node / Region server / task tracker not running on same machine • Slow HDFS – HBase RPC timeouts – HDFS I/O timeout or slowness • Job Scheduler – Even with highest priority for NRT jobs preemption time is in the order of minutes
    13. 13. HBASE IMPROVEMENTS • HBase Schema – Using version = 1 – Setting TTL where version ≠ 1 • HBase – Optimized read/write cache • hbase.regionserver.global.memstore.upperLimit = 0.25 (previously 0.1) • hbase.regionserver.global.memstore.lowerLimit = 0.24 (previously 0.09) – Optimized scanner performance • Increased Hbase.regionserver.metahandler.count (prefetch region info) – Optimized flush size • hbase.hregion.memstore.flush.size = 500MB • hbase.hregion.memstore.block.multiplier = 4 (previously 8) – Optimized Major Compaction • Major compaction at off peak hours • Increased frequency of major compaction to decrease number of Hfiles
    14. 14. FUTURE DIRECTION • Reduce map reduce initialization overhead – Stand Alone framework to build Neal Real Time Indices – YARN (next generation map reduce) • Co-Processors • Improved monitoring
    15. 15. SEARCH DATA ACQUISITION
    16. 16. SEARCH DATA ACQUISITION - NRT
    17. 17. EVENT STREAM CONSUMER • Consumer receives events in batches • Event processing – Load item – Transform item – Write item – Read item • Event life cycle – Success – Failure/Abandon – Retry
    18. 18. HBASE DATA MODEL • Three tables(active item, completed item, seller) • Up to four column families – Main – Partial Document – Change Set – Audit Trail • 100s of columns • Notion of compound and multi value fields
    19. 19. CHALLENGES IN DATA ACQUISITION • Multiple data centers – One cluster per data center – Independent of each other • High update rate • Event processing order – via source modified time • Handle two acquisition pipelines without collisions • Reload data with minimal impact to existing jobs
    20. 20. OPTIMIZATIONS • Ensure there is no update when a record is being purged • Reduce hbase rpc timeout in consumer • Wrapper script to detect idle/non responsive region servers • Audit trail column family for debugging • Htable pool
    21. 21. STATS • 1.2 billion completed items in HBase • 600 million active items in HBase • 1.4 tera bytes of data processed per day • 400 million puts in HBase per day • 250 million search metrics per day
    22. 22. Thank you Questions??

    ×