• Save
HBaseCon 2013: Near Real Time Indexing for eBay Search

Like this? Share it with your network

Share

HBaseCon 2013: Near Real Time Indexing for eBay Search

  • 2,405 views
Uploaded on

Presented by: Swati Agarwal (eBay) and Raj Tanneru (eBay)

Presented by: Swati Agarwal (eBay) and Raj Tanneru (eBay)

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,405
On Slideshare
2,054
From Embeds
351
Number of Embeds
6

Actions

Shares
Downloads
0
Comments
0
Likes
10

Embeds 351

http://www.hadoopsphere.com 333
http://www.newsblur.com 14
http://2868824907842590784_18d887540e821527539eb43f8f9aa97f67770712.blogspot.com 1
http://www.duplichecker.com 1
http://dev.techarda.com 1
http://webcache.googleusercontent.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Site activity is continuous but bulk indexes are built infrequentlyTakes over 6 hours to build full index
  • to reflect user updates in indexes

Transcript

  • 1. Near Real Time Indexing for ebay Search Swati Agarwal & Raj Tanneru
  • 2. AGENDA • Overview • Indexing pipeline • Performance Enhancements • Search Data Acquisition • Challenges in Data Acquisition • Q&A
  • 3. OVERVIEW Writes Build indexes for all new updates and apply them in query servers in “near real time” Freshness is important for better search experience for end users ! Updates Queries
  • 4. DESIGN GOALS • Reduce average time taken to propagate user updates • Handle large update volume • Reduce variability in performance • Allow horizontal scaling • Support distributed search servers • Improved monitoring
  • 5. WHY HBASE ? • Scalable – supports storing large volumes of data • Feature rich -- efficient scans and key based lookups • No schema • Support for versioning • Data consistency • Good support from open-source community
  • 6. INDEXING PIPELINE OVERVIEW Data Store HBASE Item + seller data store Event stream Bulk Data Loader D i S T R I B U T O R Query Servers Delta updates (Every few mins) Full IndexEBay World Wide Event based update Batch Update (Every few hours)
  • 7. HIGH LEVEL APPROACH • Building a full index takes hours due to data-set size • # of items changed every minute are much less • Identify updates in time window t1 – t2 (Timerange scan) • Build a ‘mini index’ only on last X minutes of changes using Map-Reduce • Mini indices are copied and consumed in near real time by query servers
  • 8. IDENTIFY UPDATES IN A TIME WINDOW • Column Family to track last modified time • Utilize ‘time range scan’ feature of HBase HBASE ITEM TABLE ROWKEY MAIN DATA (VERSION = 1) NRT_CHANGE_SET (VERSION = Inf, TTL) ITEM # SELLER TITLE CHANGE _SET TIME (VERSION) 12357899 1234 4444 Ipod.. ALL 3:15 pm BID 3:18 pm 14535788 6776 3344 Xbox … ALL 3:19 pm 14535788 4566 5553 Shirt … ALL 3:30 pm Items Changed between 3:15 – 3:20 pm
  • 9. INDEXING USING MAP REDUCE
  • 10. JOB MONITORING • Counters – HBase Scan time – HBase Random read time – HDFS I/O times – CPU time etc. • Job logs • Hadoop Application Monitoring system based on Open TSDB • Cluster monitoring – Ganglia – Nagios Alerts • Cloudera Manager (CDH4)
  • 11. UNSTABLE JOB PERFORMANCE Time
  • 12. PERFORMANCE DIAGNOSIS • Slow HBase – Excessive flush – Too many HFiles – Major compaction at peak traffic hours • Bad nodes in the cluster – Machine restarts – Slow disk – Data node / Region server / task tracker not running on same machine • Slow HDFS – HBase RPC timeouts – HDFS I/O timeout or slowness • Job Scheduler – Even with highest priority for NRT jobs preemption time is in the order of minutes
  • 13. HBASE IMPROVEMENTS • HBase Schema – Using version = 1 – Setting TTL where version ≠ 1 • HBase – Optimized read/write cache • hbase.regionserver.global.memstore.upperLimit = 0.25 (previously 0.1) • hbase.regionserver.global.memstore.lowerLimit = 0.24 (previously 0.09) – Optimized scanner performance • Increased Hbase.regionserver.metahandler.count (prefetch region info) – Optimized flush size • hbase.hregion.memstore.flush.size = 500MB • hbase.hregion.memstore.block.multiplier = 4 (previously 8) – Optimized Major Compaction • Major compaction at off peak hours • Increased frequency of major compaction to decrease number of Hfiles
  • 14. FUTURE DIRECTION • Reduce map reduce initialization overhead – Stand Alone framework to build Neal Real Time Indices – YARN (next generation map reduce) • Co-Processors • Improved monitoring
  • 15. SEARCH DATA ACQUISITION
  • 16. SEARCH DATA ACQUISITION - NRT
  • 17. EVENT STREAM CONSUMER • Consumer receives events in batches • Event processing – Load item – Transform item – Write item – Read item • Event life cycle – Success – Failure/Abandon – Retry
  • 18. HBASE DATA MODEL • Three tables(active item, completed item, seller) • Up to four column families – Main – Partial Document – Change Set – Audit Trail • 100s of columns • Notion of compound and multi value fields
  • 19. CHALLENGES IN DATA ACQUISITION • Multiple data centers – One cluster per data center – Independent of each other • High update rate • Event processing order – via source modified time • Handle two acquisition pipelines without collisions • Reload data with minimal impact to existing jobs
  • 20. OPTIMIZATIONS • Ensure there is no update when a record is being purged • Reduce hbase rpc timeout in consumer • Wrapper script to detect idle/non responsive region servers • Audit trail column family for debugging • Htable pool
  • 21. STATS • 1.2 billion completed items in HBase • 600 million active items in HBase • 1.4 tera bytes of data processed per day • 400 million puts in HBase per day • 250 million search metrics per day
  • 22. Thank you Questions??