SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
HBaseCon 2013: Near Real Time Indexing for eBay Search
HBaseCon 2013: Near Real Time Indexing for eBay Search
1.
Near Real Time Indexing
for ebay Search
Swati Agarwal & Raj Tanneru
2.
AGENDA
• Overview
• Indexing pipeline
• Performance Enhancements
• Search Data Acquisition
• Challenges in Data Acquisition
• Q&A
3.
OVERVIEW
Writes
Build indexes for all new updates and apply them in query servers in
“near real time”
Freshness is important for better search experience for end users !
Updates
Queries
4.
DESIGN GOALS
• Reduce average time taken to propagate user updates
• Handle large update volume
• Reduce variability in performance
• Allow horizontal scaling
• Support distributed search servers
• Improved monitoring
5.
WHY HBASE ?
• Scalable – supports storing large volumes of data
• Feature rich -- efficient scans and key based lookups
• No schema
• Support for versioning
• Data consistency
• Good support from open-source community
6.
INDEXING PIPELINE OVERVIEW
Data
Store
HBASE
Item +
seller data
store
Event
stream
Bulk Data
Loader
D
i
S
T
R
I
B
U
T
O
R
Query
Servers
Delta updates
(Every few mins)
Full IndexEBay World Wide
Event based
update
Batch Update
(Every few hours)
7.
HIGH LEVEL APPROACH
• Building a full index takes hours due to data-set size
• # of items changed every minute are much less
• Identify updates in time window t1 – t2 (Timerange scan)
• Build a ‘mini index’ only on last X minutes of changes using Map-Reduce
• Mini indices are copied and consumed in near real time by query servers
8.
IDENTIFY UPDATES IN A TIME WINDOW
• Column Family to track last modified time
• Utilize ‘time range scan’ feature of HBase
HBASE ITEM TABLE
ROWKEY MAIN DATA
(VERSION = 1)
NRT_CHANGE_SET
(VERSION = Inf, TTL)
ITEM # SELLER TITLE CHANGE
_SET
TIME
(VERSION)
12357899 1234 4444 Ipod.. ALL 3:15 pm
BID 3:18 pm
14535788 6776 3344 Xbox … ALL 3:19 pm
14535788 4566 5553 Shirt … ALL 3:30 pm
Items Changed
between 3:15 –
3:20 pm
10.
JOB MONITORING
• Counters
– HBase Scan time
– HBase Random read time
– HDFS I/O times
– CPU time etc.
• Job logs
• Hadoop Application Monitoring system based on Open TSDB
• Cluster monitoring
– Ganglia
– Nagios Alerts
• Cloudera Manager (CDH4)
12.
PERFORMANCE DIAGNOSIS
• Slow HBase
– Excessive flush
– Too many HFiles
– Major compaction at peak traffic hours
• Bad nodes in the cluster
– Machine restarts
– Slow disk
– Data node / Region server / task tracker not running on same machine
• Slow HDFS
– HBase RPC timeouts
– HDFS I/O timeout or slowness
• Job Scheduler
– Even with highest priority for NRT jobs preemption time is in the order of minutes
13.
HBASE IMPROVEMENTS
• HBase Schema
– Using version = 1
– Setting TTL where version ≠ 1
• HBase
– Optimized read/write cache
• hbase.regionserver.global.memstore.upperLimit = 0.25 (previously 0.1)
• hbase.regionserver.global.memstore.lowerLimit = 0.24 (previously 0.09)
– Optimized scanner performance
• Increased Hbase.regionserver.metahandler.count (prefetch region info)
– Optimized flush size
• hbase.hregion.memstore.flush.size = 500MB
• hbase.hregion.memstore.block.multiplier = 4 (previously 8)
– Optimized Major Compaction
• Major compaction at off peak hours
• Increased frequency of major compaction to decrease number of Hfiles
14.
FUTURE DIRECTION
• Reduce map reduce initialization overhead
– Stand Alone framework to build Neal Real Time Indices
– YARN (next generation map reduce)
• Co-Processors
• Improved monitoring
18.
HBASE DATA MODEL
• Three tables(active item, completed item, seller)
• Up to four column families
– Main
– Partial Document
– Change Set
– Audit Trail
• 100s of columns
• Notion of compound and multi value fields
19.
CHALLENGES IN DATA ACQUISITION
• Multiple data centers
– One cluster per data center
– Independent of each other
• High update rate
• Event processing order – via source modified time
• Handle two acquisition pipelines without collisions
• Reload data with minimal impact to existing jobs
20.
OPTIMIZATIONS
• Ensure there is no update when a record is being purged
• Reduce hbase rpc timeout in consumer
• Wrapper script to detect idle/non responsive region servers
• Audit trail column family for debugging
• Htable pool
21.
STATS
• 1.2 billion completed items in HBase
• 600 million active items in HBase
• 1.4 tera bytes of data processed per day
• 400 million puts in HBase per day
• 250 million search metrics per day