Slash n near real time indexing

A real time search index for
e-commerce
Umesh Prasad
Thejus V M

E-commerce Index Attributes
catalogue
service
Promise
Engine
Availability
Service
Seller
Rating
LISTING
PRODUCT aka SKU
Offer
Engine
Pricing
Engine

Out Of Stock, but Why Show?
Index has Stale
Availability Data
234K
Products

Outline
❏ E-commerce search Challenge
❏ Challenges in Keeping an Inverted Index Updated
❏ Our approach to Near Real Time indexing

Challenge 1 : Update rates
updates / sec
max update
/hr
min max
text /
catalogue ~10 ~100 ~100K
pricing ~100 ~1K ~10 million
availability ~100 ~10K ~10 million
offer ~100 ~10K ~10 million
seller
rating ~10 ~1K ~1 million
signal 6 ~10 ~100 ~1 million
signal 7 ~100 ~10K ~10 million
signal 8 ~100 ~10K ~10 million

Challenge 2 : Lucene Index Update
● Lucene doesn’t support Partial Updates.
● Update = Delete Old Doc + Add New Document
– Recreate the entire document for every update
– Not friendly with multiple micro-services with
different update rates
● Problem Compounded By MarketPlace
● Product + All Its Listings == SINGLE BLOCK
● BLOCK structure chosen for query performance ( ~100X
better latencies)

Challenge 3 : Refresh Cycle
Ingestion pipeline Solr
Master
Solr
Slave
Solr
Slave
Solr
Slave
Solr
Slave
Solr
Slave
Solr
Slave
Commit
fsync
Replication
Open new
Index
Open new
Index
Open new
Index
Open new
Index
Open new
Index
Open new
Index
Batch of
documents

ProductA
brand : Apple
availability : T
price : 45000
ProductB
brand : Samsung
availability : F
price : 23000
ProductC
brand : Apple
availability : T
price : 5000
Document ID
Mappings
Posting List
(Inverted Index)
DocValues
(columunar data)
Lucene Segment
Lucene Index
0 ProductA
1 ProductB
2 ProductC
45000 23000 5000Price
availability : T
brand : Samsung
brand : Apple 0 , 1
2
0 , 2
Terms Sparse
Bitsets

Root Cause :Updating Data Structures
Term 3 Bitset 3
POSTING LIST
……………
…………...
Millions of Terms
BitSet 1Term 1
BitSet 2Term 2
BitSet 3Term 3
Document
Term1 Term2
Term3 Term4
……………
…………...
Thousands of Terms
Posting List / Bit Set
D : 0 1 0 0 0 0 1 0 0 0 0 0 0 1
S: 2,7,14
SE : 2,5,7
Yes
May Be
NO
Updatable ?
Millions of
Documents

A Typical Search Flow
Query Rewrite
Results
Query
Matching
Ranking Faceting
Stats
Posting List
Doc Values
Other
Components
Lucene Segment
Inverted Index
Forward Index
NRT Store

NRT Forward Index - Considerations
● Lookup efficiency
– 50th percentile : ~10K matches
– 99th percentile : ~1 million matches
● Data on Java heap
– Memory efficiency
● Hook it to Lucene

NRT Store - Forward Index Naive
NRT Forward IndexLucene Segment
Lookup Engine
0 ProductB
1 ProductA
2 ProductC
3 ProductD
ProductC
ProductA
ProductB
ProductC
ProductD
True
False
False
True
100
150
200
250
ProductId(3) <ProductC,price>
DocId : 3
field : price
200
ProductId Availability Price
Latency : ~10 secs for ~1 Million
lookups

NRT Store - Forward Index Optimized
NRT Forward Index (Segment Independent)
Lucene Segment
Lookup Engine
0 ProductB
1 ProductA
2 ProductC
3 ProductD
100 200 250 150
NrtId(3)
2
DocId : 3
field : price
200
Availability
Price
0 ProductA
1 ProductC
2 ProductD
3 ProductB
T F F T
DocId - NrtId
0
1
2
3
3
0
1
2
Price(2
)
200

NRT Store - Invert index
NRT Forward Store
NRT Invert Store
NRT Inverter
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 ProductD
Availability : T 0 3
Offer : O1 2 3
Availability:T
Matching
BitSet

Near Real Time Solr Architecture
Solr
Kafka
Ingestion pipeline
NRT Forward
Index
Ranking
Macthing
Faceting
Redis
Bootstrap
NRT Inverted
store
Solr Master
NRT Updates
Text Updates
Catalogue
Pricing
Availability
Offers
Seller
Quality
Commit
+
Replicate
+
Reopen
Lucene
Others

Accomplishments
● Real time consumption for Ranking Signals
● BBD saw upto ~30K updates/second
● Query latency comparable to DocValues
– Consistent 99% performance

A Typical Search Flow
Query Rewrite
Results
Query
Matching
Ranking Faceting
Stats
Posting List
Doc Values
Schema
Other
Components
Lucene Index
Inverted Index
Forward Index
Schema
NRT Store

Lucene Index
0 availability:true 0,2
1 availability:false 1
0 brand:adidas 0,1
1 brand:nike 2
1 price:230 1
2 price:250 0
term ords Terms
Dictionary
Posting List
(inverted index)
Doc Value
(Forward index)
field 0 1 2
price 2 2 3
brand 0 0 1
availability 0 1 0
docId External ID Brand Availability Price
0 ProductA Adidas True 250
1 ProductB Adidas False 230
2 ProductC Nike True 500
● Lucene Index = Multiple Mini Indexes aka
Segments
● Lucene Segment
○ Write Once → Immutable Data structures
○ Posting Listing ( Sparse encoded bitsets)
○ Doc Values (Columnar Data structures)

C5 : Lucene in-place update
● Only numeric / byte Array fields
● Updates to go through the entire refresh cycle
● Not exposed via Solr

Forward Index - API Hook
● Lucene API Hook
– ValueSource
● Input
– Lucene Internal Document Id
– Field Name
● Output
– Field Value

NRT Store - Inverted Index
● Input
– Lucene Segment
– query
• Field Name : Field Value
• offer : o1
● Output
– DocSet (posting list)

Slash n near real time indexing

More Related Content

What's hot

Viewers also liked

Similar to Slash n near real time indexing

Recently uploaded

Slash n near real time indexing

Editor's Notes