A real time search index for
e-commerce
Umesh Prasad
Thejus V M
Oh!! Out Of Stock
Damn !! Out of Stock
Damn !! Missed the Offer
E-commerce Index Attributes
catalogue
service
Promise
Engine
Availability
Service
Seller
Rating
LISTING
PRODUCT aka SKU
Offer
Engine
Pricing
Engine
Out Of Stock, but Why Show?
Index has Stale
Availability Data
234K
Products
Outline
❏ E-commerce search Challenge
❏ Challenges in Keeping an Inverted Index Updated
❏ Our approach to Near Real Time indexing
Challenge 1 : Update rates
updates / sec
max update
/hr
min max
text /
catalogue ~10 ~100 ~100K
pricing ~100 ~1K ~10 million
availability ~100 ~10K ~10 million
offer ~100 ~10K ~10 million
seller
rating ~10 ~1K ~1 million
signal 6 ~10 ~100 ~1 million
signal 7 ~100 ~10K ~10 million
signal 8 ~100 ~10K ~10 million
Challenge 2 : Lucene Index Update
● Lucene doesn’t support Partial Updates.
● Update = Delete Old Doc + Add New Document
– Recreate the entire document for every update
– Not friendly with multiple micro-services with
different update rates
● Problem Compounded By MarketPlace
● Product + All Its Listings == SINGLE BLOCK
● BLOCK structure chosen for query performance ( ~100X
better latencies)
Challenge 3 : Refresh Cycle
Ingestion pipeline Solr
Master
Solr
Slave
Solr
Slave
Solr
Slave
Solr
Slave
Solr
Slave
Solr
Slave
Commit
fsync
Replication
Open new
Index
Open new
Index
Open new
Index
Open new
Index
Open new
Index
Open new
Index
Batch of
documents
ProductA
brand : Apple
availability : T
price : 45000
ProductB
brand : Samsung
availability : F
price : 23000
ProductC
brand : Apple
availability : T
price : 5000
Document ID
Mappings
Posting List
(Inverted Index)
DocValues
(columunar data)
Lucene Segment
Lucene Index
0 ProductA
1 ProductB
2 ProductC
45000 23000 5000Price
availability : T
brand : Samsung
brand : Apple 0 , 1
2
0 , 2
Terms Sparse
Bitsets
Root Cause :Updating Data Structures
Term 3 Bitset 3
POSTING LIST
……………
…………...
Millions of Terms
BitSet 1Term 1
BitSet 2Term 2
BitSet 3Term 3
Document
Term1 Term2
Term3 Term4
……………
…………...
Thousands of Terms
Posting List / Bit Set
D : 0 1 0 0 0 0 1 0 0 0 0 0 0 1
S: 2,7,14
SE : 2,5,7
Yes
May Be
NO
Updatable ?
Millions of
Documents
Outline
❏ E-commerce search Challenge
❏ Challenges in Keeping an Inverted Index Updated
❏ Our approach to Near Real Time indexing
A Typical Search Flow
Query Rewrite
Results
Query
Matching
Ranking Faceting
Stats
Posting List
Doc Values
Other
Components
Lucene Segment
Inverted Index
Forward Index
NRT Store
NRT Forward Index - Considerations
● Lookup efficiency
– 50th percentile : ~10K matches
– 99th percentile : ~1 million matches
● Data on Java heap
– Memory efficiency
● Hook it to Lucene
NRT Store - Forward Index Naive
NRT Forward IndexLucene Segment
Lookup Engine
0 ProductB
1 ProductA
2 ProductC
3 ProductD
ProductC
ProductA
ProductB
ProductC
ProductD
True
False
False
True
100
150
200
250
ProductId(3) <ProductC,price>
DocId : 3
field : price
200
ProductId Availability Price
Latency : ~10 secs for ~1 Million
lookups
NRT Store - Forward Index Optimized
NRT Forward Index (Segment Independent)
Lucene Segment
Lookup Engine
0 ProductB
1 ProductA
2 ProductC
3 ProductD
100 200 250 150
NrtId(3)
2
DocId : 3
field : price
200
Availability
Price
0 ProductA
1 ProductC
2 ProductD
3 ProductB
T F F T
DocId - NrtId
0
1
2
3
3
0
1
2
Price(2
)
200
NRT Store - Invert index
NRT Forward Store
NRT Invert Store
NRT Inverter
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 ProductD
Availability : T 0 3
Offer : O1 2 3
Availability:T
Matching
BitSet
Near Real Time Solr Architecture
Solr
Kafka
Ingestion pipeline
NRT Forward
Index
Ranking
Macthing
Faceting
Redis
Bootstrap
NRT Inverted
store
Solr Master
NRT Updates
Text Updates
Catalogue
Pricing
Availability
Offers
Seller
Quality
Commit
+
Replicate
+
Reopen
Lucene
Others
Accomplishments
● Real time consumption for Ranking Signals
● BBD saw upto ~30K updates/second
● Query latency comparable to DocValues
– Consistent 99% performance
Thank you
&
Questions
A Typical Search Flow
Query Rewrite
Results
Query
Matching
Ranking Faceting
Stats
Posting List
Doc Values
Schema
Other
Components
Lucene Index
Inverted Index
Forward Index
Schema
NRT Store
Lucene Index
0 availability:true 0,2
1 availability:false 1
0 brand:adidas 0,1
1 brand:nike 2
1 price:230 1
2 price:250 0
term ords Terms
Dictionary
Posting List
(inverted index)
Doc Value
(Forward index)
field 0 1 2
price 2 2 3
brand 0 0 1
availability 0 1 0
docId External ID Brand Availability Price
0 ProductA Adidas True 250
1 ProductB Adidas False 230
2 ProductC Nike True 500
● Lucene Index = Multiple Mini Indexes aka
Segments
● Lucene Segment
○ Write Once → Immutable Data structures
○ Posting Listing ( Sparse encoded bitsets)
○ Doc Values (Columnar Data structures)
Lucene Index
0 availability:true 0,2
1 availability:false 1
0 brand:adidas 0,1
1 brand:nike 2
1 price:230 1
2 price:250 0
term ords Terms
Dictionary
Posting List
(inverted index)
Doc Value
(Forward index)
field 0 1 2
price 2 2 3
brand 0 0 1
availability 0 1 0
docId External ID Brand Availability Price
0 ProductA Adidas True 250
1 ProductB Adidas False 230
2 ProductC Nike True 500
● Lucene Index = Multiple Mini Indexes aka
Segments
● Lucene Segment
○ Write Once → Immutable Data structures
○ Posting Listing ( Sparse encoded bitsets)
○ Doc Values (Columnar Data structures)
C5 : Lucene in-place update
● Only numeric / byte Array fields
● Updates to go through the entire refresh cycle
● Not exposed via Solr
Forward Index - API Hook
● Lucene API Hook
– ValueSource
● Input
– Lucene Internal Document Id
– Field Name
● Output
– Field Value
NRT Store - Inverted Index
● Input
– Lucene Segment
– query
• Field Name : Field Value
• offer : o1
● Output
– DocSet (posting list)

Slash n near real time indexing

  • 1.
    A real timesearch index for e-commerce Umesh Prasad Thejus V M
  • 2.
  • 3.
    Damn !! Outof Stock
  • 4.
    Damn !! Missedthe Offer
  • 5.
  • 6.
    Out Of Stock,but Why Show? Index has Stale Availability Data 234K Products
  • 7.
    Outline ❏ E-commerce searchChallenge ❏ Challenges in Keeping an Inverted Index Updated ❏ Our approach to Near Real Time indexing
  • 8.
    Challenge 1 :Update rates updates / sec max update /hr min max text / catalogue ~10 ~100 ~100K pricing ~100 ~1K ~10 million availability ~100 ~10K ~10 million offer ~100 ~10K ~10 million seller rating ~10 ~1K ~1 million signal 6 ~10 ~100 ~1 million signal 7 ~100 ~10K ~10 million signal 8 ~100 ~10K ~10 million
  • 9.
    Challenge 2 :Lucene Index Update ● Lucene doesn’t support Partial Updates. ● Update = Delete Old Doc + Add New Document – Recreate the entire document for every update – Not friendly with multiple micro-services with different update rates ● Problem Compounded By MarketPlace ● Product + All Its Listings == SINGLE BLOCK ● BLOCK structure chosen for query performance ( ~100X better latencies)
  • 10.
    Challenge 3 :Refresh Cycle Ingestion pipeline Solr Master Solr Slave Solr Slave Solr Slave Solr Slave Solr Slave Solr Slave Commit fsync Replication Open new Index Open new Index Open new Index Open new Index Open new Index Open new Index Batch of documents
  • 11.
    ProductA brand : Apple availability: T price : 45000 ProductB brand : Samsung availability : F price : 23000 ProductC brand : Apple availability : T price : 5000 Document ID Mappings Posting List (Inverted Index) DocValues (columunar data) Lucene Segment Lucene Index 0 ProductA 1 ProductB 2 ProductC 45000 23000 5000Price availability : T brand : Samsung brand : Apple 0 , 1 2 0 , 2 Terms Sparse Bitsets
  • 12.
    Root Cause :UpdatingData Structures Term 3 Bitset 3 POSTING LIST …………… …………... Millions of Terms BitSet 1Term 1 BitSet 2Term 2 BitSet 3Term 3 Document Term1 Term2 Term3 Term4 …………… …………... Thousands of Terms Posting List / Bit Set D : 0 1 0 0 0 0 1 0 0 0 0 0 0 1 S: 2,7,14 SE : 2,5,7 Yes May Be NO Updatable ? Millions of Documents
  • 13.
    Outline ❏ E-commerce searchChallenge ❏ Challenges in Keeping an Inverted Index Updated ❏ Our approach to Near Real Time indexing
  • 14.
    A Typical SearchFlow Query Rewrite Results Query Matching Ranking Faceting Stats Posting List Doc Values Other Components Lucene Segment Inverted Index Forward Index NRT Store
  • 15.
    NRT Forward Index- Considerations ● Lookup efficiency – 50th percentile : ~10K matches – 99th percentile : ~1 million matches ● Data on Java heap – Memory efficiency ● Hook it to Lucene
  • 16.
    NRT Store -Forward Index Naive NRT Forward IndexLucene Segment Lookup Engine 0 ProductB 1 ProductA 2 ProductC 3 ProductD ProductC ProductA ProductB ProductC ProductD True False False True 100 150 200 250 ProductId(3) <ProductC,price> DocId : 3 field : price 200 ProductId Availability Price Latency : ~10 secs for ~1 Million lookups
  • 17.
    NRT Store -Forward Index Optimized NRT Forward Index (Segment Independent) Lucene Segment Lookup Engine 0 ProductB 1 ProductA 2 ProductC 3 ProductD 100 200 250 150 NrtId(3) 2 DocId : 3 field : price 200 Availability Price 0 ProductA 1 ProductC 2 ProductD 3 ProductB T F F T DocId - NrtId 0 1 2 3 3 0 1 2 Price(2 ) 200
  • 18.
    NRT Store -Invert index NRT Forward Store NRT Invert Store NRT Inverter Lucene Segment 0 ProductB 1 ProductA 2 ProductC 3 ProductD Availability : T 0 3 Offer : O1 2 3 Availability:T Matching BitSet
  • 19.
    Near Real TimeSolr Architecture Solr Kafka Ingestion pipeline NRT Forward Index Ranking Macthing Faceting Redis Bootstrap NRT Inverted store Solr Master NRT Updates Text Updates Catalogue Pricing Availability Offers Seller Quality Commit + Replicate + Reopen Lucene Others
  • 20.
    Accomplishments ● Real timeconsumption for Ranking Signals ● BBD saw upto ~30K updates/second ● Query latency comparable to DocValues – Consistent 99% performance
  • 21.
  • 22.
    A Typical SearchFlow Query Rewrite Results Query Matching Ranking Faceting Stats Posting List Doc Values Schema Other Components Lucene Index Inverted Index Forward Index Schema NRT Store
  • 23.
    Lucene Index 0 availability:true0,2 1 availability:false 1 0 brand:adidas 0,1 1 brand:nike 2 1 price:230 1 2 price:250 0 term ords Terms Dictionary Posting List (inverted index) Doc Value (Forward index) field 0 1 2 price 2 2 3 brand 0 0 1 availability 0 1 0 docId External ID Brand Availability Price 0 ProductA Adidas True 250 1 ProductB Adidas False 230 2 ProductC Nike True 500 ● Lucene Index = Multiple Mini Indexes aka Segments ● Lucene Segment ○ Write Once → Immutable Data structures ○ Posting Listing ( Sparse encoded bitsets) ○ Doc Values (Columnar Data structures)
  • 24.
    Lucene Index 0 availability:true0,2 1 availability:false 1 0 brand:adidas 0,1 1 brand:nike 2 1 price:230 1 2 price:250 0 term ords Terms Dictionary Posting List (inverted index) Doc Value (Forward index) field 0 1 2 price 2 2 3 brand 0 0 1 availability 0 1 0 docId External ID Brand Availability Price 0 ProductA Adidas True 250 1 ProductB Adidas False 230 2 ProductC Nike True 500 ● Lucene Index = Multiple Mini Indexes aka Segments ● Lucene Segment ○ Write Once → Immutable Data structures ○ Posting Listing ( Sparse encoded bitsets) ○ Doc Values (Columnar Data structures)
  • 25.
    C5 : Lucenein-place update ● Only numeric / byte Array fields ● Updates to go through the entire refresh cycle ● Not exposed via Solr
  • 26.
    Forward Index -API Hook ● Lucene API Hook – ValueSource ● Input – Lucene Internal Document Id – Field Name ● Output – Field Value
  • 27.
    NRT Store -Inverted Index ● Input – Lucene Segment – query • Field Name : Field Value • offer : o1 ● Output – DocSet (posting list)

Editor's Notes

  • #4 Going from a Page 1 to Page could be a matter of seconds on Sales Day ( Big Billion Day)
  • #6 Hierarchical documents ( Product → Listing ) Highly structured Free Text, Numeric, Tags Micro services for individual field updates Different update rates Independently updating fields
  • #7 Availabilty has been used in ranking, but it is stale, hence OOS. Explain challenge of 234K
  • #9 Means, the entire index will be recreated every hour
  • #10 Product Documents + Seller SKU Documents block-join index block : Composite document, with product and all its seller SKU Con Any Update = Delete + Recreate entire block Aggravates Delete + Recreate problem
  • #11 Remove animation, don’t spend too much time on it.
  • #12 Posting =
  • #15 Keep the fast changing data outside of the index Update this data independent of Solr updates Hooks in Lucene/Solr for retrieval ValueSource Filter Collector
  • #17 Explain the API Hook
  • #18 Lucene APIs : internal document id Columnar data structures Implementation dependent on data type Chosen for memory efficiency boolean : 1bit enum : log(#enumerations) bits int : 4 bytes multi val : array of the above data structures
  • #19 Filter API of lucene DocIdSet getDocIdSet(LuceneIndex) Invert data to adhere to lucene’s internal order at regular intervals of time
  • #24 Extract segment structure in a different slide
  • #25 Extract segment structure in a different slide