Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
OCTOBER 11-14, 2016 •
BOSTON, MA
EXPEDIA INTERNAL
CONFERENCE
JUNE 13th 2017
Near Real time Indexing
Building Real Time Search Index For E-Commerce
Umesh Prasad
Advisory & Professional Services @Luci...
● SOLR/LUCENE
○ User → Hacker →
Contributor
○ [ Lucene 2.1 to 6.5 ]
● Advisory/Consulting @
Lucidworks (4 months)
● Search...
Agenda
• Ecommerce Search
• Need for Real Time Search
• SolrCloud Solution
• Alternatives
• First Principle approach
• Q &...
E-commerce Search
50 main categories
500 sub categories
231 million docs
- 90 million sku
- 160 million
listings
- result
...
800K active
users
160K requests per sec
- 40K service
- 10k solr
median : 11 ms
99th perc: 1.1 sec
!! Flipkart [Sherlock] has BBD Deals[an Offer] ??[expired]
!! Steal Deals !!- Stolen .. @Search/Sherlock Team : Please investigate
Please read the code and review architecture very ...
CUSTOMER EXPERIENCE
Product /Listing: Important Attributes
Seller
Rating
Service
catalogue
service
Promise
Service
Availability
Service
Offer
...
Summary : Lucene Document
• Product/SKU [Parent Document]
– Listing [Child Document]
• Query = Mostly SKU Attributes [Free...
Out Of Stock, but Why Show?
Index has Stale
Availability Data
234K
Products
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Re-open
searcher
Re-open
searcher
Re-o...
S1.2 : SolrCloud : Director Lens
updates / sec updates /hr
normal Peak
text / catalogue ~10 ~100 ~100K
pricing ~100 ~1K ~10 million
availability ~100 ~10K ...
Ingestion pipeline
Catalogue Pricing Availability Offers ...
Document Builder
Solr/Lucene
Change
Propagation
Documents
{L1...
S1.5 : SolrCloud : Internal Email
• Update = Delete + Add
– Block Join Index ⇒ Update Whole Block (Product + Listings)
• U...
S2xx : Alternatives
1. Updatable DocValues
a. [LUCENE-5189], [SOLR-5944], [under-the-hood],[benchmarking]
b. Doesn’t scale...
S3xx : Drawing Board
ProductA
brand : Apple
availability : T
price : 45000
ProductB
brand : Samsung
availability : T
price : 23000
ProductC
bra...
A Typical Search Flow
Query Rewrite
Results
Query
Matching
Ranking Faceting
Stats
Posting List
Doc Values
Other
Components...
NRT Forward Index - Considerations
● Lookup efficiency
– 50th percentile : ~10K matches
– 99th percentile : ~1 million mat...
HashMap based Implementation
NRT Forward IndexLucene Segment
Lookup Engine
0 ProductB
1 ProductA
2 ProductC
3 ProductD
Pro...
Foreign Key + Array Based Implementation
Lookup Engine
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 ProductD
250
DocI...
NRT Store Filter - PostFilter
PostFilter(Price:[100 TO 150])
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 ProductD
Do...
NRT Filter
NRT Store - Invert index
NRT Forward Store
NRT Inverter
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 Produ...
Solr Integration Points
• ValueSources
• Filtering
– Custom Filter Implementation for cached DocIdSet
– Custom PostFilter
...
Near Real Time Solr Architecture
Solr
Kafka
Ingestion pipeline
NRT Forward
Index
Ranking
Matching
Faceting
Redis
Bootstrap...
Accomplishments
• Real time sorting
• Real time filtering : PostFilter
– Higher latency
• Near real time filtering : cache...
Accomplishments @ Flipkart
● Real time consumption for ~150 Signals
● Reduction in shown out of stock products by 2X
● Pro...
Thank you
&
Questions
Search @ Flipkart
• Catalogue
– ~ 50 main categories
– ~ 5000 sub-categories
– ~ 231 million documents
– ~ 90 million SKUs...
Professional LinkedIn through Logos
Personal Journey through Pictures
F
U
T
U
R
E
D
I
R
E
C
T
I
O
N
near real time search in e-commerce
near real time search in e-commerce
near real time search in e-commerce
near real time search in e-commerce
near real time search in e-commerce
near real time search in e-commerce
near real time search in e-commerce
near real time search in e-commerce
near real time search in e-commerce
near real time search in e-commerce
Upcoming SlideShare
Loading in …5
×

near real time search in e-commerce

Topic : Near Real Time Search in e-commerce
Event : Cloud xTech (Expedia's Internal Conference)
Employer : Consultant @ Lucidworks
Attendance Type : External Speaker @Expedia
Place : Taj Vivanta, Gurgaon
Credits : Flipkart (work done), Expedia , Lucidworks
Date : 13th June 2017

  • Be the first to comment

near real time search in e-commerce

  1. 1. OCTOBER 11-14, 2016 • BOSTON, MA EXPEDIA INTERNAL CONFERENCE JUNE 13th 2017
  2. 2. Near Real time Indexing Building Real Time Search Index For E-Commerce Umesh Prasad Advisory & Professional Services @Lucidworks aka Techie/Individual Contributor Yogesh Ahuja Senior Search Architect @Lucidworks
  3. 3. ● SOLR/LUCENE ○ User → Hacker → Contributor ○ [ Lucene 2.1 to 6.5 ] ● Advisory/Consulting @ Lucidworks (4 months) ● Search & Data Platform @ Flipkart ○ 4.5 years ● Payments @ Amazon ○ 1.4 years ● Vertical Search @ Verse Innovation & Naukri ( 4.8 years ) ○ LUCENE 2.1
  4. 4. Agenda • Ecommerce Search • Need for Real Time Search • SolrCloud Solution • Alternatives • First Principle approach • Q & A
  5. 5. E-commerce Search 50 main categories 500 sub categories 231 million docs - 90 million sku - 160 million listings - result collapsing drill down filters top positions at premium
  6. 6. 800K active users 160K requests per sec - 40K service - 10k solr median : 11 ms 99th perc: 1.1 sec
  7. 7. !! Flipkart [Sherlock] has BBD Deals[an Offer] ??[expired]
  8. 8. !! Steal Deals !!- Stolen .. @Search/Sherlock Team : Please investigate Please read the code and review architecture very carefully
  9. 9. CUSTOMER EXPERIENCE
  10. 10. Product /Listing: Important Attributes Seller Rating Service catalogue service Promise Service Availability Service Offer Service Pricing Service Product aka SKU Listings
  11. 11. Summary : Lucene Document • Product/SKU [Parent Document] – Listing [Child Document] • Query = Mostly SKU Attributes [Free Text] • Filters = SKU + Listing Attributes [Drill Down] • Ranking = SKU + Listing Attributes [Explicit/Relevance] • Index Time Join aka Block Join – [Best Performance]
  12. 12. Out Of Stock, but Why Show? Index has Stale Availability Data 234K Products
  13. 13. Shard Replica Shard Replica Shard Replica Shard Replica Shard Replica Shard Replica Re-open searcher Re-open searcher Re-open searcher Re-open searcher Re-open searcher Re-open searcher Ingestion pipeline Shard Leader Auto commit Soft Commit Batch of documents For each Document Versioning Update Log Forward to Replica S1.1 : SolrCloud : Principal Engineer Lens
  14. 14. S1.2 : SolrCloud : Director Lens
  15. 15. updates / sec updates /hr normal Peak text / catalogue ~10 ~100 ~100K pricing ~100 ~1K ~10 million availability ~100 ~10K ~10 million offer ~100 ~10K ~10 million seller rating ~10 ~1K ~1 million signal 6 ~10 ~100 ~1 million signal 7 ~100 ~10K ~10 million signal 8 ~100 ~10K ~10 million S1.3 : SolrCloud : Monitoring Lens [Very High Update Rates]
  16. 16. Ingestion pipeline Catalogue Pricing Availability Offers ... Document Builder Solr/Lucene Change Propagation Documents {L1,L2 … P1} Updates Stream 1 Updates Stream 2 Updates Stream 3 ● Lucene doesn’t support Partial Updates ● Update = Delete + Add S1.4 : SolrCloud : Principal Engineer Lens
  17. 17. S1.5 : SolrCloud : Internal Email • Update = Delete + Add – Block Join Index ⇒ Update Whole Block (Product + Listings) • Updated Document gets streamed to all replicas in sync – Reduces indexing throughput • Soft commit is Not Free – Soft commit ⇒ In Memory Segment – Lots of Merges – Huge document churn / deletes – All caches still need to be re-generated – Filter Cache miss specially hurts performance
  18. 18. S2xx : Alternatives 1. Updatable DocValues a. [LUCENE-5189], [SOLR-5944], [under-the-hood],[benchmarking] b. Doesn’t scale to a large number of fields/multi valued fields 2. Parallel Indexes : Basically Term Partitioned Indexes [Updates in Redis] a. ParallelReader : Warning: It is up to you to make sure all indexes are created and modified the same way. b. Prototype worked : i. Works for small indexes + lots of updates or huge index + daily build ii. Not for large index + streaming updates + lots of qps iii. Pulling Changes + Document Building killed it 3. Lucene Codecs API : a. Works for Prototype : Research/Algorithms/small index b. Not for e-commerce marketplace use case. Index corruption/2 phase commit is difficult.
  19. 19. S3xx : Drawing Board
  20. 20. ProductA brand : Apple availability : T price : 45000 ProductB brand : Samsung availability : T price : 23000 ProductC brand : Apple availability : F price : 5000 Document ID Mappings Posting List (Inverted Index) DocValues (columunar data) Lucene Segment Lucene Index 0 ProductA 1 ProductB 2 ProductC 45000 23000 5000Price availability : T brand : Samsung brand : Apple 0 , 2 1 0 , 1 Terms Sparse Bitsets
  21. 21. A Typical Search Flow Query Rewrite Results Query Matching Ranking Faceting Stats Posting List Doc Values Other Components Lucene Segment Inverted Index Forward Index NRT Store samsung mobiles Offer : exchange offer price desc category : mobiles brand : samsung Offer : exchange offer
  22. 22. NRT Forward Index - Considerations ● Lookup efficiency – 50th percentile : ~10K matches – 99th percentile : ~1 million matches ● Data on Java heap – Memory efficiency
  23. 23. HashMap based Implementation NRT Forward IndexLucene Segment Lookup Engine 0 ProductB 1 ProductA 2 ProductC 3 ProductD ProductD ProductA ProductB ProductC ProductD True False False True 100 150 200 250 ProductId(3) <ProductD,price> 250 ProductId Availability Price Latency : ~10 secs for ~1 Million lookups DocId : 3 field : price
  24. 24. Foreign Key + Array Based Implementation Lookup Engine Lucene Segment 0 ProductB 1 ProductA 2 ProductC 3 ProductD 250 DocId - NrtId 0 1 2 3 3 0 1 2 NrtId(3) 2 Price(2) NRT Forward Index (Segment Independent) 100 200 250 150Price 0 ProductA 1 ProductC 2 ProductD 3 ProductB Availability T F F T Status 01 10 01 00 Latency : ~100 ms for ~1 Million lookups DocId : 3 field : price
  25. 25. NRT Store Filter - PostFilter PostFilter(Price:[100 TO 150]) Lucene Segment 0 ProductB 1 ProductA 2 ProductC 3 ProductD Don’t Delegate DocId - NrtId 0 1 2 3 3 0 1 2 DocId : 3 NrtId(3) 2 Price(2) NRT Forward Index (Segment Independent) 100 200 250 150Price 0 ProductA 1 ProductC 2 ProductD 3 ProductB Availability T F F T Status 01 10 01 00 for d in [matched-docs] collect d
  26. 26. NRT Filter NRT Store - Invert index NRT Forward Store NRT Inverter Lucene Segment 0 ProductB 1 ProductA 2 ProductC 3 ProductD NRT DocIdSet Cache Availability : T 0 3 Offer : O1 2 3 Offer:O1 DocIdSet
  27. 27. Solr Integration Points • ValueSources • Filtering – Custom Filter Implementation for cached DocIdSet – Custom PostFilter • Query – Wrapper over Filter • Custom FacetComponent
  28. 28. Near Real Time Solr Architecture Solr Kafka Ingestion pipeline NRT Forward Index Ranking Matching Faceting Redis Bootstrap NRT Inverted store Solr Master NRT Updates Lucene Updates Catalogue Pricing Availability Offers Seller Quality Commit + Replicate + Reopen Lucene Others
  29. 29. Accomplishments • Real time sorting • Real time filtering : PostFilter – Higher latency • Near real time filtering : cached DocIdSet – No consistency between lookup and filtering • Independent of lucene commits • Query latency comparable to DocValues – Consistent 99% performance
  30. 30. Accomplishments @ Flipkart ● Real time consumption for ~150 Signals ● Reduction in shown out of stock products by 2X ● Production instances of ~50K updates/second real time
  31. 31. Thank you & Questions
  32. 32. Search @ Flipkart • Catalogue – ~ 50 main categories – ~ 5000 sub-categories – ~ 231 million documents – ~ 90 million SKUs – ~ 160 million listings • E-commerce Marketplace – ~ 100K Sellers – Local Sellers – Regional Availability – Logistics Constraints
  33. 33. Professional LinkedIn through Logos
  34. 34. Personal Journey through Pictures
  35. 35. F U T U R E D I R E C T I O N

×