OCTOBER 11-14, 2016 • BOSTON, MA
Near Real time Indexing
Building Real Time Search Index For E-Commerce
Umesh Prasad
Tech Lead @ Flipkart
Thejus V M
Data Architect @ Flipkart
Agenda
• Search @ Flipkart
• Need for Real Time Search
• SolrCloud Solution
• Our approach
• Q & A
Traffic @ Flipkart
• Peak Traffic
– ~ 800K active users
– ~ 160K requests per second
• Search Traffic
– ~ 40K searches per second (Service)
– ~ 10K searches per second (Solr )
• Latency
– Median : 11 ms
– 99th percentile : 1.1 second
Search @ Flipkart
• Catalogue
– ~ 50 main categories
– ~ 5000 sub-categories
– ~ 231 million documents
– ~ 90 million SKUs
– ~ 160 million listings
• E-commerce Marketplace
– ~ 100K Sellers
– Local Sellers
– Regional Availability
– Logistics Constraints
E-commerce Search
• Heavy usage of drill down filters
• Heavy usage of faceting
• Only top results matter
• Results grouped/collapsed by products
• Serviceability and delivery experience MATTERS
Agenda
• Search @ Flipkart
• Need for Real Time Search
• SolrCloud Solution
• Our approach
• Q & A
Sorry, Stock Over !!?
Damn !! Is Offer Over ??
What !! All Steal Deals Gone ??
Product /Listing: Important Attributes
Seller
Rating
Service
catalogue
service
Promise
Service
Availability
Service
Offer
Service
Pricing
Service
Product aka SKU
Listings
Summary : Lucene Document
• Product/SKU (Parent Document)
– Listing (Child Document)
• Query : Mostly SKU Attributes (Free Text)
• Filters : SKU + Listing Attributes (Drill Down)
• Ranking : SKU + Listing Attributes
(Explicit/Relevance)
• Index Time Join aka Block Join (Best
Performance)
Out Of Stock, but Why Show?
Index has Stale
Availability Data
234K
Products
Challenge 1 : High Update Rates
updates / sec updates /hr
normal Peak
text / catalogue ~10 ~100 ~100K
pricing ~100 ~1K ~10 million
availability ~100 ~10K ~10 million
offer ~100 ~10K ~10 million
seller rating ~10 ~1K ~1 million
signal 6 ~10 ~100 ~1 million
signal 7 ~100 ~10K ~10 million
signal 8 ~100 ~10K ~10 million
Challenge 2 : Micro Services
Ingestion pipeline
Catalogue Pricing Availability Offers ...
Document Builder
Solr/Lucene
Change
Propagation
Documents
{L1,L2 … P1}
Updates Stream 1
Updates Stream 2
Updates Stream 3
● Lucene doesn’t support Partial Updates
● Update = Delete + Add
Agenda
• Search @ Flipkart
• Need for Real Time Search
• SolrCloud Solution
• Our approach
• Q & A
SolrCloud for NRT
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Re-open
searcher
Re-open
searcher
Re-open
searcher
Re-open
searcher
Re-open
searcher
Re-open
searcher
Ingestion pipeline Shard
Leader
Auto commit
Soft Commit
Batch of
documents
For Document
Versioning
Update Log
Forward to Replica
SolrCloud Evaluation
• Update = Delete + Add
– Block Join Index ⇒ Update Whole Block (Product + Listings)
• Updated Document gets streamed to all replicas in sync
– Reduces indexing throughput
• Soft commit is Not Free
– Soft commit ⇒ In Memory Segment
– Lots of Merges
– Huge document churn / deletes
– All caches still need to be re-generated
– Filter Cache miss specially hurts performance
Agenda
• Search @ Flipkart
• Need for Real Time Index
• SolrCloud Solution
•Our approach
• Q & A
ProductA
brand : Apple
availability : T
price : 45000
ProductB
brand : Samsung
availability : T
price : 23000
ProductC
brand : Apple
availability : F
price : 5000
Document ID
Mappings
Posting List
(Inverted Index)
DocValues
(columunar data)
Lucene Segment
Lucene Index
0 ProductA
1 ProductB
2 ProductC
45000 23000 5000Price
availability : T
brand : Samsung
brand : Apple 0 , 2
1
0 , 1
Terms
Sparse
Bitsets
A Typical Search Flow
Query Rewrite
Results
Query
Matching
Ranking Faceting
Stats
Posting List
Doc Values
Other
Components
Lucene Segment
Inverted Index
Forward Index
NRT Store
samsung mobiles
Offer : exchange offer
price desc
category : mobiles
brand : samsung
Offer : exchange offer
NRT Forward Index - Considerations
● Lookup efficiency
– 50th percentile : ~10K matches
– 99th percentile : ~1 million matches
● Data on Java heap
– Memory efficiency
NRT Forward Index - Naive Implementation
NRT Forward IndexLucene Segment
Lookup Engine
0 ProductB
1 ProductA
2 ProductC
3 ProductD
ProductD
ProductA
ProductB
ProductC
ProductD
True
False
False
True
100
150
200
250
ProductId(3) <ProductD,price>
DocId : 3
field: price
250
ProductId Availability Price
Latency : ~10 secs for ~1 Million
lookups
NRT Store - Forward Index Optimized
Lookup Engine
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 ProductD
DocId : 3
Field : price
250
DocId - NrtId
0
1
2
3
3
0
1
2
NrtId(3)
2
Price(2
)
NRT Forward Index (Segment Independent)
100 200 250 150Price
0 ProductA
1 ProductC
2 ProductD
3 ProductB
Availability T F F T
Status 01 10 01 00
Latency : ~100 ms for ~1 Million lookups
NRT Store Filter - PostFilter
PostFilter(Price:[100 TO 150])
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 ProductD
DocId : 3
Don’t
Delegate
DocId - NrtId
0
1
2
3
3
0
1
2
NrtId(3)
2
Price(2
)
NRT Forward Index (Segment Independent)
100 200 250 150Price
0 ProductA
1 ProductC
2 ProductD
3 ProductB
Availability T F F T
Status 01 10 01 00
NRT Filter
NRT Store - Invert index
NRT Forward StoreNRT Inverter
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 ProductD
NRT DocIdSet Cache
Availability : T 0 3
Offer : O1 2 3
Offer:O1 DocIdSet
Solr Integration Points
• ValueSources
• Filtering
– Custom Filter Implementation for cached DocIdSet
– Custom PostFilter
• Query
– Wrapper over Filter
• Custom FacetComponent
Near Real Time Solr Architecture
Solr
Kafka
Ingestion pipeline
NRT Forward
Index
Ranking
Matching
Faceting
Redis
Bootstrap
NRT Inverted
store
Solr Master
NRT Updates
Lucene Updates
Catalogue
Pricing
Availability
Offers
Seller
Quality
Commit
+
Replicate
+
Reopen
Lucene
Others
Accomplishments
• Real time sorting
• Real time filtering : PostFilter
– Higher latency
• Near real time filtering : cached DocIdSet
– No consistency between lookup and filtering
• Independent of lucene commits
• Query latency comparable to DocValues
– Consistent 99% performance
Accomplishments @ Flipkart
● Real time consumption for ~150 Signals
● Reduction in shown out of stock products by 2X
● Production instances of ~50K updates/second real time
Thank you
&
Questions

Near RealTime search @Flipkart

  • 1.
    OCTOBER 11-14, 2016• BOSTON, MA
  • 2.
    Near Real timeIndexing Building Real Time Search Index For E-Commerce Umesh Prasad Tech Lead @ Flipkart Thejus V M Data Architect @ Flipkart
  • 3.
    Agenda • Search @Flipkart • Need for Real Time Search • SolrCloud Solution • Our approach • Q & A
  • 6.
    Traffic @ Flipkart •Peak Traffic – ~ 800K active users – ~ 160K requests per second • Search Traffic – ~ 40K searches per second (Service) – ~ 10K searches per second (Solr ) • Latency – Median : 11 ms – 99th percentile : 1.1 second
  • 7.
    Search @ Flipkart •Catalogue – ~ 50 main categories – ~ 5000 sub-categories – ~ 231 million documents – ~ 90 million SKUs – ~ 160 million listings • E-commerce Marketplace – ~ 100K Sellers – Local Sellers – Regional Availability – Logistics Constraints
  • 8.
    E-commerce Search • Heavyusage of drill down filters • Heavy usage of faceting • Only top results matter • Results grouped/collapsed by products • Serviceability and delivery experience MATTERS
  • 9.
    Agenda • Search @Flipkart • Need for Real Time Search • SolrCloud Solution • Our approach • Q & A
  • 10.
  • 11.
    Damn !! IsOffer Over ??
  • 12.
    What !! AllSteal Deals Gone ??
  • 13.
    Product /Listing: ImportantAttributes Seller Rating Service catalogue service Promise Service Availability Service Offer Service Pricing Service Product aka SKU Listings
  • 14.
    Summary : LuceneDocument • Product/SKU (Parent Document) – Listing (Child Document) • Query : Mostly SKU Attributes (Free Text) • Filters : SKU + Listing Attributes (Drill Down) • Ranking : SKU + Listing Attributes (Explicit/Relevance) • Index Time Join aka Block Join (Best Performance)
  • 15.
    Out Of Stock,but Why Show? Index has Stale Availability Data 234K Products
  • 16.
    Challenge 1 :High Update Rates updates / sec updates /hr normal Peak text / catalogue ~10 ~100 ~100K pricing ~100 ~1K ~10 million availability ~100 ~10K ~10 million offer ~100 ~10K ~10 million seller rating ~10 ~1K ~1 million signal 6 ~10 ~100 ~1 million signal 7 ~100 ~10K ~10 million signal 8 ~100 ~10K ~10 million
  • 17.
    Challenge 2 :Micro Services Ingestion pipeline Catalogue Pricing Availability Offers ... Document Builder Solr/Lucene Change Propagation Documents {L1,L2 … P1} Updates Stream 1 Updates Stream 2 Updates Stream 3 ● Lucene doesn’t support Partial Updates ● Update = Delete + Add
  • 18.
    Agenda • Search @Flipkart • Need for Real Time Search • SolrCloud Solution • Our approach • Q & A
  • 19.
  • 20.
    SolrCloud Evaluation • Update= Delete + Add – Block Join Index ⇒ Update Whole Block (Product + Listings) • Updated Document gets streamed to all replicas in sync – Reduces indexing throughput • Soft commit is Not Free – Soft commit ⇒ In Memory Segment – Lots of Merges – Huge document churn / deletes – All caches still need to be re-generated – Filter Cache miss specially hurts performance
  • 21.
    Agenda • Search @Flipkart • Need for Real Time Index • SolrCloud Solution •Our approach • Q & A
  • 22.
    ProductA brand : Apple availability: T price : 45000 ProductB brand : Samsung availability : T price : 23000 ProductC brand : Apple availability : F price : 5000 Document ID Mappings Posting List (Inverted Index) DocValues (columunar data) Lucene Segment Lucene Index 0 ProductA 1 ProductB 2 ProductC 45000 23000 5000Price availability : T brand : Samsung brand : Apple 0 , 2 1 0 , 1 Terms Sparse Bitsets
  • 23.
    A Typical SearchFlow Query Rewrite Results Query Matching Ranking Faceting Stats Posting List Doc Values Other Components Lucene Segment Inverted Index Forward Index NRT Store samsung mobiles Offer : exchange offer price desc category : mobiles brand : samsung Offer : exchange offer
  • 24.
    NRT Forward Index- Considerations ● Lookup efficiency – 50th percentile : ~10K matches – 99th percentile : ~1 million matches ● Data on Java heap – Memory efficiency
  • 25.
    NRT Forward Index- Naive Implementation NRT Forward IndexLucene Segment Lookup Engine 0 ProductB 1 ProductA 2 ProductC 3 ProductD ProductD ProductA ProductB ProductC ProductD True False False True 100 150 200 250 ProductId(3) <ProductD,price> DocId : 3 field: price 250 ProductId Availability Price Latency : ~10 secs for ~1 Million lookups
  • 26.
    NRT Store -Forward Index Optimized Lookup Engine Lucene Segment 0 ProductB 1 ProductA 2 ProductC 3 ProductD DocId : 3 Field : price 250 DocId - NrtId 0 1 2 3 3 0 1 2 NrtId(3) 2 Price(2 ) NRT Forward Index (Segment Independent) 100 200 250 150Price 0 ProductA 1 ProductC 2 ProductD 3 ProductB Availability T F F T Status 01 10 01 00 Latency : ~100 ms for ~1 Million lookups
  • 27.
    NRT Store Filter- PostFilter PostFilter(Price:[100 TO 150]) Lucene Segment 0 ProductB 1 ProductA 2 ProductC 3 ProductD DocId : 3 Don’t Delegate DocId - NrtId 0 1 2 3 3 0 1 2 NrtId(3) 2 Price(2 ) NRT Forward Index (Segment Independent) 100 200 250 150Price 0 ProductA 1 ProductC 2 ProductD 3 ProductB Availability T F F T Status 01 10 01 00
  • 28.
    NRT Filter NRT Store- Invert index NRT Forward StoreNRT Inverter Lucene Segment 0 ProductB 1 ProductA 2 ProductC 3 ProductD NRT DocIdSet Cache Availability : T 0 3 Offer : O1 2 3 Offer:O1 DocIdSet
  • 29.
    Solr Integration Points •ValueSources • Filtering – Custom Filter Implementation for cached DocIdSet – Custom PostFilter • Query – Wrapper over Filter • Custom FacetComponent
  • 30.
    Near Real TimeSolr Architecture Solr Kafka Ingestion pipeline NRT Forward Index Ranking Matching Faceting Redis Bootstrap NRT Inverted store Solr Master NRT Updates Lucene Updates Catalogue Pricing Availability Offers Seller Quality Commit + Replicate + Reopen Lucene Others
  • 31.
    Accomplishments • Real timesorting • Real time filtering : PostFilter – Higher latency • Near real time filtering : cached DocIdSet – No consistency between lookup and filtering • Independent of lucene commits • Query latency comparable to DocValues – Consistent 99% performance
  • 32.
    Accomplishments @ Flipkart ●Real time consumption for ~150 Signals ● Reduction in shown out of stock products by 2X ● Production instances of ~50K updates/second real time
  • 33.