Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
OCTOBER	
  11-­‐14,	
  2016	
  	
  •	
  	
  BOSTON,	
  MA	
  
Near	
  Real	
  8me	
  Indexing	
  
Building	
  Real	
  Time	
  Search	
  Index	
  For	
  E-­‐Commerce	
  
	
  
Umesh	
  P...
Agenda	
  
•  Search	
  @	
  Flipkart	
  
•  Need	
  for	
  Real	
  Time	
  Search	
  
•  SolrCloud	
  Solu;on	
  
•  Our	...
Traffic	
  @	
  Flipkart	
  
•  Peak	
  Traffic	
  	
  
–  ~	
  800K	
  ac;ve	
  users	
  
–  ~	
  160K	
  	
  requests	
  per...
Search	
  @	
  Flipkart	
  
•  Catalogue	
  	
  
–  ~	
  50	
  main	
  categories	
  
– ~	
  5000	
  sub-­‐categories	
  
...
E-­‐commerce	
  Search	
  
•  Heavy	
  usage	
  of	
  drill	
  down	
  filters	
  
•  Heavy	
  usage	
  of	
  face;ng	
  
•...
Agenda	
  
•  Search	
  @	
  Flipkart	
  
•  Need	
  for	
  Real	
  Time	
  Search	
  
•  SolrCloud	
  Solu;on	
  
•  Our	...
Sorry,	
  	
  	
  Stock	
  Over	
  	
  	
  !!?	
  
Damn	
  !!	
  Is	
  Offer	
  Over	
  ??	
  
What	
  !!	
   	
  All	
  Steal	
  Deals	
  Gone	
  ??	
  
Product	
  /Lis;ng:	
  Important	
  Aributes	
  
Seller	
  
Ra;ng	
  
Service	
  
catalogue	
  
service	
  
Promise	
  
Se...
Summary	
  :	
  	
  Lucene	
  Document	
  
•  Product/SKU	
  	
  (Parent	
  Document)	
  
–  Lis;ng	
  (Child	
  Document)...
Out	
  Of	
  Stock,	
  but	
  Why	
  Show?	
  
Index has Stale
Availability Data
234K	
  
Products	
  
Challenge	
  1	
  :	
  High	
  Update	
  Rates	
  
updates	
  /	
  sec	
   updates	
  /hr	
  	
  
normal	
   Peak	
  
text...
Challenge	
  2	
  :	
  Micro	
  Services	
  	
  
Ingestion pipeline
Catalogue Pricing Availability Offers ...
Document Bui...
Agenda	
  
•  Search	
  @	
  Flipkart	
  
•  Need	
  for	
  Real	
  Time	
  Search	
  
•  SolrCloud	
  Solu;on	
  
•  Our	...
SolrCloud	
  for	
  NRT	
  
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Re-open
se...
SolrCloud	
  Evalua;on	
  
•  Update	
  =	
  Delete	
  +	
  Add	
  
–  Block	
  Join	
  Index	
   	
  Update	
  Whole	
  B...
Agenda	
  
•  Search	
  @	
  Flipkart	
  
•  Need	
  for	
  Real	
  Time	
  Index	
  
•  SolrCloud	
  Solu;on	
  
• Our	
 ...
ProductA
brand : Apple
availability : T
price : 45000
ProductB
brand : Samsung
availability : T
price : 23000
ProductC
bra...
A	
  Typical	
  Search	
  Flow	
  
Query Rewrite
Results
Query
Matching
Ranking Faceting
Stats
Posting List
Doc Values
Oth...
NRT	
  Forward	
  Index	
  -­‐	
  Considera;ons	
  
●  Lookup	
  efficiency	
  	
  
–  50th	
  percen;le	
  :	
  ~10K	
  mat...
NRT	
  Forward	
  Index	
  -­‐	
  Naive	
  Implementa;on	
  
NRT Forward IndexLucene Segment
Lookup Engine
0 ProductB
1 Pr...
NRT	
  Store	
  -­‐	
  Forward	
  Index	
  Op;mized	
  
Lookup Engine
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 Pr...
NRT	
  Store	
  Filter	
  -­‐	
  PostFilter	
  
PostFilter(Price:[100 TO 150])
Lucene Segment
0 ProductB
1 ProductA
2 Prod...
NRT Filter
NRT	
  Store	
  -­‐	
  Invert	
  index	
  
NRT Forward StoreNRT Inverter
Lucene Segment
0 ProductB
1 ProductA
2...
Solr	
  Integra;on	
  Points	
  
•  ValueSources	
  
•  Filtering	
  
–  Custom	
  Filter	
  Implementa;on	
  for	
  cache...
Near	
  Real	
  Time	
  Solr	
  Architecture	
  
Solr
Kafka
Ingestion pipeline
NRT Forward
Index
Ranking
Matching
Faceting...
Accomplishments	
  
•  Real	
  ;me	
  sor;ng	
  
•  Real	
  ;me	
  filtering	
  :	
  PostFilter	
  
–  Higher	
  latency	
 ...
Accomplishments	
  @	
  Flipkart	
  
●  Real	
  ;me	
  consump;on	
  for	
  ~150	
  Signals	
  
●  Reduc;on	
  in	
  shown...
Thank	
  you	
  
&	
  
Ques8ons	
  
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Upcoming SlideShare
Loading in …5
×

Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

2,526 views

Published on

Presented at Lucene/Solr Revolution 2016

Published in: Technology
  • Be the first to comment

Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

  1. 1. OCTOBER  11-­‐14,  2016    •    BOSTON,  MA  
  2. 2. Near  Real  8me  Indexing   Building  Real  Time  Search  Index  For  E-­‐Commerce     Umesh  Prasad   Tech  Lead    @  Flipkart     Thejus  V  M   Data  Architect  @  Flipkart      
  3. 3. Agenda   •  Search  @  Flipkart   •  Need  for  Real  Time  Search   •  SolrCloud  Solu;on   •  Our  approach   •  Q  &  A  
  4. 4. Traffic  @  Flipkart   •  Peak  Traffic     –  ~  800K  ac;ve  users   –  ~  160K    requests  per  second     •  Search  Traffic     –  ~  40K  searches  per  second  (Service)   –  ~  10K  searches  per  second  (Solr  )   •  Latency   –   Median  :  11  ms   –   99th  percen;le  :  1.1  second  
  5. 5. Search  @  Flipkart   •  Catalogue     –  ~  50  main  categories   – ~  5000  sub-­‐categories   – ~  231  million  documents   – ~  90  million  SKUs   – ~  160  million  lis;ngs     •  E-­‐commerce  Marketplace     – ~  100K    Sellers   – Local  Sellers   – Regional  Availability   – Logis;cs  Constraints    
  6. 6. E-­‐commerce  Search   •  Heavy  usage  of  drill  down  filters   •  Heavy  usage  of  face;ng   •  Only  top  results  maer   •  Results  grouped/collapsed  by  products     •  Serviceability  and  delivery  experience  MATTERS    
  7. 7. Agenda   •  Search  @  Flipkart   •  Need  for  Real  Time  Search   •  SolrCloud  Solu;on   •  Our  approach   •  Q  &  A  
  8. 8. Sorry,      Stock  Over      !!?  
  9. 9. Damn  !!  Is  Offer  Over  ??  
  10. 10. What  !!    All  Steal  Deals  Gone  ??  
  11. 11. Product  /Lis;ng:  Important  Aributes   Seller   Ra;ng   Service   catalogue   service   Promise   Service   Availability Service Offer   Service   Pricing   Service   Product  aka  SKU   Lis;ngs  
  12. 12. Summary  :    Lucene  Document   •  Product/SKU    (Parent  Document)   –  Lis;ng  (Child  Document)     •  Query  :    Mostly    SKU  Aributes            (Free  Text)   •  Filters  :  SKU  +    Lis;ng  Aributes        (Drill  Down)   •  Ranking  :  SKU  +  Lis;ng    Aributes        (Explicit/ Relevance)     •  Index  Time  Join  aka  Block  Join        (Best   Performance)      
  13. 13. Out  Of  Stock,  but  Why  Show?   Index has Stale Availability Data 234K   Products  
  14. 14. Challenge  1  :  High  Update  Rates   updates  /  sec   updates  /hr     normal   Peak   text  /  catalogue   ~10   ~100   ~100K   pricing   ~100   ~1K   ~10  million   availability   ~100   ~10K   ~10  million   offer   ~100   ~10K   ~10  million   seller  ra8ng   ~10   ~1K   ~1  million   signal  6   ~10   ~100   ~1  million   signal  7   ~100   ~10K   ~10  million   signal  8   ~100   ~10K   ~10  million  
  15. 15. Challenge  2  :  Micro  Services     Ingestion pipeline Catalogue Pricing Availability Offers ... Document Builder Solr/Lucene Change Propagation Documents {L1,L2 … P1} Updates Stream 1 Updates Stream 2 Updates Stream 3 ●  Lucene doesn’t support Partial Updates ●  Update = Delete + Add
  16. 16. Agenda   •  Search  @  Flipkart   •  Need  for  Real  Time  Search   •  SolrCloud  Solu;on   •  Our  approach   •  Q  &  A  
  17. 17. SolrCloud  for  NRT   Shard Replica Shard Replica Shard Replica Shard Replica Shard Replica Shard Replica Re-open searcher Re-open searcher Re-open searcher Re-open searcher Re-open searcher Re-open searcher Ingestion pipeline Shard Leader Auto commit Soft Commit Batch of documents For Document Versioning Update Log Forward to Replica
  18. 18. SolrCloud  Evalua;on   •  Update  =  Delete  +  Add   –  Block  Join  Index    Update  Whole  Block  (Product  +  Lis;ngs)   •  Updated  Document  gets  streamed  to  all  replicas  in  sync   –  Reduces  indexing  throughput   •  Sol  commit  is  Not  Free   –  Sol  commit    In  Memory  Segment   –  Lots  of  Merges   –  Huge  document  churn  /  deletes   –  All  caches  s;ll  need  to  be  re-­‐generated   –  Filter  Cache  miss  specially  hurts  performance  
  19. 19. Agenda   •  Search  @  Flipkart   •  Need  for  Real  Time  Index   •  SolrCloud  Solu;on   • Our  approach   •  Q  &  A  
  20. 20. ProductA brand : Apple availability : T price : 45000 ProductB brand : Samsung availability : T price : 23000 ProductC brand : Apple availability : F price : 5000 Document ID Mappings Posting List (Inverted Index) DocValues (columunar data) Lucene Segment Lucene  Index   0 ProductA 1 ProductB 2 ProductC 45000 23000 5000Price availability : T brand : Samsung brand : Apple 0 , 2 1 0 , 1 Terms Sparse Bitsets
  21. 21. A  Typical  Search  Flow   Query Rewrite Results Query Matching Ranking Faceting Stats Posting List Doc Values Other Components Lucene Segment Inverted Index Forward Index NRT Store samsung mobiles Offer : exchange offer price desc category : mobiles brand : samsung Offer : exchange offer
  22. 22. NRT  Forward  Index  -­‐  Considera;ons   ●  Lookup  efficiency     –  50th  percen;le  :  ~10K  matches   –  99th  percen;le  :  ~1  million  matches   ●  Data  on  Java  heap   –  Memory  efficiency    
  23. 23. NRT  Forward  Index  -­‐  Naive  Implementa;on   NRT Forward IndexLucene Segment Lookup Engine 0 ProductB 1 ProductA 2 ProductC 3 ProductD ProductD ProductA ProductB ProductC ProductD True False False True 100 150 200 250 ProductId(3) <ProductD,price> DocId : 3 field: price 250 ProductId Availability Price Latency : ~10 secs for ~1 Million lookups
  24. 24. NRT  Store  -­‐  Forward  Index  Op;mized   Lookup Engine Lucene Segment 0 ProductB 1 ProductA 2 ProductC 3 ProductD DocId : 3 Field : price 250 DocId - NrtId 0 1 2 3 3 0 1 2 NrtId(3) 2 Price(2) NRT Forward Index (Segment Independent) 100 200 250 150Price 0 ProductA 1 ProductC 2 ProductD 3 ProductB Availability T F F T Status 01 10 01 00 Latency : ~100 ms for ~1 Million lookups
  25. 25. NRT  Store  Filter  -­‐  PostFilter   PostFilter(Price:[100 TO 150]) Lucene Segment 0 ProductB 1 ProductA 2 ProductC 3 ProductD DocId : 3 Don’t Delegate DocId - NrtId 0 1 2 3 3 0 1 2 NrtId(3) 2 Price(2) NRT Forward Index (Segment Independent) 100 200 250 150Price 0 ProductA 1 ProductC 2 ProductD 3 ProductB Availability T F F T Status 01 10 01 00
  26. 26. NRT Filter NRT  Store  -­‐  Invert  index   NRT Forward StoreNRT Inverter Lucene Segment 0 ProductB 1 ProductA 2 ProductC 3 ProductD NRT DocIdSet Cache Availability : T 0 3 Offer : O1 2 3 Offer:O1 DocIdSet
  27. 27. Solr  Integra;on  Points   •  ValueSources   •  Filtering   –  Custom  Filter  Implementa;on  for  cached  DocIdSet   –  Custom  PostFilter   •  Query   –  Wrapper  over  Filter   •  Custom  FacetComponent  
  28. 28. Near  Real  Time  Solr  Architecture   Solr Kafka Ingestion pipeline NRT Forward Index Ranking Matching Faceting Redis Bootstrap NRT Inverted store Solr Master NRT Updates Lucene Updates Catalogue Pricing Availability Offers Seller Quality Commit + Replicate + Reopen Lucene Others
  29. 29. Accomplishments   •  Real  ;me  sor;ng   •  Real  ;me  filtering  :  PostFilter   –  Higher  latency   •  Near  real  ;me  filtering  :  cached  DocIdSet   –  No  consistency  between  lookup  and  filtering   •  Independent  of  lucene  commits   •  Query  latency  comparable  to  DocValues   –  Consistent  99%  performance  
  30. 30. Accomplishments  @  Flipkart   ●  Real  ;me  consump;on  for  ~150  Signals   ●  Reduc;on  in  shown  out  of  stock  products  by  2X   ●  Produc;on  instances  of  ~50K  updates/second  real  ;me  
  31. 31. Thank  you   &   Ques8ons  

×