OCTOBER	
  11-­‐14,	
  2016	
  	
  •	
  	
  BOSTON,	
  MA	
  
Near	
  Real	
  8me	
  Indexing	
  
Building	
  Real	
  Time	
  Search	
  Index	
  For	
  E-­‐Commerce	
  
	
  
Umesh	
  Prasad	
  
Tech	
  Lead	
  	
  @	
  Flipkart	
  
	
  
Thejus	
  V	
  M	
  
Data	
  Architect	
  @	
  Flipkart	
  
	
  
	
  
Agenda	
  
•  Search	
  @	
  Flipkart	
  
•  Need	
  for	
  Real	
  Time	
  Search	
  
•  SolrCloud	
  Solu;on	
  
•  Our	
  approach	
  
•  Q	
  &	
  A	
  
Traffic	
  @	
  Flipkart	
  
•  Peak	
  Traffic	
  	
  
–  ~	
  800K	
  ac;ve	
  users	
  
–  ~	
  160K	
  	
  requests	
  per	
  second	
  	
  
•  Search	
  Traffic	
  	
  
–  ~	
  40K	
  searches	
  per	
  second	
  (Service)	
  
–  ~	
  10K	
  searches	
  per	
  second	
  (Solr	
  )	
  
•  Latency	
  
–  	
  Median	
  :	
  11	
  ms	
  
–  	
  99th	
  percen;le	
  :	
  1.1	
  second	
  
Search	
  @	
  Flipkart	
  
•  Catalogue	
  	
  
–  ~	
  50	
  main	
  categories	
  
– ~	
  5000	
  sub-­‐categories	
  
– ~	
  231	
  million	
  documents	
  
– ~	
  90	
  million	
  SKUs	
  
– ~	
  160	
  million	
  lis;ngs	
  
	
  
•  E-­‐commerce	
  Marketplace	
  	
  
– ~	
  100K	
  	
  Sellers	
  
– Local	
  Sellers	
  
– Regional	
  Availability	
  
– Logis;cs	
  Constraints	
  	
  
E-­‐commerce	
  Search	
  
•  Heavy	
  usage	
  of	
  drill	
  down	
  filters	
  
•  Heavy	
  usage	
  of	
  face;ng	
  
•  Only	
  top	
  results	
  maer	
  
•  Results	
  grouped/collapsed	
  by	
  products	
  
	
  
•  Serviceability	
  and	
  delivery	
  experience	
  MATTERS	
  	
  
Agenda	
  
•  Search	
  @	
  Flipkart	
  
•  Need	
  for	
  Real	
  Time	
  Search	
  
•  SolrCloud	
  Solu;on	
  
•  Our	
  approach	
  
•  Q	
  &	
  A	
  
Sorry,	
  	
  	
  Stock	
  Over	
  	
  	
  !!?	
  
Damn	
  !!	
  Is	
  Offer	
  Over	
  ??	
  
What	
  !!	
   	
  All	
  Steal	
  Deals	
  Gone	
  ??	
  
Product	
  /Lis;ng:	
  Important	
  Aributes	
  
Seller	
  
Ra;ng	
  
Service	
  
catalogue	
  
service	
  
Promise	
  
Service	
  
Availability
Service
Offer	
  
Service	
  
Pricing	
  
Service	
  
Product	
  aka	
  SKU	
  
Lis;ngs	
  
Summary	
  :	
  	
  Lucene	
  Document	
  
•  Product/SKU	
  	
  (Parent	
  Document)	
  
–  Lis;ng	
  (Child	
  Document)	
  	
  
•  Query	
  :	
  	
  Mostly	
  	
  SKU	
  Aributes	
  	
  	
  	
   	
   	
  (Free	
  Text)	
  
•  Filters	
  :	
  SKU	
  +	
  	
  Lis;ng	
  Aributes	
   	
   	
   	
  (Drill	
  Down)	
  
•  Ranking	
  :	
  SKU	
  +	
  Lis;ng	
  	
  Aributes	
  	
   	
   	
  (Explicit/
Relevance)	
  	
  
•  Index	
  Time	
  Join	
  aka	
  Block	
  Join	
   	
   	
   	
  (Best	
  
Performance)	
  
	
  
	
  
Out	
  Of	
  Stock,	
  but	
  Why	
  Show?	
  
Index has Stale
Availability Data
234K	
  
Products	
  
Challenge	
  1	
  :	
  High	
  Update	
  Rates	
  
updates	
  /	
  sec	
   updates	
  /hr	
  	
  
normal	
   Peak	
  
text	
  /	
  catalogue	
   ~10	
   ~100	
   ~100K	
  
pricing	
   ~100	
   ~1K	
   ~10	
  million	
  
availability	
   ~100	
   ~10K	
   ~10	
  million	
  
offer	
   ~100	
   ~10K	
   ~10	
  million	
  
seller	
  ra8ng	
   ~10	
   ~1K	
   ~1	
  million	
  
signal	
  6	
   ~10	
   ~100	
   ~1	
  million	
  
signal	
  7	
   ~100	
   ~10K	
   ~10	
  million	
  
signal	
  8	
   ~100	
   ~10K	
   ~10	
  million	
  
Challenge	
  2	
  :	
  Micro	
  Services	
  	
  
Ingestion pipeline
Catalogue Pricing Availability Offers ...
Document Builder
Solr/Lucene
Change
Propagation
Documents
{L1,L2 … P1}
Updates Stream 1
Updates Stream 2
Updates Stream 3
●  Lucene doesn’t support Partial Updates
●  Update = Delete + Add
Agenda	
  
•  Search	
  @	
  Flipkart	
  
•  Need	
  for	
  Real	
  Time	
  Search	
  
•  SolrCloud	
  Solu;on	
  
•  Our	
  approach	
  
•  Q	
  &	
  A	
  
SolrCloud	
  for	
  NRT	
  
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Re-open
searcher
Re-open
searcher
Re-open
searcher
Re-open
searcher
Re-open
searcher
Re-open
searcher
Ingestion pipeline Shard
Leader
Auto commit
Soft Commit
Batch of
documents
For Document
Versioning
Update Log
Forward to Replica
SolrCloud	
  Evalua;on	
  
•  Update	
  =	
  Delete	
  +	
  Add	
  
–  Block	
  Join	
  Index	
   	
  Update	
  Whole	
  Block	
  (Product	
  +	
  Lis;ngs)	
  
•  Updated	
  Document	
  gets	
  streamed	
  to	
  all	
  replicas	
  in	
  sync	
  
–  Reduces	
  indexing	
  throughput	
  
•  Sol	
  commit	
  is	
  Not	
  Free	
  
–  Sol	
  commit	
   	
  In	
  Memory	
  Segment	
  
–  Lots	
  of	
  Merges	
  
–  Huge	
  document	
  churn	
  /	
  deletes	
  
–  All	
  caches	
  s;ll	
  need	
  to	
  be	
  re-­‐generated	
  
–  Filter	
  Cache	
  miss	
  specially	
  hurts	
  performance	
  
Agenda	
  
•  Search	
  @	
  Flipkart	
  
•  Need	
  for	
  Real	
  Time	
  Index	
  
•  SolrCloud	
  Solu;on	
  
• Our	
  approach	
  
•  Q	
  &	
  A	
  
ProductA
brand : Apple
availability : T
price : 45000
ProductB
brand : Samsung
availability : T
price : 23000
ProductC
brand : Apple
availability : F
price : 5000
Document ID
Mappings
Posting List
(Inverted Index)
DocValues
(columunar data)
Lucene Segment
Lucene	
  Index	
  
0 ProductA
1 ProductB
2 ProductC
45000 23000 5000Price
availability : T
brand : Samsung
brand : Apple 0 , 2
1
0 , 1
Terms
Sparse
Bitsets
A	
  Typical	
  Search	
  Flow	
  
Query Rewrite
Results
Query
Matching
Ranking Faceting
Stats
Posting List
Doc Values
Other
Components
Lucene Segment
Inverted Index
Forward Index
NRT Store
samsung mobiles
Offer : exchange offer
price desc
category : mobiles
brand : samsung
Offer : exchange offer
NRT	
  Forward	
  Index	
  -­‐	
  Considera;ons	
  
●  Lookup	
  efficiency	
  	
  
–  50th	
  percen;le	
  :	
  ~10K	
  matches	
  
–  99th	
  percen;le	
  :	
  ~1	
  million	
  matches	
  
●  Data	
  on	
  Java	
  heap	
  
–  Memory	
  efficiency	
  
	
  
NRT	
  Forward	
  Index	
  -­‐	
  Naive	
  Implementa;on	
  
NRT Forward IndexLucene Segment
Lookup Engine
0 ProductB
1 ProductA
2 ProductC
3 ProductD
ProductD
ProductA
ProductB
ProductC
ProductD
True
False
False
True
100
150
200
250
ProductId(3) <ProductD,price>
DocId : 3
field: price
250
ProductId Availability Price
Latency : ~10 secs for ~1 Million
lookups
NRT	
  Store	
  -­‐	
  Forward	
  Index	
  Op;mized	
  
Lookup Engine
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 ProductD
DocId : 3
Field : price
250
DocId - NrtId
0
1
2
3
3
0
1
2
NrtId(3)
2
Price(2)
NRT Forward Index (Segment Independent)
100 200 250 150Price
0 ProductA
1 ProductC
2 ProductD
3 ProductB
Availability T F F T
Status 01 10 01 00
Latency : ~100 ms for ~1 Million lookups
NRT	
  Store	
  Filter	
  -­‐	
  PostFilter	
  
PostFilter(Price:[100 TO 150])
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 ProductD
DocId : 3
Don’t
Delegate
DocId - NrtId
0
1
2
3
3
0
1
2
NrtId(3)
2
Price(2)
NRT Forward Index (Segment Independent)
100 200 250 150Price
0 ProductA
1 ProductC
2 ProductD
3 ProductB
Availability T F F T
Status 01 10 01 00
NRT Filter
NRT	
  Store	
  -­‐	
  Invert	
  index	
  
NRT Forward StoreNRT Inverter
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 ProductD
NRT DocIdSet Cache
Availability : T 0 3
Offer : O1 2 3
Offer:O1 DocIdSet
Solr	
  Integra;on	
  Points	
  
•  ValueSources	
  
•  Filtering	
  
–  Custom	
  Filter	
  Implementa;on	
  for	
  cached	
  DocIdSet	
  
–  Custom	
  PostFilter	
  
•  Query	
  
–  Wrapper	
  over	
  Filter	
  
•  Custom	
  FacetComponent	
  
Near	
  Real	
  Time	
  Solr	
  Architecture	
  
Solr
Kafka
Ingestion pipeline
NRT Forward
Index
Ranking
Matching
Faceting
Redis
Bootstrap
NRT Inverted
store
Solr Master
NRT Updates
Lucene Updates
Catalogue
Pricing
Availability
Offers
Seller
Quality
Commit
+
Replicate
+
Reopen
Lucene
Others
Accomplishments	
  
•  Real	
  ;me	
  sor;ng	
  
•  Real	
  ;me	
  filtering	
  :	
  PostFilter	
  
–  Higher	
  latency	
  
•  Near	
  real	
  ;me	
  filtering	
  :	
  cached	
  DocIdSet	
  
–  No	
  consistency	
  between	
  lookup	
  and	
  filtering	
  
•  Independent	
  of	
  lucene	
  commits	
  
•  Query	
  latency	
  comparable	
  to	
  DocValues	
  
–  Consistent	
  99%	
  performance	
  
Accomplishments	
  @	
  Flipkart	
  
●  Real	
  ;me	
  consump;on	
  for	
  ~150	
  Signals	
  
●  Reduc;on	
  in	
  shown	
  out	
  of	
  stock	
  products	
  by	
  2X	
  
●  Produc;on	
  instances	
  of	
  ~50K	
  updates/second	
  real	
  ;me	
  
Thank	
  you	
  
&	
  
Ques8ons	
  

Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

  • 1.
    OCTOBER  11-­‐14,  2016    •    BOSTON,  MA  
  • 2.
    Near  Real  8me  Indexing   Building  Real  Time  Search  Index  For  E-­‐Commerce     Umesh  Prasad   Tech  Lead    @  Flipkart     Thejus  V  M   Data  Architect  @  Flipkart      
  • 3.
    Agenda   •  Search  @  Flipkart   •  Need  for  Real  Time  Search   •  SolrCloud  Solu;on   •  Our  approach   •  Q  &  A  
  • 6.
    Traffic  @  Flipkart   •  Peak  Traffic     –  ~  800K  ac;ve  users   –  ~  160K    requests  per  second     •  Search  Traffic     –  ~  40K  searches  per  second  (Service)   –  ~  10K  searches  per  second  (Solr  )   •  Latency   –   Median  :  11  ms   –   99th  percen;le  :  1.1  second  
  • 7.
    Search  @  Flipkart   •  Catalogue     –  ~  50  main  categories   – ~  5000  sub-­‐categories   – ~  231  million  documents   – ~  90  million  SKUs   – ~  160  million  lis;ngs     •  E-­‐commerce  Marketplace     – ~  100K    Sellers   – Local  Sellers   – Regional  Availability   – Logis;cs  Constraints    
  • 8.
    E-­‐commerce  Search   • Heavy  usage  of  drill  down  filters   •  Heavy  usage  of  face;ng   •  Only  top  results  maer   •  Results  grouped/collapsed  by  products     •  Serviceability  and  delivery  experience  MATTERS    
  • 9.
    Agenda   •  Search  @  Flipkart   •  Need  for  Real  Time  Search   •  SolrCloud  Solu;on   •  Our  approach   •  Q  &  A  
  • 10.
    Sorry,      Stock  Over      !!?  
  • 11.
    Damn  !!  Is  Offer  Over  ??  
  • 12.
    What  !!    All  Steal  Deals  Gone  ??  
  • 13.
    Product  /Lis;ng:  Important  Aributes   Seller   Ra;ng   Service   catalogue   service   Promise   Service   Availability Service Offer   Service   Pricing   Service   Product  aka  SKU   Lis;ngs  
  • 14.
    Summary  :    Lucene  Document   •  Product/SKU    (Parent  Document)   –  Lis;ng  (Child  Document)     •  Query  :    Mostly    SKU  Aributes            (Free  Text)   •  Filters  :  SKU  +    Lis;ng  Aributes        (Drill  Down)   •  Ranking  :  SKU  +  Lis;ng    Aributes        (Explicit/ Relevance)     •  Index  Time  Join  aka  Block  Join        (Best   Performance)      
  • 15.
    Out  Of  Stock,  but  Why  Show?   Index has Stale Availability Data 234K   Products  
  • 16.
    Challenge  1  :  High  Update  Rates   updates  /  sec   updates  /hr     normal   Peak   text  /  catalogue   ~10   ~100   ~100K   pricing   ~100   ~1K   ~10  million   availability   ~100   ~10K   ~10  million   offer   ~100   ~10K   ~10  million   seller  ra8ng   ~10   ~1K   ~1  million   signal  6   ~10   ~100   ~1  million   signal  7   ~100   ~10K   ~10  million   signal  8   ~100   ~10K   ~10  million  
  • 17.
    Challenge  2  :  Micro  Services     Ingestion pipeline Catalogue Pricing Availability Offers ... Document Builder Solr/Lucene Change Propagation Documents {L1,L2 … P1} Updates Stream 1 Updates Stream 2 Updates Stream 3 ●  Lucene doesn’t support Partial Updates ●  Update = Delete + Add
  • 18.
    Agenda   •  Search  @  Flipkart   •  Need  for  Real  Time  Search   •  SolrCloud  Solu;on   •  Our  approach   •  Q  &  A  
  • 19.
    SolrCloud  for  NRT   Shard Replica Shard Replica Shard Replica Shard Replica Shard Replica Shard Replica Re-open searcher Re-open searcher Re-open searcher Re-open searcher Re-open searcher Re-open searcher Ingestion pipeline Shard Leader Auto commit Soft Commit Batch of documents For Document Versioning Update Log Forward to Replica
  • 20.
    SolrCloud  Evalua;on   • Update  =  Delete  +  Add   –  Block  Join  Index    Update  Whole  Block  (Product  +  Lis;ngs)   •  Updated  Document  gets  streamed  to  all  replicas  in  sync   –  Reduces  indexing  throughput   •  Sol  commit  is  Not  Free   –  Sol  commit    In  Memory  Segment   –  Lots  of  Merges   –  Huge  document  churn  /  deletes   –  All  caches  s;ll  need  to  be  re-­‐generated   –  Filter  Cache  miss  specially  hurts  performance  
  • 21.
    Agenda   •  Search  @  Flipkart   •  Need  for  Real  Time  Index   •  SolrCloud  Solu;on   • Our  approach   •  Q  &  A  
  • 22.
    ProductA brand : Apple availability: T price : 45000 ProductB brand : Samsung availability : T price : 23000 ProductC brand : Apple availability : F price : 5000 Document ID Mappings Posting List (Inverted Index) DocValues (columunar data) Lucene Segment Lucene  Index   0 ProductA 1 ProductB 2 ProductC 45000 23000 5000Price availability : T brand : Samsung brand : Apple 0 , 2 1 0 , 1 Terms Sparse Bitsets
  • 23.
    A  Typical  Search  Flow   Query Rewrite Results Query Matching Ranking Faceting Stats Posting List Doc Values Other Components Lucene Segment Inverted Index Forward Index NRT Store samsung mobiles Offer : exchange offer price desc category : mobiles brand : samsung Offer : exchange offer
  • 24.
    NRT  Forward  Index  -­‐  Considera;ons   ●  Lookup  efficiency     –  50th  percen;le  :  ~10K  matches   –  99th  percen;le  :  ~1  million  matches   ●  Data  on  Java  heap   –  Memory  efficiency    
  • 25.
    NRT  Forward  Index  -­‐  Naive  Implementa;on   NRT Forward IndexLucene Segment Lookup Engine 0 ProductB 1 ProductA 2 ProductC 3 ProductD ProductD ProductA ProductB ProductC ProductD True False False True 100 150 200 250 ProductId(3) <ProductD,price> DocId : 3 field: price 250 ProductId Availability Price Latency : ~10 secs for ~1 Million lookups
  • 26.
    NRT  Store  -­‐  Forward  Index  Op;mized   Lookup Engine Lucene Segment 0 ProductB 1 ProductA 2 ProductC 3 ProductD DocId : 3 Field : price 250 DocId - NrtId 0 1 2 3 3 0 1 2 NrtId(3) 2 Price(2) NRT Forward Index (Segment Independent) 100 200 250 150Price 0 ProductA 1 ProductC 2 ProductD 3 ProductB Availability T F F T Status 01 10 01 00 Latency : ~100 ms for ~1 Million lookups
  • 27.
    NRT  Store  Filter  -­‐  PostFilter   PostFilter(Price:[100 TO 150]) Lucene Segment 0 ProductB 1 ProductA 2 ProductC 3 ProductD DocId : 3 Don’t Delegate DocId - NrtId 0 1 2 3 3 0 1 2 NrtId(3) 2 Price(2) NRT Forward Index (Segment Independent) 100 200 250 150Price 0 ProductA 1 ProductC 2 ProductD 3 ProductB Availability T F F T Status 01 10 01 00
  • 28.
    NRT Filter NRT  Store  -­‐  Invert  index   NRT Forward StoreNRT Inverter Lucene Segment 0 ProductB 1 ProductA 2 ProductC 3 ProductD NRT DocIdSet Cache Availability : T 0 3 Offer : O1 2 3 Offer:O1 DocIdSet
  • 29.
    Solr  Integra;on  Points   •  ValueSources   •  Filtering   –  Custom  Filter  Implementa;on  for  cached  DocIdSet   –  Custom  PostFilter   •  Query   –  Wrapper  over  Filter   •  Custom  FacetComponent  
  • 30.
    Near  Real  Time  Solr  Architecture   Solr Kafka Ingestion pipeline NRT Forward Index Ranking Matching Faceting Redis Bootstrap NRT Inverted store Solr Master NRT Updates Lucene Updates Catalogue Pricing Availability Offers Seller Quality Commit + Replicate + Reopen Lucene Others
  • 31.
    Accomplishments   •  Real  ;me  sor;ng   •  Real  ;me  filtering  :  PostFilter   –  Higher  latency   •  Near  real  ;me  filtering  :  cached  DocIdSet   –  No  consistency  between  lookup  and  filtering   •  Independent  of  lucene  commits   •  Query  latency  comparable  to  DocValues   –  Consistent  99%  performance  
  • 32.
    Accomplishments  @  Flipkart   ●  Real  ;me  consump;on  for  ~150  Signals   ●  Reduc;on  in  shown  out  of  stock  products  by  2X   ●  Produc;on  instances  of  ~50K  updates/second  real  ;me  
  • 33.
    Thank  you   &   Ques8ons