Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Anatomy of an eCommerce Search Engine by Mayur Datar

322 views

Published on

In this talk, the chief Data scientist of Flipkart will uncover the various challenges in running an e-commerce search platform like scale, recency, update rates, business shaping etc. He will also explain the overall system architecture of the search platform and get into the details of some of the sub-systems, including the query understanding and rewriting sub-system.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Anatomy of an eCommerce Search Engine by Mayur Datar

  1. 1. ● Search is one of the most important discovery tools in E-commerce. ● Powers other features like merchandising (promotions), recommendations etc. ● Accounts for big fraction of the units sold and GMV.
  2. 2. ● Important signals that affect search: Price, offers, popularity, availability, serviceability etc. ● Used in ranking of products. ● Exposed as filters and sorts to end users. ● These signals are very dynamic, particularly during sales.
  3. 3. ● E-commerce search != websearch. ● Documents have a structure to them ● Queries have an implicit structure ● Challenges: ○ Large document collection with a long heavy tail ○ Extremely high rate of changes/updates (Thousands per sec) ○ Geo specific ranking ○ Multi-objective optimization (GMV, Units, Ads revenue, Long Term Value) ● Opportunities: ○ Broad queries: personalization can play a huge role
  4. 4. ● Queries per day: XXX Millions / week ● Latencies: ○ Average: ~ 100 ms ○ Median: ~ 50 ms ○ 90th percentile: ~ 500 ms ● Documents retrieved and scored from index: ○ Median: 1K to 10K ○ 95th percentile: 200K to 500K ○ 99th percentile: 500K to 3M+ ● Search CTR: Around 50%
  5. 5. ● Architectural overview of the search platform ○ Serving and Ingestion ○ Serving functional view ○ Serving architectural view ○ Ingestion architectural view ○ Example ingestion topology ● Search quality ○ Challenges ○ Life of a query: Typical flow for query understanding ○ Illustrative problems
  6. 6. ● 1,000,000 Compute Cores ● 2.56 Petabytes RAM ● 120 Petabytes Disk Storage ● 1 Petabytes NVMe SSD ● 128 Tbps bisection bandwidth Clos network
  7. 7. Query Rewriter (Spell Check, Concept, NLP, Intent, Augmentation,Retrieval/Scoring query formulation) Reverse Proxy (Geo Coding, User Context, Caching, Isolation, Rate Limit, Tee-off test framework) Search Broker (Distributed Search across shards, Blending Of Results from shards) Searcher (Matching, Scoring, Faceting, Top-K Retrieval (pass-1 ranking)) Text index NRT index Metadata Re-ranking (Pass-2 Ranking) - ML Model Pluggable Ranking Models Pluggable Rewriter Modules
  8. 8. Serving: Arch View
  9. 9. ● Architectural overview of the search platform ○ Serving and Ingestion ○ Serving functional view ○ Serving architectural view ○ Ingestion architectural view ○ Example ingestion topology ● Search quality ○ Challenges ○ Life of a query: Typical flow for query understanding ○ Illustrative problems
  10. 10. ● Marketplace ○ Catalog entries vary in quality from seller to seller. Spam is rampant. ● Diversity of users ● Mobile heavy users: Real estate on UI ● Poor internet connectivity
  11. 11. ● Literacy/Internet awareness ● Language ● Economic power ● Regional preferences Abstraction: City-tier Query/Intent Solicitation Result Presentation Product Ranking
  12. 12. 40% increase in proportion of tier-3 customers vis-a-vis metro
  13. 13. Query: samsang Relative ratio of query Tier-3 Vs Metro: 1.8 Query: jins Relative ratio of query Tier-3 Vs Metro: 2.2
  14. 14. Query Scoring Normalisation(Index time as well) - String clean-up - lower Spell Correction - Resource-based - term->term - Query->query - Online Init Context Phrasing (Index time as well) - Frequent bi/tri grams Stemming (Index time as well) - Core e-commerce stemmer - plurals Common MetaData Store (Query Level) - Raw Data: metrics (CTR, Impression, NDCG…) - Derived Data: Store, LM score, Features Synonyms - Resource-based Intent - Deductions - Tagging (CRF) Query Rewrite - Best query selection - Partial match SOLR interface Query Understanding Output Generator Retrieval ranking logic Store Classifier Query LM Feature Store Classification
  15. 15. • Special patterns: – Segmented words: lgnexus5 Counting: “samsang” & no-click followed by “samsung”& click a million times – Context aware counting • Language modeling and edit distance • Term to vector models in deep learning. Specific General
  16. 16. ● Intent: From query tokens to (implicit) attributes that are represented by those tokens ● Examples: ○ “red tape shoes” -> (brand) “red tape” (store) “shoes” ○ “kids party dress 4-5 years pack of 2” -> (ideal_for) “kids” (occasion) “party” (store) “dress” (size) “4-5 years” (pack_of) “pack of 2” ○ “samsung e6 cases” -> (“compatible_with”) “samsung e6” (store) “cases” ● Memorization, Language modeling, CRF
  17. 17. Past orders Product Views Users’ activity on the platform Customised Search Ranking for User-segment
  18. 18. economical expensive shoes watches Past orders Product Views 5 price ranges defined for each vertical. 1 2 3 4 5 User-Segments based on price affinities Users’ past activity on the platform. Customised Search Ranking for each User-segment Price Personalization #ofusers

×