SlideShare a Scribd company logo
Role of Data Science
in eCommerce
Manojkumar Rangasamy Kannadasan
eBay Inc
June 2019
1
Agenda
● Background
● Fast Facts about eBay
● Data Science in eCommerce
● Data Science @ eBay Search
● Case Studies
2
Background
3
What is Data Science?
Data science is a multidisciplinary field that uses scientific methods,
processes, algorithms and systems to extract knowledge and insights
from structured and unstructured data - Wikipedia
4
5
Reference:
https://datascience.berkeley.edu
/about/what-is-data-science/
Why Data Science?
● Empowering management to make better decisions
● Directing actions based on trends—which in turn help to define goals
● Challenging the staff to adopt best practices and focus on issues that matter
● Identifying opportunities
● Decision making with quantifiable, data-driven evidence
● Testing these decisions
● Identification and refining of target audiences
● Recruiting the right talent for the organization
6Reference: https://www.simplilearn.com/why-and-how-data-science-matters-to-business-article
Fast Facts about eBay
7
8
9
10
Frequency of Product Purchases
11
Data Science in eCommerce
Objective
12
● Help users find and discover products to purchase
● Maximize revenue / profit per user session
Data Science in Different Departments
● Search
● SEO
● Trust / Fraud / Abuse
● Selling
● Shipping
● Pricing
● Merchandising
13
● Ads / Marketing
● Structured Data
● Inventory Management
● Machine Translation
● Coupons & Rewards
● Customer Service
● Infrastructure
14
Data Science @ eBay Search
Data Science @ eBay Search
15
● Text Search
● Faceted Search
● Image Search
● Voice Search
● Conversational Search
● Recommendations
Data Science @ eBay Search
16
● Text Search
● Faceted Search
● Image Search
● Voice Search
● Conversational Search
● Recommendations
17
Query Autocompletion
18
Query Understanding
19
Ranking
20
Faceted Search
21
Spell Correction
22
Recommendations
23
Recommendations
24
Recommendations
25
Recommendations
26
Recommendations
Questions?
27
28
Case Studies
29
Query Categorization
Team Members: M. Liu, X. Liu & E. Luo
What is Query Categorization?
30
● Predict relevant product categories given a query
● Use high confident prediction to filter product listings
● Use confidence scores of the predictions to influence ranking
Why?
● 1.2 Billion Listings
● ~20K Categories & ~35 Verticals
31
Deep Semantic Similarity Model
32
Huang, He, Gao, Deng, Acero, Heck, “Learning deep structured semantic models for web search using clickthrough data”, CIKM, 2013
33
eBay Query Categorization
● Based on Convolutional Latent Semantic Model (CLSM)
○ Shen, He, Gao, Deng, Mesnil, “A latent semantic model with
convolutional-pooling structure for IR,” CIKM 2014
● Maximize the posterior probability of a category given a query
Training - Data Collection
● Test Data: Confident set from a future period
34
Query - Product
Category, Clicks,
Transactions
Confident Set: Queries w/ >= 90%
products in a single category
Ambiguous Set
Subsamples by
popularity
Train/Validation
Data
Query Categorization in Action
35
● Directly use historic data if there were
sufficient amount
● Use an experimentally determined
confidence score threshold to pick top
predictions
● Fallback to parent category or entire
inventory when there are no high confident
predictions
● Baseline = ngrams + BM25 + attribute filtration
● Absolute scale obfuscated
FastCat - Faster Training & Inference
36
● Based on (Joulin et al., “Bag of tricks for
efficient text classification”, arXiv, 2016)
○ Shallow network but deep learning
- no feature engineering
○ Bag of ngrams as input
○ Hierarchical softmax in the output
layer: log2
V outputs to evaluate
● Data collected as before
Training time
20X faster
Inference time
< 1 ms
Commodity
Hardware
Comparable
Accuracy
W1
W2
Wn-1
Wn
.
.
.
H
I
D
D
E
N
C
A
T
E
G
O
R
Y
Query
Questions?
37
38
Personalized Query
Autocompletion
Team Members: Manojkumar R Kannadasan, Grigor Aslanyan
39
Query Autocompletion
Why?
40
● Saves time for users
● Guides users to reach their products faster
● Avoids Spell errors
● Can help promote Top products
Why is it Challenging & Fun?
● Millions of Users
● Humongous Amount of Queries per sec
● Show Relevant Suggestions to users
● Detect spell errors and provide corrected suggestions
41
Most Popular Completions - Overview
42
User Prefix
Most Popular
Completions
(MPC)
Query Data
Get Top N
Queries
Most Popular Completions - Naive Approach
● Show Queries matching prefix based on popularity
● Popularity can be frequency or sale
43
Personalized Query Autocompletion
● Users queries in a session are around one or more intents
● Global query completions may be sub-optimal
44
Dslr camera Canon dslr camera Canon 5D Mark IV Canon lenses
Personalized Re-Ranker Overview
45
User Prefix
Most Popular
Completions
(MPC)
Query Data
Get Top N
Queries
Re-Ranker
Query
Features
User
Features
Manojkumar Rangasamy Kannadasan, Grigor Aslanyan, “Personalized Query Auto-Completion Through a Lightweight
Representation of the User Context” [Under Review]
Data Collection
● Billions of User Sessions
● Capture user behavioral activity
○ Prefix
○ Query Clicked from Autocomplete
○ Previous Queries issued by user
○ Queries viewed and not clicked
○ Global performance of the query
46
Understanding User Context
47
Dslr camera Canon dslr camera Canon 5D Mark IV Canon lenses
User Starts Typing C
Understanding User Context
● Features computed based on previous queries issued by the user
○ Textual features like ngrams, # of terms, frequency, session-based etc
○ Similarity features based on text
○ Similarity features based on Vector representations
● Query Vectors can be learned by
○ Supervised - query transitions, queries from product co-clicks
○ Semi-Supervised - Word2Vec, fastText, GloVe
48
Model Training
● Positive Samples
○ Queries clicked in Autocomplete
● Negative Samples
○ Queries viewed and not clicked in Autocomplete
● Train a Machine Learned Ranking Model
○ Ref: https://en.wikipedia.org/wiki/Learning_to_rank
49
Evaluation
● MRR, Success Rate, MAP & nDCG
○ 20% - 30%**
Lift over MPC
○ 5% - 10%**
Lift over Non-Personalized Re-Ranker
50
** Manojkumar Rangasamy Kannadasan, Grigor Aslanyan, “Personalized Query Auto-Completion Through a Lightweight
Representation of the User Context” [Under Review]
Questions?
51
52
Spell Correction
Team Members: Utkarsh Porwal & Roberto Konow
Why?
53
● Product names can be difficult to spell
● Users will appreciate the help
● Sellers will appreciate the help
● It is challenging and fun!
54
Spell Correction
Why is it Challenging and Fun?
● Special - Query Spell Correction for user generated item information
● Big - Millions of Users, Billions of Items
● Efficiency - Need to process humongous amount of queries per sec
● Precision - Suggest the correct spell correction for the correct query
55
Overview
56
Candidate Generation
Language Model
Error Model
RankingQuery
Corrected
Query
Efficiency
Big & Special
Big & Special
Precision
Mathematical Formulation
57Reference: http://norvig.com/spell-correct.html
Efficiency
58
Query: top
1 edit distance
n deletions
n-1 transpositions
26n alterations
26(n+1) insertions
54n+25
qop
op
sop
…
thp
tap
tkp
…
tpn
tops
Efficiency
59
Query: top
Generate only the
ones we know
qop
op
sop
…
thp
tap
tkp
…
tpn
tops
Efficiency
60
Generate only the ones we know?
Source:
wikipedia
tap
taps
top
tops
Efficiency
61
Generate only the ones we know?
Source:
wikipedia
tap
taps
top
tops
Efficiency
62
Generate only the
ones we know?
Source: http://ajainarayanan.github.io/ctrlf/
tap
taps
top
tops
Ehsan Shareghi, Matthias Petri, Gholamreza Haffari, Trevor Cohn:
Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees. EMNLP 2015
Efficiency - Which one?
● Naïve: Slow, no memory footprint, unnecessary candidates (?)
● Trie: Faster, Huge memory footprint
● DAWG: Even Faster, Not-that-huge memory footprint
● Suffix Trees (not compressed): Humongous memory footprint
● Suffix Trees (compressed): Slowest, very small memory footprint
63
Language Model
● How likely is the candidate - p(c) ?
● p(c1
c2
c3
… cn
)? = p(levis blue jeans 32 in)?
● Naive Algorithm - look for number of occurrences of given query
○ What if we have never seen the query
○ Long queries will have poor count leading to poor probability estimates
● Markov Assumption - second order
○ p(c1
c2
c3
…cn
) = p(c1
)p(c2
|c1
)p(c3
|c1
c2
) … p(cn
|cn-2
cn-1
)
64
Language Model
● p(levis blue jeans 32 in) = p(levis)p(blue|levis)p(jeans|levis blue)p(32|blue
jeans)p(in|jeans 32)
● p(blue|levis) = count(levis,blue) / count(levis)
● Now we have to only deal with unigrams, bigrams and trigrams
● There are still issues
○ Words that we have never seen - we still need to assign some probability
65
Error Model
● p(query|correction)?
● How likely is that user wanted to type the correction but typed the query
● Multiple ways to estimate this
○ Keyboard distance
○ Phonetic distance
○ Mine your logs
66
Error Model
Industry approach
● To train an error model we need triples of (intended word, observed word,
count)
● We would expect
○ p(the|the) to be very high
○ p(teh|the) to be relatively high
○ p(hippopotamus|the) to be extremely low
67
Error Model
● Get 10 million most frequent unigrams
● Get all the candidates at certain edit distance (depending on word length)
● This will give a huge tuple list <apple, applo>
● Assumption is that top 10 million are generally correct
● Prune this list based on freq - apple should be at least 10x more frequent
68
Questions?
69
Hiring @ eBay
70
Students & Recent Graduates
https://careers.ebayinc.com/join-our-team/students-recent-graduat
es/
71
Start your Career @ eBay
https://careers.ebayinc.com/join-our-team/start-your-search/
Q & A
mkannadasan@ebay.com
72
73
Language Model
● p(levis blue jeans 32 in) = p(levis)p(blue|levis)p(jeans|levis blue)p(32|blue
jeans)p(in|jeans 32)
● p(blue|levis) = count(levis,blue) / count(levis)
● Now we have to only deal with unigrams, bigrams and trigrams
● There are still issues
○ Words that we have never seen - we still need to assign some probability
○ Adjustment of probabilities to demote high freq words - the, a etc
○ Backoff scores - KenLM (https://kheafield.com/code/kenlm/)
74

More Related Content

What's hot

What's hot (20)

An explanation of machine learning for business
An explanation of machine learning for businessAn explanation of machine learning for business
An explanation of machine learning for business
 
Neural Language Generation Head to Toe
Neural Language Generation Head to Toe Neural Language Generation Head to Toe
Neural Language Generation Head to Toe
 
Bert
BertBert
Bert
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
Conversational AI with Transformer Models
Conversational AI with Transformer ModelsConversational AI with Transformer Models
Conversational AI with Transformer Models
 
How to Build Data Science Teams
How to Build Data Science TeamsHow to Build Data Science Teams
How to Build Data Science Teams
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Explainable AI (XAI) - A Perspective
Explainable AI (XAI) - A Perspective Explainable AI (XAI) - A Perspective
Explainable AI (XAI) - A Perspective
 
Cnn
CnnCnn
Cnn
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
 
Practical Human-in-the-Loop Machine Learning
 Practical Human-in-the-Loop Machine Learning Practical Human-in-the-Loop Machine Learning
Practical Human-in-the-Loop Machine Learning
 
Machine learning life cycle
Machine learning life cycleMachine learning life cycle
Machine learning life cycle
 
Self-Attention with Linear Complexity
Self-Attention with Linear ComplexitySelf-Attention with Linear Complexity
Self-Attention with Linear Complexity
 
CounterFactual Explanations.pdf
CounterFactual Explanations.pdfCounterFactual Explanations.pdf
CounterFactual Explanations.pdf
 
GDG Cloud Southlake #17: Meg Dickey-Kurdziolek: Explainable AI is for Everyone
GDG Cloud Southlake #17: Meg Dickey-Kurdziolek: Explainable AI is for EveryoneGDG Cloud Southlake #17: Meg Dickey-Kurdziolek: Explainable AI is for Everyone
GDG Cloud Southlake #17: Meg Dickey-Kurdziolek: Explainable AI is for Everyone
 
Overview of Convolutional Neural Networks
Overview of Convolutional Neural NetworksOverview of Convolutional Neural Networks
Overview of Convolutional Neural Networks
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
 

Similar to Role of Data Science in eCommerce

Context Mining and Integration in Web Predictive Analytics
Context Mining and Integration in Web Predictive AnalyticsContext Mining and Integration in Web Predictive Analytics
Context Mining and Integration in Web Predictive Analytics
Julia Kiseleva
 
Cikm 2013 - Beyond Data From User Information to Business Value
Cikm 2013 - Beyond Data From User Information to Business ValueCikm 2013 - Beyond Data From User Information to Business Value
Cikm 2013 - Beyond Data From User Information to Business Value
Xavier Amatriain
 

Similar to Role of Data Science in eCommerce (20)

Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
 
Udacity webinar on Recommendation Systems
Udacity webinar on Recommendation SystemsUdacity webinar on Recommendation Systems
Udacity webinar on Recommendation Systems
 
[UPDATE] Udacity webinar on Recommendation Systems
[UPDATE] Udacity webinar on Recommendation Systems[UPDATE] Udacity webinar on Recommendation Systems
[UPDATE] Udacity webinar on Recommendation Systems
 
Multimodal Learning to Rank in Production Scale E-commerce Search
Multimodal Learning to Rank in Production Scale E-commerce SearchMultimodal Learning to Rank in Production Scale E-commerce Search
Multimodal Learning to Rank in Production Scale E-commerce Search
 
Adopting Data Science and Machine Learning in the financial enterprise
Adopting Data Science and Machine Learning in the financial enterpriseAdopting Data Science and Machine Learning in the financial enterprise
Adopting Data Science and Machine Learning in the financial enterprise
 
Active Learning on Question Answering with Dialogues
 Active Learning on Question Answering with Dialogues Active Learning on Question Answering with Dialogues
Active Learning on Question Answering with Dialogues
 
Data science in 10 steps
Data science in 10 stepsData science in 10 steps
Data science in 10 steps
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
Learning by example: training users through high-quality query suggestions
Learning by example: training users through high-quality query suggestionsLearning by example: training users through high-quality query suggestions
Learning by example: training users through high-quality query suggestions
 
Recommender Systems @ Scale - PyData 2019
Recommender Systems @ Scale - PyData 2019Recommender Systems @ Scale - PyData 2019
Recommender Systems @ Scale - PyData 2019
 
Context Mining and Integration in Web Predictive Analytics
Context Mining and Integration in Web Predictive AnalyticsContext Mining and Integration in Web Predictive Analytics
Context Mining and Integration in Web Predictive Analytics
 
Iterative Methodology for Personalization Models Optimization
 Iterative Methodology for Personalization Models Optimization Iterative Methodology for Personalization Models Optimization
Iterative Methodology for Personalization Models Optimization
 
ECIR Recommendation Challenges
ECIR Recommendation ChallengesECIR Recommendation Challenges
ECIR Recommendation Challenges
 
Dynamic Search and Beyond
Dynamic Search and BeyondDynamic Search and Beyond
Dynamic Search and Beyond
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
 
Delivering Machine Learning Solutions by fmr Sears Dir of PM
Delivering Machine Learning Solutions by fmr Sears Dir of PMDelivering Machine Learning Solutions by fmr Sears Dir of PM
Delivering Machine Learning Solutions by fmr Sears Dir of PM
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
 
Scaling Quality on Quora Using Machine Learning
Scaling Quality on Quora Using Machine LearningScaling Quality on Quora Using Machine Learning
Scaling Quality on Quora Using Machine Learning
 
Cikm 2013 - Beyond Data From User Information to Business Value
Cikm 2013 - Beyond Data From User Information to Business ValueCikm 2013 - Beyond Data From User Information to Business Value
Cikm 2013 - Beyond Data From User Information to Business Value
 

Recently uploaded

Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
MAQIB18
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 

Role of Data Science in eCommerce

  • 1. Role of Data Science in eCommerce Manojkumar Rangasamy Kannadasan eBay Inc June 2019 1
  • 2. Agenda ● Background ● Fast Facts about eBay ● Data Science in eCommerce ● Data Science @ eBay Search ● Case Studies 2
  • 4. What is Data Science? Data science is a multidisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data - Wikipedia 4
  • 6. Why Data Science? ● Empowering management to make better decisions ● Directing actions based on trends—which in turn help to define goals ● Challenging the staff to adopt best practices and focus on issues that matter ● Identifying opportunities ● Decision making with quantifiable, data-driven evidence ● Testing these decisions ● Identification and refining of target audiences ● Recruiting the right talent for the organization 6Reference: https://www.simplilearn.com/why-and-how-data-science-matters-to-business-article
  • 8. 8
  • 9. 9
  • 11. 11 Data Science in eCommerce
  • 12. Objective 12 ● Help users find and discover products to purchase ● Maximize revenue / profit per user session
  • 13. Data Science in Different Departments ● Search ● SEO ● Trust / Fraud / Abuse ● Selling ● Shipping ● Pricing ● Merchandising 13 ● Ads / Marketing ● Structured Data ● Inventory Management ● Machine Translation ● Coupons & Rewards ● Customer Service ● Infrastructure
  • 14. 14 Data Science @ eBay Search
  • 15. Data Science @ eBay Search 15 ● Text Search ● Faceted Search ● Image Search ● Voice Search ● Conversational Search ● Recommendations
  • 16. Data Science @ eBay Search 16 ● Text Search ● Faceted Search ● Image Search ● Voice Search ● Conversational Search ● Recommendations
  • 29. 29 Query Categorization Team Members: M. Liu, X. Liu & E. Luo
  • 30. What is Query Categorization? 30 ● Predict relevant product categories given a query ● Use high confident prediction to filter product listings ● Use confidence scores of the predictions to influence ranking
  • 31. Why? ● 1.2 Billion Listings ● ~20K Categories & ~35 Verticals 31
  • 32. Deep Semantic Similarity Model 32 Huang, He, Gao, Deng, Acero, Heck, “Learning deep structured semantic models for web search using clickthrough data”, CIKM, 2013
  • 33. 33 eBay Query Categorization ● Based on Convolutional Latent Semantic Model (CLSM) ○ Shen, He, Gao, Deng, Mesnil, “A latent semantic model with convolutional-pooling structure for IR,” CIKM 2014 ● Maximize the posterior probability of a category given a query
  • 34. Training - Data Collection ● Test Data: Confident set from a future period 34 Query - Product Category, Clicks, Transactions Confident Set: Queries w/ >= 90% products in a single category Ambiguous Set Subsamples by popularity Train/Validation Data
  • 35. Query Categorization in Action 35 ● Directly use historic data if there were sufficient amount ● Use an experimentally determined confidence score threshold to pick top predictions ● Fallback to parent category or entire inventory when there are no high confident predictions ● Baseline = ngrams + BM25 + attribute filtration ● Absolute scale obfuscated
  • 36. FastCat - Faster Training & Inference 36 ● Based on (Joulin et al., “Bag of tricks for efficient text classification”, arXiv, 2016) ○ Shallow network but deep learning - no feature engineering ○ Bag of ngrams as input ○ Hierarchical softmax in the output layer: log2 V outputs to evaluate ● Data collected as before Training time 20X faster Inference time < 1 ms Commodity Hardware Comparable Accuracy W1 W2 Wn-1 Wn . . . H I D D E N C A T E G O R Y Query
  • 38. 38 Personalized Query Autocompletion Team Members: Manojkumar R Kannadasan, Grigor Aslanyan
  • 40. Why? 40 ● Saves time for users ● Guides users to reach their products faster ● Avoids Spell errors ● Can help promote Top products
  • 41. Why is it Challenging & Fun? ● Millions of Users ● Humongous Amount of Queries per sec ● Show Relevant Suggestions to users ● Detect spell errors and provide corrected suggestions 41
  • 42. Most Popular Completions - Overview 42 User Prefix Most Popular Completions (MPC) Query Data Get Top N Queries
  • 43. Most Popular Completions - Naive Approach ● Show Queries matching prefix based on popularity ● Popularity can be frequency or sale 43
  • 44. Personalized Query Autocompletion ● Users queries in a session are around one or more intents ● Global query completions may be sub-optimal 44 Dslr camera Canon dslr camera Canon 5D Mark IV Canon lenses
  • 45. Personalized Re-Ranker Overview 45 User Prefix Most Popular Completions (MPC) Query Data Get Top N Queries Re-Ranker Query Features User Features Manojkumar Rangasamy Kannadasan, Grigor Aslanyan, “Personalized Query Auto-Completion Through a Lightweight Representation of the User Context” [Under Review]
  • 46. Data Collection ● Billions of User Sessions ● Capture user behavioral activity ○ Prefix ○ Query Clicked from Autocomplete ○ Previous Queries issued by user ○ Queries viewed and not clicked ○ Global performance of the query 46
  • 47. Understanding User Context 47 Dslr camera Canon dslr camera Canon 5D Mark IV Canon lenses User Starts Typing C
  • 48. Understanding User Context ● Features computed based on previous queries issued by the user ○ Textual features like ngrams, # of terms, frequency, session-based etc ○ Similarity features based on text ○ Similarity features based on Vector representations ● Query Vectors can be learned by ○ Supervised - query transitions, queries from product co-clicks ○ Semi-Supervised - Word2Vec, fastText, GloVe 48
  • 49. Model Training ● Positive Samples ○ Queries clicked in Autocomplete ● Negative Samples ○ Queries viewed and not clicked in Autocomplete ● Train a Machine Learned Ranking Model ○ Ref: https://en.wikipedia.org/wiki/Learning_to_rank 49
  • 50. Evaluation ● MRR, Success Rate, MAP & nDCG ○ 20% - 30%** Lift over MPC ○ 5% - 10%** Lift over Non-Personalized Re-Ranker 50 ** Manojkumar Rangasamy Kannadasan, Grigor Aslanyan, “Personalized Query Auto-Completion Through a Lightweight Representation of the User Context” [Under Review]
  • 52. 52 Spell Correction Team Members: Utkarsh Porwal & Roberto Konow
  • 53. Why? 53 ● Product names can be difficult to spell ● Users will appreciate the help ● Sellers will appreciate the help ● It is challenging and fun!
  • 55. Why is it Challenging and Fun? ● Special - Query Spell Correction for user generated item information ● Big - Millions of Users, Billions of Items ● Efficiency - Need to process humongous amount of queries per sec ● Precision - Suggest the correct spell correction for the correct query 55
  • 56. Overview 56 Candidate Generation Language Model Error Model RankingQuery Corrected Query Efficiency Big & Special Big & Special Precision
  • 58. Efficiency 58 Query: top 1 edit distance n deletions n-1 transpositions 26n alterations 26(n+1) insertions 54n+25 qop op sop … thp tap tkp … tpn tops
  • 59. Efficiency 59 Query: top Generate only the ones we know qop op sop … thp tap tkp … tpn tops
  • 60. Efficiency 60 Generate only the ones we know? Source: wikipedia tap taps top tops
  • 61. Efficiency 61 Generate only the ones we know? Source: wikipedia tap taps top tops
  • 62. Efficiency 62 Generate only the ones we know? Source: http://ajainarayanan.github.io/ctrlf/ tap taps top tops Ehsan Shareghi, Matthias Petri, Gholamreza Haffari, Trevor Cohn: Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees. EMNLP 2015
  • 63. Efficiency - Which one? ● Naïve: Slow, no memory footprint, unnecessary candidates (?) ● Trie: Faster, Huge memory footprint ● DAWG: Even Faster, Not-that-huge memory footprint ● Suffix Trees (not compressed): Humongous memory footprint ● Suffix Trees (compressed): Slowest, very small memory footprint 63
  • 64. Language Model ● How likely is the candidate - p(c) ? ● p(c1 c2 c3 … cn )? = p(levis blue jeans 32 in)? ● Naive Algorithm - look for number of occurrences of given query ○ What if we have never seen the query ○ Long queries will have poor count leading to poor probability estimates ● Markov Assumption - second order ○ p(c1 c2 c3 …cn ) = p(c1 )p(c2 |c1 )p(c3 |c1 c2 ) … p(cn |cn-2 cn-1 ) 64
  • 65. Language Model ● p(levis blue jeans 32 in) = p(levis)p(blue|levis)p(jeans|levis blue)p(32|blue jeans)p(in|jeans 32) ● p(blue|levis) = count(levis,blue) / count(levis) ● Now we have to only deal with unigrams, bigrams and trigrams ● There are still issues ○ Words that we have never seen - we still need to assign some probability 65
  • 66. Error Model ● p(query|correction)? ● How likely is that user wanted to type the correction but typed the query ● Multiple ways to estimate this ○ Keyboard distance ○ Phonetic distance ○ Mine your logs 66
  • 67. Error Model Industry approach ● To train an error model we need triples of (intended word, observed word, count) ● We would expect ○ p(the|the) to be very high ○ p(teh|the) to be relatively high ○ p(hippopotamus|the) to be extremely low 67
  • 68. Error Model ● Get 10 million most frequent unigrams ● Get all the candidates at certain edit distance (depending on word length) ● This will give a huge tuple list <apple, applo> ● Assumption is that top 10 million are generally correct ● Prune this list based on freq - apple should be at least 10x more frequent 68
  • 71. Students & Recent Graduates https://careers.ebayinc.com/join-our-team/students-recent-graduat es/ 71 Start your Career @ eBay https://careers.ebayinc.com/join-our-team/start-your-search/
  • 73. 73
  • 74. Language Model ● p(levis blue jeans 32 in) = p(levis)p(blue|levis)p(jeans|levis blue)p(32|blue jeans)p(in|jeans 32) ● p(blue|levis) = count(levis,blue) / count(levis) ● Now we have to only deal with unigrams, bigrams and trigrams ● There are still issues ○ Words that we have never seen - we still need to assign some probability ○ Adjustment of probabilities to demote high freq words - the, a etc ○ Backoff scores - KenLM (https://kheafield.com/code/kenlm/) 74