Role of Data Science in eCommerce

Role of Data Science
in eCommerce
Manojkumar Rangasamy Kannadasan
eBay Inc
June 2019
1

Agenda
● Background
● Fast Facts about eBay
● Data Science in eCommerce
● Data Science @ eBay Search
● Case Studies
2

What is Data Science?
Data science is a multidisciplinary ﬁeld that uses scientiﬁc methods,
processes, algorithms and systems to extract knowledge and insights
from structured and unstructured data - Wikipedia
4

5
Reference:
https://datascience.berkeley.edu
/about/what-is-data-science/

Why Data Science?
● Empowering management to make better decisions
● Directing actions based on trends—which in turn help to define goals
● Challenging the staff to adopt best practices and focus on issues that matter
● Identifying opportunities
● Decision making with quantifiable, data-driven evidence
● Testing these decisions
● Identification and refining of target audiences
● Recruiting the right talent for the organization
6Reference: https://www.simplilearn.com/why-and-how-data-science-matters-to-business-article

10
Frequency of Product Purchases

Objective
12
● Help users ﬁnd and discover products to purchase
● Maximize revenue / proﬁt per user session

Data Science in Diﬀerent Departments
● Search
● SEO
● Trust / Fraud / Abuse
● Selling
● Shipping
● Pricing
● Merchandising
13
● Ads / Marketing
● Structured Data
● Inventory Management
● Machine Translation
● Coupons & Rewards
● Customer Service
● Infrastructure

Data Science @ eBay Search
15
● Text Search
● Faceted Search
● Image Search
● Voice Search
● Conversational Search
● Recommendations

Data Science @ eBay Search
16
● Text Search
● Faceted Search
● Image Search
● Voice Search
● Conversational Search
● Recommendations

29
Query Categorization
Team Members: M. Liu, X. Liu & E. Luo

What is Query Categorization?
30
● Predict relevant product categories given a query
● Use high confident prediction to filter product listings
● Use confidence scores of the predictions to influence ranking

Why?
● 1.2 Billion Listings
● ~20K Categories & ~35 Verticals
31

Deep Semantic Similarity Model
32
Huang, He, Gao, Deng, Acero, Heck, “Learning deep structured semantic models for web search using clickthrough data”, CIKM, 2013

33
eBay Query Categorization
● Based on Convolutional Latent Semantic Model (CLSM)
○ Shen, He, Gao, Deng, Mesnil, “A latent semantic model with
convolutional-pooling structure for IR,” CIKM 2014
● Maximize the posterior probability of a category given a query

Training - Data Collection
● Test Data: Conﬁdent set from a future period
34
Query - Product
Category, Clicks,
Transactions
Confident Set: Queries w/ >= 90%
products in a single category
Ambiguous Set
Subsamples by
popularity
Train/Validation
Data

Query Categorization in Action
35
● Directly use historic data if there were
sufficient amount
● Use an experimentally determined
confidence score threshold to pick top
predictions
● Fallback to parent category or entire
inventory when there are no high confident
predictions
● Baseline = ngrams + BM25 + attribute filtration
● Absolute scale obfuscated

FastCat - Faster Training & Inference
36
● Based on (Joulin et al., “Bag of tricks for
efﬁcient text classiﬁcation”, arXiv, 2016)
○ Shallow network but deep learning
- no feature engineering
○ Bag of ngrams as input
○ Hierarchical softmax in the output
layer: log2
V outputs to evaluate
● Data collected as before
Training time
20X faster
Inference time
< 1 ms
Commodity
Hardware
Comparable
Accuracy
W1
W2
Wn-1
Wn
.
.
.
H
I
D
D
E
N
C
A
T
E
G
O
R
Y
Query

38
Personalized Query
Autocompletion
Team Members: Manojkumar R Kannadasan, Grigor Aslanyan

Why?
40
● Saves time for users
● Guides users to reach their products faster
● Avoids Spell errors
● Can help promote Top products

Why is it Challenging & Fun?
● Millions of Users
● Humongous Amount of Queries per sec
● Show Relevant Suggestions to users
● Detect spell errors and provide corrected suggestions
41

Most Popular Completions - Overview
42
User Prefix
Most Popular
Completions
(MPC)
Query Data
Get Top N
Queries

Most Popular Completions - Naive Approach
● Show Queries matching preﬁx based on popularity
● Popularity can be frequency or sale
43

Personalized Query Autocompletion
● Users queries in a session are around one or more intents
● Global query completions may be sub-optimal
44
Dslr camera Canon dslr camera Canon 5D Mark IV Canon lenses

Personalized Re-Ranker Overview
45
User Prefix
Most Popular
Completions
(MPC)
Query Data
Get Top N
Queries
Re-Ranker
Query
Features
User
Features
Manojkumar Rangasamy Kannadasan, Grigor Aslanyan, “Personalized Query Auto-Completion Through a Lightweight
Representation of the User Context” [Under Review]

Data Collection
● Billions of User Sessions
● Capture user behavioral activity
○ Preﬁx
○ Query Clicked from Autocomplete
○ Previous Queries issued by user
○ Queries viewed and not clicked
○ Global performance of the query
46

Understanding User Context
47
Dslr camera Canon dslr camera Canon 5D Mark IV Canon lenses
User Starts Typing C

Understanding User Context
● Features computed based on previous queries issued by the user
○ Textual features like ngrams, # of terms, frequency, session-based etc
○ Similarity features based on text
○ Similarity features based on Vector representations
● Query Vectors can be learned by
○ Supervised - query transitions, queries from product co-clicks
○ Semi-Supervised - Word2Vec, fastText, GloVe
48

Model Training
● Positive Samples
○ Queries clicked in Autocomplete
● Negative Samples
○ Queries viewed and not clicked in Autocomplete
● Train a Machine Learned Ranking Model
○ Ref: https://en.wikipedia.org/wiki/Learning_to_rank
49

Evaluation
● MRR, Success Rate, MAP & nDCG
○ 20% - 30%**
Lift over MPC
○ 5% - 10%**
Lift over Non-Personalized Re-Ranker
50
** Manojkumar Rangasamy Kannadasan, Grigor Aslanyan, “Personalized Query Auto-Completion Through a Lightweight
Representation of the User Context” [Under Review]

52
Spell Correction
Team Members: Utkarsh Porwal & Roberto Konow

Why?
53
● Product names can be difﬁcult to spell
● Users will appreciate the help
● Sellers will appreciate the help
● It is challenging and fun!

Why is it Challenging and Fun?
● Special - Query Spell Correction for user generated item information
● Big - Millions of Users, Billions of Items
● Efﬁciency - Need to process humongous amount of queries per sec
● Precision - Suggest the correct spell correction for the correct query
55

Overview
56
Candidate Generation
Language Model
Error Model
RankingQuery
Corrected
Query
Efﬁciency
Big & Special
Big & Special
Precision

Mathematical Formulation
57Reference: http://norvig.com/spell-correct.html

Eﬃciency
58
Query: top
1 edit distance
n deletions
n-1 transpositions
26n alterations
26(n+1) insertions
54n+25
qop
op
sop
…
thp
tap
tkp
…
tpn
tops

Eﬃciency
59
Query: top
Generate only the
ones we know
qop
op
sop
…
thp
tap
tkp
…
tpn
tops

Eﬃciency
60
Generate only the ones we know?
Source:
wikipedia
tap
taps
top
tops

Eﬃciency
61
Generate only the ones we know?
Source:
wikipedia
tap
taps
top
tops

Efficiency
62
Generate only the
ones we know?
Source: http://ajainarayanan.github.io/ctrlf/
tap
taps
top
tops
Ehsan Shareghi, Matthias Petri, Gholamreza Haffari, Trevor Cohn:
Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees. EMNLP 2015

Efficiency - Which one?
● Naïve: Slow, no memory footprint, unnecessary candidates (?)
● Trie: Faster, Huge memory footprint
● DAWG: Even Faster, Not-that-huge memory footprint
● Suffix Trees (not compressed): Humongous memory footprint
● Suffix Trees (compressed): Slowest, very small memory footprint
63

Language Model
● How likely is the candidate - p(c) ?
● p(c1
c2
c3
… cn
)? = p(levis blue jeans 32 in)?
● Naive Algorithm - look for number of occurrences of given query
○ What if we have never seen the query
○ Long queries will have poor count leading to poor probability estimates
● Markov Assumption - second order
○ p(c1
c2
c3
…cn
) = p(c1
)p(c2
|c1
)p(c3
|c1
c2
) … p(cn
|cn-2
cn-1
)
64

Language Model
● p(levis blue jeans 32 in) = p(levis)p(blue|levis)p(jeans|levis blue)p(32|blue
jeans)p(in|jeans 32)
● p(blue|levis) = count(levis,blue) / count(levis)
● Now we have to only deal with unigrams, bigrams and trigrams
● There are still issues
○ Words that we have never seen - we still need to assign some probability
65

Error Model
● p(query|correction)?
● How likely is that user wanted to type the correction but typed the query
● Multiple ways to estimate this
○ Keyboard distance
○ Phonetic distance
○ Mine your logs
66

Error Model
Industry approach
● To train an error model we need triples of (intended word, observed word,
count)
● We would expect
○ p(the|the) to be very high
○ p(teh|the) to be relatively high
○ p(hippopotamus|the) to be extremely low
67

Error Model
● Get 10 million most frequent unigrams
● Get all the candidates at certain edit distance (depending on word length)
● This will give a huge tuple list <apple, applo>
● Assumption is that top 10 million are generally correct
● Prune this list based on freq - apple should be at least 10x more frequent
68

Students & Recent Graduates
https://careers.ebayinc.com/join-our-team/students-recent-graduat
es/
71
Start your Career @ eBay
https://careers.ebayinc.com/join-our-team/start-your-search/

Language Model
● p(levis blue jeans 32 in) = p(levis)p(blue|levis)p(jeans|levis blue)p(32|blue
jeans)p(in|jeans 32)
● p(blue|levis) = count(levis,blue) / count(levis)
● Now we have to only deal with unigrams, bigrams and trigrams
● There are still issues
○ Words that we have never seen - we still need to assign some probability
○ Adjustment of probabilities to demote high freq words - the, a etc
○ Backoff scores - KenLM (https://kheaﬁeld.com/code/kenlm/)
74

Role of Data Science in eCommerce

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Role of Data Science in eCommerce

Similar to Role of Data Science in eCommerce (20)

Recently uploaded

Recently uploaded (20)

Role of Data Science in eCommerce