October 28, 2017
Giuseppe “Pino” Di Fabbrizio
Rakuten Institute of Technology – Boston
3
• Motivations
• Traditional information retrieval models
• Learning-to-rank models
• Relevance
• Ranking Metrics
• Algorithms
• Ranking optimization
• Use cases
• Summary
• What is next?
Disclaimer: If not otherwise specified, images in this presentation
comply with the (CC) creative commons publishing license
4
• E-commerce growing faster than
traditional brick-and-mortar market
($4.06T by 2020)
• Mobile shopping adoption
increasing worldwide (46%
shoppers in Asia and 28% in North
America)
• Online catalogs offering broader
selections and competitive products
• Electronic money transactions
gaining more consumers’ trust
• Massive data collected during web
and mobile interactions providing
foundation for machine learning-
driven optimizations
1.61B
Shoppers
$1.86T
Sales
$150B*
Revenues
ML
*2016 Combined revenues for Amazon, Otto Group, and Rakuten
https://www.statista.com/topics/871/online-shopping/
5
6
250M+ Products
40k+ Categories
7
How do we find
the most relevant
products for a
search query?
www.rakuten.com
Oct 10, 2017
8Query
Ranking
function
Documents
www.rakuten.com
Nov 2016
1 2 3
4 5 6
7 8 9
9
• Relevance is estimated by
lexical matches of query
terms with document terms
• Examples:
• Boolean models
• Vector space models
• Latent semantic indexing
• Okapi BM25
Index
Indexer
Query
Documents
Scoring
model
Top-n retrieved
documents
On-line
Off-line
10
www.rakuten.com
Oct 10, 2017
Query (Q)
Document 1 (D1)
Document 2 (D2)
iphone
7
case
iphone 7 Case
Q 1 1 1
D1 2 2 2
D2 3 1 0
Q
D1
D2
11
• Basic ideas
• Lexical similarity metrics
• Penalizing repeated occurrences of the same term
• Penalizing term frequency for longer documents
• Only few features
• Manually hand-tuned feature weights based on heuristic
• Cannot include important search signals such as user’s
feedback, product popularity, purchase history, etc.
• Fast and scalable
12
• Data-driven approach
• Directly optimize products rank based on relevance (different
from classification and regression ML tasks)
• Handle thousands of features
• Robust to noisy data
• Handle personalization
• Industry & research state-of-the-art (Amazon, eBay,
Microsoft, Yahoo!, Yandex, etc.)
13
A document is relevant if contains the information the
user was looking for when submitted the query
Relevance is subjective and depends on many factors:
• context (what is displayed and how)
• task (purchase, search info, answer, etc.)
• novelty (unexpected data, ads, ext.)
• time and user’s effort involved
14
1
3
2
www.rakuten.com
Nov 2016
15
buyclick add
www.rakuten.com
Nov 2016
16
• Clickthrough data (user’s implicit
feedback) as source of relevance for
search query / document pairs
• Pros
• Abundant and easy to harvest
• Always fresh
• Unbiased
• Cons
• Noisy
• Long tail queries
• Simple relevance mapping:
• score = 0 (not relevant), score = 3 (highly
relevant)
• Purchase > cart > click > impression
Score User’s implicit feedback
3 Product purchased
2 Product added to the shopping cart
1 Product clicked
0 No clicks
17
Seen products
Potentially
seen products
Unseen
products
Browser
viewport
Click
www.rakuten.com
Aug 2017
18
Documents
Normalized and Discounted Cumulative Gain
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10
NDCG
19
• Tree ensemble method
• Handle sparse data
• Handle missing values and various value types
• Robust to outliers
• Learn higher-order feature interactions
• Invariant to feature scaling
• Highly scalable and optimized open source
implementation (XGBoost)
20
Point-wise
• Input: single documents / Output: class labels or scores
• Classify each document as relevant or non-relevant.
• Adjust w to reduce classification errors
Pairwise ranking
• Input: document pairs / Output: partial order preferences
• Classify pairs of documents – D1 > D2?
• Adjust w to reduce discordant pairs
List-wise ranking
• Input: document collections / ranked document list
• Score permutations -- Is {D1,D2,…} > {D1’,D2’,…} ?
• Adjust w to directly maximize ranking measure of interest (NDCG)
Di
Q
Q
DjDi >
Q
DjDi > Dk>
21
Green = relevant
Gray = not-relevant
Blue arrows = boost for
pair-wise loss function
Red arrows = boost for
list-wise loss function
(a) is the perfect ranking;
(b) is ranking with 10 pairwise errors;
(c) is ranking with 8 pairwise errors
22
• Relevance: User’s behavior signals
• Ranking Metrics: NDCG
• Machine Learning Algorithm: Gradient Tree
Boosting
• Ranking optimization: List-wise with NDCG
metrics
23
Index
Indexer
Query
Documents
Scoring
model
Scores
Query
Features
Training
data
Learning
to rank
Re-ranking
model
Top-n ranked
documents (n > 1M)
Top-m re-ranked
documents (m < 1k)
On-line
Off-line
Relevance
24www.rakuten.com
Mar 2017
25
Search Query: “40inch tv”
Regular text
search
Search with user’s signals
and learning-to-rank models
Not relevant
Not relevant
Not relevant
26
Conversion Rate
(Simulation)
NDCG CTR Simulated
Queries
Relative gain 15.58% 7.50% 10,000
Depth /
Estimators
5 / 500 3 / 500 10 / 500 3 / 500
NDCG 0.687 0.688 0.685 0.689
Relative gain 15.14% 15.41% 14.92% 15.58%
Training time
(56 cores)
2:45:48 1:20:57 35:25:44 1:58:07
27
Automatic
Speech
Recognition
Computer
Vision
Natural
Language
Processing
Information
Retrieval
2011 2013 2013-2015 2017?
28
Bhaskar Mitra, Fernando Diaz, and Nick Craswell. 2017. Learning to Match using Local and Distributed Representations of
Text for Web Search. In Proceedings of the 26th International Conference on World Wide Web (WWW '17).
29
Bhaskar Mitra, Fernando Diaz, and Nick Craswell. 2017. Learning to Match using Local and Distributed Representations of
Text for Web Search. In Proceedings of the 26th International Conference on World Wide Web (WWW '17).
30
Bhaskar Mitra, Fernando Diaz, and Nick Craswell. 2017. Learning to Match using Local and Distributed Representations of
Text for Web Search. In Proceedings of the 26th International Conference on World Wide Web (WWW '17).
31
• Traditional IR methods do not scale to modern e-commerce needs
• User’s implicit feedback is a proxy for search query / document pairs
relevance
• Learning-to-rank (LTR) methods scale to thousand of features and are
robust to data noise
• LTR with listwise-based loss function substantially improve search
relevance (15.6% NDCG increase on e-commerce data)
• NDCG improvements directly correlate to conversion rates (7.5% CTR
increase on e-commerce data)
• DNN methods for IR are starting to outperform traditional ML methods
Find it! Nail it!Boosting e-commerce search conversions with machine learning at scale

Find it! Nail it! Boosting e-commerce search conversions with machine learning at scale

  • 1.
    October 28, 2017 Giuseppe“Pino” Di Fabbrizio Rakuten Institute of Technology – Boston
  • 3.
    3 • Motivations • Traditionalinformation retrieval models • Learning-to-rank models • Relevance • Ranking Metrics • Algorithms • Ranking optimization • Use cases • Summary • What is next? Disclaimer: If not otherwise specified, images in this presentation comply with the (CC) creative commons publishing license
  • 4.
    4 • E-commerce growingfaster than traditional brick-and-mortar market ($4.06T by 2020) • Mobile shopping adoption increasing worldwide (46% shoppers in Asia and 28% in North America) • Online catalogs offering broader selections and competitive products • Electronic money transactions gaining more consumers’ trust • Massive data collected during web and mobile interactions providing foundation for machine learning- driven optimizations 1.61B Shoppers $1.86T Sales $150B* Revenues ML *2016 Combined revenues for Amazon, Otto Group, and Rakuten https://www.statista.com/topics/871/online-shopping/
  • 5.
  • 6.
  • 7.
    7 How do wefind the most relevant products for a search query? www.rakuten.com Oct 10, 2017
  • 8.
  • 9.
    9 • Relevance isestimated by lexical matches of query terms with document terms • Examples: • Boolean models • Vector space models • Latent semantic indexing • Okapi BM25 Index Indexer Query Documents Scoring model Top-n retrieved documents On-line Off-line
  • 10.
    10 www.rakuten.com Oct 10, 2017 Query(Q) Document 1 (D1) Document 2 (D2) iphone 7 case iphone 7 Case Q 1 1 1 D1 2 2 2 D2 3 1 0 Q D1 D2
  • 11.
    11 • Basic ideas •Lexical similarity metrics • Penalizing repeated occurrences of the same term • Penalizing term frequency for longer documents • Only few features • Manually hand-tuned feature weights based on heuristic • Cannot include important search signals such as user’s feedback, product popularity, purchase history, etc. • Fast and scalable
  • 12.
    12 • Data-driven approach •Directly optimize products rank based on relevance (different from classification and regression ML tasks) • Handle thousands of features • Robust to noisy data • Handle personalization • Industry & research state-of-the-art (Amazon, eBay, Microsoft, Yahoo!, Yandex, etc.)
  • 13.
    13 A document isrelevant if contains the information the user was looking for when submitted the query Relevance is subjective and depends on many factors: • context (what is displayed and how) • task (purchase, search info, answer, etc.) • novelty (unexpected data, ads, ext.) • time and user’s effort involved
  • 14.
  • 15.
  • 16.
    16 • Clickthrough data(user’s implicit feedback) as source of relevance for search query / document pairs • Pros • Abundant and easy to harvest • Always fresh • Unbiased • Cons • Noisy • Long tail queries • Simple relevance mapping: • score = 0 (not relevant), score = 3 (highly relevant) • Purchase > cart > click > impression Score User’s implicit feedback 3 Product purchased 2 Product added to the shopping cart 1 Product clicked 0 No clicks
  • 17.
  • 18.
    18 Documents Normalized and DiscountedCumulative Gain 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10 NDCG
  • 19.
    19 • Tree ensemblemethod • Handle sparse data • Handle missing values and various value types • Robust to outliers • Learn higher-order feature interactions • Invariant to feature scaling • Highly scalable and optimized open source implementation (XGBoost)
  • 20.
    20 Point-wise • Input: singledocuments / Output: class labels or scores • Classify each document as relevant or non-relevant. • Adjust w to reduce classification errors Pairwise ranking • Input: document pairs / Output: partial order preferences • Classify pairs of documents – D1 > D2? • Adjust w to reduce discordant pairs List-wise ranking • Input: document collections / ranked document list • Score permutations -- Is {D1,D2,…} > {D1’,D2’,…} ? • Adjust w to directly maximize ranking measure of interest (NDCG) Di Q Q DjDi > Q DjDi > Dk>
  • 21.
    21 Green = relevant Gray= not-relevant Blue arrows = boost for pair-wise loss function Red arrows = boost for list-wise loss function (a) is the perfect ranking; (b) is ranking with 10 pairwise errors; (c) is ranking with 8 pairwise errors
  • 22.
    22 • Relevance: User’sbehavior signals • Ranking Metrics: NDCG • Machine Learning Algorithm: Gradient Tree Boosting • Ranking optimization: List-wise with NDCG metrics
  • 23.
  • 24.
  • 25.
    25 Search Query: “40inchtv” Regular text search Search with user’s signals and learning-to-rank models Not relevant Not relevant Not relevant
  • 26.
    26 Conversion Rate (Simulation) NDCG CTRSimulated Queries Relative gain 15.58% 7.50% 10,000 Depth / Estimators 5 / 500 3 / 500 10 / 500 3 / 500 NDCG 0.687 0.688 0.685 0.689 Relative gain 15.14% 15.41% 14.92% 15.58% Training time (56 cores) 2:45:48 1:20:57 35:25:44 1:58:07
  • 27.
  • 28.
    28 Bhaskar Mitra, FernandoDiaz, and Nick Craswell. 2017. Learning to Match using Local and Distributed Representations of Text for Web Search. In Proceedings of the 26th International Conference on World Wide Web (WWW '17).
  • 29.
    29 Bhaskar Mitra, FernandoDiaz, and Nick Craswell. 2017. Learning to Match using Local and Distributed Representations of Text for Web Search. In Proceedings of the 26th International Conference on World Wide Web (WWW '17).
  • 30.
    30 Bhaskar Mitra, FernandoDiaz, and Nick Craswell. 2017. Learning to Match using Local and Distributed Representations of Text for Web Search. In Proceedings of the 26th International Conference on World Wide Web (WWW '17).
  • 31.
    31 • Traditional IRmethods do not scale to modern e-commerce needs • User’s implicit feedback is a proxy for search query / document pairs relevance • Learning-to-rank (LTR) methods scale to thousand of features and are robust to data noise • LTR with listwise-based loss function substantially improve search relevance (15.6% NDCG increase on e-commerce data) • NDCG improvements directly correlate to conversion rates (7.5% CTR increase on e-commerce data) • DNN methods for IR are starting to outperform traditional ML methods