PhD Thesis
Information Retrieval Models
for Recommender Systems
Author: Daniel Valcarce
Advisors: Álvaro Barreiro & Javier Parapar
A Coruña, May 8th, 2019
Information Retrieval Lab
Computer Science Department
University of A Coruña
Outline
1. Introduction
2. Evaluation
3. Top-N recommendation
4. Other recommendation tasks
5. Pseudo-relevance feedback
6. Conclusions
1
Introduction
Research aim
Recommender Systems are active Information Filtering systems that
present items that their users may be interested in.
Information Retrieval systems obtain items of information relevant
to the users’ information needs.
3
Research aim
Recommender Systems are active Information Filtering systems that
present items that their users may be interested in.
Information Retrieval systems obtain items of information relevant
to the users’ information needs.
Both Information Retrieval and Information Filtering fields:
⊚ cope with enormous amounts of information,
⊚ provide relevant information to their users,
⊚ can offer personalization.
3
Research aim
Recommender Systems are active Information Filtering systems that
present items that their users may be interested in.
Information Retrieval systems obtain items of information relevant
to the users’ information needs.
Both Information Retrieval and Information Filtering fields:
⊚ cope with enormous amounts of information,
⊚ provide relevant information to their users,
⊚ can offer personalization.
This PhD Thesis revolves around the idea of exploiting Information
Retrieval models in Recommender Systems.
3
Information Retrieval vs Information Filtering
Information Retrieval (IR)
⊚ Goal: retrieve documents
relevant to the users’
information needs.
Information Filtering (IF)
⊚ Goal: select relevant items
for the users from an
information stream.
4
Information Retrieval vs Information Filtering
Information Retrieval (IR)
⊚ Goal: retrieve documents
relevant to the users’
information needs.
⊚ Systems: search engines
(web, multimedia...).
Information Filtering (IF)
⊚ Goal: select relevant items
for the users from an
information stream.
⊚ Systems: spam filters,
recommender systems.
4
Information Retrieval vs Information Filtering
Information Retrieval (IR)
⊚ Goal: retrieve documents
relevant to the users’
information needs.
⊚ Systems: search engines
(web, multimedia...).
⊚ Input: the user’s query
(explicit).
Information Filtering (IF)
⊚ Goal: select relevant items
for the users from an
information stream.
⊚ Systems: spam filters,
recommender systems.
⊚ Input: the user’s profile
(implicit).
4
IR and IF: two sides of the same coin?
Some people consider them different fields:
⊚ U. Hanani, B. Shapira and P. Shoval: Information Filtering:
Overview of Issues, Research and Systems. User Modeling and
User-Adapted Interaction (2001).
While others consider them the same thing:
⊚ N. J. Belkin and W. B. Croft: Information filtering and information
retrieval: two sides of the same coin? Communications of the
ACM (1992).
What is undeniable is that they are closely related.
5
IR and IF: two sides of the same coin?
Some people consider them different fields:
⊚ U. Hanani, B. Shapira and P. Shoval: Information Filtering:
Overview of Issues, Research and Systems. User Modeling and
User-Adapted Interaction (2001).
While others consider them the same thing:
⊚ N. J. Belkin and W. B. Croft: Information filtering and information
retrieval: two sides of the same coin? Communications of the
ACM (1992).
What is undeniable is that they are closely related.
⊚ Why not apply techniques from one field to the other?
5
Overview of thesis contributions
Information Retrieval (IR)
⊚ Evaluation within the
Cranfield paradigm
⊚ Ad hoc retrieval
⊚ Pseudo-relevance feedback
Recommender Systems (RS)
⊚ Evaluation of top-N
recommendation
⊚ Neighborhood computation
⊚ Recommendation
6
Overview of thesis contributions
Information Retrieval (IR)
⊚ Evaluation within the
Cranfield paradigm
⊚ Ad hoc retrieval
⊚ Pseudo-relevance feedback
Recommender Systems (RS)
⊚ Evaluation of top-N
recommendation
⊚ Neighborhood computation
⊚ Recommendation
Ranking metrics are commonly used in IR and RS.
Following previous work in IR, we study the robustness and
discriminative power of these metrics in recommendation.
6
Overview of thesis contributions
Information Retrieval (IR)
⊚ Evaluation within the
Cranfield paradigm
⊚ Ad hoc retrieval
⊚ Pseudo-relevance feedback
Recommender Systems (RS)
⊚ Evaluation of top-N
recommendation
⊚ Neighborhood computation
⊚ Recommendation
Neighborhood-based techniques are a family of RS.
We show that ad hoc retrieval models can compute neighborhoods
effectively.
6
Overview of thesis contributions
Information Retrieval (IR)
⊚ Evaluation within the
Cranfield paradigm
⊚ Ad hoc retrieval
⊚ Pseudo-relevance feedback
Recommender Systems (RS)
⊚ Evaluation of top-N
recommendation
⊚ Neighborhood computation
⊚ Recommendation
Pseudo-relevance feedback (PRF) provides automatic query
expansion.
We adapt PRF techniques to diverse recommendation tasks.
6
Overview of thesis contributions
Information Retrieval (IR)
⊚ Evaluation within the
Cranfield paradigm
⊚ Ad hoc retrieval
⊚ Pseudo-relevance feedback
Recommender Systems (RS)
⊚ Evaluation of top-N
recommendation
⊚ Neighborhood computation
⊚ Recommendation
Sparse linear methods are very effective recommenders.
We propose a PRF model based on sparse linear methods that
achieves state-of-the-art effectiveness.
6
Evaluation
Top-N Recommendation
8
Top-N Recommendation
.
.
.
8
Recommender Systems evaluation
Online evaluation (e.g., A/B testing)
⊚ expensive,
⊚ measures real user behavior.
Offline evaluation
⊚ cheap,
⊚ highly reproducible,
⊚ usually constitutes the first step before deploying a
recommender system.
9
Recommender Systems evaluation
Online evaluation (e.g., A/B testing)
⊚ expensive,
⊚ measures real user behavior.
Offline evaluation ←
⊚ cheap,
⊚ highly reproducible,
⊚ usually constitutes the first step before deploying a
recommender system.
9
Offline evaluation of RS
When evaluating RS, which metric should we use?
⊚ Many types: error, ranking accuracy, diversity, novelty, etc.
⊚ Ranking accuracy metrics are the most popular.
⊚ These metrics have been traditionally used in IR.
10
Offline evaluation of RS
When evaluating RS, which metric should we use?
⊚ Many types: error, ranking accuracy, diversity, novelty, etc.
⊚ Ranking accuracy metrics are the most popular.
⊚ These metrics have been traditionally used in IR.
⊚ However, IR and RS evaluation assumptions are quite different:
Information Retrieval
⊚ relevance is independent
of users,
⊚ relevance judgments are
(almost) complete.
Recommender Systems
⊚ relevance depends
on the users,
⊚ relevance judgments are
far from complete.
10
Evaluation
Study of metrics
Ranking metrics study
Precision, Recall, MAP, NDCG, MRR, BPref, InfAP...
Many ranking accuracy metrics have been studied in IR.
We now study their behavior in top-N recommendation.
12
Ranking metrics study
Precision, Recall, MAP, NDCG, MRR, BPref, InfAP...
Many ranking accuracy metrics have been studied in IR.
We now study their behavior in top-N recommendation.
Two perspectives:
⊚ discriminative power,
⊚ robustness to incompleteness.
12
Ranking metrics study
Precision, Recall, MAP, NDCG, MRR, BPref, InfAP...
Many ranking accuracy metrics have been studied in IR.
We now study their behavior in top-N recommendation.
Two perspectives:
⊚ discriminative power,
⊚ robustness to incompleteness:
◦ sparsity bias,
◦ popularity bias.
12
Robustness to incompleteness
Sparsity bias
⊚ Sparsity is intrinsic to the
recommendation task.
⊚ We take random
subsamples from the test
set to increase the bias.
Popularity bias
⊚ Missing-not-at-random
(long tail distribution).
⊚ We remove the most
popular items to study
the bias.
We measure the robustness of a metric by computing the Kendall’s
correlation of systems rankings when changing the amount of bias.
13
Discriminative power
⊚ A metric is discriminative when its differences in value are
statistically significant.
⊚ We use the permutation test with difference in means as test
statistic.
⊚ We run the statistical test between all possible system pairs.
⊚ We plot the obtained p-values sorted by decreasing value.
14
Discriminative power
⊚ A metric is discriminative when its differences in value are
statistically significant.
⊚ We use the permutation test with difference in means as test
statistic.
⊚ We run the statistical test between all possible system pairs.
⊚ We plot the obtained p-values sorted by decreasing value.
14
Evaluation
Experiments
Comparing cut-offs of the same metric (nDCG) 1/4
@5 @10 @20 @30 @40 @50 @60 @70 @80 @90 @100
@5
@10
@20
@30
@40
@50
@60
@70
@80
@90
@100
1.00 0.95 0.93 0.92 0.92 0.92 0.92 0.91 0.90 0.90 0.90
0.95 1.00 0.98 0.97 0.97 0.97 0.97 0.96 0.95 0.95 0.95
0.93 0.98 1.00 0.99 0.99 0.99 0.99 0.98 0.97 0.97 0.97
0.92 0.97 0.99 1.00 1.00 1.00 1.00 0.99 0.98 0.98 0.98
0.92 0.97 0.99 1.00 1.00 1.00 1.00 0.99 0.98 0.98 0.98
0.92 0.97 0.99 1.00 1.00 1.00 1.00 0.99 0.98 0.98 0.98
0.92 0.97 0.99 1.00 1.00 1.00 1.00 0.99 0.98 0.98 0.98
0.91 0.96 0.98 0.99 0.99 0.99 0.99 1.00 0.99 0.99 0.99
0.90 0.95 0.97 0.98 0.98 0.98 0.98 0.99 1.00 1.00 1.00
0.90 0.95 0.97 0.98 0.98 0.98 0.98 0.99 1.00 1.00 1.00
0.90 0.95 0.97 0.98 0.98 0.98 0.98 0.99 1.00 1.00 1.00
Correlation between cut-offs of nDCG.
16
Comparing cut-offs of the same metric (nDCG) 2/4
100 90 80 70 60 50 40 30 20 10 0
% of ratings in the test set
0.85
0.90
0.95
1.00
Kendall’sτ
nDCG@5
nDCG@10
nDCG@20
nDCG@30
nDCG@40
nDCG@50
nDCG@60
nDCG@70
nDCG@80
nDCG@90
nDCG@100
Kendall’s correlation among systems evaluated with nDCG when
increasing the sparsity bias.
17
Comparing cut-offs of the same metric (nDCG) 3/4
100 95 90 85 80
% least popular items in the test set
0.0
0.2
0.4
0.6
0.8
1.0
Kendall’sτ
nDCG@5
nDCG@10
nDCG@20
nDCG@30
nDCG@40
nDCG@50
nDCG@60
nDCG@70
nDCG@80
nDCG@90
nDCG@100
Kendall’s correlation among systems evaluated with nDCG when
changing the popularity bias.
18
Comparing cut-offs of the same metric (nDCG) 4/4
0 5 10 15 20 25
pairs of recommender systems
0.0
0.2
0.4
0.6
0.8
1.0
p-value
nDCG@5
nDCG@10
nDCG@20
nDCG@30
nDCG@40
nDCG@50
nDCG@60
nDCG@70
nDCG@80
nDCG@90
nDCG@100
Discriminative power of nDCG measured with p-value curves.
19
Comparing metrics with cut-off @100 1/4
Precision Recall MAP nDCG MRR Bpref InfAP
Precision
Recall
MAP
nDCG
MRR
Bpref
InfAP
1.00 0.89 0.87 0.89 0.71 0.89 0.91
0.89 1.00 0.87 0.90 0.72 0.90 0.92
0.87 0.87 1.00 0.96 0.84 0.92 0.92
0.89 0.90 0.96 1.00 0.82 0.94 0.96
0.71 0.72 0.84 0.82 1.00 0.80 0.80
0.89 0.90 0.92 0.94 0.80 1.00 0.96
0.91 0.92 0.92 0.96 0.80 0.96 1.00
Correlation between metrics at cut-off @100.
20
Comparing metrics with cut-off @100 2/4
100 90 80 70 60 50 40 30 20 10 0
% of ratings in the test set
0.85
0.90
0.95
1.00
Kendall’sτ
Precision
Recall
MAP
nDCG
MRR
Bpref
InfAP
Kendall’s correlation among systems when increasing
the sparsity bias.
21
Comparing metrics with cut-off @100 3/4
100 95 90 85 80
% least popular items in the test set
0.0
0.2
0.4
0.6
0.8
1.0
Kendall’sτ
Precision
Recall
MAP
nDCG
MRR
Bpref
InfAP
Kendall’s correlation among systems when increasing
the popularity bias.
22
Comparing metrics with cut-off @100 4/4
0 5 10 15 20 25
pairs of recommender systems
0.0
0.2
0.4
0.6
0.8
1.0
p-value
Precision
Recall
MAP
nDCG
MRR
Bpref
InfAP
Discriminative power measured with p-value curves.
23
Evaluation
Implications
Findings
⊚ Deep cut-offs offer greater robustness and discriminative power
than shallow cut-offs.
⊚ Precision offers high robustness to sparsity and popularity
biases and good discriminative power.
⊚ nDCG provides the best discriminative power and high
robustness to the sparsity bias and moderate robustness to the
popularity bias.
25
Experimental settings: metrics
We measure three recommendation dimensions:
⊚ Ranking accuracy: nDCG@100.
◦ nDCG is robust and discriminative.
◦ nDCG models graded relevance.
⊚ Diversity: Gini@100.
◦ The Gini index measures item recommendation inequality.
⊚ Novelty: MSI@100.
◦ Mean self-information quantify recommendations
unexpectedness.
26
Experimental settings: datasets
Dataset Users Items Ratings Density
MovieLens 100k 943 1682 100 000 6.305 %
MovieLens 1M 6040 3706 1 000 209 4.468 %
MovieLens 10M 71 567 10 681 10 000 054 1.308 %
R3-Yahoo 15 400 1000 365 703 2.375 %
LibraryThing 7279 37 232 749 401 0.277 %
BeerAdvocate 33 388 66 055 1 571 808 0.071 %
Ta-Feng 32 266 23 812 817 741 0.106 %
27
Top-N recommendation
Recommender Systems
Recommendation algorithms can be classified in:
⊚ Content-based: find items similar to those the target user liked
using the items descriptions.
⊚ Collaborative filtering: relies on user-item interactions.
⊚ Hybrid: combination of content-based and collaborative
filtering approaches.
29
Recommender Systems
Recommendation algorithms can be classified in:
⊚ Content-based: find items similar to those the target user liked
using the items descriptions.
⊚ Collaborative filtering: relies on user-item interactions.
⊚ Hybrid: combination of content-based and collaborative
filtering approaches.
29
Collaborative filtering
Collaborative filtering (CF) exploits user-item feedback:
⊚ Explicit: ratings, reviews, etc.
⊚ Implicit: clicks, purchases, check-ins, etc.
30
Collaborative filtering
Collaborative filtering (CF) exploits user-item feedback:
⊚ Explicit: ratings, reviews, etc.
⊚ Implicit: clicks, purchases, check-ins, etc.
Two main families of CF methods:
⊚ Model-based: learn a predictive model from the data.
⊚ Neighborhood-based (or memory-based): directly use the
user-item feedback to compute recommendations.
30
Collaborative filtering
Collaborative filtering (CF) exploits user-item feedback:
⊚ Explicit: ratings, reviews, etc.
⊚ Implicit: clicks, purchases, check-ins, etc.
Two main families of CF methods:
⊚ Model-based: learn a predictive model from the data.
⊚ Neighborhood-based (or memory-based): directly use the
user-item feedback to compute recommendations.
30
Neighborhood-based methods
Two perspectives:
⊚ User-based: recommend items that users with common
interests liked.
⊚ Item-based: recommend items similar to those you liked.
Similarity between items is computed using common users
among items (not the content!).
31
Neighborhood-based methods
Two perspectives:
⊚ User-based: recommend items that users with common
interests liked.
⊚ Item-based: recommend items similar to those you liked.
Similarity between items is computed using common users
among items (not the content!).
Two phases:
⊚ neighborhood computation,
⊚ recommendation generation.
31
Top-N recommendation
Pseudo-relevance feedback models
for recommendation
Pseudo-relevance feedback (PRF)
Information need
33
Pseudo-relevance feedback (PRF)
Information need
query
33
Pseudo-relevance feedback (PRF)
Information need
query Retrieval
System
33
Pseudo-relevance feedback (PRF)
Information need
query Retrieval
System
33
Pseudo-relevance feedback (PRF)
Information need
query Retrieval
System
33
Pseudo-relevance feedback (PRF)
Information need
query Retrieval
System
33
Pseudo-relevance feedback (PRF)
Information need
query Retrieval
System
Query
Expansion
expanded
query
33
Pseudo-relevance feedback (PRF)
Information need
query Retrieval
System
Query
Expansion
expanded
query
33
PRF for Recommendation
Pseudo-relevance feedback Neighborhood-based recommenders
User’s query User’s profile
most^1,populated^1,state^2 Titanic^2,Avatar^3,Watchmen^5
Documents
Neighbors
Terms
Items
34
Top-N recommendation
Relevance models
Relevance models
Relevance-based language models or, simply, relevance models (RM)
are state-of-the-art PRF methods [Lavrenko & Croft, SIGIR ’01]:
⊚ RM1: i.i.d. sampling,
⊚ RM2: conditional sampling.
36
Relevance models
Relevance-based language models or, simply, relevance models (RM)
are state-of-the-art PRF methods [Lavrenko & Croft, SIGIR ’01]:
⊚ RM1: i.i.d. sampling,
⊚ RM2: conditional sampling.
RM has been adapted to user-based CF [Parapar et al., IPM ’13].
36
Relevance models for CF
RM2 : p(i|Ru) ∝ p(i)
∏
j∈Iu
∑
v∈Vu
p(i|v) p(v)
p(i)
p(j|v)
⊚ Iu is the set of items rated by the user u
⊚ Vu is neighborhood of the user u computed with kNN cosine
⊚ p(i|u) is computed smoothing the maximum likelihood
estimate with the probability in the collection
⊚ p(i) and p(v) are the item and user priors
37
Relevance models for CF
RM2 : p(i|Ru) ∝ p(i)
∏
j∈Iu
∑
v∈Vu
p(i|v) p(v)
p(i)
p(j|v)
⊚ Iu is the set of items rated by the user u
⊚ Vu is neighborhood of the user u computed with kNN cosine
⊚ p(i|u) is computed smoothing the maximum likelihood
estimate with the probability in the collection
⊚ p(i) and p(v) are the item and user priors
37
Relevance models for CF
RM2 : p(i|Ru) ∝ p(i)
∏
j∈Iu
∑
v∈Vu
p(i|v) p(v)
p(i)
p(j|v)
⊚ Iu is the set of items rated by the user u
⊚ Vu is neighborhood of the user u computed with kNN cosine
⊚ p(i|u) is computed smoothing the maximum likelihood
estimate with the probability in the collection
⊚ p(i) and p(v) are the item and user priors
37
Relevance models for CF
RM2 : p(i|Ru) ∝ p(i)
∏
j∈Iu
∑
v∈Vu
p(i|v) p(v)
p(i)
p(j|v)
⊚ Iu is the set of items rated by the user u
⊚ Vu is neighborhood of the user u computed with kNN cosine
⊚ p(i|u) is computed smoothing the maximum likelihood
estimate with the probability in the collection
⊚ p(i) and p(v) are the item and user priors
37
Relevance models for CF
RM2 : p(i|Ru) ∝ p(i)
∏
j∈Iu
∑
v∈Vu
p(i|v) p(v)
p(i)
p(j|v)
⊚ Iu is the set of items rated by the user u
⊚ Vu is neighborhood of the user u computed with kNN cosine
⊚ p(i|u) is computed smoothing the maximum likelihood
estimate with the probability in the collection
⊚ p(i) and p(v) are the item and user priors
37
Smoothing in RM2
RM2 : p(i|Ru) ∝ p(i)
∏
j∈Iu
∑
v∈Vu
p(i|v) p(v)
p(i)
p(j|v)
To compute the conditional probabilities, we smooth the maximum
likelihood estimate (MLE):
pmle(i|u) =
r(u, i)
∑
j∈Iu
r(u, j)
with the probability in the collection:
p(i|C) =
∑
v∈U r(v, i)
∑
j∈I
∑
v∈U r(v, j)
38
Why use smoothing?
In IR [Zhai & Lafferty, TOIS 2004], smoothing provides:
⊚ a way to deal with data sparsity,
⊚ inverse document frequency (IDF) effect,
⊚ document length normalization.
39
Why use smoothing?
In IR [Zhai & Lafferty, TOIS 2004], smoothing provides:
⊚ a way to deal with data sparsity,
⊚ inverse document frequency (IDF) effect,
⊚ document length normalization.
In RS, we have the same problems:
⊚ data sparsity,
⊚ item popularity/specificity,
⊚ user profiles with different sizes.
39
Smoothing techniques
Jelinek-Mercer smoothing (JMS): linear interpolation controlled by λ.
pλ(i|u) = (1 − λ) pmle(i|u) + λ p(i|C)
Dirichlet priors smoothing (DPS): Bayesian analysis with parameter µ.
pµ(i|u) =
r(u, i) + µ p(i|C)
µ +
∑
j∈Iu
r(u, j)
Absolute discounting smoothing (ADS): subtract a constant δ.
pδ(i|u) =
max[r(u, i) − δ, 0] + δ |Iu|p(i|C)
∑
j∈Iu
r(u, j)
Additive smoothing (AS): increase all the ratings by γ > 0.
pγ(i|u) =
r(u, i) + γ
∑
j∈Iu
r(u, j) + γ |I|
40
IDF effect
In IR, the IDF effect:
⊚ measures term specificity in most weighting schemes,
⊚ was born as a heuristic but was given theoretical justification.
41
IDF effect
In IR, the IDF effect:
⊚ measures term specificity in most weighting schemes,
⊚ was born as a heuristic but was given theoretical justification.
In RS, item specificity is related to item novelty.
41
IDF effect
In IR, the IDF effect:
⊚ measures term specificity in most weighting schemes,
⊚ was born as a heuristic but was given theoretical justification.
In RS, item specificity is related to item novelty.
IDF effect in recommendation
⊚ Let u be a user from the set of users U;
⊚ let Vu be their neighborhood;
⊚ given two items i1 and i2 with:
◦ the same ratings r(v, i1) = r(v, i2) ∀ v ∈ Vu,
◦ different popularity p(i1|C) < p(i2|C);
⊚ a recommender system that outputs p(i1|Ru) > p(i2|Ru) is said to
support the IDF effect.
41
Smoothing: axiomatic analysis of the IDF effect
We analyze axiomatically the IDF effect in RM2 when using different
smoothing methods:
Smoothing method IDF effect?
Jelinek-Mercer
Dirichlet priors
Absolute discounting
Additive
42
Smoothing: axiomatic analysis of the IDF effect
We analyze axiomatically the IDF effect in RM2 when using different
smoothing methods:
Smoothing method IDF effect?
Jelinek-Mercer ×
Dirichlet priors ×
Absolute discounting ×
Additive ✓
We expect additive smoothing to offer better figures of novelty.
42
Smoothing: ranking accuracy
0.28
0.30
0.32
0.34
0.36
0.38
0.40
0.42
0.44
0.46
0.48
0.50
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.001 0.01 0.1 1 10nDCG@100
δ, λ, µ × 103
γ
Additive (γ)
Absolute discounting (δ)
Jelinek-Mercer (λ)
Dirichlet priors (µ)
Figure: nDCG@100 values of RM2 varying the smoothing method on
MovieLens 100k. Also evaluated in MovieLens 1M, R3-Yahoo and LibraryThing.
43
Smoothing: diversity
0.08
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0.24
0.26
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.001 0.01 0.1 1 10Gini@100
δ, λ, µ × 103
γ
Additive (γ)
Absolute discounting (δ)
Jelinek-Mercer (λ)
Dirichlet priors (µ)
Figure: Gini@100 values of RM2 varying the smoothing method on
MovieLens 100k. Also evaluated in MovieLens 1M, R3-Yahoo and LibraryThing.
44
Smoothing: novelty
130
140
150
160
170
180
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.001 0.01 0.1 1 10MSI@100
δ, λ, µ × 103
γ
Additive (γ)
Absolute discounting (δ)
Jelinek-Mercer (λ)
Dirichlet priors (µ)
Figure: MSI@100 values of RM2 varying the smoothing method on MovieLens
100k. Also evaluated in MovieLens 1M, R3-Yahoo and LibraryThing.
45
Priors in RM2
RM2 : p(i|Ru) ∝ p(i)
∏
j∈Iu
∑
v∈Vu
p(i|v) p(v)
p(i)
p(j|v)
p(i) and p(v) are the item and user priors:
⊚ enable to introduce a priori information into the model,
⊚ provide a principled way of modeling business rules,
46
Priors in RM2
RM2 : p(i|Ru) ∝ p(i)
∏
j∈Iu
∑
v∈Vu
p(i|v) p(v)
p(i)
p(j|v)
p(i) and p(v) are the item and user priors:
⊚ enable to introduce a priori information into the model,
⊚ provide a principled way of modeling business rules,
⊚ similar to document priors used in IR such as:
◦ linear document length prior [Kraaij et al., SIGIR ’02],
◦ probabilistic document length prior [Blanco & Barreiro, ECIR ’08].
46
User prior estimators
Uniform
pU (u) =
1
|U|
Linear
pL(u) = p(u|C) =
∑
i∈Iu
r(u, i)
∑
v∈U
∑
j∈Iv
r(v, j)
Probabilistic prior using Jelinek-Mercer smoothing
pP JMS(u) = (1 − λ) + λ
∑
i∈Iu
p(i|C)
Probabilistic prior using Dirichlet priors smoothing
pP DP S(u) =
∑
i∈Iu
r(u, i) + µ
∑
i∈Iu
p(i|C)
µ +
∑
i∈Iu
r(u, i)
Probabilistic prior using absolute discounting smoothing
pP ADS(u) =
∑
i∈Iu
max(r(u, i) − δ, 0) + δ|Iu|
∑
i∈Iu
p(i|C)
∑
j∈Iu
r(u, j)
Probabilistic prior using additive smoothing
pP AS(u) =
∑
i∈Iu
r(u, i) + γ|Iu|
∑
j∈Iu
r(u, j) + γ|I|
47
Item prior estimators
Uniform
pU (i) =
1
|I|
Linear
pL(i) = p(i|C) =
∑
u∈Ui
r(u, i)
∑
j∈I
∑
v∈Uj
r(v, j)
Probabilistic prior using Jelinek-Mercer smoothing
pP JMS(u) = (1 − λ) + λ
∑
u∈Ui
p(u|C)
Probabilistic prior using Dirichlet priors smoothing
pP DP S(u) =
∑
u∈Ui
r(u, i) + µ
∑
u∈Ui
p(u|C)
µ +
∑
u∈Ui
r(u, i)
Probabilistic prior using absolute discounting smoothing
pP ADS(u) =
∑
u∈Ui
max(r(u, i) − δ, 0) + δ|Ui |
∑
u∈Ui
p(u|C)
∑
v∈Ui
r(v, i)
Probabilistic prior using additive smoothing
pP AS(u) =
∑
u∈Ui
r(u, i) + γ|Ui |
∑
v∈Ui
r(v, i) + γ|U|
48
Priors: evaluation
RM2 Metric ML 100k ML 1M R3-Yahoo LibraryThing
U-U
nDCG 0.4936 0.4242 0.0706 0.2206
Gini 0.2470 0.1352 0.3006 0.0390
MSI 175.94 172.14 303.87 331.05
U-PJMS
nDCG 0.4953* 0.4296* 0.0717* 0.2385*
Gini 0.2637 0.1637 0.4769 0.0319
MSI 180.45* 182.75* 339.65* 417.57*
Table: Comparison of RM2 method using uniform user and item priors (U-U)
or a uniform user prior and a probabilistic item prior estimate with
Jelinek-Mercer smoothing (U-PJMS). Best values in pink. Statistically
significant improvements (permutation test p < 0.05) with a *.
49
Top-N recommendation
Rocchio framework
Previous Work on Adapting PRF Methods to CF
Relevance models are very effective recommenders but have:
⊚ high computational cost,
⊚ several hyperparameters to tune,
⊚ different smoothing and prior choices to be made.
RM1 : p(i|Ru) ∝
∑
v∈Vu
p(v) p(i|v)
∏
j∈Iu
p(j|v)
RM2 : p(i|Ru) ∝ p(i)
∏
j∈Iu
∑
v∈Vu
p(i|v) p(v)
p(i)
p(j|v)
51
Popular approaches to pseudo-relevance feedback
⊚ Relevance models
[Lavrenko & Croft, SIGIR ’01]
⊚ Scoring functions based on the Rocchio framework
[Rocchio, 1971; Carpineto et al., ACM TOIS ’01]
⊚ Divergence minimization model
[Zhai & Lafferty, SIGIR ’06]
⊚ Mixture models
[Tao & Zhai, SIGIR ’06]
52
Popular approaches to pseudo-relevance feedback
⊚ Relevance models
[Lavrenko & Croft, SIGIR ’01]
⊚ Scoring functions based on the Rocchio framework
[Rocchio, 1971; Carpineto et al., ACM TOIS ’01]
⊚ Divergence minimization model
[Zhai & Lafferty, SIGIR ’06]
⊚ Mixture models
[Tao & Zhai, SIGIR ’06]
52
Scoring functions from Rocchio framework
Rocchio Weights (RW)
pRW (i|u) =
∑
v∈Vu
r(v, i)
|Vu|
Robertson Selection Value (RSV)
pRSV (i|u) = p(i|Vu)
∑
v∈Vu
r(v, i)
|Vu|
CHI2
pCHI2 (i|u) =
[
p(i|Vu) − p(i|C)
]2
p(i|C)
Kullback–Leibler Divergence (KLD)
pKLD(i|u) = p(i|Vu) log
p(i|Vu)
p(i|C)
53
Scoring functions from Rocchio framework
Rocchio Weights (RW)
pRW (i|u) =
∑
v∈Vu
r(v, i)
|Vu|
Robertson Selection Value (RSV)
pRSV (i|u) = p(i|Vu)
∑
v∈Vu
r(v, i)
|Vu|
CHI2
pCHI2 (i|u) =
[
p(i|Vu) − p(i|C)
]2
p(i|C)
Kullback–Leibler Divergence (KLD)
pKLD(i|u) = p(i|Vu) log
p(i|Vu)
p(i|C)
53
Scoring functions from Rocchio framework
Rocchio Weights (RW)
pRW (i|u) =
∑
v∈Vu
r(v, i)
|Vu|
Robertson Selection Value (RSV)
pRSV (i|u) = p(i|Vu)
∑
v∈Vu
r(v, i)
|Vu|
CHI2
pCHI2 (i|u) =
[
p(i|Vu) − p(i|C)
]2
p(i|C)
Kullback–Leibler Divergence (KLD)
pKLD(i|u) = p(i|Vu) log
p(i|Vu)
p(i|C)
53
Probability estimators
Maximum likelihood estimate (MLE)
MLE of a multinomial distribution over the ratings:
pmle(i|Vu) =
∑
v∈Vu
r(v, i)
∑
v∈Vu
∑
j∈I r(v, j)
pmle(i|C) =
∑
u∈U r(u, i)
∑
u∈U
∑
j∈I r(u, j)
54
Neighborhood size normalization (I)
Neighborhoods are computed using clustering algorithms:
⊚ Hard clustering: every user appears in only one cluster. Clusters
may have different sizes. Example: k-means.
⊚ Soft clustering: each user has its own neighbors. When we set k
to a high value, we may find different amounts of neighbors.
Example: kNN algorithm.
55
Neighborhood size normalization (I)
Neighborhoods are computed using clustering algorithms:
⊚ Hard clustering: every user appears in only one cluster. Clusters
may have different sizes. Example: k-means.
⊚ Soft clustering: each user has its own neighbors. When we set k
to a high value, we may find different amounts of neighbors.
Example: kNN algorithm.
Idea: why not consider the variability of neighborhood sizes?
⊚ Large neighborhoods are equivalent to query with a lot of
results: the collection model is closer to the target user.
⊚ Small neighborhoods imply that neighbors are highly specific:
the collection is very different from the target user.
55
Neighborhood size normalization (II)
Normalized MLE (NMLE)
We bias the MLE to perform neighborhood size normalization:
pnmle(i|Vu)
rank
=
1
|Vu|
∑
v∈Vu
r(v, i)
∑
v∈Vu
∑
j∈I r(v, j)
pnmle(i|C)
rank
=
1
|U|
∑
u∈U r(u, i)
∑
u∈U
∑
j∈I r(u, j)
56
Rocchio: efficiency
0.001
0.01
0.1
1
ML 100k ML 1M ML 10M
recommendationtimeperuser(s)
RM2
RW
RSV
KLD
CHI2
Figure: Recommendation time per user (in logarithmic scale) using RM2, RW,
RSV, CHI2 and KLD algorithms on the MovieLens 100k, 1M and 10M datasets.
57
Rocchio: ranking accuracy
Method ML 100k ML 1M R3-Yahoo LibraryThing
RM2 0.4953bcdef g
0.4296bcdef g
0.0717bcd
0.2385bcg
RW 0.4827cdef
0.4114cdef
0.0704d
0.2182c
RSV 0.4825def
0.4112def
0.0703d
0.2180
CHI2-MLE 0.2916 0.2775 0.0628 0.2605abcf g
CHI2-NMLE 0.4639df
0.3966df
0.0726bcdf
0.2610abcf g
KLD-MLE 0.4207d
0.3393d
0.0709d
0.2543abcg
KLD-NMLE 0.4839def
0.4195bcdef
0.0715bcd
0.2337bc
Table: Values of nDCG@100. Statistically significant improvements
(permutation test p < 0.05) with respect to RM2, RW, RSV, CHI2-MLE,
CHI2-NMLE, KLD-NMLE and KLD-NMLE are superscripted with a, b, c, d, e, f
and g, respectively. Best values in pink.
58
Rocchio: diversity
Method ML 100k ML 1M R3-Yahoo LibraryThing
RM2 0.2637 0.1637 0.4769 0.0319
RW 0.2341 0.1331 0.2937 0.0348
RSV 0.2338 0.1329 0.2940 0.0346
CHI2-MLE 0.3745 0.3895 0.4429 0.1496
CHI2-NMLE 0.2947 0.1677 0.4136 0.1128
KLD-MLE 0.3168 0.3190 0.6064 0.0891
KLD-NMLE 0.2806 0.1540 0.3037 0.0669
Table: Values of Gini@100. Best values in pink.
59
Rocchio: novelty
Method ML 100k ML 1M R3-Yahoo LibraryThing
RM2 180.45 182.75 339.65 417.57
RW 172.72 171.87 302.82 326.95
RSV 172.60 171.80 302.91 326.69
CHI2-MLE 233.63 262.21 333.12 442.55
CHI2-NMLE 190.77 188.34 327.74 400.18
KLD-MLE 199.23 237.88 371.56 396.31
KLD-NMLE 185.27 179.59 306.48 359.25
Table: Values of MSI@100. Best values in pink.
60
Top-N recommendation
Improving neighborhoods
Neighborhood-based methods
Neighborhood-based methods usually are:
⊚ simple,
⊚ efficient,
⊚ explainable.
But their effectiveness relies largely on the quality of the neighbors.
The most common approach is to compute the k nearest neighbors
(kNN algorithm) using a pairwise similarity.
62
Weighted sum recommender (WSR)
NNCosNgbr [Cremonesi et al., RecSys ’10]
We bias the MLE to perform neighborhood size normalization:
ˆru,i = bu,i +
∑
j∈Ji
shrunk_cosine (i, j) (r(u, j)−bu,i )
63
Weighted sum recommender (WSR)
NNCosNgbr [Cremonesi et al., RecSys ’10]
We bias the MLE to perform neighborhood size normalization:
ˆru,i = bu,i +
∑
j∈Ji
shrunk_cosine (i, j) (r(u, j)−bu,i )
Item-based weighted sum recommender (WSR-IB)
ˆru,i =
∑
j∈Ji
cos (i, j) r(u, j)
User-based weighted sum recommender (WSR-UB)
ˆru,i =
∑
v∈Vu
cos (u, v) r(v, i)
63
Experiments with WSR
Method Metric ML 100k ML 1M R3-Yahoo LibraryThing
NNCosNgbr
nDCG 0.2227 0.1980 0.0567 0.0852
Gini 0.3438 0.2407 0.2341 0.0659
MSI 230.14 228.00 386.78 546.47
WSR-UB
nDCG 0.4857* 0.4138* 0.0705* 0.2213*
Gini 0.2375 0.1356 0.3208 0.0768
MSI 173.86 172.76 309.52 364.70
WSR-IB
nDCG 0.4833* 0.4035* 0.0727* 0.3085*
Gini 0.2560 0.1516 0.3356 0.2768
MSI 177.34 178.95 315.05 461.73
Table: Statistically significant improvements in nDCG@100 (permutation test
p < 0.05) with respect to NNCosNgbr are indicated with *. Best values of
nDCG@100 in pink.
64
Top-N recommendation
Improving cosine with an oracle
Room for improvement
WSR with kNN cosine works well in top-N recommendation.
What is the room for improvement of this similarity measure?
66
Room for improvement
WSR with kNN cosine works well in top-N recommendation.
What is the room for improvement of this similarity measure?
Let’s build an oracle that generates ideal neighborhoods:
⊚ Finding the best neighborhood is a NP-hard problem.
⊚ We build an approximate oracle using a greedy approach.
66
Greedy neighborhood oracle
0 50 100 150 200 250 300
0.0
0.2
0.4
0.6
0.8
1.0
nDCG@100
0.80
0.85
0.90
Greedy Oracle
0 50 100 150 200 250 300
k
0.40
0.45
0.50
kNN cosine
Figure: Values of nDCG@100 of WSR when using the neighborhoods
produced by the greedy oracle and by kNN using cosine similarity on
MovieLens 100k.
67
Cosine-based neighborhood oracle
The neighborhoods produced by the greedy oracle may be
impossible to achieve with similarities based on co-occurrence.
68
Cosine-based neighborhood oracle
The neighborhoods produced by the greedy oracle may be
impossible to achieve with similarities based on co-occurrence.
We develop a simpler oracle based on cosine similarity:
⊚ We find the best neighborhoods that cosine similarity can
provide by tuning the value k for each user.
⊚ This oracle can be seen as an adaptive kNN algorithm that
uses the optimal k for each user.
68
Comparison against oracles
Method nDCG@100 Gini@100 MSI@100
kNN Cosine 0.4857 0.2375 173.86
Cosine-based Oracle 0.5298 0.2508 174.97
Greedy Oracle 0.8631 0.2664 168.08
Table: Values of nDCG@100, Gini@100 and MSI@100 using WSR with cosine
similarity and the two oracles on the MovieLens 100k dataset.
69
Cosine similarity improvements
By studying the properties of the neighborhoods provided by the
oracles, we modify cosine similarity:
70
Cosine similarity improvements
By studying the properties of the neighborhoods provided by the
oracles, we modify cosine similarity:
⊚ We penalize the cosine similarity to add user profile size
normalization.
◦ Similar to the pivoted document length normalization in IR
[Singhal et al., SIGIR ’96].
70
Cosine similarity improvements
By studying the properties of the neighborhoods provided by the
oracles, we modify cosine similarity:
⊚ We penalize the cosine similarity to add user profile size
normalization.
◦ Similar to the pivoted document length normalization in IR
[Singhal et al., SIGIR ’96].
⊚ We add the IDF effect to cosine similarity to increase the user
profile overlap of the neighbors.
◦ The IDF is a fundamental term specificity measure in IR.
70
Cosine similarity improvements: results
Method Metric ML 100k ML 1M R3-Yahoo LibraryThing
Cosine
nDCG 0.4857 0.4138 0.0704 0.2255
Gini 0.2375 0.1356 0.3107 0.0417
MSI 173.86 172.76 305.26 333.50
Penalized
Cosine
nDCG 0.4889* 0.4194* 0.0709 0.2266
Gini 0.2516 0.1446 0.2863 0.0471
MSI 177.97* 176.41* 302.39 339.05*
Penalized
Cosine
with IDF
nDCG 0.4927*†
0.4281*†
0.0721*†
0.2422*†
Gini 0.2517 0.1551 0.3376 0.0596
MSI 178.65* 180.41*†
312.08*†
354.46*†
Table: Statistically significant improvements in nDCG@100 and MSI@100
(permutation test p < 0.05) with respect to cosine and penalized cosine is
indicated with an * and †
, respectively. Best values in pink.
71
Top-N recommendation
Language models for computing
neighborhoods
Alternatives to cosine similarity
So far, we have improved cosine similarity with ideas from IR.
Can we do better than with cosine similarity?
73
Alternatives to cosine similarity
So far, we have improved cosine similarity with ideas from IR.
Can we do better than with cosine similarity?
Let’s study cosine similarity from IR perspective.
73
Cosine similarity and the vector space model
Recommender Systems
⊚ Target user
⊚ Rest of users
⊚ Items
Information Retrieval
⊚ Query
⊚ Documents
⊚ Terms
74
Cosine similarity and the vector space model
Recommender Systems
⊚ Target user
⊚ Rest of users
⊚ Items
Information Retrieval
⊚ Query
⊚ Documents
⊚ Terms
Computing neighborhoods using cosine similarity is equivalent to
search in the vector space model.
If we swap users and items, we can derive an analogous item-based
approach.
74
Cosine similarity and the vector space model
Recommender Systems
⊚ Target user
⊚ Rest of users
⊚ Items
Information Retrieval
⊚ Query
⊚ Documents
⊚ Terms
Computing neighborhoods using cosine similarity is equivalent to
search in the vector space model.
If we swap users and items, we can derive an analogous item-based
approach.
We can use sophisticated search techniques for finding neighbors!
74
Language models
Statistical language models are a state-of-the-art ad hoc retrieval
framework [Ponte & Croft, SIGIR ’98].
Documents are ranked according to their posterior probability given
the query:
p(d|q) =
p(q|d) p(d)
p(q)
rank
= p(q|d) p(d)
75
Language models
Statistical language models are a state-of-the-art ad hoc retrieval
framework [Ponte & Croft, SIGIR ’98].
Documents are ranked according to their posterior probability given
the query:
p(d|q) =
p(q|d) p(d)
p(q)
rank
= p(q|d) p(d)
The query likelihood, p(q|d), is based on a unigram model:
p(q|d) =
∏
t∈q
p(t|d)c(t,d)
75
Language models
Statistical language models are a state-of-the-art ad hoc retrieval
framework [Ponte & Croft, SIGIR ’98].
Documents are ranked according to their posterior probability given
the query:
p(d|q) =
p(q|d) p(d)
p(q)
rank
= p(q|d) p(d)
The query likelihood, p(q|d), is based on a unigram model:
p(q|d) =
∏
t∈q
p(t|d)c(t,d)
The document prior, p(d), is usually considered uniform.
75
Language models for finding neighborhoods (I)
Ad hoc retrieval
p(d|q)
rank
= p(d)
∏
t∈q
p(t|d)c(t,d)
User-based collaborative filtering
p(v|u)
rank
= p(v)
∏
i∈Iu
p(i|v)r(v,i)
Item-based collaborative filtering
p(j|i)
rank
= p(j)
∏
u∈Ui
p(u|j)r(u,j)
76
Language models for finding neighborhoods (II)
User-based collaborative filtering:
p(v|u)
rank
= p(v)
∏
i∈Iu
p(i|v)r(v,i)
We assume a multinomial distribution over the count of ratings:
pmle(i|v) =
r(v, i)
∑
j∈Iv
r(v, j)
77
Language models for finding neighborhoods (II)
User-based collaborative filtering:
p(v|u)
rank
= p(v)
∏
i∈Iu
p(i|v)r(v,i)
We assume a multinomial distribution over the count of ratings:
pmle(i|v) =
r(v, i)
∑
j∈Iv
r(v, j)
However it suffers from sparsity. We need smoothing!
⊚ Jelinek-Mercer smoothing (JMS)
⊚ Dirichlet priors smoothing (DPS)
⊚ Absolute discounting smoothing (ADS)
⊚ Additive smoothing (AS)
77
Language models: ranking accuracy
0.24
0.26
0.28
0.30
0.32
0.34
0.36
0.38
0.40
0.42
0.44
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.001 0.01 0.1 1 10
nDCG@100
δ, λ, µ × 4 × 103
γ
Additive (γ)
Absolute discounting (δ)
Jelinek-Mercer (λ)
Dirichlet priors (µ)
Cosine
Figure: nDCG@100 values of WSR-UB varying the smoothing method on
MovieLens 1M. Also evaluated in MovieLens 100k, R3-Yahoo and LibraryThing.
78
Language models: diversity
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.001 0.01 0.1 1 10
Gini@100
δ, λ, µ × 4 × 103
γ
Additive (γ)
Absolute discounting (δ)
Jelinek-Mercer (λ)
Dirichlet priors (µ)
Cosine
Figure: Gini@100 values of WSR-UB varying the smoothing method on
MovieLens 1M. Also evaluated in MovieLens 100k, R3-Yahoo and LibraryThing.
79
Language models: novelty
140
145
150
155
160
165
170
175
180
185
190
195
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.001 0.01 0.1 1 10
MSI@100
δ, λ, µ × 4 × 103
γ
Additive (γ)
Absolute discounting (δ)
Jelinek-Mercer (λ)
Dirichlet priors (µ)
Cosine
Figure: MSI@100 values of WSR-UB varying the smoothing method on
MovieLens 1M. Also evaluated in MovieLens 100k, R3-Yahoo and LibraryThing.
80
Language models: ranking accuracy
Method ML 100k ML 1M R3-Yahoo LibraryThing
Cosine WSR-UB 0.4857b
0.4138b
0.0703 0.2255
Cosine WSR-IB 0.4790 0.4035 0.0727a
0.3085acdf
Cosine RM2 0.4953ab
0.4322abe
0.0717a
0.2384a
LM-JMS WSR-UB 0.4990abc
0.4329abe
0.0719a
0.2370a
LM-JMS WSR-IB 0.4989abc
0.4232ab
0.0731a
0.3118abcdf
LM-JMS RM2 0.5021abcd
0.4392abcde
0.0731acd
0.2406ad
Table: Ranking accuracy figures measured in nDCG@100. Statistically
significant improvements (permutation test p < 0.05) indicated with
superscripts. Best values in pink.
81
Language models: diversity
Method ML 100k ML 1M R3-Yahoo LibraryThing
Cosine WSR-UB 0.2375 0.1356 0.3107 0.0417
Cosine WSR-IB 0.2738 0.1516 0.3309 0.2768
Cosine RM2 0.2637 0.1533 0.4769 0.1278
LM-JMS WSR-UB 0.2645 0.1731 0.3566 0.0570
LM-JMS WSR-IB 0.2952 0.1854 0.3520 0.3368
LM-JMS RM2 0.2794 0.1825 0.4281 0.1285
Table: Diversity figures measured in Gini@100. Best values in pink.
82
Language models: novelty
Method ML 100k ML 1M R3-Yahoo LibraryThing
Cosine WSR-UB 173.86 172.76 305.26 333.50
Cosine WSR-IB 181.59ac
178.95a
314.12a
461.74acdf
Cosine RM2 180.45a
179.39a
339.64abdef
417.56ad
LM-JMS WSR-UB 180.59a
186.15abc
314.23a
352.80a
LM-JMS WSR-IB 190.23abcdf
191.34abcdf
318.00abd
499.73abcdf
LM-JMS RM2 184.29abcd
189.27abcd
332.49abde
418.39ad
Table: Novelty figures measured in MSI@100. Statistically significant
improvements (permutation test p < 0.05) indicated with superscripts. Best
values in pink.
83
Why LM with JMS works?
Why language models with Jelinek-Mercer smoothing work better
than cosine similarity?
To explain this, we perform an axiomatic analysis.
We define user specificity and item specificity properties.
84
User specificity
User specificity
⊚ Given the target user u,
⊚ and the candidate neighbors v and w such that:
◦ Iu ∩ Iv = Iu ∩ Iw ,
◦ r(u, i) = r(v, i) = r(w, i) ∀i ∈ Iu ∩ Iv ,
◦ |v| < |w|;
⊚ the user specificity property enforces sim(u, v) > sim(u, w).
85
Item specificity
Item specificity
⊚ Let u be the target user;
⊚ let v and w be two candidate users such that |v| = |w|;
⊚ let j and k be two items from the set of items I such that:
◦ j ∈ Iu ∩ Iv ,
◦ k ∈ Iu ∩ Iw ;
⊚ given:
◦ (Iu ∩ Iv )  {j} = (Iu ∩ Iw )  {k},
◦ r(u, j) = r(v, j) = r(u, k) = r(w, k),
◦ r(u, i) = r(v, i) = r(w, i) ∀i ∈ Iu ∩ Iv ∩ Iw ;
⊚ if |j| < |k|, then the item specificity property enforces
sim(u, v) > sim(u, w).
86
Language models: axiomatic analysis
We analyze axiomatically the user specificity and item specificity
properties in cosine similarity and in language models with
Jelinek-Mercer smoothing:
Neighborhood method User specificity Item specificity
Cosine similarity
LM-JMS
87
Language models: axiomatic analysis
We analyze axiomatically the user specificity and item specificity
properties in cosine similarity and in language models with
Jelinek-Mercer smoothing:
Neighborhood method User specificity Item specificity
Cosine similarity ∼ ∼
LM-JMS ✓ ✓
We think differences in effectiveness may be related to these
properties.
87
Other recommendation tasks
Other recommendation problems
Top-N recommendation is the most prominent task in RS.
However, recommendation technologies are used in many
industrial scenarios.
89
Other recommendation problems
Top-N recommendation is the most prominent task in RS.
However, recommendation technologies are used in many
industrial scenarios.
In this part, we focus on two less popular recommendation
problems:
⊚ long tail liquidation,
⊚ user-item group formation.
89
Other recommendation tasks
Long tail liquidation
Long tail liquidation
Item popularity follows a long tail distribution.
The excess of inventory or overstock is a source of revenue loss.
91
Long tail liquidation
Item popularity follows a long tail distribution.
The excess of inventory or overstock is a source of revenue loss.
We formulate a recommendation task centered on the liquidation of
long tail items.
We propose an item-based adaptation of relevance models to deal
with this novel task.
91
Long tail liquidation problem
Long tail liquidation problem
Let I′
⊂ I be the items we want to liquidate,
we aim to find a scoring function s′
: I′
× U → R such that:
⊚ for each item i ∈ I′
,
⊚ we can build a ranked list of n users Ln
i ∈ Un
,
⊚ that are most likely interested in such item i.
92
Long tail estimation
Least rated products
I′
=
{
i ∈ I |Ui | < c1
}
Lowest rated products
I′
=
{
i ∈ I
∑
u∈Ui
ru,i
|Ui |
< c2
}
Least recommended products
I′
=
{
i ∈ I i /∈ Lc3
u , ∀u ∈ U
}
93
Item relevance models
IRM2
p(u|Ri ) ∝ p(u)
∏
v∈Ui
∑
j∈Ji
p(v| j)
p(u| j) p(j)
p(u)
94
Item relevance models
IRM2
p(u|Ri ) ∝ p(u)
∏
v∈Ui
∑
j∈Ji
p(v| j)
p(u| j) p(j)
p(u)
MLE with additive smoothing
pγ(u|i) =
r(u, i) + γ
∑
v∈Ui
r(v, i) + γ |U|
94
Item relevance models
IRM2
p(u|Ri ) ∝ p(u)
∏
v∈Ui
∑
j∈Ji
p(v| j)
p(u| j) p(j)
p(u)
MLE with additive smoothing
pγ(u|i) =
r(u, i) + γ
∑
v∈Ui
r(v, i) + γ |U|
Item neighborhoods
Ji is computed using kNN algorithm with cosine similarity.
94
Item relevance models
IRM2
p(u|Ri ) ∝ p(u)
∏
v∈Ui
∑
j∈Ji
p(v| j)
p(u| j) p(j)
p(u)
MLE with additive smoothing
pγ(u|i) =
r(u, i) + γ
∑
v∈Ui
r(v, i) + γ |U|
Item neighborhoods
Ji is computed using kNN algorithm with cosine similarity.
User and item priors
We use uniform estimators.
94
Long tail liquidation: results on LibraryThing
Method Least rated Lowest rated Least recommended
Random 0.0024 0.0002 0.0030
Pop 0.0408acd
0.0499acd
0.0455acd
kNN-UB 0.0018 0.0039 0.0026
kNN-IB 0.0255ac
0.0061 0.0169ac
UIR-IB 0.0890abcd
0.0894abcd
0.0876abcd
HT 0.1431abcdeg
0.1451abcdeg
0.1477abcdeg
PureSVD 0.0879abcd
0.0919abcd
0.1065abcde
SLIM 0.2004abcdef g
0.2029abcdef g
0.2495abcdef g
IRM2 0.2120abcdef gh
0.2108abcdef g
0.2522abcdef g
Table: Values of nDCG@100 on LibraryThing for each long tail estimation.
Superscripts indicate significant improvements. Best values in pink.
95
Long tail liquidation: results on Ta-Feng
0.000
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
0.010
1 2 3 4 5 6 7 8 9 10
nDCG@100
#buyers
Random
Pop
kNN-UB
kNN-IB
UIR-Item
HT
PureSVD
SLIM
IRM2
Figure: Values of nDCG@100 on the Ta-Feng dataset for liquidating long tail
items (those with no more than n buyers).
96
Other recommendation tasks
User-item group formation
User-item group formation
The user-item group formation (UIGF) problem aims to find the best
companions for a given item and a target user [Brilhante et al.,
ICMDM ’16].
98
User-item group formation
The user-item group formation (UIGF) problem aims to find the best
companions for a given item and a target user [Brilhante et al.,
ICMDM ’16].
IRM2:
⊚ estimates the relevance of a user given an item;
⊚ deals with long tail item liquidation with uniform priors.
98
User-item group formation
The user-item group formation (UIGF) problem aims to find the best
companions for a given item and a target user [Brilhante et al.,
ICMDM ’16].
IRM2:
⊚ estimates the relevance of a user given an item;
⊚ deals with long tail item liquidation with uniform priors.
We can model the user relationships with different priors estimators.
98
User-item group formation problem
UIGF as an item relevance modeling problem
⊚ Given the target user u ∈ U,
⊚ the recommended item i ∈ I,
⊚ an integer k;
⊚ the UIGF problem seeks to find the set FG
u,i ⊆ U such that:
FG
u,i = arg max
F ∗
u
∑
v∈F ∗
p (v|Ri )
s.t. F∗
⊆ U, |F∗
| = k
99
UIGF priors
Uniform prior (U)
pU(v) =
1
|Fu|
Common Friends (CF)
pCF (v) ∝
1
|Fu ∩ Fv |
Common group friends (CGF)
pCGF (v) ∝
1
(∪
w∈F G
u,i
Fw
)
∩ Fv
Group closeness (GC)
pGC(v) ∝
1
FG
u,i ∩ Fv
100
UIGF evaluation
We used ground truth groups to evaluate UIGF approaches:
⊚ users who checked in the same place within 4 hours,
⊚ groups of at least 4 members,
⊚ each user must be friends with at least another group member.
101
UIGF evaluation
We used ground truth groups to evaluate UIGF approaches:
⊚ users who checked in the same place within 4 hours,
⊚ groups of at least 4 members,
⊚ each user must be friends with at least another group member.
Evaluation protocol:
⊚ for each group, we select a random member as the target user
and the place where the group registered as the target item;
⊚ we ask the UIGF model to form a group of k friends for this
specific user and item;
⊚ we evaluate the precision of the recommended group against
the ground truth groups.
101
UIGF datasets
Dataset Users Items Links Check-ins Ratings
FS 2 138 367 83 999 27 098 472 1 021 966 2 809 580
FS-NYC 103 663 7813 1 890 844 157 064 330 043
Gowalla 196 591 1 280 969 1 900 654 6 442 892 −
Brightkite 58 228 772 966 428 156 4 747 281 −
Weeplaces 15 799 971 307 114 131 7 369 712 −
Table: Statistics of location-based social network datasets.
102
UIGF: results in Foursquare
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
4 6 8 10 12
Precision
Group Size
k-Top
DkSP (PAV)
DkSP (PLM)
GREEDY (PAV)
GREEDY (PLM)
k-NN (PAV)
k-NN (PLM)
IRM2-U
IRM2-CF
IRM2-CGF
IRM2-GC
103
UIGF: results in Brightkite
0.10
0.15
0.20
0.25
0.30
0.35
0.40
4 6 8 10 12
Precision
Group Size
k-Top
DkSP (PAV)
DkSP (PLM)
GREEDY (PAV)
GREEDY (PLM)
k-NN (PAV)
k-NN (PLM)
IRM2-U
IRM2-CF
IRM2-CGF
IRM2-GC
104
Pseudo-relevance feedback
LiMe: Linear Methods for PRF
Linear methods such as SLIM have been successfully used in
recommendation [Ning & Karypis, ICDM ’11].
We adapt them to PRF. Our proposal LiMe:
⊚ models the PRF task a matrix decomposition problem
⊚ employs linear methods to provide a solution
⊚ is able to learn inter-term or inter-doc similarities
⊚ jointly models the query and the pseudo-relevant set
⊚ admits different feature schemes
⊚ is agnostic to the retrieval model
106
LiMe variants
Two variants:
⊚ DLiMe: learns inter-document similarities.
⊚ TLiMe: learns inter-term similarities.
107
LiMe variants
Two variants:
⊚ DLiMe: learns inter-document similarities.
⊚ TLiMe: learns inter-term similarities.
107
TLiMe: matrix formulation
Let X ∈ Rm×n
be the extended pseudo-relevant set matrix, we aim to
find a inter-term similarity matrix W ∈ Rn×n
+ such that:
X = X × W





Q
D1
. . .
Dm−1





m×n
=





Q
D1
. . .
Dm−1





m×n
×




w11 · · · w1n
...
...
...
wn1 · · · wnn




n×n
s.t. diag(W) = 0, W ≥ 0
108
LiMe: feature schemes
How do we fill matrix X =





Q
D1
. . .
Dm−1





m×n
?
109
LiMe: feature schemes
How do we fill matrix X =





Q
D1
. . .
Dm−1





m×n
?
xij =



s(tj , Q) if i = 1 and f (tj , Q) > 0,
s(tj , Di−1) if i > 1 and f (tj , Di−1) > 0,
0 otherwise
⊚ stf −idf (t, D) = (1 + log2 f (t, D)) × log2
|C|
df (t)
⊚ f (t, D): #occurrences of term t in D (or Q)
109
LiMe: optimization problem
Matrix optimization problem:
W∗
= arg min
W
1
2
∥X − X W∥2
F + β1 ∥W∥1,1 +
β2
2
∥W∥2
F
s.t. diag(W) = 0, W ≥ 0
(1)
110
LiMe: optimization problem
Matrix optimization problem:
W∗
= arg min
W
1
2
∥X − X W∥2
F + β1 ∥W∥1,1 +
β2
2
∥W∥2
F
s.t. diag(W) = 0, W ≥ 0
(1)
Bound constrained least squares optimization problem with elastic
net (ℓ1 and ℓ2 regularization) penalty:
⃗w∗
·j = arg min
⃗w·j
1
2
∥⃗x·j − X ⃗w·j ∥2
2 + β1 ∥ ⃗w·j ∥1 +
β2
2
∥ ⃗w·j ∥2
2
s.t. wjj = 0, ⃗w·j ≥ 0
(2)
110
LiMe: query expansion
To expand the original query, we reconstruct the first row of X:
(
Q′
)
1×n
=
(
Q
)
1×n
×




w11 · · · w1n
...
...
...
wn1 · · · wnn




n×n
ˆx1· = ⃗x1· × W∗
(3)
111
LiMe: query expansion
To expand the original query, we reconstruct the first row of X:
(
Q′
)
1×n
=
(
Q
)
1×n
×




w11 · · · w1n
...
...
...
wn1 · · · wnn




n×n
ˆx1· = ⃗x1· × W∗
(3)
We compute a probabilistic estimate of a term tj given the feedback
model θF :
p(tj |θF ) =



ˆx1j∑
tv ∈VF ′
ˆx1v
if tj ∈ VF ′ ,
0 otherwise
(4)
111
LiMe: second retrieval
The second retrieval is performed interpolating the original query
model with the feedback model:
p(t|θ′
Q) = (1 − α) p(t|θQ) + α p(t|θF ) (5)
⊚ The hyperparameter α controls the interpolation
⊚ This is a standard procedure in state-of-the-art PRF techniques
112
LiMe: test collections
Collection #docs
Avg doc Topics
length Training Test
AP88-89 165k 284.7 51-100 101-150
TREC-678 528k 297.1 301-350 351-400
Robust-04 528k 28.3 301-450 601-700
WT10G 1,692k 399.3 451-500 501-550
GOV2 25,205k 647.9 701-750 751-800
113
LiMe: results
Method Metric AP88-89 TREC-678 Robust-04 WT10G GOV2
LM
nDCG 0.5637 0.4518 0.5830 0.5212 0.6325
RI − − − − −
RFMF
nDCG 0.5749 0.4746 0.5884 0.5262 0.6453
RI 0.42 0.23 0.07 0.30 0.42
MEDMM
nDCG 0.5955 0.5115 0.6227 0.5324 0.6653
RI 0.42 0.26 0.32 0.36 0.66
RM3
nDCG 0.6005 0.4987 0.6251 0.5352 0.6618
RI 0.50 0.40 0.37 0.20 0.60
DLiMe
nDCG 0.6058 0.4936 0.6247 0.5290 0.6588
RI 0.52 0.44 0.32 0.26 0.72
TLiMe
nDCG 0.6085 0.5198 0.6294 0.5398 0.6698
RI 0.52 0.46 0.37 0.30 0.62
114
Conclusions
Conclusions (I)
We explored cross-pollination of ideas between IR and RS:
⊚ We studied the robustness and discriminative power of ranking
accuracy metrics. These findings influenced the evaluation of
this thesis.
116
Conclusions (I)
We explored cross-pollination of ideas between IR and RS:
⊚ We studied the robustness and discriminative power of ranking
accuracy metrics. These findings influenced the evaluation of
this thesis.
⊚ We adapted different pseudo-relevance feedback models to
top-N recommendation as memory-based recommenders:
◦ relevance models offer highly accurate recommendations;
◦ techniques from the Rocchio framework are a very cost-effective
alternative.
116
Conclusions (I)
We explored cross-pollination of ideas between IR and RS:
⊚ We studied the robustness and discriminative power of ranking
accuracy metrics. These findings influenced the evaluation of
this thesis.
⊚ We adapted different pseudo-relevance feedback models to
top-N recommendation as memory-based recommenders:
◦ relevance models offer highly accurate recommendations;
◦ techniques from the Rocchio framework are a very cost-effective
alternative.
⊚ We used ad hoc retrieval models to compute better
neighborhoods in collaborative filtering:
◦ neighborhood oracles provide insights for improvements;
◦ language models outperform cosine similarity.
116
Conclusions (II)
We explored cross-pollination of ideas between IR and RS:
⊚ We adapted relevance models to novel recommendation tasks:
◦ item-based relevance models can tackle long tail item liquidation;
◦ specific priors can be used to deal with the user-item group
formation problem.
117
Conclusions (II)
We explored cross-pollination of ideas between IR and RS:
⊚ We adapted relevance models to novel recommendation tasks:
◦ item-based relevance models can tackle long tail item liquidation;
◦ specific priors can be used to deal with the user-item group
formation problem.
⊚ We proposed a novel PRF framework inspired by a
recommendation method.
117
Conclusions
Future directions
Future directions
⊚ Explore our robustness and discriminative analysis to different
types of metrics such as diversity or novelty metrics.
119
Future directions
⊚ Explore our robustness and discriminative analysis to different
types of metrics such as diversity or novelty metrics.
⊚ Study the adaptation of different pseudo-relevance feedback
models to top-N recommendation or other tasks.
119
Future directions
⊚ Explore our robustness and discriminative analysis to different
types of metrics such as diversity or novelty metrics.
⊚ Study the adaptation of different pseudo-relevance feedback
models to top-N recommendation or other tasks.
⊚ Analyze other neighborhood computation techniques using the
methodology based on oracles.
119
Future directions
⊚ Explore our robustness and discriminative analysis to different
types of metrics such as diversity or novelty metrics.
⊚ Study the adaptation of different pseudo-relevance feedback
models to top-N recommendation or other tasks.
⊚ Analyze other neighborhood computation techniques using the
methodology based on oracles.
⊚ Examine other ad hoc retrieval models to compute
neighborhoods.
119
Future directions
⊚ Explore our robustness and discriminative analysis to different
types of metrics such as diversity or novelty metrics.
⊚ Study the adaptation of different pseudo-relevance feedback
models to top-N recommendation or other tasks.
⊚ Analyze other neighborhood computation techniques using the
methodology based on oracles.
⊚ Examine other ad hoc retrieval models to compute
neighborhoods.
⊚ Extend LiMe with richer features (based on Wikipedia, query
logs, etc.).
119
Conclusions
Publications
Conferences (I)
gitemizeitemize4 [gitemize,1] leftmargin=0.3cm+, label=
A. Landin, D. Valcarce, J. Parapar, Á. Barreiro. “PRIN: A Probabilistic
Recommender with Item Priors and Neural Models”. ECIR ’19, pp.
133-147, 2019.
D. Valcarce, A. Bellogín, J. Parapar, P. Castells. “On the Robustness and
Discriminative Power of IR Metrics for Top-N Recommendation”.
ACM RecSys ’18, pp. 260-268, 2018.
D. Valcarce, J. Parapar, Á. Barreiro. “LiMe: Linear Methods for
Pseudo-Relevance Feedback”. ACM SAC ’18, pp. 678-687, 2018.
D. Valcarce, J. Parapar, Á. Barreiro. “Combining Top-N Recommenders
with Metasearch Algorithms”. ACM SIGIR ’17, pp. 805-808, 2017.
D. Valcarce, J. Parapar, Á. Barreiro. “Additive Smoothing for
Relevance-Based Language Modelling of Recommender Systems”. 121
Conferences (II)
gitemizeitemize4 [gitemize,1] leftmargin=0.3cm+, label=
D. Valcarce, J. Parapar, Á. Barreiro. “Efficient Pseudo-Relevance
Feedback Methods for Collaborative Filtering Recommendation”.
ECIR ’16, pp. 602-613, 2016.
D. Valcarce, J. Parapar, Á. Barreiro. “Language Models for Collaborative
Filtering Neighbourhoods”. ECIR ’16, pp. 614-625, 2016.
D. Valcarce. “Exploring Statistical Language Models for Recommender
Systems”. ACM RecSys ’15, pp. 375-378, 2015.
D. Valcarce, J. Parapar, Á. Barreiro. “A Study of Priors for
Relevance-Based Language Modelling of Recommender Systems”.
ACM RecSys ’15, pp. 237-240, 2015.
D. Valcarce, J. Parapar, Á. Barreiro. “A Study of Smoothing Methods for
Relevance-Based Language Modelling of Recommender Systems”. 122
Journals
gitemizeitemize4 [gitemize,1] leftmargin=0.3cm+, label=
D. Valcarce, J. Parapar, Á. Barreiro. “Document-based and Term-based Linear Methods
for Pseudo-Relevance Feedback”. Applied Computing Review 18(4), pp. 5-17, 2018.
D. Valcarce, I. Brilhante, J.A. Macedo, F.M. Nardini, R. Perego, C. Renso. “Item-driven
group formation”. Online Social Networks and Media 8, pp. 17-31, 2018.
D. Valcarce, J. Parapar, Á. Barreiro. “Finding and Analysing Good Neighbourhoods to
Improve Collaborative Filtering”. Knowledge-Based Systems 159, pp. 193-202, 2018.
D. Valcarce, J. Parapar, Á. Barreiro. “A MapReduce implementation of posterior
probability clustering and relevance models for recommendation”. Engineering
Applications of Artificial Intelligence 75, pp. 114-124, 2018.
D. Valcarce, J. Parapar, Á. Barreiro. “Axiomatic Analysis of Language Modelling of
Recommender Systems”. International Journal of Uncertainty, Fuzziness and
Knowledge-Based Systems 25(2), pp. 113-128, 2017.
D. Valcarce, J. Parapar, Á. Barreiro. “Item-Based Relevance Modelling of
Recommendations for Getting Rid of Long Tail Products”. Knowledge-Based Systems
103, pp. 41-51, 2016. 123
PhD Thesis
Information Retrieval Models
for Recommender Systems
Author: Daniel Valcarce
Advisors: Álvaro Barreiro & Javier Parapar
A Coruña, May 8th, 2019
Information Retrieval Lab
Computer Science Department
University of A Coruña

Information Retrieval Models for Recommender Systems - PhD slides

  • 1.
    PhD Thesis Information RetrievalModels for Recommender Systems Author: Daniel Valcarce Advisors: Álvaro Barreiro & Javier Parapar A Coruña, May 8th, 2019 Information Retrieval Lab Computer Science Department University of A Coruña
  • 2.
    Outline 1. Introduction 2. Evaluation 3.Top-N recommendation 4. Other recommendation tasks 5. Pseudo-relevance feedback 6. Conclusions 1
  • 3.
  • 4.
    Research aim Recommender Systemsare active Information Filtering systems that present items that their users may be interested in. Information Retrieval systems obtain items of information relevant to the users’ information needs. 3
  • 5.
    Research aim Recommender Systemsare active Information Filtering systems that present items that their users may be interested in. Information Retrieval systems obtain items of information relevant to the users’ information needs. Both Information Retrieval and Information Filtering fields: ⊚ cope with enormous amounts of information, ⊚ provide relevant information to their users, ⊚ can offer personalization. 3
  • 6.
    Research aim Recommender Systemsare active Information Filtering systems that present items that their users may be interested in. Information Retrieval systems obtain items of information relevant to the users’ information needs. Both Information Retrieval and Information Filtering fields: ⊚ cope with enormous amounts of information, ⊚ provide relevant information to their users, ⊚ can offer personalization. This PhD Thesis revolves around the idea of exploiting Information Retrieval models in Recommender Systems. 3
  • 7.
    Information Retrieval vsInformation Filtering Information Retrieval (IR) ⊚ Goal: retrieve documents relevant to the users’ information needs. Information Filtering (IF) ⊚ Goal: select relevant items for the users from an information stream. 4
  • 8.
    Information Retrieval vsInformation Filtering Information Retrieval (IR) ⊚ Goal: retrieve documents relevant to the users’ information needs. ⊚ Systems: search engines (web, multimedia...). Information Filtering (IF) ⊚ Goal: select relevant items for the users from an information stream. ⊚ Systems: spam filters, recommender systems. 4
  • 9.
    Information Retrieval vsInformation Filtering Information Retrieval (IR) ⊚ Goal: retrieve documents relevant to the users’ information needs. ⊚ Systems: search engines (web, multimedia...). ⊚ Input: the user’s query (explicit). Information Filtering (IF) ⊚ Goal: select relevant items for the users from an information stream. ⊚ Systems: spam filters, recommender systems. ⊚ Input: the user’s profile (implicit). 4
  • 10.
    IR and IF:two sides of the same coin? Some people consider them different fields: ⊚ U. Hanani, B. Shapira and P. Shoval: Information Filtering: Overview of Issues, Research and Systems. User Modeling and User-Adapted Interaction (2001). While others consider them the same thing: ⊚ N. J. Belkin and W. B. Croft: Information filtering and information retrieval: two sides of the same coin? Communications of the ACM (1992). What is undeniable is that they are closely related. 5
  • 11.
    IR and IF:two sides of the same coin? Some people consider them different fields: ⊚ U. Hanani, B. Shapira and P. Shoval: Information Filtering: Overview of Issues, Research and Systems. User Modeling and User-Adapted Interaction (2001). While others consider them the same thing: ⊚ N. J. Belkin and W. B. Croft: Information filtering and information retrieval: two sides of the same coin? Communications of the ACM (1992). What is undeniable is that they are closely related. ⊚ Why not apply techniques from one field to the other? 5
  • 12.
    Overview of thesiscontributions Information Retrieval (IR) ⊚ Evaluation within the Cranfield paradigm ⊚ Ad hoc retrieval ⊚ Pseudo-relevance feedback Recommender Systems (RS) ⊚ Evaluation of top-N recommendation ⊚ Neighborhood computation ⊚ Recommendation 6
  • 13.
    Overview of thesiscontributions Information Retrieval (IR) ⊚ Evaluation within the Cranfield paradigm ⊚ Ad hoc retrieval ⊚ Pseudo-relevance feedback Recommender Systems (RS) ⊚ Evaluation of top-N recommendation ⊚ Neighborhood computation ⊚ Recommendation Ranking metrics are commonly used in IR and RS. Following previous work in IR, we study the robustness and discriminative power of these metrics in recommendation. 6
  • 14.
    Overview of thesiscontributions Information Retrieval (IR) ⊚ Evaluation within the Cranfield paradigm ⊚ Ad hoc retrieval ⊚ Pseudo-relevance feedback Recommender Systems (RS) ⊚ Evaluation of top-N recommendation ⊚ Neighborhood computation ⊚ Recommendation Neighborhood-based techniques are a family of RS. We show that ad hoc retrieval models can compute neighborhoods effectively. 6
  • 15.
    Overview of thesiscontributions Information Retrieval (IR) ⊚ Evaluation within the Cranfield paradigm ⊚ Ad hoc retrieval ⊚ Pseudo-relevance feedback Recommender Systems (RS) ⊚ Evaluation of top-N recommendation ⊚ Neighborhood computation ⊚ Recommendation Pseudo-relevance feedback (PRF) provides automatic query expansion. We adapt PRF techniques to diverse recommendation tasks. 6
  • 16.
    Overview of thesiscontributions Information Retrieval (IR) ⊚ Evaluation within the Cranfield paradigm ⊚ Ad hoc retrieval ⊚ Pseudo-relevance feedback Recommender Systems (RS) ⊚ Evaluation of top-N recommendation ⊚ Neighborhood computation ⊚ Recommendation Sparse linear methods are very effective recommenders. We propose a PRF model based on sparse linear methods that achieves state-of-the-art effectiveness. 6
  • 17.
  • 18.
  • 19.
  • 20.
    Recommender Systems evaluation Onlineevaluation (e.g., A/B testing) ⊚ expensive, ⊚ measures real user behavior. Offline evaluation ⊚ cheap, ⊚ highly reproducible, ⊚ usually constitutes the first step before deploying a recommender system. 9
  • 21.
    Recommender Systems evaluation Onlineevaluation (e.g., A/B testing) ⊚ expensive, ⊚ measures real user behavior. Offline evaluation ← ⊚ cheap, ⊚ highly reproducible, ⊚ usually constitutes the first step before deploying a recommender system. 9
  • 22.
    Offline evaluation ofRS When evaluating RS, which metric should we use? ⊚ Many types: error, ranking accuracy, diversity, novelty, etc. ⊚ Ranking accuracy metrics are the most popular. ⊚ These metrics have been traditionally used in IR. 10
  • 23.
    Offline evaluation ofRS When evaluating RS, which metric should we use? ⊚ Many types: error, ranking accuracy, diversity, novelty, etc. ⊚ Ranking accuracy metrics are the most popular. ⊚ These metrics have been traditionally used in IR. ⊚ However, IR and RS evaluation assumptions are quite different: Information Retrieval ⊚ relevance is independent of users, ⊚ relevance judgments are (almost) complete. Recommender Systems ⊚ relevance depends on the users, ⊚ relevance judgments are far from complete. 10
  • 24.
  • 25.
    Ranking metrics study Precision,Recall, MAP, NDCG, MRR, BPref, InfAP... Many ranking accuracy metrics have been studied in IR. We now study their behavior in top-N recommendation. 12
  • 26.
    Ranking metrics study Precision,Recall, MAP, NDCG, MRR, BPref, InfAP... Many ranking accuracy metrics have been studied in IR. We now study their behavior in top-N recommendation. Two perspectives: ⊚ discriminative power, ⊚ robustness to incompleteness. 12
  • 27.
    Ranking metrics study Precision,Recall, MAP, NDCG, MRR, BPref, InfAP... Many ranking accuracy metrics have been studied in IR. We now study their behavior in top-N recommendation. Two perspectives: ⊚ discriminative power, ⊚ robustness to incompleteness: ◦ sparsity bias, ◦ popularity bias. 12
  • 28.
    Robustness to incompleteness Sparsitybias ⊚ Sparsity is intrinsic to the recommendation task. ⊚ We take random subsamples from the test set to increase the bias. Popularity bias ⊚ Missing-not-at-random (long tail distribution). ⊚ We remove the most popular items to study the bias. We measure the robustness of a metric by computing the Kendall’s correlation of systems rankings when changing the amount of bias. 13
  • 29.
    Discriminative power ⊚ Ametric is discriminative when its differences in value are statistically significant. ⊚ We use the permutation test with difference in means as test statistic. ⊚ We run the statistical test between all possible system pairs. ⊚ We plot the obtained p-values sorted by decreasing value. 14
  • 30.
    Discriminative power ⊚ Ametric is discriminative when its differences in value are statistically significant. ⊚ We use the permutation test with difference in means as test statistic. ⊚ We run the statistical test between all possible system pairs. ⊚ We plot the obtained p-values sorted by decreasing value. 14
  • 31.
  • 32.
    Comparing cut-offs ofthe same metric (nDCG) 1/4 @5 @10 @20 @30 @40 @50 @60 @70 @80 @90 @100 @5 @10 @20 @30 @40 @50 @60 @70 @80 @90 @100 1.00 0.95 0.93 0.92 0.92 0.92 0.92 0.91 0.90 0.90 0.90 0.95 1.00 0.98 0.97 0.97 0.97 0.97 0.96 0.95 0.95 0.95 0.93 0.98 1.00 0.99 0.99 0.99 0.99 0.98 0.97 0.97 0.97 0.92 0.97 0.99 1.00 1.00 1.00 1.00 0.99 0.98 0.98 0.98 0.92 0.97 0.99 1.00 1.00 1.00 1.00 0.99 0.98 0.98 0.98 0.92 0.97 0.99 1.00 1.00 1.00 1.00 0.99 0.98 0.98 0.98 0.92 0.97 0.99 1.00 1.00 1.00 1.00 0.99 0.98 0.98 0.98 0.91 0.96 0.98 0.99 0.99 0.99 0.99 1.00 0.99 0.99 0.99 0.90 0.95 0.97 0.98 0.98 0.98 0.98 0.99 1.00 1.00 1.00 0.90 0.95 0.97 0.98 0.98 0.98 0.98 0.99 1.00 1.00 1.00 0.90 0.95 0.97 0.98 0.98 0.98 0.98 0.99 1.00 1.00 1.00 Correlation between cut-offs of nDCG. 16
  • 33.
    Comparing cut-offs ofthe same metric (nDCG) 2/4 100 90 80 70 60 50 40 30 20 10 0 % of ratings in the test set 0.85 0.90 0.95 1.00 Kendall’sτ nDCG@5 nDCG@10 nDCG@20 nDCG@30 nDCG@40 nDCG@50 nDCG@60 nDCG@70 nDCG@80 nDCG@90 nDCG@100 Kendall’s correlation among systems evaluated with nDCG when increasing the sparsity bias. 17
  • 34.
    Comparing cut-offs ofthe same metric (nDCG) 3/4 100 95 90 85 80 % least popular items in the test set 0.0 0.2 0.4 0.6 0.8 1.0 Kendall’sτ nDCG@5 nDCG@10 nDCG@20 nDCG@30 nDCG@40 nDCG@50 nDCG@60 nDCG@70 nDCG@80 nDCG@90 nDCG@100 Kendall’s correlation among systems evaluated with nDCG when changing the popularity bias. 18
  • 35.
    Comparing cut-offs ofthe same metric (nDCG) 4/4 0 5 10 15 20 25 pairs of recommender systems 0.0 0.2 0.4 0.6 0.8 1.0 p-value nDCG@5 nDCG@10 nDCG@20 nDCG@30 nDCG@40 nDCG@50 nDCG@60 nDCG@70 nDCG@80 nDCG@90 nDCG@100 Discriminative power of nDCG measured with p-value curves. 19
  • 36.
    Comparing metrics withcut-off @100 1/4 Precision Recall MAP nDCG MRR Bpref InfAP Precision Recall MAP nDCG MRR Bpref InfAP 1.00 0.89 0.87 0.89 0.71 0.89 0.91 0.89 1.00 0.87 0.90 0.72 0.90 0.92 0.87 0.87 1.00 0.96 0.84 0.92 0.92 0.89 0.90 0.96 1.00 0.82 0.94 0.96 0.71 0.72 0.84 0.82 1.00 0.80 0.80 0.89 0.90 0.92 0.94 0.80 1.00 0.96 0.91 0.92 0.92 0.96 0.80 0.96 1.00 Correlation between metrics at cut-off @100. 20
  • 37.
    Comparing metrics withcut-off @100 2/4 100 90 80 70 60 50 40 30 20 10 0 % of ratings in the test set 0.85 0.90 0.95 1.00 Kendall’sτ Precision Recall MAP nDCG MRR Bpref InfAP Kendall’s correlation among systems when increasing the sparsity bias. 21
  • 38.
    Comparing metrics withcut-off @100 3/4 100 95 90 85 80 % least popular items in the test set 0.0 0.2 0.4 0.6 0.8 1.0 Kendall’sτ Precision Recall MAP nDCG MRR Bpref InfAP Kendall’s correlation among systems when increasing the popularity bias. 22
  • 39.
    Comparing metrics withcut-off @100 4/4 0 5 10 15 20 25 pairs of recommender systems 0.0 0.2 0.4 0.6 0.8 1.0 p-value Precision Recall MAP nDCG MRR Bpref InfAP Discriminative power measured with p-value curves. 23
  • 40.
  • 41.
    Findings ⊚ Deep cut-offsoffer greater robustness and discriminative power than shallow cut-offs. ⊚ Precision offers high robustness to sparsity and popularity biases and good discriminative power. ⊚ nDCG provides the best discriminative power and high robustness to the sparsity bias and moderate robustness to the popularity bias. 25
  • 42.
    Experimental settings: metrics Wemeasure three recommendation dimensions: ⊚ Ranking accuracy: nDCG@100. ◦ nDCG is robust and discriminative. ◦ nDCG models graded relevance. ⊚ Diversity: Gini@100. ◦ The Gini index measures item recommendation inequality. ⊚ Novelty: MSI@100. ◦ Mean self-information quantify recommendations unexpectedness. 26
  • 43.
    Experimental settings: datasets DatasetUsers Items Ratings Density MovieLens 100k 943 1682 100 000 6.305 % MovieLens 1M 6040 3706 1 000 209 4.468 % MovieLens 10M 71 567 10 681 10 000 054 1.308 % R3-Yahoo 15 400 1000 365 703 2.375 % LibraryThing 7279 37 232 749 401 0.277 % BeerAdvocate 33 388 66 055 1 571 808 0.071 % Ta-Feng 32 266 23 812 817 741 0.106 % 27
  • 44.
  • 45.
    Recommender Systems Recommendation algorithmscan be classified in: ⊚ Content-based: find items similar to those the target user liked using the items descriptions. ⊚ Collaborative filtering: relies on user-item interactions. ⊚ Hybrid: combination of content-based and collaborative filtering approaches. 29
  • 46.
    Recommender Systems Recommendation algorithmscan be classified in: ⊚ Content-based: find items similar to those the target user liked using the items descriptions. ⊚ Collaborative filtering: relies on user-item interactions. ⊚ Hybrid: combination of content-based and collaborative filtering approaches. 29
  • 47.
    Collaborative filtering Collaborative filtering(CF) exploits user-item feedback: ⊚ Explicit: ratings, reviews, etc. ⊚ Implicit: clicks, purchases, check-ins, etc. 30
  • 48.
    Collaborative filtering Collaborative filtering(CF) exploits user-item feedback: ⊚ Explicit: ratings, reviews, etc. ⊚ Implicit: clicks, purchases, check-ins, etc. Two main families of CF methods: ⊚ Model-based: learn a predictive model from the data. ⊚ Neighborhood-based (or memory-based): directly use the user-item feedback to compute recommendations. 30
  • 49.
    Collaborative filtering Collaborative filtering(CF) exploits user-item feedback: ⊚ Explicit: ratings, reviews, etc. ⊚ Implicit: clicks, purchases, check-ins, etc. Two main families of CF methods: ⊚ Model-based: learn a predictive model from the data. ⊚ Neighborhood-based (or memory-based): directly use the user-item feedback to compute recommendations. 30
  • 50.
    Neighborhood-based methods Two perspectives: ⊚User-based: recommend items that users with common interests liked. ⊚ Item-based: recommend items similar to those you liked. Similarity between items is computed using common users among items (not the content!). 31
  • 51.
    Neighborhood-based methods Two perspectives: ⊚User-based: recommend items that users with common interests liked. ⊚ Item-based: recommend items similar to those you liked. Similarity between items is computed using common users among items (not the content!). Two phases: ⊚ neighborhood computation, ⊚ recommendation generation. 31
  • 52.
  • 53.
  • 54.
  • 55.
    Pseudo-relevance feedback (PRF) Informationneed query Retrieval System 33
  • 56.
    Pseudo-relevance feedback (PRF) Informationneed query Retrieval System 33
  • 57.
    Pseudo-relevance feedback (PRF) Informationneed query Retrieval System 33
  • 58.
    Pseudo-relevance feedback (PRF) Informationneed query Retrieval System 33
  • 59.
    Pseudo-relevance feedback (PRF) Informationneed query Retrieval System Query Expansion expanded query 33
  • 60.
    Pseudo-relevance feedback (PRF) Informationneed query Retrieval System Query Expansion expanded query 33
  • 61.
    PRF for Recommendation Pseudo-relevancefeedback Neighborhood-based recommenders User’s query User’s profile most^1,populated^1,state^2 Titanic^2,Avatar^3,Watchmen^5 Documents Neighbors Terms Items 34
  • 62.
  • 63.
    Relevance models Relevance-based languagemodels or, simply, relevance models (RM) are state-of-the-art PRF methods [Lavrenko & Croft, SIGIR ’01]: ⊚ RM1: i.i.d. sampling, ⊚ RM2: conditional sampling. 36
  • 64.
    Relevance models Relevance-based languagemodels or, simply, relevance models (RM) are state-of-the-art PRF methods [Lavrenko & Croft, SIGIR ’01]: ⊚ RM1: i.i.d. sampling, ⊚ RM2: conditional sampling. RM has been adapted to user-based CF [Parapar et al., IPM ’13]. 36
  • 65.
    Relevance models forCF RM2 : p(i|Ru) ∝ p(i) ∏ j∈Iu ∑ v∈Vu p(i|v) p(v) p(i) p(j|v) ⊚ Iu is the set of items rated by the user u ⊚ Vu is neighborhood of the user u computed with kNN cosine ⊚ p(i|u) is computed smoothing the maximum likelihood estimate with the probability in the collection ⊚ p(i) and p(v) are the item and user priors 37
  • 66.
    Relevance models forCF RM2 : p(i|Ru) ∝ p(i) ∏ j∈Iu ∑ v∈Vu p(i|v) p(v) p(i) p(j|v) ⊚ Iu is the set of items rated by the user u ⊚ Vu is neighborhood of the user u computed with kNN cosine ⊚ p(i|u) is computed smoothing the maximum likelihood estimate with the probability in the collection ⊚ p(i) and p(v) are the item and user priors 37
  • 67.
    Relevance models forCF RM2 : p(i|Ru) ∝ p(i) ∏ j∈Iu ∑ v∈Vu p(i|v) p(v) p(i) p(j|v) ⊚ Iu is the set of items rated by the user u ⊚ Vu is neighborhood of the user u computed with kNN cosine ⊚ p(i|u) is computed smoothing the maximum likelihood estimate with the probability in the collection ⊚ p(i) and p(v) are the item and user priors 37
  • 68.
    Relevance models forCF RM2 : p(i|Ru) ∝ p(i) ∏ j∈Iu ∑ v∈Vu p(i|v) p(v) p(i) p(j|v) ⊚ Iu is the set of items rated by the user u ⊚ Vu is neighborhood of the user u computed with kNN cosine ⊚ p(i|u) is computed smoothing the maximum likelihood estimate with the probability in the collection ⊚ p(i) and p(v) are the item and user priors 37
  • 69.
    Relevance models forCF RM2 : p(i|Ru) ∝ p(i) ∏ j∈Iu ∑ v∈Vu p(i|v) p(v) p(i) p(j|v) ⊚ Iu is the set of items rated by the user u ⊚ Vu is neighborhood of the user u computed with kNN cosine ⊚ p(i|u) is computed smoothing the maximum likelihood estimate with the probability in the collection ⊚ p(i) and p(v) are the item and user priors 37
  • 70.
    Smoothing in RM2 RM2: p(i|Ru) ∝ p(i) ∏ j∈Iu ∑ v∈Vu p(i|v) p(v) p(i) p(j|v) To compute the conditional probabilities, we smooth the maximum likelihood estimate (MLE): pmle(i|u) = r(u, i) ∑ j∈Iu r(u, j) with the probability in the collection: p(i|C) = ∑ v∈U r(v, i) ∑ j∈I ∑ v∈U r(v, j) 38
  • 71.
    Why use smoothing? InIR [Zhai & Lafferty, TOIS 2004], smoothing provides: ⊚ a way to deal with data sparsity, ⊚ inverse document frequency (IDF) effect, ⊚ document length normalization. 39
  • 72.
    Why use smoothing? InIR [Zhai & Lafferty, TOIS 2004], smoothing provides: ⊚ a way to deal with data sparsity, ⊚ inverse document frequency (IDF) effect, ⊚ document length normalization. In RS, we have the same problems: ⊚ data sparsity, ⊚ item popularity/specificity, ⊚ user profiles with different sizes. 39
  • 73.
    Smoothing techniques Jelinek-Mercer smoothing(JMS): linear interpolation controlled by λ. pλ(i|u) = (1 − λ) pmle(i|u) + λ p(i|C) Dirichlet priors smoothing (DPS): Bayesian analysis with parameter µ. pµ(i|u) = r(u, i) + µ p(i|C) µ + ∑ j∈Iu r(u, j) Absolute discounting smoothing (ADS): subtract a constant δ. pδ(i|u) = max[r(u, i) − δ, 0] + δ |Iu|p(i|C) ∑ j∈Iu r(u, j) Additive smoothing (AS): increase all the ratings by γ > 0. pγ(i|u) = r(u, i) + γ ∑ j∈Iu r(u, j) + γ |I| 40
  • 74.
    IDF effect In IR,the IDF effect: ⊚ measures term specificity in most weighting schemes, ⊚ was born as a heuristic but was given theoretical justification. 41
  • 75.
    IDF effect In IR,the IDF effect: ⊚ measures term specificity in most weighting schemes, ⊚ was born as a heuristic but was given theoretical justification. In RS, item specificity is related to item novelty. 41
  • 76.
    IDF effect In IR,the IDF effect: ⊚ measures term specificity in most weighting schemes, ⊚ was born as a heuristic but was given theoretical justification. In RS, item specificity is related to item novelty. IDF effect in recommendation ⊚ Let u be a user from the set of users U; ⊚ let Vu be their neighborhood; ⊚ given two items i1 and i2 with: ◦ the same ratings r(v, i1) = r(v, i2) ∀ v ∈ Vu, ◦ different popularity p(i1|C) < p(i2|C); ⊚ a recommender system that outputs p(i1|Ru) > p(i2|Ru) is said to support the IDF effect. 41
  • 77.
    Smoothing: axiomatic analysisof the IDF effect We analyze axiomatically the IDF effect in RM2 when using different smoothing methods: Smoothing method IDF effect? Jelinek-Mercer Dirichlet priors Absolute discounting Additive 42
  • 78.
    Smoothing: axiomatic analysisof the IDF effect We analyze axiomatically the IDF effect in RM2 when using different smoothing methods: Smoothing method IDF effect? Jelinek-Mercer × Dirichlet priors × Absolute discounting × Additive ✓ We expect additive smoothing to offer better figures of novelty. 42
  • 79.
    Smoothing: ranking accuracy 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.10.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.001 0.01 0.1 1 10nDCG@100 δ, λ, µ × 103 γ Additive (γ) Absolute discounting (δ) Jelinek-Mercer (λ) Dirichlet priors (µ) Figure: nDCG@100 values of RM2 varying the smoothing method on MovieLens 100k. Also evaluated in MovieLens 1M, R3-Yahoo and LibraryThing. 43
  • 80.
    Smoothing: diversity 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.1 0.20.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.001 0.01 0.1 1 10Gini@100 δ, λ, µ × 103 γ Additive (γ) Absolute discounting (δ) Jelinek-Mercer (λ) Dirichlet priors (µ) Figure: Gini@100 values of RM2 varying the smoothing method on MovieLens 100k. Also evaluated in MovieLens 1M, R3-Yahoo and LibraryThing. 44
  • 81.
    Smoothing: novelty 130 140 150 160 170 180 0.1 0.20.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.001 0.01 0.1 1 10MSI@100 δ, λ, µ × 103 γ Additive (γ) Absolute discounting (δ) Jelinek-Mercer (λ) Dirichlet priors (µ) Figure: MSI@100 values of RM2 varying the smoothing method on MovieLens 100k. Also evaluated in MovieLens 1M, R3-Yahoo and LibraryThing. 45
  • 82.
    Priors in RM2 RM2: p(i|Ru) ∝ p(i) ∏ j∈Iu ∑ v∈Vu p(i|v) p(v) p(i) p(j|v) p(i) and p(v) are the item and user priors: ⊚ enable to introduce a priori information into the model, ⊚ provide a principled way of modeling business rules, 46
  • 83.
    Priors in RM2 RM2: p(i|Ru) ∝ p(i) ∏ j∈Iu ∑ v∈Vu p(i|v) p(v) p(i) p(j|v) p(i) and p(v) are the item and user priors: ⊚ enable to introduce a priori information into the model, ⊚ provide a principled way of modeling business rules, ⊚ similar to document priors used in IR such as: ◦ linear document length prior [Kraaij et al., SIGIR ’02], ◦ probabilistic document length prior [Blanco & Barreiro, ECIR ’08]. 46
  • 84.
    User prior estimators Uniform pU(u) = 1 |U| Linear pL(u) = p(u|C) = ∑ i∈Iu r(u, i) ∑ v∈U ∑ j∈Iv r(v, j) Probabilistic prior using Jelinek-Mercer smoothing pP JMS(u) = (1 − λ) + λ ∑ i∈Iu p(i|C) Probabilistic prior using Dirichlet priors smoothing pP DP S(u) = ∑ i∈Iu r(u, i) + µ ∑ i∈Iu p(i|C) µ + ∑ i∈Iu r(u, i) Probabilistic prior using absolute discounting smoothing pP ADS(u) = ∑ i∈Iu max(r(u, i) − δ, 0) + δ|Iu| ∑ i∈Iu p(i|C) ∑ j∈Iu r(u, j) Probabilistic prior using additive smoothing pP AS(u) = ∑ i∈Iu r(u, i) + γ|Iu| ∑ j∈Iu r(u, j) + γ|I| 47
  • 85.
    Item prior estimators Uniform pU(i) = 1 |I| Linear pL(i) = p(i|C) = ∑ u∈Ui r(u, i) ∑ j∈I ∑ v∈Uj r(v, j) Probabilistic prior using Jelinek-Mercer smoothing pP JMS(u) = (1 − λ) + λ ∑ u∈Ui p(u|C) Probabilistic prior using Dirichlet priors smoothing pP DP S(u) = ∑ u∈Ui r(u, i) + µ ∑ u∈Ui p(u|C) µ + ∑ u∈Ui r(u, i) Probabilistic prior using absolute discounting smoothing pP ADS(u) = ∑ u∈Ui max(r(u, i) − δ, 0) + δ|Ui | ∑ u∈Ui p(u|C) ∑ v∈Ui r(v, i) Probabilistic prior using additive smoothing pP AS(u) = ∑ u∈Ui r(u, i) + γ|Ui | ∑ v∈Ui r(v, i) + γ|U| 48
  • 86.
    Priors: evaluation RM2 MetricML 100k ML 1M R3-Yahoo LibraryThing U-U nDCG 0.4936 0.4242 0.0706 0.2206 Gini 0.2470 0.1352 0.3006 0.0390 MSI 175.94 172.14 303.87 331.05 U-PJMS nDCG 0.4953* 0.4296* 0.0717* 0.2385* Gini 0.2637 0.1637 0.4769 0.0319 MSI 180.45* 182.75* 339.65* 417.57* Table: Comparison of RM2 method using uniform user and item priors (U-U) or a uniform user prior and a probabilistic item prior estimate with Jelinek-Mercer smoothing (U-PJMS). Best values in pink. Statistically significant improvements (permutation test p < 0.05) with a *. 49
  • 87.
  • 88.
    Previous Work onAdapting PRF Methods to CF Relevance models are very effective recommenders but have: ⊚ high computational cost, ⊚ several hyperparameters to tune, ⊚ different smoothing and prior choices to be made. RM1 : p(i|Ru) ∝ ∑ v∈Vu p(v) p(i|v) ∏ j∈Iu p(j|v) RM2 : p(i|Ru) ∝ p(i) ∏ j∈Iu ∑ v∈Vu p(i|v) p(v) p(i) p(j|v) 51
  • 89.
    Popular approaches topseudo-relevance feedback ⊚ Relevance models [Lavrenko & Croft, SIGIR ’01] ⊚ Scoring functions based on the Rocchio framework [Rocchio, 1971; Carpineto et al., ACM TOIS ’01] ⊚ Divergence minimization model [Zhai & Lafferty, SIGIR ’06] ⊚ Mixture models [Tao & Zhai, SIGIR ’06] 52
  • 90.
    Popular approaches topseudo-relevance feedback ⊚ Relevance models [Lavrenko & Croft, SIGIR ’01] ⊚ Scoring functions based on the Rocchio framework [Rocchio, 1971; Carpineto et al., ACM TOIS ’01] ⊚ Divergence minimization model [Zhai & Lafferty, SIGIR ’06] ⊚ Mixture models [Tao & Zhai, SIGIR ’06] 52
  • 91.
    Scoring functions fromRocchio framework Rocchio Weights (RW) pRW (i|u) = ∑ v∈Vu r(v, i) |Vu| Robertson Selection Value (RSV) pRSV (i|u) = p(i|Vu) ∑ v∈Vu r(v, i) |Vu| CHI2 pCHI2 (i|u) = [ p(i|Vu) − p(i|C) ]2 p(i|C) Kullback–Leibler Divergence (KLD) pKLD(i|u) = p(i|Vu) log p(i|Vu) p(i|C) 53
  • 92.
    Scoring functions fromRocchio framework Rocchio Weights (RW) pRW (i|u) = ∑ v∈Vu r(v, i) |Vu| Robertson Selection Value (RSV) pRSV (i|u) = p(i|Vu) ∑ v∈Vu r(v, i) |Vu| CHI2 pCHI2 (i|u) = [ p(i|Vu) − p(i|C) ]2 p(i|C) Kullback–Leibler Divergence (KLD) pKLD(i|u) = p(i|Vu) log p(i|Vu) p(i|C) 53
  • 93.
    Scoring functions fromRocchio framework Rocchio Weights (RW) pRW (i|u) = ∑ v∈Vu r(v, i) |Vu| Robertson Selection Value (RSV) pRSV (i|u) = p(i|Vu) ∑ v∈Vu r(v, i) |Vu| CHI2 pCHI2 (i|u) = [ p(i|Vu) − p(i|C) ]2 p(i|C) Kullback–Leibler Divergence (KLD) pKLD(i|u) = p(i|Vu) log p(i|Vu) p(i|C) 53
  • 94.
    Probability estimators Maximum likelihoodestimate (MLE) MLE of a multinomial distribution over the ratings: pmle(i|Vu) = ∑ v∈Vu r(v, i) ∑ v∈Vu ∑ j∈I r(v, j) pmle(i|C) = ∑ u∈U r(u, i) ∑ u∈U ∑ j∈I r(u, j) 54
  • 95.
    Neighborhood size normalization(I) Neighborhoods are computed using clustering algorithms: ⊚ Hard clustering: every user appears in only one cluster. Clusters may have different sizes. Example: k-means. ⊚ Soft clustering: each user has its own neighbors. When we set k to a high value, we may find different amounts of neighbors. Example: kNN algorithm. 55
  • 96.
    Neighborhood size normalization(I) Neighborhoods are computed using clustering algorithms: ⊚ Hard clustering: every user appears in only one cluster. Clusters may have different sizes. Example: k-means. ⊚ Soft clustering: each user has its own neighbors. When we set k to a high value, we may find different amounts of neighbors. Example: kNN algorithm. Idea: why not consider the variability of neighborhood sizes? ⊚ Large neighborhoods are equivalent to query with a lot of results: the collection model is closer to the target user. ⊚ Small neighborhoods imply that neighbors are highly specific: the collection is very different from the target user. 55
  • 97.
    Neighborhood size normalization(II) Normalized MLE (NMLE) We bias the MLE to perform neighborhood size normalization: pnmle(i|Vu) rank = 1 |Vu| ∑ v∈Vu r(v, i) ∑ v∈Vu ∑ j∈I r(v, j) pnmle(i|C) rank = 1 |U| ∑ u∈U r(u, i) ∑ u∈U ∑ j∈I r(u, j) 56
  • 98.
    Rocchio: efficiency 0.001 0.01 0.1 1 ML 100kML 1M ML 10M recommendationtimeperuser(s) RM2 RW RSV KLD CHI2 Figure: Recommendation time per user (in logarithmic scale) using RM2, RW, RSV, CHI2 and KLD algorithms on the MovieLens 100k, 1M and 10M datasets. 57
  • 99.
    Rocchio: ranking accuracy MethodML 100k ML 1M R3-Yahoo LibraryThing RM2 0.4953bcdef g 0.4296bcdef g 0.0717bcd 0.2385bcg RW 0.4827cdef 0.4114cdef 0.0704d 0.2182c RSV 0.4825def 0.4112def 0.0703d 0.2180 CHI2-MLE 0.2916 0.2775 0.0628 0.2605abcf g CHI2-NMLE 0.4639df 0.3966df 0.0726bcdf 0.2610abcf g KLD-MLE 0.4207d 0.3393d 0.0709d 0.2543abcg KLD-NMLE 0.4839def 0.4195bcdef 0.0715bcd 0.2337bc Table: Values of nDCG@100. Statistically significant improvements (permutation test p < 0.05) with respect to RM2, RW, RSV, CHI2-MLE, CHI2-NMLE, KLD-NMLE and KLD-NMLE are superscripted with a, b, c, d, e, f and g, respectively. Best values in pink. 58
  • 100.
    Rocchio: diversity Method ML100k ML 1M R3-Yahoo LibraryThing RM2 0.2637 0.1637 0.4769 0.0319 RW 0.2341 0.1331 0.2937 0.0348 RSV 0.2338 0.1329 0.2940 0.0346 CHI2-MLE 0.3745 0.3895 0.4429 0.1496 CHI2-NMLE 0.2947 0.1677 0.4136 0.1128 KLD-MLE 0.3168 0.3190 0.6064 0.0891 KLD-NMLE 0.2806 0.1540 0.3037 0.0669 Table: Values of Gini@100. Best values in pink. 59
  • 101.
    Rocchio: novelty Method ML100k ML 1M R3-Yahoo LibraryThing RM2 180.45 182.75 339.65 417.57 RW 172.72 171.87 302.82 326.95 RSV 172.60 171.80 302.91 326.69 CHI2-MLE 233.63 262.21 333.12 442.55 CHI2-NMLE 190.77 188.34 327.74 400.18 KLD-MLE 199.23 237.88 371.56 396.31 KLD-NMLE 185.27 179.59 306.48 359.25 Table: Values of MSI@100. Best values in pink. 60
  • 102.
  • 103.
    Neighborhood-based methods Neighborhood-based methodsusually are: ⊚ simple, ⊚ efficient, ⊚ explainable. But their effectiveness relies largely on the quality of the neighbors. The most common approach is to compute the k nearest neighbors (kNN algorithm) using a pairwise similarity. 62
  • 104.
    Weighted sum recommender(WSR) NNCosNgbr [Cremonesi et al., RecSys ’10] We bias the MLE to perform neighborhood size normalization: ˆru,i = bu,i + ∑ j∈Ji shrunk_cosine (i, j) (r(u, j)−bu,i ) 63
  • 105.
    Weighted sum recommender(WSR) NNCosNgbr [Cremonesi et al., RecSys ’10] We bias the MLE to perform neighborhood size normalization: ˆru,i = bu,i + ∑ j∈Ji shrunk_cosine (i, j) (r(u, j)−bu,i ) Item-based weighted sum recommender (WSR-IB) ˆru,i = ∑ j∈Ji cos (i, j) r(u, j) User-based weighted sum recommender (WSR-UB) ˆru,i = ∑ v∈Vu cos (u, v) r(v, i) 63
  • 106.
    Experiments with WSR MethodMetric ML 100k ML 1M R3-Yahoo LibraryThing NNCosNgbr nDCG 0.2227 0.1980 0.0567 0.0852 Gini 0.3438 0.2407 0.2341 0.0659 MSI 230.14 228.00 386.78 546.47 WSR-UB nDCG 0.4857* 0.4138* 0.0705* 0.2213* Gini 0.2375 0.1356 0.3208 0.0768 MSI 173.86 172.76 309.52 364.70 WSR-IB nDCG 0.4833* 0.4035* 0.0727* 0.3085* Gini 0.2560 0.1516 0.3356 0.2768 MSI 177.34 178.95 315.05 461.73 Table: Statistically significant improvements in nDCG@100 (permutation test p < 0.05) with respect to NNCosNgbr are indicated with *. Best values of nDCG@100 in pink. 64
  • 107.
  • 108.
    Room for improvement WSRwith kNN cosine works well in top-N recommendation. What is the room for improvement of this similarity measure? 66
  • 109.
    Room for improvement WSRwith kNN cosine works well in top-N recommendation. What is the room for improvement of this similarity measure? Let’s build an oracle that generates ideal neighborhoods: ⊚ Finding the best neighborhood is a NP-hard problem. ⊚ We build an approximate oracle using a greedy approach. 66
  • 110.
    Greedy neighborhood oracle 050 100 150 200 250 300 0.0 0.2 0.4 0.6 0.8 1.0 nDCG@100 0.80 0.85 0.90 Greedy Oracle 0 50 100 150 200 250 300 k 0.40 0.45 0.50 kNN cosine Figure: Values of nDCG@100 of WSR when using the neighborhoods produced by the greedy oracle and by kNN using cosine similarity on MovieLens 100k. 67
  • 111.
    Cosine-based neighborhood oracle Theneighborhoods produced by the greedy oracle may be impossible to achieve with similarities based on co-occurrence. 68
  • 112.
    Cosine-based neighborhood oracle Theneighborhoods produced by the greedy oracle may be impossible to achieve with similarities based on co-occurrence. We develop a simpler oracle based on cosine similarity: ⊚ We find the best neighborhoods that cosine similarity can provide by tuning the value k for each user. ⊚ This oracle can be seen as an adaptive kNN algorithm that uses the optimal k for each user. 68
  • 113.
    Comparison against oracles MethodnDCG@100 Gini@100 MSI@100 kNN Cosine 0.4857 0.2375 173.86 Cosine-based Oracle 0.5298 0.2508 174.97 Greedy Oracle 0.8631 0.2664 168.08 Table: Values of nDCG@100, Gini@100 and MSI@100 using WSR with cosine similarity and the two oracles on the MovieLens 100k dataset. 69
  • 114.
    Cosine similarity improvements Bystudying the properties of the neighborhoods provided by the oracles, we modify cosine similarity: 70
  • 115.
    Cosine similarity improvements Bystudying the properties of the neighborhoods provided by the oracles, we modify cosine similarity: ⊚ We penalize the cosine similarity to add user profile size normalization. ◦ Similar to the pivoted document length normalization in IR [Singhal et al., SIGIR ’96]. 70
  • 116.
    Cosine similarity improvements Bystudying the properties of the neighborhoods provided by the oracles, we modify cosine similarity: ⊚ We penalize the cosine similarity to add user profile size normalization. ◦ Similar to the pivoted document length normalization in IR [Singhal et al., SIGIR ’96]. ⊚ We add the IDF effect to cosine similarity to increase the user profile overlap of the neighbors. ◦ The IDF is a fundamental term specificity measure in IR. 70
  • 117.
    Cosine similarity improvements:results Method Metric ML 100k ML 1M R3-Yahoo LibraryThing Cosine nDCG 0.4857 0.4138 0.0704 0.2255 Gini 0.2375 0.1356 0.3107 0.0417 MSI 173.86 172.76 305.26 333.50 Penalized Cosine nDCG 0.4889* 0.4194* 0.0709 0.2266 Gini 0.2516 0.1446 0.2863 0.0471 MSI 177.97* 176.41* 302.39 339.05* Penalized Cosine with IDF nDCG 0.4927*† 0.4281*† 0.0721*† 0.2422*† Gini 0.2517 0.1551 0.3376 0.0596 MSI 178.65* 180.41*† 312.08*† 354.46*† Table: Statistically significant improvements in nDCG@100 and MSI@100 (permutation test p < 0.05) with respect to cosine and penalized cosine is indicated with an * and † , respectively. Best values in pink. 71
  • 118.
    Top-N recommendation Language modelsfor computing neighborhoods
  • 119.
    Alternatives to cosinesimilarity So far, we have improved cosine similarity with ideas from IR. Can we do better than with cosine similarity? 73
  • 120.
    Alternatives to cosinesimilarity So far, we have improved cosine similarity with ideas from IR. Can we do better than with cosine similarity? Let’s study cosine similarity from IR perspective. 73
  • 121.
    Cosine similarity andthe vector space model Recommender Systems ⊚ Target user ⊚ Rest of users ⊚ Items Information Retrieval ⊚ Query ⊚ Documents ⊚ Terms 74
  • 122.
    Cosine similarity andthe vector space model Recommender Systems ⊚ Target user ⊚ Rest of users ⊚ Items Information Retrieval ⊚ Query ⊚ Documents ⊚ Terms Computing neighborhoods using cosine similarity is equivalent to search in the vector space model. If we swap users and items, we can derive an analogous item-based approach. 74
  • 123.
    Cosine similarity andthe vector space model Recommender Systems ⊚ Target user ⊚ Rest of users ⊚ Items Information Retrieval ⊚ Query ⊚ Documents ⊚ Terms Computing neighborhoods using cosine similarity is equivalent to search in the vector space model. If we swap users and items, we can derive an analogous item-based approach. We can use sophisticated search techniques for finding neighbors! 74
  • 124.
    Language models Statistical languagemodels are a state-of-the-art ad hoc retrieval framework [Ponte & Croft, SIGIR ’98]. Documents are ranked according to their posterior probability given the query: p(d|q) = p(q|d) p(d) p(q) rank = p(q|d) p(d) 75
  • 125.
    Language models Statistical languagemodels are a state-of-the-art ad hoc retrieval framework [Ponte & Croft, SIGIR ’98]. Documents are ranked according to their posterior probability given the query: p(d|q) = p(q|d) p(d) p(q) rank = p(q|d) p(d) The query likelihood, p(q|d), is based on a unigram model: p(q|d) = ∏ t∈q p(t|d)c(t,d) 75
  • 126.
    Language models Statistical languagemodels are a state-of-the-art ad hoc retrieval framework [Ponte & Croft, SIGIR ’98]. Documents are ranked according to their posterior probability given the query: p(d|q) = p(q|d) p(d) p(q) rank = p(q|d) p(d) The query likelihood, p(q|d), is based on a unigram model: p(q|d) = ∏ t∈q p(t|d)c(t,d) The document prior, p(d), is usually considered uniform. 75
  • 127.
    Language models forfinding neighborhoods (I) Ad hoc retrieval p(d|q) rank = p(d) ∏ t∈q p(t|d)c(t,d) User-based collaborative filtering p(v|u) rank = p(v) ∏ i∈Iu p(i|v)r(v,i) Item-based collaborative filtering p(j|i) rank = p(j) ∏ u∈Ui p(u|j)r(u,j) 76
  • 128.
    Language models forfinding neighborhoods (II) User-based collaborative filtering: p(v|u) rank = p(v) ∏ i∈Iu p(i|v)r(v,i) We assume a multinomial distribution over the count of ratings: pmle(i|v) = r(v, i) ∑ j∈Iv r(v, j) 77
  • 129.
    Language models forfinding neighborhoods (II) User-based collaborative filtering: p(v|u) rank = p(v) ∏ i∈Iu p(i|v)r(v,i) We assume a multinomial distribution over the count of ratings: pmle(i|v) = r(v, i) ∑ j∈Iv r(v, j) However it suffers from sparsity. We need smoothing! ⊚ Jelinek-Mercer smoothing (JMS) ⊚ Dirichlet priors smoothing (DPS) ⊚ Absolute discounting smoothing (ADS) ⊚ Additive smoothing (AS) 77
  • 130.
    Language models: rankingaccuracy 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.001 0.01 0.1 1 10 nDCG@100 δ, λ, µ × 4 × 103 γ Additive (γ) Absolute discounting (δ) Jelinek-Mercer (λ) Dirichlet priors (µ) Cosine Figure: nDCG@100 values of WSR-UB varying the smoothing method on MovieLens 1M. Also evaluated in MovieLens 100k, R3-Yahoo and LibraryThing. 78
  • 131.
    Language models: diversity 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.10.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.001 0.01 0.1 1 10 Gini@100 δ, λ, µ × 4 × 103 γ Additive (γ) Absolute discounting (δ) Jelinek-Mercer (λ) Dirichlet priors (µ) Cosine Figure: Gini@100 values of WSR-UB varying the smoothing method on MovieLens 1M. Also evaluated in MovieLens 100k, R3-Yahoo and LibraryThing. 79
  • 132.
    Language models: novelty 140 145 150 155 160 165 170 175 180 185 190 195 0.10.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.001 0.01 0.1 1 10 MSI@100 δ, λ, µ × 4 × 103 γ Additive (γ) Absolute discounting (δ) Jelinek-Mercer (λ) Dirichlet priors (µ) Cosine Figure: MSI@100 values of WSR-UB varying the smoothing method on MovieLens 1M. Also evaluated in MovieLens 100k, R3-Yahoo and LibraryThing. 80
  • 133.
    Language models: rankingaccuracy Method ML 100k ML 1M R3-Yahoo LibraryThing Cosine WSR-UB 0.4857b 0.4138b 0.0703 0.2255 Cosine WSR-IB 0.4790 0.4035 0.0727a 0.3085acdf Cosine RM2 0.4953ab 0.4322abe 0.0717a 0.2384a LM-JMS WSR-UB 0.4990abc 0.4329abe 0.0719a 0.2370a LM-JMS WSR-IB 0.4989abc 0.4232ab 0.0731a 0.3118abcdf LM-JMS RM2 0.5021abcd 0.4392abcde 0.0731acd 0.2406ad Table: Ranking accuracy figures measured in nDCG@100. Statistically significant improvements (permutation test p < 0.05) indicated with superscripts. Best values in pink. 81
  • 134.
    Language models: diversity MethodML 100k ML 1M R3-Yahoo LibraryThing Cosine WSR-UB 0.2375 0.1356 0.3107 0.0417 Cosine WSR-IB 0.2738 0.1516 0.3309 0.2768 Cosine RM2 0.2637 0.1533 0.4769 0.1278 LM-JMS WSR-UB 0.2645 0.1731 0.3566 0.0570 LM-JMS WSR-IB 0.2952 0.1854 0.3520 0.3368 LM-JMS RM2 0.2794 0.1825 0.4281 0.1285 Table: Diversity figures measured in Gini@100. Best values in pink. 82
  • 135.
    Language models: novelty MethodML 100k ML 1M R3-Yahoo LibraryThing Cosine WSR-UB 173.86 172.76 305.26 333.50 Cosine WSR-IB 181.59ac 178.95a 314.12a 461.74acdf Cosine RM2 180.45a 179.39a 339.64abdef 417.56ad LM-JMS WSR-UB 180.59a 186.15abc 314.23a 352.80a LM-JMS WSR-IB 190.23abcdf 191.34abcdf 318.00abd 499.73abcdf LM-JMS RM2 184.29abcd 189.27abcd 332.49abde 418.39ad Table: Novelty figures measured in MSI@100. Statistically significant improvements (permutation test p < 0.05) indicated with superscripts. Best values in pink. 83
  • 136.
    Why LM withJMS works? Why language models with Jelinek-Mercer smoothing work better than cosine similarity? To explain this, we perform an axiomatic analysis. We define user specificity and item specificity properties. 84
  • 137.
    User specificity User specificity ⊚Given the target user u, ⊚ and the candidate neighbors v and w such that: ◦ Iu ∩ Iv = Iu ∩ Iw , ◦ r(u, i) = r(v, i) = r(w, i) ∀i ∈ Iu ∩ Iv , ◦ |v| < |w|; ⊚ the user specificity property enforces sim(u, v) > sim(u, w). 85
  • 138.
    Item specificity Item specificity ⊚Let u be the target user; ⊚ let v and w be two candidate users such that |v| = |w|; ⊚ let j and k be two items from the set of items I such that: ◦ j ∈ Iu ∩ Iv , ◦ k ∈ Iu ∩ Iw ; ⊚ given: ◦ (Iu ∩ Iv ) {j} = (Iu ∩ Iw ) {k}, ◦ r(u, j) = r(v, j) = r(u, k) = r(w, k), ◦ r(u, i) = r(v, i) = r(w, i) ∀i ∈ Iu ∩ Iv ∩ Iw ; ⊚ if |j| < |k|, then the item specificity property enforces sim(u, v) > sim(u, w). 86
  • 139.
    Language models: axiomaticanalysis We analyze axiomatically the user specificity and item specificity properties in cosine similarity and in language models with Jelinek-Mercer smoothing: Neighborhood method User specificity Item specificity Cosine similarity LM-JMS 87
  • 140.
    Language models: axiomaticanalysis We analyze axiomatically the user specificity and item specificity properties in cosine similarity and in language models with Jelinek-Mercer smoothing: Neighborhood method User specificity Item specificity Cosine similarity ∼ ∼ LM-JMS ✓ ✓ We think differences in effectiveness may be related to these properties. 87
  • 141.
  • 142.
    Other recommendation problems Top-Nrecommendation is the most prominent task in RS. However, recommendation technologies are used in many industrial scenarios. 89
  • 143.
    Other recommendation problems Top-Nrecommendation is the most prominent task in RS. However, recommendation technologies are used in many industrial scenarios. In this part, we focus on two less popular recommendation problems: ⊚ long tail liquidation, ⊚ user-item group formation. 89
  • 144.
  • 145.
    Long tail liquidation Itempopularity follows a long tail distribution. The excess of inventory or overstock is a source of revenue loss. 91
  • 146.
    Long tail liquidation Itempopularity follows a long tail distribution. The excess of inventory or overstock is a source of revenue loss. We formulate a recommendation task centered on the liquidation of long tail items. We propose an item-based adaptation of relevance models to deal with this novel task. 91
  • 147.
    Long tail liquidationproblem Long tail liquidation problem Let I′ ⊂ I be the items we want to liquidate, we aim to find a scoring function s′ : I′ × U → R such that: ⊚ for each item i ∈ I′ , ⊚ we can build a ranked list of n users Ln i ∈ Un , ⊚ that are most likely interested in such item i. 92
  • 148.
    Long tail estimation Leastrated products I′ = { i ∈ I |Ui | < c1 } Lowest rated products I′ = { i ∈ I ∑ u∈Ui ru,i |Ui | < c2 } Least recommended products I′ = { i ∈ I i /∈ Lc3 u , ∀u ∈ U } 93
  • 149.
    Item relevance models IRM2 p(u|Ri) ∝ p(u) ∏ v∈Ui ∑ j∈Ji p(v| j) p(u| j) p(j) p(u) 94
  • 150.
    Item relevance models IRM2 p(u|Ri) ∝ p(u) ∏ v∈Ui ∑ j∈Ji p(v| j) p(u| j) p(j) p(u) MLE with additive smoothing pγ(u|i) = r(u, i) + γ ∑ v∈Ui r(v, i) + γ |U| 94
  • 151.
    Item relevance models IRM2 p(u|Ri) ∝ p(u) ∏ v∈Ui ∑ j∈Ji p(v| j) p(u| j) p(j) p(u) MLE with additive smoothing pγ(u|i) = r(u, i) + γ ∑ v∈Ui r(v, i) + γ |U| Item neighborhoods Ji is computed using kNN algorithm with cosine similarity. 94
  • 152.
    Item relevance models IRM2 p(u|Ri) ∝ p(u) ∏ v∈Ui ∑ j∈Ji p(v| j) p(u| j) p(j) p(u) MLE with additive smoothing pγ(u|i) = r(u, i) + γ ∑ v∈Ui r(v, i) + γ |U| Item neighborhoods Ji is computed using kNN algorithm with cosine similarity. User and item priors We use uniform estimators. 94
  • 153.
    Long tail liquidation:results on LibraryThing Method Least rated Lowest rated Least recommended Random 0.0024 0.0002 0.0030 Pop 0.0408acd 0.0499acd 0.0455acd kNN-UB 0.0018 0.0039 0.0026 kNN-IB 0.0255ac 0.0061 0.0169ac UIR-IB 0.0890abcd 0.0894abcd 0.0876abcd HT 0.1431abcdeg 0.1451abcdeg 0.1477abcdeg PureSVD 0.0879abcd 0.0919abcd 0.1065abcde SLIM 0.2004abcdef g 0.2029abcdef g 0.2495abcdef g IRM2 0.2120abcdef gh 0.2108abcdef g 0.2522abcdef g Table: Values of nDCG@100 on LibraryThing for each long tail estimation. Superscripts indicate significant improvements. Best values in pink. 95
  • 154.
    Long tail liquidation:results on Ta-Feng 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.010 1 2 3 4 5 6 7 8 9 10 nDCG@100 #buyers Random Pop kNN-UB kNN-IB UIR-Item HT PureSVD SLIM IRM2 Figure: Values of nDCG@100 on the Ta-Feng dataset for liquidating long tail items (those with no more than n buyers). 96
  • 155.
  • 156.
    User-item group formation Theuser-item group formation (UIGF) problem aims to find the best companions for a given item and a target user [Brilhante et al., ICMDM ’16]. 98
  • 157.
    User-item group formation Theuser-item group formation (UIGF) problem aims to find the best companions for a given item and a target user [Brilhante et al., ICMDM ’16]. IRM2: ⊚ estimates the relevance of a user given an item; ⊚ deals with long tail item liquidation with uniform priors. 98
  • 158.
    User-item group formation Theuser-item group formation (UIGF) problem aims to find the best companions for a given item and a target user [Brilhante et al., ICMDM ’16]. IRM2: ⊚ estimates the relevance of a user given an item; ⊚ deals with long tail item liquidation with uniform priors. We can model the user relationships with different priors estimators. 98
  • 159.
    User-item group formationproblem UIGF as an item relevance modeling problem ⊚ Given the target user u ∈ U, ⊚ the recommended item i ∈ I, ⊚ an integer k; ⊚ the UIGF problem seeks to find the set FG u,i ⊆ U such that: FG u,i = arg max F ∗ u ∑ v∈F ∗ p (v|Ri ) s.t. F∗ ⊆ U, |F∗ | = k 99
  • 160.
    UIGF priors Uniform prior(U) pU(v) = 1 |Fu| Common Friends (CF) pCF (v) ∝ 1 |Fu ∩ Fv | Common group friends (CGF) pCGF (v) ∝ 1 (∪ w∈F G u,i Fw ) ∩ Fv Group closeness (GC) pGC(v) ∝ 1 FG u,i ∩ Fv 100
  • 161.
    UIGF evaluation We usedground truth groups to evaluate UIGF approaches: ⊚ users who checked in the same place within 4 hours, ⊚ groups of at least 4 members, ⊚ each user must be friends with at least another group member. 101
  • 162.
    UIGF evaluation We usedground truth groups to evaluate UIGF approaches: ⊚ users who checked in the same place within 4 hours, ⊚ groups of at least 4 members, ⊚ each user must be friends with at least another group member. Evaluation protocol: ⊚ for each group, we select a random member as the target user and the place where the group registered as the target item; ⊚ we ask the UIGF model to form a group of k friends for this specific user and item; ⊚ we evaluate the precision of the recommended group against the ground truth groups. 101
  • 163.
    UIGF datasets Dataset UsersItems Links Check-ins Ratings FS 2 138 367 83 999 27 098 472 1 021 966 2 809 580 FS-NYC 103 663 7813 1 890 844 157 064 330 043 Gowalla 196 591 1 280 969 1 900 654 6 442 892 − Brightkite 58 228 772 966 428 156 4 747 281 − Weeplaces 15 799 971 307 114 131 7 369 712 − Table: Statistics of location-based social network datasets. 102
  • 164.
    UIGF: results inFoursquare 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 4 6 8 10 12 Precision Group Size k-Top DkSP (PAV) DkSP (PLM) GREEDY (PAV) GREEDY (PLM) k-NN (PAV) k-NN (PLM) IRM2-U IRM2-CF IRM2-CGF IRM2-GC 103
  • 165.
    UIGF: results inBrightkite 0.10 0.15 0.20 0.25 0.30 0.35 0.40 4 6 8 10 12 Precision Group Size k-Top DkSP (PAV) DkSP (PLM) GREEDY (PAV) GREEDY (PLM) k-NN (PAV) k-NN (PLM) IRM2-U IRM2-CF IRM2-CGF IRM2-GC 104
  • 166.
  • 167.
    LiMe: Linear Methodsfor PRF Linear methods such as SLIM have been successfully used in recommendation [Ning & Karypis, ICDM ’11]. We adapt them to PRF. Our proposal LiMe: ⊚ models the PRF task a matrix decomposition problem ⊚ employs linear methods to provide a solution ⊚ is able to learn inter-term or inter-doc similarities ⊚ jointly models the query and the pseudo-relevant set ⊚ admits different feature schemes ⊚ is agnostic to the retrieval model 106
  • 168.
    LiMe variants Two variants: ⊚DLiMe: learns inter-document similarities. ⊚ TLiMe: learns inter-term similarities. 107
  • 169.
    LiMe variants Two variants: ⊚DLiMe: learns inter-document similarities. ⊚ TLiMe: learns inter-term similarities. 107
  • 170.
    TLiMe: matrix formulation LetX ∈ Rm×n be the extended pseudo-relevant set matrix, we aim to find a inter-term similarity matrix W ∈ Rn×n + such that: X = X × W      Q D1 . . . Dm−1      m×n =      Q D1 . . . Dm−1      m×n ×     w11 · · · w1n ... ... ... wn1 · · · wnn     n×n s.t. diag(W) = 0, W ≥ 0 108
  • 171.
    LiMe: feature schemes Howdo we fill matrix X =      Q D1 . . . Dm−1      m×n ? 109
  • 172.
    LiMe: feature schemes Howdo we fill matrix X =      Q D1 . . . Dm−1      m×n ? xij =    s(tj , Q) if i = 1 and f (tj , Q) > 0, s(tj , Di−1) if i > 1 and f (tj , Di−1) > 0, 0 otherwise ⊚ stf −idf (t, D) = (1 + log2 f (t, D)) × log2 |C| df (t) ⊚ f (t, D): #occurrences of term t in D (or Q) 109
  • 173.
    LiMe: optimization problem Matrixoptimization problem: W∗ = arg min W 1 2 ∥X − X W∥2 F + β1 ∥W∥1,1 + β2 2 ∥W∥2 F s.t. diag(W) = 0, W ≥ 0 (1) 110
  • 174.
    LiMe: optimization problem Matrixoptimization problem: W∗ = arg min W 1 2 ∥X − X W∥2 F + β1 ∥W∥1,1 + β2 2 ∥W∥2 F s.t. diag(W) = 0, W ≥ 0 (1) Bound constrained least squares optimization problem with elastic net (ℓ1 and ℓ2 regularization) penalty: ⃗w∗ ·j = arg min ⃗w·j 1 2 ∥⃗x·j − X ⃗w·j ∥2 2 + β1 ∥ ⃗w·j ∥1 + β2 2 ∥ ⃗w·j ∥2 2 s.t. wjj = 0, ⃗w·j ≥ 0 (2) 110
  • 175.
    LiMe: query expansion Toexpand the original query, we reconstruct the first row of X: ( Q′ ) 1×n = ( Q ) 1×n ×     w11 · · · w1n ... ... ... wn1 · · · wnn     n×n ˆx1· = ⃗x1· × W∗ (3) 111
  • 176.
    LiMe: query expansion Toexpand the original query, we reconstruct the first row of X: ( Q′ ) 1×n = ( Q ) 1×n ×     w11 · · · w1n ... ... ... wn1 · · · wnn     n×n ˆx1· = ⃗x1· × W∗ (3) We compute a probabilistic estimate of a term tj given the feedback model θF : p(tj |θF ) =    ˆx1j∑ tv ∈VF ′ ˆx1v if tj ∈ VF ′ , 0 otherwise (4) 111
  • 177.
    LiMe: second retrieval Thesecond retrieval is performed interpolating the original query model with the feedback model: p(t|θ′ Q) = (1 − α) p(t|θQ) + α p(t|θF ) (5) ⊚ The hyperparameter α controls the interpolation ⊚ This is a standard procedure in state-of-the-art PRF techniques 112
  • 178.
    LiMe: test collections Collection#docs Avg doc Topics length Training Test AP88-89 165k 284.7 51-100 101-150 TREC-678 528k 297.1 301-350 351-400 Robust-04 528k 28.3 301-450 601-700 WT10G 1,692k 399.3 451-500 501-550 GOV2 25,205k 647.9 701-750 751-800 113
  • 179.
    LiMe: results Method MetricAP88-89 TREC-678 Robust-04 WT10G GOV2 LM nDCG 0.5637 0.4518 0.5830 0.5212 0.6325 RI − − − − − RFMF nDCG 0.5749 0.4746 0.5884 0.5262 0.6453 RI 0.42 0.23 0.07 0.30 0.42 MEDMM nDCG 0.5955 0.5115 0.6227 0.5324 0.6653 RI 0.42 0.26 0.32 0.36 0.66 RM3 nDCG 0.6005 0.4987 0.6251 0.5352 0.6618 RI 0.50 0.40 0.37 0.20 0.60 DLiMe nDCG 0.6058 0.4936 0.6247 0.5290 0.6588 RI 0.52 0.44 0.32 0.26 0.72 TLiMe nDCG 0.6085 0.5198 0.6294 0.5398 0.6698 RI 0.52 0.46 0.37 0.30 0.62 114
  • 180.
  • 181.
    Conclusions (I) We exploredcross-pollination of ideas between IR and RS: ⊚ We studied the robustness and discriminative power of ranking accuracy metrics. These findings influenced the evaluation of this thesis. 116
  • 182.
    Conclusions (I) We exploredcross-pollination of ideas between IR and RS: ⊚ We studied the robustness and discriminative power of ranking accuracy metrics. These findings influenced the evaluation of this thesis. ⊚ We adapted different pseudo-relevance feedback models to top-N recommendation as memory-based recommenders: ◦ relevance models offer highly accurate recommendations; ◦ techniques from the Rocchio framework are a very cost-effective alternative. 116
  • 183.
    Conclusions (I) We exploredcross-pollination of ideas between IR and RS: ⊚ We studied the robustness and discriminative power of ranking accuracy metrics. These findings influenced the evaluation of this thesis. ⊚ We adapted different pseudo-relevance feedback models to top-N recommendation as memory-based recommenders: ◦ relevance models offer highly accurate recommendations; ◦ techniques from the Rocchio framework are a very cost-effective alternative. ⊚ We used ad hoc retrieval models to compute better neighborhoods in collaborative filtering: ◦ neighborhood oracles provide insights for improvements; ◦ language models outperform cosine similarity. 116
  • 184.
    Conclusions (II) We exploredcross-pollination of ideas between IR and RS: ⊚ We adapted relevance models to novel recommendation tasks: ◦ item-based relevance models can tackle long tail item liquidation; ◦ specific priors can be used to deal with the user-item group formation problem. 117
  • 185.
    Conclusions (II) We exploredcross-pollination of ideas between IR and RS: ⊚ We adapted relevance models to novel recommendation tasks: ◦ item-based relevance models can tackle long tail item liquidation; ◦ specific priors can be used to deal with the user-item group formation problem. ⊚ We proposed a novel PRF framework inspired by a recommendation method. 117
  • 186.
  • 187.
    Future directions ⊚ Exploreour robustness and discriminative analysis to different types of metrics such as diversity or novelty metrics. 119
  • 188.
    Future directions ⊚ Exploreour robustness and discriminative analysis to different types of metrics such as diversity or novelty metrics. ⊚ Study the adaptation of different pseudo-relevance feedback models to top-N recommendation or other tasks. 119
  • 189.
    Future directions ⊚ Exploreour robustness and discriminative analysis to different types of metrics such as diversity or novelty metrics. ⊚ Study the adaptation of different pseudo-relevance feedback models to top-N recommendation or other tasks. ⊚ Analyze other neighborhood computation techniques using the methodology based on oracles. 119
  • 190.
    Future directions ⊚ Exploreour robustness and discriminative analysis to different types of metrics such as diversity or novelty metrics. ⊚ Study the adaptation of different pseudo-relevance feedback models to top-N recommendation or other tasks. ⊚ Analyze other neighborhood computation techniques using the methodology based on oracles. ⊚ Examine other ad hoc retrieval models to compute neighborhoods. 119
  • 191.
    Future directions ⊚ Exploreour robustness and discriminative analysis to different types of metrics such as diversity or novelty metrics. ⊚ Study the adaptation of different pseudo-relevance feedback models to top-N recommendation or other tasks. ⊚ Analyze other neighborhood computation techniques using the methodology based on oracles. ⊚ Examine other ad hoc retrieval models to compute neighborhoods. ⊚ Extend LiMe with richer features (based on Wikipedia, query logs, etc.). 119
  • 192.
  • 193.
    Conferences (I) gitemizeitemize4 [gitemize,1]leftmargin=0.3cm+, label= A. Landin, D. Valcarce, J. Parapar, Á. Barreiro. “PRIN: A Probabilistic Recommender with Item Priors and Neural Models”. ECIR ’19, pp. 133-147, 2019. D. Valcarce, A. Bellogín, J. Parapar, P. Castells. “On the Robustness and Discriminative Power of IR Metrics for Top-N Recommendation”. ACM RecSys ’18, pp. 260-268, 2018. D. Valcarce, J. Parapar, Á. Barreiro. “LiMe: Linear Methods for Pseudo-Relevance Feedback”. ACM SAC ’18, pp. 678-687, 2018. D. Valcarce, J. Parapar, Á. Barreiro. “Combining Top-N Recommenders with Metasearch Algorithms”. ACM SIGIR ’17, pp. 805-808, 2017. D. Valcarce, J. Parapar, Á. Barreiro. “Additive Smoothing for Relevance-Based Language Modelling of Recommender Systems”. 121
  • 194.
    Conferences (II) gitemizeitemize4 [gitemize,1]leftmargin=0.3cm+, label= D. Valcarce, J. Parapar, Á. Barreiro. “Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recommendation”. ECIR ’16, pp. 602-613, 2016. D. Valcarce, J. Parapar, Á. Barreiro. “Language Models for Collaborative Filtering Neighbourhoods”. ECIR ’16, pp. 614-625, 2016. D. Valcarce. “Exploring Statistical Language Models for Recommender Systems”. ACM RecSys ’15, pp. 375-378, 2015. D. Valcarce, J. Parapar, Á. Barreiro. “A Study of Priors for Relevance-Based Language Modelling of Recommender Systems”. ACM RecSys ’15, pp. 237-240, 2015. D. Valcarce, J. Parapar, Á. Barreiro. “A Study of Smoothing Methods for Relevance-Based Language Modelling of Recommender Systems”. 122
  • 195.
    Journals gitemizeitemize4 [gitemize,1] leftmargin=0.3cm+,label= D. Valcarce, J. Parapar, Á. Barreiro. “Document-based and Term-based Linear Methods for Pseudo-Relevance Feedback”. Applied Computing Review 18(4), pp. 5-17, 2018. D. Valcarce, I. Brilhante, J.A. Macedo, F.M. Nardini, R. Perego, C. Renso. “Item-driven group formation”. Online Social Networks and Media 8, pp. 17-31, 2018. D. Valcarce, J. Parapar, Á. Barreiro. “Finding and Analysing Good Neighbourhoods to Improve Collaborative Filtering”. Knowledge-Based Systems 159, pp. 193-202, 2018. D. Valcarce, J. Parapar, Á. Barreiro. “A MapReduce implementation of posterior probability clustering and relevance models for recommendation”. Engineering Applications of Artificial Intelligence 75, pp. 114-124, 2018. D. Valcarce, J. Parapar, Á. Barreiro. “Axiomatic Analysis of Language Modelling of Recommender Systems”. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 25(2), pp. 113-128, 2017. D. Valcarce, J. Parapar, Á. Barreiro. “Item-Based Relevance Modelling of Recommendations for Getting Rid of Long Tail Products”. Knowledge-Based Systems 103, pp. 41-51, 2016. 123
  • 196.
    PhD Thesis Information RetrievalModels for Recommender Systems Author: Daniel Valcarce Advisors: Álvaro Barreiro & Javier Parapar A Coruña, May 8th, 2019 Information Retrieval Lab Computer Science Department University of A Coruña