Study of relevancy, diversity, and novelty in recommender systems

Etude de la Diversité, la Nouveauté, et la Pertinence
dans les Systèmes de Recommandation
Chems Eddine BERBAGUE
Directeur de thèse : Pr. Hassina SERIDI
Co-directeur de thèse : Dr. Karabadji Nour El-Islem
Novembre, 2021

Table of Contents
1. Introduction
2. State of the Art
3. Contributions: Users clusterings and pairwise similairty.
4. Final Conclusion
1/67

Recommender systems
Figure 1: General sheme of RS algortihms, and evaluation.
3/67

Research questions
What makes the collaborative ﬁltering a good research choice ?
Why is the clustering one of the best techniques for dealing with the
recommendation issues?
How well are the bio-inspired clusteirng techniques?
4/67

Aims and objectives
Improve the scalability of the memory-based collaborative ﬁltering
algorithm.
Improve the recommendation quality.
5/67

Thesis research axes
Memory-based collaborative ﬁltering algorithms
Neighbor 2
Neighbor 1
Neighbor 3
Target User
Similarity 3
Similarity 1
Similarity 2
Figure 2: General scheme of user-based collaborative ﬁltering.
6/67

Dimentionality reduction
Neighbor 2
Neighbor 1
Neighbor 3
Target User
Similarity 3
Similarity 1
Similarity 2
Candidate neighbor
Candidate neighbor
Candidate neighbor
Candidate neighbor
Candidate neighbor
Candidate neighbor
Figure 3: Dimentionality reduction using clustering.
7/67

Recommendation quality improvment
Yes, we care about the quality !
Diversity
Relevancy
Noverlty
Figure 4: Different recommendation quality metrics.
8/67

Memory user-based collaborative ﬁltering (CF).
Figure 5: Flow of memory user-based collaborative ﬁltering.
10/67

Comparison of collaborative ﬁltering to other
appraoches.
Algorithm Diversity Relevancy Sparsity Cold Start Scalability Simplicity
ICF ? ? ? ? ? ? ? ? ? ??
UCF ?? ?? ? ? ? ? ? ??
SOC/DEM ? ? ? ? ? ? ? ? ? ? ? ??
CBF ? ?? ? ? ? ? ? ?
HYB ?? ? ? ?? ?? ?? ? ? ? ?
MF ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Table 1: Comparison of different recommendation techniques
11/67

Limits of memory-based collaborative ﬁltering
Complexity
Figure 6: Different complexity examples.
12/67

Limits of memory-based collaborative ﬁltering
Sensitivity to data quality
Sparsity, cold start ...etc.
i1 i2 i3 i4 i5 i6 i7 i8
u1 ? ? ? ? ? ? 5 1
u2 1 ↑ ↓ 4↑ ↓ ? ? ? ? 4 2
u3 ? ? 4 2 ? ? 5 4
u4 ? ? 3 4 4↑ ↓ 5↑ ↓ 5 4
u5 ? ? ? ? ? ? ? ?
Table 2: Example of User-Item rating matrix.
13/67

How were evolutionary algorithms adapted to
RS ?
Inter-algorithmic use.
similarity calculation, recommendation ranking, clustering, latent factor models ..ect.
Intra-algorithmic use.
hybridization ..ect.
14/67

Bio-inspired algorithms
Algorithm GA ACO ANN ABC BAT FSS PSO
Similarity X X X - - - -
Weighting X - X - X X -
Clustering X - - X X X -
Re-ranking X - - - - - X
Latent factor models - - - - - - X
Graph-based models - X - X - - X
Table 3: Summary of different use of bio-inspired algorithms in RS.
15/67

Genetic-based multi-objective optimization
Genetic algorithms (GA).
GA is a bio-inspired machine learning tool for optimization. It consists of exploring a large search space
to select among the possible solutions, the most suitable/ﬁt one.
Figure 7: General scheme of genetic-based multi-objectives optimization algorithm.
16/67

Questios about using GAs in RSs
Which problem to target ?
How to formally describe the problem ?
How to deﬁne the quality of the solution ?
17/67

Contributions: users
clustering and pairwise
similairty
18/67

Outline
Datasets & experimental conﬁgurations
1. GA-CLUS: An Evolutionary Scheme to Improve Scalability
2. TS-IKNN: A Two-Stage Improved KNN
3. GA-DCLUS: A Multi-Objectives Clustering-Based Recommendation
Approach
19/67

Datasets & experimental conﬁgurations
Dataset #Users #Items #Ratings Sparsity
Movielens 100k 943 1682 100000 6%
Movielens 1M 6040 3952 1000000 4%
Table 4: Dataset statics of 100K movielens and 1M movielens datasets.
20/67

I .GA-CLUS:Users
Clustering Using a
Genetic Algorithm.
21/67

Neighborhood selection
Target
Candidate
1
Candidate
4
Candidate
3
Candidate
2
Figure 8: Illustration of neighborhood
selection
Target
Candidat
e1
Candidat
e4
Candidat
e4
Candidat
e4
Candidat
e4
Candidat
e4
Candidat
e4
Candidat
e4
Candidat
e4
Candidat
e4
Candidat
e4
Candidat
e4
Candidat
e4
Candidat
e4
Candidat
e4
Candidat
e4
Candidat
e4
Candidat
e4
Candidat
e4
Candidat
e4
Candidat
e4 Candidat
e3
Candidat
e2
Figure 9: Illustration of neighborhood
selection
22/67

The clustering is a solution !
Figure 10: Simple clustering example.
We propose the use of a genetic algorithm to:
• Catch the best clusters number.
• Explore the search space and select among its members the best clusters representatives.
• Controlling the borders overlap problem by specifying the clusters sizes.
23/67

Encoding scheme of GA-CLUS
k C1 !1 Cx !x Ck !k
… …
k is an integer to represent number of clusters
Ci is an integer to represent a profil as centre of the cluster i
!i is an integer to represent number of profiles in the cluster i
Figure 11: Encoding scheme of GA-CLUS
010 00000000000010 0100 00000000000111 1110
k 𝛼1
C1
𝛼2
C2
100 01010 0100 00101 0100 01111 0100 00001 0100
k 𝛼1
C1 C2 C3 C4
𝛼2 𝛼3 𝛼4
011 00000001 0100 00000101 1000 00001111 0110
k 𝛼1
C1 C2 C3
𝛼2 𝛼3
100 00001 0100 00100 0100 00101 0100 01000 0100
k 𝛼1
C1 C2 C3 C4
𝛼2 𝛼3 𝛼4
(a) (b)
(c) (d)
Figure 12: Some samples of GA-CLUS encoding scheme
24/67

Fitness function
Minimizing the MAE of each cluster:
group_precision(ch) = (max(r) − min(r)) − (
1
k
× (
k
X
i=1
MAE(Gi))) (1)
Diversifying the clusters’ ceters:
center_diversity(ch) =
1
k × (k − 1)
× (
k
X
i=1
k−1
X
j=i+1
(1 − sim(Ci, Cj+1))) (2)
Combination of centers’ diversity and clusters’ precision:
ﬁtness(ch) = group_precision(ch) + center_diversity(ch) (3)
25/67

Conﬁguration
• Baseline algorithms:
◦ Memory user-based collaborative ﬁltering (Knn).
◦ K-means clustering algorithm.
◦ PCA-GAKM clustering algorithm.
• Experimental scenarios:
◦ Analyse the neighborhood size effect.
◦ Analyse the recommendation length effect.
• Evaluation metrics:
◦ Rating prediction accuracy: mean absolute error (MAE).
◦ Recommendation set accuracy: recall, precision.
26/67

Rating prediction comparison of GA-CLUS in terms
of the neighborhood size
Figure 13: Comparison of GA-CLUS to
KNN.
Figure 14: Comparison of GA-CLUS to
Kmeans and PCA-GAKM.
27/67

Relevancy comparison of GA-CLUS in terms of
the neighborhood size
Figure 15: Precision comparison of
GA-CLUS to other methods.
Figure 16: Recall comparison of
GA-CLUS to other methods.
28/67

Relevancy comparison of GA-CLUS in terms of
Top-N recommendations
Figure 17: Precision comparison of
GA-CLUS with different recommendation
length.
Figure 18: Recall comparison of
GA-CLUS with different recommendation
length.
29/67

Partial conclusions
+ We encoded the clustering problem as to reduce the search
space.
+ We optimized the quality of the clustering in a way to achieve
more accurate results.
- Accuracy is an insufﬁcient measure to evaluate the satisfaction of
users.
30/67

Discussion
Users’ similarity VS recommendation diversity.
Users
Low diversity
System
High popularity
Figure 19: Conﬁlicting diversity of recommendation.
31/67

II.TS-IKNN: Two-Stage
Improved k-NN
Algorithm
32/67

TS-IKNN:
First stage:
We sought to reduce the search space extension by performing an
adapted KNN algorithm. In this stage, we modiﬁed a similarity mea-
sure to combine a pairwise user diversity measure and a similarity-
based rating measure.
Second stage:
We employed a based genetic algorithm to improve the neighborhood
selection.
33/67

First stage:
Considering the diversity within a similarity measure may allow obtain-
ing a dual control on the users set to select similar and diverse ones at
the same time:
• The similarity deﬁnition:
new_sim(u1, u2) = α × sim(u1, u2) + (1 − α) × div(u1, u2), (4)
• The diversity deﬁnition:
div(u1, u2) =
X
i∈I2−I1
1 −
P(i)
|U|
, (5)
34/67

Second stage:
Figure 20: Chromosome encoding of the neighborhood optimization.
Combination of the diversity and the relevancy:
ﬁtness(ch) = β × (1 − precision) + (1 − β) × (1 − diversity), (6)
35/67

The similarity calculation algorithm of the sec-
ond stage
36/67

Results of TS-IKNN
0 1 2 3 4 5 6 7
0
100
200
300
400
500
600
Algorithms
Coverage
KNN normal similarity
KNN modified similarity
Kmeans best Precision
Kmeans best Coverage
Proposed algorithm
LDA
Figure 21: Coverage comparision of TS-IKNN to other methods.
Normal similarity Adjusted similarity K-means best precision K-means best coverage Proposed approach LDA
Precision 0.6106 0.6130 0.6379 0.6321 0.6524 0.3368
Table 5: Precision comparision of TS-IKNN to other methods.
37/67

Partial conclusions
We presented an evolutionary algorithm that acts in two-stages with
the aim of making a balance between coverage and precision. How-
ever, this method suffers some limitations which we tried to improve
and solve:
• A neighborhood is assigned to a user according to binary weights,
which may exclude explorer users.
• The ﬁrst neighbors’ candidates set size is hard to ﬁx, whereas big
numbers allow a better exploration of possibilities. However, it
increases complexity.
• The correlation between the novelty of recommendations and
their coverage is not cleared up.
38/67

Perspectives and future work
A:
• Statics could misleading.
• The clustering alleviate the data dimensionality curse.
Q:
• How to improve the similarity calculation ?
• How to use clustering for targeting the novelty/diversity metrics ?
• Can we beneﬁt from more information to improve the clustering ?
39/67

III.GA-DCLUS:
Diversiﬁed Users’
Clustering Using GA
40/67

!
Analyzing the popularity of the items
0 200 400 600 800 1000 1200 1400 1600 1800
Items
0
50
100
150
200
250
300
350
400
450
500
Number
of
ratings
Popularity of items
Figure 22: Popularity of items in movielens 100K.
41/67

Analyzing the users’ tendencies toward popularity.
Item rarity:
rarity(i) =
|Ui| − freqmin
freqmax − freqmin
, (7)
User tendancy:
tendency(u) =
1
|Iu|
X
i∈Iu
1 − rarity(i), (8)
42/67

Analyzing the users’ tendencies toward popularity
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Rarity of items
0
25
50
75
100
125
150
175
200
225
250
270
Number
of
users
Figure 23: Users’ tendencies toward popularity.
43/67

Principals of GA-DCLUS
We denote by U the set of users, and C the set of clusters. Our clus-
tering scheme consists of creating a matrix W = {w1, w2, ..w|U|}
of |U| rows and |C| columns. We deﬁne for each user u ∈ U, a be-
longing weigh vector wu ∈ W of |C| values, which we denote by
wu = {w(u,1), w(u,2), ..w(u,|C|)}, as to assign for u and a given cluster j ∈ C
a belonging weight value w(u,j). Furthermore, we assign each user u to
one main cluster, and zero or more seconding clusters in respect to
the belonging weight vector wu. Next elements explain how we pro-
ceeded to assign users.
44/67

Main clusters set, which we denote by CM
, and it consists of |U| value,
as to have CM
= {cM
1 , cM
2 , ..cM
|U|}, whereas each user u has a main clus-
ter cM
u in which we apply UCF algorithm to generate his recommenda-
tions. The cluster cM
u is selected by identifying from wu, the cluster of
the highest belonging weight.
45/67

Secondary clusters set, which we denote by CS
, and it consists of |U|
sub-vector, as to have CS
= {cS
1 , cS
2, ..cS
|U|}. Each user u has a set cS
u of
nu secondary clusters denoted by cS
u = {c1, c2, ..cnu }. User u can partic-
ipate in the clusters of cS
u as a candidate neighbor only, and does not
get any recommendations. In order to improve the scalability, a mini-
mum belonging threshold θ is added to the chromosome’ encoding.
46/67

Clustering encoding scheme of GA-DCLUS
... ...
CLUS NEI TH W1,1 W1,2 ... W1,|C|
Clustering
parameters
U1
W2,1 W2,2 ... W2,|C| W|U|,1 W|U|,2 ... W|U|,|C|
U2 U|U|
Figure 24: The chromosome encoding scheme.
47/67

Example of the clustering encoding scheme
U1 U2 U3 U4
CLUS NEI TH W1,1 W1,2 W1,3 W2,1 W2,2 W2,3 W3,1 W3,2 W3,3 W4,1 W4,2 W4,3
2 48 0.8 0.95 0.85 0.7 0.55 0.92 0.88 0.90 0.81 0.91 0.7 0.94 0.74
Figure 25: Example of the proposed genetic encoding.
48/67

Fitness function
Pairwise users diversity:
div(u1, u2) =
1
|I2 − I1|
X
i∈I2−I1
rarity(i), (9)
Cluster’ diversity:
cluster_content(c) =
X
(u1,u2)∈c
α · sim(u1, u2) + (1 − α) · div(u1, u2), (10)
Fitness function:
ﬁtness_function(ch) = β · (1 − coverage) + (1 − β)
·(γ · (1 − precision) + (1 − γ) ·
1
|C|
X
c∈C
cluster_content(c)
(11)
49/67

Improved similarity measure
Prediction formula:
Pr(u, i) = ¯
ru +
P
v∈N(u)sim(u, v) · R(v, i)
P
v∈N(u) sim(u, v)
, (12)
Improved similarity measure:
new_sim(u1, u2) =
w1,cM
1
+ w2,cM
1
2
· sim(u1, u2) (13)
50/67

Difference convergence speed using different
similarity measrures
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
Generations
0.5
0.52
0.54
0.56
0.58
0.6
0.62
0.64
0.66
0.68
Fitness
function
Weighted similarity
Normal similarity
Figure 26: Convergence speed of GA-DCLUS with normal and weighted similarity
measure. 51/67

Performance difference using different simiar-
ity measures
Conﬁguration Precision Recall F1 Coverage Novelty
GAsim 0.5560 0.6831 0.6130 0.4713 0.8165
GAwsim 0.5625 0.6830 0.6269 0.4733 0.8165
Table 6: Comparison of GA-DCLUS results using different similarity metrics.
52/67

Improved prediction formula
. . . . . .
CLUS NEI TH W1,1 W1,2 . . . W1,|C|
Clustering
parameters
U1
W2,1 W2,2 . . . W2,|C| W|U|,1 W|U|,2 . . . W|U|,|C|
U2 U|U|
WN
1 WN
2 . . . WN
|U|
Novelty weights
Figure 27: The chromosome encoding scheme using novelty weights.
The item’ novelty:
Nov(i) = −Log(
|Ui|
|U|
) · Log(|U|), (14)
The improved prediction formula:
Primp(u, i) = reg · wN
u · Nov(i) + Pr(u, i), (15)
53/67

Steps of GA-DCLUS clustering algorithm
54/67

Experimental conﬁgurations
Mutation Pr Crossover Pr Population Selection St
0.02 0.9 50 binary tournament
Table 7: GA conﬁguration.
reg maxN minN maxC minC α β γ
[0-5] 50 10 50 4 0.2 0.9 0.1
Table 8: GA-DCLUS parameters.
55/67

Relevancy performance of GA-DCLUS
Precision Recall F1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Performance metrics
Performance
values
GA(wsim,reg=0) GA(wsim,reg=1) GA(wsim,reg=2) GA(wsim,reg=3) GA(wsim,reg=4) GA(wsim,reg=5)
Figure 28: GA-DCLUS performance with differenet conﬁgurations versus the
relevancy metrics on 100k movielens dataset. 56/67

Diversity performance of GA-DCLUS
Precision Recall F1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Performance metrics
Performance
values
GA(wsim,reg=0) GA(wsim,reg=5) KNN Kmeans PCA−GAKM
Figure 29: GA-DCLUS comparison to other methods on base of relevancy metrics on
100k movielens dataset. 57/67

Diversity comparison of GA-DCLUS to other methods
Coverage Novelty
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Performance
values
Performance metrics
GA(wsim,reg=0) GA(wsim,reg=5) KNN Kmeans PCA−GAKM
Figure 30: GA-DCLUS comparision to other methods on base of diversity metrics on
100k movielens dataset. 58/67

Results on the 1M MovieLens
Algorithm Precision Recall F1 Coverage Novelty
GAwsim,reg=0 0.6355 0.5889 0.6113 0.5349 0.8434
GAwsim,reg=5 0.6093 0.5784 0.5934 0.6307 0.8719
KNN 0.4582 0.3832 0.4173 0.2948 0.7808
Kmeans 0.5756 0.4683 0.5166 0.4608 0.7983
PCA − GAKM 0.6041 0.5184 0.5580 0.4794 0.8200
Table 9: Comparison of GA-DCLUS to other methods on 1M movielens dataset.
59/67

Comparison to re-ranking methods
MMR is a re-ranking algorithm that promotes diversity using a controlling paramter. We used the next
objective function:
i∗
= argmaxi∈(Ru−Su)(1 − λ) × rel(i, Su) + λ × gnov(i, Su), (16)
Relevancy:
rel(i, Su) =
Pr(u, i) +
P
j∈Su
Pr(u, j)
1 + |Su|
, (17)
Novelty:
gnov(i, Su) =
Nov(i) +
P
j∈Su
Nov(j)
1 + |Su|
, (18)
60/67

GAwsim,reg=3 0.5490 0.6760 0.6059 0.5214 0.8310
Kmeans(MMR) 0.5248 0.6184 0.5696 0.4423 0.8040
PCA − GAKM(MMR) 0.5461 0.6648 0.5996 0.4840 0.8201
Table 10: Comparision of GA-DCLUS to MMR on 100k movielens dataset.
61/67

GAwsim,reg=5 0.6093 0.5784 0.5934 0.6307 0.8719
Kmeans(MMR) 0.5585 0.4354 0.4893 0.4752 0.8038
PCA − GAKM(MMR) 0.5988 0.5372 0.5663 0.4727 0.8349
Table 11: Comparison of GA-DCLUS to MMR on 1M movielens dataset.
62/67

General conclusions
Genetic algorithms are good optimization tool which allow us to integrate more data
sources, to hybrid more than a recommender, and to improve the similarity mea-
sures.
Recommendation quality can be driven by real indicators during the optimization,
within a ﬁtness function such as MAE or Precision.
GA allows to handle more than one problem at once by adjusting the encoding of the
solutions and the ﬁtness function. The latter, can be parametrized using controilling
parameters.
64/67

Issues
GAs take time !
many critical parameters, hard decoding, hard evaluation, GA’ behaviour is unpre-
dictable...etc.
The ﬁtness function is hard to ﬁx !
which features, which metrics, which formulas...etc.
65/67

Future works
Divide the clustering tasks in sub tasks !
map-reduce paradigm, parallelization, better ﬁtness function, algorithms combina-
tion..etc.
Combine more data sources !
feature weighting, feature extraction,...etc.
66/67

Thank you for your
attention.
67/67

Study of relevancy, diversity, and novelty in recommender systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Study of relevancy, diversity, and novelty in recommender systems

Similar to Study of relevancy, diversity, and novelty in recommender systems (20)

Recently uploaded

Recently uploaded (20)

Study of relevancy, diversity, and novelty in recommender systems