Combining Content-based and
Collaborative Filtering
Department of Computer Science and
Engineering, Slovak University of Technology
polcicova@dcs.elf.stuba.sk
navrat@elf.stuba.sk
Gabriela Polčicová
Pavol Návrat
Overview
• Information Filtering and its Types
• Combined Method
• Experiment with Information
Filtering Methods
• Conclusions
Information Filtering (1)
– delivery of relevant information to the people who need it
• Types of Information Filtering
– Content-based - for textual documents
– Collaborative - for communities of users
• Interests
– information about interests - stored in profiles
– expressing opinions to documents - ratings
• Ratings {i, j, rij}
– for user i, item j, the value of rating rij
Information Filtering (2)
Filter
Learning
interests
Estimating the
value of rating
Choosing
recommendations
Rated items
{user, item, value}
Unrated items
{user, item}
Recommendations
{user, item, estimation}
Content-based Filtering (1)
• Basic idea
– recommending documents based on content and
properties of document
• Profile
– consists of keywords with assigned weights
– only documents matching profile are recommended
• Recommendations
– based on objective measurable properties
Content-based Filtering (2)
Documents rated by the user
Documents of interest
Documents unrated by the user
PROFILE
Keywords, phrases
with weights
Documents matching profile
=> recommended documents
Documents, ratings
Collaborative Filtering (1)
• Basic idea
– automating “word of mouth”
– leverage opinions of like-minded users while making
decisions
• Schema
– collecting users’ opinions
– searching for like-minded users
– making recommendations
Collaborative Filtering (2)
Profile of
current
user
Profile of
user 1
Profile of
user 2
Profile of
user 3
Profile of
user 4
Profile of
user 5
Documents from
like-minded users’
profiles
=> recommended
documents
kci =
∑ (rcj - rc) (rij - ri)
j ∈ Ici
∑ (rcj - rc)2
∑ (rij - ri)2
j ∈ Ici j ∈ Ici
• Recommendations computation: weighted sum of ratings
rcj = rc +
∑ (rij - ri) kci
i ∈ Ucj
∑ |kci|
i ∈ Ucj
Collaborative Filtering (3)
• Similarity measure: Pearson Correlation Coefficient
Combining Content-based and
Collaborative Filtering (1)
• Computing of estimates for missing ratings by Content-
based Filtering method for each user
• Searching for like-minded users
– computing coefficient kci between current and i-th user
(only from ratings)
– computing coefficient kci’ between current and i-th user
(from both ratings and estimates)
• New recommendations computation
– using ratings (with coefficients kci) and also ratings with
estimates (with coefficient kci’) as weights in weighted
sum of ratings and estimates
Datasets for Experiments
• Data:
– EachMovie - users‘ ratings for movies
www.research.digital.com/SRC/eachmovie/
– IMDB - textual information for CBF (movies‘ descriptions)
www.imdb.com/
• Datasets:
– A - ratings from the period up to Mar 1, 1996
(810 ratings from 71 users)
– B - ratings from the period uo to Mar 15, 1996
(2407 ratings from 131 users)
– C - ratings from the period up to Apr 1, 1996
(12290 ratings from 651 users)
EachMovie Data and Constant Method
Percentage of ratings in EachMovie
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
1 2 3 4 5 6
ratings
A
B
C
• Constant Method rcj = 5
Experiments with Combination of Content-
based and Collaborative Filtering (2)
Dataset
Divide dataset
into training
set (90%) and
test set (10%)
Apply filtering
methods and
evaluate their
performance
Content-based
Filtering method
Collaborative
Filtering method
Combined
Filtering method
recommendations
recommendations
recommendations
test, training sets
test, training sets
Evaluation of methods’ performance
Constant
methodrecommendations
test set
Metrics
• Coverage = percentage of items for which the method is able to
compute estimates
• Accuracy =
• F-measure =
• NMAE =
2.Precision.Recall
Precision + Recall
|R ∩L| + |R ∩L|
|L| + |L|
|R ∩ L|
|R|
|R ∩ L|
|L|
∑|rij - rij|
n.s
Precision =
Recall =
R - set of recommended
items
L - set of liked items
Results of Experiments
Coverage
0,8
0,85
0,9
0,95
1
A B C
Accuracy
0,7
0,75
0,8
0,85
0,9
A B C
F-measure
0,8
0,85
0,9
0,95
1
A B C
F-measure
0,8
0,85
0,9
0,95
1
A B C
CF
CBF
combined
constant
Conclusions
• Combination of content-based and collaborative
filtering might help in initial phase
Future work
• Weighting of coefficients
• Comparing method with additional methods
Content-based Filtering - Vector
Representation of Documents and Profiles
Wj= (0, … , 0, 0.5 , 0, … , 0, 0.3 , 0, … , 0, 0.2 , 0, … , 0)
profilei = ∑ rj .wij
n
j = 1
D = ( … , computer, … , learning, … , machine, …. )
Documentj
computer machine
learning
TF-IDF
TF-IDFTF-IDF
W . Profile
|W| . |Profile|
Sim(W, Profile) =
Collaborative Filtering - Example
A B C D E F G
current 1 4 5
1 3 5 1 2
2 1 3 2 5
3 5 1 4 5
4 1 4 2 4
5 2 4 2 5
2
kci =
∑ (rcj - rc) (rij - ri)
j ∈ Ici
∑ (rcj - rc)2
∑ (rij - ri)2
j ∈ Ici j ∈ Ici
• Recommendations computation: weighted sum of ratings
and estimates
rcj = rc +
∑ (rij - ri) kci + ∑ (rij - ri) kci’
i ∈ Ucj
CBF
∑ |kci| + ∑ |kci’|
i ∈ U’cj
i ∈ Ucj i ∈ U’cj
Combining Content-based and
Collaborative Filtering (2)
• Similarity measure: Pearson Correlation Coefficient
’
’
’ ’
CBF CBF
CBF CBF
Experiments with Combination of Content-
based and Collaborative Filtering (1)
• Content-based Filtering Method (CBF)
– documents and profiles: vector representation - weighted
keywords (TF-IDF)
– estimation computation: normalized dot product of
document and profile vectors
• Collaborative Filtering (CF)
– Pearson correlation coefficient
– weighted sum of ratings
• Combination of CF and CBF
– Pearson correlation coefficients
– weighted sum of ratings and CBF estimations
• Constant Method (rcj = 5)

collaborative filtering

  • 1.
    Combining Content-based and CollaborativeFiltering Department of Computer Science and Engineering, Slovak University of Technology polcicova@dcs.elf.stuba.sk navrat@elf.stuba.sk Gabriela Polčicová Pavol Návrat
  • 2.
    Overview • Information Filteringand its Types • Combined Method • Experiment with Information Filtering Methods • Conclusions
  • 3.
    Information Filtering (1) –delivery of relevant information to the people who need it • Types of Information Filtering – Content-based - for textual documents – Collaborative - for communities of users • Interests – information about interests - stored in profiles – expressing opinions to documents - ratings • Ratings {i, j, rij} – for user i, item j, the value of rating rij
  • 4.
    Information Filtering (2) Filter Learning interests Estimatingthe value of rating Choosing recommendations Rated items {user, item, value} Unrated items {user, item} Recommendations {user, item, estimation}
  • 5.
    Content-based Filtering (1) •Basic idea – recommending documents based on content and properties of document • Profile – consists of keywords with assigned weights – only documents matching profile are recommended • Recommendations – based on objective measurable properties
  • 6.
    Content-based Filtering (2) Documentsrated by the user Documents of interest Documents unrated by the user PROFILE Keywords, phrases with weights Documents matching profile => recommended documents Documents, ratings
  • 7.
    Collaborative Filtering (1) •Basic idea – automating “word of mouth” – leverage opinions of like-minded users while making decisions • Schema – collecting users’ opinions – searching for like-minded users – making recommendations
  • 8.
    Collaborative Filtering (2) Profileof current user Profile of user 1 Profile of user 2 Profile of user 3 Profile of user 4 Profile of user 5 Documents from like-minded users’ profiles => recommended documents
  • 9.
    kci = ∑ (rcj- rc) (rij - ri) j ∈ Ici ∑ (rcj - rc)2 ∑ (rij - ri)2 j ∈ Ici j ∈ Ici • Recommendations computation: weighted sum of ratings rcj = rc + ∑ (rij - ri) kci i ∈ Ucj ∑ |kci| i ∈ Ucj Collaborative Filtering (3) • Similarity measure: Pearson Correlation Coefficient
  • 10.
    Combining Content-based and CollaborativeFiltering (1) • Computing of estimates for missing ratings by Content- based Filtering method for each user • Searching for like-minded users – computing coefficient kci between current and i-th user (only from ratings) – computing coefficient kci’ between current and i-th user (from both ratings and estimates) • New recommendations computation – using ratings (with coefficients kci) and also ratings with estimates (with coefficient kci’) as weights in weighted sum of ratings and estimates
  • 11.
    Datasets for Experiments •Data: – EachMovie - users‘ ratings for movies www.research.digital.com/SRC/eachmovie/ – IMDB - textual information for CBF (movies‘ descriptions) www.imdb.com/ • Datasets: – A - ratings from the period up to Mar 1, 1996 (810 ratings from 71 users) – B - ratings from the period uo to Mar 15, 1996 (2407 ratings from 131 users) – C - ratings from the period up to Apr 1, 1996 (12290 ratings from 651 users)
  • 12.
    EachMovie Data andConstant Method Percentage of ratings in EachMovie 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 1 2 3 4 5 6 ratings A B C • Constant Method rcj = 5
  • 13.
    Experiments with Combinationof Content- based and Collaborative Filtering (2) Dataset Divide dataset into training set (90%) and test set (10%) Apply filtering methods and evaluate their performance Content-based Filtering method Collaborative Filtering method Combined Filtering method recommendations recommendations recommendations test, training sets test, training sets Evaluation of methods’ performance Constant methodrecommendations test set
  • 14.
    Metrics • Coverage =percentage of items for which the method is able to compute estimates • Accuracy = • F-measure = • NMAE = 2.Precision.Recall Precision + Recall |R ∩L| + |R ∩L| |L| + |L| |R ∩ L| |R| |R ∩ L| |L| ∑|rij - rij| n.s Precision = Recall = R - set of recommended items L - set of liked items
  • 15.
    Results of Experiments Coverage 0,8 0,85 0,9 0,95 1 AB C Accuracy 0,7 0,75 0,8 0,85 0,9 A B C F-measure 0,8 0,85 0,9 0,95 1 A B C F-measure 0,8 0,85 0,9 0,95 1 A B C CF CBF combined constant
  • 16.
    Conclusions • Combination ofcontent-based and collaborative filtering might help in initial phase Future work • Weighting of coefficients • Comparing method with additional methods
  • 17.
    Content-based Filtering -Vector Representation of Documents and Profiles Wj= (0, … , 0, 0.5 , 0, … , 0, 0.3 , 0, … , 0, 0.2 , 0, … , 0) profilei = ∑ rj .wij n j = 1 D = ( … , computer, … , learning, … , machine, …. ) Documentj computer machine learning TF-IDF TF-IDFTF-IDF W . Profile |W| . |Profile| Sim(W, Profile) =
  • 18.
    Collaborative Filtering -Example A B C D E F G current 1 4 5 1 3 5 1 2 2 1 3 2 5 3 5 1 4 5 4 1 4 2 4 5 2 4 2 5 2
  • 19.
    kci = ∑ (rcj- rc) (rij - ri) j ∈ Ici ∑ (rcj - rc)2 ∑ (rij - ri)2 j ∈ Ici j ∈ Ici • Recommendations computation: weighted sum of ratings and estimates rcj = rc + ∑ (rij - ri) kci + ∑ (rij - ri) kci’ i ∈ Ucj CBF ∑ |kci| + ∑ |kci’| i ∈ U’cj i ∈ Ucj i ∈ U’cj Combining Content-based and Collaborative Filtering (2) • Similarity measure: Pearson Correlation Coefficient ’ ’ ’ ’ CBF CBF CBF CBF
  • 20.
    Experiments with Combinationof Content- based and Collaborative Filtering (1) • Content-based Filtering Method (CBF) – documents and profiles: vector representation - weighted keywords (TF-IDF) – estimation computation: normalized dot product of document and profile vectors • Collaborative Filtering (CF) – Pearson correlation coefficient – weighted sum of ratings • Combination of CF and CBF – Pearson correlation coefficients – weighted sum of ratings and CBF estimations • Constant Method (rcj = 5)

Editor's Notes

  • #2 I would like to present the method for information filtering that combines content-based and collaborative filtering techniques.
  • #3 First of all, I will talk about information filtering and about its main types. Then I will introduce our method that combines content based and collaborative filtering. Further I will present results of experiments comparing the performance of 2 main methods and proposed method. And finally present some conclusions.
  • #4 The aim of information filtering is to deliver the relevant information to the people. This either means to filter out not interesting ones, or to recommend the relevant information. Two main types of information filtering methods are: Content-based Collaborative To be able to recommend items (documents), the it is necessary to model user’s interests. Information about user’s interests/opinions are stored in profiles and are expressed using user’s (explicit) ratings. Ratings are triples {user (who rates), item (for which the rating is made), and the value of rating (a number from a given scale)}
  • #5 If you take any of main information filtering methods, the process of making recommendations may be divided into three phases: (1) Learning interests: This means collecting of ratings and composing profiles. (2) During second phase the rating values are estimated by some of information filtering technique (content-based, collaborative) for unrated items of a given user. (3) Finally the recommendations are chosen -> typically only items with estimated rating value greater than a given threshold are recommended.
  • #6 The first of the two main information filtering techniques is content-based filtering. This is a method for textual documents. Its basic idea is to recommend documents based on content and properties of document. Content and properties of document are represented by keywords selected from the documents. User’s profile consists of keywords with assigned weights. These weights are computed on the basis of keywords’ importance in the document. Recommendations are those documents that are matching profile. Matching is based on measurable property such as frequency of words in a document.
  • #7 At the beginning user rates some documents. From these documents some are rated highly - this are documents of user’s interest. User’s profile is composed from the words of rated documents. The weight is assigned to each word in the profile. Then from unrated documents are chosen and recommended those that match the profile.
  • #8 The second main type of information filtering is collaborative filtering. Its basic idea is automating “word of mouth”, what means recommending products, documents, or items by friends and people with similar taste. The content of document is not taken into account, but opinions of users in a community are used to make recommendations. The schema of collaborative filtering is following: collecting users opinions via ratings then searching for like -minded users by comparing ratings and making recommendations. These three steps are repeated.
  • #9 Profile of the user consists only from the user’s ratings of items (documents). When making recommendations for the current user, similar profiles are chosen from the profiles of other users (in the figure profiles of users 2, 4 and 5). Estimates of unrated document ratings are computed as a weighted combination of similar profiles. Documents with estimate of ratings that are high are recommended to the user.
  • #10 We used Pearson correlation coefficient to find like-minded users. It measures the linear relationship between two profiles. Those users are considered to be like-minded whose coefficient is greater then a given threshold. Weighted sum of ratings of like-minded users is used to estimate rating for the current user. ---- Ici - set of items rated by both users (c and i). Ucj - set of users who rated item j and have kci grater than given threshold.
  • #11 We proposed the method that combines content-based and collaborative filtering. This combined method consists of these steps: After collecting ratings, the estimates for missing ratings are computed for each user using content-based filtering method. Then Pearson correlation coefficients kci and kci’ are computed between the profiles of the current user and i-th user (for all users i) using only ratings and ratings with estimates respectively. This is done for all users. Finally for each user new estimates are computed as weighted sum of users ratings with coefficient kci and with ratings and estimates weighted by coefficient kci’ in weighted sum.
  • #12 We made experiments to compare the performance of proposed method and two main methods. We used EachMovie database that consists of users’ ratings for movies collected in years 1995 to 1997. In addition we used Internet Movie Database that consists of movies’ descriptions. Thus we had both - textual information for items with users’ ratings. Since we hypothesized that combined method can perform better recommendations in the initial phase, when only a small number of ratings are available, we take three datasets into account. These ratings consist of ratings collected: Dataset A: from the period up to March 1, 1996 (810 ratings from 71 users) Dataset B: from the period up to March 15, 1996 (2407 ratings form 131 users) Dataset C: from the period up to April 1, 1996 (12 290 ratings from 651 users)
  • #14 During experiment each dataset (A, B, or C) was divided into training set and the test set. Then all methods were applied to estimate ratings of the test set by using ratings in the training set. Finally methods’ performance was evaluated.
  • #15 In order to measure the performance of methods we made 10 experiments for each method (CF, CBF and combined) and each dataset (A, B, C). We used four metrics to measure performance of the methods. Coverage is used to measure percentage of items for which the method is able to compute estimates. Accuracy and F-measure are expressed by terms of two sets: R (recommended items) and L (liked items). Accuracy measures a percentage of items that were correctly classified (relevant items that were recommended and irrelevant that were not recommended). F-measure combines measures Precision and Recall based on those two sets too. Normalized MAE is Normalized Mean Absolute Error. Sum goes through ratings in the test set (i - index for user, j - index for item). rij is user’s rating and rij with hat is estimate of user’s rating, n is number of estimates and s is a number of values of the scale (6). Our analysis uses ANOVA (test) with 95% confidence level.
  • #16 Results show that combined method (yellow color) provides significantly better coverage as a CF method (blue color) and significantly smaller Normalized Mean Absolute Error as compared to CBF method (violet color) (with comparable results measured by other metrics).
  • #17 Although much additional work have to be done, from the results of our preliminary experiment we conclude that combining content-based filtering and collaborative filtering provides advantages of both method. In the future work we plan to study weighting of coefficients in weighted sum of ratings and further we plan to compare our results with results of other published methods.
  • #18 In our experiment we use vector representation of documents and profiles. As a dictionary we use a vector of words D. Vector representing a document consists of weights for the words in the dictionary vector. These weights are computed by term frequency - inverse document frequency. This weight represents the importance of the word on the basis of frequency of the word in a document and in the given collection of documents. As mentioned previously, the profile is composed from words in rated documents. It is represented by a vector of weights of words of the dictionary vector, too. These weights are computed as a weighted sum of rated documents. To find unrated documents that can be interesting to the particular user we use the normalized dot product of profile and unrated document vector as a similarity measure.
  • #19 Let us look at the very simple example. In the table we can see ratings of 6 users (current + 5 (from previous slide)) for 7 items (A to G) in the table. Each line in the table represents the profile of one user. The goal is to estimate ratings for the current user. We can see that users number 2, 4 and 5 have ratings similar to the current user. The user number 1 have very different ratings and the user number 3 rated different items. Thus from ratings of users 2, 4 and 5 we can estimate a rating of the current user for item F ->2 (animation).
  • #20 Correlation coefficient kci is the same as used in collaborative filtering method, kci’ is the same Pearson correlation coefficient but it takes estimations made by CBF filtering (animation) into account. We do not use all but only estimations for those items (j) that were rated by at least one of the users: either by current user or by user i. Thus at least one of rcj and rij with superscript CBF is real rating and Ici’ is a union of items rated by the current user and user i. To compute estimations - weighted sum of ratings and estimates made by content-based filtering method is used.
  • #21 To summarize, in our experiments we used content-based filtering method with vector representation and to compute estimations of ratings we used normalized dot product of document and profile vectors. For collaborative filtering we used Pearson correlation coefficient as a similarity measure and weighted sum of ratings for estimation computations. And for combined method we used Pearson correlation coefficients and weighted sum of ratings and estimates made by content-based filtering method.