SlideShare a Scribd company logo
1 of 25
Download to read offline
User-Based or Memory-Based Filtering
Hina Arora
References:
Guide to Data Mining Chapter 2 - Zacharski
User-Based or Memory-Based Filtering
• We compare a user with every other user to find the closest matches
• Also called Memory-Based Filtering because we need to store all ratings
in order to make recommendations
• It’s a 3-step process. Let’s say we are trying to find item
recommendations for User X:
oFind past item ratings from User 𝑋
oFind the “most similar” User 𝑌 (based on similarity of item ratings) from the
remaining user corpus
oRecommend those items to User X that the “most similar” User 𝑌 has rated, and
that User 𝑋 hasn’t used yet
What measures of similarity can we use
between two users?
1) Distance Based (dis)similarity measures
a. Manhattan distance
b. Euclidean distance
c. Minkowski distance
2) Cosine-based similarity measure
3) Pearson Correlation-based similarity measure
4) K-nearest neighbor
Running Example for this Section
Let’s say we are trying to find item recommendations for Veronica:
• We already have Veronica’s past item ratings.
• Now if we can find the user who is “most similar” to Veronica based on their item ratings,
• then we can recommend those items to Veronica that are highly rated by that “most
similar”, and that Veronica hasn’t already discovered
(1) Distance Based Measures
Distance-based (dis)similarity measures between User 𝑋 and User 𝑌 based on
𝑛 item ratings:
• Distance >= 0
• Most Similar  Shortest Distance
• An item rating is considered in the distance measure only if it exists for both users
• Three different distance-based measures:
a) Manhattan Distance: σ𝑘=1
𝑛
|𝑥𝑘 − 𝑦𝑘|
b) Euclidean Distance: (σ𝑘=1
𝑛
𝑥𝑘 − 𝑦𝑘
2
)
Τ
1
2
c) Minkowski Distance: (σ𝑘=1
𝑛
|𝑥𝑘 − 𝑦𝑘|𝑟
)
Τ
1
𝑟
Note: Minkowski is a generalization of Manhattan (r=1), and Euclidean (r=2)
example comparing 3 users across 2 items
a) Manhattan Distance: σ𝑘=1
𝑛
|𝑥𝑘 − 𝑦𝑘|
We want to find item recommendations for Veronica.
• Manhattan Distances between Veronica and all other users (only consider those items ratings in distance measure that have been rated by both users):
o Angelica and Veronica = |3.5-3| + 4.5-5| + |5-4| + |1.5-2.5| + |2.5-3| = 3.5
o Bill and Veronica = |2-3| + |2-4| + |3.5-2.5| = 4
o Chan and Veronica = |5-3| + |3-5| + |5-4| + |1-2.5| = 6.5
o Dan and Veronica = |3-3| + |3-4| + |4.5-2.5| + |4-3| = 4
o Hailey and Veronica = |4-5| + |4-3| = 2
o Jordyn and Veronica = |5-5| + |5-4| + |4.5-2.5| + |4-3| = 4
o Sam and Veronica = |5-3| + |3-5| + |5-4| + |4-2.5| + |5-3| = 8.5
• User most similar (shortest distance) to Veronica: Hailey (Manhattan Distance 2)
o Hailey has rated three items that Veronica hasn’t: Broken Bells (Rating 4), Deadmau5 (Rating 1), Vampire Weekend (Rating 1)
o So we can make the following recommendation to Veronica: [('Broken Bells', 4.0), ('Deadmau5', 1.0), ('Vampire Weekend', 1.0)]
Since these are highly rated items by the most similar user Hailey, and these are items that Veronica hasn’t discovered yet
Note:
Ties broken arbitrarily
We want to find item recommendations for Veronica.
• Euclidean Distance between Veronica and all other users (only consider those items ratings in distance measure that have been rated by both users):
o Angelica and Veronica = (|3.5-3|^2 + 4.5-5|^2 + |5-4|^2 + |1.5-2.5|^2 + |2.5-3|^2)^1/2 = 1.7
o Bill and Veronica = (|2-3|^2 + |2-4|^2 + |3.5-2.5|^2 )^1/2 = 2.4
o Chan and Veronica = (|5-3|^2 + |3-5|^2 + |5-4|^2 + |1-2.5|^2 )^1/2 = 3.4
o Dan and Veronica = (|3-3|^2 + |3-4|^2 + |4.5-2.5|^2 + |4-3|^2 )^1/2 = 2.4
o Hailey and Veronica = (|4-5|^2 + |4-3|^2 )^1/2 = 1.4
o Jordyn and Veronica = (|5-5|^2 + |5-4|^2 + |4.5-2.5|^2 + |4-3|^2 )^1/2 = 2.4
o Sam and Veronica = (|5-3|^2 + |3-5|^2 + |5-4|^2 + |4-2.5|^2 + |5-3|^2 )^1/2 = 3.9
• User most similar (shortest distance) to Veronica: Hailey (Euclidean Distance 1.4)
o Hailey has rated three items that Veronica hasn’t: Broken Bells (Rating 4), Deadmau5 (Rating 1), Vampire Weekend (Rating 1)
o So we can make the following recommendation to Veronica: [('Broken Bells', 4.0), ('Deadmau5', 1.0), ('Vampire Weekend', 1.0)]
Since these are highly rated items by the most similar user Hailey, and these are items that Veronica hasn’t discovered yet
b) Euclidean Distance: (σ𝑘=1
𝑛
𝑥𝑘 − 𝑦𝑘
2
)
Τ
1
2
Note:
Ties broken arbitrarily
c) Minkowski Distance: (σ𝑘=1
𝑛
|𝑥𝑘 − 𝑦𝑘|𝑟)
Τ
1
𝑟
Try this for yourself to make recommendations for Veronica (with r=3)
Which Distance-Based Measure to Use?
• For higher-dimensional vectors, you might find that lower-order Minkowski
Distances (Manhattan r=1, Euclidean r=2) work better than higher-order
Minkowski Distances (r > 2).
• This is because the higher the order, the more the distance is going to be
dominated by the dimension with the highest difference.
• At lower orders, all dimensions get to play a substantial role in the distance
measure.
When to use Distance-Based Measures?
• If your data is dense (not too many zero or missing attribute values) and the
magnitude of the attribute values is important, use distance measures such as
Euclidean or Manhattan. Because if the data is sparse, then you can end up with
spurious results.
o For instance, when you compute the distance between Hailey and Veronica, you notice they
only rated two bands in common (Norah Jones and The Strokes), whereas when you compute
the distance between Hailey and Jordyn, you notice they rated five bands in common.
o This will skew our distance measurement, since the Hailey-Veronica distance is in 2 dimensions
while the Hailey-Jordyn distance is in 5 dimensions.
o Adding 0s to missing ratings typically just exacerbates the problem.
• Use smaller r (r=1 or r=2) if you don't want the measure to be dominated by larger
differences
• You may need to scale data if the attributes are on very different scales
(2) Cosine Similarity Measure
Motivation:
• Let’s say we were trying to make song recommendations for a user based on
what other “similar” users have played. Most users would have played most
songs 0 times, and very few songs non-0 times. So we wouldn’t want our
similarity measure to be based on the number of shared 0 values since any
users are likely to have “not played” many of the same songs. So what we’d like
is a Jaccard measure, but for non-binary vectors.
• Similarly, we’d only want to consider a match when both users have played a
song, rather than when one has and the other hasn’t. Because otherwise, we’d
be letting the similarity measure be overwhelmed by non-matches rather than
matches.*
• Cosine similarity accounts for both of these by considering products terms (so
the 0s fall off naturally).
Ɵ
d
* Unless we want to “self-dampen” the similarity measure based on amount of overlap…
Cosine-based similarity measure between User X and User Y based on n
item ratings:
• Cosine similarity lies between -1 and 1
(-1 total opposites, 0 independent, 1 perfectly similar)
• Most Similar  Highest Cosine
• An item rating is considered in the cosine measure only if it exists for both
users (that is, we consider intersection).*
• cos(x,y) =
𝑥.𝑦
𝑥 𝑦
=
σ𝑖=1
𝑛
𝑥𝑖𝑦𝑖
σ𝑖=1
𝑛 𝑥𝑖
2 σ𝑖=1
𝑛 𝑦𝑖
2
* Or we could consider the union if we’d like the similarity measure to self-dampen
based on amount of overlap.
Note:
Ties broken arbitrarily
When to use Cosine Measure?
• If the data is sparse (too many zero or missing attribute values)
consider using Cosine Similarity since it ignores 0 matches.
(3) Pearson Correlation Measure
Motivation:
• Users often have different rating patterns. For instance, Bill seems to avoid
extreme ratings, his ratings range from 2 to 4. Jordyn seems to like everything,
her ratings range from 4 to 5. Hailey is a binary person, giving ratings of either
1s or 4s.
• In other words, users often anchor their ratings at different scales. One user
might rate <bad, good, great> as <1, 2, 3>, whereas another user might rate
<bad, good, great> as <3, 4, 5>.
• So we need a way to be able to base similarity on similar trending of ratings,
rather than similar absolute ratings.
Pearson Correlation-based similarity measure between User 𝑋 and User 𝑌 based on
𝑛 item ratings:
• Correlation between -1 and 1
(-1 perfectly -vely correlated, 0 not correlated, 1 perfectly +vely correlated)
• Most Similar  Highest Correlation
• An item rating is considered in the pearson measure only if it exists for both users
(intersection).
• Original formula:
σ𝑖=1
𝑛
𝑥𝑖− ҧ
𝑥 𝑦𝑖− ത
𝑦
σ𝑖=1
𝑛
𝑥𝑖− ҧ
𝑥 2 σ𝑖=1
𝑛
𝑦𝑖− ത
𝑦 2
• Modified formula more efficient since it only requires a single pass through the data:
σ𝑖=1
𝑛
𝑥𝑖𝑦𝑖−
σ𝑖=1
𝑛 𝑥𝑖 σ𝑖=1
𝑛 𝑦𝑖
𝑛
σ𝑖=1
𝑛
𝑥𝑖
2−
σ
𝑖=1
𝑛 𝑥𝑖
2
𝑛
σ𝑖=1
𝑛
𝑦𝑖
2−
σ
𝑖=1
𝑛 𝑦𝑖
2
𝑛
NOTE
Recall:
• Cosine Similarity:
𝑥.𝑦
𝑥 𝑦
=
σ𝑖=1
𝑛
𝑥𝑖𝑦𝑖
σ𝑖=1
𝑛 𝑥𝑖
2 σ𝑖=1
𝑛 𝑦𝑖
2
• Pearson Correlation:
σ𝑖=1
𝑛
𝑥𝑖− ҧ
𝑥 𝑦𝑖− ത
𝑦
σ𝑖=1
𝑛
𝑥𝑖− ҧ
𝑥 2 σ𝑖=1
𝑛
𝑦𝑖− ത
𝑦 2
Mean-centered or mean-adjusted cosine:
• If we center the data before taking the cosine similarity, and only consider
items rated by both users, then it's exactly same as Pearson
• If we center the data before taking the cosine similarity, make unknows 0, and
consider all items, then denominator will dampen the effect of too few
overlaps. This in some sense factors in confidence related with degree of
overlap and can sometimes perform better than Pearson Correlation.
https://grouplens.org/blog/similarity-functions-for-user-user-collaborative-filtering/
Note:
Ties broken arbitrarily
When to use Pearson Correlation Measure?
• If the data is subject to grade-inflation (different users may be using
different scales) use Pearson Similarity.
• Recall that pearson correlation considers an item rating in the distance
measure only if it exists for both users. If you’d like factor in
confidence related with degree of overlap, you can additionally
“dampen” the similarity by multiplying the similarity with a weighting
factor such as
𝑚𝑖𝑛 𝐼𝑢∩𝐼𝑣,50
50
(4) K-Nearest Neighbor Recommendation
Motivation:
• Note that in all of the above methods, we have relied on a single “most
similar” person.
• So any quirk that person has is passed on as a recommendation.
• One way of getting around this is to base our recommendations on more than
one person who is similar to our user.
K-Nearest Neighbor recommendation based on Pearson Correlation-based similarity measure
between User 𝑋 and User 𝑌 based on 𝑛 item ratings:
• The projected rating for each item is calculated using the weighted average influence
(based on similarity) of the k-nearest neighbors.
• So let’s say the k-nearest neighbors to Ann are Sally, Eric and Amanda with Pearson
Scores of 0.8, 0.7 and 0.5.
o Then their influence (based on Pearson similarity) is 0.8/(0.8+0.7+0.5)=0.4, 0.7/(0.8+0.7+0.5)=0.35, and
0.5/(0.8+0.7+0.5)=0.25.
o Now suppose Sally, Eric, and Amanda rated the band Grey Wardens as 3.5, 5, 4.5, then the projected rating for
Ann would be 3.5*0.4+5*0.35+4.5*0.25=4.275.
• Note on calculating the weights: since Pearson Correlation can be negative, the above
method of calculating weights can create some interesting challenges. So when
calculating the weights, it is customary to first transform the Pearson Coefficient to a 0-1
scale using the following transformation: (pc + 1)/2.
• The best value for k is application specific—you will need to do some experimentation.
Note:
Ties broken arbitrarily
K = 3
Veronica’s Ratings:

More Related Content

Similar to User Based Recommendation Systems (1).pdf

learningIntro.doc
learningIntro.doclearningIntro.doc
learningIntro.doc
butest
 
learningIntro.doc
learningIntro.doclearningIntro.doc
learningIntro.doc
butest
 
Development of health measurement scales – part 2
Development of health measurement scales – part 2Development of health measurement scales – part 2
Development of health measurement scales – part 2
Rizwan S A
 
Download
DownloadDownload
Download
butest
 
Download
DownloadDownload
Download
butest
 
Applied statistics lecture_4
Applied statistics lecture_4Applied statistics lecture_4
Applied statistics lecture_4
Daria Bogdanova
 
Statisticsforbiologists colstons
Statisticsforbiologists colstonsStatisticsforbiologists colstons
Statisticsforbiologists colstons
andymartin
 

Similar to User Based Recommendation Systems (1).pdf (20)

Rank Monotonicity in Centrality Measures (A report about Quality guarantees f...
Rank Monotonicity in Centrality Measures (A report about Quality guarantees f...Rank Monotonicity in Centrality Measures (A report about Quality guarantees f...
Rank Monotonicity in Centrality Measures (A report about Quality guarantees f...
 
learningIntro.doc
learningIntro.doclearningIntro.doc
learningIntro.doc
 
learningIntro.doc
learningIntro.doclearningIntro.doc
learningIntro.doc
 
Data analysis05 clustering
Data analysis05 clusteringData analysis05 clustering
Data analysis05 clustering
 
Covariance.pdf
Covariance.pdfCovariance.pdf
Covariance.pdf
 
Development of health measurement scales – part 2
Development of health measurement scales – part 2Development of health measurement scales – part 2
Development of health measurement scales – part 2
 
GIS in Public Health Research: Understanding Spatial Analysis and Interpretin...
GIS in Public Health Research: Understanding Spatial Analysis and Interpretin...GIS in Public Health Research: Understanding Spatial Analysis and Interpretin...
GIS in Public Health Research: Understanding Spatial Analysis and Interpretin...
 
[UMAP2013] Recommendation with Differential Context Weighting
[UMAP2013] Recommendation with Differential Context Weighting[UMAP2013] Recommendation with Differential Context Weighting
[UMAP2013] Recommendation with Differential Context Weighting
 
Collaborative filtering
Collaborative filteringCollaborative filtering
Collaborative filtering
 
Data Processing and Statistical Treatment: Spreads and Correlation
Data Processing and Statistical Treatment: Spreads and CorrelationData Processing and Statistical Treatment: Spreads and Correlation
Data Processing and Statistical Treatment: Spreads and Correlation
 
Language Models for Collaborative Filtering Neighbourhoods [ECIR '16 Slides]
Language Models for Collaborative Filtering Neighbourhoods [ECIR '16 Slides]Language Models for Collaborative Filtering Neighbourhoods [ECIR '16 Slides]
Language Models for Collaborative Filtering Neighbourhoods [ECIR '16 Slides]
 
Association between-variables
Association between-variablesAssociation between-variables
Association between-variables
 
Association between-variables
Association between-variablesAssociation between-variables
Association between-variables
 
Recommender systems
Recommender systemsRecommender systems
Recommender systems
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 
Download
DownloadDownload
Download
 
Download
DownloadDownload
Download
 
Applied statistics lecture_4
Applied statistics lecture_4Applied statistics lecture_4
Applied statistics lecture_4
 
DMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringDMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to Clustering
 
Statisticsforbiologists colstons
Statisticsforbiologists colstonsStatisticsforbiologists colstons
Statisticsforbiologists colstons
 

Recently uploaded

如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
jk0tkvfv
 
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted KitAbortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh +966572737505 get cytotec
 
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
wsppdmt
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
pwgnohujw
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
varanasisatyanvesh
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
yulianti213969
 

Recently uploaded (20)

Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted KitAbortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
 
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
jll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdf
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 

User Based Recommendation Systems (1).pdf

  • 1. User-Based or Memory-Based Filtering Hina Arora References: Guide to Data Mining Chapter 2 - Zacharski
  • 2. User-Based or Memory-Based Filtering • We compare a user with every other user to find the closest matches • Also called Memory-Based Filtering because we need to store all ratings in order to make recommendations • It’s a 3-step process. Let’s say we are trying to find item recommendations for User X: oFind past item ratings from User 𝑋 oFind the “most similar” User 𝑌 (based on similarity of item ratings) from the remaining user corpus oRecommend those items to User X that the “most similar” User 𝑌 has rated, and that User 𝑋 hasn’t used yet
  • 3. What measures of similarity can we use between two users? 1) Distance Based (dis)similarity measures a. Manhattan distance b. Euclidean distance c. Minkowski distance 2) Cosine-based similarity measure 3) Pearson Correlation-based similarity measure 4) K-nearest neighbor
  • 4. Running Example for this Section Let’s say we are trying to find item recommendations for Veronica: • We already have Veronica’s past item ratings. • Now if we can find the user who is “most similar” to Veronica based on their item ratings, • then we can recommend those items to Veronica that are highly rated by that “most similar”, and that Veronica hasn’t already discovered
  • 5. (1) Distance Based Measures Distance-based (dis)similarity measures between User 𝑋 and User 𝑌 based on 𝑛 item ratings: • Distance >= 0 • Most Similar  Shortest Distance • An item rating is considered in the distance measure only if it exists for both users • Three different distance-based measures: a) Manhattan Distance: σ𝑘=1 𝑛 |𝑥𝑘 − 𝑦𝑘| b) Euclidean Distance: (σ𝑘=1 𝑛 𝑥𝑘 − 𝑦𝑘 2 ) Τ 1 2 c) Minkowski Distance: (σ𝑘=1 𝑛 |𝑥𝑘 − 𝑦𝑘|𝑟 ) Τ 1 𝑟 Note: Minkowski is a generalization of Manhattan (r=1), and Euclidean (r=2) example comparing 3 users across 2 items
  • 6. a) Manhattan Distance: σ𝑘=1 𝑛 |𝑥𝑘 − 𝑦𝑘| We want to find item recommendations for Veronica. • Manhattan Distances between Veronica and all other users (only consider those items ratings in distance measure that have been rated by both users): o Angelica and Veronica = |3.5-3| + 4.5-5| + |5-4| + |1.5-2.5| + |2.5-3| = 3.5 o Bill and Veronica = |2-3| + |2-4| + |3.5-2.5| = 4 o Chan and Veronica = |5-3| + |3-5| + |5-4| + |1-2.5| = 6.5 o Dan and Veronica = |3-3| + |3-4| + |4.5-2.5| + |4-3| = 4 o Hailey and Veronica = |4-5| + |4-3| = 2 o Jordyn and Veronica = |5-5| + |5-4| + |4.5-2.5| + |4-3| = 4 o Sam and Veronica = |5-3| + |3-5| + |5-4| + |4-2.5| + |5-3| = 8.5 • User most similar (shortest distance) to Veronica: Hailey (Manhattan Distance 2) o Hailey has rated three items that Veronica hasn’t: Broken Bells (Rating 4), Deadmau5 (Rating 1), Vampire Weekend (Rating 1) o So we can make the following recommendation to Veronica: [('Broken Bells', 4.0), ('Deadmau5', 1.0), ('Vampire Weekend', 1.0)] Since these are highly rated items by the most similar user Hailey, and these are items that Veronica hasn’t discovered yet
  • 8. We want to find item recommendations for Veronica. • Euclidean Distance between Veronica and all other users (only consider those items ratings in distance measure that have been rated by both users): o Angelica and Veronica = (|3.5-3|^2 + 4.5-5|^2 + |5-4|^2 + |1.5-2.5|^2 + |2.5-3|^2)^1/2 = 1.7 o Bill and Veronica = (|2-3|^2 + |2-4|^2 + |3.5-2.5|^2 )^1/2 = 2.4 o Chan and Veronica = (|5-3|^2 + |3-5|^2 + |5-4|^2 + |1-2.5|^2 )^1/2 = 3.4 o Dan and Veronica = (|3-3|^2 + |3-4|^2 + |4.5-2.5|^2 + |4-3|^2 )^1/2 = 2.4 o Hailey and Veronica = (|4-5|^2 + |4-3|^2 )^1/2 = 1.4 o Jordyn and Veronica = (|5-5|^2 + |5-4|^2 + |4.5-2.5|^2 + |4-3|^2 )^1/2 = 2.4 o Sam and Veronica = (|5-3|^2 + |3-5|^2 + |5-4|^2 + |4-2.5|^2 + |5-3|^2 )^1/2 = 3.9 • User most similar (shortest distance) to Veronica: Hailey (Euclidean Distance 1.4) o Hailey has rated three items that Veronica hasn’t: Broken Bells (Rating 4), Deadmau5 (Rating 1), Vampire Weekend (Rating 1) o So we can make the following recommendation to Veronica: [('Broken Bells', 4.0), ('Deadmau5', 1.0), ('Vampire Weekend', 1.0)] Since these are highly rated items by the most similar user Hailey, and these are items that Veronica hasn’t discovered yet b) Euclidean Distance: (σ𝑘=1 𝑛 𝑥𝑘 − 𝑦𝑘 2 ) Τ 1 2
  • 10. c) Minkowski Distance: (σ𝑘=1 𝑛 |𝑥𝑘 − 𝑦𝑘|𝑟) Τ 1 𝑟 Try this for yourself to make recommendations for Veronica (with r=3)
  • 11. Which Distance-Based Measure to Use? • For higher-dimensional vectors, you might find that lower-order Minkowski Distances (Manhattan r=1, Euclidean r=2) work better than higher-order Minkowski Distances (r > 2). • This is because the higher the order, the more the distance is going to be dominated by the dimension with the highest difference. • At lower orders, all dimensions get to play a substantial role in the distance measure.
  • 12. When to use Distance-Based Measures? • If your data is dense (not too many zero or missing attribute values) and the magnitude of the attribute values is important, use distance measures such as Euclidean or Manhattan. Because if the data is sparse, then you can end up with spurious results. o For instance, when you compute the distance between Hailey and Veronica, you notice they only rated two bands in common (Norah Jones and The Strokes), whereas when you compute the distance between Hailey and Jordyn, you notice they rated five bands in common. o This will skew our distance measurement, since the Hailey-Veronica distance is in 2 dimensions while the Hailey-Jordyn distance is in 5 dimensions. o Adding 0s to missing ratings typically just exacerbates the problem. • Use smaller r (r=1 or r=2) if you don't want the measure to be dominated by larger differences • You may need to scale data if the attributes are on very different scales
  • 13. (2) Cosine Similarity Measure Motivation: • Let’s say we were trying to make song recommendations for a user based on what other “similar” users have played. Most users would have played most songs 0 times, and very few songs non-0 times. So we wouldn’t want our similarity measure to be based on the number of shared 0 values since any users are likely to have “not played” many of the same songs. So what we’d like is a Jaccard measure, but for non-binary vectors. • Similarly, we’d only want to consider a match when both users have played a song, rather than when one has and the other hasn’t. Because otherwise, we’d be letting the similarity measure be overwhelmed by non-matches rather than matches.* • Cosine similarity accounts for both of these by considering products terms (so the 0s fall off naturally). Ɵ d * Unless we want to “self-dampen” the similarity measure based on amount of overlap…
  • 14. Cosine-based similarity measure between User X and User Y based on n item ratings: • Cosine similarity lies between -1 and 1 (-1 total opposites, 0 independent, 1 perfectly similar) • Most Similar  Highest Cosine • An item rating is considered in the cosine measure only if it exists for both users (that is, we consider intersection).* • cos(x,y) = 𝑥.𝑦 𝑥 𝑦 = σ𝑖=1 𝑛 𝑥𝑖𝑦𝑖 σ𝑖=1 𝑛 𝑥𝑖 2 σ𝑖=1 𝑛 𝑦𝑖 2 * Or we could consider the union if we’d like the similarity measure to self-dampen based on amount of overlap.
  • 16. When to use Cosine Measure? • If the data is sparse (too many zero or missing attribute values) consider using Cosine Similarity since it ignores 0 matches.
  • 17. (3) Pearson Correlation Measure Motivation: • Users often have different rating patterns. For instance, Bill seems to avoid extreme ratings, his ratings range from 2 to 4. Jordyn seems to like everything, her ratings range from 4 to 5. Hailey is a binary person, giving ratings of either 1s or 4s. • In other words, users often anchor their ratings at different scales. One user might rate <bad, good, great> as <1, 2, 3>, whereas another user might rate <bad, good, great> as <3, 4, 5>. • So we need a way to be able to base similarity on similar trending of ratings, rather than similar absolute ratings.
  • 18. Pearson Correlation-based similarity measure between User 𝑋 and User 𝑌 based on 𝑛 item ratings: • Correlation between -1 and 1 (-1 perfectly -vely correlated, 0 not correlated, 1 perfectly +vely correlated) • Most Similar  Highest Correlation • An item rating is considered in the pearson measure only if it exists for both users (intersection). • Original formula: σ𝑖=1 𝑛 𝑥𝑖− ҧ 𝑥 𝑦𝑖− ത 𝑦 σ𝑖=1 𝑛 𝑥𝑖− ҧ 𝑥 2 σ𝑖=1 𝑛 𝑦𝑖− ത 𝑦 2 • Modified formula more efficient since it only requires a single pass through the data: σ𝑖=1 𝑛 𝑥𝑖𝑦𝑖− σ𝑖=1 𝑛 𝑥𝑖 σ𝑖=1 𝑛 𝑦𝑖 𝑛 σ𝑖=1 𝑛 𝑥𝑖 2− σ 𝑖=1 𝑛 𝑥𝑖 2 𝑛 σ𝑖=1 𝑛 𝑦𝑖 2− σ 𝑖=1 𝑛 𝑦𝑖 2 𝑛
  • 19. NOTE Recall: • Cosine Similarity: 𝑥.𝑦 𝑥 𝑦 = σ𝑖=1 𝑛 𝑥𝑖𝑦𝑖 σ𝑖=1 𝑛 𝑥𝑖 2 σ𝑖=1 𝑛 𝑦𝑖 2 • Pearson Correlation: σ𝑖=1 𝑛 𝑥𝑖− ҧ 𝑥 𝑦𝑖− ത 𝑦 σ𝑖=1 𝑛 𝑥𝑖− ҧ 𝑥 2 σ𝑖=1 𝑛 𝑦𝑖− ത 𝑦 2 Mean-centered or mean-adjusted cosine: • If we center the data before taking the cosine similarity, and only consider items rated by both users, then it's exactly same as Pearson • If we center the data before taking the cosine similarity, make unknows 0, and consider all items, then denominator will dampen the effect of too few overlaps. This in some sense factors in confidence related with degree of overlap and can sometimes perform better than Pearson Correlation. https://grouplens.org/blog/similarity-functions-for-user-user-collaborative-filtering/
  • 21. When to use Pearson Correlation Measure? • If the data is subject to grade-inflation (different users may be using different scales) use Pearson Similarity. • Recall that pearson correlation considers an item rating in the distance measure only if it exists for both users. If you’d like factor in confidence related with degree of overlap, you can additionally “dampen” the similarity by multiplying the similarity with a weighting factor such as 𝑚𝑖𝑛 𝐼𝑢∩𝐼𝑣,50 50
  • 22. (4) K-Nearest Neighbor Recommendation Motivation: • Note that in all of the above methods, we have relied on a single “most similar” person. • So any quirk that person has is passed on as a recommendation. • One way of getting around this is to base our recommendations on more than one person who is similar to our user.
  • 23. K-Nearest Neighbor recommendation based on Pearson Correlation-based similarity measure between User 𝑋 and User 𝑌 based on 𝑛 item ratings: • The projected rating for each item is calculated using the weighted average influence (based on similarity) of the k-nearest neighbors. • So let’s say the k-nearest neighbors to Ann are Sally, Eric and Amanda with Pearson Scores of 0.8, 0.7 and 0.5. o Then their influence (based on Pearson similarity) is 0.8/(0.8+0.7+0.5)=0.4, 0.7/(0.8+0.7+0.5)=0.35, and 0.5/(0.8+0.7+0.5)=0.25. o Now suppose Sally, Eric, and Amanda rated the band Grey Wardens as 3.5, 5, 4.5, then the projected rating for Ann would be 3.5*0.4+5*0.35+4.5*0.25=4.275. • Note on calculating the weights: since Pearson Correlation can be negative, the above method of calculating weights can create some interesting challenges. So when calculating the weights, it is customary to first transform the Pearson Coefficient to a 0-1 scale using the following transformation: (pc + 1)/2. • The best value for k is application specific—you will need to do some experimentation.