5. LOGORecommender Systems
Neighborhood-based Collaborative Filtering
Basic Steps
Assign a weight to all users with respect to similarity
with the active user.
Select k users that have the highest similarity with the
active user – (neighborhood)
9. LOGORecommender Systems
Item-based Collaborative Filtering
Proposed in 2003
DOES NOT match similar users
DOES match similar items
Leads to faster online systems
Results in improved recommendations
11. LOGORecommender Systems
More Extensions
Highly correlated neighbors based on very few
co-rated items
Significance Weighting
multiply the similarity weight by a significance
weighting factor
Default Voting
assume a default value for the rating for items that
have not been explicitly rated
Inverse User Frequency
Universally loved/hated items are bad
12. LOGORecommender Systems
Model-based Collaborative Filtering
Uses statistical models for predictions
Based on data mining and machine learning
algorithms
Latent factor and Matrix factorization models
have emerged as a state-of-the-art methodology
Netflix Prize competition
13. LOGORecommender Systems
Content-based Recommending
Pure collaborative filtering recommenders treat all
users and items as atomic units
Can make a better personalized recommendation
by knowing more about a user or an item
Demographic information
Movie genres
Literary genres
16. LOGORecommender Systems
Hybrid Approaches
Used to leverage the strengths of content-based
and collaborative recommenders.
Merging the list results to produce a final list.
Content-boosted collaborative filtering
17. LOGORecommender Systems
Evaluation Metrics
Evaluation matrix is used to measure the quality
of a recommender system.
These systems are typical measured using
predictive accuracy metrics
1. Mean Absolute Error (MAE)
2. Root Mean Squared Error (RMSE)
21. LOGORecommender Systems
Sparsity
User ratings matrix is typically very sparse
Effects collaborative filtering systems
The problem
system has a very high item- to user ratio.
The system is in the initial stages of use.
Solution - making assumptions about the data
generation process
22. LOGORecommender Systems
Cold-Start Problem
New items and new users pose a significant
challenge to recommender systems.
New item problem –
content-based approach to produce
recommendations for all items,
New user problem
selecting items to be rated by a user so as
to rapidly improve recommendation
performance with the least user feedback
23. LOGORecommender Systems
Fraud
Push attacks
Increase the rating of their own products
Nuke attacks
Lower the ratings of their competitors
Item-based collaborative filtering is more robust
to these attacks
Content based methods are unaffected by
profile injection attacks.
24. LOGORecommender Systems
Content based or Collaborative
filtering
Advantages of CF over CB
CF can perform in domains where there is not
much content associated with items
CF can also preform when content is difficult for
a computer to analyze.
CF system has the ability to provide
serendipitous recommendations.
Editor's Notes
In neighborhood-based CF, every user should be considered in finding neighborsWhen the number of users is small – neighborhood-based collaborative filtering works
When the number of users is large – computational complexity is highDifficult to find neighborsAlternative - Item-based Collaborative Filtering
Proposed in 2003 by Linden, Smith, and YorkDoes not match similar users as in neighborhood based CFMatch a user’s rated items to similar itemsResearches shows this leads to faster online systems and also results in improved recommendations
Pearson correlation is used to find the similarity between two items i and jU is the set of users who have rated both items i and jr(u,i) is the rating of user u on item ir‘(i) is the average rating of item I across all the usersThen the rating for item ‘i’ for user ‘u’ is predicted using weighted average.
It is common for the active user to have highly correlated neighbors that are based on very few co-rated (overlapping) items. These neighbors based on a small number of overlapping items tend to be bad predictors. One approach to tackle this problem is to multiply the similarity weight by a significance weighting factor, which devalues the correlations based on few co-rated items.Another approach is applying a default value to unrated itemsThen one can now compute correlation using the union of items rated by users being matched as opposed to the intersection.There may be items which are universally loved or hatedThey are bad for predictionsA value called inverse user frequency is calculated and the original CF rating is multiplied by this valueNeighborhood based methods that generate recommendations based on statistical notions of similarity between users, or between items
Uses statistical models for predictionslatent factor models assume that the similarity between users and items is simultaneously induced by some hidden lower dimensional structure in the dataFor an example, the rating that a user gives to a movie might be assumed to depend on few implicit factors such as the user’s taste across various movie genresThese statistical models are developed based on data mining and machine learning algorithmsCurrently the latent factor and matrix factorization models are widely usedIn 2009 a competition was held by Netflix – popular movie web site to design the best collaborative filtering algorithm to predict user ratings for films. the grand prize of US$1,000,000 was given to the team which bested Netflix's own algorithm for predicting ratings by 10.06%The final winning solution was a complex ensemble of different models, several enhancements to basic matrix factorization models.
So far discussed about collaborative filteringSecond type of recommender systems are content-based recommending.Pure CF techniques treats users and items as atomic units.They make predictions without regard to the specifics of individual users or items.But using underlying information about users or items, better predictions can be made.For examples demographic information about users – age group, gender, ethnicity, languages etc.Movie genres such as action, comedy, horror, drama, romance etc.
Assume that a particular user has liked Start Wars and Star TrekWhen the content of those movies were analyzed, we can find that the genre is sci-fi.Based on that we can recommend another sci-fi movie to the user such as Oblivion
Content base recommending is mainly focused on items with associated textual content such as web pages, books and movies.There are two approaches to tackle this problem.Recommendation problem is treated as an Information Retrieval task.User’s preferences are treated as a Query and the unrated documents are scored with relevance/similarity to this queryRecommendation problem is treated as a Classification task.Each example represents the content of an item, and a user’s past ratings are used as labels for these examples
In order to leverage the strengths of content-based and collaborative recommenders, people have come up with hybrid approaches which combine the two.simple approach is to allow both content-based and collaborative filtering methods to produce separate ranked lists of recommendations, and then merge their resultsto produce a one final list. To improve this combine the two predictions using an adaptive weighted average, where the weight of the collaborative component increases as thenumber of users accessing an item increasescontent-based predictions are applied to convert a sparse user ratings matrix into a full ratings matrix, and then a CF method is used to provide recommendations
quality of a recommender system can be evaluated by comparing recommendations to a test set of known user ratings. these systems are typicaly measured using predictive accuracy metrics where the predicted ratings are directly compared to actual user ratings.The most commonly used metric
The MAE measures the average magnitude of the errors in a set of forecasts, without considering their direction. It measures accuracy for continuous variables.
The RMSE is a quadratic scoring rule which measures the average magnitude of the errorExpressing the formula in words, the difference between forecast and corresponding observed values are each squared and then averaged over the sample. Finally, the square root of the average is taken. Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE is most useful when large errors are particularly undesirable.
New items and new userspose a signi+cant challenge to recommender systems.Collectively these problems are referred to as the coldstart problem
Stated simply, most users do not rate most items and, hence, the user ratings matrix is typically very sparse. this is a problem for collaborative filtering systems, since it decreases the probability of finding a set of users with similar ratings.This problem often occurs when a system has a very high itemuser ratio, or the system is in the initial stages of use.Solution to this using additional domain information about item. for example when a new movie is added to the system give additional making assumptions about the data generation process that allows for high-quality imputation
New items and new users pose a significant challenge to recommender systems. Collectively these problems are referred to as the cold start problem The first of these problems arises in collaborative filtering systems, where an item cannot be recommended unless some user hasrated it beforeSolution isSince content-based approaches do not rely on ratings from other users, they can be used to produce recommendations for all items, provided attributes ofthe items are available. Thenew-user problem is dificult to tackle, since without previous preferences of a user it is not possible to find similar users or to build a content-based profile.Solution to this is selecting items to be rated by a user so as to rapidly improve recommendation performance with the least user feedback.
As recommender systems are being increasingly adopted by commercial websites, they have started toplay a significant role in affecting the profitability of sellers. Thishas led to many vendors engaging in different forms of fraud. To increase the profits by cheating the recommendersystems for their benefitsIncrease the rating of their own productsLower the ratings of their competitors
Now let see which method is better. CF can perform in domains where there is not much content associated with itemsCF can also preform when content is difficult for a computer to analyze.CF system has the ability to provideserendipitous recommendations.