Recommenders Systems

Recommender Systems
Collaborative Filtering &
Content-Based Recommending
1

Recommender Systems
 Systems for recommending items (e.g.
books, movies, CD’s, web pages,
newsgroup messages) to users based
on examples of their preferences.
 Many on-line stores provide
recommendations (e.g. Amazon,
CDNow).
 Recommenders have been shown to
substantially increase sales at on-line
stores.
2

Book Recommender
3
Red
Mars
Juras-
sic
Park
Lost
World
2001
Found
ation
Differ-
ence
Engine
Machine
Learning
User
Profile
Neuro-
mancer
2010

Utility matrix
4
In a recommendation-system application there are two
classes of entities, which we shall refer to as users and
items.
Users have preferences for certain items, and these
preferences must be teased out of the data. The data
itself is represented as a utility matrix, giving for each
user-item pair, a value that represents what is known
about the degree of preference of that user for that
item.
We assume that the matrix is sparse, meaning that most
entries are “unknown.” An unknown rating implies that
we have no explicit information about the user’s
preference for the item.

Example
5
The goal of a recommendation system is to
predict the blanks in the utility matrix.

Long Tail Phenomena
6
The distinction between the physical and on-
line worlds has been called the long tail
phenomenon.
The long-tail phenomenon forces on-line
institutions to recommend items to
individual users. It is not possible to present
all available items to the user, the way
physical institutions can.

Populating the Utility Matrix
7
There are two way:
1. We can ask users to rate items.
2. We can make inferences from users’
behavior.

Approaches of recommender
system
8
There are two basic approaches to
recommending:
1. Collaborative Filtering
2. Content-based Filtering

Collaborative Filtering
 Maintain a database of many users’
ratings of a variety of items.
 For a given user, find other similar users
whose ratings strongly correlate with the
current user.
 Recommend items rated highly by these
similar users, but not rated by the current
user.
 Almost all existing commercial
recommenders use this approach (e.g.
Amazon).
9

10
A 9
B 3
C
: :
Z 5
A
B
C 9
: :
Z 10
A 5
B 3
C
: :
Z 7
A
B
C 8
: :
Z
A 6
B 4
C
: :
Z
A 10
B 4
C 8
. .
Z 1
User
Database
Active
User
Correlation
Match
A 9
B 3
C
. .
Z 5
A 9
B 3
C
: :
Z 5
A 10
B 4
C 8
. .
Z 1
Extract
Recommendations
C

Method
 Weight all users with respect to similarity with the
active user.
 Select a subset of the users (neighbors) to use as
predictors.
 Normalize ratings and compute a prediction from a
weighted combination of the selected neighbors’
ratings.
 Present items with highest predicted ratings as
recommendations.
11

Measuring Similarity
The first question we must deal with is how
to measure similarity of users or items from
their rows or columns in the utility matrix.
Jaccard Distance: lets assume A and B have
an intersection of size 1 and a union of size
5.Thus, their Jaccard similarity is 1/5, and
their Jaccard distance is 4/5; i.e.,they are
very far apart.
12

Conti….
Cosine Distance:
We can treat blanks as a 0 value. This choice is
questionable, since it has the effect of treating the
lack of a rating as more similar to disliking the movie
than liking it.
Rounding the Data:
We could try to eliminate the apparent similarity
between movies a user rates highly and those with
low scores by rounding the ratings. For instance, we
could consider ratings of 3, 4, and 5 as a “1” and
consider ratings 1 and 2 as unrated.
13

Conti….
Normalizing the rating:
If we normalize ratings, by subtracting from
each rating the average rating of that user,
we turn low ratings into negative numbers
and high ratings into positive numbers.
14

User based similarity
Let say,
Sim(User1, User2)= Σj | rating (User1, Item
j) – rating|(User2, Item j) /Num. of items
15

Problems with Collaborative
Filtering
 Cold Start: There needs to be enough other users
already in the system to find a match.
 Sparsity: If there are many items to be
recommended, even if there are many users, the
user/ratings matrix is sparse, and it is hard to find
users that have rated the same items.
 First Rater: Cannot recommend an item that has
not been previously rated.
 New items
 Esoteric items
 Popularity Bias: Cannot recommend items to
someone with unique tastes.
 Tends to recommend popular items.
16

Content-Based
Recommending
 Recommendations are based on
information on the content of items
rather than on other users’ opinions.
 Uses a machine learning algorithm to
induce a profile of the users
preferences.
 Some previous applications:
 Newsweeder (Lang, 1995)
 Syskill and Webert (Pazzani et al., 1996)
17

Item profile
In a content-based system, we must construct for
each item a profile, which is a record or collection of
records representing important characteristics of that
item.
 1. The set of actors of the movie. Some viewers
prefer movies with their favorite actors.
 2. The director. Some viewers have a preference
for the work of certain directors.
 3. The year in which the movie was made. Some
viewers prefer old movies; others watch only the
latest releases.
 4. The genre or general type of movie. Some
viewers like only comedie, others dramas or
romances. 18

Discovering Features of
Documents
There are other classes of items where it is not
immediately apparent what the values of
features should be. We shall consider two of
them: document collections and images.
 First, eliminate stop words –the several
hundred most common words, which tend to
say little about the topic of a document.
 For the remaining words, compute the
TF.IDF score for each word in the document.
 The ones with the highest scores are the
words that characterize the document.
19

Obtaining Item Features From
Tags
 The problem with images is that their
data, typically an array of pixels, does
not tell us anything useful about their
features.
 There have been a number of attempts to
obtain information about features of
items by inviting users to tag the items by
entering words or phrases that describe
the item.
20

User Profiles
We need to create vectors with the same
components that describe the user’s
preferences. We have the utility matrix
representing the connection between users
and items.
21

Recommending Items to Users
Based on Content
With profile vectors for both users and
items, we can estimate the degree to which
a user would prefer an item by computing
the cosine distance between the user’s and
item’s vectors.
22

Advantages of Content-Based
Approach
 No need for data on other users.
 No cold-start or sparsity problems.
 Able to recommend to users with unique
tastes.
 Able to recommend new and unpopular items
 No first-rater problem.
 Can provide explanations of recommended
items by listing content-features that caused
an item to be recommended.
23

Disadvantages of Content-
Based Method
 Requires content that can be encoded as
meaningful features.
 Users’ tastes must be represented as a
learnable function of these content
features.
 Unable to exploit quality judgments of
other users.
 Unless these are somehow included in
the content features.
24

LIBRA
Learning Intelligent Book Recommending
Agent
 Content-based recommender for books using
information about titles extracted from
Amazon.
 Uses information extraction from the web to
organize text into fields:
 Author
 Title
 Editorial Reviews
 Customer Comments
 Subject terms
 Related authors
 Related titles
25

LIBRA System
26
Amazon Pages
Rated
Examples
User Profile
Machine Learning
Learner
Information
Extraction
LIBRA
Database
Recommendations
1.~~~~~~
2.~~~~~~~
3.~~~~~
:
:
: Predictor

Sample Amazon Page
27
Age of Spiritual Machines

Sample Extracted Information
28
Title: <The Age of Spiritual Machines: When Computers Exceed Human Intelligence>
Author: <Ray Kurzweil>
Price: <11.96>
Publication Date: <January 2000>
ISBN: <0140282025>
Related Titles: <Title: <Robot: Mere Machine or Transcendent Mind>
Author: <Hans Moravec> >
…
Reviews: <Author: <Amazon.com Reviews> Text: <How much do we humans…> >
…
Comments: <Stars: <4> Author: <Stephen A. Haines> Text:<Kurzweil has …> >
…
Related Authors: <Hans P. Moravec> <K. Eric Drexler>…
Subjects: <Science/Mathematics> <Computers> <Artificial Intelligence> …

Libra Content Information
 Libra uses this extracted information to form “bags of
words” for the following slots:
 Author
 Title
 Description (reviews and comments)
 Subjects
 Related Titles
 Related Authors
29

Implementation
 Stopwords removed from all bags.
 A book’s title and author are added to its own related title and related
author slots.
 All probabilities are smoothed using Laplace estimation to account for
small sample size.
 Lisp implementation is quite efficient:
 Training: 20 exs in 0.4 secs, 840 exs in 11.5 secs
 Test: 200 books per second
30

Experimental Data
 Amazon searches were used to find books in various
genres.
 Titles that have at least one review or comment were
kept.
 Data sets:
 Literature fiction: 3,061 titles
 Mystery: 7,285 titles
 Science: 3,813 titles
 Science Fiction: 3.813 titles
31

Rated Data
 4 users rated random examples within a genre by
reviewing the Amazon pages about the title:
 LIT1 936 titles
 LIT2 935 titles
 MYST 500 titles
 SCI 500 titles
 SF 500 titles
32

Combining Content and
Collaboration
 Content-based and collaborative methods
have complementary strengths and
weaknesses.
 Combine methods to obtain the best of both.
 Various hybrid approaches:
 Apply both methods and combine
recommendations.
 Use collaborative data as content.
 Use content-based predictor as another
collaborator.
 Use content-based predictor to complete
collaborative data.
33

Content-Boosted
34
IMDbEachMovie Web Crawler
Movie
Content
Database
Full User
Ratings Matrix
Collaborative
Filtering
Active
User Ratings
User Ratings
Matrix (Sparse)
Content-based
Predictor
Recommendations

Content-Boosted CF - I
35
Content-Based
Predictor
Training Examples
Pseudo User-ratings Vector
Items with Predicted Ratings
User-ratings Vector
User-rated Items
Unrated Items

Conclusions
 Recommending and personalization are important
approaches to combating information over-load.
 Machine Learning is an important part of systems for
these tasks.
 Collaborative filtering has problems.
 Content-based methods address these problems (but
have problems of their own).
 Integrating both is best.
36

Recommenders Systems

More Related Content

What's hot

Similar to Recommenders Systems

Recently uploaded

Recommenders Systems

Editor's Notes