1. UNIT- 2
CONTENT BASED
RECOMMENDATION SYSTEMS
CONTENTS:
1. HIGH LEVEL ARCHITECTURE OF CONTENT BASED SYSTEMS
2. ITEM PROFILES
3. REPRESENTING ITEM PROFILES
4. METHODS FOR LEARNING USER PROFILES
5. SIMILARITY BASED RETRIEVAL
6. CLASSIFICATION ALGORITHMS
3. ITEM PROFILES
• Item Profile: In Content-Based Recommender,
we must build a profile for each item, which
will represent the important characteristics of
that item. For example, if we make a movie as
an item then its actors, director, release year
and genre are the most significant features of
the movie.
4. • Profile is a set of features.
• Eg: Books: Title, Author, Publisher, etc
• Vector is used to represent the Item profile.
• This vector will be Boolean or Real value.
• In a text file:
Profile = set of “key(important)” words
5. TF-IDF
(Term Frequency * Inverse Document Frequency)
• TF-IDF is used to identify the keywords in a text file.
• TF-IDF is a text mining technique.
• Formula for TF-IDF score:
wij = Tfij * IDFi
TFij = fij/ (maxk*fkj)
fij= frequency of term(feature) i in doc (item) j
IDFi = log N/ ni
ni = no. of docs that refer term I
N = total no. of docs
6. REPRESENTING ITEM PROFILES
• Our ultimate goal for content-based
recommendation is to create both an item
profile consisting of feature-value pairs and a
user profile summarizing the preferences of
the user, based of their row of the utility
matrix.
Utility Matrix
Subject Raja Guna
RS 4 3
PSPP 4 4
7. • Item profile can be constructed using a vector
of 0’s and 1’s, where 1 is represented the
occurrence of a high-TF.IDF(term frequency-
inverse document frequency) word in the
document.
8. • For example, if one feature of movies is the
set of actors, then imagine that there is a
component for each actor, with 1 if the actor
is in the movie, and 0 if not. Likewise, we can
have a component for each possible director,
and each possible genre.
9. • All these features can be represented using
only 0’s and 1’s. There is another class of
features that is not readily represented by
Boolean vectors: those features that are
numerical. For instance, we might take the
average rating for movies to be a feature,2
and this average is a real number.
10. • It does not make sense to have one
component for each of the possible average
ratings, and doing so would cause us to lose
the structure implicit in numbers. That is, two
ratings that are close but not identical should
be considered more similar than widely
differing ratings.
11. • Likewise, numerical features of products, such
as screen size or disk capacity for PC’s, should
be considered similar if their values do not
differ greatly. Numerical features should be
represented by single components of vectors
representing items. These components hold
the exact value of that feature.