movieRecommendation_FinalReport

Movie Recommendation System
Komal Khattar Mohit Juneja Nupur Kale Sohini Sarkar
College of Information Studies College of Information Studies College of Information Studies College of Information Studies
University of Maryland,
College Park
College Park
College Park
College Park
kkhattar@umd.edu mjuneja@umd.edu nkale@umd.edu ssarkar1@umd.edu
ABSTRACT
Recommendation Systems have become increasingly
important in e-commerce due to the large number of
choices that consumer face. If you have used services
like Netflix, IMDb, or Amazon, you would be aware
of their personalized recommendations suggesting
movies to watch or items to buy. In general, these
systems take a set of input such as user profiles or a
set of movie ratings, identify the similarities among
the input and finally pass the similar pairs for
prediction calculation. In this project, we will identify
the user demographics (such as occupation, age,
gender, etc.) and the movie-related parameters (like
type of genres), using different types of classifiers,
which can be useful in determining the ratings given
by the viewers to the movies they have watched and
then, go on to predict the ratings for the users for the
movies they have not watched. This will be done using
collaborative filtering technique, which is one of the
most promising approaches for building the
recommendation model. Lastly, we will recommend
movies to users using this model.
I. INTRODUCTION
Recommendation Systems are one of the important
applications of the e-commerce industry due to the
overwhelming amount of information available on the
Internet. This consequently, makes it impossible for
the consumers to explore and compare every possible
product. A movie recommendation system (sometimes
referred to as a recommender engine) predicts what
movie a user may like among a list of given items [5].
Such a recommendation engine generally use the
following techniques to make predictions:
• Content-Based Systems: These systems analyze
the properties of the items that a user likes to
determine what else the user may like [6]. For
instance, if an IMDb user has watched many
crime movies, then recommend a movie from the
“crime” genre. Such systems rely solely on the
content that the user itself accesses, and not on
the behavior of other users in the system.
• Collaborative Filtering Systems: It relies on the
likes and dislikes of other users, and recommends
items based on the similarity measures between
items
and/or users. The recommended items are
essentially drawn from those preferred by similar
users. Thus, this system can be constructed from
the behavior of other users who have similar traits
[6].
Figure 1: Similarities and differences used in Collaborative
Filtering
For instance, if an IMDb user has liked ‘The
Shawshank Redemption’, then ‘The Godfather’
and ‘Batman: The Dark Knight’ are also
recommended, because people who liked ‘The
Shawshank Redemption’ also liked ‘The
Godfather’ and ‘Batman: The Dark Knight’.
• Hybrids: The hybrid approaches combine
content-based and collaborative filtering to build
a much more robust recommendation systems.
Incorporating both the methods creates the
potential for a more accurate recommendation
[6].
II. DATA PREPARATION
In this project, we will be using the MovieLens
dataset, collected by the GroupLens Research Project
at the University of Minnesota. This data set consists
of 100,000 ratings (1-5) from 943 users on 1682
movies, wherein each user has rated at least 20
movies. The complete dataset consists of userData,
movieData, genreData, trainData and testData. The
userData contains the demographic information for the

users like user id, age, gender, occupation and zip
code. The movieData contains information about the
movies including the movie id, movie title, release
date, video release date, IMDb URL and genre of the
movie. The genreData consists of a list of the genres.
The complete dataset of 100,000 ratings has been split
into training set ‘trainData’ and test set ‘testData’ in
such a way that the test set consists of exactly 10
ratings per user in the test set. A full training dataset,
named as ‘fullTrain’, has been prepared using
trainData, userData and movieData, which consist of
all the available user-related (demographic info) and
movie-related information (title, genre, release date,
IMDb URL, etc.). Likewise, A full test dataset, named
as ‘fullTest’, has been prepared using testData,
userData and movieData. Additionally, a
‘unifiedMovieLensData’ has been prepared wherein
the genre field in the dataset consists of "multiple" as
the genre, if the movie has more than one genre.
Another dataset ‘unifiedMovieLensDataMultiple’ has
been created, which consists of multiple rows for
movies with two or more genres. This means that this
dataset contains duplicate combinations of user id and
movie name.

III. EXPLORATORY DATA ANALYSIS AND
FINDINGS
In this section, our aim is to explore the MovieLens
dataset for trends with movie preferences. It is to be
noted that the user needs to run the script
dataClean.R, to generate cleaned data sets,
unifiedMovieLensData.csv and
unifiedMovieLensDataMultiple.csv, on which the
exploratory data analysis has been done. The R
libraries used for the analysis include ggplot2,
RColorBrewer, plyr and grid. On investigating the
general features in our dataset, we determine that a
majority of the users have age between 20-30, and
also, there are a significant number of users in the late
forties.
Figure 2: Histogram Plot for Analysis of User Age
Next, we will investigate the user with respect to
profession within the dataset in order to determine
how different profession tends to rate the movies.
Figure 3: Bar Chart Plot for User with respect to Profession
(Gender biased)
It is evident from the above plot that a majority of
users are student, while there are very few doctors and
homemakers. It is probably difficult to say anything
about the minority groups with much confidence.
Interestingly, males make up for most of our dataset
and professions like engineer, scientist, executive and
entertainment are completely male dominated.
Figure 4: Violin Plot for Average Rating with respect to
Profession
Lastly, the different professions do not seem to rate
the movies evenly with the health care workers
having a very low average rating as compared to other
professions and executives giving very low movie
ratings at times.
Our next analysis involved determining the release
date of the movies in our dataset, followed by
computing the total number of movies of each genre;
first, with a specific genre counted single times, and

then with a specific genre counted multiple times for
multi-genre movies.

Figure 5: Histogram Plot for Release Date of the Movies
The plot shows that most movies in our dataset are
from the 1990's. On further investing the genre of
these movies, we found that a larger percentage of the
movies are multi-genre and there are a very few
number of movies with pure fantasy/pure film-
noir/pure animation/ pure adventure genre.
Figure 6a: Bar Chart Plot for Movie Genre
Next, we plotted the total number of movies with a
specific genre counted multiple times for multi-genre
movies and found that documentaries no longer seem
to be a high count genre (as compared to the previous
plot).
Figure 6b: Bar Chart Plot for Movie Genre (With specific
genre counted multiple times for multi-genre movies)
It can also be interpreted that majority of the movies
that belong to the Documentaries genre, typically do
not have other genre associated with them. It is also
worth noting that movies with animation genre are no
longer a small number.
IV. DIFFERENT CLASSIFIER ALGORITHM
IMPLEMENTATIONS
Conditional Inference Trees: Conditional inference
trees classifier is a tree-based classifier used for
recursive partitioning of response variables in a
conditional inference framework. This class of tree
classifier can be applied to all kinds of problems,
including nominal, ordinal, numeric and multivariate
response variables. The package party in R provides
the c-tree function and allows recursive partitioning
[3]. Recursive partitioning is considered as a basic tool
in data mining. It helps to explore the structure of
dataset, and helps to develop decision rules for
predicting a categorical or continuous output [4].
Rpart is also a tree classifier that performs recursive
partitioning and univariate split but we preferred c-tree
as conditional inference tree are considered as biased
free predictor selection classifiers. C-tree uses a
covariate selection scheme based on permutation
based significance test while the Rpart has a selection
bias towards covariates which allow many possible
splits or have many missing values, or the ones that
maximizes an information measure [7].
For our dataset, we split our training dataset in the
ratio of 80:20 and considered the 80 portion as the
train subset and 20 as the test subset. We then used
seven unique features or variables combination and
predicted the accuracy using the c-tree function.
Feature Combination Accuracy Percentage
Age + Gender + Occupation
+ Genre 0.3628
Age + Occupation + Genre 0.3574
Age + Gender + Genre 0.3489
Gender + Occupation +
Genre 0.3498
Gender + Genre 0.3499
Occupation + Genre 0.3483
Age + Genre 0.3462
Note: Genre = Action+ Adventure + Animation +
Children + Comedy + Crime + Documentary + Drama
+ Fantasy + Film_Noir + Horror + Musical + Mystery
+ Romance +Sci_Fi + Thriller + War + Western
The feature age, gender, occupation and genre
received the highest accuracy of 0.3628. On plotting a
tree for this feature combination we get the below tree.

Figure 7: Random-Forest Plot
N represents the total count of ratings for that node
and y represents the base probability for each user
rating from 1 to 5.
Random Forest: Random forests are regarded as an
ensemble method used for classification and
regression. Random forest classifier uses multiple
decision trees, in order to improve the classification or
accuracy rate. They are implemented in R using
the randomForest package. The way this classifier
works is that it induces additional randomness by
sampling and averaging which diversifies the trees,
resulting in increased search area and noise profile for
better accuracy in prediction.
Based on random samples of variable, Random forests
generate large number of bootstrapped trees, trained at
different parts of the training data and then classify to
predict the final outcome by combining the results
across all the trees of the forest. This process of
bootstrapping, aggregating or averaging helps to
increase the stability, accuracy of the classifier. Trees
that grow deep, or are grown for large complex dataset
tend to produce irregular patterns and cause overfitting
in training sets which results in low bias and high
variance. In such cases the approach of averaging
multiple deep decision trees in random forests helps to
reduce the variance and boosts the performance of the
final model [2]. Furthermore, as many samples are
selected in a process this classifier provides the
measure of importance of each variable in the model
and helps in variable selection for models built on
datasets having numerous predictor variables.
On the 80/20 % split data, we used seven unique
features or variables combination and predicted the
accuracy using the random forest function.
Feature Combination Accuracy Percentage
Age + Gender + Occupation
+ Genre 0.3662
Age + Occupation + Genre 0.3676
Age + Gender + Genre 0.3524
Gender + Occupation +
Genre 0.3541
Gender + Genre 0.3491
Occupation + Genre 0.3553
Age + Genre 0.3518
Note: Genre = Action+ Adventure + Animation +
Children + Comedy + Crime + Documentary + Drama
+ Fantasy + Film_Noir + Horror + Musical + Mystery
+ Romance + Sci_Fi + Thriller + War + Western
The feature age, occupation and genre received the
highest accuracy of 0.3676. We also plotted a graph
for error rate over number of trees. Here the numbers
of trees considered are 100.
Figure 8: Error rate over Number of Trees
The graph displays different color lines indicating
error rate for different user rating (1-5). The black line
indicates the overall out-of-bag error or means
prediction error, which is 63.24 %. From the graph it
can be observed that the error rate for user rating 4
decreases with the increase in number of trees. In
other words it can be said that the accuracy for
predicting user rating 4 increases with the increase in
number of trees. It also proves that more sampling,
more averaging of the trees in random forest result in
higher accuracy in prediction.
Naïve Bayes: The Naïve Bayes classifier algorithm is
based on Bayesian theorem, which assumes
independence between the different features. The
basic idea behind, Bayesian classifier is that if an
agent knows the class it can predict the values of the
other different features, else it uses Bayes rule to
predict the class given the feature values. One of the
major areas of application for Naïve Bayes classifier is

for text analytics. On applying the Naïve Bayes
classifier model for the movie dataset we had a
maximum accuracy of 31.32 % for the feature
combination of age, occupation, genre.
k-NN Algorithm: This algorithm basically stores all
the available cases and classifies the new cases based
on a similarity measure (e.g.. Distance functions like
Euclidean distance). The cases are classified by
calculating a majority vote of its neighbors. The
following are the basic steps for the k-NN algorithm:
• To compute the distances between the new sample
and all previous samples, that has already been
classified into clusters.
• To sort the distances in increasing order and select
the k samples with the smallest distance values
• To apply the voting principle.
On applying the k-NN classifier model for the movie
dataset we had a maximum accuracy of 36.36 % for
the feature combination of age, gender, occupation,
and genre.
K-Means Clustering: This is a method of cluster
analysis in data mining. K-means clustering aims to
partition n observations into k clusters in which each
observation belongs to the cluster with the nearest
mean, serving as a prototype of the cluster. K-means is
a type of item based classification technique in our
case. To carry out k-means clustering for our dataset,
we created a subset of the movie dataset, which
consists of all the movies from the dataset and the
information about their genre. Using this information,
we created clusters based on genre similarity of our
movie dataset. The clusters are formed based on
characteristic features of the movies.
Figure 9: Plot of within/between ratio against k
The first step in carrying out k-means clustering is
choosing the number of clusters to separate the
movies. For this, we chose to apply the elbow method
to decide the number of clusters for our clustering data
mining. We needed to choose the number of clusters
in such a way that we minimize the within/between
ratio.
We plotted the graph of the number of clusters against
the within/between ratio for these clusters. We
observed the ratio monotonically decreases. The
elbow occurred at k = 10. After that even an increase
of the ratio occurred probably because of randomness.
Hence, we chose k = 10 to be the number of clusters.
From our data, a smaller cluster number would have
higher values around itself; so it maybe gives low
values because of randomness and in another scenario
could give higher ratio.
Figure 10: Cluster plot of movies after k-means
After running the k-means clustering function in R on
our dataset, we derived 10 clusters. Each cluster has
movies that fall under similar genre.
Multinomial Logistic Regression: Multinomial
Logistic Regression is the linear regression analysis to
conduct when the dependent variable is nominal with
more than two levels. Thus it is an extension of
logistic regression, which analyzes dichotomous
(binary) dependents. Multinomial logistic regression is
a type of predictive analysis regression method.
We computed coefficients of multinomial regression
for the model age + occupation + genre for predicting
ratings of a movie. We got the following regression
coefficients:
(Intercept) -0.3407205
age 0.03921094
occupationartist 0.1029380
occupationdoctor 0.2490433
occupationeducator -0.17566946
occupationengineer 0.03247879
occupationentertainment -0.28149450
occupationexecutive -1.136975
occupationhealthcare -2.2304297
occupationhomemaker -0.70603732
occupationnone 0.20311825
occupationlawyer -0.1625073
occupationlibrarian -0.25623202

occupationmarketing 1.2161332
occupationother 0.1225862
occupationprogrammer 0.04265024
occupationretired -1.4250644
occupationsalesman -0.1073120
occupationscientist -0.08942207
occupationstudent 0.2849286
occupationtechnician 0.0133927
occupationwriter -0.45672898
Action -0.190309482
Adventure 0.3373544
Animation 1.0798400
Children -0.6189434
Comedy -0.20248454
Crime 0.19747754
Documentary 0.3946927
Drama 0.7341593
Fantasy -0.4277778
Film_Noir 1.6413046
Horror -0.37720373
Musical 0.2171512
Mystery 0.3038321
Romance 0.27989829
Sci_Fi Thriller 0.1479345
War 0.76651665
Western 0.6193790
We got the maximum accuracy for this model of
logistic regression, which was 0.3576.
Random Forest Classifier on Test Data: The feature
combination ‘Age + Occupation + Genre’ which gives
the highest accuracy of 0.3676 for Random Forest
Classifier is applied on test data. We also plotted a
graph for error rate over number of trees. Here, the
number of trees considered is 500. From the graph it
can be observed that the error rate for user rating 4
decreases with the increase in number of trees. So here
also it can be said that the accuracy for predicting user
rating 4 increases with the increase in number of trees.
Figure 11: Error rate over Number of Trees for Test Data
To predict the accuracy we calculated the value of
Mean Absolute Error (MAE) and Root Mean Squared
Error (RMSE). We got RMSE= 1.158 and MAE=
0.826. The MAE and the RMSE values can be used to
analyze the variation in the errors in a set of forecasts.
The greater the difference between them, the greater
the variance will be in the errors present in the sample.
As in this case their low difference and lower values
support that the model predicts the rating with high
accuracy.
V. RECOMMENDATION SYSTEM
Predicting ratings and creating personalized
recommendations is something that almost every
recommendation system does. The approach that
recommendation systems use can broadly be classified
into two categories.
• Content-based approach
• Collaborative filtering approach
Content based approach is based on the idea that if we
can understand the preference structure of a customer
(user) concerning product (movie) attributes then we
can recommend movies which rank high for the user’s
most desirable attributes. However, for our
recommendation system we have used collaborative
filtering approach (recommenderlab package in R).
The basic idea of collaborative filtering being given
rating data by many users for multiple movies, one can
predict a user’s rating for a movie that he/she has
never watched. As a result, helping to create a
recommendation list of top – N movies based on the
predicted ratings. For our project we are using the R
extension package recommenderlab.
While designing our recommendation system we had
the dataset that consisted details related to ratings
provided by many users for many movies as the basis
for predicting missing ratings. That is ; we have a set
of users U = {u1, u2, . . . , um} and a set of items I =
{i1, i2, . . . , in}. Ratings are stored in a m × n user-
item rating matrix R = (rjk) where each row represent
a user uj with 1 ≥ j ≥ m and columns represent items ik
with 1 ≥ k ≥ n. rjk represents the rating of user uj for
item ik. Typically only a small fraction of ratings are
known and for many cells in R the values are missing.
Predicting the missing ratings on a scale of 1 – 5 (as
was in the train data), is more of a regression problem
that is being solved by the recommendation system.
The next step involves, creating the top N
recommendation list based on all the predicted ratings.
In theory, while dealing with large datasets predicting
ratings for each and every user-movie pair becomes
computationally expensive. Thus there are rule based
approached that predict s the top N recommendation
items directly.
Collaborative filtering approaches can be broadly
divided into two groups:
• Memory – based collaborative filtering
• Model – based collaborative filtering

In memory based collaborative filtering the whole user
dataset is used to create recommendation. The most
common example of memory based collaborative
filtering is the user based collaborative filtering
algorithm. In user based collaborative filtering, we
essentially assume that individuals with similar
preference will rate items similarly. In this approach
the missing rating for users is predicted by first finding
a neighborhood of similar users and then aggregate the
ratings of these users to compute a prediction. The
neighborhood is defined using the similarity score for
different users (calculated using cosine similarity),
consisting of most similar user or users having
similarity score greater than a given threshold. Thus,
to summarize the neighborhood for an active user can
be selected by either a threshold on the similarity or by
considering k nearest neighbor.
VI. CONCLUSION
Out of all the classifier built, Random forest classifier
gives the highest prediction accuracy. Seven feature
combinations were used to test the accuracy of each
classifier built. The feature combination ‘Age +
Occupation + Genre’ gives the highest accuracy of
0.3676. RMSE and MAE were calculated to predict
the accuracy of the classifiers. When the above feature
combination was applied in full test data the RMSE
value was equal to 1.172 and MAE value was 0.849.
Similarly, when the RMSE and MAE value was
calculated for the recommendation system built using
unsupervised learning, the value of RMSE was equal
to 1.06 and MAE was equal to 0.76. As the difference
between the two in less for the case of the
recommender system, we can say the accuracy
increases with collaborative filtering. Even the
limitations of the dataset cannot be neglected. All the
classifiers built give a very low accuracy percentage,
which could be because the data had more user details,
as compared to the movie details. If movie details such
as movie director, actor, and duration were included in
the data frame, probably the prediction of accuracy
would have been higher. Moreover, Having more
movie details can make the data suitable to build a
recommendation system using item-based
collaborative filtering, which is a more sophisticated
approach.
REFERENCES
1. https://cran.r-
project.org/web/packages/recommenderlab/vignett
es/recommenderlab.pdf
2. Bhalla, D. (2015). Random Forests Explained in
Simple Terms. Retrieved from:
http://www.listendata.com/2014/11/random-
forest-with-r.html
3. Hothorn, T., Hornik, K., Zeileis, A. (2014). ctree:
Conditional Inference Trees. Retrieved from:
https://cran.r-
project.org/web/packages/partykit/vignettes/ctree.
pdf
4. Hothorn, T., Hornik, K., Zeileis, A. (2006).
Unbiased Recursive Partitioning: A Conditional
Inference Framework. Retrieved from:
http://eeecon.uibk.ac.at/~zeileis/papers/Hothorn+
Hornik+Zeileis-2006.pdf
5. Jones, M. (2013, December 12). Introduction to
approaches and algorithms. Retrieved from
http://www.ibm.com/developerworks/library/os-
recommender1/index.html
6. Marafi, S. (2014, April 26). Collaborative
Filtering with R. Retrieved from:
http://www.salemmarafi.com/code/collaborative-
filtering-r/
7. Ridwan, M. (n.d.). Predicting Likes: Inside A
Simple Recommendation Engine's Algorithms.
Retrieved from:
http://www.toptal.com/algorithms/predicting-
likes-inside-a-simple-recommendation-engine
8. Wolf, R. (2011). Conditional inference tress vs
traditional decision trees. Retrieved from:
http://stats.stackexchange.com/questions/12140/co
nditional-inference-trees-vs-traditional-decision-
trees
9. Wikipedia. Random Forest. Retrieved from:
https://en.wikipedia.org/wiki/Random_forest
10. http://rstudio-pubs-
static.s3.amazonaws.com/9893_4cc5f31ec224446
d89c5865936c8afee.html
11. http://www.statisticssolutions.com/mlr/

movieRecommendation_FinalReport

Recommended

Recommended

More Related Content

Similar to movieRecommendation_FinalReport

Similar to movieRecommendation_FinalReport (20)

movieRecommendation_FinalReport