SlideShare a Scribd company logo
Movie Recommendation System
Komal Khattar Mohit Juneja Nupur Kale Sohini Sarkar
College of Information Studies College of Information Studies College of Information Studies College of Information Studies
University of Maryland,
College Park
University of Maryland,
College Park
University of Maryland,
College Park
University of Maryland,
College Park
kkhattar@umd.edu mjuneja@umd.edu nkale@umd.edu ssarkar1@umd.edu
ABSTRACT
Recommendation Systems have become increasingly
important in e-commerce due to the large number of
choices that consumer face. If you have used services
like Netflix, IMDb, or Amazon, you would be aware
of their personalized recommendations suggesting
movies to watch or items to buy. In general, these
systems take a set of input such as user profiles or a
set of movie ratings, identify the similarities among
the input and finally pass the similar pairs for
prediction calculation. In this project, we will identify
the user demographics (such as occupation, age,
gender, etc.) and the movie-related parameters (like
type of genres), using different types of classifiers,
which can be useful in determining the ratings given
by the viewers to the movies they have watched and
then, go on to predict the ratings for the users for the
movies they have not watched. This will be done using
collaborative filtering technique, which is one of the
most promising approaches for building the
recommendation model. Lastly, we will recommend
movies to users using this model.
I. INTRODUCTION
Recommendation Systems are one of the important
applications of the e-commerce industry due to the
overwhelming amount of information available on the
Internet. This consequently, makes it impossible for
the consumers to explore and compare every possible
product. A movie recommendation system (sometimes
referred to as a recommender engine) predicts what
movie a user may like among a list of given items [5].
Such a recommendation engine generally use the
following techniques to make predictions:
• Content-Based Systems: These systems analyze
the properties of the items that a user likes to
determine what else the user may like [6]. For
instance, if an IMDb user has watched many
crime movies, then recommend a movie from the
“crime” genre. Such systems rely solely on the
content that the user itself accesses, and not on
the behavior of other users in the system.
• Collaborative Filtering Systems: It relies on the
likes and dislikes of other users, and recommends
items based on the similarity measures between
items	
   and/or users. The recommended items are
essentially drawn from those preferred by similar
users. Thus, this system can be constructed from
the behavior of other users who have similar traits
[6].
Figure 1: Similarities and differences used in Collaborative
Filtering
For instance, if an IMDb user has liked ‘The
Shawshank Redemption’, then ‘The Godfather’
and ‘Batman: The Dark Knight’ are also
recommended, because people who liked ‘The
Shawshank Redemption’ also liked ‘The
Godfather’ and ‘Batman: The Dark Knight’.
• Hybrids: The hybrid approaches combine
content-based and collaborative filtering to build
a much more robust recommendation systems.
Incorporating both the methods creates the
potential for a more accurate recommendation
[6].
II. DATA PREPARATION
In this project, we will be using the MovieLens
dataset, collected by the GroupLens Research Project
at the University of Minnesota. This data set consists
of 100,000 ratings (1-5) from 943 users on 1682
movies, wherein each user has rated at least 20
movies. The complete dataset consists of userData,
movieData, genreData, trainData and testData. The
userData contains the demographic information for the
users like user id, age, gender, occupation and zip
code. The movieData contains information about the
movies including the movie id, movie title, release
date, video release date, IMDb URL and genre of the
movie. The genreData consists of a list of the genres.
The complete dataset of 100,000 ratings has been split
into training set ‘trainData’ and test set ‘testData’ in
such a way that the test set consists of exactly 10
ratings per user in the test set. A full training dataset,
named as ‘fullTrain’, has been prepared using
trainData, userData and movieData, which consist of
all the available user-related (demographic info) and
movie-related information (title, genre, release date,
IMDb URL, etc.). Likewise, A full test dataset, named
as ‘fullTest’, has been prepared using testData,
userData and movieData. Additionally, a
‘unifiedMovieLensData’ has been prepared wherein
the genre field in the dataset consists of "multiple" as
the genre, if the movie has more than one genre.
Another dataset ‘unifiedMovieLensDataMultiple’ has
been created, which consists of multiple rows for
movies with two or more genres. This means that this
dataset contains duplicate combinations of user id and
movie name.	
  
III. EXPLORATORY DATA ANALYSIS AND
FINDINGS
In this section, our aim is to explore the MovieLens
dataset for trends with movie preferences. It is to be
noted that the user needs to run the script
dataClean.R, to generate cleaned data sets,
unifiedMovieLensData.csv and
unifiedMovieLensDataMultiple.csv, on which the
exploratory data analysis has been done. The R
libraries used for the analysis include ggplot2,
RColorBrewer, plyr and grid. On investigating the
general features in our dataset, we determine that a
majority of the users have age between 20-30, and
also, there are a significant number of users in the late
forties.
Figure 2: Histogram Plot for Analysis of User Age
Next, we will investigate the user with respect to
profession within the dataset in order to determine
how different profession tends to rate the movies.
Figure 3: Bar Chart Plot for User with respect to Profession
(Gender biased)
It is evident from the above plot that a majority of
users are student, while there are very few doctors and
homemakers. It is probably difficult to say anything
about the minority groups with much confidence.
Interestingly, males make up for most of our dataset
and professions like engineer, scientist, executive and
entertainment are completely male dominated.
Figure 4: Violin Plot for Average Rating with respect to
Profession
Lastly, the different professions do not seem to rate
the movies evenly with the health care workers
having a very low average rating as compared to other
professions and executives giving very low movie
ratings at times.
Our next analysis involved determining the release
date of the movies in our dataset, followed by
computing the total number of movies of each genre;
first, with a specific genre counted single times, and
then with a specific genre counted multiple times for
multi-genre movies.
	
  
Figure 5: Histogram Plot for Release Date of the Movies
The plot shows that most movies in our dataset are
from the 1990's. On further investing the genre of
these movies, we found that a larger percentage of the
movies are multi-genre and there are a very few
number of movies with pure fantasy/pure film-
noir/pure animation/ pure adventure genre.
Figure 6a: Bar Chart Plot for Movie Genre
Next, we plotted the total number of movies with a
specific genre counted multiple times for multi-genre
movies and found that documentaries no longer seem
to be a high count genre (as compared to the previous
plot).
Figure 6b: Bar Chart Plot for Movie Genre (With specific
genre counted multiple times for multi-genre movies)
It can also be interpreted that majority of the movies
that belong to the Documentaries genre, typically do
not have other genre associated with them. It is also
worth noting that movies with animation genre are no
longer a small number.
IV. DIFFERENT CLASSIFIER ALGORITHM
IMPLEMENTATIONS
Conditional Inference Trees: Conditional inference
trees classifier is a tree-based classifier used for
recursive partitioning of response variables in a
conditional inference framework. This class of tree
classifier can be applied to all kinds of problems,
including nominal, ordinal, numeric and multivariate
response variables. The package party in R provides
the c-tree function and allows recursive partitioning
[3]. Recursive partitioning is considered as a basic tool
in data mining. It helps to explore the structure of
dataset, and helps to develop decision rules for
predicting a categorical or continuous output [4].
Rpart is also a tree classifier that performs recursive
partitioning and univariate split but we preferred c-tree
as conditional inference tree are considered as biased
free predictor selection classifiers. C-tree uses a
covariate selection scheme based on permutation
based significance test while the Rpart has a selection
bias towards covariates which allow many possible
splits or have many missing values, or the ones that
maximizes an information measure [7].
For our dataset, we split our training dataset in the
ratio of 80:20 and considered the 80 portion as the
train subset and 20 as the test subset. We then used
seven unique features or variables combination and
predicted the accuracy using the c-tree function.
Feature Combination Accuracy Percentage
Age + Gender + Occupation
+ Genre 0.3628
Age + Occupation + Genre 0.3574
Age + Gender + Genre 0.3489
Gender + Occupation +
Genre 0.3498
Gender + Genre 0.3499
Occupation + Genre 0.3483
Age + Genre 0.3462
Note: Genre = Action+ Adventure + Animation +
Children + Comedy + Crime + Documentary + Drama
+ Fantasy + Film_Noir + Horror + Musical + Mystery
+ Romance +Sci_Fi + Thriller + War + Western
The feature age, gender, occupation and genre
received the highest accuracy of 0.3628. On plotting a
tree for this feature combination we get the below tree.
Figure 7: Random-Forest Plot
N represents the total count of ratings for that node
and y represents the base probability for each user
rating from 1 to 5.
Random Forest: Random forests are regarded as an
ensemble method used for classification and
regression. Random forest classifier uses multiple
decision trees, in order to improve the classification or
accuracy rate. They are implemented in R using
the randomForest package. The way this classifier
works is that it induces additional randomness by
sampling and averaging which diversifies the trees,
resulting in increased search area and noise profile for
better accuracy in prediction.
Based on random samples of variable, Random forests
generate large number of bootstrapped trees, trained at
different parts of the training data and then classify to
predict the final outcome by combining the results
across all the trees of the forest. This process of
bootstrapping, aggregating or averaging helps to
increase the stability, accuracy of the classifier. Trees
that grow deep, or are grown for large complex dataset
tend to produce irregular patterns and cause overfitting
in training sets which results in low bias and high
variance. In such cases the approach of averaging
multiple deep decision trees in random forests helps to
reduce the variance and boosts the performance of the
final model [2]. Furthermore, as many samples are
selected in a process this classifier provides the
measure of importance of each variable in the model
and helps in variable selection for models built on
datasets having numerous predictor variables.
On the 80/20 % split data, we used seven unique
features or variables combination and predicted the
accuracy using the random forest function.
Feature Combination Accuracy Percentage
Age + Gender + Occupation
+ Genre 0.3662
Age + Occupation + Genre 0.3676
Age + Gender + Genre 0.3524
Gender + Occupation +
Genre 0.3541
Gender + Genre 0.3491
Occupation + Genre 0.3553
Age + Genre 0.3518
Note: Genre = Action+ Adventure + Animation +
Children + Comedy + Crime + Documentary + Drama
+ Fantasy + Film_Noir + Horror + Musical + Mystery
+ Romance + Sci_Fi + Thriller + War + Western
The feature age, occupation and genre received the
highest accuracy of 0.3676. We also plotted a graph
for error rate over number of trees. Here the numbers
of trees considered are 100.
Figure 8: Error rate over Number of Trees
The graph displays different color lines indicating
error rate for different user rating (1-5). The black line
indicates the overall out-of-bag error or means
prediction error, which is 63.24 %. From the graph it
can be observed that the error rate for user rating 4
decreases with the increase in number of trees. In
other words it can be said that the accuracy for
predicting user rating 4 increases with the increase in
number of trees. It also proves that more sampling,
more averaging of the trees in random forest result in
higher accuracy in prediction.
Naïve Bayes: The Naïve Bayes classifier algorithm is
based on Bayesian theorem, which assumes
independence between the different features. The
basic idea behind, Bayesian classifier is that if an
agent knows the class it can predict the values of the
other different features, else it uses Bayes rule to
predict the class given the feature values. One of the
major areas of application for Naïve Bayes classifier is
for text analytics. On applying the Naïve Bayes
classifier model for the movie dataset we had a
maximum accuracy of 31.32 % for the feature
combination of age, occupation, genre.
k-NN Algorithm: This algorithm basically stores all
the available cases and classifies the new cases based
on a similarity measure (e.g.. Distance functions like
Euclidean distance). The cases are classified by
calculating a majority vote of its neighbors. The
following are the basic steps for the k-NN algorithm:
• To compute the distances between the new sample
and all previous samples, that has already been
classified into clusters.
• To sort the distances in increasing order and select
the k samples with the smallest distance values
• To apply the voting principle.
On applying the k-NN classifier model for the movie
dataset we had a maximum accuracy of 36.36 % for
the feature combination of age, gender, occupation,
and genre.
K-Means Clustering: This is a method of cluster
analysis in data mining. K-means clustering aims to
partition n observations into k clusters in which each
observation belongs to the cluster with the nearest
mean, serving as a prototype of the cluster. K-means is
a type of item based classification technique in our
case. To carry out k-means clustering for our dataset,
we created a subset of the movie dataset, which
consists of all the movies from the dataset and the
information about their genre. Using this information,
we created clusters based on genre similarity of our
movie dataset. The clusters are formed based on
characteristic features of the movies.
Figure 9: Plot of within/between ratio against k
The first step in carrying out k-means clustering is
choosing the number of clusters to separate the
movies. For this, we chose to apply the elbow method
to decide the number of clusters for our clustering data
mining. We needed to choose the number of clusters
in such a way that we minimize the within/between
ratio.
We plotted the graph of the number of clusters against
the within/between ratio for these clusters. We
observed the ratio monotonically decreases. The
elbow occurred at k = 10. After that even an increase
of the ratio occurred probably because of randomness.
Hence, we chose k = 10 to be the number of clusters.
From our data, a smaller cluster number would have
higher values around itself; so it maybe gives low
values because of randomness and in another scenario
could give higher ratio.
Figure 10: Cluster plot of movies after k-means
After running the k-means clustering function in R on
our dataset, we derived 10 clusters. Each cluster has
movies that fall under similar genre.
Multinomial Logistic Regression: Multinomial
Logistic Regression is the linear regression analysis to
conduct when the dependent variable is nominal with
more than two levels. Thus it is an extension of
logistic regression, which analyzes dichotomous
(binary) dependents. Multinomial logistic regression is
a type of predictive analysis regression method.
We computed coefficients of multinomial regression
for the model age + occupation + genre for predicting
ratings of a movie. We got the following regression
coefficients:
(Intercept) -0.3407205
age 0.03921094
occupationartist 0.1029380
occupationdoctor 0.2490433
occupationeducator -0.17566946
occupationengineer 0.03247879
occupationentertainment -0.28149450
occupationexecutive -1.136975
occupationhealthcare -2.2304297
occupationhomemaker -0.70603732
occupationnone 0.20311825
occupationlawyer -0.1625073
occupationlibrarian -0.25623202
occupationmarketing 1.2161332
occupationother 0.1225862
occupationprogrammer 0.04265024
occupationretired -1.4250644
occupationsalesman -0.1073120
occupationscientist -0.08942207
occupationstudent 0.2849286
occupationtechnician 0.0133927
occupationwriter -0.45672898
Action -0.190309482
Adventure 0.3373544
Animation 1.0798400
Children -0.6189434
Comedy -0.20248454
Crime 0.19747754
Documentary 0.3946927
Drama 0.7341593
Fantasy -0.4277778
Film_Noir 1.6413046
Horror -0.37720373
Musical 0.2171512
Mystery 0.3038321
Romance 0.27989829
Sci_Fi Thriller 0.1479345
War 0.76651665
Western 0.6193790
We got the maximum accuracy for this model of
logistic regression, which was 0.3576.
Random Forest Classifier on Test Data: The feature
combination ‘Age + Occupation + Genre’ which gives
the highest accuracy of 0.3676 for Random Forest
Classifier is applied on test data. We also plotted a
graph for error rate over number of trees. Here, the
number of trees considered is 500. From the graph it
can be observed that the error rate for user rating 4
decreases with the increase in number of trees. So here
also it can be said that the accuracy for predicting user
rating 4 increases with the increase in number of trees.
Figure 11: Error rate over Number of Trees for Test Data
To predict the accuracy we calculated the value of
Mean Absolute Error (MAE) and Root Mean Squared
Error (RMSE). We got RMSE= 1.158 and MAE=
0.826. The MAE and the RMSE values can be used to
analyze the variation in the errors in a set of forecasts.
The greater the difference between them, the greater
the variance will be in the errors present in the sample.
As in this case their low difference and lower values
support that the model predicts the rating with high
accuracy.
V. RECOMMENDATION SYSTEM
Predicting ratings and creating personalized
recommendations is something that almost every
recommendation system does. The approach that
recommendation systems use can broadly be classified
into two categories.
• Content-based approach
• Collaborative filtering approach
Content based approach is based on the idea that if we
can understand the preference structure of a customer
(user) concerning product (movie) attributes then we
can recommend movies which rank high for the user’s
most desirable attributes. However, for our
recommendation system we have used collaborative
filtering approach (recommenderlab package in R).
The basic idea of collaborative filtering being given
rating data by many users for multiple movies, one can
predict a user’s rating for a movie that he/she has
never watched. As a result, helping to create a
recommendation list of top – N movies based on the
predicted ratings. For our project we are using the R
extension package recommenderlab.
While designing our recommendation system we had
the dataset that consisted details related to ratings
provided by many users for many movies as the basis
for predicting missing ratings. That is ; we have a set
of users U = {u1, u2, . . . , um} and a set of items I =
{i1, i2, . . . , in}. Ratings are stored in a m × n user-
item rating matrix R = (rjk) where each row represent
a user uj with 1 ≥ j ≥ m and columns represent items ik
with 1 ≥ k ≥ n. rjk represents the rating of user uj for
item ik. Typically only a small fraction of ratings are
known and for many cells in R the values are missing.
Predicting the missing ratings on a scale of 1 – 5 (as
was in the train data), is more of a regression problem
that is being solved by the recommendation system.
The next step involves, creating the top N
recommendation list based on all the predicted ratings.
In theory, while dealing with large datasets predicting
ratings for each and every user-movie pair becomes
computationally expensive. Thus there are rule based
approached that predict s the top N recommendation
items directly.
Collaborative filtering approaches can be broadly
divided into two groups:
• Memory – based collaborative filtering
• Model – based collaborative filtering
In memory based collaborative filtering the whole user
dataset is used to create recommendation. The most
common example of memory based collaborative
filtering is the user based collaborative filtering
algorithm. In user based collaborative filtering, we
essentially assume that individuals with similar
preference will rate items similarly. In this approach
the missing rating for users is predicted by first finding
a neighborhood of similar users and then aggregate the
ratings of these users to compute a prediction. The
neighborhood is defined using the similarity score for
different users (calculated using cosine similarity),
consisting of most similar user or users having
similarity score greater than a given threshold. Thus,
to summarize the neighborhood for an active user can
be selected by either a threshold on the similarity or by
considering k nearest neighbor.
VI. CONCLUSION
Out of all the classifier built, Random forest classifier
gives the highest prediction accuracy. Seven feature
combinations were used to test the accuracy of each
classifier built. The feature combination ‘Age +
Occupation + Genre’ gives the highest accuracy of
0.3676. RMSE and MAE were calculated to predict
the accuracy of the classifiers. When the above feature
combination was applied in full test data the RMSE
value was equal to 1.172 and MAE value was 0.849.
Similarly, when the RMSE and MAE value was
calculated for the recommendation system built using
unsupervised learning, the value of RMSE was equal
to 1.06 and MAE was equal to 0.76. As the difference
between the two in less for the case of the
recommender system, we can say the accuracy
increases with collaborative filtering. Even the
limitations of the dataset cannot be neglected. All the
classifiers built give a very low accuracy percentage,
which could be because the data had more user details,
as compared to the movie details. If movie details such
as movie director, actor, and duration were included in
the data frame, probably the prediction of accuracy
would have been higher. Moreover, Having more
movie details can make the data suitable to build a
recommendation system using item-based
collaborative filtering, which is a more sophisticated
approach.
REFERENCES
1. https://cran.r-
project.org/web/packages/recommenderlab/vignett
es/recommenderlab.pdf
2. Bhalla, D. (2015). Random Forests Explained in
Simple Terms. Retrieved from:
http://www.listendata.com/2014/11/random-
forest-with-r.html
3. Hothorn, T., Hornik, K., Zeileis, A. (2014). ctree:
Conditional Inference Trees. Retrieved from:
https://cran.r-
project.org/web/packages/partykit/vignettes/ctree.
pdf
4. Hothorn, T., Hornik, K., Zeileis, A. (2006).
Unbiased Recursive Partitioning: A Conditional
Inference Framework. Retrieved from:
http://eeecon.uibk.ac.at/~zeileis/papers/Hothorn+
Hornik+Zeileis-2006.pdf
5. Jones, M. (2013, December 12). Introduction to
approaches and algorithms. Retrieved from
http://www.ibm.com/developerworks/library/os-
recommender1/index.html
6. Marafi, S. (2014, April 26). Collaborative
Filtering with R. Retrieved from:
http://www.salemmarafi.com/code/collaborative-
filtering-r/
7. Ridwan, M. (n.d.). Predicting Likes: Inside A
Simple Recommendation Engine's Algorithms.
Retrieved from:
http://www.toptal.com/algorithms/predicting-
likes-inside-a-simple-recommendation-engine
8. Wolf, R. (2011). Conditional inference tress vs
traditional decision trees. Retrieved from:
http://stats.stackexchange.com/questions/12140/co
nditional-inference-trees-vs-traditional-decision-
trees
9. Wikipedia. Random Forest. Retrieved from:
https://en.wikipedia.org/wiki/Random_forest
10. http://rstudio-pubs-
static.s3.amazonaws.com/9893_4cc5f31ec224446
d89c5865936c8afee.html
11. http://www.statisticssolutions.com/mlr/

More Related Content

Similar to movieRecommendation_FinalReport

movie recommender system using vectorization and SVD tech
movie recommender system using vectorization and SVD techmovie recommender system using vectorization and SVD tech
movie recommender system using vectorization and SVD tech
UddeshBhagat
 
MOVIE RECOMMENDATION SYSTEM USING COLLABORATIVE FILTERING
MOVIE RECOMMENDATION SYSTEM USING COLLABORATIVE FILTERINGMOVIE RECOMMENDATION SYSTEM USING COLLABORATIVE FILTERING
MOVIE RECOMMENDATION SYSTEM USING COLLABORATIVE FILTERING
IRJET Journal
 
TMDb movie dataset by kaggle
TMDb movie dataset by kaggleTMDb movie dataset by kaggle
TMDb movie dataset by kaggle
Mouhamadou Gueye, PhD
 
Recommending Movies Using Neo4j
Recommending Movies Using Neo4j Recommending Movies Using Neo4j
Recommending Movies Using Neo4j
Ilias Katsabalos
 
(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report
Gaurav Sawant
 
Predicting movie success from search
Predicting movie success from searchPredicting movie success from search
Predicting movie success from search
ijaia
 
IRJET - Enhanced Movie Recommendation Engine using Content Filtering, Collabo...
IRJET - Enhanced Movie Recommendation Engine using Content Filtering, Collabo...IRJET - Enhanced Movie Recommendation Engine using Content Filtering, Collabo...
IRJET - Enhanced Movie Recommendation Engine using Content Filtering, Collabo...
IRJET Journal
 
What Women Want - Movio White Paper
What Women Want - Movio White PaperWhat Women Want - Movio White Paper
What Women Want - Movio White Paper
Bryan Smith
 
What Women Want - Movio White Paper
What Women Want - Movio White PaperWhat Women Want - Movio White Paper
What Women Want - Movio White Paper
William Palmer
 
Movie recommendation system using collaborative filtering system
Movie recommendation system using collaborative filtering system Movie recommendation system using collaborative filtering system
Movie recommendation system using collaborative filtering system
Mauryasuraj98
 
Introduction to recommender systems
Introduction to recommender systemsIntroduction to recommender systems
Introduction to recommender systems
Rami Alsalman
 
IRJET- Hybrid Recommendation System for Movies
IRJET-  	  Hybrid Recommendation System for MoviesIRJET-  	  Hybrid Recommendation System for Movies
IRJET- Hybrid Recommendation System for Movies
IRJET Journal
 
Building a Movie Success Predictor
Building a Movie Success PredictorBuilding a Movie Success Predictor
Building a Movie Success Predictor
Youness Lahdili
 
Collaborative Filtering
Collaborative FilteringCollaborative Filtering
Collaborative Filtering
Tayfun Sen
 
Developing Movie Recommendation System
Developing Movie Recommendation SystemDeveloping Movie Recommendation System
Developing Movie Recommendation System
Mohammad Emrul Hassan Emon
 
UTILIZING IMBALANCED DATA AND CLASSIFICATION COST MATRIX TO PREDICT MOVIE PRE...
UTILIZING IMBALANCED DATA AND CLASSIFICATION COST MATRIX TO PREDICT MOVIE PRE...UTILIZING IMBALANCED DATA AND CLASSIFICATION COST MATRIX TO PREDICT MOVIE PRE...
UTILIZING IMBALANCED DATA AND CLASSIFICATION COST MATRIX TO PREDICT MOVIE PRE...
ijaia
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
kevig
 
Recommendation systems
Recommendation systemsRecommendation systems
Recommendation systems
sunsine123
 
A Model Of Opinion Mining For Classifying Movies
A Model Of Opinion Mining For Classifying MoviesA Model Of Opinion Mining For Classifying Movies
A Model Of Opinion Mining For Classifying Movies
Andrew Molina
 
IMDB Movie Dataset Analysis
IMDB Movie Dataset AnalysisIMDB Movie Dataset Analysis
IMDB Movie Dataset Analysis
Aaron McClellan
 

Similar to movieRecommendation_FinalReport (20)

movie recommender system using vectorization and SVD tech
movie recommender system using vectorization and SVD techmovie recommender system using vectorization and SVD tech
movie recommender system using vectorization and SVD tech
 
MOVIE RECOMMENDATION SYSTEM USING COLLABORATIVE FILTERING
MOVIE RECOMMENDATION SYSTEM USING COLLABORATIVE FILTERINGMOVIE RECOMMENDATION SYSTEM USING COLLABORATIVE FILTERING
MOVIE RECOMMENDATION SYSTEM USING COLLABORATIVE FILTERING
 
TMDb movie dataset by kaggle
TMDb movie dataset by kaggleTMDb movie dataset by kaggle
TMDb movie dataset by kaggle
 
Recommending Movies Using Neo4j
Recommending Movies Using Neo4j Recommending Movies Using Neo4j
Recommending Movies Using Neo4j
 
(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report
 
Predicting movie success from search
Predicting movie success from searchPredicting movie success from search
Predicting movie success from search
 
IRJET - Enhanced Movie Recommendation Engine using Content Filtering, Collabo...
IRJET - Enhanced Movie Recommendation Engine using Content Filtering, Collabo...IRJET - Enhanced Movie Recommendation Engine using Content Filtering, Collabo...
IRJET - Enhanced Movie Recommendation Engine using Content Filtering, Collabo...
 
What Women Want - Movio White Paper
What Women Want - Movio White PaperWhat Women Want - Movio White Paper
What Women Want - Movio White Paper
 
What Women Want - Movio White Paper
What Women Want - Movio White PaperWhat Women Want - Movio White Paper
What Women Want - Movio White Paper
 
Movie recommendation system using collaborative filtering system
Movie recommendation system using collaborative filtering system Movie recommendation system using collaborative filtering system
Movie recommendation system using collaborative filtering system
 
Introduction to recommender systems
Introduction to recommender systemsIntroduction to recommender systems
Introduction to recommender systems
 
IRJET- Hybrid Recommendation System for Movies
IRJET-  	  Hybrid Recommendation System for MoviesIRJET-  	  Hybrid Recommendation System for Movies
IRJET- Hybrid Recommendation System for Movies
 
Building a Movie Success Predictor
Building a Movie Success PredictorBuilding a Movie Success Predictor
Building a Movie Success Predictor
 
Collaborative Filtering
Collaborative FilteringCollaborative Filtering
Collaborative Filtering
 
Developing Movie Recommendation System
Developing Movie Recommendation SystemDeveloping Movie Recommendation System
Developing Movie Recommendation System
 
UTILIZING IMBALANCED DATA AND CLASSIFICATION COST MATRIX TO PREDICT MOVIE PRE...
UTILIZING IMBALANCED DATA AND CLASSIFICATION COST MATRIX TO PREDICT MOVIE PRE...UTILIZING IMBALANCED DATA AND CLASSIFICATION COST MATRIX TO PREDICT MOVIE PRE...
UTILIZING IMBALANCED DATA AND CLASSIFICATION COST MATRIX TO PREDICT MOVIE PRE...
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
 
Recommendation systems
Recommendation systemsRecommendation systems
Recommendation systems
 
A Model Of Opinion Mining For Classifying Movies
A Model Of Opinion Mining For Classifying MoviesA Model Of Opinion Mining For Classifying Movies
A Model Of Opinion Mining For Classifying Movies
 
IMDB Movie Dataset Analysis
IMDB Movie Dataset AnalysisIMDB Movie Dataset Analysis
IMDB Movie Dataset Analysis
 

movieRecommendation_FinalReport

  • 1. Movie Recommendation System Komal Khattar Mohit Juneja Nupur Kale Sohini Sarkar College of Information Studies College of Information Studies College of Information Studies College of Information Studies University of Maryland, College Park University of Maryland, College Park University of Maryland, College Park University of Maryland, College Park kkhattar@umd.edu mjuneja@umd.edu nkale@umd.edu ssarkar1@umd.edu ABSTRACT Recommendation Systems have become increasingly important in e-commerce due to the large number of choices that consumer face. If you have used services like Netflix, IMDb, or Amazon, you would be aware of their personalized recommendations suggesting movies to watch or items to buy. In general, these systems take a set of input such as user profiles or a set of movie ratings, identify the similarities among the input and finally pass the similar pairs for prediction calculation. In this project, we will identify the user demographics (such as occupation, age, gender, etc.) and the movie-related parameters (like type of genres), using different types of classifiers, which can be useful in determining the ratings given by the viewers to the movies they have watched and then, go on to predict the ratings for the users for the movies they have not watched. This will be done using collaborative filtering technique, which is one of the most promising approaches for building the recommendation model. Lastly, we will recommend movies to users using this model. I. INTRODUCTION Recommendation Systems are one of the important applications of the e-commerce industry due to the overwhelming amount of information available on the Internet. This consequently, makes it impossible for the consumers to explore and compare every possible product. A movie recommendation system (sometimes referred to as a recommender engine) predicts what movie a user may like among a list of given items [5]. Such a recommendation engine generally use the following techniques to make predictions: • Content-Based Systems: These systems analyze the properties of the items that a user likes to determine what else the user may like [6]. For instance, if an IMDb user has watched many crime movies, then recommend a movie from the “crime” genre. Such systems rely solely on the content that the user itself accesses, and not on the behavior of other users in the system. • Collaborative Filtering Systems: It relies on the likes and dislikes of other users, and recommends items based on the similarity measures between items   and/or users. The recommended items are essentially drawn from those preferred by similar users. Thus, this system can be constructed from the behavior of other users who have similar traits [6]. Figure 1: Similarities and differences used in Collaborative Filtering For instance, if an IMDb user has liked ‘The Shawshank Redemption’, then ‘The Godfather’ and ‘Batman: The Dark Knight’ are also recommended, because people who liked ‘The Shawshank Redemption’ also liked ‘The Godfather’ and ‘Batman: The Dark Knight’. • Hybrids: The hybrid approaches combine content-based and collaborative filtering to build a much more robust recommendation systems. Incorporating both the methods creates the potential for a more accurate recommendation [6]. II. DATA PREPARATION In this project, we will be using the MovieLens dataset, collected by the GroupLens Research Project at the University of Minnesota. This data set consists of 100,000 ratings (1-5) from 943 users on 1682 movies, wherein each user has rated at least 20 movies. The complete dataset consists of userData, movieData, genreData, trainData and testData. The userData contains the demographic information for the
  • 2. users like user id, age, gender, occupation and zip code. The movieData contains information about the movies including the movie id, movie title, release date, video release date, IMDb URL and genre of the movie. The genreData consists of a list of the genres. The complete dataset of 100,000 ratings has been split into training set ‘trainData’ and test set ‘testData’ in such a way that the test set consists of exactly 10 ratings per user in the test set. A full training dataset, named as ‘fullTrain’, has been prepared using trainData, userData and movieData, which consist of all the available user-related (demographic info) and movie-related information (title, genre, release date, IMDb URL, etc.). Likewise, A full test dataset, named as ‘fullTest’, has been prepared using testData, userData and movieData. Additionally, a ‘unifiedMovieLensData’ has been prepared wherein the genre field in the dataset consists of "multiple" as the genre, if the movie has more than one genre. Another dataset ‘unifiedMovieLensDataMultiple’ has been created, which consists of multiple rows for movies with two or more genres. This means that this dataset contains duplicate combinations of user id and movie name.   III. EXPLORATORY DATA ANALYSIS AND FINDINGS In this section, our aim is to explore the MovieLens dataset for trends with movie preferences. It is to be noted that the user needs to run the script dataClean.R, to generate cleaned data sets, unifiedMovieLensData.csv and unifiedMovieLensDataMultiple.csv, on which the exploratory data analysis has been done. The R libraries used for the analysis include ggplot2, RColorBrewer, plyr and grid. On investigating the general features in our dataset, we determine that a majority of the users have age between 20-30, and also, there are a significant number of users in the late forties. Figure 2: Histogram Plot for Analysis of User Age Next, we will investigate the user with respect to profession within the dataset in order to determine how different profession tends to rate the movies. Figure 3: Bar Chart Plot for User with respect to Profession (Gender biased) It is evident from the above plot that a majority of users are student, while there are very few doctors and homemakers. It is probably difficult to say anything about the minority groups with much confidence. Interestingly, males make up for most of our dataset and professions like engineer, scientist, executive and entertainment are completely male dominated. Figure 4: Violin Plot for Average Rating with respect to Profession Lastly, the different professions do not seem to rate the movies evenly with the health care workers having a very low average rating as compared to other professions and executives giving very low movie ratings at times. Our next analysis involved determining the release date of the movies in our dataset, followed by computing the total number of movies of each genre; first, with a specific genre counted single times, and
  • 3. then with a specific genre counted multiple times for multi-genre movies.   Figure 5: Histogram Plot for Release Date of the Movies The plot shows that most movies in our dataset are from the 1990's. On further investing the genre of these movies, we found that a larger percentage of the movies are multi-genre and there are a very few number of movies with pure fantasy/pure film- noir/pure animation/ pure adventure genre. Figure 6a: Bar Chart Plot for Movie Genre Next, we plotted the total number of movies with a specific genre counted multiple times for multi-genre movies and found that documentaries no longer seem to be a high count genre (as compared to the previous plot). Figure 6b: Bar Chart Plot for Movie Genre (With specific genre counted multiple times for multi-genre movies) It can also be interpreted that majority of the movies that belong to the Documentaries genre, typically do not have other genre associated with them. It is also worth noting that movies with animation genre are no longer a small number. IV. DIFFERENT CLASSIFIER ALGORITHM IMPLEMENTATIONS Conditional Inference Trees: Conditional inference trees classifier is a tree-based classifier used for recursive partitioning of response variables in a conditional inference framework. This class of tree classifier can be applied to all kinds of problems, including nominal, ordinal, numeric and multivariate response variables. The package party in R provides the c-tree function and allows recursive partitioning [3]. Recursive partitioning is considered as a basic tool in data mining. It helps to explore the structure of dataset, and helps to develop decision rules for predicting a categorical or continuous output [4]. Rpart is also a tree classifier that performs recursive partitioning and univariate split but we preferred c-tree as conditional inference tree are considered as biased free predictor selection classifiers. C-tree uses a covariate selection scheme based on permutation based significance test while the Rpart has a selection bias towards covariates which allow many possible splits or have many missing values, or the ones that maximizes an information measure [7]. For our dataset, we split our training dataset in the ratio of 80:20 and considered the 80 portion as the train subset and 20 as the test subset. We then used seven unique features or variables combination and predicted the accuracy using the c-tree function. Feature Combination Accuracy Percentage Age + Gender + Occupation + Genre 0.3628 Age + Occupation + Genre 0.3574 Age + Gender + Genre 0.3489 Gender + Occupation + Genre 0.3498 Gender + Genre 0.3499 Occupation + Genre 0.3483 Age + Genre 0.3462 Note: Genre = Action+ Adventure + Animation + Children + Comedy + Crime + Documentary + Drama + Fantasy + Film_Noir + Horror + Musical + Mystery + Romance +Sci_Fi + Thriller + War + Western The feature age, gender, occupation and genre received the highest accuracy of 0.3628. On plotting a tree for this feature combination we get the below tree.
  • 4. Figure 7: Random-Forest Plot N represents the total count of ratings for that node and y represents the base probability for each user rating from 1 to 5. Random Forest: Random forests are regarded as an ensemble method used for classification and regression. Random forest classifier uses multiple decision trees, in order to improve the classification or accuracy rate. They are implemented in R using the randomForest package. The way this classifier works is that it induces additional randomness by sampling and averaging which diversifies the trees, resulting in increased search area and noise profile for better accuracy in prediction. Based on random samples of variable, Random forests generate large number of bootstrapped trees, trained at different parts of the training data and then classify to predict the final outcome by combining the results across all the trees of the forest. This process of bootstrapping, aggregating or averaging helps to increase the stability, accuracy of the classifier. Trees that grow deep, or are grown for large complex dataset tend to produce irregular patterns and cause overfitting in training sets which results in low bias and high variance. In such cases the approach of averaging multiple deep decision trees in random forests helps to reduce the variance and boosts the performance of the final model [2]. Furthermore, as many samples are selected in a process this classifier provides the measure of importance of each variable in the model and helps in variable selection for models built on datasets having numerous predictor variables. On the 80/20 % split data, we used seven unique features or variables combination and predicted the accuracy using the random forest function. Feature Combination Accuracy Percentage Age + Gender + Occupation + Genre 0.3662 Age + Occupation + Genre 0.3676 Age + Gender + Genre 0.3524 Gender + Occupation + Genre 0.3541 Gender + Genre 0.3491 Occupation + Genre 0.3553 Age + Genre 0.3518 Note: Genre = Action+ Adventure + Animation + Children + Comedy + Crime + Documentary + Drama + Fantasy + Film_Noir + Horror + Musical + Mystery + Romance + Sci_Fi + Thriller + War + Western The feature age, occupation and genre received the highest accuracy of 0.3676. We also plotted a graph for error rate over number of trees. Here the numbers of trees considered are 100. Figure 8: Error rate over Number of Trees The graph displays different color lines indicating error rate for different user rating (1-5). The black line indicates the overall out-of-bag error or means prediction error, which is 63.24 %. From the graph it can be observed that the error rate for user rating 4 decreases with the increase in number of trees. In other words it can be said that the accuracy for predicting user rating 4 increases with the increase in number of trees. It also proves that more sampling, more averaging of the trees in random forest result in higher accuracy in prediction. Naïve Bayes: The Naïve Bayes classifier algorithm is based on Bayesian theorem, which assumes independence between the different features. The basic idea behind, Bayesian classifier is that if an agent knows the class it can predict the values of the other different features, else it uses Bayes rule to predict the class given the feature values. One of the major areas of application for Naïve Bayes classifier is
  • 5. for text analytics. On applying the Naïve Bayes classifier model for the movie dataset we had a maximum accuracy of 31.32 % for the feature combination of age, occupation, genre. k-NN Algorithm: This algorithm basically stores all the available cases and classifies the new cases based on a similarity measure (e.g.. Distance functions like Euclidean distance). The cases are classified by calculating a majority vote of its neighbors. The following are the basic steps for the k-NN algorithm: • To compute the distances between the new sample and all previous samples, that has already been classified into clusters. • To sort the distances in increasing order and select the k samples with the smallest distance values • To apply the voting principle. On applying the k-NN classifier model for the movie dataset we had a maximum accuracy of 36.36 % for the feature combination of age, gender, occupation, and genre. K-Means Clustering: This is a method of cluster analysis in data mining. K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. K-means is a type of item based classification technique in our case. To carry out k-means clustering for our dataset, we created a subset of the movie dataset, which consists of all the movies from the dataset and the information about their genre. Using this information, we created clusters based on genre similarity of our movie dataset. The clusters are formed based on characteristic features of the movies. Figure 9: Plot of within/between ratio against k The first step in carrying out k-means clustering is choosing the number of clusters to separate the movies. For this, we chose to apply the elbow method to decide the number of clusters for our clustering data mining. We needed to choose the number of clusters in such a way that we minimize the within/between ratio. We plotted the graph of the number of clusters against the within/between ratio for these clusters. We observed the ratio monotonically decreases. The elbow occurred at k = 10. After that even an increase of the ratio occurred probably because of randomness. Hence, we chose k = 10 to be the number of clusters. From our data, a smaller cluster number would have higher values around itself; so it maybe gives low values because of randomness and in another scenario could give higher ratio. Figure 10: Cluster plot of movies after k-means After running the k-means clustering function in R on our dataset, we derived 10 clusters. Each cluster has movies that fall under similar genre. Multinomial Logistic Regression: Multinomial Logistic Regression is the linear regression analysis to conduct when the dependent variable is nominal with more than two levels. Thus it is an extension of logistic regression, which analyzes dichotomous (binary) dependents. Multinomial logistic regression is a type of predictive analysis regression method. We computed coefficients of multinomial regression for the model age + occupation + genre for predicting ratings of a movie. We got the following regression coefficients: (Intercept) -0.3407205 age 0.03921094 occupationartist 0.1029380 occupationdoctor 0.2490433 occupationeducator -0.17566946 occupationengineer 0.03247879 occupationentertainment -0.28149450 occupationexecutive -1.136975 occupationhealthcare -2.2304297 occupationhomemaker -0.70603732 occupationnone 0.20311825 occupationlawyer -0.1625073 occupationlibrarian -0.25623202
  • 6. occupationmarketing 1.2161332 occupationother 0.1225862 occupationprogrammer 0.04265024 occupationretired -1.4250644 occupationsalesman -0.1073120 occupationscientist -0.08942207 occupationstudent 0.2849286 occupationtechnician 0.0133927 occupationwriter -0.45672898 Action -0.190309482 Adventure 0.3373544 Animation 1.0798400 Children -0.6189434 Comedy -0.20248454 Crime 0.19747754 Documentary 0.3946927 Drama 0.7341593 Fantasy -0.4277778 Film_Noir 1.6413046 Horror -0.37720373 Musical 0.2171512 Mystery 0.3038321 Romance 0.27989829 Sci_Fi Thriller 0.1479345 War 0.76651665 Western 0.6193790 We got the maximum accuracy for this model of logistic regression, which was 0.3576. Random Forest Classifier on Test Data: The feature combination ‘Age + Occupation + Genre’ which gives the highest accuracy of 0.3676 for Random Forest Classifier is applied on test data. We also plotted a graph for error rate over number of trees. Here, the number of trees considered is 500. From the graph it can be observed that the error rate for user rating 4 decreases with the increase in number of trees. So here also it can be said that the accuracy for predicting user rating 4 increases with the increase in number of trees. Figure 11: Error rate over Number of Trees for Test Data To predict the accuracy we calculated the value of Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). We got RMSE= 1.158 and MAE= 0.826. The MAE and the RMSE values can be used to analyze the variation in the errors in a set of forecasts. The greater the difference between them, the greater the variance will be in the errors present in the sample. As in this case their low difference and lower values support that the model predicts the rating with high accuracy. V. RECOMMENDATION SYSTEM Predicting ratings and creating personalized recommendations is something that almost every recommendation system does. The approach that recommendation systems use can broadly be classified into two categories. • Content-based approach • Collaborative filtering approach Content based approach is based on the idea that if we can understand the preference structure of a customer (user) concerning product (movie) attributes then we can recommend movies which rank high for the user’s most desirable attributes. However, for our recommendation system we have used collaborative filtering approach (recommenderlab package in R). The basic idea of collaborative filtering being given rating data by many users for multiple movies, one can predict a user’s rating for a movie that he/she has never watched. As a result, helping to create a recommendation list of top – N movies based on the predicted ratings. For our project we are using the R extension package recommenderlab. While designing our recommendation system we had the dataset that consisted details related to ratings provided by many users for many movies as the basis for predicting missing ratings. That is ; we have a set of users U = {u1, u2, . . . , um} and a set of items I = {i1, i2, . . . , in}. Ratings are stored in a m × n user- item rating matrix R = (rjk) where each row represent a user uj with 1 ≥ j ≥ m and columns represent items ik with 1 ≥ k ≥ n. rjk represents the rating of user uj for item ik. Typically only a small fraction of ratings are known and for many cells in R the values are missing. Predicting the missing ratings on a scale of 1 – 5 (as was in the train data), is more of a regression problem that is being solved by the recommendation system. The next step involves, creating the top N recommendation list based on all the predicted ratings. In theory, while dealing with large datasets predicting ratings for each and every user-movie pair becomes computationally expensive. Thus there are rule based approached that predict s the top N recommendation items directly. Collaborative filtering approaches can be broadly divided into two groups: • Memory – based collaborative filtering • Model – based collaborative filtering
  • 7. In memory based collaborative filtering the whole user dataset is used to create recommendation. The most common example of memory based collaborative filtering is the user based collaborative filtering algorithm. In user based collaborative filtering, we essentially assume that individuals with similar preference will rate items similarly. In this approach the missing rating for users is predicted by first finding a neighborhood of similar users and then aggregate the ratings of these users to compute a prediction. The neighborhood is defined using the similarity score for different users (calculated using cosine similarity), consisting of most similar user or users having similarity score greater than a given threshold. Thus, to summarize the neighborhood for an active user can be selected by either a threshold on the similarity or by considering k nearest neighbor. VI. CONCLUSION Out of all the classifier built, Random forest classifier gives the highest prediction accuracy. Seven feature combinations were used to test the accuracy of each classifier built. The feature combination ‘Age + Occupation + Genre’ gives the highest accuracy of 0.3676. RMSE and MAE were calculated to predict the accuracy of the classifiers. When the above feature combination was applied in full test data the RMSE value was equal to 1.172 and MAE value was 0.849. Similarly, when the RMSE and MAE value was calculated for the recommendation system built using unsupervised learning, the value of RMSE was equal to 1.06 and MAE was equal to 0.76. As the difference between the two in less for the case of the recommender system, we can say the accuracy increases with collaborative filtering. Even the limitations of the dataset cannot be neglected. All the classifiers built give a very low accuracy percentage, which could be because the data had more user details, as compared to the movie details. If movie details such as movie director, actor, and duration were included in the data frame, probably the prediction of accuracy would have been higher. Moreover, Having more movie details can make the data suitable to build a recommendation system using item-based collaborative filtering, which is a more sophisticated approach. REFERENCES 1. https://cran.r- project.org/web/packages/recommenderlab/vignett es/recommenderlab.pdf 2. Bhalla, D. (2015). Random Forests Explained in Simple Terms. Retrieved from: http://www.listendata.com/2014/11/random- forest-with-r.html 3. Hothorn, T., Hornik, K., Zeileis, A. (2014). ctree: Conditional Inference Trees. Retrieved from: https://cran.r- project.org/web/packages/partykit/vignettes/ctree. pdf 4. Hothorn, T., Hornik, K., Zeileis, A. (2006). Unbiased Recursive Partitioning: A Conditional Inference Framework. Retrieved from: http://eeecon.uibk.ac.at/~zeileis/papers/Hothorn+ Hornik+Zeileis-2006.pdf 5. Jones, M. (2013, December 12). Introduction to approaches and algorithms. Retrieved from http://www.ibm.com/developerworks/library/os- recommender1/index.html 6. Marafi, S. (2014, April 26). Collaborative Filtering with R. Retrieved from: http://www.salemmarafi.com/code/collaborative- filtering-r/ 7. Ridwan, M. (n.d.). Predicting Likes: Inside A Simple Recommendation Engine's Algorithms. Retrieved from: http://www.toptal.com/algorithms/predicting- likes-inside-a-simple-recommendation-engine 8. Wolf, R. (2011). Conditional inference tress vs traditional decision trees. Retrieved from: http://stats.stackexchange.com/questions/12140/co nditional-inference-trees-vs-traditional-decision- trees 9. Wikipedia. Random Forest. Retrieved from: https://en.wikipedia.org/wiki/Random_forest 10. http://rstudio-pubs- static.s3.amazonaws.com/9893_4cc5f31ec224446 d89c5865936c8afee.html 11. http://www.statisticssolutions.com/mlr/