Movie recommendation Engine using Artificial Intelligence
projectreport
1. Project Report(Wenhuan Wei)
1 title
Movie Recommendation using model-based collaborative filtering
2 names and roles
name: Wenhuan Wei
role: leader, programmer
3 what the program does
The program receive a personal rate text file which includes that person’s rate for
certain movies and output the top 50 movies recommended for him
4 key AI technique and brief description
The key AI technique of this program is called model-based collaborative filtering.
Before moving into the definition of model-based collaborative filtering, let’s look at the
definition of collaborative filtering first.
Collaborative filtering is a method of making automatic predictions (filtering) about the
interests of a user by collecting preferences or taste information from many users
(collaborating).
Model-based collaborative filtering aims to first compute feature vector of each user and
item.Then produce a prediction by using dot product of corresponding user feature
vector and item feature vector.
Below is an simple example explaining how the model-based collaborative filtering
works.Suppose we have a rate matrix which is 2*3,each row indicates a user’s rates to
all movies and each column indicates a movie’s rates by all users.
2. As the example blow indicates, model-based collaborative filtering decomposes the rate
matrix to two “simple” matrix called user matrix(U) and item matrix(M). Each row of user
matrix indicates the feature vector for that user, each row of item matrix indicates the
feature vector for that item(movie).The values in user matrix and item matrix are learned
by alternating least square algorithm based on the loss function:
The first part of the loss function minimize the distance between each prediction and the
rate in rate matrix, the second part of the loss function is called regularization in
machine learning to avoid overfitting.
R movie1 movie2 movie3
user1 5 3
user2 2 4 5
U amount of
action
complexity of
characters
user1 1 0.1
user2 0.2 1
M amount of
action
complexity of
characters
movie1 5 0
movie2 3 3
movie3 0 5
3. 5 sample session
A. input(personal rate text file):
0::1::1::1400000000::Toy Story (1995)
0::780::1::1400000000::Independence Day (a.k.a. ID4) (1996)
0::590::4::1400000000::Dances with Wolves (1990)
0::1210::4::1400000000::Star Wars: Episode VI - Return of the Jedi (1983)
0::648::4::1400000000::Mission: Impossible (1996)
0::344::5::1400000000::Ace Ventura: Pet Detective (1994)
0::165::5::1400000000::Die Hard: With a Vengeance (1995)
0::153::5::1400000000::Batman Forever (1995)
0::597::2::1400000000::Pretty Woman (1990)
0::1580::5::1400000000::Men in Black (1997)
0::231::3::1400000000::Dumb & Dumber (1994)
The text file has a format: userId::movieId::rate::timestamp::movieName
B. output(movie recommended):
Movies recommended for you:
1: Teddy Bear (Mis) (1981)
2: Bat People, The (1974)
3: FLCL (2000)
4: Maradona by Kusturica (2008)
5: Freud: The Secret Passion (1962)
6: Class of Nuke 'Em High Part II: Subhumanoid Meltdown (1991)
7: Longshots, The (2008)
8: Naked Man, The (1998)
9: Waterboys (2001)
10: Stargate: The Ark of Truth (2008)
11: Carmen (1983)
12: RocknRolla (2008)
13: Power of Nightmares: The Rise of the Politics of Fear, The (2004)
C. some intermediate result:
Got 10000054 rates from 69878 users on 10677 movies.
Training: 6002484 validation: 1999675 test: 1997906
The optimal model was trained with size of feature vector = 20 and regularization
parameter = 0.1 and number of iterations = 30 and its Loss on the test data is
0.812826304717
4. 6 brief demo instructions:
First install Spark(http://www.tutorialspoint.com/apache_spark/
apache_spark_installation.htm) on your computer.
Second, download the data “ratings.dat”,"movies.dat","personalRatings.txt" to local
computer and record the directory of these data files.
Third,Modify the directory of data file in the program in order to make sure the program
can load the data.
Fourth,you can modify the rates in the "personalRating.txt" as your preference to those
movies, modify the third column in the file. It will be used later to train a model and in
the end of the program, you will receive 50 movies that strongly recommended for you.
At last,type "spark-submit --master local MRecommendation.py" in terminal
7 key code
## Combinations of paramters and initialization
nFeature = [6,10,14,20]
pRegularization = [0.1,1.0,5.0,10.0]
nIterations = [10, 20,30]
optimalModel = None
optimalValidationLoss = 10000000.0
optimalNFeature = 0
optimalNRegularization = 0.0
optimalNIterations = 0
## Try different combinations of the parameters
for nFe,pRe,nIt in itertools.product(nFeature, pRegularization, nIterations):
## Train model using ALS algorithm
model = ALS.train(training, nFe, nIt, pRe)
## Compute the loss of model
validationLoss = computeLoss(model, validation, nValidation)
print ("Loss = "+str(validationLoss)+" for the model trained with size of feature
vector = "+str(nFe)+" and regularization parameter = "+str(pRe)+" and number of
iterations = "+str(nIt))
## Pick optimal model based on loss
if (validationLoss < optimalValidationLoss):
optimalModel = model
optimalValidationLoss = validationLoss
optimalNFeature = nFe
optimalNRegularization = pRe
optimalNIterations = nIt
5. Above are key code in this program which is used to train the optimal model(user matrix
and item matrix) based on training data,validation data.Training data and validation data
are splitter before.
First part of this code are lists of choice of each parameter for the model, training in
short is same as picking optimal parameters for some model by trying different
combinations of parameters and compute value of each training model’s loss function
and pick an optimal one.
Second part of this code aims to tune parameters by first train the model with certain
parameters on training data and compute the loss of the training model based on
validation data and pick the optimal model.
8 What learned in this project
Understanding of a distributed system called Hadoop
Processing data(load,split,join,reduce,partition) and implementing machine
learning(ALS training,prediction) through Spark pair RDD
Difference between memory-based and model-based collaborative filtering
Concept of model-based collaborative filtering
Alternating least square algorithm
9 What to add
A more interactive input like a console instead of a text file
a website application instead of a standalone python program
10 references
All the data are from:
http://grouplens.org/datasets/movielens/
The program logic is based on:
http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-
mllib.html