SlideShare a Scribd company logo
1 of 5
Download to read offline
Project Report(Wenhuan Wei)
1 title
Movie Recommendation using model-based collaborative filtering
2 names and roles
name: Wenhuan Wei
role: leader, programmer
3 what the program does
The program receive a personal rate text file which includes that person’s rate for
certain movies and output the top 50 movies recommended for him
4 key AI technique and brief description
The key AI technique of this program is called model-based collaborative filtering.
Before moving into the definition of model-based collaborative filtering, let’s look at the
definition of collaborative filtering first.
Collaborative filtering is a method of making automatic predictions (filtering) about the
interests of a user by collecting preferences or taste information from many users
(collaborating).
Model-based collaborative filtering aims to first compute feature vector of each user and
item.Then produce a prediction by using dot product of corresponding user feature
vector and item feature vector.
Below is an simple example explaining how the model-based collaborative filtering
works.Suppose we have a rate matrix which is 2*3,each row indicates a user’s rates to
all movies and each column indicates a movie’s rates by all users.
As the example blow indicates, model-based collaborative filtering decomposes the rate
matrix to two “simple” matrix called user matrix(U) and item matrix(M). Each row of user
matrix indicates the feature vector for that user, each row of item matrix indicates the
feature vector for that item(movie).The values in user matrix and item matrix are learned
by alternating least square algorithm based on the loss function:
The first part of the loss function minimize the distance between each prediction and the
rate in rate matrix, the second part of the loss function is called regularization in
machine learning to avoid overfitting.
R movie1 movie2 movie3
user1 5 3
user2 2 4 5
U amount of
action
complexity of
characters
user1 1 0.1
user2 0.2 1
M amount of
action
complexity of
characters
movie1 5 0
movie2 3 3
movie3 0 5
5 sample session
A. input(personal rate text file):
0::1::1::1400000000::Toy Story (1995)
0::780::1::1400000000::Independence Day (a.k.a. ID4) (1996)
0::590::4::1400000000::Dances with Wolves (1990)
0::1210::4::1400000000::Star Wars: Episode VI - Return of the Jedi (1983)
0::648::4::1400000000::Mission: Impossible (1996)
0::344::5::1400000000::Ace Ventura: Pet Detective (1994)
0::165::5::1400000000::Die Hard: With a Vengeance (1995)
0::153::5::1400000000::Batman Forever (1995)
0::597::2::1400000000::Pretty Woman (1990)
0::1580::5::1400000000::Men in Black (1997)
0::231::3::1400000000::Dumb & Dumber (1994)
The text file has a format: userId::movieId::rate::timestamp::movieName
B. output(movie recommended):
Movies recommended for you:
1: Teddy Bear (Mis) (1981)
2: Bat People, The (1974)
3: FLCL (2000)
4: Maradona by Kusturica (2008)
5: Freud: The Secret Passion (1962)
6: Class of Nuke 'Em High Part II: Subhumanoid Meltdown (1991)
7: Longshots, The (2008)
8: Naked Man, The (1998)
9: Waterboys (2001)
10: Stargate: The Ark of Truth (2008)
11: Carmen (1983)
12: RocknRolla (2008)
13: Power of Nightmares: The Rise of the Politics of Fear, The (2004)
C. some intermediate result:
Got 10000054 rates from 69878 users on 10677 movies.
Training: 6002484 validation: 1999675 test: 1997906
The optimal model was trained with size of feature vector = 20 and regularization
parameter = 0.1 and number of iterations = 30 and its Loss on the test data is
0.812826304717
6 brief demo instructions:
First install Spark(http://www.tutorialspoint.com/apache_spark/
apache_spark_installation.htm) on your computer.
Second, download the data “ratings.dat”,"movies.dat","personalRatings.txt" to local
computer and record the directory of these data files.
Third,Modify the directory of data file in the program in order to make sure the program
can load the data.
Fourth,you can modify the rates in the "personalRating.txt" as your preference to those
movies, modify the third column in the file. It will be used later to train a model and in
the end of the program, you will receive 50 movies that strongly recommended for you.
At last,type "spark-submit --master local MRecommendation.py" in terminal
7 key code
## Combinations of paramters and initialization
nFeature = [6,10,14,20]
pRegularization = [0.1,1.0,5.0,10.0]
nIterations = [10, 20,30]
optimalModel = None
optimalValidationLoss = 10000000.0
optimalNFeature = 0
optimalNRegularization = 0.0
optimalNIterations = 0
## Try different combinations of the parameters
for nFe,pRe,nIt in itertools.product(nFeature, pRegularization, nIterations):
## Train model using ALS algorithm
model = ALS.train(training, nFe, nIt, pRe)
## Compute the loss of model
validationLoss = computeLoss(model, validation, nValidation)
print ("Loss = "+str(validationLoss)+" for the model trained with size of feature
vector = "+str(nFe)+" and regularization parameter = "+str(pRe)+" and number of
iterations = "+str(nIt))
## Pick optimal model based on loss
if (validationLoss < optimalValidationLoss):
optimalModel = model
optimalValidationLoss = validationLoss
optimalNFeature = nFe
optimalNRegularization = pRe
optimalNIterations = nIt
Above are key code in this program which is used to train the optimal model(user matrix
and item matrix) based on training data,validation data.Training data and validation data
are splitter before.
First part of this code are lists of choice of each parameter for the model, training in
short is same as picking optimal parameters for some model by trying different
combinations of parameters and compute value of each training model’s loss function
and pick an optimal one.
Second part of this code aims to tune parameters by first train the model with certain
parameters on training data and compute the loss of the training model based on
validation data and pick the optimal model.
8 What learned in this project
Understanding of a distributed system called Hadoop
Processing data(load,split,join,reduce,partition) and implementing machine
learning(ALS training,prediction) through Spark pair RDD
Difference between memory-based and model-based collaborative filtering
Concept of model-based collaborative filtering
Alternating least square algorithm
9 What to add
A more interactive input like a console instead of a text file
a website application instead of a standalone python program
10 references
All the data are from:
http://grouplens.org/datasets/movielens/
The program logic is based on:
http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-
mllib.html

More Related Content

Similar to projectreport

AIML4 CNN lab256 1hr (111-1).pdf
AIML4 CNN lab256 1hr (111-1).pdfAIML4 CNN lab256 1hr (111-1).pdf
AIML4 CNN lab256 1hr (111-1).pdfssuserb4d806
 
IRJET- Implementation of Gender Detection with Notice Board using Raspberry Pi
IRJET- Implementation of Gender Detection with Notice Board using Raspberry PiIRJET- Implementation of Gender Detection with Notice Board using Raspberry Pi
IRJET- Implementation of Gender Detection with Notice Board using Raspberry PiIRJET Journal
 
Presentation1.2.pptx
Presentation1.2.pptxPresentation1.2.pptx
Presentation1.2.pptxpranaykusuma
 
Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117Ganesan Narayanasamy
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple stepsRenjith M P
 
maxbox starter60 machine learning
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learningMax Kleiner
 
Question 1 briefly respond to all the following questions. make
Question 1 briefly respond to all the following questions. make Question 1 briefly respond to all the following questions. make
Question 1 briefly respond to all the following questions. make YASHU40
 
IRJET - Single Image Super Resolution using Machine Learning
IRJET - Single Image Super Resolution using Machine LearningIRJET - Single Image Super Resolution using Machine Learning
IRJET - Single Image Super Resolution using Machine LearningIRJET Journal
 
Java conceptual learning material
Java conceptual learning materialJava conceptual learning material
Java conceptual learning materialArthyR3
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONIRJET Journal
 
Presentation on BornoNet Research Paper and Python Basics
Presentation on BornoNet Research Paper and Python BasicsPresentation on BornoNet Research Paper and Python Basics
Presentation on BornoNet Research Paper and Python BasicsShibbir Ahmed
 
Intelligent Systems Project: Bike sharing service modeling
Intelligent Systems Project: Bike sharing service modelingIntelligent Systems Project: Bike sharing service modeling
Intelligent Systems Project: Bike sharing service modelingAlessio Villardita
 
Machine learning in php Using PHP-ML
Machine learning in php Using PHP-MLMachine learning in php Using PHP-ML
Machine learning in php Using PHP-MLAgbagbara Omokhoa
 
House price prediction
House price predictionHouse price prediction
House price predictionSabahBegum
 
IRJET- Content Based Video Activity Classifier
IRJET- Content Based Video Activity ClassifierIRJET- Content Based Video Activity Classifier
IRJET- Content Based Video Activity ClassifierIRJET Journal
 
Rachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_reportRachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_reportRachit Mishra
 
Movie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial IntelligenceMovie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial IntelligenceHarivamshi D
 

Similar to projectreport (20)

AIML4 CNN lab256 1hr (111-1).pdf
AIML4 CNN lab256 1hr (111-1).pdfAIML4 CNN lab256 1hr (111-1).pdf
AIML4 CNN lab256 1hr (111-1).pdf
 
IRJET- Implementation of Gender Detection with Notice Board using Raspberry Pi
IRJET- Implementation of Gender Detection with Notice Board using Raspberry PiIRJET- Implementation of Gender Detection with Notice Board using Raspberry Pi
IRJET- Implementation of Gender Detection with Notice Board using Raspberry Pi
 
Ember
EmberEmber
Ember
 
Lo 09
Lo 09Lo 09
Lo 09
 
Ai use cases
Ai use casesAi use cases
Ai use cases
 
Presentation1.2.pptx
Presentation1.2.pptxPresentation1.2.pptx
Presentation1.2.pptx
 
Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple steps
 
maxbox starter60 machine learning
maxbox starter60 machine learningmaxbox starter60 machine learning
maxbox starter60 machine learning
 
Question 1 briefly respond to all the following questions. make
Question 1 briefly respond to all the following questions. make Question 1 briefly respond to all the following questions. make
Question 1 briefly respond to all the following questions. make
 
IRJET - Single Image Super Resolution using Machine Learning
IRJET - Single Image Super Resolution using Machine LearningIRJET - Single Image Super Resolution using Machine Learning
IRJET - Single Image Super Resolution using Machine Learning
 
Java conceptual learning material
Java conceptual learning materialJava conceptual learning material
Java conceptual learning material
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
 
Presentation on BornoNet Research Paper and Python Basics
Presentation on BornoNet Research Paper and Python BasicsPresentation on BornoNet Research Paper and Python Basics
Presentation on BornoNet Research Paper and Python Basics
 
Intelligent Systems Project: Bike sharing service modeling
Intelligent Systems Project: Bike sharing service modelingIntelligent Systems Project: Bike sharing service modeling
Intelligent Systems Project: Bike sharing service modeling
 
Machine learning in php Using PHP-ML
Machine learning in php Using PHP-MLMachine learning in php Using PHP-ML
Machine learning in php Using PHP-ML
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
IRJET- Content Based Video Activity Classifier
IRJET- Content Based Video Activity ClassifierIRJET- Content Based Video Activity Classifier
IRJET- Content Based Video Activity Classifier
 
Rachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_reportRachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_report
 
Movie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial IntelligenceMovie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial Intelligence
 

projectreport

  • 1. Project Report(Wenhuan Wei) 1 title Movie Recommendation using model-based collaborative filtering 2 names and roles name: Wenhuan Wei role: leader, programmer 3 what the program does The program receive a personal rate text file which includes that person’s rate for certain movies and output the top 50 movies recommended for him 4 key AI technique and brief description The key AI technique of this program is called model-based collaborative filtering. Before moving into the definition of model-based collaborative filtering, let’s look at the definition of collaborative filtering first. Collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). Model-based collaborative filtering aims to first compute feature vector of each user and item.Then produce a prediction by using dot product of corresponding user feature vector and item feature vector. Below is an simple example explaining how the model-based collaborative filtering works.Suppose we have a rate matrix which is 2*3,each row indicates a user’s rates to all movies and each column indicates a movie’s rates by all users.
  • 2. As the example blow indicates, model-based collaborative filtering decomposes the rate matrix to two “simple” matrix called user matrix(U) and item matrix(M). Each row of user matrix indicates the feature vector for that user, each row of item matrix indicates the feature vector for that item(movie).The values in user matrix and item matrix are learned by alternating least square algorithm based on the loss function: The first part of the loss function minimize the distance between each prediction and the rate in rate matrix, the second part of the loss function is called regularization in machine learning to avoid overfitting. R movie1 movie2 movie3 user1 5 3 user2 2 4 5 U amount of action complexity of characters user1 1 0.1 user2 0.2 1 M amount of action complexity of characters movie1 5 0 movie2 3 3 movie3 0 5
  • 3. 5 sample session A. input(personal rate text file): 0::1::1::1400000000::Toy Story (1995) 0::780::1::1400000000::Independence Day (a.k.a. ID4) (1996) 0::590::4::1400000000::Dances with Wolves (1990) 0::1210::4::1400000000::Star Wars: Episode VI - Return of the Jedi (1983) 0::648::4::1400000000::Mission: Impossible (1996) 0::344::5::1400000000::Ace Ventura: Pet Detective (1994) 0::165::5::1400000000::Die Hard: With a Vengeance (1995) 0::153::5::1400000000::Batman Forever (1995) 0::597::2::1400000000::Pretty Woman (1990) 0::1580::5::1400000000::Men in Black (1997) 0::231::3::1400000000::Dumb & Dumber (1994) The text file has a format: userId::movieId::rate::timestamp::movieName B. output(movie recommended): Movies recommended for you: 1: Teddy Bear (Mis) (1981) 2: Bat People, The (1974) 3: FLCL (2000) 4: Maradona by Kusturica (2008) 5: Freud: The Secret Passion (1962) 6: Class of Nuke 'Em High Part II: Subhumanoid Meltdown (1991) 7: Longshots, The (2008) 8: Naked Man, The (1998) 9: Waterboys (2001) 10: Stargate: The Ark of Truth (2008) 11: Carmen (1983) 12: RocknRolla (2008) 13: Power of Nightmares: The Rise of the Politics of Fear, The (2004) C. some intermediate result: Got 10000054 rates from 69878 users on 10677 movies. Training: 6002484 validation: 1999675 test: 1997906 The optimal model was trained with size of feature vector = 20 and regularization parameter = 0.1 and number of iterations = 30 and its Loss on the test data is 0.812826304717
  • 4. 6 brief demo instructions: First install Spark(http://www.tutorialspoint.com/apache_spark/ apache_spark_installation.htm) on your computer. Second, download the data “ratings.dat”,"movies.dat","personalRatings.txt" to local computer and record the directory of these data files. Third,Modify the directory of data file in the program in order to make sure the program can load the data. Fourth,you can modify the rates in the "personalRating.txt" as your preference to those movies, modify the third column in the file. It will be used later to train a model and in the end of the program, you will receive 50 movies that strongly recommended for you. At last,type "spark-submit --master local MRecommendation.py" in terminal 7 key code ## Combinations of paramters and initialization nFeature = [6,10,14,20] pRegularization = [0.1,1.0,5.0,10.0] nIterations = [10, 20,30] optimalModel = None optimalValidationLoss = 10000000.0 optimalNFeature = 0 optimalNRegularization = 0.0 optimalNIterations = 0 ## Try different combinations of the parameters for nFe,pRe,nIt in itertools.product(nFeature, pRegularization, nIterations): ## Train model using ALS algorithm model = ALS.train(training, nFe, nIt, pRe) ## Compute the loss of model validationLoss = computeLoss(model, validation, nValidation) print ("Loss = "+str(validationLoss)+" for the model trained with size of feature vector = "+str(nFe)+" and regularization parameter = "+str(pRe)+" and number of iterations = "+str(nIt)) ## Pick optimal model based on loss if (validationLoss < optimalValidationLoss): optimalModel = model optimalValidationLoss = validationLoss optimalNFeature = nFe optimalNRegularization = pRe optimalNIterations = nIt
  • 5. Above are key code in this program which is used to train the optimal model(user matrix and item matrix) based on training data,validation data.Training data and validation data are splitter before. First part of this code are lists of choice of each parameter for the model, training in short is same as picking optimal parameters for some model by trying different combinations of parameters and compute value of each training model’s loss function and pick an optimal one. Second part of this code aims to tune parameters by first train the model with certain parameters on training data and compute the loss of the training model based on validation data and pick the optimal model. 8 What learned in this project Understanding of a distributed system called Hadoop Processing data(load,split,join,reduce,partition) and implementing machine learning(ALS training,prediction) through Spark pair RDD Difference between memory-based and model-based collaborative filtering Concept of model-based collaborative filtering Alternating least square algorithm 9 What to add A more interactive input like a console instead of a text file a website application instead of a standalone python program 10 references All the data are from: http://grouplens.org/datasets/movielens/ The program logic is based on: http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with- mllib.html