SlideShare a Scribd company logo
1 of 7
Download to read offline
Recommendation System on Amazon Music Reviews
Han Li
hanli1@cs.stonybrook.edu
Luping Su
luping.su@stonybrook.edu
Jiewen Zheng
jiewen.zheng@stonybrook.edu
Abstract
In this project we implement a variety of recommen-
dation system models on music review data from ama-
zon.com. The global averaging method produces root-
mean-square error (RMSE) of 0.91, which serves as
a baseline. The hybrid model outputs an average
RMSE=0.77 over 5 subgroups. The latent factor model
reduces RMSE from 0.76 to 0.69 when increasing
rank=10, iteration=10 to rank=20, iteration=20. The
item-item based collaborative filtering (CF) model gen-
erates RMSE=0.39 and 0.35 using method without
and with baseline ratings, respectively. The ensemble
model, which combines outputs from latent factor and
item-item CF models, is able to reduce the RMSE fur-
ther down to 0.33 when calculating the weights with
lease-squares method.
1 Introduction
Recommendation systems (RSs) consist of algorithms and
techniques which are used to provide suggestions to users
with their most likely interested items (Shapira et al., 2011).
RSs are especially important for e-commerce business since
they usually provide overwhelming number of items that
users can choose from and this creates the long tail phe-
nomenon (Leskovec et al., 2014). RSs also become popu-
lar in other diverse fields such as social tags, research article
recommend and search queries (Aggarwal, 2016).
Popular RS approaches include item-item collaborative
filtering, content-based filtering, latent factor models, and
ensemble models (see next section) (Shapira et al., 2011). In
addition, personalized page rank attracts people’s attention
in recent years, which simulates the user-item relation as bi-
partite graph and does recommendation based on the weight
of each item node (Gori and Pucci, 2007). In this project we
implement each of the above RS models and compare their
performance.
2 Background
Popular approaches of RS include:
1. Content Based Filtering
Content-based systems extract properties of the items
to be recommended. In content-based system, the key
part is building the item profile and user profile based on
their properties. Certain item will be recommended to
the user based on the similarity between the user profile
and this item.
2. Collaborative Filtering
Collaborative Filtering uses use-item interaction infor-
mation, such as context review, numeric rating to items
and purchase frequency, from which the system builds
an utility matrix. This approach is more popular than
Content Based Filtering when no enough item profiles
are provided. Collaborative filtering includes the fol-
lowing two parts: Item-Item Method and Latent Factor
Model.
3. Hybrid Recommendation Systems
Combining content based filtering and collaborative fil-
tering always returns more effective result. Available
combination ways include doing content-based and col-
laborative separately and combining the result, using
content based to narrow the scope and collaborative to
do accurate prediction.
3 Data and toolchain
The dataset used in this project consists of user music re-
views from amazon.com (size 6 GB). The data contains
6396350 reviews. Each review includes product and user in-
formation, rating, and a plain text review. The item descrip-
tion file (size 1.8 GB) includes productId and product de-
scriptions, which is used in our hybrid model. Details of the
data source can be found at https://snap.stanford.
edu/data/web-Amazon.html.
Spark (MLlib, Graphframes, GraphX) is used for data
mining. Pandas, numpy are used for data post-processing.
Matplotlib is used for data visualization and analysis.
4 Methods
4.1 Global Average Prediction
We start with simplest and most intuitive method. The pur-
pose of global average prediction is to predict rxi, the miss-
ing rating value of music i from user x. In this part, we
randomly divide dataset into training and prediction parts.
• training process
We calculate three kinds of averages including µ, bx, bi,
where µ is global averaged rating, bx is rating deviation
of user x (bx = average rating of user x - µ), bi is rating
deviation of item i (bi = average rating of item i - µ)
• prediction process
The missing value bxi will be calculated by following
equation:
rxi = µ + bx + bi (1)
4.2 Collaborative filtering: latent factor
Same as global average prediction, the purpose of latent fac-
tor model is to predict rxi. In this part, users and items in-
teractions are described by a set of latent factors. All miss-
ing values can be predicted by the product of corresponding
factors. We randomly divide dataset into training and predic-
tion parts, then use spark.mllib.recommendation.ALS, which
implements the alternating least squares (ALS) algorithm to
learn these latent factors.
• training process
train MatrixFactorizationModel on training dateset by
spark.mllib.recommendation.ALS
• prediction process
predict missing values by trained model
4.3 Collaborative filtering: item-item based
Item-item based collaborative filtering is implemented here
with the intuition that items are simpler than users (which
often have multiple tastes) (Leskovec et al., 2014). The pro-
cedures are described as follows.
1. If a user rated an item multiple times, use the average
of all the ratings;
2. From there, filter to items associated with at least 30
distinct users, filter to users associated with at least 30
distinct items, ignore an item if all of its ratings are the
same;
3. Randomly sample 3% of the items for validation, the
rest items are used for training;
4. For each item in the validation set, calculate its co-
sine similarities with the items in training set (normalize
each item by subtracting average);
Sxy =
s∈Sxy
(rxs − rx)(rys − ry)
s∈Sxy
(rxs − rx)2
s∈Sxy
(rys − ry)2
(2)
5. Choose the 50 nearest items in the training set (if less
than 50 is available, use all) and calculate the predicted
rating with:
rxi =
j∈N(i;x) Sij × rxj
j∈N(i;x) Sij
(3)
6. Besides the method described above, we use a revised
version to make predictions:
rxi = bxi +
j∈N(i;x) Sij × (rxj − bxj)
j∈N(i;x) Sij
(4)
bxi = µ + bx + bi (5)
where µ is global averaged rating, bxi is baseline esti-
mate for rxi, bx is rating deviation of user x (bx = aver-
age rating of user x - µ), bi is rating deviation of item i
(bi = average rating of item i - µ).
4.4 Link Analysis: Personalized Page Rank
• Global Page Rank
Page Rank is originally developed for rating the sig-
nificance of web pages, based on link relationship.
Each node represents a website and each directed edge
demonstrates a reference relation from the source node
to destination node. Global Page Rank results in a ’sta-
tionary’ distribution for each node with the following
equation (Brin and Page, 1998):
x = (1 − α) × Ax + α × E (6)
where x is the distribution vector of next iteration, x is
the distribution vector of current iteration, α is probabil-
ity of random walk and A is transition matrix. In global
Page Rank, E is a vector of equal values summing to
one.
• Personalized Page Rank
Personalized Page Rank (PPR) is a specific version of
global Page Rank, which jumps back to one or more
starting nodes rather than dangle ”worldwide”. The
surfing route in PPR prone to be around the starting
node(s). Compared with global Page Rank, PPR per-
forms a localized way of random walk, as shown in fig-
ure 1 left.
In PPR, α in equation 6 is still a constant representing
the probability of random walk. However elements in
E have different values in this case, which represent the
source information. All elements in E should be zero,
except the the starting node we are interest in.
u1
u2
u3
un
m1
m2
mk
(a) (b)
Figure 1: (a) Personalized Page Rank – localized random walk. (b) PPR
Graph for Recommendation System.
• Personalized Page Rank for Recommendation
When using PPR for Recommendation System, each
item m or user u is represented by a node. If there are
some item-user interactions, we add bidirectional edges
between the item and user. Such construction results in
a bipartite graph as figure 1 right. We can find the PPR
vector of each user or item, whereas only item weight
resulting from the PPR running from a user node gives
us meaningful recommendation result. Generally most
of values in result vector x will be close to zero. Those
left distinguished item nodes are the ones the system
will recommend to the start user node (Gori and Pucci,
2007).
The last question is what is item-user interaction. Since the
weighted value of each node is defined by the number of
related links, the item-user interaction should reflect user’s
positive attitude towards item. Given the music reviews
dataset, we only keep the (Userx, Itemi, rxi) triples, in
which rxi − Rx is greater or equal to 0.2. Rx is the aver-
age rating of user x.For each valid triple, we add bi-direction
edges between the user and item.
• training process
Randomly sample 10 percent valid user-item edges (the
corresponding items are ones the system should recom-
mend) and construct graph with left 90 percent valid
edges.
Run PPR starting from user i, get the weighted value of
each item node in the graph
• prediction process
Recommend items with top weighed values to the user
i
4.5 Ensemble method
Ensemble-based method has been proven to be successful
in previous contest such as the Nexflix Grand Prize (Koren,
2009). In this project, we implement ensemble method using
the results from latent factor model and CF item-item based
model. The predicted rating is calculated as linear weighted
sum over the two models:
ˆr =
n
i=1
wi × ˆri (7)
where n = 2 in this case. w1 and w2 represent the weight
for latent factor model and CF item-item based model, re-
spectively.
75%
25%
Latent factor
CF item-item weights
new predicts
predicts
Figure 2: Ensemble method: weights are calculated using least-squares
method.
Two different approaches are implemented (Adomavicius
and Kwon, 2007). The first one is to simply use the average
of the two models’ predictions as the new prediction, i.e.,
w1 = 0.5 and w2 = 0.5. The second approach is to calculate
the weights using least-square method (Figure 2). To achieve
this, the predicted ratings from both models are randomly
split into two groups (25%/75%). We use 25% of the results
to solve the equation:
A × w = b (8)
where A is a m×2 matrix, m is the number of predictions.
The first column and second column in A are the predicted
ratings from latent factor and CF item-item based models,
respectively. b is the original rating. w is solved using least-
squares method. Next we calculate the new predicted ratings
using w on the 75% group and evaluate its performance by
calculating the RMSE (see next section).
4.6 Evaluation metric
• Root-mean-square error
RMSE =
n
i=1(ˆri − ri)2
n
(9)
• Top recommended ratio
Top recommended ratio will only be used in link analy-
sis evaluation.
Rt =
nppr
nval
, 0 <= Rt <= 1 (10)
nval is the number of items to be recommended be-
fore PPR. Such items correspond to item nodes of the
deleted edges. nppr is the number of actually rec-
ommended items after PPR, which are also among
the original nval to-be recommended items.nppr shows
how many correct recommendation decisions have been
made.
4.7 Hybrid Recommendation: combing content-based
and collaborative filtering
The idea of hybrid recommendation used in this project is:
use content based to group data, narrow down the calculation
range, then use item item collaborative filtering on the sub-
group to which the target item belongs (Li and Kim, 2003).
This methods consists of 4 steps as shown below.
1. Group items based on item descriptions
The purpose of grouping items is to group the items
into several clusters, then narrows down the calculation
range for item-item collaborative filtering.
we use three steps to finish grouping. First step pre-
processes item description, including dealing with stop
words, tokenizing and stemming the texts. Second step
trains a tfidf model, and calculates the tfidf value as term
weighs of term-document matrix. At last, use singular
value decomposition(SVD) on the term-document ma-
trix to get a relation value, and assign each item a group
based on that value.
• Step 1:Preprocessing Data
(a) Input: music item descriptions
(b) Tokenize and lowercase input
(c) Get rid of stopwords in input
(d) Stem input
(e) Output: text preprocessed
• Step 2: Train tf-idf Model
(a) Train tf-idf model with text preprocessed
(b) Covert weighs in item-document matrix into
tfidf through trained model
(c) OutPut: item document tfidf matrix
• Step 3: Train LSI model and group data
(a) Use LsiModel to do 10 rank SVD on
item document tfidf matrix
(b) Get document topic relation matrix
(c) For each document, choose the group number
with the highest relation index
(d) Assign each document to its group number
(e) Output: list of (group number, document)
2. item-item collaborative filtering on grouped data
After assgin each document to subgroup, we implement
item-item collaborative filtering on grouped data.
• Step 4: item-item collaborative filtering on
grouped data
(a) Find the subgroup to which the target item be-
longs
(b) Use item-item cf on chosen subgroup
(c) Output: the predict value for the target item
5 Results and discussion
5.1 Global Average Prediction
In global average prediction part, we randomly divide the
dataset into 80 percent training part and 20 percent predic-
tion part. Root-mean-square error (RMSE) is used to eval-
uate the accuracy. We regards global average prediction as
a baseline model, which gives us 0.91 RMSE. Based on this
reasonable result, some further improvement will be intro-
duced in the following part.
5.2 Collaborative filtering: latent factor
Same as global average prediction, we randomly divide the
dataset into 80%/20% training/validation.
As shown in Table 1, the accuracy increases with higher
rank and iteration numbers. This is reasonable since higher
rank keeps more information during matrix factorization pro-
cess. Similarly higher iteration number results in more ac-
curate matrix factorization, and restores more accurate con-
cepts. However there is no free lunch, blindly increasing
rank or iteration may overwhelm the memory and easily
cause stackoverflow in Spark.
The result is shown in Figure 3. Four subplots show
the result of different rank iteration combinations. The
rank iteration RMSE
10 10 0.76
15 10 0.73
15 15 0.70
20 20 0.69
Table 1: latent factor result summary
Method RMSE
Without baseline rating 0.39
With baseline rating 0.35
Table 2: CF item-item based model results. Total of 803 items are used
for evaluation.
prediction errors prediction error = predict rating −
original rating are spread out around the ground true
value. As the original rating values increase, most prediction
errors change from positive to negative, which is the com-
mon tendency of all subplots. Latent factor has no bias on
predicting high rating values or low rating values.
Figure 3: Latent factor outputs with rank-iteration of: (a) 10-10, (b) 15-
10, (c) 15-15, (d) 20-20.
5.3 Collaborative filtering: item-item based
The results of CF item-item based model are shown in Ta-
ble 2. The RMSE is reduced by 10% by adding baseline rat-
ings. The statistics of the two different methods are shown in
Figure 4. Original ratings that are not integers (e.g., 2.5) con-
stitute less than 10% of the total ratings and are not shown
in the result. Both methods have large deviations from the
original ratings for items with ratings 1 and 2. For method
without baseline rating, all of the outliers under predict at
rating=5 (Figure 4(a)). After adding baseline ratings, the
outliers at rating=5 are distributed in regions both less and
larger than 5, and the average ratings at rating=1 and 2 are
brought closer to the original ratings (Figure 4(b)). Hence
adding the baseline rating can be interpreted as a method
which can reduce the noise in the CF item-item model.
Figure 4: Results of CF item-item based model: (a) without baseline
rating, (b) with baseline rating. Red squares indicate average predicted
ratings. Blue dots indicate outliers outside of [5, 95] percentiles. Thick
red lines indicate 1:1 ratio.
5.4 Hybrid Recommendation: combining
content-based and collaborative filtering
Music data includes 29476 distinct items. Item description
file is for 154310 distinct items, but only 16035 music items
included. Our content based analyzes these 16035 music
item descriptions and groups them into 5 groups.
Content based filtering narrows the calculation range for
item-item collaborative filtering at least down to 31.8%. And
the hybrid system could get a average RMSE=0.77 for 5 sub-
groups, as showed in table 3.
5.5 Ensemble method
The ensemble method results are shown in Table 4. The
CF and latent factor models use different random sampling
methods, the outputs from the two models share 393 same
items with each other and are used for evaluation. The CF
item-item model with baseline produces lower RMSE than
Value name Value
rank 5
group1 percent 16.1%
group2 percent 31.8%
group3 percent 20.3%
group4 percent 17.9%
group5 percent 13.9%
avg RMSE 0.77
Table 3: group percent and average RMSE for groups
Method RMSE
CF item-item with baseline 0.34
Latent factor: iter=20, rank=20 0.47
Ensemble: averaging weight 0.36
Ensemble: least-squares weight 0.33
Table 4: Ensemble method results. Total of 393 items are used for evalu-
ation.
the latent factor model (Table 4). The ensemble model with
averaging weight method outputs RMSE=0.36, which is be-
tween CF item-item and latent factor models.
Figure 5: Comparison of latent factor and CF item-item based models.
Comparison of the predicted ratings between CF item-
item and latent factor models is shown in Figure 5. Both
model have predicted ratings deviate from the 1:1 line. This
indicates that simply averaging the predicted ratings may not
improve the ensemble results. Next we resort to calculat-
ing the weights using least-sqaures method. The weights are
0.82 and 0.18 for CF item-item and latent factor models, re-
spectively. This means the CF item-item model contributes
more in making better predictions, which is shown by its
lower RMSE value. With this weight, the ensemble model
achieves a RMSE=0.33, lower than both CF item-item and
latent factor models.
users α nppr nval Rt
300 0.1 880 1658 0.53
300 0.25 933 1586 0.59
300 0.5 610 1749 0.35
1000 0.1 2886 5691 0.51
1000 0.25 3209 5624 0.57
1000 0.5 2073 5495 0.38
Table 5: PPR result summary
5.6 Link Analysis: Personalized Page Rank
In this part, we run one personalized page rank per user.
The system will recommend top weighted items with weight
greater than 0.0001 out of total 29476 items. Commonly
there are less 100 nodes with weight greater than 0.001.
In terms of item nodes, the recommendation scope is even
smaller.
nppr is the number of correct recommendations the system
provide. nval is the maximum number of correct recommen-
dations we can get, which is also the number of user-item
edges of the starting user nodes which are deleted at train-
ing process. Rt is the ration of nppr and nval, representing
the percent of right recommendations. In this project, right
recommendation means the original rating rxi from user x to
item i, is at least 0.2 greater than the average rating of user
x, Rx.
As showed in table 4, the 300 user and 1000 user (running
PPR with different 300 and 1000 starting nodes) show sim-
ilar result. α = 0.25 gives us the best recommendation.
When α is too small, in our case α = 0.1, the personal-
ized page rank reduces to general page rank, which gives us
the global popularity rather than popularity specific to cer-
tain user. When α is too large, in our case α = 0.5, PPR
cannot grasp enough link information from the node. The
large α forces random walk to go back to source node too
often so that it losts some useful information in large scope.
6 Conclusion
In this project, we use global average and latent factor model
as baseline methods to predict the numerical rating. Starting
from them, item-item collaborative filtering and result en-
semble show a significant accuracy improvement. Our con-
tent based method narrows down the calculation range. PPR
focuses on predicting the right item rather than numerical
rating. Our experiment demonstrates that α matters a lot to
the model both practically and theoretically.
References
[Adomavicius and Kwon2007] Gediminas Adomavicius and
YoungOk Kwon. 2007. New recommendation techniques
for multicriteria rating systems. IEEE Intelligent Systems,
22(3):48–55.
[Aggarwal2016] Charu C Aggarwal. 2016. Recommender Sys-
tems: The Textbook. Springer.
[Brin and Page1998] Sergey Brin and Lawrence Page. 1998. The
anatomy of a large-scale hypertextual web search engine. Com-
puter Networks and ISDN Systems, pages 107–117.
[Gori and Pucci2007] Marco Gori and Augusto Pucci. 2007. Item-
rank: A random-walk based scoring algorithm for recom-
mender engines. IJCAI’07 Proceedings of the 20th interna-
tional joint conference on Artifical intelligence, pages 2766–
2771.
[Koren2009] Yehuda Koren. 2009. The bellkor solution to the
netflix grand prize. Netflix prize documentation, 81:1–10.
[Leskovec et al.2014] Jure Leskovec, Anand Rajaraman, and Jef-
frey David Ullman. 2014. Mining of massive datasets. Cam-
bridge University Press.
[Li and Kim2003] Qing Li and Byeong Man Kim. 2003. An ap-
proach for combining content-based and collaborative filters. In
Proceedings of the sixth international workshop on Information
retrieval with Asian languages-Volume 11, pages 17–24. Asso-
ciation for Computational Linguistics.
[Shapira et al.2011] Bracha Shapira, Francesco Ricci, Paul B Kan-
tor, and Lior Rokach. 2011. Recommender Systems Handbook.
Springer.

More Related Content

What's hot

IRJET - Movie Genre Prediction from Plot Summaries by Comparing Various C...
IRJET -  	  Movie Genre Prediction from Plot Summaries by Comparing Various C...IRJET -  	  Movie Genre Prediction from Plot Summaries by Comparing Various C...
IRJET - Movie Genre Prediction from Plot Summaries by Comparing Various C...IRJET Journal
 
Q UANTUM C LUSTERING -B ASED F EATURE SUBSET S ELECTION FOR MAMMOGRAPHIC I...
Q UANTUM  C LUSTERING -B ASED  F EATURE SUBSET  S ELECTION FOR MAMMOGRAPHIC I...Q UANTUM  C LUSTERING -B ASED  F EATURE SUBSET  S ELECTION FOR MAMMOGRAPHIC I...
Q UANTUM C LUSTERING -B ASED F EATURE SUBSET S ELECTION FOR MAMMOGRAPHIC I...ijcsit
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai UniversityMadhav Mishra
 
A Preference Model on Adaptive Affinity Propagation
A Preference Model on Adaptive Affinity PropagationA Preference Model on Adaptive Affinity Propagation
A Preference Model on Adaptive Affinity PropagationIJECEIAES
 
Genetic algorithms
Genetic algorithmsGenetic algorithms
Genetic algorithmsswapnac12
 
Sca a sine cosine algorithm for solving optimization problems
Sca a sine cosine algorithm for solving optimization problemsSca a sine cosine algorithm for solving optimization problems
Sca a sine cosine algorithm for solving optimization problemslaxmanLaxman03209
 
Machine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoMachine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoHridyesh Bisht
 
A comparative analysis of retrieval techniques in content based image retrieval
A comparative analysis of retrieval techniques in content based image retrievalA comparative analysis of retrieval techniques in content based image retrieval
A comparative analysis of retrieval techniques in content based image retrievalcsandit
 
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...CSCJournals
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataIOSR Journals
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET Journal
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisJaclyn Kokx
 
BINARY SINE COSINE ALGORITHMS FOR FEATURE SELECTION FROM MEDICAL DATA
BINARY SINE COSINE ALGORITHMS FOR FEATURE SELECTION FROM MEDICAL DATABINARY SINE COSINE ALGORITHMS FOR FEATURE SELECTION FROM MEDICAL DATA
BINARY SINE COSINE ALGORITHMS FOR FEATURE SELECTION FROM MEDICAL DATAacijjournal
 
Feature selection using modified particle swarm optimisation for face recogni...
Feature selection using modified particle swarm optimisation for face recogni...Feature selection using modified particle swarm optimisation for face recogni...
Feature selection using modified particle swarm optimisation for face recogni...eSAT Journals
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
 
Multiview Alignment Hashing for Efficient Image Search
Multiview Alignment Hashing for Efficient Image SearchMultiview Alignment Hashing for Efficient Image Search
Multiview Alignment Hashing for Efficient Image Search1crore projects
 

What's hot (17)

IRJET - Movie Genre Prediction from Plot Summaries by Comparing Various C...
IRJET -  	  Movie Genre Prediction from Plot Summaries by Comparing Various C...IRJET -  	  Movie Genre Prediction from Plot Summaries by Comparing Various C...
IRJET - Movie Genre Prediction from Plot Summaries by Comparing Various C...
 
Q UANTUM C LUSTERING -B ASED F EATURE SUBSET S ELECTION FOR MAMMOGRAPHIC I...
Q UANTUM  C LUSTERING -B ASED  F EATURE SUBSET  S ELECTION FOR MAMMOGRAPHIC I...Q UANTUM  C LUSTERING -B ASED  F EATURE SUBSET  S ELECTION FOR MAMMOGRAPHIC I...
Q UANTUM C LUSTERING -B ASED F EATURE SUBSET S ELECTION FOR MAMMOGRAPHIC I...
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
 
A Preference Model on Adaptive Affinity Propagation
A Preference Model on Adaptive Affinity PropagationA Preference Model on Adaptive Affinity Propagation
A Preference Model on Adaptive Affinity Propagation
 
Genetic algorithms
Genetic algorithmsGenetic algorithms
Genetic algorithms
 
Sca a sine cosine algorithm for solving optimization problems
Sca a sine cosine algorithm for solving optimization problemsSca a sine cosine algorithm for solving optimization problems
Sca a sine cosine algorithm for solving optimization problems
 
Machine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoMachine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demo
 
A comparative analysis of retrieval techniques in content based image retrieval
A comparative analysis of retrieval techniques in content based image retrievalA comparative analysis of retrieval techniques in content based image retrieval
A comparative analysis of retrieval techniques in content based image retrieval
 
I AM SAM web app
I AM SAM web appI AM SAM web app
I AM SAM web app
 
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
 
BINARY SINE COSINE ALGORITHMS FOR FEATURE SELECTION FROM MEDICAL DATA
BINARY SINE COSINE ALGORITHMS FOR FEATURE SELECTION FROM MEDICAL DATABINARY SINE COSINE ALGORITHMS FOR FEATURE SELECTION FROM MEDICAL DATA
BINARY SINE COSINE ALGORITHMS FOR FEATURE SELECTION FROM MEDICAL DATA
 
Feature selection using modified particle swarm optimisation for face recogni...
Feature selection using modified particle swarm optimisation for face recogni...Feature selection using modified particle swarm optimisation for face recogni...
Feature selection using modified particle swarm optimisation for face recogni...
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
 
Multiview Alignment Hashing for Efficient Image Search
Multiview Alignment Hashing for Efficient Image SearchMultiview Alignment Hashing for Efficient Image Search
Multiview Alignment Hashing for Efficient Image Search
 

Viewers also liked

Music videos pitch presentation
Music videos pitch presentationMusic videos pitch presentation
Music videos pitch presentationBradleyBarnes16
 
Elem of design unit 10 module 3 document creation
Elem of design unit 10 module 3 document creationElem of design unit 10 module 3 document creation
Elem of design unit 10 module 3 document creationkateridrex
 
How I went about creating our Artists Facebook Page ...
How I went about creating our Artists Facebook Page ...How I went about creating our Artists Facebook Page ...
How I went about creating our Artists Facebook Page ...BradleyBarnes16
 
Fund of design unit 9 module 2 how to create a visual pulse
Fund of design unit 9 module 2 how to create a visual pulseFund of design unit 9 module 2 how to create a visual pulse
Fund of design unit 9 module 2 how to create a visual pulsekateridrex
 
Kieu Thien Van - Individual Challenge 2 - Young Marketers Elite Development P...
Kieu Thien Van - Individual Challenge 2 - Young Marketers Elite Development P...Kieu Thien Van - Individual Challenge 2 - Young Marketers Elite Development P...
Kieu Thien Van - Individual Challenge 2 - Young Marketers Elite Development P...kieuthienvan
 
economia de panama
economia de panamaeconomia de panama
economia de panamamarcoan04
 
Young Marketers Elite 3 Individual Graduation Case 2 - Nguyễn Trường Liêm
Young Marketers Elite 3 Individual Graduation Case 2 - Nguyễn Trường LiêmYoung Marketers Elite 3 Individual Graduation Case 2 - Nguyễn Trường Liêm
Young Marketers Elite 3 Individual Graduation Case 2 - Nguyễn Trường LiêmNguyen Liem
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupErik Bernhardsson
 
Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Erik Bernhardsson
 
Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyChris Johnson
 
From Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyFrom Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyChris Johnson
 

Viewers also liked (13)

Music videos pitch presentation
Music videos pitch presentationMusic videos pitch presentation
Music videos pitch presentation
 
Ada 1
Ada 1Ada 1
Ada 1
 
Elem of design unit 10 module 3 document creation
Elem of design unit 10 module 3 document creationElem of design unit 10 module 3 document creation
Elem of design unit 10 module 3 document creation
 
How I went about creating our Artists Facebook Page ...
How I went about creating our Artists Facebook Page ...How I went about creating our Artists Facebook Page ...
How I went about creating our Artists Facebook Page ...
 
Digipak analysis ½
Digipak analysis ½Digipak analysis ½
Digipak analysis ½
 
Fund of design unit 9 module 2 how to create a visual pulse
Fund of design unit 9 module 2 how to create a visual pulseFund of design unit 9 module 2 how to create a visual pulse
Fund of design unit 9 module 2 how to create a visual pulse
 
Kieu Thien Van - Individual Challenge 2 - Young Marketers Elite Development P...
Kieu Thien Van - Individual Challenge 2 - Young Marketers Elite Development P...Kieu Thien Van - Individual Challenge 2 - Young Marketers Elite Development P...
Kieu Thien Van - Individual Challenge 2 - Young Marketers Elite Development P...
 
economia de panama
economia de panamaeconomia de panama
economia de panama
 
Young Marketers Elite 3 Individual Graduation Case 2 - Nguyễn Trường Liêm
Young Marketers Elite 3 Individual Graduation Case 2 - Nguyễn Trường LiêmYoung Marketers Elite 3 Individual Graduation Case 2 - Nguyễn Trường Liêm
Young Marketers Elite 3 Individual Graduation Case 2 - Nguyễn Trường Liêm
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetup
 
Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014
 
Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at Spotify
 
From Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyFrom Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover Weekly
 

Similar to Recommendation Systems for Amazon Music Reviews

Item Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation AlgorithmsItem Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation Algorithmsnextlib
 
A Modified KS-test for Feature Selection
A Modified KS-test for Feature SelectionA Modified KS-test for Feature Selection
A Modified KS-test for Feature SelectionIOSR Journals
 
Item basedcollaborativefilteringrecommendationalgorithms
Item basedcollaborativefilteringrecommendationalgorithmsItem basedcollaborativefilteringrecommendationalgorithms
Item basedcollaborativefilteringrecommendationalgorithmsAravindharamanan S
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation SystemsRobin Reni
 
RBHF_SDM_2011_Jie
RBHF_SDM_2011_JieRBHF_SDM_2011_Jie
RBHF_SDM_2011_JieMDO_Lab
 
(Gaurav sawant &amp; dhaval sawlani)bia 678 final project report
(Gaurav sawant &amp; dhaval sawlani)bia 678 final project report(Gaurav sawant &amp; dhaval sawlani)bia 678 final project report
(Gaurav sawant &amp; dhaval sawlani)bia 678 final project reportGaurav Sawant
 
A report on designing a model for improving CPU Scheduling by using Machine L...
A report on designing a model for improving CPU Scheduling by using Machine L...A report on designing a model for improving CPU Scheduling by using Machine L...
A report on designing a model for improving CPU Scheduling by using Machine L...MuskanRath1
 
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
	Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm	Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithminventionjournals
 
Analysis of Rayleigh Quotient in Extrapolation Method to Accelerate the Compu...
Analysis of Rayleigh Quotient in Extrapolation Method to Accelerate the Compu...Analysis of Rayleigh Quotient in Extrapolation Method to Accelerate the Compu...
Analysis of Rayleigh Quotient in Extrapolation Method to Accelerate the Compu...IOSR Journals
 
Download
DownloadDownload
Downloadbutest
 
Download
DownloadDownload
Downloadbutest
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
 
Understanding the Applicability of Linear & Non-Linear Models Using a Case-Ba...
Understanding the Applicability of Linear & Non-Linear Models Using a Case-Ba...Understanding the Applicability of Linear & Non-Linear Models Using a Case-Ba...
Understanding the Applicability of Linear & Non-Linear Models Using a Case-Ba...ijaia
 
Summer internship 2014 report by Rishabh Misra, Thapar University
Summer internship 2014 report by Rishabh Misra, Thapar UniversitySummer internship 2014 report by Rishabh Misra, Thapar University
Summer internship 2014 report by Rishabh Misra, Thapar UniversityRishabh Misra
 
A Novel Collaborative Filtering Algorithm by Bit Mining Frequent Itemsets
A Novel Collaborative Filtering Algorithm by Bit Mining Frequent ItemsetsA Novel Collaborative Filtering Algorithm by Bit Mining Frequent Itemsets
A Novel Collaborative Filtering Algorithm by Bit Mining Frequent ItemsetsLoc Nguyen
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3Luis Borbon
 
SemiBoost: Boosting for Semi-supervised Learning
SemiBoost: Boosting for Semi-supervised LearningSemiBoost: Boosting for Semi-supervised Learning
SemiBoost: Boosting for Semi-supervised Learningbutest
 

Similar to Recommendation Systems for Amazon Music Reviews (20)

Item Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation AlgorithmsItem Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation Algorithms
 
A Modified KS-test for Feature Selection
A Modified KS-test for Feature SelectionA Modified KS-test for Feature Selection
A Modified KS-test for Feature Selection
 
Item basedcollaborativefilteringrecommendationalgorithms
Item basedcollaborativefilteringrecommendationalgorithmsItem basedcollaborativefilteringrecommendationalgorithms
Item basedcollaborativefilteringrecommendationalgorithms
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation Systems
 
RBHF_SDM_2011_Jie
RBHF_SDM_2011_JieRBHF_SDM_2011_Jie
RBHF_SDM_2011_Jie
 
(Gaurav sawant &amp; dhaval sawlani)bia 678 final project report
(Gaurav sawant &amp; dhaval sawlani)bia 678 final project report(Gaurav sawant &amp; dhaval sawlani)bia 678 final project report
(Gaurav sawant &amp; dhaval sawlani)bia 678 final project report
 
A report on designing a model for improving CPU Scheduling by using Machine L...
A report on designing a model for improving CPU Scheduling by using Machine L...A report on designing a model for improving CPU Scheduling by using Machine L...
A report on designing a model for improving CPU Scheduling by using Machine L...
 
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
	Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm	Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
 
Analysis of Rayleigh Quotient in Extrapolation Method to Accelerate the Compu...
Analysis of Rayleigh Quotient in Extrapolation Method to Accelerate the Compu...Analysis of Rayleigh Quotient in Extrapolation Method to Accelerate the Compu...
Analysis of Rayleigh Quotient in Extrapolation Method to Accelerate the Compu...
 
Building the Professional of 2020: An Approach to Business Change Process Int...
Building the Professional of 2020: An Approach to Business Change Process Int...Building the Professional of 2020: An Approach to Business Change Process Int...
Building the Professional of 2020: An Approach to Business Change Process Int...
 
Download
DownloadDownload
Download
 
Download
DownloadDownload
Download
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
Understanding the Applicability of Linear & Non-Linear Models Using a Case-Ba...
Understanding the Applicability of Linear & Non-Linear Models Using a Case-Ba...Understanding the Applicability of Linear & Non-Linear Models Using a Case-Ba...
Understanding the Applicability of Linear & Non-Linear Models Using a Case-Ba...
 
Summer internship 2014 report by Rishabh Misra, Thapar University
Summer internship 2014 report by Rishabh Misra, Thapar UniversitySummer internship 2014 report by Rishabh Misra, Thapar University
Summer internship 2014 report by Rishabh Misra, Thapar University
 
A Novel Collaborative Filtering Algorithm by Bit Mining Frequent Itemsets
A Novel Collaborative Filtering Algorithm by Bit Mining Frequent ItemsetsA Novel Collaborative Filtering Algorithm by Bit Mining Frequent Itemsets
A Novel Collaborative Filtering Algorithm by Bit Mining Frequent Itemsets
 
pdf
pdfpdf
pdf
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3
 
SemiBoost: Boosting for Semi-supervised Learning
SemiBoost: Boosting for Semi-supervised LearningSemiBoost: Boosting for Semi-supervised Learning
SemiBoost: Boosting for Semi-supervised Learning
 

Recommendation Systems for Amazon Music Reviews

  • 1. Recommendation System on Amazon Music Reviews Han Li hanli1@cs.stonybrook.edu Luping Su luping.su@stonybrook.edu Jiewen Zheng jiewen.zheng@stonybrook.edu Abstract In this project we implement a variety of recommen- dation system models on music review data from ama- zon.com. The global averaging method produces root- mean-square error (RMSE) of 0.91, which serves as a baseline. The hybrid model outputs an average RMSE=0.77 over 5 subgroups. The latent factor model reduces RMSE from 0.76 to 0.69 when increasing rank=10, iteration=10 to rank=20, iteration=20. The item-item based collaborative filtering (CF) model gen- erates RMSE=0.39 and 0.35 using method without and with baseline ratings, respectively. The ensemble model, which combines outputs from latent factor and item-item CF models, is able to reduce the RMSE fur- ther down to 0.33 when calculating the weights with lease-squares method. 1 Introduction Recommendation systems (RSs) consist of algorithms and techniques which are used to provide suggestions to users with their most likely interested items (Shapira et al., 2011). RSs are especially important for e-commerce business since they usually provide overwhelming number of items that users can choose from and this creates the long tail phe- nomenon (Leskovec et al., 2014). RSs also become popu- lar in other diverse fields such as social tags, research article recommend and search queries (Aggarwal, 2016). Popular RS approaches include item-item collaborative filtering, content-based filtering, latent factor models, and ensemble models (see next section) (Shapira et al., 2011). In addition, personalized page rank attracts people’s attention in recent years, which simulates the user-item relation as bi- partite graph and does recommendation based on the weight of each item node (Gori and Pucci, 2007). In this project we implement each of the above RS models and compare their performance. 2 Background Popular approaches of RS include: 1. Content Based Filtering Content-based systems extract properties of the items to be recommended. In content-based system, the key part is building the item profile and user profile based on their properties. Certain item will be recommended to the user based on the similarity between the user profile and this item. 2. Collaborative Filtering Collaborative Filtering uses use-item interaction infor- mation, such as context review, numeric rating to items and purchase frequency, from which the system builds an utility matrix. This approach is more popular than Content Based Filtering when no enough item profiles are provided. Collaborative filtering includes the fol- lowing two parts: Item-Item Method and Latent Factor Model. 3. Hybrid Recommendation Systems Combining content based filtering and collaborative fil- tering always returns more effective result. Available combination ways include doing content-based and col- laborative separately and combining the result, using content based to narrow the scope and collaborative to do accurate prediction. 3 Data and toolchain The dataset used in this project consists of user music re- views from amazon.com (size 6 GB). The data contains 6396350 reviews. Each review includes product and user in- formation, rating, and a plain text review. The item descrip- tion file (size 1.8 GB) includes productId and product de- scriptions, which is used in our hybrid model. Details of the data source can be found at https://snap.stanford. edu/data/web-Amazon.html. Spark (MLlib, Graphframes, GraphX) is used for data mining. Pandas, numpy are used for data post-processing. Matplotlib is used for data visualization and analysis. 4 Methods 4.1 Global Average Prediction We start with simplest and most intuitive method. The pur- pose of global average prediction is to predict rxi, the miss-
  • 2. ing rating value of music i from user x. In this part, we randomly divide dataset into training and prediction parts. • training process We calculate three kinds of averages including µ, bx, bi, where µ is global averaged rating, bx is rating deviation of user x (bx = average rating of user x - µ), bi is rating deviation of item i (bi = average rating of item i - µ) • prediction process The missing value bxi will be calculated by following equation: rxi = µ + bx + bi (1) 4.2 Collaborative filtering: latent factor Same as global average prediction, the purpose of latent fac- tor model is to predict rxi. In this part, users and items in- teractions are described by a set of latent factors. All miss- ing values can be predicted by the product of corresponding factors. We randomly divide dataset into training and predic- tion parts, then use spark.mllib.recommendation.ALS, which implements the alternating least squares (ALS) algorithm to learn these latent factors. • training process train MatrixFactorizationModel on training dateset by spark.mllib.recommendation.ALS • prediction process predict missing values by trained model 4.3 Collaborative filtering: item-item based Item-item based collaborative filtering is implemented here with the intuition that items are simpler than users (which often have multiple tastes) (Leskovec et al., 2014). The pro- cedures are described as follows. 1. If a user rated an item multiple times, use the average of all the ratings; 2. From there, filter to items associated with at least 30 distinct users, filter to users associated with at least 30 distinct items, ignore an item if all of its ratings are the same; 3. Randomly sample 3% of the items for validation, the rest items are used for training; 4. For each item in the validation set, calculate its co- sine similarities with the items in training set (normalize each item by subtracting average); Sxy = s∈Sxy (rxs − rx)(rys − ry) s∈Sxy (rxs − rx)2 s∈Sxy (rys − ry)2 (2) 5. Choose the 50 nearest items in the training set (if less than 50 is available, use all) and calculate the predicted rating with: rxi = j∈N(i;x) Sij × rxj j∈N(i;x) Sij (3) 6. Besides the method described above, we use a revised version to make predictions: rxi = bxi + j∈N(i;x) Sij × (rxj − bxj) j∈N(i;x) Sij (4) bxi = µ + bx + bi (5) where µ is global averaged rating, bxi is baseline esti- mate for rxi, bx is rating deviation of user x (bx = aver- age rating of user x - µ), bi is rating deviation of item i (bi = average rating of item i - µ). 4.4 Link Analysis: Personalized Page Rank • Global Page Rank Page Rank is originally developed for rating the sig- nificance of web pages, based on link relationship. Each node represents a website and each directed edge demonstrates a reference relation from the source node to destination node. Global Page Rank results in a ’sta- tionary’ distribution for each node with the following equation (Brin and Page, 1998): x = (1 − α) × Ax + α × E (6) where x is the distribution vector of next iteration, x is the distribution vector of current iteration, α is probabil- ity of random walk and A is transition matrix. In global Page Rank, E is a vector of equal values summing to one. • Personalized Page Rank Personalized Page Rank (PPR) is a specific version of global Page Rank, which jumps back to one or more
  • 3. starting nodes rather than dangle ”worldwide”. The surfing route in PPR prone to be around the starting node(s). Compared with global Page Rank, PPR per- forms a localized way of random walk, as shown in fig- ure 1 left. In PPR, α in equation 6 is still a constant representing the probability of random walk. However elements in E have different values in this case, which represent the source information. All elements in E should be zero, except the the starting node we are interest in. u1 u2 u3 un m1 m2 mk (a) (b) Figure 1: (a) Personalized Page Rank – localized random walk. (b) PPR Graph for Recommendation System. • Personalized Page Rank for Recommendation When using PPR for Recommendation System, each item m or user u is represented by a node. If there are some item-user interactions, we add bidirectional edges between the item and user. Such construction results in a bipartite graph as figure 1 right. We can find the PPR vector of each user or item, whereas only item weight resulting from the PPR running from a user node gives us meaningful recommendation result. Generally most of values in result vector x will be close to zero. Those left distinguished item nodes are the ones the system will recommend to the start user node (Gori and Pucci, 2007). The last question is what is item-user interaction. Since the weighted value of each node is defined by the number of related links, the item-user interaction should reflect user’s positive attitude towards item. Given the music reviews dataset, we only keep the (Userx, Itemi, rxi) triples, in which rxi − Rx is greater or equal to 0.2. Rx is the aver- age rating of user x.For each valid triple, we add bi-direction edges between the user and item. • training process Randomly sample 10 percent valid user-item edges (the corresponding items are ones the system should recom- mend) and construct graph with left 90 percent valid edges. Run PPR starting from user i, get the weighted value of each item node in the graph • prediction process Recommend items with top weighed values to the user i 4.5 Ensemble method Ensemble-based method has been proven to be successful in previous contest such as the Nexflix Grand Prize (Koren, 2009). In this project, we implement ensemble method using the results from latent factor model and CF item-item based model. The predicted rating is calculated as linear weighted sum over the two models: ˆr = n i=1 wi × ˆri (7) where n = 2 in this case. w1 and w2 represent the weight for latent factor model and CF item-item based model, re- spectively. 75% 25% Latent factor CF item-item weights new predicts predicts Figure 2: Ensemble method: weights are calculated using least-squares method. Two different approaches are implemented (Adomavicius and Kwon, 2007). The first one is to simply use the average of the two models’ predictions as the new prediction, i.e., w1 = 0.5 and w2 = 0.5. The second approach is to calculate the weights using least-square method (Figure 2). To achieve this, the predicted ratings from both models are randomly split into two groups (25%/75%). We use 25% of the results to solve the equation: A × w = b (8) where A is a m×2 matrix, m is the number of predictions. The first column and second column in A are the predicted ratings from latent factor and CF item-item based models, respectively. b is the original rating. w is solved using least- squares method. Next we calculate the new predicted ratings using w on the 75% group and evaluate its performance by calculating the RMSE (see next section).
  • 4. 4.6 Evaluation metric • Root-mean-square error RMSE = n i=1(ˆri − ri)2 n (9) • Top recommended ratio Top recommended ratio will only be used in link analy- sis evaluation. Rt = nppr nval , 0 <= Rt <= 1 (10) nval is the number of items to be recommended be- fore PPR. Such items correspond to item nodes of the deleted edges. nppr is the number of actually rec- ommended items after PPR, which are also among the original nval to-be recommended items.nppr shows how many correct recommendation decisions have been made. 4.7 Hybrid Recommendation: combing content-based and collaborative filtering The idea of hybrid recommendation used in this project is: use content based to group data, narrow down the calculation range, then use item item collaborative filtering on the sub- group to which the target item belongs (Li and Kim, 2003). This methods consists of 4 steps as shown below. 1. Group items based on item descriptions The purpose of grouping items is to group the items into several clusters, then narrows down the calculation range for item-item collaborative filtering. we use three steps to finish grouping. First step pre- processes item description, including dealing with stop words, tokenizing and stemming the texts. Second step trains a tfidf model, and calculates the tfidf value as term weighs of term-document matrix. At last, use singular value decomposition(SVD) on the term-document ma- trix to get a relation value, and assign each item a group based on that value. • Step 1:Preprocessing Data (a) Input: music item descriptions (b) Tokenize and lowercase input (c) Get rid of stopwords in input (d) Stem input (e) Output: text preprocessed • Step 2: Train tf-idf Model (a) Train tf-idf model with text preprocessed (b) Covert weighs in item-document matrix into tfidf through trained model (c) OutPut: item document tfidf matrix • Step 3: Train LSI model and group data (a) Use LsiModel to do 10 rank SVD on item document tfidf matrix (b) Get document topic relation matrix (c) For each document, choose the group number with the highest relation index (d) Assign each document to its group number (e) Output: list of (group number, document) 2. item-item collaborative filtering on grouped data After assgin each document to subgroup, we implement item-item collaborative filtering on grouped data. • Step 4: item-item collaborative filtering on grouped data (a) Find the subgroup to which the target item be- longs (b) Use item-item cf on chosen subgroup (c) Output: the predict value for the target item 5 Results and discussion 5.1 Global Average Prediction In global average prediction part, we randomly divide the dataset into 80 percent training part and 20 percent predic- tion part. Root-mean-square error (RMSE) is used to eval- uate the accuracy. We regards global average prediction as a baseline model, which gives us 0.91 RMSE. Based on this reasonable result, some further improvement will be intro- duced in the following part. 5.2 Collaborative filtering: latent factor Same as global average prediction, we randomly divide the dataset into 80%/20% training/validation. As shown in Table 1, the accuracy increases with higher rank and iteration numbers. This is reasonable since higher rank keeps more information during matrix factorization pro- cess. Similarly higher iteration number results in more ac- curate matrix factorization, and restores more accurate con- cepts. However there is no free lunch, blindly increasing rank or iteration may overwhelm the memory and easily cause stackoverflow in Spark. The result is shown in Figure 3. Four subplots show the result of different rank iteration combinations. The
  • 5. rank iteration RMSE 10 10 0.76 15 10 0.73 15 15 0.70 20 20 0.69 Table 1: latent factor result summary Method RMSE Without baseline rating 0.39 With baseline rating 0.35 Table 2: CF item-item based model results. Total of 803 items are used for evaluation. prediction errors prediction error = predict rating − original rating are spread out around the ground true value. As the original rating values increase, most prediction errors change from positive to negative, which is the com- mon tendency of all subplots. Latent factor has no bias on predicting high rating values or low rating values. Figure 3: Latent factor outputs with rank-iteration of: (a) 10-10, (b) 15- 10, (c) 15-15, (d) 20-20. 5.3 Collaborative filtering: item-item based The results of CF item-item based model are shown in Ta- ble 2. The RMSE is reduced by 10% by adding baseline rat- ings. The statistics of the two different methods are shown in Figure 4. Original ratings that are not integers (e.g., 2.5) con- stitute less than 10% of the total ratings and are not shown in the result. Both methods have large deviations from the original ratings for items with ratings 1 and 2. For method without baseline rating, all of the outliers under predict at rating=5 (Figure 4(a)). After adding baseline ratings, the outliers at rating=5 are distributed in regions both less and larger than 5, and the average ratings at rating=1 and 2 are brought closer to the original ratings (Figure 4(b)). Hence adding the baseline rating can be interpreted as a method which can reduce the noise in the CF item-item model. Figure 4: Results of CF item-item based model: (a) without baseline rating, (b) with baseline rating. Red squares indicate average predicted ratings. Blue dots indicate outliers outside of [5, 95] percentiles. Thick red lines indicate 1:1 ratio. 5.4 Hybrid Recommendation: combining content-based and collaborative filtering Music data includes 29476 distinct items. Item description file is for 154310 distinct items, but only 16035 music items included. Our content based analyzes these 16035 music item descriptions and groups them into 5 groups. Content based filtering narrows the calculation range for item-item collaborative filtering at least down to 31.8%. And the hybrid system could get a average RMSE=0.77 for 5 sub- groups, as showed in table 3. 5.5 Ensemble method The ensemble method results are shown in Table 4. The CF and latent factor models use different random sampling methods, the outputs from the two models share 393 same items with each other and are used for evaluation. The CF item-item model with baseline produces lower RMSE than
  • 6. Value name Value rank 5 group1 percent 16.1% group2 percent 31.8% group3 percent 20.3% group4 percent 17.9% group5 percent 13.9% avg RMSE 0.77 Table 3: group percent and average RMSE for groups Method RMSE CF item-item with baseline 0.34 Latent factor: iter=20, rank=20 0.47 Ensemble: averaging weight 0.36 Ensemble: least-squares weight 0.33 Table 4: Ensemble method results. Total of 393 items are used for evalu- ation. the latent factor model (Table 4). The ensemble model with averaging weight method outputs RMSE=0.36, which is be- tween CF item-item and latent factor models. Figure 5: Comparison of latent factor and CF item-item based models. Comparison of the predicted ratings between CF item- item and latent factor models is shown in Figure 5. Both model have predicted ratings deviate from the 1:1 line. This indicates that simply averaging the predicted ratings may not improve the ensemble results. Next we resort to calculat- ing the weights using least-sqaures method. The weights are 0.82 and 0.18 for CF item-item and latent factor models, re- spectively. This means the CF item-item model contributes more in making better predictions, which is shown by its lower RMSE value. With this weight, the ensemble model achieves a RMSE=0.33, lower than both CF item-item and latent factor models. users α nppr nval Rt 300 0.1 880 1658 0.53 300 0.25 933 1586 0.59 300 0.5 610 1749 0.35 1000 0.1 2886 5691 0.51 1000 0.25 3209 5624 0.57 1000 0.5 2073 5495 0.38 Table 5: PPR result summary 5.6 Link Analysis: Personalized Page Rank In this part, we run one personalized page rank per user. The system will recommend top weighted items with weight greater than 0.0001 out of total 29476 items. Commonly there are less 100 nodes with weight greater than 0.001. In terms of item nodes, the recommendation scope is even smaller. nppr is the number of correct recommendations the system provide. nval is the maximum number of correct recommen- dations we can get, which is also the number of user-item edges of the starting user nodes which are deleted at train- ing process. Rt is the ration of nppr and nval, representing the percent of right recommendations. In this project, right recommendation means the original rating rxi from user x to item i, is at least 0.2 greater than the average rating of user x, Rx. As showed in table 4, the 300 user and 1000 user (running PPR with different 300 and 1000 starting nodes) show sim- ilar result. α = 0.25 gives us the best recommendation. When α is too small, in our case α = 0.1, the personal- ized page rank reduces to general page rank, which gives us the global popularity rather than popularity specific to cer- tain user. When α is too large, in our case α = 0.5, PPR cannot grasp enough link information from the node. The large α forces random walk to go back to source node too often so that it losts some useful information in large scope. 6 Conclusion In this project, we use global average and latent factor model as baseline methods to predict the numerical rating. Starting from them, item-item collaborative filtering and result en- semble show a significant accuracy improvement. Our con- tent based method narrows down the calculation range. PPR focuses on predicting the right item rather than numerical rating. Our experiment demonstrates that α matters a lot to the model both practically and theoretically.
  • 7. References [Adomavicius and Kwon2007] Gediminas Adomavicius and YoungOk Kwon. 2007. New recommendation techniques for multicriteria rating systems. IEEE Intelligent Systems, 22(3):48–55. [Aggarwal2016] Charu C Aggarwal. 2016. Recommender Sys- tems: The Textbook. Springer. [Brin and Page1998] Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. Com- puter Networks and ISDN Systems, pages 107–117. [Gori and Pucci2007] Marco Gori and Augusto Pucci. 2007. Item- rank: A random-walk based scoring algorithm for recom- mender engines. IJCAI’07 Proceedings of the 20th interna- tional joint conference on Artifical intelligence, pages 2766– 2771. [Koren2009] Yehuda Koren. 2009. The bellkor solution to the netflix grand prize. Netflix prize documentation, 81:1–10. [Leskovec et al.2014] Jure Leskovec, Anand Rajaraman, and Jef- frey David Ullman. 2014. Mining of massive datasets. Cam- bridge University Press. [Li and Kim2003] Qing Li and Byeong Man Kim. 2003. An ap- proach for combining content-based and collaborative filters. In Proceedings of the sixth international workshop on Information retrieval with Asian languages-Volume 11, pages 17–24. Asso- ciation for Computational Linguistics. [Shapira et al.2011] Bracha Shapira, Francesco Ricci, Paul B Kan- tor, and Lior Rokach. 2011. Recommender Systems Handbook. Springer.