1. Recommendation System on Amazon Music Reviews
Han Li
hanli1@cs.stonybrook.edu
Luping Su
luping.su@stonybrook.edu
Jiewen Zheng
jiewen.zheng@stonybrook.edu
Abstract
In this project we implement a variety of recommen-
dation system models on music review data from ama-
zon.com. The global averaging method produces root-
mean-square error (RMSE) of 0.91, which serves as
a baseline. The hybrid model outputs an average
RMSE=0.77 over 5 subgroups. The latent factor model
reduces RMSE from 0.76 to 0.69 when increasing
rank=10, iteration=10 to rank=20, iteration=20. The
item-item based collaborative filtering (CF) model gen-
erates RMSE=0.39 and 0.35 using method without
and with baseline ratings, respectively. The ensemble
model, which combines outputs from latent factor and
item-item CF models, is able to reduce the RMSE fur-
ther down to 0.33 when calculating the weights with
lease-squares method.
1 Introduction
Recommendation systems (RSs) consist of algorithms and
techniques which are used to provide suggestions to users
with their most likely interested items (Shapira et al., 2011).
RSs are especially important for e-commerce business since
they usually provide overwhelming number of items that
users can choose from and this creates the long tail phe-
nomenon (Leskovec et al., 2014). RSs also become popu-
lar in other diverse fields such as social tags, research article
recommend and search queries (Aggarwal, 2016).
Popular RS approaches include item-item collaborative
filtering, content-based filtering, latent factor models, and
ensemble models (see next section) (Shapira et al., 2011). In
addition, personalized page rank attracts people’s attention
in recent years, which simulates the user-item relation as bi-
partite graph and does recommendation based on the weight
of each item node (Gori and Pucci, 2007). In this project we
implement each of the above RS models and compare their
performance.
2 Background
Popular approaches of RS include:
1. Content Based Filtering
Content-based systems extract properties of the items
to be recommended. In content-based system, the key
part is building the item profile and user profile based on
their properties. Certain item will be recommended to
the user based on the similarity between the user profile
and this item.
2. Collaborative Filtering
Collaborative Filtering uses use-item interaction infor-
mation, such as context review, numeric rating to items
and purchase frequency, from which the system builds
an utility matrix. This approach is more popular than
Content Based Filtering when no enough item profiles
are provided. Collaborative filtering includes the fol-
lowing two parts: Item-Item Method and Latent Factor
Model.
3. Hybrid Recommendation Systems
Combining content based filtering and collaborative fil-
tering always returns more effective result. Available
combination ways include doing content-based and col-
laborative separately and combining the result, using
content based to narrow the scope and collaborative to
do accurate prediction.
3 Data and toolchain
The dataset used in this project consists of user music re-
views from amazon.com (size 6 GB). The data contains
6396350 reviews. Each review includes product and user in-
formation, rating, and a plain text review. The item descrip-
tion file (size 1.8 GB) includes productId and product de-
scriptions, which is used in our hybrid model. Details of the
data source can be found at https://snap.stanford.
edu/data/web-Amazon.html.
Spark (MLlib, Graphframes, GraphX) is used for data
mining. Pandas, numpy are used for data post-processing.
Matplotlib is used for data visualization and analysis.
4 Methods
4.1 Global Average Prediction
We start with simplest and most intuitive method. The pur-
pose of global average prediction is to predict rxi, the miss-
2. ing rating value of music i from user x. In this part, we
randomly divide dataset into training and prediction parts.
• training process
We calculate three kinds of averages including µ, bx, bi,
where µ is global averaged rating, bx is rating deviation
of user x (bx = average rating of user x - µ), bi is rating
deviation of item i (bi = average rating of item i - µ)
• prediction process
The missing value bxi will be calculated by following
equation:
rxi = µ + bx + bi (1)
4.2 Collaborative filtering: latent factor
Same as global average prediction, the purpose of latent fac-
tor model is to predict rxi. In this part, users and items in-
teractions are described by a set of latent factors. All miss-
ing values can be predicted by the product of corresponding
factors. We randomly divide dataset into training and predic-
tion parts, then use spark.mllib.recommendation.ALS, which
implements the alternating least squares (ALS) algorithm to
learn these latent factors.
• training process
train MatrixFactorizationModel on training dateset by
spark.mllib.recommendation.ALS
• prediction process
predict missing values by trained model
4.3 Collaborative filtering: item-item based
Item-item based collaborative filtering is implemented here
with the intuition that items are simpler than users (which
often have multiple tastes) (Leskovec et al., 2014). The pro-
cedures are described as follows.
1. If a user rated an item multiple times, use the average
of all the ratings;
2. From there, filter to items associated with at least 30
distinct users, filter to users associated with at least 30
distinct items, ignore an item if all of its ratings are the
same;
3. Randomly sample 3% of the items for validation, the
rest items are used for training;
4. For each item in the validation set, calculate its co-
sine similarities with the items in training set (normalize
each item by subtracting average);
Sxy =
s∈Sxy
(rxs − rx)(rys − ry)
s∈Sxy
(rxs − rx)2
s∈Sxy
(rys − ry)2
(2)
5. Choose the 50 nearest items in the training set (if less
than 50 is available, use all) and calculate the predicted
rating with:
rxi =
j∈N(i;x) Sij × rxj
j∈N(i;x) Sij
(3)
6. Besides the method described above, we use a revised
version to make predictions:
rxi = bxi +
j∈N(i;x) Sij × (rxj − bxj)
j∈N(i;x) Sij
(4)
bxi = µ + bx + bi (5)
where µ is global averaged rating, bxi is baseline esti-
mate for rxi, bx is rating deviation of user x (bx = aver-
age rating of user x - µ), bi is rating deviation of item i
(bi = average rating of item i - µ).
4.4 Link Analysis: Personalized Page Rank
• Global Page Rank
Page Rank is originally developed for rating the sig-
nificance of web pages, based on link relationship.
Each node represents a website and each directed edge
demonstrates a reference relation from the source node
to destination node. Global Page Rank results in a ’sta-
tionary’ distribution for each node with the following
equation (Brin and Page, 1998):
x = (1 − α) × Ax + α × E (6)
where x is the distribution vector of next iteration, x is
the distribution vector of current iteration, α is probabil-
ity of random walk and A is transition matrix. In global
Page Rank, E is a vector of equal values summing to
one.
• Personalized Page Rank
Personalized Page Rank (PPR) is a specific version of
global Page Rank, which jumps back to one or more
3. starting nodes rather than dangle ”worldwide”. The
surfing route in PPR prone to be around the starting
node(s). Compared with global Page Rank, PPR per-
forms a localized way of random walk, as shown in fig-
ure 1 left.
In PPR, α in equation 6 is still a constant representing
the probability of random walk. However elements in
E have different values in this case, which represent the
source information. All elements in E should be zero,
except the the starting node we are interest in.
u1
u2
u3
un
m1
m2
mk
(a) (b)
Figure 1: (a) Personalized Page Rank – localized random walk. (b) PPR
Graph for Recommendation System.
• Personalized Page Rank for Recommendation
When using PPR for Recommendation System, each
item m or user u is represented by a node. If there are
some item-user interactions, we add bidirectional edges
between the item and user. Such construction results in
a bipartite graph as figure 1 right. We can find the PPR
vector of each user or item, whereas only item weight
resulting from the PPR running from a user node gives
us meaningful recommendation result. Generally most
of values in result vector x will be close to zero. Those
left distinguished item nodes are the ones the system
will recommend to the start user node (Gori and Pucci,
2007).
The last question is what is item-user interaction. Since the
weighted value of each node is defined by the number of
related links, the item-user interaction should reflect user’s
positive attitude towards item. Given the music reviews
dataset, we only keep the (Userx, Itemi, rxi) triples, in
which rxi − Rx is greater or equal to 0.2. Rx is the aver-
age rating of user x.For each valid triple, we add bi-direction
edges between the user and item.
• training process
Randomly sample 10 percent valid user-item edges (the
corresponding items are ones the system should recom-
mend) and construct graph with left 90 percent valid
edges.
Run PPR starting from user i, get the weighted value of
each item node in the graph
• prediction process
Recommend items with top weighed values to the user
i
4.5 Ensemble method
Ensemble-based method has been proven to be successful
in previous contest such as the Nexflix Grand Prize (Koren,
2009). In this project, we implement ensemble method using
the results from latent factor model and CF item-item based
model. The predicted rating is calculated as linear weighted
sum over the two models:
ˆr =
n
i=1
wi × ˆri (7)
where n = 2 in this case. w1 and w2 represent the weight
for latent factor model and CF item-item based model, re-
spectively.
75%
25%
Latent factor
CF item-item weights
new predicts
predicts
Figure 2: Ensemble method: weights are calculated using least-squares
method.
Two different approaches are implemented (Adomavicius
and Kwon, 2007). The first one is to simply use the average
of the two models’ predictions as the new prediction, i.e.,
w1 = 0.5 and w2 = 0.5. The second approach is to calculate
the weights using least-square method (Figure 2). To achieve
this, the predicted ratings from both models are randomly
split into two groups (25%/75%). We use 25% of the results
to solve the equation:
A × w = b (8)
where A is a m×2 matrix, m is the number of predictions.
The first column and second column in A are the predicted
ratings from latent factor and CF item-item based models,
respectively. b is the original rating. w is solved using least-
squares method. Next we calculate the new predicted ratings
using w on the 75% group and evaluate its performance by
calculating the RMSE (see next section).
4. 4.6 Evaluation metric
• Root-mean-square error
RMSE =
n
i=1(ˆri − ri)2
n
(9)
• Top recommended ratio
Top recommended ratio will only be used in link analy-
sis evaluation.
Rt =
nppr
nval
, 0 <= Rt <= 1 (10)
nval is the number of items to be recommended be-
fore PPR. Such items correspond to item nodes of the
deleted edges. nppr is the number of actually rec-
ommended items after PPR, which are also among
the original nval to-be recommended items.nppr shows
how many correct recommendation decisions have been
made.
4.7 Hybrid Recommendation: combing content-based
and collaborative filtering
The idea of hybrid recommendation used in this project is:
use content based to group data, narrow down the calculation
range, then use item item collaborative filtering on the sub-
group to which the target item belongs (Li and Kim, 2003).
This methods consists of 4 steps as shown below.
1. Group items based on item descriptions
The purpose of grouping items is to group the items
into several clusters, then narrows down the calculation
range for item-item collaborative filtering.
we use three steps to finish grouping. First step pre-
processes item description, including dealing with stop
words, tokenizing and stemming the texts. Second step
trains a tfidf model, and calculates the tfidf value as term
weighs of term-document matrix. At last, use singular
value decomposition(SVD) on the term-document ma-
trix to get a relation value, and assign each item a group
based on that value.
• Step 1:Preprocessing Data
(a) Input: music item descriptions
(b) Tokenize and lowercase input
(c) Get rid of stopwords in input
(d) Stem input
(e) Output: text preprocessed
• Step 2: Train tf-idf Model
(a) Train tf-idf model with text preprocessed
(b) Covert weighs in item-document matrix into
tfidf through trained model
(c) OutPut: item document tfidf matrix
• Step 3: Train LSI model and group data
(a) Use LsiModel to do 10 rank SVD on
item document tfidf matrix
(b) Get document topic relation matrix
(c) For each document, choose the group number
with the highest relation index
(d) Assign each document to its group number
(e) Output: list of (group number, document)
2. item-item collaborative filtering on grouped data
After assgin each document to subgroup, we implement
item-item collaborative filtering on grouped data.
• Step 4: item-item collaborative filtering on
grouped data
(a) Find the subgroup to which the target item be-
longs
(b) Use item-item cf on chosen subgroup
(c) Output: the predict value for the target item
5 Results and discussion
5.1 Global Average Prediction
In global average prediction part, we randomly divide the
dataset into 80 percent training part and 20 percent predic-
tion part. Root-mean-square error (RMSE) is used to eval-
uate the accuracy. We regards global average prediction as
a baseline model, which gives us 0.91 RMSE. Based on this
reasonable result, some further improvement will be intro-
duced in the following part.
5.2 Collaborative filtering: latent factor
Same as global average prediction, we randomly divide the
dataset into 80%/20% training/validation.
As shown in Table 1, the accuracy increases with higher
rank and iteration numbers. This is reasonable since higher
rank keeps more information during matrix factorization pro-
cess. Similarly higher iteration number results in more ac-
curate matrix factorization, and restores more accurate con-
cepts. However there is no free lunch, blindly increasing
rank or iteration may overwhelm the memory and easily
cause stackoverflow in Spark.
The result is shown in Figure 3. Four subplots show
the result of different rank iteration combinations. The
5. rank iteration RMSE
10 10 0.76
15 10 0.73
15 15 0.70
20 20 0.69
Table 1: latent factor result summary
Method RMSE
Without baseline rating 0.39
With baseline rating 0.35
Table 2: CF item-item based model results. Total of 803 items are used
for evaluation.
prediction errors prediction error = predict rating −
original rating are spread out around the ground true
value. As the original rating values increase, most prediction
errors change from positive to negative, which is the com-
mon tendency of all subplots. Latent factor has no bias on
predicting high rating values or low rating values.
Figure 3: Latent factor outputs with rank-iteration of: (a) 10-10, (b) 15-
10, (c) 15-15, (d) 20-20.
5.3 Collaborative filtering: item-item based
The results of CF item-item based model are shown in Ta-
ble 2. The RMSE is reduced by 10% by adding baseline rat-
ings. The statistics of the two different methods are shown in
Figure 4. Original ratings that are not integers (e.g., 2.5) con-
stitute less than 10% of the total ratings and are not shown
in the result. Both methods have large deviations from the
original ratings for items with ratings 1 and 2. For method
without baseline rating, all of the outliers under predict at
rating=5 (Figure 4(a)). After adding baseline ratings, the
outliers at rating=5 are distributed in regions both less and
larger than 5, and the average ratings at rating=1 and 2 are
brought closer to the original ratings (Figure 4(b)). Hence
adding the baseline rating can be interpreted as a method
which can reduce the noise in the CF item-item model.
Figure 4: Results of CF item-item based model: (a) without baseline
rating, (b) with baseline rating. Red squares indicate average predicted
ratings. Blue dots indicate outliers outside of [5, 95] percentiles. Thick
red lines indicate 1:1 ratio.
5.4 Hybrid Recommendation: combining
content-based and collaborative filtering
Music data includes 29476 distinct items. Item description
file is for 154310 distinct items, but only 16035 music items
included. Our content based analyzes these 16035 music
item descriptions and groups them into 5 groups.
Content based filtering narrows the calculation range for
item-item collaborative filtering at least down to 31.8%. And
the hybrid system could get a average RMSE=0.77 for 5 sub-
groups, as showed in table 3.
5.5 Ensemble method
The ensemble method results are shown in Table 4. The
CF and latent factor models use different random sampling
methods, the outputs from the two models share 393 same
items with each other and are used for evaluation. The CF
item-item model with baseline produces lower RMSE than
6. Value name Value
rank 5
group1 percent 16.1%
group2 percent 31.8%
group3 percent 20.3%
group4 percent 17.9%
group5 percent 13.9%
avg RMSE 0.77
Table 3: group percent and average RMSE for groups
Method RMSE
CF item-item with baseline 0.34
Latent factor: iter=20, rank=20 0.47
Ensemble: averaging weight 0.36
Ensemble: least-squares weight 0.33
Table 4: Ensemble method results. Total of 393 items are used for evalu-
ation.
the latent factor model (Table 4). The ensemble model with
averaging weight method outputs RMSE=0.36, which is be-
tween CF item-item and latent factor models.
Figure 5: Comparison of latent factor and CF item-item based models.
Comparison of the predicted ratings between CF item-
item and latent factor models is shown in Figure 5. Both
model have predicted ratings deviate from the 1:1 line. This
indicates that simply averaging the predicted ratings may not
improve the ensemble results. Next we resort to calculat-
ing the weights using least-sqaures method. The weights are
0.82 and 0.18 for CF item-item and latent factor models, re-
spectively. This means the CF item-item model contributes
more in making better predictions, which is shown by its
lower RMSE value. With this weight, the ensemble model
achieves a RMSE=0.33, lower than both CF item-item and
latent factor models.
users α nppr nval Rt
300 0.1 880 1658 0.53
300 0.25 933 1586 0.59
300 0.5 610 1749 0.35
1000 0.1 2886 5691 0.51
1000 0.25 3209 5624 0.57
1000 0.5 2073 5495 0.38
Table 5: PPR result summary
5.6 Link Analysis: Personalized Page Rank
In this part, we run one personalized page rank per user.
The system will recommend top weighted items with weight
greater than 0.0001 out of total 29476 items. Commonly
there are less 100 nodes with weight greater than 0.001.
In terms of item nodes, the recommendation scope is even
smaller.
nppr is the number of correct recommendations the system
provide. nval is the maximum number of correct recommen-
dations we can get, which is also the number of user-item
edges of the starting user nodes which are deleted at train-
ing process. Rt is the ration of nppr and nval, representing
the percent of right recommendations. In this project, right
recommendation means the original rating rxi from user x to
item i, is at least 0.2 greater than the average rating of user
x, Rx.
As showed in table 4, the 300 user and 1000 user (running
PPR with different 300 and 1000 starting nodes) show sim-
ilar result. α = 0.25 gives us the best recommendation.
When α is too small, in our case α = 0.1, the personal-
ized page rank reduces to general page rank, which gives us
the global popularity rather than popularity specific to cer-
tain user. When α is too large, in our case α = 0.5, PPR
cannot grasp enough link information from the node. The
large α forces random walk to go back to source node too
often so that it losts some useful information in large scope.
6 Conclusion
In this project, we use global average and latent factor model
as baseline methods to predict the numerical rating. Starting
from them, item-item collaborative filtering and result en-
semble show a significant accuracy improvement. Our con-
tent based method narrows down the calculation range. PPR
focuses on predicting the right item rather than numerical
rating. Our experiment demonstrates that α matters a lot to
the model both practically and theoretically.
7. References
[Adomavicius and Kwon2007] Gediminas Adomavicius and
YoungOk Kwon. 2007. New recommendation techniques
for multicriteria rating systems. IEEE Intelligent Systems,
22(3):48–55.
[Aggarwal2016] Charu C Aggarwal. 2016. Recommender Sys-
tems: The Textbook. Springer.
[Brin and Page1998] Sergey Brin and Lawrence Page. 1998. The
anatomy of a large-scale hypertextual web search engine. Com-
puter Networks and ISDN Systems, pages 107–117.
[Gori and Pucci2007] Marco Gori and Augusto Pucci. 2007. Item-
rank: A random-walk based scoring algorithm for recom-
mender engines. IJCAI’07 Proceedings of the 20th interna-
tional joint conference on Artifical intelligence, pages 2766–
2771.
[Koren2009] Yehuda Koren. 2009. The bellkor solution to the
netflix grand prize. Netflix prize documentation, 81:1–10.
[Leskovec et al.2014] Jure Leskovec, Anand Rajaraman, and Jef-
frey David Ullman. 2014. Mining of massive datasets. Cam-
bridge University Press.
[Li and Kim2003] Qing Li and Byeong Man Kim. 2003. An ap-
proach for combining content-based and collaborative filters. In
Proceedings of the sixth international workshop on Information
retrieval with Asian languages-Volume 11, pages 17–24. Asso-
ciation for Computational Linguistics.
[Shapira et al.2011] Bracha Shapira, Francesco Ricci, Paul B Kan-
tor, and Lior Rokach. 2011. Recommender Systems Handbook.
Springer.