Book Recommendation Engine

What should I read next?
A Book Recommendation Engine
Based on GoodReads Ratings
Team 10:
Shravani Bheema,Coco Huang,Sharon Heber,Chen Zhou,Mohit Gupta

About the dataset
Goodreads is an American social cataloging website and a
subsidiary of Amazon that allows individuals to search its
database of books,annotations,quotes,and reviews.
With a Goodreads account,you can keep track of the books
you've read,the books you're reading,and the books you want
to read.
You can also follow friends and authors to see what they're
reading,leave reviews,and comment on reviews written by
others.

About the dataset
Three separate ﬁles:
● Users: Contains basic information regarding the reader.
○ UserID,Location,Age
● Books: Contains basic information regarding the books.
○ ISBN,Title,Author,Year of publication,Publisher
● Rating: Contains all of the user rating information.
○ UserID,ISBN,Rating(1-10) 10 being the highest

About the dataset
1 million ratings
300k users
300k books
16k publishers
Name of the book
Author
Publisher
Published Year
ISBN
Rating
User ID
User Age
User Location
Each row is the rating given by a particular user for a speciﬁc book.
Filtered books with less than 50 ratings
and users who have rated less than 20 books
for the recommender system.
200k ratings
6k users
5k books
600 publishers

Project Goal
Explore and compare different approaches to recommending the
most relevant books to users based on their interactions
(ratings) with other books in the past or based on other similar
users’interactions with books.

Preparing Data (Data clean up)
● Data was provided in 3 separate ﬁles,and had to be merged based on ISBN and UserID before starting analysis.
● Many null values in the Age column of users (about ⅓ of users).
● Data contained random errors.For example,some publisher names were entered into Year of Publication column,
and there were also 0s entered for Year of Publication.
● Location information was written as a long list of strings.Was broken up into different columns containing City,
State,and Country information for cleaner analysis.
● Each ﬁle included lot of duplicated rows.All duplications were removed before starting.

Exploratory Data Analysis
The most popular book is
not the most well-liked book

Top 25 most popular author shows that not all
popular author are well liked.Most authors who have
sold lots of book have very low rating for those books.
For example,Rich Shapero’s book have a very low
rating due to the fact that his style are loved by some
but hated by many.

Average rating of a book
per user is 7.62

62% of users are
between the ages
18 -40

Recommendation systems
● A recommender system is a subclass of
information ﬁltering system that seeks to
predict the “preference”a user would give to an
item.
What is a good recommendation?
● The one that is personalized (relevant to that
user)
● The one that is diverse (includes different user
interests)
● The one that doesn’t recommend the same
items to users for the second time

Collaborative Filtering
Collaborative filtering (CF) is a method for generating recommendations by calculating preference scores of a user for
an item using the historical preference for that item from other similar users in the database.This algorithm takes
account of the explicit interaction with the item irrespective of the attributes of the item,so is domain agnostic as
long as we have sufficient historical interaction data.
Two types of collaborative filtering (CF):
1. Memory based collaborative filtering: uses user rating data to compute the similarities between users or items.
This technique relies heavily on simple similarity measures,such as cosine similarity,to match similar people
or items together.
2. Model based collaborative filtering: models are developed using different data mining,machine learning
algorithms to predict users’rating of unrated items.Popular techniques include Bayesian networks and single
value composition.

Item Based Collaborative Filtering
“Since you liked this,you may also like..”
Item-item collaborative ﬁltering,is a form of collaborative ﬁltering for
recommender systems based on the similarity between items calculated
using people's ratings of those items.In this method similar items build
neighbourhoods on the behaviour of users.
What makes 2 items or books similar?
If User_A likes 3 books (or rates them highly),these 3 books are
considered similar.This process is iterated across thousands of books and
users.
This method is not based on the features/contents of the items.Similarity
scores of items with other items is calculated using a similarity score
measure like Euclidean distance/Pearson Correlation/Cosine Similarity.

The Sparse Matrix
A sparse matrix or sparse array is a matrix in which
most of the elements are zero.
There is no strict deﬁnition regarding the proportion
of zero-value elements for a matrix to qualify as
sparse,but a common criterion is that the number of
non-zero elements is roughly equal to the number of
rows or columns.
The sparsity of our
matrix is 98.3%
On average,a user has only read
28 books among 6k available.

Cosine Similarity: Evaluating Closeness of 2 Items or Users
Cosine similarity is a measure of similarity between two
sequences of numbers.The sequences here,are viewed
as vectors.
The cosine similarity is deﬁned as the cosine of the
angle between them,that is,the dot product of the
vectors divided by the product of their absolute
lengths.
In our case,we can view the vectors in 2 ways.Each
book is a vector of n dimensions,where n is the number
of users OR Each user is a vector of n dimensions,
where n is the number of books.
User_85 User_2245 User_1134 User_92
The Da Vinci Code 10 7
A Walk to Remember 9 5 5
Angels & Demons 5 7
Life of Pi 8
The Alchemist 7 6
The Hobbit 10
Harry Potter & The Goblet
of Fire 9 8 7

Cosine Similarity: Evaluating Closeness of 2 Items
The output is an array of similarity scores for
each book with every other book

Similarity Score Distributions for Sample Items
Pride and Prejudice
The Da Vinci Code The Hobbit

k nearest neighbors
The k-nearest neighbors (KNN) algorithm is a data classification
method for estimating the likelihood that a data point will
become a member of one group or another based on what group
the data points nearest to it belong to.
The principle behind nearest neighbor methods is to find a
predefined number of samples closest in distance to the new
point,and predict the label from these.
The number of samples (k) can be a user-defined constant
(k-nearest neighbor learning), or vary based on the local density
of points (radius-based neighbor learning).

Working Principle of KNN
● Choose the K value
● Calculate the distance between all the training points and new data points.
● Sort the computed distance in ascending order between training points and
new data points.
● Choose the ﬁrst K distances from the sorted list.
● Take the mode/mean of the classes associated with the distances.
For classiﬁcation, compute mode else for regression problem compute mean with the
distances.

Centered Cosine Similarity: Penalizing Opposite Ratings
Instead of treating missing ratings as zero
ratings,this method treats them as average
ratings (since the mean of each row is zero).
It scales strict raters and liberal raters.
Also known as the Pearson Correlation.
Normalizing ratings by subtracting the row mean.
Each book rating is now centred around 0,positive
ratings indicate that the book was liked more than
average by the user,and negative implies that it
was below average when taking their own
personal rating system into consideration.
The Da Vinci
Code
A Walk to
Remember
Angels &
Demons Life of Pi The Alchemist The Hobbit
Harry Potter &
The Goblet of
Fire
User_85 2 -2 0
User_2245 2.25 -2.75 -0.75 1.25
User_1134 -2 2 0
User_92 0.25 -1.75 1.25 0.25

Item Based Collaborative Filtering

Item-Based Filtering Pros and Cons
Pros
- The item-based method provides more
consistent recommendation results compared to
others,because there is high consistency among
similarities between books compared to that
between users.
- It can be used to recommend books for new
users and those with limited rating history.
Cons
- Item-based methods might sometimes
recommend obvious items or items that are
not novel from previous user experiences.

User-Based
Collaborative Filtering
“Users similar to you also liked..”
A technique used to predict the items
that a user might like based on ratings
given to that item by the other users who
have similar tastes.

Steps for User-Based Filtering -I
1.Filter out users with fewer than 50 ratings.
2. Create user-items matrix (pivot table).

Steps for User-Based Filtering -II
3.Calculate similarity scores for each pair of users.
4. Create a function to retrieve top three book choices from similar users.

Steps for User-Based Filtering -III
5.Use a predeﬁned function to identify the ones with the highest similarity scores to the target user and
return their top book choices.

Steps for User-Based Filtering -IV
5a.Alternatively,we can use KNN to identify the ones with the highest similarity scores and their
“distances”to the target user.

Similarity Score Distributions for Sample Users
User 187517with 631 ratings
User 141902 with 200 ratings User 153662 with 5814 ratings

User-Based Filtering Results Exhibition -I

User-Based Filtering Results Exhibition -II

User-Based Filtering Results Exhibition -III

User-Based Filtering Results Exhibition -IV

User-Based Filtering Pros and Cons
Pros
- The performance of the recommendations
will keep improving as the size of the
neighborhood grows.
- It requires only user ratings to make
recommendations,which is independent from
user demographic features.
- It tends to generate more diverse results
because users have varied tastes.
Cons
- Only a small percentage of users on
Goodreads provided rating scores.
- We have very limited information on new
users to calculate similarity scores.
- The computation of user neighborhoods
needs to be performed more frequently with
the addition of new users.

Item-Based VS.User-Based Methods
- In theory,use-user and item-item are dual approaches with similar expected performance.In
practice,item-based outperforms user-based in many cases.
- Users have changing tastes,while two items would always remain similar.Users have varied
tastes,while items belong to a small set of “genres”.
- Incremental maintenance of the recommendation model is more challenging in the case of
user-based methods compared to item-based methods.

Content Based Filtering
Content-based filtering uses item features to recommend
other items similar to what the user likes,based on their
previous actions or explicit feedback (in this case,the
rating).
Features used to develop a content based model include
author,publisher In contrast collaborative filtering,does
not take item attributes/features into consideration.This
is done by representing a profile vector of the user in the
same dimensions as the item attribute vector and
calculating the weights based on users’historical
interaction with the items.

Steps for Content-Based Method
1. Filter out books with fewer than 200 ratings.
2. Use predeﬁned function to identify items with similar author,publisher and publishing year to
generate recommendation results.

Content-Based Recommendation Results -I
recommends

Content-Based Recommendation Results -II
recommends

Content-Based Recommendation Results -III
recommends

Content-Based Filtering Pros and Cons
Pros
- We don’t need a long rating history of the
user to make the recommendation,nor do we
need any demographic info.
- It can capture the niche interests of a user.
Cons
- This technique requires a lot of domain
knowledge.The model can only be as good as
the hand-engineered features.In our case,for
example,content-based ﬁltering wouldn’t be
as good as the other two techniques,because
we have very limited information on book
features.

K-means Clustering
Cluster 1: Mostly international users,would recommend books that are popular worldwide,rather than just USA.
Cluster 3: Users who tend to read books that are the most popular (most rated),even if they may not be highly rated (AvgBook.Rating = 3.2).
Would recommend most popular books regardless of ratings.

K-means Clustering
Cluster 2: Users who are interested only in books that are highly rated,regardless of their popularity.Would recommend other highly rated
books to these users.
Cluster 5: Users who tend to read books that are published by the Big 5 Publishing Houses: Penguin Books,HarperCollins,Hachette Livre,
Macmillan,and Simon & Schuster.Would recommend any book published under these houses.

K-means Clustering
● There is some interpretability:
○ If user falls into high AvgBook.Rating cluster,can recommend them only books that received high ratings,etc.
● Can be a ﬁrst step,but recommendation system should be improved through other methodologies.
○ Collaborative based ﬁltering
○ Gather more feature data for each book such as genre,themes,etc.and cluster again.

Business Value
To recommend the most relevant books to users based on their interactions (ratings) with
other books in the past or based on other similar users’interactions with books.
How Our Recommendation System Can Drive Business Value
● Help Goodreads provide an improved customer experience and gain a competitive advantage.This will
improve customer retention rate and in turn acquisition costs.
● Drive customer engagement with the Goodreads website through new recommendation engine.
● Increase product awareness by helping all types of books reach new customers -strategic value of Amazon
owning Goodreads.
● Improve the product design process of Goodreads website.Recommendation systems can help Goodreads
make design decisions by surfacing the most relevant products to any given user.

Future Improvements
● The cold-start problem: Collaborative filtering systems are based on the action of available data from similar users. If you are
building a brand new recommendation system, you would have no user data to start with. You can use content-based filtering
first and then move on to the collaborative filtering approach.
● Input data may not always be accurate because hall ratings are self reports.User behavior is more important than ratings.
● A strong recommendation engine will be able to identify changes (or signs of an impending changes) in customers’ preferences
and behavior,and constantly auto-train themselves in real time in order to serve relevant recommendations.

Future Improvements
Content Boosted Collaborative
Filtering for recommender systems:
CBCF is a type of hybrid
recommendation technique that uses
a combination of content-based
filtering and collaborative filtering.
Its main idea is to overcome the
sparsity problem that degrades the
performance of collaborative
filtering algorithms by using item
content to make the user-item
interaction matrix dense.

Future Improvements
Single Value Decomposition: SVD is
a matrix factorization technique
that is usually used to reduce the
number of features of a data set by
reducing space dimensions from N to
K where K < N.The matrix
factorization is done on the
user-item ratings matrix.From a high
level,matrix factorization can be
thought of as ﬁnding 2 matrices
whose product is the original matrix.

Book Recommendation Engine

More Related Content

What's hot

Similar to Book Recommendation Engine

Recently uploaded

Book Recommendation Engine