3. Data Sources and Processing Flow
20M Reviews
Genres
2800 Movies & 1000 Books
with 20+ reviews & no missing attribute
Similarity Scores
Book Recommendations for Each Movie
Book cover images
4. Collaborative filtering using user rating scores?
Unfortunately the data is too sparse...
Poor performance even after SVD
80% movie-book pairs have
0 common user
5. Similarity Metrics for Movie-Book Pairs
Review Text
Cosine Similarity (C)
Genres
Jaccard Similarity (J)
Final Similarity Score
6. Validation
Users liked movie A
Users liked book B
Users liked both
• Based on rating scores from
users who rated both
movies and books
• For each movie, calculate
Jaccard index between the
movie and:
– Jrec: recommended books
– Jbase: all the books
• Median(Jrec/Jbase)=26:
people are 26x more likely to
like Movie2Books
recommendation than the
random baseline
9. Out of 20 million reviews
from 3.7 million users,
about half of the reviews
were provided by 10% of
the users.
Books
Movies
Top 10% users
Top 10% users
Some fun stuff…
10. Is this a highly rated movie at Amazon?
Don’t like it Really like it
11. Is this a highly rated movie at Amazon?
=
Ratings of the Movie Ratings of All Movies Re-scaled scores=
=
12. Most vs Least
Reviewed Items
• Both have very
skewed distribution
in ratings, with
mode being 5
• The most reviewed
items have higher
fraction of 5s:
popular products
are indeed more
liked by people.
Books
Movies
13. Most vs Least
Active Users
The least active users
give more bad ratings
(score=1):
they are more likely
to write a review if
they really don’t like
the product?
Books
Movies