Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

+
Large-scale Parallel Collaborative Filtering and Clustering
using MapReduce for Recommender Engines
Varad Meru
Software Development Engineer,
Orzota, Inc.
© Varad Meru, 2013

+
Outline
 Introduction
 Introduction to Recommendation Engines
 Algorithms for Recommendation Engines
 Challenges in Recommendation Engines
 What is Hadoop MapReduce?
 What is Netflix prize?
 Block diagram
 System requirement
 Conclusion
© Varad Meru, 2013

+
Recommender Systems
Introduction and Project Scope
© Varad Meru, 2013

+
Introduction
 Scope of our project is to build a Recommender Engine using
Clustering.
 Recommender Engine are used in E-Commerce and other
settings to recommend items to the end users.
 Widely used in companies such as
Amazon, Netflix, Flipkart, Google News, and many others.
 Collaborative Algorithms, Clustering and Matrix Decomposition
is used for finding Recommendations.
© Varad Meru, 2013

+
Recommender
System Example
© Varad Meru, 2013

+ Some other Recommender
Systems
Here are some snapshots of widely
used recommendation engines used
in Amazon.
© Varad Meru, 2013

+
Collaborative Filtering in Action
thms” : “Recommender Systems”, “id” : “Example”}
0! 1! 1! 1!
1! 0! 1! 1!
0! 1! 0! 0!
1! 0! 1! 1!
1! 1! 1! 1!
1! 0! 1! 1!
1! 0! 0! 0!
1! 1! 1! 0!
1! 1! 0! 1!
Binary Values
Recommendation!
Alice!
Bob!
John!
Jane!
Bill!
Steve!
Larry!
Don!
Jack!
 Assuming is Every
one of the names
have seen any of
the above movie
 Let 1 denote seen
 Let 0 denote not
seen
© Varad Meru, 2013

+
Algorithms” : “Recommender Systems” , “Similarity” : “Tanimoto”}
1! 1/3 –
0.33!
5/8 –
0.625!
5/8 –
0.625!
1/3 –
0.33!
1!
3/8 –
0.375!
3/8 –
0.375!
5/8 –
0.625!
3/8 –
0.375!
1!
5/7 –
0.714!
5/8 –
0.625!
3/8 –
0.375!
5/7 –
0.714! 1!
Tanimoto Coefﬁcient!
NA – Number of Customers
who bought Product A!
NB – Number of Customer who
bought Product B!
Nc – Number of Customer who
bought both Product A and
Product B!
15
: “Recommender Systems” , “Similarity” : “Tanimoto”}
1! 1/3 –
0.33!
5/8 –
0.625!
5/8 –
0.625!
1/3 –
0.33!
1!
3/8 –
0.375!
3/8 –
0.375!
5/8 –
0.625!
3/8 –
0.375!
1! 5/7 –
0.714!
5/8 –
0.625!
3/8 –
0.375!
5/7 –
0.714! 1!
Tanimoto Coefﬁcient!
bought Product B!
Product B!
© Varad Meru, 2013

+
Algorithms” : “Recommender Systems” , “Similarity” : “Cosine”}
1! 0.507! 0.772! 0.772!
0.507! 1! 0.707! 0.707!
0.772! 0.707! 1! 0.833!
0.772! 0.707! 0.833! 1!
Cosine Coefﬁcient!
bought Product B!
Product B!
16
: “Recommender Systems” , “Similarity” : “Cosine”}
1! 0.507! 0.772! 0.772!
0.507! 1! 0.707! 0.707!
0.772! 0.707! 1! 0.833!
0.772! 0.707! 0.833! 1!
Cosine Coefﬁcient!
bought Product B!
Product B!
© Varad Meru, 2013

+
MinHash Clustering in Action
 We will be implementing a variation of algorithm for our Project
 It’s a technique to findout how similar two sets are.
 The scheme was invented by Andrei Broder (1997)1
 The simplest version of the minhash scheme uses k different
hash functions, where k is a fixed integer parameter, and
represents each set S by the k values of hmin(S) for these k
functions.
 Google is known to have used this method to cluster news
articles for recommending users the news of their tastes2
1Broder, Andrei Z. (1997), "On the resemblance and containment of documents”.
2Mayur Datar et. al. (2007), "Google News Personalization: Scalable Online Collaborative Filtering”.© Varad Meru, 2013

+
MinHash Clustering Flow
Get a Random
Permutation of Product
Catalog, R
Start
Define a hash function h
such that
h(Ui)=min. ranked product
in R
Ui : All the Interaction performed by the User.
An Interaction can be a Click, Purchase, Like, etc.
Pass each user through
the Hash function to get
the Cluster Number
After the Clusters have been
formed, Use Covisitation to
find out Recommendations
Stop
Cache the
Recommendations in
Memory
Memory
© Varad Meru, 2013

+
Some Recommender Systems
Available
 Apache Mahout1
 Easyrec2
 University of Minnisota’s SUGGEST3
 Other, for research, implementations such as UniRecSys and
Taste
1 http://mahout.apache.org
2 http://easyrec.org/
3 http://www-users.cs.umn.edu/~karypis/suggest/
© Varad Meru, 2013

+
MapReduce Paradigm
MapReduce and Hadoop
© Varad Meru, 2013

+
MapReduce Programming
Paradigm
 A core idea behind MapReduce is mapping your data set into a
collection of Key-Value pairs, and then reducing over all pairs
with the same key.
 Hadoop MapReduce is an Open Source implementation of
MapReduce framework on the lines of Google’s MapReduce
software framework.
 Used for writing applications rapidly process vast amounts of
data in parallel on large clusters of compute nodes.
 A Hadoop MapReduce job mainly consists of two user-defined
functions: map and reduce.
© Varad Meru, 2013

+
map() function
 A list of data elements are passed, one at a time, to map()
functions which transform each data element to an individual
output data element.
 A map() produces one or more intermediate <key, values>
pair(s) from the input list.
k1 V1 k2 V2 k5 V5k4 V4k3 V3
MAP MAP MAPMAP
k6 V6 ……
k’1 V’1 k’2 V’2 k’5 V’5k’4 V’4k’3 V’3 k’6 V’6 ……
Input list
Intermediate
output list
© Varad Meru, 2013

+
reduce() function
 After map phase finish, those intermediate values with same
output key are reduced into one or more final values
k’1 V’1 k’2 V’2 k’5 V’5k’4 V’4k’3 V’3 k’6 V’6 ……
Reduce Reduce Reduce
F1 R1 F2 R2 F3 R3 ……
Intermediate
map output
Final
Result
© Varad Meru, 2013

+
Parallelism
 map() functions run in parallel, creating different intermediate
values from different input data elements
 reduce() functions also run in parallel, working with assigned
output key
 All values are processed independently
 Reduce phase can’t start until map phase is completely
finished.
 Its in a way, data parallel implementation and thus works with
humongous amount of data.
© Varad Meru, 2013

+
Hadoop
 Started by Doug Cutting, and then carried ahead by enterprises
such as Yahoo! and Facebook
 It’s a collection of three frameworks – Commons, MapReduce
and DFS.
 Free and Open Source with Apache Software License
 Current Largest Cluster size of 4000 nodes. ( at Yahoo! )
 Whole Ecosystem build around it to process large amounts of
data. (~in GBs, TBs, PBs)
© Varad Meru, 2013

+
Netflix Dataset
 This dataset was release by Netflix October 2, 2006 for
SIGKDD challenge to build worlds best recommender for
Netflix.
 Netflix provided a training data set of 100,480,507 ratings that
480,189 users gave to 17,770 movies.
 Each training rating is a quadruplet of the form
<user, movie, date of grade, grade>
 Used heavily in Research for Recommender Engine1.
 Used in our project to compare the Implementation of our
Algorithm with other implementations e.g. Mahout
1Google Scholar : About 3,190 results for the search term “netflix prize”© Varad Meru, 2013

+
High-level Architecture
 MapReduce implementation of
Clustering algorithms such as K-
Means and MinHash Clustering.
 Comparative Analysis with
already present frameworks
such as Apache Mahout (Refer
Reference no. 1, 2, and 3)
© Varad Meru, 2013

+
Requisites
 2 Linux Machines (Required, preferred OS - Ubuntu)
 Pentium 4 + Machines (Recommended – Core 2 Duo 2.53
GHz+)
 RAM 1 GB per machine (Recommended – 4 GB per machine)
 Apache Hadoop (from http://hadoop.apache.org )
 Apache Mahout (from http://mahout.apache.org)
 Java IDE ( Eclipse, Preferred)
 Java SDK1.6+
© Varad Meru, 2013

+
References
1. “Scalable Similarity-Based Neighborhood Methods with
MapReduce” by Sebastian Schelter, Christoph Boden and
Volker Markl. – RecSys 2012.
2. “Case Study Evaluation of Mahout as a Recommender Platform”
by Carlos E. Seminario and David C. Wilson - Workshop on
Recommendation Utility Evaluation: Beyond RMSE (RUE 2012)
3. http://mahout.apache.org/ - Apache Mahout Project Page
4. http://www.ibm.com/developerworks/java/library/j-mahout/ -
Introducing Apache Mahout
5. [VIDEO] “Collaborative filtering at scale” by Sean Owen
6. [BOOK] “Mahout in Action” by Owen et. al., Manning Pub.
© Varad Meru, 2013

Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

More Related Content

What's hot

Viewers also liked

Similar to Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

More from Varad Meru

Recently uploaded

Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines