Machine Learning Recommendations with Spark

Collaborative filtering algorithms recommend items (this is
the filtering part) based on preference information from many
users (this is the collaborative part). The collaborative filtering
approach is based on similarity; the basic idea is people who
liked similar items in the past will like similar items in the future.
In the example shown, Ted likes movies A, B, and C. Carol likes
movies B and C. Bob likes movie B. To recommend a movie to
Bob, we calculate that users who liked B also liked C, so C is a
possible recommendation for Bob. Of course, this is a tiny
example. In real situations, we would have much more data to
work with.
3

The goal of a collaborative filtering algorithm is to take
preferences data from users and to create a model which can be
used for recommendations or predictions.
Ted likes movies A, B, and C. Carol likes movies B and C.  So we
take this data , run it through an algorithm to build a model.
Then when we have new Data such as  Bob likes movie B, we
use the model to predict  that C is a possible recommendation
for Bob.
4

ALS approximates the sparse user item rating matrix of
dimension K as the product of two dense matrices, User and
Item factor matrices of size U×K and I×K (see picture below). The
factor matrices are also called latent feature models. The factor
matrices represent hidden features which the algorithm tries to
discover. One matrix tries to describe the latent or hidden
features of each user, and one tries to describe latent
properties of each movie.
ALS is an iterative algorithm. In each iteration, the algorithm
alternatively fixes one factor matrix and solves for the other, and
this process continues until it converges. This alternation
between which matrix to optimize is where the "alternating" in
the name comes from.
5

A typical machine learning workflow is shown , we will perform
the following steps:
Load the sample data.
Parse the data into the input format for the ALS algorithm.
Split the data into two parts, one for building the model and one
for testing the model.
Run the ALS algorithm to build/train a user product matrix
model.
Make predictions with the training data and observe the results.
Test the model with the test data.
6

Spark is especially useful for parallel processing of distributed
data with iterative algorithms. Spark tries to keep things in
memory, whereas MapReduce involves more reading and
writing from disk. As shown in the image below, for each
MapReduce Job, data is read from an HDFS file for a mapper,
written to and from a SequenceFile in between, and then
written to an output file from a reducer. When a chain of
multiple jobs is needed, Spark can execute much faster by
keeping data in memory.
7

Spark’s primary abstraction is a distributed collection of items
called a Resilient Distributed Dataset (RDD). RDDs can be
created from Hadoop InputFormats (such as HDFS files) or by
transforming other RDDs.
8

An RDD is simply a distributed collection of elements. You can think
of the distributed collections like of like an array or list in your single
machine program, except that it’s spread out across multiple nodes
in the cluster.
In Spark all work is expressed as either creating new RDDs,
transforming existing RDDs, or calling operations on RDDs to
compute a result. Under the hood, Spark automatically distributes
the data contained in RDDs across your cluster and parallelizes the
operations you perform on them.
So, Spark gives you APIs and functions that lets you do something on
the whole collection in parallel using all the nodes.
9

We use
the org.apache.spark.mllib.recommendation.Rating class for
parsing the ratings.dat file. Later we will use the Rating class as
input for the ALS run method.
Then we use the map transformation on ratingText, which will
apply the parseRating function to each element in ratingText
and return a new RDD of Rating objects. We cache the ratings
data, since we will use this data to build the matrix model.
11

Next we we Split the data into two parts, one for building the
model and one for testing the model.
Then we Run the ALS algorithm to build/train a user product
matrix model.
12

Next we we Split the data into two parts, one for building the
model and one for testing the model.
Then we Run the ALS algorithm to build/train a user product
matrix model.
13

Next we get predicted movie ratings for the test data: by calling
model.predict with test User id , Movie Id input data
14

Next we will compare test User id , Movie Id Ratings to the
test Userid, Movie Id predicted Rating
15

Here we create User id , Movie Id , Ratings key value pairs for
joining in order to compare the test ratings to the predicted
ratings
16

Next we will compare test User id , Movie Id Ratings to the
test Userid, Movie Id predicted Rating
17

Here we compare test ratings and predicted ratings by filtering
on ratings where the test rating<=1 and the predicted rating is
>=4
18

we register the DataFrame as a table. Registering it as a table
allows us to use it in subsequent SQL statements.
Now we can inspect the data.
19

https://www.mapr.com/blog/parallel‐and‐iterative‐processing‐
machine‐learning‐recommendations‐spark
21

Machine Learning Recommendations with Spark

More Related Content

Viewers also liked

Similar to Machine Learning Recommendations with Spark

More from Carol McDonald

Recently uploaded

Machine Learning Recommendations with Spark