Spark Advanced Analytics NJ Data Science Meetup - Princeton University

© Hortonworks Inc. 2014
Advanced Analytics with Apache Spark
and Apache Zeppelin in HDP
Hortonworks. We do Hadoop.
Alex Zeltov
Solutions Engineer
@azeltov

In this workshop
• Introduction to HDP and Spark
• Build a Data analytics application:
- Spark Programming: Scala, Python, R
- Core Spark: working with RDDs, DataFrames
- Spark SQL: structured data access
- Spark MlLib: predictive analytics
- Spark Streaming: real time data processing
• Develop Recommendation Engine - Using the “Collaborative
Filtering” method
• Conclusion and Q/A

Introduction to HDP and Spark
http://hortonworks.com/hadoop/spark/

Spark is certified as YARN Ready and is a part of HDP.
Hortonworks Data Platform 2.3
GOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS
YARN: Data Operating System
(Cluster Resource Management)
MapReduce
Apache Falcon
Apache Sqoop
Apache Flume
Apache Kafka
ApacheHive
ApachePig
ApacheHBase
ApacheAccumulo
ApacheSolr
ApacheSpark
ApacheStorm
1 • • • • • • • • • • •
• • • • • • • • • • • •
HDFS
(Hadoop Distributed File System)
Apache Ambari
Apache
ZooKeeper
Apache Oozie
Deployment Choice
Linux Windows On-premises Cloud
Apache Atlas
Cloudbreak
SECURITY
Apache Ranger
Apache Knox
Apache Atlas
HDFS Encryption
ISVEngines

Spark Components
Spark allows you to do data processing, ETL, machine learning,
stream processing, SQL querying from one framework

Emerging Spark Patterns
• Spark as query federation engine
 Bring data from multiple sources to join/query in Spark
• Use multiple Spark libraries together
 Common to see Core, ML & Sql used together
• Use Spark with various Hadoop ecosystem projects
 Use Spark & Hive together
 Spark & HBase together

More Data Sources APIs
18/03/2016

Spark Deployment Modes
• Spark Standalone Cluster:
– For developing Spark apps against a local Spark (similar to
develop/deploying in IDE)
• Spark on YARN in two modes:
– Spark driver (SparkContext) in local (yarn-client): Spark Driver runs in the
client process outside of YARN cluster, and ApplicationMaster is only used to
negotiate resources from Resoure manager
– Spark driver (SparkContext) in YARN AM(yarn-cluster): Spark Driver runs in
ApplicationMaster spawned by NodeManager on a slave node

Spark on YARN
YARN RM
App Master
Monitoring UI

Spark UI

Interacting with Spark

Interacting with Spark
• Spark’s interactive REPL shell (in Python or Scala)
• Web-based Notebooks:
• Zeppelin: A web-based notebook that enables interactive data
analytics.
• Jupyter: Evolved from the IPython Project
• SparkNotebook: forked from the scala-notebook

Apache Zeppelin
• A web-based notebook that enables interactive data
analytics.
• Multiple language backend
• Multi-purpose Notebook is the place for all your
needs
 Data Ingestion
 Data Discovery
 Data Analytics
 Data Visualization
 Collaboration

Zeppelin- Multiple language backend
Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown and Shell.

Zeppelin – Dependency Management
• Load libraries recursively from Maven repository
• Load libraries from local filesystem
• %dep
• // add maven repository
• z.addRepo("RepoName").url("RepoURL”)
• // add artifact from filesystem
• z.load("/path/to.jar")
• // add artifact from maven repository, with no dependency
• z.load("groupId:artifactId:version").excludeAll()

Spark & Zeppelin Pace of Innovation
HDP 2.2.4
Spark 1.2.1
GA
HDP 2.3.2
Spark 1.4.1
GA
HDP 2.3.0
Spark 1.3.1
GA
HDP 2.3.4
Spark 1.5.2*
GA
Spark
Spark 1.3.1
TP
5/2015
Spark 1.4.1
TP
8/2015
Spark 1.5.1
TP
Nov/2015
Now
Zeppelin
TP
Oct/2015
Apache Zeppelin
Zeppelin
TP Refresh
March 1st 2016
Dec 2015
HDP 2.4.0
Spark 1.6
GA
Zeppelin
GA
Q1, 2016
Last Awareness Session
Spark 1.6
TP
Jan/2015
March 1st 2016
HDP 2.5.x
Spark 1.6.1*
GA
Q1, 2016

Spark in HDP customer base - 2015
0
10
20
30
40
50
60
70
Q1 Q2 Q3 Q4
Unique # of customers filing Spark tickets/Qs
Customers that filed
Spark tickets in 2015
132

Programming Spark

How Does Spark Work?
• RDD
• Your data is loaded in parallel into structured collections
• Actions
• Manipulate the state of the working model by forming new RDDs
and performing calculations upon them
• Persistence
• Long-term storage of an RDD’s state

Resilient Distributed Datasets
• The primary abstraction in Spark
» Immutable once constructed
» Track lineage information to efﬁciently recompute lost data
» Enable operations on collection of elements in parallel
• You construct RDDs
» by parallelizing existing collections (lists)
» by transforming an existing RDDs
» from ﬁles in HDFS or any other storage system

item-1
item-2
item-3
item-4
item-5
item-6
item-7
item-8
item-9
item-10
item-11
item-12
item-13
item-14
item-15
item-16
item-17
item-18
item-19
item-20
item-21
item-22
item-23
item-24
item-25
more partitions = more parallelism
Worker
Spark
executor
Worker
Spark
executor
Worker
Spark
executor
RDDs
• Programmer speciﬁes number of partitions for an RDD
(Default value used if unspeciﬁed)
RDD split into 5 partitions

RDDs
• Two types of operations:transformations and actions
• Transformations are lazy (not computed immediately)
• Transformed RDD is executed when action runs on it
• Persist (cache) RDDs in memory or disk

Example RDD Transformations
•map(func)
•filter(func)
•distinct(func)
• All create a new DataSet from an existing one
• Do not create the DataSet until an action is performed (Lazy)
• Each element in an RDD is passed to the target function and the
result forms a new RDD

Example Action Operations
•count()
•reduce(func)
•collect()
•take()
• Either:
• Returns a value to the driver program
• Exports state to external system

Example Persistence Operations
•persist() -- takes options
•cache() -- only one option: in-memory
• Stores RDD Values
• in memory (what doesn’t fit is recalculated when necessary)
• Replication is an option for in-memory
• to disk
• blended

Spark Applications
Are a definition in code of
• RDD creation
• Actions
• Persistence
Results in the creation of a DAG (Directed Acyclic Graph) [workflow]
• Each DAG is compiled into stages
• Each Stage is executed as a series of Tasks
• Each Task operates in parallel on assigned partitions

Spark Context
• A Spark program ﬁrst creates a SparkContext object
• Tells Spark how and where to access a cluster
• Use SparkContext to create RDDs
• SparkContext, SQLContext, ZeppelinContext:
• are automatically created and exposed as variable names 'sc', 'sqlContext' and
'z', respectively, both in scala and python environments using Zeppelin
• iPython and programs must use a constructor to create a new SparkContext
Note: that scala / python environment shares the same SparkContext, SQLContext,
ZeppelinContext instance.

1. Resilient Distributed Dataset [RDD] Graph
val v = sc.textFile("hdfs://…some-hdfs-data")
mapmap reduceByKey collecttextFile
v.flatMap(line=>line.split(" "))
.map(word=>(word, 1)))
.reduceByKey(_ + _, 3)
.collect()
RDD[String]
RDD[List[String]]
RDD[(String, Int)]
Array[(String, Int)]
RDD[(String, Int)]

Processing A File in Scala
//Load the file:
val file = sc.textFile("hdfs://…/user/DAW/littlelog.csv")
//Trim away any empty rows:
val fltr = file.filter(_.length > 0)
//Print out the remaining rows:
fltr.foreach(println)
29

Looking at the State in the Machine
//run debug command to inspect RDD:
scala> fltr.toDebugString
//simplified output:
res1: String =
FilteredRDD[2] at filter at <console>:14
MappedRDD[1] at textFile at <console>:12
HadoopRDD[0] at textFile at <console>:12
30

A Word on Anonymous Functions
Scala programmers make great use of anonymous functions as can
be seen in the code:
flatMap( line => line.split(" ") )
31
Argument
to the
function
Body of
the
function

Scala Functions Come In a Variety of Styles
flatMap( line => line.split(" ") )
flatMap((line:String) => line.split(" "))
flatMap(_.split(" "))
32
Argument to the
function (type inferred)
Body of the function
Argument to the
function (explicit type)
Body of the
function
No Argument to the
function declared
(placeholder) instead
Body of the function includes placeholder _ which allows for exactly one use of
one arg for each _ present. _ essentially means ‘whatever you pass me’

And Finally – the Formal ‘def’
def myFunc(line:String): Array[String]={
return line.split(",")
}
//and now that it has a name:
myFunc("Hi Mom, I’m home.").foreach(println)
Return type of the function)
Body of the function
Argument to the function)

LAB: Spark RDD & Data Frames Demo –
Philly Crime Data Set
http://sandbox.hortonworks.com:8081/#/notebook/2B6HKTZDK

Spark DataFrames

What are DataFrames?
• Distributed Collection of Data organized in Columns
• Equivalent to Tables in Databases or DataFrame in R/PYTHON
• Much richer optimization than any other implementation of DF
• Can be constructed from a wide variety of sources and APIs
• Greater accessiblity
• Declarative rather thanimperative
• Catalyst Optimizer
Why DataFrames?

Writing a DataFrame
val df = sqlContext.jsonFile("/tmp/people.json")
df.show()
df.printSchema()
df.select ("First Name").show()
df.select("First Name","Age").show()
df.filter(df("age")>40).show()
df.groupBy("age").count().show()

Querying RDD Using SQL
import org.apache.spark.sql.types.{StructType,StructField,StringType}
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName,
StringType, true)))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val people = sc.textFile("/tmp/people.txt")
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
peopleDataFrame.registerTempTable("people")
val results = sqlContext.sql("SELECT name FROM people")
results.map(t => "Name: " + t(0)).collect().foreach(println)

Querying RDD Using SQL
// SQL statements can be run directly on RDD’s
val teenagers =
sqlC.sql("SELECT name FROM people
WHERE age >= 13 AND age <= 19")
// The results of SQL queries are SchemaRDDs and support
// normal RDD operations:
val nameList = teenagers.map(t => "Name: " + t(0)).collect()
// Language integrated queries (ala LINQ)
val teenagers =
people.where('age >= 10).where('age <= 19).select('name)

Dataframes for Apache Spark
DataFrame SQL
DataFrame R
DataFrame Python
DataFrame Scala
RDD Python
RDD Scala
Time to aggregate 10 million integer pairs (in seconds)
DataFrames can be significantly faster than RDDs. And they
perform the same, regardless of language.

Transformations Actions
filter count
select collect
drop show
join take
Transformations contribute to the query plan but
nothing is executed until an action is called
Dataframes – Transformation & Actions

LAB: DataFrames
http://sandbox.hortonworks.com:8081/#/notebook/2B4B7EWY7
http://sandbox.hortonworks.com:8081/#/notebook/2B5RMG4AM
DataFrames + SQL
DataFrames JSON

DataFrames and JDBC
val jdbc_attendees = sqlContext.load("jdbc", Map("url" ->
"jdbc:mysql://localhost:3306/db1?user=root&password=xxx","dbtable" -> "attendees"))
jdbc_attendees.show()
jdbc.attendees.count()
jdbc_attendees.registerTempTable("jdbc_attendees")
val countall = sqlContext.sql("select count(*) from jdbc_attendees")
countall.map(t=>"Records count is "+t(0)).collect().foreach(println)

Code ‘select count’
Equivalent SQL Statement:
Select count(*) from pagecounts WHERE state = ‘FL’
Scala statement:
val file = sc.textFile("hdfs://…/log.txt")
val numFL = file.filter(line =>
line.contains("fl")).count()
scala> println(numFL)
44
1. Load the page as an RDD
2. Filter the lines of the page
eliminating any that do not
contain “fl“
3. Count those lines that
remain
4. Print the value of the
counted lines containing ‘fl’

Spark SQL
45

© Hortonworks Inc. 2014 46
Platform APIs
• Joining Data from Different
Sources
• Access Data using DataFrames /
SQL

Platform APIs
• Community Plugins
• 100+ connectors
http://spark-packages.org/

LAB: JDBC and 3rd party packages
http://sandbox.hortonworks.com:8081/#/notebook/2B2P8RE82

What About Integration With Hive?
scala> val hiveCTX = new org.apache.spark.sql.hive.HiveContext(sc)
scala> hiveCTX.hql("SHOW TABLES").collect().foreach(println)
…
[omniture]
[omniturelogs]
[orc_table]
[raw_products]
[raw_users]
…
49

More Integration With Hive:
scala> hCTX.hql("DESCRIBE raw_users").collect().foreach(println)
[swid,string,null]
[birth_date,string,null]
[gender_cd,string,null]
scala> hCTX.hql("SELECT * FROM raw_users WHERE gender_cd='F' LIMIT
5").collect().foreach(println)
[0001BDD9-EABF-4D0D-81BD-D9EABFCD0D7D,8-Apr-84,F]
[00071AA7-86D2-4EB9-871A-A786D27EB9BA,7-Feb-88,F]
[00071B7D-31AF-4D85-871B-7D31AFFD852E,22-Oct-64,F]
[000F36E5-9891-4098-9B69-CEE78483B653,24-Mar-85,F]
[00102F3F-061C-4212-9F91-1254F9D6E39F,1-Nov-91,F]
50

ORC at Spotify
16x less HDFS read when
using ORC versus Avro.(5)
IOi
32x less CPU when using ORC
versus Avro.(5)
CPUi
[2]

LAB: HIVE ORC
http://sandbox.hortonworks.com:8081/#/notebook/2B6KUW16Z

Spark Streaming

MicroBatch Spark Streams

Physical Execution

Spark Streaming 101
• Spark has significant library support for streaming applications
val ssc = new StreamingContext(sc, Seconds(5))
val tweetStream = TwitterUtils.createStream(ssc, Some(auth))
• Allows to combine Streaming with Batch/ETL,SQL & ML
• Read data from HDFS, Flume, Kafka, Twitter, ZeroMQ & custom.
• Chop input data stream into batches
• Spark processes batches & results published in batches
• Fundamental unit is Discretized Streams (DStreams)

Spark MLLib

Spark MLlib – Algorithms Offered
• Classification: logistic regression, linear SVM,
– naïve Bayes, least squares, classification tree
• Regression: generalized linear models (GLMs),
– regression tree
• Collaborative filtering: alternating least squares (ALS),
– non-negative matrix factorization (NMF)
• Clustering: k-means
• Decomposition: SVD, PCA
• Optimization: stochastic gradient descent, L-BFGS

ML - Pipelines
• New algorithms KMeans [SPARK-7879], Naive Bayes [SPARK-
8600], Bisecting KMeans
• [SPARK-6517], Multi-layer Perceptron (ANN) [SPARK-2352],
Weighting for
• Linear Models [SPARK-7685]
• New transformers (close to parity with SciKit learn):
CountVectorizer [SPARK-8703],
• PCA [SPARK-8664], DCT [SPARK-8471], N-Grams [SPARK-8455]
• Calling into single machine solvers (coming soon as a package)

Twitter Language Classifier
Goal: connect to real time twitter stream and print only
those tweets whose language match our chosen language.
Main issue: how to detect the language during run time?
Solution: build a language classifier model offline capable of
detecting language of tweet (Mlib). Then, apply it to real
time twitter stream and do filtering (Spark Streaming).

Spark External Datasources

Spark External Datasources
You can load datasets from various external sources:
• Local Filesystem
• HDFS
• HDFS using custom InputFormat
• Amazon S3
• Relational Databases (RDBMS)
• Apache Cassandra, Mongo DB, etc.

LABS: data load from MongoDB or Cassandra

Recommendation Engine - ALS

Step 1: Data Ingest
• Using the MovieLens 10M data set
• http://grouplens.org/datasets/movielens/
• Ratings: UserID::MovieID::Rating::Timestamp
• 10.000.000 ratings on 10.000 movies by 72.000 users
• ratings.dat.gz
• Movies: MovieID::Title::Genres
• 10.000 movies
• movies.dat

baseDir = os.path.join('movielens')
ratingsFilename = os.path.join(baseDir, 'ratings.dat.gz')
moviesFilename = os.path.join(baseDir, 'movies.dat')
numPartitions = 2
rawRatings =
sc.textFile(ratingsFilename).repartition(numPartitions)
rawMovies = sc.textFile(moviesFilename)
Step 1: Data Ingest
• Some simple python code followed by the creation of the first RDD
import sys
import os

Step 2: Feature Extraction
• Transform the string data in tuples of useful data and remove
unwanted pieces
• Ratings: UserID::MovieID::Rating::TImestamp
1::1193::5::978300760
1::661::3::978302109 => [(1, 1193, 5.0), (1, 914, 3.0), …]
• Movies: MovieID::Title::Genres
1::Toy Story (1995):: Animation|Children’s|Comedy
2::Jumanji (1995)::Adventure|Children’s|Fantasy
=> [(1, 'Toy Story (1995)'), (2, u'Jumanji (1995)'), …]

def get_ratings_tuple(entry):
float(items[2])
items = entry.split('::')
return int(items[0]), int(items[1]),
def get_movie_tuple(entry):
items = entry.split('::')
return int(items[0]), items[1]
ratingsRDD = rawRatings.map(get_ratings_tuple).cache()
moviesRDD = rawMovies.map(get_movie_tuple).cache()

• Inspect RDD using collect()
• Careful: make sure the whole dataset fits in the
memory of the driver
Driver
job
Executor
Task
Executor
Task
• Use take(num)
• Safer: takes a num-size subset
print 'Ratings: %s' % ratingsRDD.take(2)
Ratings: [(1, 1193, 5.0), (1, 914, 3.0)]
print 'Movies: %s' % moviesRDD.take(2)
Movies: [(1, u'Toy Story (1995)'), (2, u'Jumanji (1995)')]

Step 3: Create Model – The naïve approach
• Recommend movies with the highest average rating
• Need a tuple containing the movie name and it’s average rating
• Only consider movies with at least 500 ratings
• Tuple must contain the number of ratings for the movie
• The tuple we need should be of the folowing form:
( averageRating, movieName, numberOfRatings )

• Calculate the average rating of a movie
• From the ratingsRDD, we create tuples containing all the ratings for a movie:
– Remember: ratingsRDD = (UserID, MovieID, Rating)
movieIDsWithRatingsRDD = (ratingsRDD
.map(lambda (user_id,movie_id,rating): (movie_id,[rating]))
.reduceByKey(lambda a,b: a+b))
• This is simpele map-reduce in spark:
• Map: (UserID, MovieID, Rating) => (MovieID, [Rating])
• Reduce: (MovieID1, [Rating1]), (MovieID1, [Rating2]) => (MovieID1, [Rating1,Rating2])

( len(RatingsTuple[1]), total/len(RatingsTuple[1])) )
movieIDsWithAvgRatingsRDD =
movieIDsWithRatingsRDD.map(getCountsAndAverages)
• Note that the new key-value tuples have MovieID as key and a nested tuple
(ratings,average) as value: [ (2, (332, 3.174698795180723) ), … ]
• Next map the data to an RDD with average and number of ratings
def getCountsAndAverages(RatingsTuple):
total = 0.0
for rating in RatingsTuple[1]:
total += rating
return ( RatingsTuple[0],

• Only the movie name is still missing from the tuple
• The name of the movie was not present in the ratings data. It must
be joined in from the movie data
movieNameWithAvgRatingsRDD = ( moviesRDD
.join(movieIDsWithAvgRatingsRDD)
.map(lambda ( movieid,(name,(ratings, average)) ):
(average, name, ratings)) )
• The join creates tuples that still contain the movieID and ends up nested three deep:
(Key , (Left_Value, Right_value) )
• A simple map() solves that problem and produces the tuple we need

• The RDD now contains tuples of the correct form
Print movieNameWithAvgRatingsRDD.take(3)
[
(3.68181818181818, 'Happiest Millionaire, The (1967)', 22),
(3.04682274247491, 'Grumpier Old Men (1995)', 299),
(2.88297872340425, 'Hocus Pocus (1993)', 94)
]

• Now we can easily filter out all the movies with less than 500 ratings,
sort the RDD by average rating and show the top 20
movieLimitedAndSortedByRatingRDD =
( movieNameWithAvgRatingsRDD
.filter(
name, ratings): ratings > 500lambda (average,
)
.sortBy(sortFunction, ascending=False)
)

value =
return
tuple[1]
(key + ' ' + value)
• sortFunction makes sure the tuples are sorted using both key and
value which insures a consistent sort, even if a key appears more
than once
def sortFunction(tuple):
key = unicode('%.3f' % tuple[0])

print 'Movies with highest ratings: %s' %
movieLimitedAndSortedByRatingRDD.take(20)
Movies with highest ratings: [
1447),

Step 3: Create Model – Collaborative Filtering
• The naïve approach will recommend
the same movies to everybody,
regardless of their personal
preferences.
• Collaborative Filtering will look for
people with similar tastes and use
their ratings to give recommendations
fit to your personal preferences.
Image from Wikipedia:
https://en.wikipedia.org/wiki/Collaborative_filtering

• We have a matrix where every row is the ratings for one user for all
movies in the database.
• Since every user did not rate every movie, this matrix is incomplete.
• Predicting the missing ratings is exactly what we need to do in order
to give the user good recommendations
• The algorithm that is usually applied to solve recommendation
problems is “Alternating Least Squares” which takes an iterative
approach to finding the missing values in the matrix.
• Spark’s mllib has a module for Alternating Least Square recommendation, aptly called “ALS”

• Machine Learning workflow
Full
Dataset
Training
Set
Validation
Set
Test Set
Model
Accuracy
(Over-fitting test)
Prediction

• Randomly split the dataset we have in multiple groups for training,
validating and testing using randomSplit(weights, seed=None)
trainingRDD, validationRDD, testRDD =
ratingsRDD.randomSplit([6, 2, 2], seed=0L)
print 'Training: %s, validation: %s, test: %sn' %
trainingRDD.count(),
validationRDD.count(),
testRDD.count())
Training: 292716, validation: 96902, test: 98032

• Before we start training the model, we need a way to calculate how
good a model is, so we can compare it against other tries
• Root Mean Square Error (RMSE) is often used to compute the error of
a model
• RMSE compares the predicted values from the training set with
the real values present in the validation set. By adding the
absolute values of the differences, and taking the average of
those values, we get a single number that represents the error
of the model

def computeError(predictedRDD, actualRDD):
predictedReformattedRDD = (predictedRDD
.map(lambda (UserID,
actualReformattedRDD
MovieID, Rating):((UserID,
= (actualRDD
MovieID), Rating)) )
.map(lambda (UserID, MovieID, Rating):((UserID, MovieID), Rating)) )
squaredErrorsRDD = (predictedReformattedRDD
.join(actualReformattedRDD)
.map(lambda (k,(a,b)): math.pow((a-b),2)))
totalError = squaredErrorsRDD.reduce(lambda a,b: a+b)
numRatings = squaredErrorsRDD.count()
return math.sqrt(float(totalError)/numRatings)

• Create a trained model using the ALS.train() method from Spark mllib
• Rank is the most important parameter to tune
• The number of rows and columns in the matrix used
• A lower rank will mean higher error, a high rank may lead to overfitting
ALS.train(
trainingRDD,
rank, # We’ll try 3 ranks: 4, 8, 12
seed = 5L,
iterations = 5,
lambda_ = 0.1
)

• Use the trained model to predict the missing ratings in the validation
set
• Create a new RDD from te validation set where the ratings are removed
• Call the predictAll() method using the trained model on that RDD
validationForPredictRDD = validationRDD
.map(
lambda (UserID, MovieID, Rating):
(UserID, MovieID)
)
predictedRatingsRDD =
model.predictAll(validationForPredictRDD)

• Finally use our computeError() method to calculate the error of our
trained model by comparing the predicted ratings with the real ones
error = computeError(predictedRatingsRDD, validationRDD)

from pyspark.mllib.recommendation import ALS
validationForPredictRDD = ( validationRDD
.map(lambda (UserID, MovieID, Rating): (UserID, MovieID))
ranks = [4, 8, 12]
errors = [0, 0
,
0]
err = 0
minError = float('inf')
bestRank = -1bestIteration = -1
• Import the ALS module, create the “empty” validatio RDD for prediction and set up some
variables

for rank in ranks:
model = ALS.train( trainingRDD, rank, seed=5L,
iterations=5, lambda_=0.1)
model.predictAll(validationForPredictRDD)
error = computeError(predictedRatingsRDD, validationRDD)
errors[err] = error
err += 1
print 'For rank %s the RMSE is %s' % (rank, error)
minError:if error <
minError
bestRank
= error
= rank

• The model that was trained with rank 8 has the lowest error (RMSE)
print 'The best model was trained with rank %s' % bestRank
For rank 4 the RMSE is 0.892734779484
The best model was trained with rank 8

Step 4: Test Model
• So we have now found the best model, but now we still need to test
if the model is actually good
• Testing using the same validation set is not a good test since it may
leave us vulnerable to overfitting
• The model is so fit to the validation set, that it only produces good results for that set
• This is why we have split of a test set at the start of the Machine
Learning process
• We will use the best rank result we obtained to train a model and
then predict the ratings for the test set
• Calculating the RMSE for the test set predictions should tell us if our
model is useable

Step 4: Test Model
• We recreate the model, remove all the ratings present in the test set
and run the predictAll() method
8,seed=5L,
iterations=5, lambda_=0.1)
myModel = ALS.train(trainingRDD,
testForPredictingRDD =
testRDD.map(lambda (UserID, MovieID, Rating):
(UserID, MovieID))
predictedTestRDD = myModel.predictAll(testForPredictingRDD)
testRMSE = computeError(testRDD, predictedTestRDD)

Step 4: Test Model
• The RMSE is good. Our model does not suffer from overfitting and is
usable.
• The RMSE of the validation set was 0.890121292255, only slightly better
print 'The model had a RMSE on the test set of %s' %
testRMSE
The model had a RMSE on the test set of 0.891048561304

Step 5: Use the model
• Let’s get some movie predictions!
• First I need to give the data set some ratings so it has something to deduce my taste
myRatedMovies = [ # Rating
(0, 845,5.0), # Blade Runner (1982) - 5.0/5
(0, 789,4.5), # Good Will Hunting (1997) - 4.5/5
(0, 983,4.8), # Christmas Story, A (1983) - 4.8/5
(0, 551,2.0), # Taxi Driver (1976) - 2.0/5
(0,1039,2.0), # Pulp Fiction (1994) - 2.0/5
(0, 651,5.0), # Dr. Strangelove (1963) - 5.0/5
(0,1195,4.0), # Raiders of the Lost Ark (1981) - 4.0/5
(0,1110,5.0), # Sixth Sense, The (1999) - 4.5/5
(0,1250,4.5), # Matrix, The (1999) - 4.5/5
- 4.0/5(0,1083,4.0) # Princess Bride, The (1987)
]
myRatingsRDD = sc.parallelize(myRatedMovies)

• Then we add my ratings to the data set
• since we now have more ratings, let’s train our model again
• and make sure the RMSE is still OK (re-using the test set RDDs from the previous step)
trainingWithMyRatingsRDD = myRatingsRDD.union(trainingRDD)
myRatingsModel = ALS.train(trainingWithMyRatingsRDD, 8,
seed=5L, iterations=5, lambda_=0.1)
predictedTestMyRatingsRDD = myRatingsModel
.predictAll(testForPredictingRDD)
testRMSEMyRatings = computeError(testRDD,
predictedTestMyRatingsRDD)

• And of course, check the RMSE again... We’re good
print 'The model had a RMSE on the test set of %s' %
testRMSEMyRatings
The model had a RMSE on the test set of 0.892023318284

• Now we need an RDD with only the movies I did not rate, to run
predictAll() on. (my userid is set to zero)
• [(0, movieID1), (0, movieID2), (0, movieID3), …]
myUnratedMoviesRDD = (moviesRDD
.map(lambda (movieID, name): movieID)
.filter(lambda movieID:
in myRatedMovies] )movieID not in [ mine[1] for mine
.map(lambda movieID: (0, movieID)))
myRatingsModel.predictAll(myUnratedMoviesRDD)

• From the predicted RDD, get the top 20 predicted ratings, but only
for movies that had at least 75 ratings in total
• Re-use the RDD we created in the naïve approach that had the average ratings and
number of ratings. (movieIDsWithAvgRatingsRDD)
• Map it to tuples of form (movieID, number_of_ratings)
• Strip the userid from the predicted RDD
• Map it to tuples (movieID, predicted_rating)
• Join those two and add the movie names from the original movies
data and clean up the result
• The resulting tuple is (predicted_rating, name, number_of_ratings)
• Filter out all movies that had less than 75 ratings

movieCountsRDD = movieIDsWithAvgRatingsRDD
.map(lambda (movie_id, (ratings, average)):
(movie_id, ratings))
predictedRDD = predictedRatingsRDD
.map(lambda (uid, movie_id, rating): (movie_id, rating))
predictedWithCountsRDD= (predictedRDD.join(movieCountsRDD))
ratingsWithNamesRDD = (predictedWithCountsRDD
.join(moviesRDD)
.map(lambda (movie_id, ((pred, ratings), name)):
(pred, name,
.filter(lambda (pred, name, ratings): ratings
ratings))
> 75))

• And finally get the top 20 recommended movies for myself
predictedHighestRatedMovies =
ratingsWithNamesRDD.takeOrdered(20, key=lambda x: -x[0])
print ('My highest rated movies as predictedn%s' %
'n'.join(map(str, predictedHighestRatedMovies)))

My highest rated movies as predicted:
(4.823536053603062, 'Once Upon a Time in the West (1969)', 82)
(4.743456934724456, 'Texas Chainsaw Massacre, The (1974)', 111)
(4.452221024980805, 'Evil Dead II (Dead By Dawn) (1987)', 305)
(4.387531237859994, 'Duck Soup (1933)', 279)
(4.373821653377477, 'Citizen Kane (1941)', 527)
(4.344480264132989, 'Cabin Boy (1994)', 95)
(4.332264360095111, 'Shaft (1971)', 85)
(4.217371529794628, 'Night of the Living Dead (1968)', 352)
(4.181318251399025, 'Yojimbo (1961)', 110)
(4.171790272807383, 'Naked Gun: From the Files of Police Squad', 435)
…

Conclusion and Q&A

Learn More Spark + Hadoop Perfect Together
HDP Spark General Info:
http://hortonworks.com/hadoop/spark/
Learn more about our Focus on Spark:
http://hortonworks.com/hadoop/spark/#section_6
Get the HDP Spark 1.5.1 Tech Preview:
http://hortonworks.com/hadoop/spark/#section_5
Get started with Spark and Zeppelin and download the Sandbox:
http://hortonworks.com/sandbox
Try these tutorials:
http://hortonworks.com/hadoop/spark/#tutorials
http://hortonworks.com/hadoop-tutorial/apache-spark-1-5-1-technical-preview-with-hdp-2-3/
Learn more about GeoSpatial Spark processing with Magellan:
http://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/

Spark Advanced Analytics NJ Data Science Meetup - Princeton University

More Related Content

What's hot

Viewers also liked

Similar to Spark Advanced Analytics NJ Data Science Meetup - Princeton University

Recently uploaded

Spark Advanced Analytics NJ Data Science Meetup - Princeton University