Introduction to Spark ML

Introduction to Spark ML
Machine learning at scale
Alpine Academy Spark Workshop #2
Hella-Legit
Preview!

Who am I?
Holden
I prefer she/her for pronouns
Co-author of the Learning Spark book
Software Engineer at IBM’s Spark Technology Center
@holdenkarau
http://www.slideshare.net/hkarau
https://www.linkedin.com/in/holdenkarau

Who are the TAs?
Anya
Pranav
Anandha
Vasudev

What we are going to explore together!
Who I think you all are
Spark’s two different ML APIs
Running through a simple example with one
A brief detour into some codegen funtimes
Exercises!
Model save/load
Discussion of “serving” options

The different pieces of Spark
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark ML
bagel &
Grah X
MLLib
Community
Packages

Who do I think you all are?
Nice people*
Some knowledge of Apache Spark core & maybe SQL
Interested in using Spark for Machine Learning
Familiar-ish with Scala or Java or Python
Amanda

Skipping intro & set-up time :)

But maybe time to upgrade...
Spark 1.5+ (Spark 1.6 would be best!)
(built with Hive support if building from source)
Amanda

Some pages to keep open:
http://bit.ly/sparkDocs
http://bit.ly/sparkPyDocs OR http://bit.ly/sparkScalaDoc
http://bit.ly/sparkMLGuide
https://github.com/holdenk/spark-intro-ml-pipeline-
workshop
Dwight Sipler

Getting some data for working with:
census data: https://archive.ics.uci.edu/ml/datasets/Adult
Goal: predict income > 50k
Also included in the github repo
Download that now if you haven’t already
We will add a header to the data
http://pastebin.ca/3318687
PROTill
Westermayer

So what are the two APIs?
Traditional and Pipeline
Pipeline is the new shiny future which will fix all problems*
Traditional API works on RDDs
Data preparation work is generally done in traditional Spark
transformations
Pipeline API works on DataFrames
Often we want to apply some transformations to our data before feeding
to the machine learning algorithm
Makes it easy to chain these together
(*until replaced by a newer shinier future)
Steve Jurvetson

So what are DataFrames?
Spark SQL’s version of RDDs of the world (its for more
than just SQL)
Restricted data types, schema information, compile time
untyped
Restricted operations (more relational style)
Allow lots of fun extra optimizations
Tungsten etc.
We’ll talk more about them (& Datasets) when we do the
Spark SQL component of this workshop

Transformers, Estimators and Pipelines
Transformers transform a DataFrame into another
Estimators can be trained on a DataFrame to produce a
transformer
Pipelines chain together multiple transformers and
estimators

Let’s start with loading some data
We’ve got some CSV data, we could use textfile and
parse by hand
spark-packages can save by providing the spark-csv
package by Hossein Falaki
If we were building a Java project we can include maven coordinates
For the Spark shell when launching add:
--packages com.databricks:spark-csv_2.10:1.3.0
Jess Johnson

Loading with sparkSQL & spark-csv
sqlContext.read returns a DataFrameReader
We can specify general properties & data specific options
option(“key”, “value”)
spark-csv ones we will use are header & inferSchema
format(“formatName”)
built in formats include parquet, jdbc, etc. today we will use
com.databricks.spark.csv
load(“path”)
Jess Johnson

Loading with sparkSQL & spark-csv
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("resources/adult.data")
Jess Johnson

Lets explore training a Decision Tree
Step 1: Data loading (done!)
Step 2: Data prep (select features, etc.)
Step 3: Train
Step 4: Predict

Data prep / cleaning
We need to predict a double (can be 0.0, 1.0, but type
must be double)
We need to train with a vector of features
Imports:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.param import Param, Params
from pyspark.ml.feature import Bucketizer, VectorAssembler,
StringIndexer
from pyspark.ml import Pipeline
Huang
Yun
Chung

Data prep / cleaning continued
# Combines a list of double input features into a vector
assembler = VectorAssembler(inputCols=["age", "education-num"],
outputCol="feautres")
# String indexer converts a set of strings into doubles
indexer =
StringIndexer(inputCol="category")
.setOutputCol("category-index")
# Can be used to combine pipeline components together
pipeline = Pipeline().setStages([assembler, indexer])
Huang
Yun
Chung

So a bit more about that pipeline
Each of our previous components has “fit” & “transform”
stage
Constructing the pipeline this way makes it easier to work
with (only need to call one fit & one transform)
Can re-use the fitted model on future data
model=pipeline.fit(df)
prepared = model.transform(df)
Andrey

What does our pipeline look like so far?
Input Data Assembler
Input Data
+ Vectors StringIndexer
Input Data
+Cat ID
+ Vectors
While not an ML learning
algorithm this still needs to
be fit
This is a regular
transformer - no fitting
required.

Let's train a model on our prepared data:
# Specify model
dt = DecisionTreeClassifier(labelCol = "category-index",
featuresCol="features")
# Fit it
dt_model = dt.fit(prepared)
# Or as part of the pipeline
pipeline_and_model = Pipeline().setStages([assembler, indexer,
dt])
pipeline_model = pipeline_and_model.fit(df)

And predict the results on the same data:
pipeline_model.transform(df).select("prediction",
"category-index").take(20)

Exercise 1:
Go from the index to something useful
We could manually look up the labels and then write a
select statement
Or we could look at the features on the
StringIndexerModel and use IndexToString
Our pipeline has an array of stages we can use for this

Solution:
from pyspark.ml.feature import IndexToString
labels = list(pipeline_model.stages[1].labels())
inverter = IndexToString(inputCol="prediction",
outputCol="prediction-label", labels=labels)
inverter.transform(pipeline_model.transform(df)).select("predict
ion-label", "category").take(20)
# Pre Spark 1.6 use SQL if/else or similar

So what could we do for other types of
data?
org.apache.spark.ml.feature has a lot of options
HashingTF
Tokenizer
Word2Vec
etc.

Exercise 2: Add more features to your tree
Finished quickly? Help others!
Or tell me if adding these features helped or not…
We can download a reserve “test” dataset but how would we know if we
couldn’t do that?
cobra libre

And not just for getting data into doubles...
Maybe a customers cat food preference only matters if
the owns_cats boolean is true
Maybe the scale is _way_ off
Maybe we’ve got stop words
Maybe we know one component has a non-linear relation
etc.

Cross-validation
because saving a test set is effort
Automagically* fit your model params
Because thinking is effort
org.apache.spark.ml.tuning has the tools
(not in Python yet so skipping for now)
Jonathan Kotta

Pipeline API has many models:
org.apache.spark.ml.classification
BinaryLogisticRegressionClassification, DecissionTreeClassification,
GBTClassifier, etc.
org.apache.spark.ml.regression
DecissionTreeRegression, GBTRegressor, IsotonicRegression,
LinearRegression, etc.
org.apache.spark.ml.recommendation
ALS
PROcarterse Follow

Exercise 3: Train a new model type
Your choice!
If you want to do regression change what we are
predicting

So serving...
Generally refers to using your model online
Generating recommendations...
In batch mode you can “just” save & use the Spark bits
Spark’s “native” formats (often parquet w/metadata)
Understood by Spark libraries and thats pretty much it
If you are serving in JVM can load these but need Spark dependencies
(albeit often not a Spark cluster)
Some models support PMML export
https://github.com/jpmml/openscoring etc.
We can also write our own export & serving by hand :(
Ambernectar 13

So what models are PMML exportable?
Right now “old” style models
KMeans, LinearRegresion, RidgeRegression, Lasso, SVM, and Binary
LogisticRegression
However if we look in the code we can sometimes find converters to the
old style models and use this to export our “new” style model
Waiting on https://issues.apache.org/jira/browse/SPARK-
11171 / https://github.com/apache/spark/pull/9207 for
pipeline models

How to PMML export
toPMML
returns a string or
takes a path to local fs and saves results or
takes a SparkContext & a distributed path and saves or
takes a stream and writes result to stream

Optional* exercise time
Take a model you trained and save it to PMML
You will have to dig around in the Spark code to be able to do this
Look at the file
Load it into a serving system and try some predictions
Note: PMML export currently only includes the model -
not any transformations beforehand
Also: you might need to train a new model
If you don’t get it don’t worry - hints to follow :)

Introduction to Spark ML

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Introduction to Spark ML

Similar to Introduction to Spark ML (20)

Recently uploaded

Recently uploaded (20)

Introduction to Spark ML

Editor's Notes