SlideShare a Scribd company logo
Introduction to Spark ML
Machine learning at scale
Alpine Academy Spark Workshop #2
Hella-Legit
Preview!
Who am I?
Holden
I prefer she/her for pronouns
Co-author of the Learning Spark book
Software Engineer at IBM’s Spark Technology Center
@holdenkarau
http://www.slideshare.net/hkarau
https://www.linkedin.com/in/holdenkarau
Who are the TAs?
Anya
Pranav
Anandha
Vasudev
What we are going to explore together!
Who I think you all are
Spark’s two different ML APIs
Running through a simple example with one
A brief detour into some codegen funtimes
Exercises!
Model save/load
Discussion of “serving” options
The different pieces of Spark
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark ML
bagel &
Grah X
MLLib
Community
Packages
Who do I think you all are?
Nice people*
Some knowledge of Apache Spark core & maybe SQL
Interested in using Spark for Machine Learning
Familiar-ish with Scala or Java or Python
Amanda
Skipping intro & set-up time :)
But maybe time to upgrade...
Spark 1.5+ (Spark 1.6 would be best!)
(built with Hive support if building from source)
Amanda
Some pages to keep open:
http://bit.ly/sparkDocs
http://bit.ly/sparkPyDocs OR http://bit.ly/sparkScalaDoc
http://bit.ly/sparkMLGuide
https://github.com/holdenk/spark-intro-ml-pipeline-
workshop
Dwight Sipler
Getting some data for working with:
census data: https://archive.ics.uci.edu/ml/datasets/Adult
Goal: predict income > 50k
Also included in the github repo
Download that now if you haven’t already
We will add a header to the data
http://pastebin.ca/3318687
PROTill
Westermayer
So what are the two APIs?
Traditional and Pipeline
Pipeline is the new shiny future which will fix all problems*
Traditional API works on RDDs
Data preparation work is generally done in traditional Spark
transformations
Pipeline API works on DataFrames
Often we want to apply some transformations to our data before feeding
to the machine learning algorithm
Makes it easy to chain these together
(*until replaced by a newer shinier future)
Steve Jurvetson
So what are DataFrames?
Spark SQL’s version of RDDs of the world (its for more
than just SQL)
Restricted data types, schema information, compile time
untyped
Restricted operations (more relational style)
Allow lots of fun extra optimizations
Tungsten etc.
We’ll talk more about them (& Datasets) when we do the
Spark SQL component of this workshop
Transformers, Estimators and Pipelines
Transformers transform a DataFrame into another
Estimators can be trained on a DataFrame to produce a
transformer
Pipelines chain together multiple transformers and
estimators
Let’s start with loading some data
We’ve got some CSV data, we could use textfile and
parse by hand
spark-packages can save by providing the spark-csv
package by Hossein Falaki
If we were building a Java project we can include maven coordinates
For the Spark shell when launching add:
--packages com.databricks:spark-csv_2.10:1.3.0
Jess Johnson
Loading with sparkSQL & spark-csv
sqlContext.read returns a DataFrameReader
We can specify general properties & data specific options
option(“key”, “value”)
spark-csv ones we will use are header & inferSchema
format(“formatName”)
built in formats include parquet, jdbc, etc. today we will use
com.databricks.spark.csv
load(“path”)
Jess Johnson
Loading with sparkSQL & spark-csv
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("resources/adult.data")
Jess Johnson
Lets explore training a Decision Tree
Step 1: Data loading (done!)
Step 2: Data prep (select features, etc.)
Step 3: Train
Step 4: Predict
Data prep / cleaning
We need to predict a double (can be 0.0, 1.0, but type
must be double)
We need to train with a vector of features
Imports:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.param import Param, Params
from pyspark.ml.feature import Bucketizer, VectorAssembler,
StringIndexer
from pyspark.ml import Pipeline
Huang
Yun
Chung
Data prep / cleaning continued
# Combines a list of double input features into a vector
assembler = VectorAssembler(inputCols=["age", "education-num"],
outputCol="feautres")
# String indexer converts a set of strings into doubles
indexer =
StringIndexer(inputCol="category")
.setOutputCol("category-index")
# Can be used to combine pipeline components together
pipeline = Pipeline().setStages([assembler, indexer])
Huang
Yun
Chung
So a bit more about that pipeline
Each of our previous components has “fit” & “transform”
stage
Constructing the pipeline this way makes it easier to work
with (only need to call one fit & one transform)
Can re-use the fitted model on future data
model=pipeline.fit(df)
prepared = model.transform(df)
Andrey
What does our pipeline look like so far?
Input Data Assembler
Input Data
+ Vectors StringIndexer
Input Data
+Cat ID
+ Vectors
While not an ML learning
algorithm this still needs to
be fit
This is a regular
transformer - no fitting
required.
Let's train a model on our prepared data:
# Specify model
dt = DecisionTreeClassifier(labelCol = "category-index",
featuresCol="features")
# Fit it
dt_model = dt.fit(prepared)
# Or as part of the pipeline
pipeline_and_model = Pipeline().setStages([assembler, indexer,
dt])
pipeline_model = pipeline_and_model.fit(df)
And predict the results on the same data:
pipeline_model.transform(df).select("prediction",
"category-index").take(20)
Exercise 1:
Go from the index to something useful
We could manually look up the labels and then write a
select statement
Or we could look at the features on the
StringIndexerModel and use IndexToString
Our pipeline has an array of stages we can use for this
Solution:
from pyspark.ml.feature import IndexToString
labels = list(pipeline_model.stages[1].labels())
inverter = IndexToString(inputCol="prediction",
outputCol="prediction-label", labels=labels)
inverter.transform(pipeline_model.transform(df)).select("predict
ion-label", "category").take(20)
# Pre Spark 1.6 use SQL if/else or similar
So what could we do for other types of
data?
org.apache.spark.ml.feature has a lot of options
HashingTF
Tokenizer
Word2Vec
etc.
Exercise 2: Add more features to your tree
Finished quickly? Help others!
Or tell me if adding these features helped or not…
We can download a reserve “test” dataset but how would we know if we
couldn’t do that?
cobra libre
And not just for getting data into doubles...
Maybe a customers cat food preference only matters if
the owns_cats boolean is true
Maybe the scale is _way_ off
Maybe we’ve got stop words
Maybe we know one component has a non-linear relation
etc.
Cross-validation
because saving a test set is effort
Automagically* fit your model params
Because thinking is effort
org.apache.spark.ml.tuning has the tools
(not in Python yet so skipping for now)
Jonathan Kotta
Pipeline API has many models:
org.apache.spark.ml.classification
BinaryLogisticRegressionClassification, DecissionTreeClassification,
GBTClassifier, etc.
org.apache.spark.ml.regression
DecissionTreeRegression, GBTRegressor, IsotonicRegression,
LinearRegression, etc.
org.apache.spark.ml.recommendation
ALS
PROcarterse Follow
Exercise 3: Train a new model type
Your choice!
If you want to do regression change what we are
predicting
So serving...
Generally refers to using your model online
Generating recommendations...
In batch mode you can “just” save & use the Spark bits
Spark’s “native” formats (often parquet w/metadata)
Understood by Spark libraries and thats pretty much it
If you are serving in JVM can load these but need Spark dependencies
(albeit often not a Spark cluster)
Some models support PMML export
https://github.com/jpmml/openscoring etc.
We can also write our own export & serving by hand :(
Ambernectar 13
So what models are PMML exportable?
Right now “old” style models
KMeans, LinearRegresion, RidgeRegression, Lasso, SVM, and Binary
LogisticRegression
However if we look in the code we can sometimes find converters to the
old style models and use this to export our “new” style model
Waiting on https://issues.apache.org/jira/browse/SPARK-
11171 / https://github.com/apache/spark/pull/9207 for
pipeline models
How to PMML export
toPMML
returns a string or
takes a path to local fs and saves results or
takes a SparkContext & a distributed path and saves or
takes a stream and writes result to stream
Optional* exercise time
Take a model you trained and save it to PMML
You will have to dig around in the Spark code to be able to do this
Look at the file
Load it into a serving system and try some predictions
Note: PMML export currently only includes the model -
not any transformations beforehand
Also: you might need to train a new model
If you don’t get it don’t worry - hints to follow :)

More Related Content

What's hot

Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Jen Aman
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
Holden Karau
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Databricks
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
Ilya Ganelin
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Holden Karau
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Anastasios Skarlatidis
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
Data Con LA
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Databricks
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
datamantra
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Lillian Pierson
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
Ryuji Tamagawa
 

What's hot (20)

Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
PySaprk
PySaprkPySaprk
PySaprk
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
 

Viewers also liked

Spark with Elasticsearch
Spark with ElasticsearchSpark with Elasticsearch
Spark with Elasticsearch
Holden Karau
 
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Holden Karau
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
Holden Karau
 
Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016
Holden Karau
 
JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツ
JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツJP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツ
JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツ
Holden Karau
 
Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016
Holden Karau
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法 ※講演は翻訳資料にて行います。 - Getting the Best...
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法  ※講演は翻訳資料にて行います。 - Getting the Best...PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法  ※講演は翻訳資料にて行います。 - Getting the Best...
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法 ※講演は翻訳資料にて行います。 - Getting the Best...
Holden Karau
 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache Spark
Holden Karau
 
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016
Holden Karau
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
Holden Karau
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Holden Karau
 

Viewers also liked (12)

Spark with Elasticsearch
Spark with ElasticsearchSpark with Elasticsearch
Spark with Elasticsearch
 
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
 
Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016
 
JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツ
JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツJP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツ
JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツ
 
Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法 ※講演は翻訳資料にて行います。 - Getting the Best...
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法  ※講演は翻訳資料にて行います。 - Getting the Best...PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法  ※講演は翻訳資料にて行います。 - Getting the Best...
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法 ※講演は翻訳資料にて行います。 - Getting the Best...
 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache Spark
 
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
 

Similar to Introduction to Spark ML

Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
Holden Karau
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
Data Con LA
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Holden Karau
 
Introduction to and Extending Spark ML
Introduction to and Extending Spark MLIntroduction to and Extending Spark ML
Introduction to and Extending Spark ML
Holden Karau
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache Spark
Databricks
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
sparktc
 
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
DataWorks Summit
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
MLeap: Deploy Spark ML Pipelines to Production API Servers
MLeap: Deploy Spark ML Pipelines to Production API ServersMLeap: Deploy Spark ML Pipelines to Production API Servers
MLeap: Deploy Spark ML Pipelines to Production API Servers
DataWorks Summit
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
Juantomás García Molina
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Farzad Nozarian
 
Apache Spark at Viadeo
Apache Spark at ViadeoApache Spark at Viadeo
Apache Spark at Viadeo
Cepoi Eugen
 
Spark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleSpark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scale
Mateusz Dymczyk
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
Yousun Jeong
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Ilya Ganelin
 
Salesforce Lightning Web Components Overview
Salesforce Lightning Web Components OverviewSalesforce Lightning Web Components Overview
Salesforce Lightning Web Components Overview
Nagarjuna Kaipu
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
Roger Rafanell Mas
 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedChao Chen
 

Similar to Introduction to Spark ML (20)

Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
Introduction to and Extending Spark ML
Introduction to and Extending Spark MLIntroduction to and Extending Spark ML
Introduction to and Extending Spark ML
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache Spark
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
 
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
MLeap: Deploy Spark ML Pipelines to Production API Servers
MLeap: Deploy Spark ML Pipelines to Production API ServersMLeap: Deploy Spark ML Pipelines to Production API Servers
MLeap: Deploy Spark ML Pipelines to Production API Servers
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Apache Spark at Viadeo
Apache Spark at ViadeoApache Spark at Viadeo
Apache Spark at Viadeo
 
Spark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleSpark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scale
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
 
Salesforce Lightning Web Components Overview
Salesforce Lightning Web Components OverviewSalesforce Lightning Web Components Overview
Salesforce Lightning Web Components Overview
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reduced
 

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
Jen Stirrup
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 

Introduction to Spark ML

  • 1. Introduction to Spark ML Machine learning at scale Alpine Academy Spark Workshop #2 Hella-Legit Preview!
  • 2. Who am I? Holden I prefer she/her for pronouns Co-author of the Learning Spark book Software Engineer at IBM’s Spark Technology Center @holdenkarau http://www.slideshare.net/hkarau https://www.linkedin.com/in/holdenkarau
  • 3. Who are the TAs? Anya Pranav Anandha Vasudev
  • 4. What we are going to explore together! Who I think you all are Spark’s two different ML APIs Running through a simple example with one A brief detour into some codegen funtimes Exercises! Model save/load Discussion of “serving” options
  • 5. The different pieces of Spark Apache Spark SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Grah X MLLib Community Packages
  • 6. Who do I think you all are? Nice people* Some knowledge of Apache Spark core & maybe SQL Interested in using Spark for Machine Learning Familiar-ish with Scala or Java or Python Amanda
  • 7. Skipping intro & set-up time :)
  • 8. But maybe time to upgrade... Spark 1.5+ (Spark 1.6 would be best!) (built with Hive support if building from source) Amanda
  • 9. Some pages to keep open: http://bit.ly/sparkDocs http://bit.ly/sparkPyDocs OR http://bit.ly/sparkScalaDoc http://bit.ly/sparkMLGuide https://github.com/holdenk/spark-intro-ml-pipeline- workshop Dwight Sipler
  • 10. Getting some data for working with: census data: https://archive.ics.uci.edu/ml/datasets/Adult Goal: predict income > 50k Also included in the github repo Download that now if you haven’t already We will add a header to the data http://pastebin.ca/3318687 PROTill Westermayer
  • 11. So what are the two APIs? Traditional and Pipeline Pipeline is the new shiny future which will fix all problems* Traditional API works on RDDs Data preparation work is generally done in traditional Spark transformations Pipeline API works on DataFrames Often we want to apply some transformations to our data before feeding to the machine learning algorithm Makes it easy to chain these together (*until replaced by a newer shinier future) Steve Jurvetson
  • 12. So what are DataFrames? Spark SQL’s version of RDDs of the world (its for more than just SQL) Restricted data types, schema information, compile time untyped Restricted operations (more relational style) Allow lots of fun extra optimizations Tungsten etc. We’ll talk more about them (& Datasets) when we do the Spark SQL component of this workshop
  • 13. Transformers, Estimators and Pipelines Transformers transform a DataFrame into another Estimators can be trained on a DataFrame to produce a transformer Pipelines chain together multiple transformers and estimators
  • 14. Let’s start with loading some data We’ve got some CSV data, we could use textfile and parse by hand spark-packages can save by providing the spark-csv package by Hossein Falaki If we were building a Java project we can include maven coordinates For the Spark shell when launching add: --packages com.databricks:spark-csv_2.10:1.3.0 Jess Johnson
  • 15. Loading with sparkSQL & spark-csv sqlContext.read returns a DataFrameReader We can specify general properties & data specific options option(“key”, “value”) spark-csv ones we will use are header & inferSchema format(“formatName”) built in formats include parquet, jdbc, etc. today we will use com.databricks.spark.csv load(“path”) Jess Johnson
  • 16. Loading with sparkSQL & spark-csv df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") .option("inferSchema", "true") .load("resources/adult.data") Jess Johnson
  • 17. Lets explore training a Decision Tree Step 1: Data loading (done!) Step 2: Data prep (select features, etc.) Step 3: Train Step 4: Predict
  • 18. Data prep / cleaning We need to predict a double (can be 0.0, 1.0, but type must be double) We need to train with a vector of features Imports: from pyspark.mllib.linalg import Vectors from pyspark.ml.classification import DecisionTreeClassifier from pyspark.ml.param import Param, Params from pyspark.ml.feature import Bucketizer, VectorAssembler, StringIndexer from pyspark.ml import Pipeline Huang Yun Chung
  • 19. Data prep / cleaning continued # Combines a list of double input features into a vector assembler = VectorAssembler(inputCols=["age", "education-num"], outputCol="feautres") # String indexer converts a set of strings into doubles indexer = StringIndexer(inputCol="category") .setOutputCol("category-index") # Can be used to combine pipeline components together pipeline = Pipeline().setStages([assembler, indexer]) Huang Yun Chung
  • 20. So a bit more about that pipeline Each of our previous components has “fit” & “transform” stage Constructing the pipeline this way makes it easier to work with (only need to call one fit & one transform) Can re-use the fitted model on future data model=pipeline.fit(df) prepared = model.transform(df) Andrey
  • 21. What does our pipeline look like so far? Input Data Assembler Input Data + Vectors StringIndexer Input Data +Cat ID + Vectors While not an ML learning algorithm this still needs to be fit This is a regular transformer - no fitting required.
  • 22. Let's train a model on our prepared data: # Specify model dt = DecisionTreeClassifier(labelCol = "category-index", featuresCol="features") # Fit it dt_model = dt.fit(prepared) # Or as part of the pipeline pipeline_and_model = Pipeline().setStages([assembler, indexer, dt]) pipeline_model = pipeline_and_model.fit(df)
  • 23. And predict the results on the same data: pipeline_model.transform(df).select("prediction", "category-index").take(20)
  • 24. Exercise 1: Go from the index to something useful We could manually look up the labels and then write a select statement Or we could look at the features on the StringIndexerModel and use IndexToString Our pipeline has an array of stages we can use for this
  • 25. Solution: from pyspark.ml.feature import IndexToString labels = list(pipeline_model.stages[1].labels()) inverter = IndexToString(inputCol="prediction", outputCol="prediction-label", labels=labels) inverter.transform(pipeline_model.transform(df)).select("predict ion-label", "category").take(20) # Pre Spark 1.6 use SQL if/else or similar
  • 26. So what could we do for other types of data? org.apache.spark.ml.feature has a lot of options HashingTF Tokenizer Word2Vec etc.
  • 27. Exercise 2: Add more features to your tree Finished quickly? Help others! Or tell me if adding these features helped or not… We can download a reserve “test” dataset but how would we know if we couldn’t do that? cobra libre
  • 28. And not just for getting data into doubles... Maybe a customers cat food preference only matters if the owns_cats boolean is true Maybe the scale is _way_ off Maybe we’ve got stop words Maybe we know one component has a non-linear relation etc.
  • 29. Cross-validation because saving a test set is effort Automagically* fit your model params Because thinking is effort org.apache.spark.ml.tuning has the tools (not in Python yet so skipping for now) Jonathan Kotta
  • 30. Pipeline API has many models: org.apache.spark.ml.classification BinaryLogisticRegressionClassification, DecissionTreeClassification, GBTClassifier, etc. org.apache.spark.ml.regression DecissionTreeRegression, GBTRegressor, IsotonicRegression, LinearRegression, etc. org.apache.spark.ml.recommendation ALS PROcarterse Follow
  • 31. Exercise 3: Train a new model type Your choice! If you want to do regression change what we are predicting
  • 32. So serving... Generally refers to using your model online Generating recommendations... In batch mode you can “just” save & use the Spark bits Spark’s “native” formats (often parquet w/metadata) Understood by Spark libraries and thats pretty much it If you are serving in JVM can load these but need Spark dependencies (albeit often not a Spark cluster) Some models support PMML export https://github.com/jpmml/openscoring etc. We can also write our own export & serving by hand :( Ambernectar 13
  • 33. So what models are PMML exportable? Right now “old” style models KMeans, LinearRegresion, RidgeRegression, Lasso, SVM, and Binary LogisticRegression However if we look in the code we can sometimes find converters to the old style models and use this to export our “new” style model Waiting on https://issues.apache.org/jira/browse/SPARK- 11171 / https://github.com/apache/spark/pull/9207 for pipeline models
  • 34. How to PMML export toPMML returns a string or takes a path to local fs and saves results or takes a SparkContext & a distributed path and saves or takes a stream and writes result to stream
  • 35. Optional* exercise time Take a model you trained and save it to PMML You will have to dig around in the Spark code to be able to do this Look at the file Load it into a serving system and try some predictions Note: PMML export currently only includes the model - not any transformations beforehand Also: you might need to train a new model If you don’t get it don’t worry - hints to follow :)

Editor's Notes

  1. TAs please stand up so people can recognize who you all are :)
  2. https://www.flickr.com/photos/spakattacks/1260950297/in/photolist-2VqGAn-arDiRL-GD51C-rg7ncX-9uVkX8-iGU4uD-cme4Yd-sgrS68-34dFWm-9mESpD-5PMbBV-6p6uqS-vDf2Z3-7ykJQ-dXUhBz-3TAXXp-6wMHQY-8p2Hyo-4t6Hj-4Er9ay-atwnU8-dRCvih-7nsnVG-58iGvR-naChvU-5gihUo-5abdvj-3ngY3K-p5E5nd-5gdaKX-56JnvK-5nhXVC-9rnD3y-cNxexS-yWTBY4-bTx32M-eiiUNV-iuw6-p8B8u-yvMnL-yvLni-5GqaS3-3KwsTY-yvME5-4BTb4C-bxdXxs-9M9Sit-c2CBL9-a4UXA4-b97BtF
  3. https://www.flickr.com/photos/nevernotfocused/14710283621/in/photolist-opU3PF-7Wjaig-7WnqV5-dzDfCv-9kpT5T-7EZN75-49t4dW-6cuYDv-dGbcz9-96Ec1M-2GgYZ5-9GJcmP-SCQc9-5dDnBa-9QozHB-7B8eqJ-3b58nt-4x9EG1-8c59U5-3HmbHE-8MSWuX-8XPUSh-eb2Rbx-N8FNU-qWfLm3-7Wzmsp-634wj8-8NpFnx-dGboCE-aE78Bz-hHHpJd-9ABFAu-oSuKdc-d3YZmJ-ePsq4E-7rXnPp-eyEw6e-8PnfCu-cdG9Sb-srVSE-5UYwzW-8Lfgpr-64xVvw-4NHLes-64pCsr-64pDSi-7WCzAE-dMMjrK-TE6Nv-dtGqLF
  4. https://www.flickr.com/photos/spakattacks/1260950297/in/photolist-2VqGAn-arDiRL-GD51C-rg7ncX-9uVkX8-iGU4uD-cme4Yd-sgrS68-34dFWm-9mESpD-5PMbBV-6p6uqS-vDf2Z3-7ykJQ-dXUhBz-3TAXXp-6wMHQY-8p2Hyo-4t6Hj-4Er9ay-atwnU8-dRCvih-7nsnVG-58iGvR-naChvU-5gihUo-5abdvj-3ngY3K-p5E5nd-5gdaKX-56JnvK-5nhXVC-9rnD3y-cNxexS-yWTBY4-bTx32M-eiiUNV-iuw6-p8B8u-yvMnL-yvLni-5GqaS3-3KwsTY-yvME5-4BTb4C-bxdXxs-9M9Sit-c2CBL9-a4UXA4-b97BtF
  5. https://www.flickr.com/photos/photofarmer/423926200/in/photolist-DsJuq-qsS2ka-arDiRL-4Bw32o-rdr7va-cBF7G5-4SN2dz-xzHrew-nx4TMP-npEcgc-9Z7xW2-5conzx-rYonmx-zK7wZ-Kt4uB-ayHfL9-deMnHj-oAQBEu-dUwzBb-wjP4WA-5nmbMV-dU5emw-5obV3u-551mE3-aC192Q-7c68UV-pq6bKF-7Z2S1D-zAMgg-4C9q9q-wavuvy-GVQQw-nHGEyG-7ADWSP-jQHCzK-np5zHL-ayEDk4-4kpMFp-4Rh78-3opP7V-oR8fMn-551oaq-tMcZWc-b4Z57Z-ayHa3b-6VCQac-fub4S7-4j4nW5-8Mvana-495vNa
  6. https://www.flickr.com/photos/tillwe/75684450/in/photolist-7FUmj-6pQSAv-9ddxwE-jW5vcr-dXaZ7j-mUZSq-89QrWo-grvK9-btJDEK-iCjZL-3g1jR-fTy1P-7nXFj-67bSSJ-asHJSr-dGhuAX-5H88ML-fej2NR-eLPabK-62i8fu-7eVJa6-3vy9a-hsPTYp-kPZSjJ-dXaZqQ-5RLTru-7piqP8-asLnaW-asLkmj-acEK6v-4xKgG1-cKn2bm-9z82Q2-drJs5U-nxeDa6-cKn13Q-naP6zr-asLkPj-e3dWXe-asHJfx-asLmc5-73k4v1-4VqP5n-9616VH-62Pdug-dtNfNc-3c1gh-dwxKVL-nW9BZ3-8JtUQu
  7. https://www.flickr.com/photos/jurvetson/8501129832/in/photolist-dXdvJ1-iWzTq-aEHcUe-8QqpWV-bo28M9-4EcPJn-55SSbg-4EgZod-i2jdJA-57RUHD-4EgYQw-F8ymu-2Y7wuB-BiQSUH-oQYM9-ArZ1av-bi96MD-7J8CMY-7qDmUU-wzJfE-NwV3-dDS8PM-6QxLb-4QLNG8-fej2NR-5GEjGP-om318-5jF3uM-5jebEq-S2rsn-b8C23i-9ykTzN-cECyio-5NwfH-8YgMCF-8BMnBT-9DFyFj-F5N5D-pzPHwp-ELQvK-3TbQEg-5tQxT4-BbDyCM-6c5sMN-6dBqGd-S2rsx-wgdLP-aFbsta-49E3V4-9QCaGp
  8. https://www.flickr.com/photos/westmidlandspolice/23765249535/in/photolist-Cd47HD-bkpYMD-bkpYtV-bq2y5T-bkpYG4-bq2xVi-bq2y2T-bQahtx-aDdcAU-dehhyn-bBfBJJ-aDdcAd-bQafgz-eJZBh1-9RJWR7-bTJHsT-5JDnP4-5wn1Yx-4tTago-aF2peg-bq2AM2-btSNYa-btSQ1a-btSPnc-btSPW8-btSPrx-9m5Cts-pYxGVU-uoPc4-btSNJ6-bq2xLX-bq2xJ8-bq2xRZ-btSR6c-btSRLz-btSRCi-bkpZc4-bkpYYR-btSQRk-btSRaH-btSS12-btSQJX-btSSM4-btSPvn-btSQrP-btSP7z-btSPS8-btSPD8-btSPyV-btSQ52
  9. https://www.flickr.com/photos/hyc/122643306/in/photolist-bQzzL-7Jd4VQ-6GHHbB-xfBFy-bQwKT-pyEuFR-bQCzf-fRWyLU-bQCaV-5KPina-bQyEN-aoTEiU-9fRxRC-63j3uo-r7ZoEA-bQwmg-93NCZM-phsK23-5E3jw-kVHuU-ht6Gp-4RDvR1-xfBkB-qM5M7-bQCqL-ox8Py-bQvVn-ByUasK-b6CjC-4TCPph-wNxqK-66Yosr-PoxA-49Gaji-ofQfGw-4svq18-fzCrBP-bNj7Cn-bhomEe-8KTkAm-cmLC17-4Lexyy-adYKN-4wFtZT-8v5CKo-qdgSwb-7zViMM-xfBzu-bQx4y-xfBoe
  10. https://www.flickr.com/photos/akras/2234513430/in/photolist-4pssZC-vLLhbz-gkDMe-6kAwip-CQi5w-fQuhF2-fR2yLS-2CiLyA-6pXUhu-5KXSkW-g1vScw-aXCQux-4DocBQ-fQuhAF-NojT5-4dtmkB-yqu47-HqCnf-aZWDs2-9ffnQL-2e5kP-646DaV-4rK8V1-9Wbir2-3oUjdL-3oPJnF-3oUewf-3oUgW7-3oUdSA-3oUfbN-3oPGFD-3oPLDZ-3oUcRm-3oUhy7-3oPL8g-3oUfP7-8VTaMe-4Mr6bi-3oPDLk-72mYKP-vtLyso-4qESv8-34guhE-3pV7aF-6L4p6-62YGih-oEwVYo-BueW7f-9kpnGs-sqBipi
  11. See aren’t pipelines so much nicer?
  12. https://www.flickr.com/photos/cobra/151666057/in/photolist-epk32-4qfPrM-5L12Lk-5Zasc8-5L13JK-4qjTWq-6yzaDH-aji5zJ-6mJ2Ez-5BX3oi-5C2rwW-5BXfdt-5BX53V-2z5KC-5C2xmu-5C2tsf-5C2upY-dqNhpG-5C2vo9-5BX4he-aWMuz4-aMZz8K-5C2sw9-5BX6CT-5C2woL-8ZPUW-57sSHU-5BWEuc-5C1YNo-5BWPm2-5C1Dq7-5BWEFB-5C1Z1U-5BWEgF-5BWkZP-5BWksp-5C1D4y-5BWjRP-5C1Cuw-5BWjx8-6sgPiS-5C1ESC-5BWoVz-5BWp8v-5C1RVG-5BWZQD-5C1LZy-5C1JAL-5C1Hbu-5C2hZq
  13. https://www.flickr.com/photos/australianshepherds/2633058619/in/photolist-51F7Ap-61b4Eg-5MY3iq-5NUhT-742jTF-66vhos-5UjwWm-9nftSL-acYKvg-4r5Srg-7nHSsK-7ZP3ZG-5Z1ZtF-qtxkg-q2PAe6-m66KCz-5ULFVh-nWx3i-nMLr6-nCbof-nxV5x-nmoYY-nmbLE-nu5g9-nmoYK-nWx3p-nmbLT-nnshD-nu5gd-nCbnU-brmjTS-nr1pt-nGCMe-nGCM8-nwZMAr-bKPbvg-7HrxSz-nbjA5-9Sbudm-bmSA39-awWfvv-jK85WG-bKPa3c-ny2th-bEgfaa-bEgjCp-nCbob-biKuoR-egd1fj-5oK6kn