SlideShare a Scribd company logo
ML Pipelines with Apache
Spark & a little Apache Beam
Ottawa Reactive Meetup
Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google focused on OSS Big Data
● Apache Spark PMC (think committer with tenure)
● Contributor to a lot of other projects (including BEAM)
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of High Performance Spark & Learning Spark (+ more)
● Twitter: @holdenkarau
● Slideshare
● Linkedin
● Github
● Related Spark Videos
Who do I think you all are?
● Nice people*
● Mostly engineers
● Familiar with one of Java, Scala, or Python
● May or may not know Apache Spark
What is in store for our adventure?
● Why train models on distributed systems
○ Have big data? It’s better than down sampling (more data often wins
over better algorithms)
○ Have small data? Enjoy fast hyper parameter tuning and taking an
early coffee break (but not too long)
● Why Apache Spark and Apache Beam in general
○ Answer questions in minutes instead of days* (*some of the time)
● “Classic” ML pipelines & then some deep learning
Ada Doglace
What is out of scope for today:
● The details backing of these algorithms
○ If you ask I’ll just say “gradient descent” and run away after throwing a
smoke bomb on the floor.
● Questions about stack traces (j/k)
Ada Doglace
So why you might be doing this?
● Maybe you’ve built a system with “hand tuned” weights
● Your static list of [X, Y, Z] no longer cuts it
● Your system is overwhelmed with abuse & your budget
for handling it is less than an intern.
● You want a new job and ML sounds nicer than Perl on
the resume now days
Why did I get into this?
● I built a few search systems
● We spent a lot of time… guessing… I mean tuning
● We hired some smart people from Google
● Added ML magic
● Things went downhill from there (and then uphill at
another local maximum later)
What tools are we going to use today?
● Apache Spark - Model Training (plus fits into your ETL)
● emacs/vim - looking at random output
● spark-testing-base - You still need unit tests
● (sort of) spark-validation - Validating your jobs
● csv files - hey at least its not XML
● XML - ahhh crap
Demos will be in Scala (but I’ll try and avoid the odd things
like _s) & you can stop me if its confusing.
Mohammed Mustafa
What is Spark?
● General purpose distributed system
○ Built in Scala with an FP inspired API
● Apache project (one of the most
● Must faster than Hadoop
● Good when too big for a single
● Built on top of two abstractions for
distributed data: RDDs & Datasets
When we say distributed we mean...
Why people come to Spark:
Well this MapReduce
job is going to take
16 hours - how long
could it take to learn
Why people come to Spark:
My DataFrame won’t fit
in memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this Spark business will
solve that...
Plus a little magic :)
Steven Saus
The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Python, &
Spark ML
bagel &
Graph X
Paul Hudson
Required: Word count (in python)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
( x: (x, 1))
.reduceByKey(lambda x, y: x+y))
Photo By: Will
Companion notebook funtimes:
● Small companion IJupyter notebook to explore with:
○ Python:
○ Scala:
● If you want to use it you will access to Apache Spark
○ Install from
○ Or get access to one of the online notebook environments (Google
Dataproc, DataBricks Cloud, Microsoft Spark HDInsights Cluster
Notebook, etc.)
David DeHetre
Transformers, Estimators and Pipelines
● Transformers transform a DataFrame into another
● Estimators can be trained on a DataFrame to produce a
● Pipelines chain together multiple transformers and
Let’s start with loading some data
● Genuine big data, doesn’t fit on a floppy disk
○ It’s ok if your inputs do fit on a floppy disk, buuuuut more data
generally works better
● In all seriousness, not a bad practice to down-sample
first while your building your pipeline so you can find
your errors fast (Spark pipelines discarded some type
information -- sorry!)
Jess Johnson
Loading with sparkSQL & spark-csv returns a DataFrameReader
We can specify general properties & data specific options
● option(“key”, “value”)
○ spark-csv ones we will use are header & inferSchema
● format(“formatName”)
○ built in formats include parquet, jdbc, etc. today we will use
● load(“path”)
Jess Johnson
Loading with sparkSQL & spark-csv
val df =
.option("header", "true")
.option("inferSchema", "true")
Jess Johnson
Lets explore training a Decision Tree
● Step 1: Data loading (done!)
● Step 2a: Data prep (produce features, remove complete
garbage, etc.) -
● Step 2b: Data prep (select features, etc.)
● Step 3: Train
● Step 4: Predict
Data prep / cleaning
● We need to predict a double (can be 0.0, 1.0, but type
must be double)
● We need to train with a vector of features**
** There is work to allow images and other things too.
Data prep / cleaning continued
// Combines a list of double input features into a vector
val assembler = new VectorAssembler().setInputCols(Array("age",
// String indexer converts a set of strings into doubles
val indexer = StringIndexer().setInputCol("category")
// Can be used to combine pipeline components together
val pipeline = Pipeline().setStages(Array(assembler, indexer))
So a bit more about that pipeline
● Each of our previous components has “fit” & “transform”
● Constructing the pipeline this way makes it easier to
work with (only need to call one fit & one transform)
● Can re-use the fitted model on future data
prepared = model.transform(df)
What does our pipeline look like so far?
Input Data Assembler
Input Data
+ Vectors StringIndexer
Input Data
+Cat ID
+ Vectors
While not an ML learning
algorithm this still needs to
be fit
This is a regular
transformer - no fitting
Let's train a model on our prepared data:
# Specify model
dt = DecisionTreeClassifier(labelCol = "category-index",
# Fit it
dt_model =
# Or as part of the pipeline
pipeline_and_model = Pipeline().setStages([assembler, indexer,
pipeline_model =
And predict the results on the same data:
What does our tree look like?
val tree =
What does our tree look like?
[info] If (feature 1 <= 12.5)
[info] If (feature 0 <= 33.5)
[info] If (feature 0 <= 26.5)
[info] If (feature 0 <= 23.5)
[info] If (feature 0 <= 21.5)
[info] Predict: 0.0
[info] Else (feature 0 > 21.5)
Win G
And predict the results on new data
// Option 1: Add empty/place-holder label data
pipeline_model.transform(df.withColumn("category", "dne"))
I guess that looks ok? Lets serve it!
● Waaaaaait - why is evaluate only on a dataframe?
● Ewwww - embeding Spark local mode in our webapp
○ The jar conflicts: they burn! & the performance will burn “later” :p
● See if your company has a model server, write export
function to match that glorious 90s C/C++ code base
● Write our own serving code & copy n’ paste the predict
● Use someone else’s copy n’ paste project
Ambernectar 13
But wait Spark has PMML support right?
● Spark 2.4 timeframe for pipelines -- I’m sorry (but the
codes in master)
● Limited support in both models & general data prep
○ No general whole pipeline export yet either
● Serving options: write your own, license something, or
AGPL code
The state of serving is generally a mess
● One project which aims to improve this is KubeFlow
○ Goal is unifying training & serving experiences
● Despite the name targeting more than just TensorFlow
● Doesn’t work with Spark yet, but it’s on my TODO list.
Pipeline API has many models:
○ BinaryLogisticRegressionClassification, DecissionTreeClassification,
GBTClassifier, etc.
○ DecissionTreeRegression, GBTRegressor, IsotonicRegression,
LinearRegression, etc.
PROcarterse Follow
It’s not always a standalone microservice:
● Linear regression is awesome because I can “serve”* it
inside as an embedding in my elasticsearch / solr query
● Batch prediction is pretty OK too for somethings
○ Videos you may be interested in etc.
● Sometimes hybrid systems
○ Off-line expensive models + on-line inexpensive models
○ At this point you should probably higher a data scientist though
because saving a test set is effort
● Automagically* fit your model params
○ Things like max tree depth, min info gain, and other regularization.
● Because thinking is effort
● has the tools
● If your going to use this for auto-tuning please please
save a test set
● Otherwise your models will look awesome and perform
like a ford pinto
Jonathan Kotta
The state of serving is generally a mess
● One project which aims to improve this is KubeFlow
○ Goal is unifying training & serving experiences
● Despite the name targeting more than just TensorFlow
● Doesn’t work with Spark yet, but it’s on my TODO list.
It’s not always a standalone microservice:
● Linear regression is awesome because I can “serve”* it
inside as an embedding in my elasticsearch / solr query
● Batch prediction is pretty OK too for somethings
○ Videos you may be interested in etc.
● Sometimes hybrid systems
○ Off-line expensive models + on-line inexpensive models
○ At this point you should probably higher a data scientist though
because saving a test set is effort
● Automagically* fit your model params
○ Things like max tree depth, min info gain, and other regularization.
● Because thinking is effort
● has the tools
● If your going to use this for auto-tuning please please
save a test set
● Otherwise your models will look awesome and perform
like a ford pinto
Jonathan Kotta
because saving a test set is effort
// ParamGridBuilder constructs an Array of parameter
val paramGrid: Array[ParamMap] = new ParamGridBuilder()
.addGrid(nb.smoothing, Array(0.1, 0.5, 1.0, 2.0))
val cv = new CrossValidator()
val cvModel =
val bestModel = cvModel.bestModel
Jonathan Kotta
False sense of security:
● A/B test please even if CV says many many $s
● Rank based things can have training bias with previous
● Non-displayed options: unlikely to be chosen
● Sometimes can find previous formulaic corrections
● Sometimes we can “experimentally” determine
● Other times we just hope it’s better than nothing
● Try and make sure your ML isn’t evil or re-encoding
human biases but stronger
TensorFlowOnSpark, everyone loves mnist!
cluster =, mnist_dist_dataset.map_fun, args,
args.cluster_size, num_ps, args.tensorboard,
if args.mode == "train":
cluster.train(dataRDD, args.epochs)
Enter: TF.Transform
● For pre-processing of your data
○ e.g. where you spend 90% of your dev time anyways
● Integrates into serving time :D
● Runs on top of Apache Beam, but current release not yet outside of GCP
○ On master this can run on Flink, but probably has bugs currently.
○ Please don’t use this in production today unless your on
PROKathryn Yengel
● I’m serious, I don’t want to die or cause the next
financial meltdown with software I’m a part of
● By Today I mean August 15 2018, but it’s probably
going to not be great for at least a “little while”
Vladimir Pustovit
PROTambako The Jaguar
Ooor from the chicago taxi data...
# Preserve this feature as a dense float, setting nan's to
the mean.
outputs[key] = transform.scale_to_z_score(inputs[key])
for key in taxi.VOCAB_FEATURE_KEYS:
# Build a vocabulary for this feature.
outputs[key] = transform.string_to_int(
inputs[key], top_k=taxi.VOCAB_SIZE,
for key in taxi.BUCKET_FEATURE_KEYS:
outputs[key] = transform.bucketize(inputs[key],
Defining a Transform processing function
def preprocessing_fn(inputs):
x = inputs['x']
y = inputs['y']
s = inputs['s']
x_centered = x - tft.mean(x)
y_normalized = tft.scale_to_0_1(y)
s_int = tft.string_to_int(s)
return { 'x_centered': x_centered,
'y_normalized': y_normalized, 's_int': s_int}
mean stddev
Reduce (full pass)
Implemented as a distributed
data pipeline
Instance-to-instance (don’t
change batch dimension)
Pure TensorFlow
mean stddev
Scale to ... Bag of Words / N-Grams
Bucketization Feature Crosses
Some common use-cases...
BEAM Beyond the JVM: Current release
● Non JVM BEAM doesn’t work outside of Google’s environment yet
● tl;dr : uses grpc / protobuf
○ Similar to the common design but with more efficient representations (often)
● But exciting new plans to unify the runners and ease the support of different
languages (called SDKS)
○ See
● If this is exciting, you can come join me on making BEAM work in Python3
○ Yes we still don’t have that :(
○ But we're getting closer & you can come join us on BEAM-2874 :D
BEAM Beyond the JVM: Master w/ experiments
So what does that look like?
Worker 1
Worker K
Updating your model
● The real world changes
● Online learning (streaming) is super cool, but hard to
● Iterative batches: automatically train on new data,
deploy model, and A/B test
● But A/B testing isn’t enough -- bad data can result in
wrong or even illegal results (ask me after a bud light
So why should you test & validate
Results from: Testing with Spark survey
● For now checking file sizes & execution time seem like the most common best
practice (from survey)
● spark-validator is still in early stages and not ready for production use but
interesting proof of concept
● Doesn’t need to be done in your Spark job (can be done in your scripting
language of choice with whatever job control system you are using)
● Sometimes your rules will miss-fire and you’ll need to manually approve a job
- that is ok!
● Remember those property tests? Could be great Validation rules!
Photo by:
Paul Schadler
Using a Spark accumulator for validation:
val (ok, bad) = (sc.accumulator(0), sc.accumulator(0))
val records ={ x => if (isValid(x)) ok +=1 else bad += 1
// Actual parse logic here
// An action (e.g. count, save, etc.)
if (bad.value > 0.1* ok.value) {
throw Exception("bad data - do not use results")
// Optional cleanup
// Mark as safe
P.S: If you are interested in this check out spark-validator (still early stages).
Found Animals Foundation Follow
Validating records read matches our expectations:
val vc = new ValidationConf(tempPath, "1", true,
new AbsoluteSparkCounterValidationRule("recordsRead", Some(30),
val sqlCtx = new SQLContext(sc)
val v = Validation(sc, sqlCtx, vc)
//Business logic goes here
assert(v.validate(5) === true)
Photo by Dvortygirl
Common ML Specific Validation
● Number of iterations
○ did I coverage super quickly or slowly compared to last time? Could indicate junk data.
● CV model performance versus previous run
● Performance on a “fixed” test set (periodically manually refresh)
● Shadow run model on input stream - % of failures or missing results
Learning Spark
Fast Data
Processing with
(Out of Date)
Fast Data
Processing with
(2nd edition)
Analytics with
Spark in Action
High Performance SparkLearning PySpark
High Performance Spark!
You can buy it today on
Not a lot of ML focus but some!
Cats love it*
*Or at least the box it comes in. If buying for a cat, get print
rather than e-book.
What about the code lab?
● Chocodyno
k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
I need to give a testing talk next
month, help a “friend” out.
Will tweet results
“eventually” @holdenkarau
Do you want more realistic
benchmarks? Share your UDFs!
Pssst: Have feedback on the presentation? Give me a
shout ( if you feel comfortable doing
so :)
Give feedback on this presentation

More Related Content

What's hot

Introduction to and Extending Spark ML
Introduction to and Extending Spark MLIntroduction to and Extending Spark ML
Introduction to and Extending Spark ML
Holden Karau
Extending spark ML for custom models now with python!
Extending spark ML for custom models  now with python!Extending spark ML for custom models  now with python!
Extending spark ML for custom models now with python!
Holden Karau
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Holden Karau
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sj
Holden Karau
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Holden Karau
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauBeyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden Karau
Spark Summit
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
Holden Karau
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
Holden Karau
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
Spark Summit
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017
Holden Karau
Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016
Holden Karau
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark... Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017
Holden Karau
Deploying Machine Learning Models to Production
Deploying Machine Learning Models to ProductionDeploying Machine Learning Models to Production
Deploying Machine Learning Models to Production
Anass Bensrhir - Senior Data Scientist
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
Holden Karau
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau

What's hot (20)

Introduction to and Extending Spark ML
Introduction to and Extending Spark MLIntroduction to and Extending Spark ML
Introduction to and Extending Spark ML
Extending spark ML for custom models now with python!
Extending spark ML for custom models  now with python!Extending spark ML for custom models  now with python!
Extending spark ML for custom models now with python!
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling   global big data tech conference 2015 sjBeyond shuffling   global big data tech conference 2015 sj
Beyond shuffling global big data tech conference 2015 sj
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauBeyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden Karau
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Debugging PySpark - PyCon US 2018
Debugging PySpark -  PyCon US 2018Debugging PySpark -  PyCon US 2018
Debugging PySpark - PyCon US 2018
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017
Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
 Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark... Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017
Deploying Machine Learning Models to Production
Deploying Machine Learning Models to ProductionDeploying Machine Learning Models to Production
Deploying Machine Learning Models to Production
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC

Similar to Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup August 16 2018

An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
Data Con LA
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Holden Karau
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
Holden Karau
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
Holden Karau
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016
Holden Karau
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Chetan Khatri
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018
Holden Karau
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017
Holden Karau
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
Holden Karau
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
Yousun Jeong
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Mammoth Data
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
Holden Karau
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
Rafal Kwasny
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
Mateusz Buśkiewicz
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Chester Chen

Similar to Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup August 16 2018 (19)

An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017Debugging PySpark - Spark Summit East 2017
Debugging PySpark - Spark Summit East 2017
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...

Recently uploaded

Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
Gal Baras
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
Rogerio Filho
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Brad Spiegel Macon GA
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
Javier Lasa
Comptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guideComptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guide
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Sanjeev Rampal
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptxInternet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024

Recently uploaded (20)

Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
Comptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guideComptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guide
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptxInternet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024

Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup August 16 2018

  • 1. ML Pipelines with Apache Spark & a little Apache Beam Ottawa Reactive Meetup 2018 Hella-Legit
  • 2. Who am I? ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google focused on OSS Big Data ● Apache Spark PMC (think committer with tenure) ● Contributor to a lot of other projects (including BEAM) ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of High Performance Spark & Learning Spark (+ more) ● Twitter: @holdenkarau ● Slideshare ● Linkedin ● Github ● Related Spark Videos
  • 3.
  • 4. Who do I think you all are? ● Nice people* ● Mostly engineers ● Familiar with one of Java, Scala, or Python ● May or may not know Apache Spark Amanda
  • 5. What is in store for our adventure? ● Why train models on distributed systems ○ Have big data? It’s better than down sampling (more data often wins over better algorithms) ○ Have small data? Enjoy fast hyper parameter tuning and taking an early coffee break (but not too long) ● Why Apache Spark and Apache Beam in general ○ Answer questions in minutes instead of days* (*some of the time) ● “Classic” ML pipelines & then some deep learning Ada Doglace
  • 6. What is out of scope for today: ● The details backing of these algorithms ○ If you ask I’ll just say “gradient descent” and run away after throwing a smoke bomb on the floor. ● Questions about stack traces (j/k) Ada Doglace
  • 7. So why you might be doing this? ● Maybe you’ve built a system with “hand tuned” weights ● Your static list of [X, Y, Z] no longer cuts it ● Your system is overwhelmed with abuse & your budget for handling it is less than an intern. ● You want a new job and ML sounds nicer than Perl on the resume now days Amanda
  • 8. Why did I get into this? ● I built a few search systems ● We spent a lot of time… guessing… I mean tuning ● We hired some smart people from Google ● Added ML magic ● Things went downhill from there (and then uphill at another local maximum later)
  • 9. What tools are we going to use today? ● Apache Spark - Model Training (plus fits into your ETL) ● emacs/vim - looking at random output ● spark-testing-base - You still need unit tests ● (sort of) spark-validation - Validating your jobs ● csv files - hey at least its not XML ● XML - ahhh crap Demos will be in Scala (but I’ll try and avoid the odd things like _s) & you can stop me if its confusing. Mohammed Mustafa
  • 10. What is Spark? ● General purpose distributed system ○ Built in Scala with an FP inspired API ● Apache project (one of the most active) ● Must faster than Hadoop Map/Reduce ● Good when too big for a single machine ● Built on top of two abstractions for distributed data: RDDs & Datasets
  • 11. When we say distributed we mean...
  • 12. Why people come to Spark: Well this MapReduce job is going to take 16 hours - how long could it take to learn Spark? dougwoods
  • 13. Why people come to Spark: My DataFrame won’t fit in memory on my cluster anymore, let alone my MacBook Pro :( Maybe this Spark business will solve that... brownpau
  • 14. Plus a little magic :) Steven Saus
  • 15. The different pieces of Spark Apache Spark SQL, DataFrames & Datasets Structured Streaming Scala, Java, Python, & R Spark ML bagel & Graph X MLLib Scala, Java, PythonStreaming Graph Frames Paul Hudson
  • 16. Required: Word count (in python) lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = ( x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile(“output”) Photo By: Will Keightley
  • 17. Companion notebook funtimes: ● Small companion IJupyter notebook to explore with: ○ Python: ○ Scala: ● If you want to use it you will access to Apache Spark ○ Install from ○ Or get access to one of the online notebook environments (Google Dataproc, DataBricks Cloud, Microsoft Spark HDInsights Cluster Notebook, etc.) David DeHetre
  • 18. Transformers, Estimators and Pipelines ● Transformers transform a DataFrame into another ● Estimators can be trained on a DataFrame to produce a transformer ● Pipelines chain together multiple transformers and estimators A.Davey
  • 19. Let’s start with loading some data ● Genuine big data, doesn’t fit on a floppy disk ○ It’s ok if your inputs do fit on a floppy disk, buuuuut more data generally works better ● In all seriousness, not a bad practice to down-sample first while your building your pipeline so you can find your errors fast (Spark pipelines discarded some type information -- sorry!) Jess Johnson
  • 20. Loading with sparkSQL & spark-csv returns a DataFrameReader We can specify general properties & data specific options ● option(“key”, “value”) ○ spark-csv ones we will use are header & inferSchema ● format(“formatName”) ○ built in formats include parquet, jdbc, etc. today we will use com.databricks.spark.csv ● load(“path”) Jess Johnson
  • 21. Loading with sparkSQL & spark-csv val df = .format("csv") .option("header", "true") .option("inferSchema", "true") .load("resources/") Jess Johnson
  • 22. Lets explore training a Decision Tree ● Step 1: Data loading (done!) ● Step 2a: Data prep (produce features, remove complete garbage, etc.) - ● Step 2b: Data prep (select features, etc.) ● Step 3: Train ● Step 4: Predict
  • 23. Data prep / cleaning ● We need to predict a double (can be 0.0, 1.0, but type must be double) ● We need to train with a vector of features** Huang Yun Chung ** There is work to allow images and other things too.
  • 24. Data prep / cleaning continued // Combines a list of double input features into a vector val assembler = new VectorAssembler().setInputCols(Array("age", "education-num")).setOutputCol("features") // String indexer converts a set of strings into doubles val indexer = StringIndexer().setInputCol("category") .setOutputCol("category-index") // Can be used to combine pipeline components together val pipeline = Pipeline().setStages(Array(assembler, indexer)) Huang Yun Chung
  • 25. So a bit more about that pipeline ● Each of our previous components has “fit” & “transform” stage ● Constructing the pipeline this way makes it easier to work with (only need to call one fit & one transform) ● Can re-use the fitted model on future data prepared = model.transform(df) Andrey
  • 26. What does our pipeline look like so far? Input Data Assembler Input Data + Vectors StringIndexer Input Data +Cat ID + Vectors While not an ML learning algorithm this still needs to be fit This is a regular transformer - no fitting required.
  • 27. Let's train a model on our prepared data: # Specify model dt = DecisionTreeClassifier(labelCol = "category-index", featuresCol="features") # Fit it dt_model = # Or as part of the pipeline pipeline_and_model = Pipeline().setStages([assembler, indexer, dt]) pipeline_model =
  • 28. And predict the results on the same data: pipeline_model.transform(df).select("prediction", "category-index").take(20)
  • 29. What does our tree look like? val tree = pipeline_model.stages(2).asInstanceOf[DecisionTreeClassificatio nModel] println(tree.toDebugString)
  • 30. What does our tree look like? ooooh [info] If (feature 1 <= 12.5) [info] If (feature 0 <= 33.5) [info] If (feature 0 <= 26.5) [info] If (feature 0 <= 23.5) [info] If (feature 0 <= 21.5) [info] Predict: 0.0 [info] Else (feature 0 > 21.5) Win G
  • 31. And predict the results on new data // Option 1: Add empty/place-holder label data pipeline_model.transform(df.withColumn("category", "dne")) .take(20)
  • 32. I guess that looks ok? Lets serve it! ● Waaaaaait - why is evaluate only on a dataframe? ● Ewwww - embeding Spark local mode in our webapp ○ The jar conflicts: they burn! & the performance will burn “later” :p Options: ● See if your company has a model server, write export function to match that glorious 90s C/C++ code base ● Write our own serving code & copy n’ paste the predict code ● Use someone else’s copy n’ paste project Ambernectar 13
  • 33. But wait Spark has PMML support right? ● Spark 2.4 timeframe for pipelines -- I’m sorry (but the codes in master) ● Limited support in both models & general data prep ○ No general whole pipeline export yet either ● Serving options: write your own, license something, or AGPL code
  • 34. The state of serving is generally a mess ● One project which aims to improve this is KubeFlow ○ Goal is unifying training & serving experiences ● Despite the name targeting more than just TensorFlow ● Doesn’t work with Spark yet, but it’s on my TODO list.
  • 35. Pipeline API has many models: ● ○ BinaryLogisticRegressionClassification, DecissionTreeClassification, GBTClassifier, etc. ● ○ DecissionTreeRegression, GBTRegressor, IsotonicRegression, LinearRegression, etc. ● ○ ALS PROcarterse Follow
  • 36. It’s not always a standalone microservice: ● Linear regression is awesome because I can “serve”* it inside as an embedding in my elasticsearch / solr query ● Batch prediction is pretty OK too for somethings ○ Videos you may be interested in etc. ● Sometimes hybrid systems ○ Off-line expensive models + on-line inexpensive models ○ At this point you should probably higher a data scientist though
  • 37. Cross-validation because saving a test set is effort ● Automagically* fit your model params ○ Things like max tree depth, min info gain, and other regularization. ● Because thinking is effort ● has the tools ● If your going to use this for auto-tuning please please save a test set ● Otherwise your models will look awesome and perform like a ford pinto Jonathan Kotta
  • 38. The state of serving is generally a mess ● One project which aims to improve this is KubeFlow ○ Goal is unifying training & serving experiences ● Despite the name targeting more than just TensorFlow ● Doesn’t work with Spark yet, but it’s on my TODO list.
  • 39. It’s not always a standalone microservice: ● Linear regression is awesome because I can “serve”* it inside as an embedding in my elasticsearch / solr query ● Batch prediction is pretty OK too for somethings ○ Videos you may be interested in etc. ● Sometimes hybrid systems ○ Off-line expensive models + on-line inexpensive models ○ At this point you should probably higher a data scientist though
  • 40. Cross-validation because saving a test set is effort ● Automagically* fit your model params ○ Things like max tree depth, min info gain, and other regularization. ● Because thinking is effort ● has the tools ● If your going to use this for auto-tuning please please save a test set ● Otherwise your models will look awesome and perform like a ford pinto Jonathan Kotta
  • 41. Cross-validation because saving a test set is effort // ParamGridBuilder constructs an Array of parameter combinations. val paramGrid: Array[ParamMap] = new ParamGridBuilder() .addGrid(nb.smoothing, Array(0.1, 0.5, 1.0, 2.0)) .build() val cv = new CrossValidator() .setEstimator(pipeline) .setEstimatorParamMaps(paramGrid) val cvModel = val bestModel = cvModel.bestModel Jonathan Kotta
  • 42. False sense of security: ● A/B test please even if CV says many many $s ● Rank based things can have training bias with previous orders ● Non-displayed options: unlikely to be chosen ● Sometimes can find previous formulaic corrections ● Sometimes we can “experimentally” determine ● Other times we just hope it’s better than nothing ● Try and make sure your ML isn’t evil or re-encoding human biases but stronger
  • 43. TensorFlowOnSpark, everyone loves mnist! cluster =, mnist_dist_dataset.map_fun, args, args.cluster_size, num_ps, args.tensorboard, TFCluster.InputMode.SPARK) if args.mode == "train": cluster.train(dataRDD, args.epochs) Lida
  • 44. Enter: TF.Transform ● For pre-processing of your data ○ e.g. where you spend 90% of your dev time anyways ● Integrates into serving time :D ● OSS ● Runs on top of Apache Beam, but current release not yet outside of GCP ○ On master this can run on Flink, but probably has bugs currently. ○ Please don’t use this in production today unless your on GCP/Dataflow PROKathryn Yengel
  • 45. DO NOT USE THIS IN PRODUCTION TODAY ● I’m serious, I don’t want to die or cause the next financial meltdown with software I’m a part of ● By Today I mean August 15 2018, but it’s probably going to not be great for at least a “little while” Vladimir Pustovit PROTambako The Jaguar
  • 46. Ooor from the chicago taxi data... for key in taxi.DENSE_FLOAT_FEATURE_KEYS: # Preserve this feature as a dense float, setting nan's to the mean. outputs[key] = transform.scale_to_z_score(inputs[key]) for key in taxi.VOCAB_FEATURE_KEYS: # Build a vocabulary for this feature. outputs[key] = transform.string_to_int( inputs[key], top_k=taxi.VOCAB_SIZE, num_oov_buckets=taxi.OOV_SIZE) for key in taxi.BUCKET_FEATURE_KEYS: outputs[key] = transform.bucketize(inputs[key],
  • 47. Defining a Transform processing function def preprocessing_fn(inputs): x = inputs['x'] y = inputs['y'] s = inputs['s'] x_centered = x - tft.mean(x) y_normalized = tft.scale_to_0_1(y) s_int = tft.string_to_int(s) return { 'x_centered': x_centered, 'y_normalized': y_normalized, 's_int': s_int}
  • 48. mean stddev normalize multiply quantiles bucketize Analyzers Reduce (full pass) Implemented as a distributed data pipeline Transforms Instance-to-instance (don’t change batch dimension) Pure TensorFlow
  • 50. Scale to ... Bag of Words / N-Grams Bucketization Feature Crosses tft.ngrams tft.string_to_int tf.string_split tft.scale_to_z_score tft.apply_buckets tft.quantiles tft.string_to_int tf.string_join ... Some common use-cases...
  • 51. BEAM Beyond the JVM: Current release ● Non JVM BEAM doesn’t work outside of Google’s environment yet ● tl;dr : uses grpc / protobuf ○ Similar to the common design but with more efficient representations (often) ● But exciting new plans to unify the runners and ease the support of different languages (called SDKS) ○ See ● If this is exciting, you can come join me on making BEAM work in Python3 ○ Yes we still don’t have that :( ○ But we're getting closer & you can come join us on BEAM-2874 :D Emma
  • 52. BEAM Beyond the JVM: Master w/ experiments *ish *ish *ish Nick portability *ish
  • 53. So what does that look like? Driver Worker 1 Docker grpc Worker K Docker grpc
  • 54. Updating your model ● The real world changes ● Online learning (streaming) is super cool, but hard to version ● Iterative batches: automatically train on new data, deploy model, and A/B test ● But A/B testing isn’t enough -- bad data can result in wrong or even illegal results (ask me after a bud light lime)
  • 55. So why should you test & validate Results from: Testing with Spark survey
  • 56. Validation ● For now checking file sizes & execution time seem like the most common best practice (from survey) ● spark-validator is still in early stages and not ready for production use but interesting proof of concept ● Doesn’t need to be done in your Spark job (can be done in your scripting language of choice with whatever job control system you are using) ● Sometimes your rules will miss-fire and you’ll need to manually approve a job - that is ok! ● Remember those property tests? Could be great Validation rules! Photo by: Paul Schadler
  • 57. Using a Spark accumulator for validation: val (ok, bad) = (sc.accumulator(0), sc.accumulator(0)) val records ={ x => if (isValid(x)) ok +=1 else bad += 1 // Actual parse logic here } // An action (e.g. count, save, etc.) if (bad.value > 0.1* ok.value) { throw Exception("bad data - do not use results") // Optional cleanup } // Mark as safe P.S: If you are interested in this check out spark-validator (still early stages). Found Animals Foundation Follow
  • 58. Validating records read matches our expectations: val vc = new ValidationConf(tempPath, "1", true, List[ValidationRule]( new AbsoluteSparkCounterValidationRule("recordsRead", Some(30), Some(1000))) ) val sqlCtx = new SQLContext(sc) val v = Validation(sc, sqlCtx, vc) //Business logic goes here assert(v.validate(5) === true) } Photo by Dvortygirl
  • 59. Common ML Specific Validation ● Number of iterations ○ did I coverage super quickly or slowly compared to last time? Could indicate junk data. ● CV model performance versus previous run ● Performance on a “fixed” test set (periodically manually refresh) ● Shadow run model on input stream - % of failures or missing results
  • 60. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance SparkLearning PySpark
  • 61. High Performance Spark! You can buy it today on Not a lot of ML focus but some! Cats love it* *Or at least the box it comes in. If buying for a cat, get print rather than e-book.
  • 62. What about the code lab? ● Chocodyno
  • 63. k thnx bye :) If you care about Spark testing and don’t hate surveys: I need to give a testing talk next month, help a “friend” out. Will tweet results “eventually” @holdenkarau Do you want more realistic benchmarks? Share your UDFs! Pssst: Have feedback on the presentation? Give me a shout ( if you feel comfortable doing so :) Give feedback on this presentation