Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

About the speaker: Joseph Bradley
Joseph Bradley is a Software Engineer and
Apache Spark Committer & PMC member
working on Machine Learning at
Databricks. Previously, he was a postdoc
at UC Berkeley after receiving his Ph.D. in
Machine Learning from Carnegie Mellon in
2013.
5

About the speaker: Jules S. Damji
Jules S. Damji is an Apache Spark
Community Evangelist with Databricks. He
is a hands-on developer with over 15 years
of experience and has worked at leading
companies building large-scale distributed
systems.
6

Databricks
Founded by the creators
of Apache Spark in 2013
Share of Spark code
contributed by
Databricks
in 2014
75%
Data Value
Created Databricks on top of Spark to make big data simple.
7

…
Apache Spark Engine
Spark Core
Spark
Streaming
Spark
SQL
MLlib GraphX
Unified engine across diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, & R
APIs
Standard libraries
8

10
NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT
2015 SAN FRANCISCO
Source: Slide 5 of Spark Community Update

Outline
Intro to MLlib in 2.x
Migrating an ML workload to DataFrames
ML persistence
Roadmap ahead during 2.x
11

Outline
ML persistence
Roadmap ahead during 2.x
12

A bit of MLlib history
Spark 0.8
RDD-based API
Fast, scale-out ML
Challenges
• Expressing complex workflows
• Integrating with DataFrames
• Developing Java, Python & R APIs
Spark 1.2
DataFrame-based API
(a.k.a. “Spark ML”)
Major improvements
• ML Pipelines with automated tuning
• Native DataFrame integration
• Standard API across languages
See Xiangrui Meng’s original
design & prototype in SPARK-
3530.
13

MLlib trajectory
0
200
400
600
800
1000
v0.8 v0.9 v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6 v2.0
commits/release
Scala/Jav
a
API
Primary
API for
MLlib
Pytho
n API R API
DataFrame-based
API for MLlib
14

DataFrame-based API for MLlib
DataFrames are the standard ML dataset type.
Uniform APIs for algorithms, hyperparameters, etc.
Pipelines provide utilities for constructing ML workflows +
automating hyperparameter tuning.
Learn more about ML Pipelines:
http://spark-summit.org/2015/events/practical-machine-learning-pipelines-with-mllib-2
http://docs.databricks.com/spark/latest/mllib/binary-classification-mllib-pipelines.html
15

DataFrame-based API for MLlib
In 2.0, the DataFrame-based API became the primary MLlib
API.
• Voted by community
• org.apache.spark.ml, pyspark.ml
The RDD-based API is in maintenance mode.
• Still maintained with bug fixes, but no new features
•org.apache.spark.mllib, pyspark.mllib
16

Outline
ML persistence
Roadmap during 2.x
17

Why migrate to DataFrames?
DataFrames
Language APIs
Pipelines
DataFrames & Datasets are the new “core” API for Spark.
• Data sources & ETL
• Latest performance improvements (Catalyst & Tungsten)
• Structured Streaming
18

DataFrames
Language APIs
Pipelines
Standardized across Scala, Java, Python, and R
• Python & R match Scala/Java performance
• Cross-language persistence (saving/loading models)
19

DataFrames
Language APIs
Pipelines
Specify complex ML workflows
• Chain together Transformers, Estimators, & Models
• Automated hyperparameter tuning
20

Demo migration
Convert a notebook from the RDD-based API
to the DataFrame-based API.
Key points
• Work with single models or complex Pipelines
• Incremental migration
• Many benefits: simpler APIs, SQL integration, tuning
• A few gotchas (linear algebra types)
21
Warning
Demo for
experts!

Demo recap: migration process
Separate 2 migrations:
• Spark 1.6  2.0
• RDDs  DataFrames
Migrate ML APIs: spark.mllib  spark.ml
• Gotcha: a few naming changes (from standardizing algorithm
APIs)
• Certain Param and model methods
• run()  fit()
• Tips:
• Use explainParams()
• Compare the API docs if you hit issues!
Migrate data APIs: RDDs  DataFrames
• Tip: Get familiar with conversion syntax in both directions.
22

Demo recap: migration process
Debugging runtime errors
• Gotcha: Lazy evaluation in Pipelines means bugs appear later than
expected.
• Tip: Check intermediate results.
• Gotcha: Vector (and Matrix) types in spark.mllib and spark.ml.
• Relevant for Spark 1.6  2.0 migration
• Tip: Watch for buried errors: MatchError and mentions of “vector”
• Tip: Use helper methods for conversion
• org.apache.spark.mllib.linalg.Vector.asML
• org.apache.spark.mllib.linalg.Vectors.fromML
• http://spark.apache.org/docs/latest/ml-guide.html#migration-guide
23

Future benefits of migration
Currently
ML training is implemented on
RDDs.
Goal
Port implementation to DataFrames.
Benefit from DataFrame
optimizations (Catalyst, Tungsten).
Spark SQL MLlib
Core
RDDs
DataFrames
Datasets
SQL
24

Future benefits of migration
Status
First published implementation
in GraphFrames (Spark
package for graph processing)
Ongoing work
DataFrame improvements for
iterative algorithms:
checkpointing, improved
caching, and more.
Spark SQL
MLlib
Core
RDDs
DataFrames
Datasets
SQL
25

Outline
ML persistence
Roadmap during 2.x
26

Why ML persistence?
Data Science Software Engineering
Prototype
(Python/R)
Create model
Re-implement model for
production (Java)
Deploy model
27

Why ML persistence?
Prototype
(Python/R)
Create Pipeline
• Extract raw features
• Transform features
• Select key features
• Fit multiple models
• Combine results to
make prediction
• Extra implementation work
• Different code paths
• Synchronization overhead
Re-implement Pipeline
for production (Java)
Deploy Pipeline
28

With ML persistence...
Prototype
(Python/R)
Create Pipeline
Persist model or Pipeline:
model.save(“s3n://...”)
Load Pipeline (Scala/Java)
Model.load(“s3n://…”)
Deploy in production
29

Model tuning
ML persistence status
Text
preprocessin
g
Feature
generation
Random
forest
Unfitted Fitted
Model
Pipeline
“recipe” “result”
30

ML persistence status
Near-complete coverage in all Spark language APIs
• Scala & Java: complete
• Python: complete except for 2 algorithms
• R: complete for existing APIs
Single underlying implementation of models
Exchangeable data format
• JSON for metadata
• Parquet for model data (coefficients, etc.)
31

Demo: ML persistence
• Can persist single models & complex workflows
• Easy to move models across Spark deployments
• Share models across teams & languages
32

ML persistence: pending issues
Python tuning: not yet implemented
• CrossValidator, TrainValidationSplit
R format: incompatible with Python/Java/Scala
• Issue: R wrappers are all special Pipelines.
• Working towards a fix
• Workaround: Load underlying PipelineModel from subfolder in
saved model directory.
Backwards compatibility: WIP in SPARK-15573
ML persistence blog post:
http://databricks.com/blog/2016/05/31
33

Outline
ML persistence
Roadmap during 2.x
34

Goals for MLlib in 2.x
Major initiatives
• ML persistence: saving &
loading models & Pipelines
• Complete feature parity for
DataFrames-based API.
Missing items:
• Frequent Pattern Mining
• Certain methods in models
• Developer APIs
For an overview of MLlib in 2.0, see
http://spark-summit.org/2016/events/apache-spark-mllib-20-preview-data-science-and-
production
35
Other important improvements
• Generalized Linear Models
• Python & R API parity
• Speed & scalability improvements

Coming in 2.1
Multiclass logistic regression (SPARK-7159)
Locality sensitive hashing (SPARK-5992)
More ML in SparkR (SPARK-16442)
• ALS
• Isotonic Regression
• Multilayer Perceptron Classifier
• Random Forest
• Gaussian Mixture Model
• LDA
• Multiclass Logistic Regression
• Gradient Boosted Trees
Various speed & scalability improvements
• Random Forest, Naive Bayes, LDA, Gaussian Mixture, and others
Spark 2.1 status:
Release candidates are under QA.
For release schedule, see
http://spark.apache.org/versioning-policy.html
36

Get started
Get involved in the community
• Events & news https://sparkhub.databricks.com/
• User mailing list
http://spark.apache.org/community.html
Get involved in development
• Dev mailing list
http://spark.apache.org/community.html
• JIRA
http://issues.apache.org/jira/browse/SPARK
• Contribute
http://spark.apache.org/contributing.html
Try out Apache Spark
for free on Databricks
Community Edition!
http://databricks.com/try
Many thanks to
the Apache
Spark
community!
37

https://spark-summit.org/east-2017/

Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Similar to Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Editor's Notes