Webinar Logistics
3
Webinar Logistics
4
About the speaker: Joseph Bradley
Joseph Bradley is a Software Engineer and
Apache Spark Committer & PMC member
working on Machine Learning at
Databricks. Previously, he was a postdoc
at UC Berkeley after receiving his Ph.D. in
Machine Learning from Carnegie Mellon in
2013.
5
About the speaker: Jules S. Damji
Jules S. Damji is an Apache Spark
Community Evangelist with Databricks. He
is a hands-on developer with over 15 years
of experience and has worked at leading
companies building large-scale distributed
systems.
6
Databricks
Founded by the creators
of Apache Spark in 2013
Share of Spark code
contributed by
Databricks
in 2014
75%
Data Value
Created Databricks on top of Spark to make big data simple.
7
…
Apache Spark Engine
Spark Core
Spark
Streaming
Spark
SQL
MLlib GraphX
Unified engine across diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, & R
APIs
Standard libraries
8
9
10
NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT
2015 SAN FRANCISCO
Source: Slide 5 of Spark Community Update
Outline
Intro to MLlib in 2.x
Migrating an ML workload to DataFrames
ML persistence
Roadmap ahead during 2.x
11
Outline
Intro to MLlib in 2.x
Migrating an ML workload to DataFrames
ML persistence
Roadmap ahead during 2.x
12
A bit of MLlib history
Spark 0.8
RDD-based API
Fast, scale-out ML
Challenges
• Expressing complex workflows
• Integrating with DataFrames
• Developing Java, Python & R APIs
Spark 1.2
DataFrame-based API
(a.k.a. “Spark ML”)
Major improvements
• ML Pipelines with automated tuning
• Native DataFrame integration
• Standard API across languages
See Xiangrui Meng’s original
design & prototype in SPARK-
3530.
13
MLlib trajectory
0
200
400
600
800
1000
v0.8 v0.9 v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6 v2.0
commits/release
Scala/Jav
a
API
Primary
API for
MLlib
Pytho
n API R API
DataFrame-based
API for MLlib
14
DataFrame-based API for MLlib
DataFrames are the standard ML dataset type.
Uniform APIs for algorithms, hyperparameters, etc.
Pipelines provide utilities for constructing ML workflows +
automating hyperparameter tuning.
Learn more about ML Pipelines:
http://spark-summit.org/2015/events/practical-machine-learning-pipelines-with-mllib-2
http://docs.databricks.com/spark/latest/mllib/binary-classification-mllib-pipelines.html
15
DataFrame-based API for MLlib
In 2.0, the DataFrame-based API became the primary MLlib
API.
• Voted by community
• org.apache.spark.ml, pyspark.ml
The RDD-based API is in maintenance mode.
• Still maintained with bug fixes, but no new features
•org.apache.spark.mllib, pyspark.mllib
16
Outline
Intro to MLlib in 2.x
Migrating an ML workload to DataFrames
ML persistence
Roadmap during 2.x
17
Why migrate to DataFrames?
DataFrames
Language APIs
Pipelines
DataFrames & Datasets are the new “core” API for Spark.
• Data sources & ETL
• Latest performance improvements (Catalyst & Tungsten)
• Structured Streaming
18
Why migrate to DataFrames?
DataFrames
Language APIs
Pipelines
Standardized across Scala, Java, Python, and R
• Python & R match Scala/Java performance
• Cross-language persistence (saving/loading models)
19
Why migrate to DataFrames?
DataFrames
Language APIs
Pipelines
Specify complex ML workflows
• Chain together Transformers, Estimators, & Models
• Automated hyperparameter tuning
20
Demo migration
Convert a notebook from the RDD-based API
to the DataFrame-based API.
Key points
• Work with single models or complex Pipelines
• Incremental migration
• Many benefits: simpler APIs, SQL integration, tuning
• A few gotchas (linear algebra types)
21
Warning
Demo for
experts!
Demo recap: migration process
Separate 2 migrations:
• Spark 1.6  2.0
• RDDs  DataFrames
Migrate ML APIs: spark.mllib  spark.ml
• Gotcha: a few naming changes (from standardizing algorithm
APIs)
• Certain Param and model methods
• run()  fit()
• Tips:
• Use explainParams()
• Compare the API docs if you hit issues!
Migrate data APIs: RDDs  DataFrames
• Tip: Get familiar with conversion syntax in both directions.
22
Demo recap: migration process
Debugging runtime errors
• Gotcha: Lazy evaluation in Pipelines means bugs appear later than
expected.
• Tip: Check intermediate results.
• Gotcha: Vector (and Matrix) types in spark.mllib and spark.ml.
• Relevant for Spark 1.6  2.0 migration
• Tip: Watch for buried errors: MatchError and mentions of “vector”
• Tip: Use helper methods for conversion
• org.apache.spark.mllib.linalg.Vector.asML
• org.apache.spark.mllib.linalg.Vectors.fromML
• http://spark.apache.org/docs/latest/ml-guide.html#migration-guide
23
Future benefits of migration
Currently
ML training is implemented on
RDDs.
Goal
Port implementation to DataFrames.
Benefit from DataFrame
optimizations (Catalyst, Tungsten).
Spark SQL MLlib
Core
RDDs
DataFrames
Datasets
SQL
24
Future benefits of migration
Status
First published implementation
in GraphFrames (Spark
package for graph processing)
Ongoing work
DataFrame improvements for
iterative algorithms:
checkpointing, improved
caching, and more.
Spark SQL
MLlib
Core
RDDs
DataFrames
Datasets
SQL
25
Outline
Intro to MLlib in 2.x
Migrating an ML workload to DataFrames
ML persistence
Roadmap during 2.x
26
Why ML persistence?
Data Science Software Engineering
Prototype
(Python/R)
Create model
Re-implement model for
production (Java)
Deploy model
27
Why ML persistence?
Data Science Software Engineering
Prototype
(Python/R)
Create Pipeline
• Extract raw features
• Transform features
• Select key features
• Fit multiple models
• Combine results to
make prediction
• Extra implementation work
• Different code paths
• Synchronization overhead
Re-implement Pipeline
for production (Java)
Deploy Pipeline
28
With ML persistence...
Data Science Software Engineering
Prototype
(Python/R)
Create Pipeline
Persist model or Pipeline:
model.save(“s3n://...”)
Load Pipeline (Scala/Java)
Model.load(“s3n://…”)
Deploy in production
29
Model tuning
ML persistence status
Text
preprocessin
g
Feature
generation
Random
forest
Unfitted Fitted
Model
Pipeline
“recipe” “result”
30
ML persistence status
Near-complete coverage in all Spark language APIs
• Scala & Java: complete
• Python: complete except for 2 algorithms
• R: complete for existing APIs
Single underlying implementation of models
Exchangeable data format
• JSON for metadata
• Parquet for model data (coefficients, etc.)
31
Demo: ML persistence
• Can persist single models & complex workflows
• Easy to move models across Spark deployments
• Share models across teams & languages
32
ML persistence: pending issues
Python tuning: not yet implemented
• CrossValidator, TrainValidationSplit
R format: incompatible with Python/Java/Scala
• Issue: R wrappers are all special Pipelines.
• Working towards a fix
• Workaround: Load underlying PipelineModel from subfolder in
saved model directory.
Backwards compatibility: WIP in SPARK-15573
ML persistence blog post:
http://databricks.com/blog/2016/05/31
33
Outline
Intro to MLlib in 2.x
Migrating an ML workload to DataFrames
ML persistence
Roadmap during 2.x
34
Goals for MLlib in 2.x
Major initiatives
• ML persistence: saving &
loading models & Pipelines
• Complete feature parity for
DataFrames-based API.
Missing items:
• Frequent Pattern Mining
• Certain methods in models
• Developer APIs
For an overview of MLlib in 2.0, see
http://spark-summit.org/2016/events/apache-spark-mllib-20-preview-data-science-and-
production
35
Other important improvements
• Generalized Linear Models
• Python & R API parity
• Speed & scalability improvements
Coming in 2.1
Multiclass logistic regression (SPARK-7159)
Locality sensitive hashing (SPARK-5992)
More ML in SparkR (SPARK-16442)
• ALS
• Isotonic Regression
• Multilayer Perceptron Classifier
• Random Forest
• Gaussian Mixture Model
• LDA
• Multiclass Logistic Regression
• Gradient Boosted Trees
Various speed & scalability improvements
• Random Forest, Naive Bayes, LDA, Gaussian Mixture, and others
Spark 2.1 status:
Release candidates are under QA.
For release schedule, see
http://spark.apache.org/versioning-policy.html
36
Get started
Get involved in the community
• Events & news https://sparkhub.databricks.com/
• User mailing list
http://spark.apache.org/community.html
Get involved in development
• Dev mailing list
http://spark.apache.org/community.html
• JIRA
http://issues.apache.org/jira/browse/SPARK
• Contribute
http://spark.apache.org/contributing.html
Try out Apache Spark
for free on Databricks
Community Edition!
http://databricks.com/try
Many thanks to
the Apache
Spark
community!
37
https://spark-summit.org/east-2017/
Thank you!
Twitter: @jkbatcmu

Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

  • 2.
  • 3.
  • 4.
    About the speaker:Joseph Bradley Joseph Bradley is a Software Engineer and Apache Spark Committer & PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013. 5
  • 5.
    About the speaker:Jules S. Damji Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies building large-scale distributed systems. 6
  • 6.
    Databricks Founded by thecreators of Apache Spark in 2013 Share of Spark code contributed by Databricks in 2014 75% Data Value Created Databricks on top of Spark to make big data simple. 7
  • 7.
    … Apache Spark Engine SparkCore Spark Streaming Spark SQL MLlib GraphX Unified engine across diverse workloads & environments Scale out, fault tolerant Python, Java, Scala, & R APIs Standard libraries 8
  • 8.
  • 9.
    10 NOTABLE USERS THATPRESENTED AT SPARK SUMMIT 2015 SAN FRANCISCO Source: Slide 5 of Spark Community Update
  • 10.
    Outline Intro to MLlibin 2.x Migrating an ML workload to DataFrames ML persistence Roadmap ahead during 2.x 11
  • 11.
    Outline Intro to MLlibin 2.x Migrating an ML workload to DataFrames ML persistence Roadmap ahead during 2.x 12
  • 12.
    A bit ofMLlib history Spark 0.8 RDD-based API Fast, scale-out ML Challenges • Expressing complex workflows • Integrating with DataFrames • Developing Java, Python & R APIs Spark 1.2 DataFrame-based API (a.k.a. “Spark ML”) Major improvements • ML Pipelines with automated tuning • Native DataFrame integration • Standard API across languages See Xiangrui Meng’s original design & prototype in SPARK- 3530. 13
  • 13.
    MLlib trajectory 0 200 400 600 800 1000 v0.8 v0.9v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6 v2.0 commits/release Scala/Jav a API Primary API for MLlib Pytho n API R API DataFrame-based API for MLlib 14
  • 14.
    DataFrame-based API forMLlib DataFrames are the standard ML dataset type. Uniform APIs for algorithms, hyperparameters, etc. Pipelines provide utilities for constructing ML workflows + automating hyperparameter tuning. Learn more about ML Pipelines: http://spark-summit.org/2015/events/practical-machine-learning-pipelines-with-mllib-2 http://docs.databricks.com/spark/latest/mllib/binary-classification-mllib-pipelines.html 15
  • 15.
    DataFrame-based API forMLlib In 2.0, the DataFrame-based API became the primary MLlib API. • Voted by community • org.apache.spark.ml, pyspark.ml The RDD-based API is in maintenance mode. • Still maintained with bug fixes, but no new features •org.apache.spark.mllib, pyspark.mllib 16
  • 16.
    Outline Intro to MLlibin 2.x Migrating an ML workload to DataFrames ML persistence Roadmap during 2.x 17
  • 17.
    Why migrate toDataFrames? DataFrames Language APIs Pipelines DataFrames & Datasets are the new “core” API for Spark. • Data sources & ETL • Latest performance improvements (Catalyst & Tungsten) • Structured Streaming 18
  • 18.
    Why migrate toDataFrames? DataFrames Language APIs Pipelines Standardized across Scala, Java, Python, and R • Python & R match Scala/Java performance • Cross-language persistence (saving/loading models) 19
  • 19.
    Why migrate toDataFrames? DataFrames Language APIs Pipelines Specify complex ML workflows • Chain together Transformers, Estimators, & Models • Automated hyperparameter tuning 20
  • 20.
    Demo migration Convert anotebook from the RDD-based API to the DataFrame-based API. Key points • Work with single models or complex Pipelines • Incremental migration • Many benefits: simpler APIs, SQL integration, tuning • A few gotchas (linear algebra types) 21 Warning Demo for experts!
  • 21.
    Demo recap: migrationprocess Separate 2 migrations: • Spark 1.6  2.0 • RDDs  DataFrames Migrate ML APIs: spark.mllib  spark.ml • Gotcha: a few naming changes (from standardizing algorithm APIs) • Certain Param and model methods • run()  fit() • Tips: • Use explainParams() • Compare the API docs if you hit issues! Migrate data APIs: RDDs  DataFrames • Tip: Get familiar with conversion syntax in both directions. 22
  • 22.
    Demo recap: migrationprocess Debugging runtime errors • Gotcha: Lazy evaluation in Pipelines means bugs appear later than expected. • Tip: Check intermediate results. • Gotcha: Vector (and Matrix) types in spark.mllib and spark.ml. • Relevant for Spark 1.6  2.0 migration • Tip: Watch for buried errors: MatchError and mentions of “vector” • Tip: Use helper methods for conversion • org.apache.spark.mllib.linalg.Vector.asML • org.apache.spark.mllib.linalg.Vectors.fromML • http://spark.apache.org/docs/latest/ml-guide.html#migration-guide 23
  • 23.
    Future benefits ofmigration Currently ML training is implemented on RDDs. Goal Port implementation to DataFrames. Benefit from DataFrame optimizations (Catalyst, Tungsten). Spark SQL MLlib Core RDDs DataFrames Datasets SQL 24
  • 24.
    Future benefits ofmigration Status First published implementation in GraphFrames (Spark package for graph processing) Ongoing work DataFrame improvements for iterative algorithms: checkpointing, improved caching, and more. Spark SQL MLlib Core RDDs DataFrames Datasets SQL 25
  • 25.
    Outline Intro to MLlibin 2.x Migrating an ML workload to DataFrames ML persistence Roadmap during 2.x 26
  • 26.
    Why ML persistence? DataScience Software Engineering Prototype (Python/R) Create model Re-implement model for production (Java) Deploy model 27
  • 27.
    Why ML persistence? DataScience Software Engineering Prototype (Python/R) Create Pipeline • Extract raw features • Transform features • Select key features • Fit multiple models • Combine results to make prediction • Extra implementation work • Different code paths • Synchronization overhead Re-implement Pipeline for production (Java) Deploy Pipeline 28
  • 28.
    With ML persistence... DataScience Software Engineering Prototype (Python/R) Create Pipeline Persist model or Pipeline: model.save(“s3n://...”) Load Pipeline (Scala/Java) Model.load(“s3n://…”) Deploy in production 29
  • 29.
    Model tuning ML persistencestatus Text preprocessin g Feature generation Random forest Unfitted Fitted Model Pipeline “recipe” “result” 30
  • 30.
    ML persistence status Near-completecoverage in all Spark language APIs • Scala & Java: complete • Python: complete except for 2 algorithms • R: complete for existing APIs Single underlying implementation of models Exchangeable data format • JSON for metadata • Parquet for model data (coefficients, etc.) 31
  • 31.
    Demo: ML persistence •Can persist single models & complex workflows • Easy to move models across Spark deployments • Share models across teams & languages 32
  • 32.
    ML persistence: pendingissues Python tuning: not yet implemented • CrossValidator, TrainValidationSplit R format: incompatible with Python/Java/Scala • Issue: R wrappers are all special Pipelines. • Working towards a fix • Workaround: Load underlying PipelineModel from subfolder in saved model directory. Backwards compatibility: WIP in SPARK-15573 ML persistence blog post: http://databricks.com/blog/2016/05/31 33
  • 33.
    Outline Intro to MLlibin 2.x Migrating an ML workload to DataFrames ML persistence Roadmap during 2.x 34
  • 34.
    Goals for MLlibin 2.x Major initiatives • ML persistence: saving & loading models & Pipelines • Complete feature parity for DataFrames-based API. Missing items: • Frequent Pattern Mining • Certain methods in models • Developer APIs For an overview of MLlib in 2.0, see http://spark-summit.org/2016/events/apache-spark-mllib-20-preview-data-science-and- production 35 Other important improvements • Generalized Linear Models • Python & R API parity • Speed & scalability improvements
  • 35.
    Coming in 2.1 Multiclasslogistic regression (SPARK-7159) Locality sensitive hashing (SPARK-5992) More ML in SparkR (SPARK-16442) • ALS • Isotonic Regression • Multilayer Perceptron Classifier • Random Forest • Gaussian Mixture Model • LDA • Multiclass Logistic Regression • Gradient Boosted Trees Various speed & scalability improvements • Random Forest, Naive Bayes, LDA, Gaussian Mixture, and others Spark 2.1 status: Release candidates are under QA. For release schedule, see http://spark.apache.org/versioning-policy.html 36
  • 36.
    Get started Get involvedin the community • Events & news https://sparkhub.databricks.com/ • User mailing list http://spark.apache.org/community.html Get involved in development • Dev mailing list http://spark.apache.org/community.html • JIRA http://issues.apache.org/jira/browse/SPARK • Contribute http://spark.apache.org/contributing.html Try out Apache Spark for free on Databricks Community Edition! http://databricks.com/try Many thanks to the Apache Spark community! 37
  • 37.
  • 38.

Editor's Notes

  • #3 Abstract In the Apache Spark 2.x releases, Machine Learning (ML) is focusing on DataFrame-based APIs. This webinar is aimed at helping users take full advantage of the new APIs. Topics will include migrating workloads from RDDs to DataFrames, ML persistence for saving and loading models, and the roadmap ahead. Migrating ML workloads to use Spark DataFrames and Datasets allows users to benefit from simpler APIs, plus speed and scalability improvements. As the DataFrame/Dataset API becomes the primary API for data in Spark, this migration will become increasingly important to MLlib users, especially for integrating ML with the rest of Spark data processing workloads. We will give a tutorial covering best practices and some of the immediate and future benefits to expect. ML persistence is one of the biggest improvements in the DataFrame-based API. With Spark 2.0, almost all ML algorithms can be saved and loaded, even across languages. ML persistence dramatically simplifies collaborating across teams and moving ML models to production. We will demonstrate how to use persistence, and we will discuss a few existing issues and workarounds. At the end of the webinar, we will discuss major roadmap items. These include API coverage, major speed and scalability improvements to certain algorithms, and integration with structured streaming.
  • #14 Original Pipeline JIRA: http://issues.apache.org/jira/browse/SPARK-3530
  • #23 “Certain Param and model methods” is an algorithm-specific issue. All other issues are general across MLlib.
  • #30 Note this is loading into Spark.
  • #31 Saving & loading ML types Models, both unfitted (“recipe”) & fitted Complex Pipelines, both unfitted (“workflow”) & fitted
  • #32 (DEMO)