Practical Machine Learning
Pipelines with MLlib
Joseph K. Bradley
March 18, 2015
Spark Summit East 2015
About Spark MLlib
Started in UC Berkeley AMPLab
• Shipped with Spark 0.8
Currently (Spark 1.3)
• Contributions from 50+ orgs, 100+ individuals
• Good coverage of algorithms
classification
regression
clustering
recommendation
feature extraction, selection
frequent itemsets
statistics
linear algebra
MLlib’s Mission
How can we move beyond this list of algorithms
and help users developer real ML workflows?
MLlib’s mission is to make practical
machine learning easy and scalable.
• Capable of learning from large-scale
datasets
• Easy to build machine learning
applications
Outline
ML workflows
Pipelines
Roadmap
Outline
ML
workflows
Pipelines
Roadmap
Example: Text Classification
Set Footer from Insert Dropdown Menu 6
Goal: Given a text document, predict its
topic.
Subject: Re: Lexan Polish?
Suggest McQuires #1 plastic
polish. It will help somewhat
but nothing will remove deep
scratches without making it
worse than it already is.
McQuires will do something...
1: about science
0: not about science
LabelFeatures
text, image, vector, ...
CTR, inches of rainfall, ...
Dataset: “20 Newsgroups”
From UCI KDD Archive
Training & Testing
Set Footer from Insert Dropdown Menu 7
Training Testing/Production
Given labeled data:
RDD of (features, label)
Subject: Re: Lexan Polish?
Suggest McQuires #1 plastic
polish. It will help...
Subject: RIPEM FAQ
RIPEM is a program which
performs Privacy Enhanced...
...
Label 0
Label 1
Learn a model.
Given new unlabeled data:
RDD of features
Subject: Apollo Training
The Apollo astronauts also
trained at (in) Meteor...
Subject: A demo of Nonsense
How can you lie about
something that no one...
Use model to make predictions.
Label 1
Label 0
Example ML Workflow
Training
Train model
labels + predictions
Evaluate
Load data
labels + plain text
labels + feature vectors
Extract features
Explicitly unzip & zip RDDs
labels.zip(predictions).map {
if (_._1 == _._2) ...
}
val features: RDD[Vector]
val predictions: RDD[Double]
Create many RDDs
val labels: RDD[Double] =
data.map(_.label)
Pain point
Example ML Workflow
Write as a script
Pain point
• Not modular
• Difficult to re-use workflow
Training
labels + feature vectors
Train model
labels + predictions
Evaluate
Load data
labels + plain text
Extract features
Example ML Workflow
Training
labels + feature vectors
Train model
labels + predictions
Evaluate
Load data
labels + plain text
Extract features
Testing/Production
feature vectors
Predict using model
predictions
Act on predictions
Load new data
plain text
Extract features
Almost
identical
workflow
Example ML Workflow
Training
labels + feature vectors
Train model
labels + predictions
Evaluate
Load data
labels + plain text
Extract features
Pain point
Parameter tuning
• Key part of ML
• Involves training many models
• For different splits of the data
• For different sets of parameters
Pain Points
Create & handle many RDDs and data types
Write as a script
Tune parameters
Enter...
Pipelines! in Spark 1.2 & 1.3
Outline
ML workflows
Pipelines
Roadmap
Key Concepts
DataFrame: The ML Dataset
Abstractions: Transformers, Estimators, &
Evaluators
Parameters: API & tuning
DataFrame: The ML Dataset
DataFrame: RDD + schema + DSL
Named columns with types
label: Double
text: String
words: Seq[String]
features: Vector
prediction: Double
label text words features
0 This is ... [“This”, “is”, …] [0.5, 1.2, …]
0 When we ... [“When”, ...] [1.9, -0.8, …]
1 Knuth was ... [“Knuth”, …] [0.0, 8.7, …]
0 Or you ... [“Or”, “you”, …] [0.1, -0.6, …]
DataFrame: The ML Dataset
DataFrame: RDD + schema + DSL
Named columns with types Domain-Specific Language
# Select science articles
sciDocs =
data.filter(“label” == 1)
# Scale labels
data(“label”) * 0.5
DataFrame: The ML Dataset
DataFrame: RDD + schema + DSL
• Shipped with Spark 1.3
• APIs for Python, Java & Scala (+R in dev)
• Integration with Spark SQL
• Data import/export
• Internal optimizations
Named columns with types Domain-Specific Language
Pain point: Create & handle
many RDDs and data types
BIG data
Abstractions
Set Footer from Insert Dropdown Menu 18
Training
Train model
Evaluate
Load data
Extract features
Abstraction: Transformer
Set Footer from Insert Dropdown Menu 19
Training
Train model
Evaluate
Extract features
def transform(DataFrame): DataFrame
label: Double
text: String
label: Double
text: String
features: Vector
Abstraction: Estimator
Set Footer from Insert Dropdown Menu 20
Training
Train model
Evaluate
Extract features
label: Double
text: String
features: Vector
LogisticRegression
Model
def fit(DataFrame): Model
Train model
Abstraction: Evaluator
Set Footer from Insert Dropdown Menu 21
Training
Evaluate
Extract features
label: Double
text: String
features: Vector
prediction: Double
Metric:
accuracy
AUC
MSE
...
def evaluate(DataFrame): Double
Act on predictions
Abstraction: Model
Set Footer from Insert Dropdown Menu 22
Model is a type of Transformer
def transform(DataFrame): DataFrame
text: String
features: Vector
Testing/Production
Predict using model
Extract features text: String
features: Vector
prediction: Double
(Recall) Abstraction: Estimator
Set Footer from Insert Dropdown Menu 23
Training
Train model
Evaluate
Load data
Extract features
label: Double
text: String
features: Vector
LogisticRegression
Model
def fit(DataFrame): Model
Abstraction: Pipeline
Set Footer from Insert Dropdown Menu 24
Training
Train model
Evaluate
Load data
Extract features
label: Double
text: String
PipelineModel
Pipeline is a type of Estimator
def fit(DataFrame): Model
Abstraction: PipelineModel
Set Footer from Insert Dropdown Menu 25
text: String
PipelineModel is a type of Transformer
def transform(DataFrame): DataFrame
Testing/Production
Predict using model
Load data
Extract features text: String
features: Vector
prediction: Double
Act on predictions
Abstractions: Summary
Set Footer from Insert Dropdown Menu 26
Training
Train model
Evaluate
Load data
Extract featuresTransformer
DataFrame
Estimator
Evaluator
Testing
Predict using model
Evaluate
Load data
Extract features
Demo
Set Footer from Insert Dropdown Menu 27
Transformer
DataFrame
Estimator
Evaluator
label: Double
text: String
features: Vector
Current data schema
prediction: Double
Training
LogisticRegression
BinaryClassification
Evaluator
Load data
Tokenizer
Transformer HashingTF
words: Seq[String]
Demo
Set Footer from Insert Dropdown Menu 28
Transformer
DataFrame
Estimator
Evaluator
Training
LogisticRegression
BinaryClassification
Evaluator
Load data
Tokenizer
Transformer HashingTF
Pain point: Write as a script
Parameters
Set Footer from Insert Dropdown Menu 29
> hashingTF.numFeaturesStandard API
• Typed
• Defaults
• Built-in doc
• Autocomplete
org.apache.spark.ml.param.IntParam =
numFeatures: number of features
(default: 262144)
> hashingTF.setNumFeatures(1000)
> hashingTF.getNumFeatures
Parameter Tuning
Given:
• Estimator
• Parameter grid
• Evaluator
Find best parameters
lr.regParam
{0.01, 0.1, 0.5}
hashingTF.numFeatures
{100, 1000, 10000}
LogisticRegression
Tokenizer
HashingTF
BinaryClassification
Evaluator
CrossValidator
Parameter Tuning
Given:
• Estimator
• Parameter grid
• Evaluator
Find best parameters
LogisticRegression
Tokenizer
HashingTF
BinaryClassification
Evaluator
CrossValidator
Pain point: Tune parameters
Pipelines: Recap
Inspirations
scikit-learn
+ Spark DataFrame, Param API
MLBase (Berkeley AMPLab)
Ongoing collaborations
Create & handle many RDDs and data types
Write as a script
Tune parameters
DataFrame
Abstractions
Parameter API
* Groundwork done; full support WIP.
Also
• Python, Scala, Java APIs
• Schema validation
• User-Defined Types*
• Feature metadata*
• Multi-model training optimizations*
Outline
ML workflows
Pipelines
Roadmap
Roadmap
spark.mllib: Primary ML package
spark.ml: High-level Pipelines API for algorithms in spark.mllib
(experimental in Spark 1.2-1.3)
Near future
• Feature attributes
• Feature transformers
• More algorithms under Pipeline API
Farther ahead
• Ideas from AMPLab MLBase (auto-tuning models)
• SparkR integration
Thank you!
Outline
• ML workflows
• Pipelines
• DataFrame
• Abstractions
• Parameter tuning
• Roadmap
Spark documentation
http://spark.apache.org/
Pipelines blog post
https://databricks.com/blog/2015/01/07

Machine Learning Pipelines - Joseph Bradley - Databricks

  • 1.
    Practical Machine Learning Pipelineswith MLlib Joseph K. Bradley March 18, 2015 Spark Summit East 2015
  • 2.
    About Spark MLlib Startedin UC Berkeley AMPLab • Shipped with Spark 0.8 Currently (Spark 1.3) • Contributions from 50+ orgs, 100+ individuals • Good coverage of algorithms classification regression clustering recommendation feature extraction, selection frequent itemsets statistics linear algebra
  • 3.
    MLlib’s Mission How canwe move beyond this list of algorithms and help users developer real ML workflows? MLlib’s mission is to make practical machine learning easy and scalable. • Capable of learning from large-scale datasets • Easy to build machine learning applications
  • 4.
  • 5.
  • 6.
    Example: Text Classification SetFooter from Insert Dropdown Menu 6 Goal: Given a text document, predict its topic. Subject: Re: Lexan Polish? Suggest McQuires #1 plastic polish. It will help somewhat but nothing will remove deep scratches without making it worse than it already is. McQuires will do something... 1: about science 0: not about science LabelFeatures text, image, vector, ... CTR, inches of rainfall, ... Dataset: “20 Newsgroups” From UCI KDD Archive
  • 7.
    Training & Testing SetFooter from Insert Dropdown Menu 7 Training Testing/Production Given labeled data: RDD of (features, label) Subject: Re: Lexan Polish? Suggest McQuires #1 plastic polish. It will help... Subject: RIPEM FAQ RIPEM is a program which performs Privacy Enhanced... ... Label 0 Label 1 Learn a model. Given new unlabeled data: RDD of features Subject: Apollo Training The Apollo astronauts also trained at (in) Meteor... Subject: A demo of Nonsense How can you lie about something that no one... Use model to make predictions. Label 1 Label 0
  • 8.
    Example ML Workflow Training Trainmodel labels + predictions Evaluate Load data labels + plain text labels + feature vectors Extract features Explicitly unzip & zip RDDs labels.zip(predictions).map { if (_._1 == _._2) ... } val features: RDD[Vector] val predictions: RDD[Double] Create many RDDs val labels: RDD[Double] = data.map(_.label) Pain point
  • 9.
    Example ML Workflow Writeas a script Pain point • Not modular • Difficult to re-use workflow Training labels + feature vectors Train model labels + predictions Evaluate Load data labels + plain text Extract features
  • 10.
    Example ML Workflow Training labels+ feature vectors Train model labels + predictions Evaluate Load data labels + plain text Extract features Testing/Production feature vectors Predict using model predictions Act on predictions Load new data plain text Extract features Almost identical workflow
  • 11.
    Example ML Workflow Training labels+ feature vectors Train model labels + predictions Evaluate Load data labels + plain text Extract features Pain point Parameter tuning • Key part of ML • Involves training many models • For different splits of the data • For different sets of parameters
  • 12.
    Pain Points Create &handle many RDDs and data types Write as a script Tune parameters Enter... Pipelines! in Spark 1.2 & 1.3
  • 13.
  • 14.
    Key Concepts DataFrame: TheML Dataset Abstractions: Transformers, Estimators, & Evaluators Parameters: API & tuning
  • 15.
    DataFrame: The MLDataset DataFrame: RDD + schema + DSL Named columns with types label: Double text: String words: Seq[String] features: Vector prediction: Double label text words features 0 This is ... [“This”, “is”, …] [0.5, 1.2, …] 0 When we ... [“When”, ...] [1.9, -0.8, …] 1 Knuth was ... [“Knuth”, …] [0.0, 8.7, …] 0 Or you ... [“Or”, “you”, …] [0.1, -0.6, …]
  • 16.
    DataFrame: The MLDataset DataFrame: RDD + schema + DSL Named columns with types Domain-Specific Language # Select science articles sciDocs = data.filter(“label” == 1) # Scale labels data(“label”) * 0.5
  • 17.
    DataFrame: The MLDataset DataFrame: RDD + schema + DSL • Shipped with Spark 1.3 • APIs for Python, Java & Scala (+R in dev) • Integration with Spark SQL • Data import/export • Internal optimizations Named columns with types Domain-Specific Language Pain point: Create & handle many RDDs and data types BIG data
  • 18.
    Abstractions Set Footer fromInsert Dropdown Menu 18 Training Train model Evaluate Load data Extract features
  • 19.
    Abstraction: Transformer Set Footerfrom Insert Dropdown Menu 19 Training Train model Evaluate Extract features def transform(DataFrame): DataFrame label: Double text: String label: Double text: String features: Vector
  • 20.
    Abstraction: Estimator Set Footerfrom Insert Dropdown Menu 20 Training Train model Evaluate Extract features label: Double text: String features: Vector LogisticRegression Model def fit(DataFrame): Model
  • 21.
    Train model Abstraction: Evaluator SetFooter from Insert Dropdown Menu 21 Training Evaluate Extract features label: Double text: String features: Vector prediction: Double Metric: accuracy AUC MSE ... def evaluate(DataFrame): Double
  • 22.
    Act on predictions Abstraction:Model Set Footer from Insert Dropdown Menu 22 Model is a type of Transformer def transform(DataFrame): DataFrame text: String features: Vector Testing/Production Predict using model Extract features text: String features: Vector prediction: Double
  • 23.
    (Recall) Abstraction: Estimator SetFooter from Insert Dropdown Menu 23 Training Train model Evaluate Load data Extract features label: Double text: String features: Vector LogisticRegression Model def fit(DataFrame): Model
  • 24.
    Abstraction: Pipeline Set Footerfrom Insert Dropdown Menu 24 Training Train model Evaluate Load data Extract features label: Double text: String PipelineModel Pipeline is a type of Estimator def fit(DataFrame): Model
  • 25.
    Abstraction: PipelineModel Set Footerfrom Insert Dropdown Menu 25 text: String PipelineModel is a type of Transformer def transform(DataFrame): DataFrame Testing/Production Predict using model Load data Extract features text: String features: Vector prediction: Double Act on predictions
  • 26.
    Abstractions: Summary Set Footerfrom Insert Dropdown Menu 26 Training Train model Evaluate Load data Extract featuresTransformer DataFrame Estimator Evaluator Testing Predict using model Evaluate Load data Extract features
  • 27.
    Demo Set Footer fromInsert Dropdown Menu 27 Transformer DataFrame Estimator Evaluator label: Double text: String features: Vector Current data schema prediction: Double Training LogisticRegression BinaryClassification Evaluator Load data Tokenizer Transformer HashingTF words: Seq[String]
  • 28.
    Demo Set Footer fromInsert Dropdown Menu 28 Transformer DataFrame Estimator Evaluator Training LogisticRegression BinaryClassification Evaluator Load data Tokenizer Transformer HashingTF Pain point: Write as a script
  • 29.
    Parameters Set Footer fromInsert Dropdown Menu 29 > hashingTF.numFeaturesStandard API • Typed • Defaults • Built-in doc • Autocomplete org.apache.spark.ml.param.IntParam = numFeatures: number of features (default: 262144) > hashingTF.setNumFeatures(1000) > hashingTF.getNumFeatures
  • 30.
    Parameter Tuning Given: • Estimator •Parameter grid • Evaluator Find best parameters lr.regParam {0.01, 0.1, 0.5} hashingTF.numFeatures {100, 1000, 10000} LogisticRegression Tokenizer HashingTF BinaryClassification Evaluator CrossValidator
  • 31.
    Parameter Tuning Given: • Estimator •Parameter grid • Evaluator Find best parameters LogisticRegression Tokenizer HashingTF BinaryClassification Evaluator CrossValidator Pain point: Tune parameters
  • 32.
    Pipelines: Recap Inspirations scikit-learn + SparkDataFrame, Param API MLBase (Berkeley AMPLab) Ongoing collaborations Create & handle many RDDs and data types Write as a script Tune parameters DataFrame Abstractions Parameter API * Groundwork done; full support WIP. Also • Python, Scala, Java APIs • Schema validation • User-Defined Types* • Feature metadata* • Multi-model training optimizations*
  • 33.
  • 34.
    Roadmap spark.mllib: Primary MLpackage spark.ml: High-level Pipelines API for algorithms in spark.mllib (experimental in Spark 1.2-1.3) Near future • Feature attributes • Feature transformers • More algorithms under Pipeline API Farther ahead • Ideas from AMPLab MLBase (auto-tuning models) • SparkR integration
  • 35.
    Thank you! Outline • MLworkflows • Pipelines • DataFrame • Abstractions • Parameter tuning • Roadmap Spark documentation http://spark.apache.org/ Pipelines blog post https://databricks.com/blog/2015/01/07

Editor's Notes

  • #3 Contributions estimated from github commit logs, with some effort to de-duplicate entities.
  • #7 Dataset source: http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html *Data from UCI KDD Archive, originally donated to archive by Tom Mitchell (CMU).
  • #9 Handling multiple RDDs and data types Split loaded data into featuresRDD, labelsRDD Transform featuresRDD  words RDD  feature vector RDD Zip labels with feature vector to create final RDD
  • #10 It is possible for programmers to abstract workflows by putting their workflows into methods or callable scripts. However, that makes it hard to do exploratory work or do rapid iterative tweaking-and-testing of workflows.
  • #11 Feature extraction in particular can be long and complicated, and using the same workflow is vital.
  • #12 Parameter tuning is doable in essentially any ML library, but it is often done by hand and involves a lot of repetitive but difficult-to-abstract code.
  • #13 This API shipped with Spark 1.2 and 1.3 but is still experimental.
  • #16 SchemaRDD + DSL (SchemaRDD is now called DataFrame, mentioned in Michael’s talk earlier in the day) Introduced in Spark 1.3 Integrates with pandas dataframes Catalyst optimizer handles column materialization Other: built-in data sources & 3rd-party extensions optimizations & codegen APIs for Python, Java & Scala (+R in dev)
  • #20 Columns which are not needed do not need to be materialized, so there is almost no penalty for keeping the columns around for later use. Default transformer behavior is to append columns.
  • #31  ----- Meeting Notes (3/18/15 01:56) ----- bad animation
  • #33 Easy for users to create their own Transformers and Estimators to plug into Pipelines.