Building, Debugging, & Tuning
Spark ML Pipelines
Joseph K. Bradley
June 2015
Spark Summit
Who am I?
• Joseph K. Bradley
•  Ph.D. in Machine Learning from CMU, postdoc at Berkeley
•  Apache Spark committer
•  Software Engineer @ Databricks Inc.
2
About Spark MLlib
•  Started at Berkeley
(Spark 0.8)
•  Now (Spark 1.4)
•  Contributions from 50+ orgs, 100+ individuals
•  Growing coverage of distributed algorithms
Spark	
  
SparkSQL	
   Streaming	
   MLlib	
   GraphX	
  
3
Algorithm Coverage
•  Classification
•  Logistic regression
•  Naive Bayes
•  Streaming logistic regression
•  Linear SVMs
•  Decision trees
•  Random forests
•  Gradient-boosted trees
•  Regression
•  Ordinary least squares
•  Ridge regression
•  Lasso
•  Isotonic regression
•  Decision trees
•  Random forests
•  Gradient-boosted trees
•  Streaming linear methods
•  Frequent itemsets
•  FP-growth
4
Clustering
•  Gaussian mixture models
•  K-Means
•  Streaming K-Means
•  Latent Dirichlet Allocation
•  Power Iteration Clustering
Statistics
•  Pearson correlation
•  Spearman correlation
•  Online summarization
•  Chi-squared test
•  Kernel density estimation
Linear algebra
•  Local dense & sparse vectors & matrices
•  Distributed matrices
•  Block-partitioned matrix
•  Row matrix
•  Indexed row matrix
•  Coordinate matrix
•  Matrix decompositions
Recommendation
•  Alternating Least Squares
Feature extraction & selection
•  Word2Vec
•  Chi-Squared selection
•  Hashing term frequency
•  Inverse document frequency
•  Normalizer
•  Standard scaler
•  Tokenizer
•  One-Hot Encoder
•  StringIndexer
•  VectorIndexer
•  VectorAssembler
•  Binarizer
•  Bucketizer
•  ElementwiseProduct
•  PolynomialExpansion
Model import/export
Pipelines
List based on Spark 1.4
ML Workflows are complex
5
Train	
  model	
  
Evaluate	
  
Load	
  data	
  
Extract	
  features	
  
ML Workflows are complex
6
Train	
  model	
  
Evaluate	
  
Datasource	
  1	
  
Extract	
  features	
  
Datasource	
  2	
  
Datasource	
  2	
  
ML Workflows are complex
7
Train	
  model	
  
Evaluate	
  
Datasource	
  1	
  
Datasource	
  2	
  
Datasource	
  2	
  
Extract	
  features	
  Extract	
  features	
  
Feature	
  transform	
  1	
  
Feature	
  transform	
  2	
  
Feature	
  transform	
  3	
  
ML Workflows are complex
8
Train	
  model	
  1	
  
Evaluate	
  
Datasource	
  1	
  
Datasource	
  2	
  
Datasource	
  2	
  
Extract	
  features	
  Extract	
  features	
  
Feature	
  transform	
  1	
  
Feature	
  transform	
  2	
  
Feature	
  transform	
  3	
  
Train	
  model	
  2	
  
Ensemble	
  
ML Workflows are complex
9
à 	
  Specify	
  pipeline	
  
à 	
  Inspect	
  &	
  debug	
  
à 	
  Re-­‐run	
  on	
  new	
  data	
  
à 	
  Tune	
  parameters	
  
Example: Text Classification
10
•  Goal: Given a text document, predict its topic.
Subject: Re: Lexan Polish?!
Suggest McQuires #1 plastic
polish. It will help somewhat
but nothing will remove deep
scratches without making it
worse than it already is.!
McQuires will do something...!
1:	
  about	
  science	
  
0:	
  not	
  about	
  science	
  
Label	
  Features	
  
Dataset:	
  “20	
  Newsgroups”	
  
From	
  UCI	
  KDD	
  Archive	
  
DataFrames
11
dept	
   age	
   name	
  
Bio	
   48	
   H	
  Smith	
  
CS	
   54	
   A	
  Turing	
  
Bio	
   43	
   B	
  Jones	
  
Chem	
   61	
   M	
  Kennedy	
  
Data	
  grouped	
  into	
  
named	
  columns	
  
DSL	
  for	
  common	
  tasks	
  
•  Project,	
  filter,	
  aggregate,	
  join,	
  …	
  
•  Metadata	
  
•  UDFs	
  
ML Workflow
12
Train	
  model	
  
Evaluate	
  
Load	
  data	
  
Extract	
  features	
  
Load Data
13
Train	
  model	
  
Evaluate	
  
Load	
  data	
  
Extract	
  features	
  
built-in external
{ JSON }
JDBC
and more …
Data sources for DataFrames
Load Data
14
Train	
  model	
  
Evaluate	
  
Load	
  data	
  
Extract	
  features	
  
label: Int
text: String
Current	
  data	
  schema	
  
Extract Features
15
Train	
  model	
  
Evaluate	
  
Load	
  data	
  
Extract	
  features	
  
label: Int
text: String
Current	
  data	
  schema	
  
Extract Features
16
Train	
  model	
  
Evaluate	
  
Load	
  data	
  
label: Int
text: String
Current	
  data	
  schema	
  
Tokenizer	
  
Hashed	
  Term	
  Freq.	
  
features: Vector
words: Seq[String]
Transformer
DataFrame
DataFrame
Train a Model
17
Logis@c	
  Regression	
  
Evaluate	
  
label: Int
text: String
Current	
  data	
  schema	
  
Tokenizer	
  
Hashed	
  Term	
  Freq.	
  
features: Vector
words: Seq[String]
prediction: Int
Load	
  data	
  
Estimator
DataFrame
Model
Evaluate the Model
18
Logiscc	
  Regression	
  
Evaluate	
  
label: Int
text: String
Current	
  data	
  schema	
  
Tokenizer	
  
Hashed	
  Term	
  Freq.	
  
features: Vector
words: Seq[String]
prediction: Int
Load	
  data	
  
Evaluator
DataFrame
metric
Data Flow
19
Logiscc	
  Regression	
  
Evaluate	
  
label: Int
text: String
Current	
  data	
  schema	
  
Tokenizer	
  
Hashed	
  Term	
  Freq.	
  
features: Vector
words: Seq[String]
prediction: Int
Load	
  data	
  
By	
  default,	
  always	
  
append	
  new	
  columns	
  	
  
à Can	
  go	
  back	
  &	
  inspect	
  
intermediate	
  results	
  	
  
à Made	
  efficient	
  by	
  
DataFrames	
  
ML Pipelines
20
Logiscc	
  Regression	
  
Evaluate	
  
Tokenizer	
  
Hashed	
  Term	
  Freq.	
  
Load	
  data	
  
Pipeline
Test	
  data	
  
Logiscc	
  Regression	
  
Tokenizer	
  
Hashed	
  Term	
  Freq.	
  
Evaluate	
  
Re-­‐run	
  exactly	
  
the	
  same	
  way	
  
Parameter Tuning
21
Logiscc	
  Regression	
  
Evaluate	
  
Tokenizer	
  
Hashed	
  Term	
  Freq.	
  
lr.regParam
{0.01, 0.1, 0.5}
hashingTF.numFeatures
{100, 1000, 10000} Given:
•  Estimator
•  Parameter grid
•  Evaluator
Find best parameters
CrossValidator
22
Demo:	
  ML	
  Pipelines	
  
in	
  Databricks	
  Cloud	
  
Recap
• ML Pipelines provide:
•  Integration with DataFrames
•  Familiar API based on scikit-learn
•  Easy workflow inspection
•  Simple parameter tuning
23
Complex	
  Pipelines:	
  
composicons	
  &	
  DAGs	
  
Schema	
  validacon	
  
User-­‐defined	
  Pipeline	
  components:	
  
Escmators,	
  Transformers,	
  &	
  Evaluators	
  
Support	
  for	
  complex	
  types	
  
(via	
  DataFrames	
  &	
  UDTs)	
  
Looking Ahead
24
•  Pipelines graduating from alpha!
•  Many feature transformers
•  More complete Python API
•  Spark R
•  PMML model export
•  More algorithms (OneVsRest,
OnlineLDA, …)
•  MLlib API for SparkR
•  Improved model inspection
•  Pipeline optimizations
In Spark 1.4
25
Databricks, Inc.: Founded by the creators of Spark & remains largest contributor
Thank you!
Spark	
  documentacon	
  
	
  	
  	
  	
  spark.apache.org	
  
	
  
Pipelines	
  blog	
  post	
  
	
  	
  	
  	
  databricks.com/blog/2015/01/07	
  
	
  
DataFrames	
  blog	
  post	
  
	
  	
  	
  	
  databricks.com/blog/2015/02/17	
  
	
  
Databricks	
  
	
  	
  	
  	
  databricks.com/product	
  
	
  
Spark	
  ML	
  edX	
  MOOC	
  
	
  	
  	
  	
  ML	
  with	
  Spark	
  
	
  
Spark	
  Packages	
  
	
  	
  	
  	
  spark-­‐packages.org	
  
	
  
databricks.com/company/careers	
  

Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradley, Databricks)

  • 1.
    Building, Debugging, &Tuning Spark ML Pipelines Joseph K. Bradley June 2015 Spark Summit
  • 2.
    Who am I? • JosephK. Bradley •  Ph.D. in Machine Learning from CMU, postdoc at Berkeley •  Apache Spark committer •  Software Engineer @ Databricks Inc. 2
  • 3.
    About Spark MLlib • Started at Berkeley (Spark 0.8) •  Now (Spark 1.4) •  Contributions from 50+ orgs, 100+ individuals •  Growing coverage of distributed algorithms Spark   SparkSQL   Streaming   MLlib   GraphX   3
  • 4.
    Algorithm Coverage •  Classification • Logistic regression •  Naive Bayes •  Streaming logistic regression •  Linear SVMs •  Decision trees •  Random forests •  Gradient-boosted trees •  Regression •  Ordinary least squares •  Ridge regression •  Lasso •  Isotonic regression •  Decision trees •  Random forests •  Gradient-boosted trees •  Streaming linear methods •  Frequent itemsets •  FP-growth 4 Clustering •  Gaussian mixture models •  K-Means •  Streaming K-Means •  Latent Dirichlet Allocation •  Power Iteration Clustering Statistics •  Pearson correlation •  Spearman correlation •  Online summarization •  Chi-squared test •  Kernel density estimation Linear algebra •  Local dense & sparse vectors & matrices •  Distributed matrices •  Block-partitioned matrix •  Row matrix •  Indexed row matrix •  Coordinate matrix •  Matrix decompositions Recommendation •  Alternating Least Squares Feature extraction & selection •  Word2Vec •  Chi-Squared selection •  Hashing term frequency •  Inverse document frequency •  Normalizer •  Standard scaler •  Tokenizer •  One-Hot Encoder •  StringIndexer •  VectorIndexer •  VectorAssembler •  Binarizer •  Bucketizer •  ElementwiseProduct •  PolynomialExpansion Model import/export Pipelines List based on Spark 1.4
  • 5.
    ML Workflows arecomplex 5 Train  model   Evaluate   Load  data   Extract  features  
  • 6.
    ML Workflows arecomplex 6 Train  model   Evaluate   Datasource  1   Extract  features   Datasource  2   Datasource  2  
  • 7.
    ML Workflows arecomplex 7 Train  model   Evaluate   Datasource  1   Datasource  2   Datasource  2   Extract  features  Extract  features   Feature  transform  1   Feature  transform  2   Feature  transform  3  
  • 8.
    ML Workflows arecomplex 8 Train  model  1   Evaluate   Datasource  1   Datasource  2   Datasource  2   Extract  features  Extract  features   Feature  transform  1   Feature  transform  2   Feature  transform  3   Train  model  2   Ensemble  
  • 9.
    ML Workflows arecomplex 9 à   Specify  pipeline   à   Inspect  &  debug   à   Re-­‐run  on  new  data   à   Tune  parameters  
  • 10.
    Example: Text Classification 10 • Goal: Given a text document, predict its topic. Subject: Re: Lexan Polish?! Suggest McQuires #1 plastic polish. It will help somewhat but nothing will remove deep scratches without making it worse than it already is.! McQuires will do something...! 1:  about  science   0:  not  about  science   Label  Features   Dataset:  “20  Newsgroups”   From  UCI  KDD  Archive  
  • 11.
    DataFrames 11 dept   age   name   Bio   48   H  Smith   CS   54   A  Turing   Bio   43   B  Jones   Chem   61   M  Kennedy   Data  grouped  into   named  columns   DSL  for  common  tasks   •  Project,  filter,  aggregate,  join,  …   •  Metadata   •  UDFs  
  • 12.
    ML Workflow 12 Train  model   Evaluate   Load  data   Extract  features  
  • 13.
    Load Data 13 Train  model   Evaluate   Load  data   Extract  features   built-in external { JSON } JDBC and more … Data sources for DataFrames
  • 14.
    Load Data 14 Train  model   Evaluate   Load  data   Extract  features   label: Int text: String Current  data  schema  
  • 15.
    Extract Features 15 Train  model   Evaluate   Load  data   Extract  features   label: Int text: String Current  data  schema  
  • 16.
    Extract Features 16 Train  model   Evaluate   Load  data   label: Int text: String Current  data  schema   Tokenizer   Hashed  Term  Freq.   features: Vector words: Seq[String] Transformer DataFrame DataFrame
  • 17.
    Train a Model 17 Logis@c  Regression   Evaluate   label: Int text: String Current  data  schema   Tokenizer   Hashed  Term  Freq.   features: Vector words: Seq[String] prediction: Int Load  data   Estimator DataFrame Model
  • 18.
    Evaluate the Model 18 Logiscc  Regression   Evaluate   label: Int text: String Current  data  schema   Tokenizer   Hashed  Term  Freq.   features: Vector words: Seq[String] prediction: Int Load  data   Evaluator DataFrame metric
  • 19.
    Data Flow 19 Logiscc  Regression   Evaluate   label: Int text: String Current  data  schema   Tokenizer   Hashed  Term  Freq.   features: Vector words: Seq[String] prediction: Int Load  data   By  default,  always   append  new  columns     à Can  go  back  &  inspect   intermediate  results     à Made  efficient  by   DataFrames  
  • 20.
    ML Pipelines 20 Logiscc  Regression   Evaluate   Tokenizer   Hashed  Term  Freq.   Load  data   Pipeline Test  data   Logiscc  Regression   Tokenizer   Hashed  Term  Freq.   Evaluate   Re-­‐run  exactly   the  same  way  
  • 21.
    Parameter Tuning 21 Logiscc  Regression   Evaluate   Tokenizer   Hashed  Term  Freq.   lr.regParam {0.01, 0.1, 0.5} hashingTF.numFeatures {100, 1000, 10000} Given: •  Estimator •  Parameter grid •  Evaluator Find best parameters CrossValidator
  • 22.
    22 Demo:  ML  Pipelines   in  Databricks  Cloud  
  • 23.
    Recap • ML Pipelines provide: • Integration with DataFrames •  Familiar API based on scikit-learn •  Easy workflow inspection •  Simple parameter tuning 23 Complex  Pipelines:   composicons  &  DAGs   Schema  validacon   User-­‐defined  Pipeline  components:   Escmators,  Transformers,  &  Evaluators   Support  for  complex  types   (via  DataFrames  &  UDTs)  
  • 24.
    Looking Ahead 24 •  Pipelinesgraduating from alpha! •  Many feature transformers •  More complete Python API •  Spark R •  PMML model export •  More algorithms (OneVsRest, OnlineLDA, …) •  MLlib API for SparkR •  Improved model inspection •  Pipeline optimizations In Spark 1.4
  • 25.
    25 Databricks, Inc.: Foundedby the creators of Spark & remains largest contributor
  • 26.
    Thank you! Spark  documentacon          spark.apache.org     Pipelines  blog  post          databricks.com/blog/2015/01/07     DataFrames  blog  post          databricks.com/blog/2015/02/17     Databricks          databricks.com/product     Spark  ML  edX  MOOC          ML  with  Spark     Spark  Packages          spark-­‐packages.org     databricks.com/company/careers