Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1 of 26

Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradley, Databricks)

16

Share

Download to read offline

Presentation at Spark Summit 2015

Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradley, Databricks)

  1. 1. Building, Debugging, & Tuning Spark ML Pipelines Joseph K. Bradley June 2015 Spark Summit
  2. 2. Who am I? • Joseph K. Bradley •  Ph.D. in Machine Learning from CMU, postdoc at Berkeley •  Apache Spark committer •  Software Engineer @ Databricks Inc. 2
  3. 3. About Spark MLlib •  Started at Berkeley (Spark 0.8) •  Now (Spark 1.4) •  Contributions from 50+ orgs, 100+ individuals •  Growing coverage of distributed algorithms Spark   SparkSQL   Streaming   MLlib   GraphX   3
  4. 4. Algorithm Coverage •  Classification •  Logistic regression •  Naive Bayes •  Streaming logistic regression •  Linear SVMs •  Decision trees •  Random forests •  Gradient-boosted trees •  Regression •  Ordinary least squares •  Ridge regression •  Lasso •  Isotonic regression •  Decision trees •  Random forests •  Gradient-boosted trees •  Streaming linear methods •  Frequent itemsets •  FP-growth 4 Clustering •  Gaussian mixture models •  K-Means •  Streaming K-Means •  Latent Dirichlet Allocation •  Power Iteration Clustering Statistics •  Pearson correlation •  Spearman correlation •  Online summarization •  Chi-squared test •  Kernel density estimation Linear algebra •  Local dense & sparse vectors & matrices •  Distributed matrices •  Block-partitioned matrix •  Row matrix •  Indexed row matrix •  Coordinate matrix •  Matrix decompositions Recommendation •  Alternating Least Squares Feature extraction & selection •  Word2Vec •  Chi-Squared selection •  Hashing term frequency •  Inverse document frequency •  Normalizer •  Standard scaler •  Tokenizer •  One-Hot Encoder •  StringIndexer •  VectorIndexer •  VectorAssembler •  Binarizer •  Bucketizer •  ElementwiseProduct •  PolynomialExpansion Model import/export Pipelines List based on Spark 1.4
  5. 5. ML Workflows are complex 5 Train  model   Evaluate   Load  data   Extract  features  
  6. 6. ML Workflows are complex 6 Train  model   Evaluate   Datasource  1   Extract  features   Datasource  2   Datasource  2  
  7. 7. ML Workflows are complex 7 Train  model   Evaluate   Datasource  1   Datasource  2   Datasource  2   Extract  features  Extract  features   Feature  transform  1   Feature  transform  2   Feature  transform  3  
  8. 8. ML Workflows are complex 8 Train  model  1   Evaluate   Datasource  1   Datasource  2   Datasource  2   Extract  features  Extract  features   Feature  transform  1   Feature  transform  2   Feature  transform  3   Train  model  2   Ensemble  
  9. 9. ML Workflows are complex 9 à   Specify  pipeline   à   Inspect  &  debug   à   Re-­‐run  on  new  data   à   Tune  parameters  
  10. 10. Example: Text Classification 10 •  Goal: Given a text document, predict its topic. Subject: Re: Lexan Polish?! Suggest McQuires #1 plastic polish. It will help somewhat but nothing will remove deep scratches without making it worse than it already is.! McQuires will do something...! 1:  about  science   0:  not  about  science   Label  Features   Dataset:  “20  Newsgroups”   From  UCI  KDD  Archive  
  11. 11. DataFrames 11 dept   age   name   Bio   48   H  Smith   CS   54   A  Turing   Bio   43   B  Jones   Chem   61   M  Kennedy   Data  grouped  into   named  columns   DSL  for  common  tasks   •  Project,  filter,  aggregate,  join,  …   •  Metadata   •  UDFs  
  12. 12. ML Workflow 12 Train  model   Evaluate   Load  data   Extract  features  
  13. 13. Load Data 13 Train  model   Evaluate   Load  data   Extract  features   built-in external { JSON } JDBC and more … Data sources for DataFrames
  14. 14. Load Data 14 Train  model   Evaluate   Load  data   Extract  features   label: Int text: String Current  data  schema  
  15. 15. Extract Features 15 Train  model   Evaluate   Load  data   Extract  features   label: Int text: String Current  data  schema  
  16. 16. Extract Features 16 Train  model   Evaluate   Load  data   label: Int text: String Current  data  schema   Tokenizer   Hashed  Term  Freq.   features: Vector words: Seq[String] Transformer DataFrame DataFrame
  17. 17. Train a Model 17 Logis@c  Regression   Evaluate   label: Int text: String Current  data  schema   Tokenizer   Hashed  Term  Freq.   features: Vector words: Seq[String] prediction: Int Load  data   Estimator DataFrame Model
  18. 18. Evaluate the Model 18 Logiscc  Regression   Evaluate   label: Int text: String Current  data  schema   Tokenizer   Hashed  Term  Freq.   features: Vector words: Seq[String] prediction: Int Load  data   Evaluator DataFrame metric
  19. 19. Data Flow 19 Logiscc  Regression   Evaluate   label: Int text: String Current  data  schema   Tokenizer   Hashed  Term  Freq.   features: Vector words: Seq[String] prediction: Int Load  data   By  default,  always   append  new  columns     à Can  go  back  &  inspect   intermediate  results     à Made  efficient  by   DataFrames  
  20. 20. ML Pipelines 20 Logiscc  Regression   Evaluate   Tokenizer   Hashed  Term  Freq.   Load  data   Pipeline Test  data   Logiscc  Regression   Tokenizer   Hashed  Term  Freq.   Evaluate   Re-­‐run  exactly   the  same  way  
  21. 21. Parameter Tuning 21 Logiscc  Regression   Evaluate   Tokenizer   Hashed  Term  Freq.   lr.regParam {0.01, 0.1, 0.5} hashingTF.numFeatures {100, 1000, 10000} Given: •  Estimator •  Parameter grid •  Evaluator Find best parameters CrossValidator
  22. 22. 22 Demo:  ML  Pipelines   in  Databricks  Cloud  
  23. 23. Recap • ML Pipelines provide: •  Integration with DataFrames •  Familiar API based on scikit-learn •  Easy workflow inspection •  Simple parameter tuning 23 Complex  Pipelines:   composicons  &  DAGs   Schema  validacon   User-­‐defined  Pipeline  components:   Escmators,  Transformers,  &  Evaluators   Support  for  complex  types   (via  DataFrames  &  UDTs)  
  24. 24. Looking Ahead 24 •  Pipelines graduating from alpha! •  Many feature transformers •  More complete Python API •  Spark R •  PMML model export •  More algorithms (OneVsRest, OnlineLDA, …) •  MLlib API for SparkR •  Improved model inspection •  Pipeline optimizations In Spark 1.4
  25. 25. 25 Databricks, Inc.: Founded by the creators of Spark & remains largest contributor
  26. 26. Thank you! Spark  documentacon          spark.apache.org     Pipelines  blog  post          databricks.com/blog/2015/01/07     DataFrames  blog  post          databricks.com/blog/2015/02/17     Databricks          databricks.com/product     Spark  ML  edX  MOOC          ML  with  Spark     Spark  Packages          spark-­‐packages.org     databricks.com/company/careers  

×