Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradley, Databricks)

4,114 views

Published on

Presentation at Spark Summit 2015

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradley, Databricks)

  1. 1. Building, Debugging, & Tuning Spark ML Pipelines Joseph K. Bradley June 2015 Spark Summit
  2. 2. Who am I? • Joseph K. Bradley •  Ph.D. in Machine Learning from CMU, postdoc at Berkeley •  Apache Spark committer •  Software Engineer @ Databricks Inc. 2
  3. 3. About Spark MLlib •  Started at Berkeley (Spark 0.8) •  Now (Spark 1.4) •  Contributions from 50+ orgs, 100+ individuals •  Growing coverage of distributed algorithms Spark   SparkSQL   Streaming   MLlib   GraphX   3
  4. 4. Algorithm Coverage •  Classification •  Logistic regression •  Naive Bayes •  Streaming logistic regression •  Linear SVMs •  Decision trees •  Random forests •  Gradient-boosted trees •  Regression •  Ordinary least squares •  Ridge regression •  Lasso •  Isotonic regression •  Decision trees •  Random forests •  Gradient-boosted trees •  Streaming linear methods •  Frequent itemsets •  FP-growth 4 Clustering •  Gaussian mixture models •  K-Means •  Streaming K-Means •  Latent Dirichlet Allocation •  Power Iteration Clustering Statistics •  Pearson correlation •  Spearman correlation •  Online summarization •  Chi-squared test •  Kernel density estimation Linear algebra •  Local dense & sparse vectors & matrices •  Distributed matrices •  Block-partitioned matrix •  Row matrix •  Indexed row matrix •  Coordinate matrix •  Matrix decompositions Recommendation •  Alternating Least Squares Feature extraction & selection •  Word2Vec •  Chi-Squared selection •  Hashing term frequency •  Inverse document frequency •  Normalizer •  Standard scaler •  Tokenizer •  One-Hot Encoder •  StringIndexer •  VectorIndexer •  VectorAssembler •  Binarizer •  Bucketizer •  ElementwiseProduct •  PolynomialExpansion Model import/export Pipelines List based on Spark 1.4
  5. 5. ML Workflows are complex 5 Train  model   Evaluate   Load  data   Extract  features  
  6. 6. ML Workflows are complex 6 Train  model   Evaluate   Datasource  1   Extract  features   Datasource  2   Datasource  2  
  7. 7. ML Workflows are complex 7 Train  model   Evaluate   Datasource  1   Datasource  2   Datasource  2   Extract  features  Extract  features   Feature  transform  1   Feature  transform  2   Feature  transform  3  
  8. 8. ML Workflows are complex 8 Train  model  1   Evaluate   Datasource  1   Datasource  2   Datasource  2   Extract  features  Extract  features   Feature  transform  1   Feature  transform  2   Feature  transform  3   Train  model  2   Ensemble  
  9. 9. ML Workflows are complex 9 à   Specify  pipeline   à   Inspect  &  debug   à   Re-­‐run  on  new  data   à   Tune  parameters  
  10. 10. Example: Text Classification 10 •  Goal: Given a text document, predict its topic. Subject: Re: Lexan Polish?! Suggest McQuires #1 plastic polish. It will help somewhat but nothing will remove deep scratches without making it worse than it already is.! McQuires will do something...! 1:  about  science   0:  not  about  science   Label  Features   Dataset:  “20  Newsgroups”   From  UCI  KDD  Archive  
  11. 11. DataFrames 11 dept   age   name   Bio   48   H  Smith   CS   54   A  Turing   Bio   43   B  Jones   Chem   61   M  Kennedy   Data  grouped  into   named  columns   DSL  for  common  tasks   •  Project,  filter,  aggregate,  join,  …   •  Metadata   •  UDFs  
  12. 12. ML Workflow 12 Train  model   Evaluate   Load  data   Extract  features  
  13. 13. Load Data 13 Train  model   Evaluate   Load  data   Extract  features   built-in external { JSON } JDBC and more … Data sources for DataFrames
  14. 14. Load Data 14 Train  model   Evaluate   Load  data   Extract  features   label: Int text: String Current  data  schema  
  15. 15. Extract Features 15 Train  model   Evaluate   Load  data   Extract  features   label: Int text: String Current  data  schema  
  16. 16. Extract Features 16 Train  model   Evaluate   Load  data   label: Int text: String Current  data  schema   Tokenizer   Hashed  Term  Freq.   features: Vector words: Seq[String] Transformer DataFrame DataFrame
  17. 17. Train a Model 17 Logis@c  Regression   Evaluate   label: Int text: String Current  data  schema   Tokenizer   Hashed  Term  Freq.   features: Vector words: Seq[String] prediction: Int Load  data   Estimator DataFrame Model
  18. 18. Evaluate the Model 18 Logiscc  Regression   Evaluate   label: Int text: String Current  data  schema   Tokenizer   Hashed  Term  Freq.   features: Vector words: Seq[String] prediction: Int Load  data   Evaluator DataFrame metric
  19. 19. Data Flow 19 Logiscc  Regression   Evaluate   label: Int text: String Current  data  schema   Tokenizer   Hashed  Term  Freq.   features: Vector words: Seq[String] prediction: Int Load  data   By  default,  always   append  new  columns     à Can  go  back  &  inspect   intermediate  results     à Made  efficient  by   DataFrames  
  20. 20. ML Pipelines 20 Logiscc  Regression   Evaluate   Tokenizer   Hashed  Term  Freq.   Load  data   Pipeline Test  data   Logiscc  Regression   Tokenizer   Hashed  Term  Freq.   Evaluate   Re-­‐run  exactly   the  same  way  
  21. 21. Parameter Tuning 21 Logiscc  Regression   Evaluate   Tokenizer   Hashed  Term  Freq.   lr.regParam {0.01, 0.1, 0.5} hashingTF.numFeatures {100, 1000, 10000} Given: •  Estimator •  Parameter grid •  Evaluator Find best parameters CrossValidator
  22. 22. 22 Demo:  ML  Pipelines   in  Databricks  Cloud  
  23. 23. Recap • ML Pipelines provide: •  Integration with DataFrames •  Familiar API based on scikit-learn •  Easy workflow inspection •  Simple parameter tuning 23 Complex  Pipelines:   composicons  &  DAGs   Schema  validacon   User-­‐defined  Pipeline  components:   Escmators,  Transformers,  &  Evaluators   Support  for  complex  types   (via  DataFrames  &  UDTs)  
  24. 24. Looking Ahead 24 •  Pipelines graduating from alpha! •  Many feature transformers •  More complete Python API •  Spark R •  PMML model export •  More algorithms (OneVsRest, OnlineLDA, …) •  MLlib API for SparkR •  Improved model inspection •  Pipeline optimizations In Spark 1.4
  25. 25. 25 Databricks, Inc.: Founded by the creators of Spark & remains largest contributor
  26. 26. Thank you! Spark  documentacon          spark.apache.org     Pipelines  blog  post          databricks.com/blog/2015/01/07     DataFrames  blog  post          databricks.com/blog/2015/02/17     Databricks          databricks.com/product     Spark  ML  edX  MOOC          ML  with  Spark     Spark  Packages          spark-­‐packages.org     databricks.com/company/careers  

×