Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark DataFrames and ML Pipelines

10,242 views

Published on

Presented at the MLConf in Seattle, this presentation offers a quick introduction to Apache Spark, followed by an overview of two novel features for data science

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Spark DataFrames and ML Pipelines

  1. 1. Spark DataFrames and ML Pipelines Joseph K. Bradley May 1, 2015 MLconf Seattle
  2. 2. Who am I? Joseph K. Bradley Ph.D. in ML from CMU, postdoc at Berkeley Apache Spark committer Software Engineer @ Databricks Inc. 2
  3. 3. Databricks Inc. 3 Founded by the creators of Spark & driving its development Databricks Cloud: the best place to run Spark Guess what…we’re hiring! databricks.com/company/careers
  4. 4. 4 Concise  APIs  in  Python,  Java,  Scala          …  and  R  in  Spark  1.4!   500+  enterprises  using  or  planning   to  use  Spark  in  producCon  (blog)   Spark   SparkSQL   Streaming   MLlib   GraphX   Distributed  compuCng  engine   •  Built  for  speed,  ease  of  use,   and  sophisCcated  analyCcs   •  Apache  open  source  
  5. 5. Beyond Hadoop 5 Early  adopters   (Data)  Engineers   MapReduce  &   funcConal  API   Data  ScienCsts   &  StaCsCcians  
  6. 6. Spark for Data Science DataFrames Intuitive manipulation of distributed structured data 6 Machine Learning Pipelines Simple construction and tuning of ML workflows
  7. 7. Google Trends for “dataframe” 7
  8. 8. DataFrames 8 dept   age   name   Bio   48   H  Smith   CS   54   A  Turing   Bio   43   B  Jones   Chem   61   M  Kennedy   RDD  API   DataFrame  API   Data  grouped  into   named  columns  
  9. 9. DataFrames 9 dept   age   name   Bio   48   H  Smith   CS   54   A  Turing   Bio   43   B  Jones   Chem   61   M  Kennedy   Data  grouped  into   named  columns   DSL  for  common  tasks   •  Project,  filter,  aggregate,  join,  …   •  Metadata   •  UDFs  
  10. 10. Spark DataFrames 10 API inspired by R and Python Pandas •  Python, Scala, Java (+ R in dev) •  Pandas integration Distributed DataFrame Highly optimized
  11. 11. 11 0 2 4 6 8 10 RDD Scala RDD Python Spark Scala DF Spark Python DF Runtime of aggregating 10 million int pairs (secs) Spark DataFrames are fast be.er   Uses  SparkSQL   Catalyst  op;mizer  
  12. 12. 12 Demo:  DataFrames  
  13. 13. Spark for Data Science DataFrames •  Structured data •  Familiar API based on R & Python Pandas •  Distributed, optimized implementation 13 Machine Learning Pipelines Simple construction and tuning of ML workflows
  14. 14. About Spark MLlib Started @ Berkeley •  Spark 0.8 Now (Spark 1.3) •  Contributions from 50+ orgs, 100+ individuals •  Growing coverage of distributed algorithms Spark   SparkSQL   Streaming   MLlib   GraphX   14
  15. 15. About Spark MLlib Classification •  Logistic regression •  Naive Bayes •  Streaming logistic regression •  Linear SVMs •  Decision trees •  Random forests •  Gradient-boosted trees Regression •  Ordinary least squares •  Ridge regression •  Lasso •  Isotonic regression •  Decision trees •  Random forests •  Gradient-boosted trees •  Streaming linear methods 15 Statistics •  Pearson correlation •  Spearman correlation •  Online summarization •  Chi-squared test •  Kernel density estimation Linear algebra •  Local dense & sparse vectors & matrices •  Distributed matrices •  Block-partitioned matrix •  Row matrix •  Indexed row matrix •  Coordinate matrix •  Matrix decompositions Frequent itemsets •  FP-growth Model import/export Clustering •  Gaussian mixture models •  K-Means •  Streaming K-Means •  Latent Dirichlet Allocation •  Power Iteration Clustering Recommendation •  Alternating Least Squares Feature extraction & selection •  Word2Vec •  Chi-Squared selection •  Hashing term frequency •  Inverse document frequency •  Normalizer •  Standard scaler •  Tokenizer
  16. 16. ML Workflows are complex 16 Image  classificaCon   pipeline*   *  Evan  Sparks.  “ML  Pipelines.”   amplab.cs.berkeley.edu/ml-­‐pipelines   à Specify  pipeline   à Inspect  &  debug   à Re-­‐run  on  new  data   à Tune  parameters  
  17. 17. Example: Text Classification 17 Goal: Given a text document, predict its topic. Subject: Re: Lexan Polish? Suggest McQuires #1 plastic polish. It will help somewhat but nothing will remove deep scratches without making it worse than it already is. McQuires will do something... 1:  about  science   0:  not  about  science   Label  Features   Dataset:  “20  Newsgroups”   From  UCI  KDD  Archive  
  18. 18. ML Workflow 18 Train  model   Evaluate   Load  data   Extract  features  
  19. 19. Load Data 19 Train  model   Evaluate   Load  data   Extract  features   built-in external { JSON } JDBC and more … Data sources for DataFrames
  20. 20. Load Data 20 Train  model   Evaluate   Load  data   Extract  features   label: Int text: String Current  data  schema  
  21. 21. Extract Features 21 Train  model   Evaluate   Load  data   Extract  features   label: Int text: String Current  data  schema  
  22. 22. Extract Features 22 Train  model   Evaluate   Load  data   label: Int text: String Current  data  schema   Tokenizer   Hashed  Term  Freq.   features: Vector words: Seq[String] Transformer
  23. 23. Train a Model 23 LogisAc  Regression   Evaluate   label: Int text: String Current  data  schema   Tokenizer   Hashed  Term  Freq.   features: Vector words: Seq[String] prediction: Int Estimator Load  data   Transformer
  24. 24. Evaluate the Model 24 LogisCc  Regression   Evaluate   label: Int text: String Current  data  schema   Tokenizer   Hashed  Term  Freq.   features: Vector words: Seq[String] prediction: Int Load  data   Transformer Evaluator Estimator By  default,  always   append  new  columns     à Can  go  back  &  inspect   intermediate  results     à Made  efficient  by   DataFrame   opCmizaCons  
  25. 25. ML Pipelines 25 LogisCc  Regression   Evaluate   Tokenizer   Hashed  Term  Freq.   Load  data   Pipeline Test  data   LogisCc  Regression   Tokenizer   Hashed  Term  Freq.   Evaluate   Re-­‐run  exactly   the  same  way  
  26. 26. Parameter Tuning 26 LogisCc  Regression   Evaluate   Tokenizer   Hashed  Term  Freq.   lr.regParam {0.01, 0.1, 0.5} hashingTF.numFeatures {100, 1000, 10000} Given: •  Estimator •  Parameter grid •  Evaluator Find best parameters CrossValidator
  27. 27. 27 Demo:  ML  Pipelines  
  28. 28. Recap DataFrames •  Structured data •  Familiar API based on R & Python Pandas •  Distributed, optimized implementation Machine Learning Pipelines •  Integration with DataFrames •  Familiar API based on scikit-learn •  Simple parameter tuning 28 Composable  &  DAG  Pipelines   Schema  validaCon   User-­‐defined  Transformers   &  EsCmators  
  29. 29. Looking Ahead Collaborations with UC Berkeley & others •  Auto-tuning models 29 DataFrames •  Further optimization •  API for R ML Pipelines •  More algorithms & pluggability •  API for R
  30. 30. Thank you! Spark  documentaCon          spark.apache.org     Pipelines  blog  post          databricks.com/blog/2015/01/07     DataFrames  blog  post          databricks.com/blog/2015/02/17     Databricks  Cloud  Plalorm          databricks.com/product     Spark  MOOCs  on  edX          Intro  to  Spark  &  ML  with  Spark     Spark  Packages          spark-­‐packages.org    

×