Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Integrating Spark with PyData and R

1,009 views

Published on

by Joseph Bradley

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Integrating Spark with PyData and R

  1. 1. Integrating Spark with PyData and R Joseph K. Bradley AMP Camp 6 November 2015
  2. 2. About me Apache Spark committer Software Engineer @ Databricks Ph.D. in Machine Learning @ Carnegie Mellon University 2
  3. 3. About MLlib General Machine Learning library for big data •  Scalable & robust •  Coverage of common algorithms Tools for practical workflows Integration with existing data science tools 3
  4. 4. Algorithm coverage Classification •  Logistic regression w/ elastic net •  Naive Bayes •  Streaming logistic regression •  Linear SVMs •  Decision trees •  Random forests •  Gradient-boosted trees •  Multilayer perceptron •  One-vs-rest Regression •  Least squares w/ elastic net •  Isotonic regression •  Decision trees •  Random forests •  Gradient-boosted trees •  Streaming linear methods Recommendation •  Alternating Least Squares Frequent itemsets •  FP-growth •  Prefix span Clustering •  Gaussian mixture models •  K-Means •  Streaming K-Means •  Latent Dirichlet Allocation •  Power Iteration Clustering Statistics •  Pearson correlation •  Spearman correlation •  Online summarization •  Chi-squared test •  Kernel density estimation Linear algebra •  Local dense & sparse vectors & matrices •  Distributed matrices •  Block-partitioned matrix •  Row matrix •  Indexed row matrix •  Coordinate matrix •  Matrix decompositions Model import/export Pipelines Feature extraction & selection •  Binarizer •  Bucketizer •  Chi-Squared selection •  CountVectorizer •  Discrete cosine transform •  ElementwiseProduct •  Hashing term frequency •  Inverse document frequency •  MinMaxScaler •  Ngram •  Normalizer •  One-Hot Encoder •  PCA •  PolynomialExpansion •  RFormula •  SQLTransformer •  Standard scaler •  StopWordsRemover •  StringIndexer •  Tokenizer •  StringIndexer •  VectorAssembler •  VectorIndexer •  VectorSlicer •  Word2Vec List based on Spark 1.5 4
  5. 5. High-level functionality Learning tasks •  Classification •  Regression •  Recommendation •  Clustering •  Frequent itemsets 5 Workflow utilities •  Model import/export •  Pipelines •  DataFrames •  Cross validation Data utilities •  Feature extraction & selection •  Statistics •  Linear algebra
  6. 6. 6 How does Spark MLlib relate to existing ML libraries?
  7. 7. scikit-learn & R 7
  8. 8. 8
  9. 9. scikit-learn & R Great libraries •  Detailed documentation & how-to guides •  Many packages & extensions Investment •  Education •  Tooling & workflows 9
  10. 10. Big Data 10 Scaling (trees)Topic model on 4.5 million Wikipedia articles Recommendation with 50 million users, 5 million songs, 50 billion ratings
  11. 11. Big Data & MLlib • More data à higher accuracy • Scale with business (# users, available data) • Integrate with production systems 11
  12. 12. Bridging the gap 12 At school: Machine Learning on a laptop But now: Machine Learning on a huge computing cluster How do you get from a single-machine workload to a fully distributed one?
  13. 13. Wish list • Run original code on a production environment •  Use distributed data sources • Distribute ML workload piece by piece •  Only distribute as needed •  Easily switch between local & distributed settings • Use familiar APIs 13
  14. 14. Our task 14 Sentiment analysis Given a review (text), Predict the user’s rating. Data from h*ps://snap.stanford.edu/data/web-Amazon.html
  15. 15. Our ML workflow 15 Text This scarf I bought is very strange. When I ... Label Rating = 3.0 Tokenizer Words [This, scarf, I, bought, ...] Hashing Term-Freq Features [2.0, 0.0, 3.0, ...] Linear Regression Prediction Rating = 2.0
  16. 16. Our ML workflow 16 Tokenizer Hashing Term-Freq Linear Regression Feature Extraction Model Training
  17. 17. Our ML workflow 17 Cross Validation Model Training Feature Extraction regularization parameter: {0.0, 0.1, ...}
  18. 18. Cross validation 18 Cross Validation ... Best Model Model #1 Training Model #2 Training Feature Extraction Model #3 Training
  19. 19. Cross validation 19 Cross Validation ... Best Model Model #1 Training Model #2 Training Feature Extraction Model #3 Training
  20. 20. Distribute cross validation 20 Cross Validation ... Best Model Model #1 Training Model #2 Training Feature Extraction Model #3 Training
  21. 21. Distribute feature extraction 21 Cross Validation ... Best Model Model #1 Training Feature Extraction #1 Feature Extraction #2 Feature Extraction #3 ... Model #2 Training Model #3 Training
  22. 22. Distribute feature extraction 22 Cross Validation ... Best Model Model #1 Training Feature Extraction #1 Feature Extraction #2 Feature Extraction #3 ... Model #2 Training Model #3 Training
  23. 23. Feature Extraction #1 Distribute learning 23 Cross Validation ... Best Model Feature Extraction #2 Feature Extraction #3 Model #1 Training Model #2 Training ...
  24. 24. Wish list •  Run original code on a production environment •  Use distributed data sources •  Distribute ML workload piece by piece •  Only distribute as needed •  Easily switch between local & distributed settings •  Use familiar APIs 24 •  Run Python code on Spark driver •  Convert DataFrames Spark ßà pandas •  Cross Validation •  pdspark: run scikit-learn tasks on workers •  Feature Extraction •  Convert vectors/matrices Spark ßà scipy •  Model Training •  pdspark: Minimal modifications to code •  MLlib: Similar functionality & workflow
  25. 25. Improvements we observed In practice, even larger problems: •  Use more folds of Cross Validation •  Tune more parameters •  Increase model size as dataset size increases 25 1)  Faster model selection for smaller data 2)  Faster training for larger data 3)  Better predictions with more data
  26. 26. Repeating this at home This demo used: •  Spark 1.5 •  The pdspark Spark Package (to be released soon!) Demo notebooks 26 Also see sparkit-learn packagehttp://spark-packages.org/package/lensacom/sparkit-learn Stage 1: Single-machine https://databricks.com/wp-content/uploads/2015/10/spark-summit-eu-joseph-demo_-_1.html Stage 2: Distributed cross-validation https://databricks.com/wp-content/uploads/2015/10/spark-summit-eu-joseph-demo_-_2.html Stage 3: Distributed feature extraction https://databricks.com/wp-content/uploads/2015/10/spark-summit-eu-joseph-demo_-_3.html Stage 4: Distributed everything https://databricks.com/wp-content/uploads/2015/10/spark-summit-eu-joseph-demo_-_4.html
  27. 27. Integrations with R DataFrames • Conversions between R (local) & Spark (distributed) • SQL queries from R 27 model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian") head(filter(df, df$waiting < 50)) ## eruptions waiting ##1 1.750 47 ##2 1.750 47 ##3 1.867 48 R-like MLlib API for generalized linear models
  28. 28. Coming in Spark 1.6 for MLlib Algorithms •  Survival analysis •  Online hypothesis testing •  More feature transformers •  Bisecting (faster) K-Means clustering 28 ML Pipelines API •  Pipeline persistence (save/load) •  R-like stats for linear models •  Expanded API coverage DataFrames: More statistics functions R: Feature interactions in formula Documentation: Testable code examples
  29. 29. MLlib Roadmap: Selected Items Complete ML Pipelines API •  Model persistence Model summaries & R-like statistics Ease-of-use •  More implicit feature transformation •  R formula •  Documentation 29
  30. 30. Thank you! spark.apache.org spark-packages.org databricks.com

×