Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

25,425 views

Published on

This talk discusses integrating common data science tools like Python pandas, scikit-learn, and R with MLlib, Spark’s distributed Machine Learning (ML) library. Integration is simple; migration to distributed ML can be done lazily; and scaling to big data can significantly improve accuracy. We demonstrate integration with a simple data science workflow. Data scientists often encounter scaling bottlenecks with single-machine ML tools. Yet the overhead in migrating to a distributed workflow can seem daunting. In this talk, we demonstrate such a migration, taking advantage of Spark and MLlib’s integration with common ML libraries. We begin with a small dataset which runs on a single machine. Increasing the size, we hit bottlenecks in various parts of the workflow: hyperparameter tuning, then ETL, and eventually the core learning algorithm. As we hit each bottleneck, we parallelize that part of the workflow using Spark and MLlib. As we increase the dataset and model size, we can see significant gains in accuracy. We end with results demonstrating the impressive scalability of MLlib algorithms. With accuracy comparable to traditional ML libraries, combined with state-of-the-art distributed scalability, MLlib is a valuable new tool for the modern data scientist.

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

  1. 1. Combining the Strengths of MLlib, scikit-learn, & R Joseph K. Bradley Spark Summit Europe October 2015
  2. 2. About me ApacheSpark committer Software Engineer@ Databricks Ph.D. in Machine Learning @ CarnegieMellon University 2
  3. 3. scikit-learn & R 3
  4. 4. 4
  5. 5. scikit-learn & R Greatlibraries • Detailed documentation & how-to guides • Many packages& extensions Business investment • Education • Tooling & workflows 5
  6. 6. Big Data 6 Scaling (trees)Topic model on 4.5 million Wikipedia articles Recommendation with 50 million users, 5 million songs, 50 billion ratings
  7. 7. Big Data & MLlib • More data à higher accuracy • Scalewith business (# users,available data) • Integrate with production systems 7
  8. 8. Bridging the gap How do you get from a single-machine workload to a fully distributed one? 8 At school: Machine Learning with R on my laptop The Goal: Machine Learning on a huge computing cluster
  9. 9. Wish list • Run original code on a production environment • Use distributed data sources • Distribute ML workload piece by piece • Only distribute as needed • Easily switch between local & distributed settings • Use familiar APIs 9
  10. 10. Our task 10 Sentiment analysis Given a review (text), Predict the user’srating. Data  from  https://snap.stanford.edu/data/web-­‐Amazon.html
  11. 11. Our ML workflow 11 Text This scarf I bought is very strange. When I ... Label Rating = 3.0 Tokenizer Words [This, scarf, I, bought, ...] Hashing Term-Freq Features [2.0, 0.0, 3.0, ...] Linear Regression Prediction Rating = 2.0
  12. 12. Our ML workflow 12 Cross Validation Linear Regression Feature Extraction regularization parameter: {0.0, 0.1, ...}
  13. 13. Cross validation 13 Cross Validation ... Best Linear Regression Linear Regression #1 Linear Regression #2 Feature Extraction Linear Regression #3
  14. 14. Cross validation 14 Cross Validation ... Best Linear Regression Linear Regression #1 Linear Regression #2 Feature Extraction Linear Regression #3
  15. 15. Distribute cross validation 15 Cross Validation ... Best Linear Regression Linear Regression #1 Linear Regression #2 Feature Extraction Linear Regression #3
  16. 16. Distribute feature extraction 16 Cross Validation ... Best Linear Regression Linear Regression #1 Feature Extraction #1 Feature Extraction #2 Feature Extraction #3 ... Linear Regression #2 Linear Regression #3
  17. 17. Feature Extraction #1 Distribute learning 17 Cross Validation ... Best Linear Regression Feature Extraction #2 Feature Extraction #3 Linear Regression #1 Linear Regression #2 ...
  18. 18. Improvements we observed Also, in practice: • More folds of Cross Validation • Tune more parameters • Increase model size as dataset size increases 18 1) Faster model selection for small data 2) Faster training for large data 3) Better predictions (R^2) with more data
  19. 19. Integrations • Distributed data sources • Conversionsbetween pandas& Spark • Conversionsbetween scipy & MLlib types • Distributed model selection • Distributed feature extraction • Distributed learning • Conversionsbetween scikit-learn & MLlib models 19
  20. 20. Integrations with R DataFrames • Conversionsbetween R (local)& Spark (distributed) • SQL queries from R 20 model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian") head(filter(df, df$waiting < 50)) ## eruptions waiting ##1 1.750 47 ##2 1.750 47 ##3 1.867 48 R-like MLlib API for generalizedlinearmodels
  21. 21. Repeating this at home This demo used: • Spark 1.5 • The pdspark Spark Package (tobe released soon!) The code will be posted online. Also see sparkit-learn package 21 Try it on Databricks with a free trial @ databricks.com
  22. 22. What’s next? Further work on integrations • Python:Support more models& data types • R: Expand GLM formula (feature interactions) & other models Match features & behavior Getinvolved! • Contribute to Spark & Spark packages • Provide feedback 22
  23. 23. Thank you! spark.apache.org spark-packages.org databricks.com

×