Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Distributed R: The Next Generation Platform for Predictive Analytics


Published on

In this talk we share some ideas around big data and we present Distributed R, a distributed machine learning framework for R built on those ideas.

Published in: Technology
  • Be the first to comment

Distributed R: The Next Generation Platform for Predictive Analytics

  1. 1. 1 Distributed R The Next Generation Platform for Predictive Analytics Jorge Martinez Vishrut Gupta Ed Ma April 10th, 2015
  2. 2. 2 About me FPGAs Barcelona 2009 Embedded software, GPUs Barcelona 2011 Distributed systems and ML SF 2013 @jorgemarsal
  3. 3. 3 The data explosion
  4. 4. 4 Horizontal scaling The shift from BI to Data Science The shift from BI to data science Happens!
  5. 5. 5 Predictive analytics workflow Build Models Evaluate Models Deploy Models (In-database scoring) BI Integration 1 2 3 Build and evaluate predictive models on large datasets using Distributed R 2 1 Ingest and prepare data by leveraging HP Vertica Analytics Platform (SQL DB) 3 Deploy models to Vertica and use in-database scoring to produce prediction results for BI and applications.
  6. 6. 6 Data Scientists Preferred Languages: R & SQL Adoption of R increased across industries 1) 2)
  7. 7. 7 R is … “The best thing about R is that it was developed by statisticians. The worst thing about R is that… it was developed by statisticians.” -Bo Cogwill, Google
  8. 8. 8 R is …. Popular Not scalable Open source No parallel algorithms Flexible Extensible Limited pre/post processing
  9. 9. 9 Horizontal scaling Functional programming and big dataScale-out Scale-out
  10. 10. 10 Horizontal scaling “The future has arrived, it’s just not evenly distributed yet” - William Gibson “The future has arrived, it’s just not evenly distributed yet” - William Gibson Ship code to data, Functional programming
  11. 11. 11 Distributed R The Next Generation Platform for Predictive Analytics
  12. 12. 12 Distributed R ANew Enterpriseclass predictive analytics platform A scalable, high-performance platform for the R language • Implemented as an R package • Open source Use familiar GUIs and packages Analyze data too large for vanilla R Leverage multiple nodes for distributed processing Vastly improved performance
  13. 13. 13 Distributed R: architecture Master • Schedules tasks across the cluster. • Sends commands/code to workers Workers • Do the actual work • Own the data • Work on independent data partitions in parallel DistR Master Worker 1 Worker 2 Worker 3 Worker 4
  14. 14. 14 • Relies on user defined partitioning • Also support for distributed data-frames and lists darray Distributed R: Distributed data structures
  15. 15. 15 • Express computations over partitions • Execute across the cluster foreach Distributed R: Distributed code f (x)
  16. 16. 16 Distributed R basic demo
  17. 17. 17 • Similar signature, accuracy as R packages • Scalable and high performance • E.g., regression on billions of rows in a couple of minutes Distributed R: Built-in distributed algorithms Algorithm Use cases Linear Regression (GLM) Risk Analysis, Trend Analysis, etc. Logistic Regression (GLM) Customer Response modeling, Healthcare analytics (Disease analysis) Random Forest Customer churn, Market campaign analysis K-Means Clustering Customer segmentation, Fraud detection, Anomaly detection Page Rank Identify influencers
  18. 18. 18 Distributed R March Madness demo
  19. 19. 19 Parallel Random Forest Example Random Forest – building an ensemble of deep decision trees Need to build 100 decision trees on 4 machines Each machine builds 25 decision trees Can use random forest to predict March Madness Bracket X 7 > 5 X1 2 > 3. 4 X 3 > 3 01 10
  20. 20. 21 March Madness Bracket Train Model to predict individual games Use team and opponent features to train a model • blocks, steals, assists, rebounds, free throw accuracy, field goal accuracy, 3 point accuracy Calculate the summary statistics of each team Group by teams and get the mean of each team’s features Predict the result of the game Concatenate the summary statistics of the team and feed to model that predicts individual games Fill out bracket by predicting 1 game at the time
  21. 21. 22
  22. 22. 23 Distributed R Census demo using Shiny
  23. 23. 24 Distributed R rocks! • Regression on billions of rows in minutes • Graph algorithms on 10B edges • Load 400GB+ data from database to R in < 10 minutes • Open source!
  24. 24. 25 That’s cool… what can I do with it? • Collaborate • Github (report issues, send PRs) • Standardization with R-core • Get the SW + docs: distributed-r/ • Buy commercial support
  25. 25. 26 “The future has already arrived, it’s just not evenly distributed yet” - William Gibson
  26. 26. Thank you