Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Building a Mature Analytics Workflo... by Tristan Handy 625 views
- resume by Chong Tze Wei 266 views
- Hp distributed R User Guide by Andrey Karpov 740 views
- OSGeo와 Open Data by r-kor 258 views
- Developing secured-biometric-paymen... by R Systems Interna... 722 views
- 황성수 공공데이터 개방과 공공이슈 해결 by r-kor 654 views

1,395 views

Published on

Published in:
Technology

No Downloads

Total views

1,395

On SlideShare

0

From Embeds

0

Number of Embeds

222

Shares

0

Downloads

73

Comments

0

Likes

3

No embeds

No notes for slide

- 1. 1 Distributed R The Next Generation Platform for Predictive Analytics Jorge Martinez Vishrut Gupta Ed Ma April 10th, 2015
- 2. 2 About me FPGAs Barcelona 2009 Embedded software, GPUs Barcelona 2011 Distributed systems and ML SF 2013 @jorgemarsal http://github.com/jorgemarsal
- 3. 3 The data explosion
- 4. 4 Horizontal scaling The shift from BI to Data Science The shift from BI to data science Happens! https://www.youtube.com/watch?v=vbb-AjiXyh0
- 5. 5 Predictive analytics workflow Build Models Evaluate Models Deploy Models (In-database scoring) BI Integration 1 2 3 Build and evaluate predictive models on large datasets using Distributed R 2 1 Ingest and prepare data by leveraging HP Vertica Analytics Platform (SQL DB) 3 Deploy models to Vertica and use in-database scoring to produce prediction results for BI and applications.
- 6. 6 Data Scientists Preferred Languages: R & SQL Adoption of R increased across industries 1) http://www.kdnuggets.com/2014/08/four-main-languages-analytics-data-mining-data-science.html 2) http://blog.revolutionanalytics.com/2013/10/r-usage-skyrocketing-rexer-poll.html
- 7. 7 R is … “The best thing about R is that it was developed by statisticians. The worst thing about R is that… it was developed by statisticians.” -Bo Cogwill, Google
- 8. 8 R is …. Popular Not scalable Open source No parallel algorithms Flexible Extensible Limited pre/post processing
- 9. 9 Horizontal scaling Functional programming and big dataScale-out Scale-out
- 10. 10 Horizontal scaling “The future has arrived, it’s just not evenly distributed yet” - William Gibson “The future has arrived, it’s just not evenly distributed yet” - William Gibson Ship code to data, Functional programming
- 11. 11 Distributed R The Next Generation Platform for Predictive Analytics
- 12. 12 Distributed R ANew Enterpriseclass predictive analytics platform A scalable, high-performance platform for the R language • Implemented as an R package • Open source Use familiar GUIs and packages Analyze data too large for vanilla R Leverage multiple nodes for distributed processing Vastly improved performance
- 13. 13 Distributed R: architecture Master • Schedules tasks across the cluster. • Sends commands/code to workers Workers • Do the actual work • Own the data • Work on independent data partitions in parallel DistR Master Worker 1 Worker 2 Worker 3 Worker 4
- 14. 14 • Relies on user defined partitioning • Also support for distributed data-frames and lists darray Distributed R: Distributed data structures
- 15. 15 • Express computations over partitions • Execute across the cluster foreach Distributed R: Distributed code f (x)
- 16. 16 Distributed R basic demo
- 17. 17 • Similar signature, accuracy as R packages • Scalable and high performance • E.g., regression on billions of rows in a couple of minutes Distributed R: Built-in distributed algorithms Algorithm Use cases Linear Regression (GLM) Risk Analysis, Trend Analysis, etc. Logistic Regression (GLM) Customer Response modeling, Healthcare analytics (Disease analysis) Random Forest Customer churn, Market campaign analysis K-Means Clustering Customer segmentation, Fraud detection, Anomaly detection Page Rank Identify influencers
- 18. 18 Distributed R March Madness demo
- 19. 19 Parallel Random Forest Example Random Forest – building an ensemble of deep decision trees Need to build 100 decision trees on 4 machines Each machine builds 25 decision trees Can use random forest to predict March Madness Bracket X 7 > 5 X1 2 > 3. 4 X 3 > 3 01 10
- 20. 21 March Madness Bracket Train Model to predict individual games Use team and opponent features to train a model • blocks, steals, assists, rebounds, free throw accuracy, field goal accuracy, 3 point accuracy Calculate the summary statistics of each team Group by teams and get the mean of each team’s features Predict the result of the game Concatenate the summary statistics of the team and feed to model that predicts individual games Fill out bracket by predicting 1 game at the time
- 21. 22
- 22. 23 Distributed R Census demo using Shiny http://15.126.194.41/public/index.html
- 23. 24 Distributed R rocks! • Regression on billions of rows in minutes • Graph algorithms on 10B edges • Load 400GB+ data from database to R in < 10 minutes • Open source!
- 24. 25 That’s cool… what can I do with it? • Collaborate • Github (report issues, send PRs) https://github.com/vertica/DistributedR • Standardization with R-core http://www.r-bloggers.com/enhancing-r-for-distributed-computing/ • Get the SW + docs: http://www.vertica.com/hp-vertica-products/hp-vertica- distributed-r/ • Buy commercial support
- 25. 26 “The future has already arrived, it’s just not evenly distributed yet” - William Gibson
- 26. Thank you http://github.com/vertica/distributedr

No public clipboards found for this slide

Be the first to comment