4. 4
Horizontal scaling
The shift from BI to Data Science
The shift from BI to
data science
Happens!
https://www.youtube.com/watch?v=vbb-AjiXyh0
5. 5
Predictive analytics workflow
Build Models
Evaluate Models
Deploy
Models
(In-database
scoring)
BI Integration
1 2
3
Build and evaluate predictive
models on large datasets
using Distributed R
2
1 Ingest and prepare data by
leveraging HP Vertica
Analytics Platform (SQL DB)
3 Deploy models to Vertica and
use in-database scoring to
produce prediction results for
BI and applications.
6. 6
Data Scientists Preferred Languages: R & SQL
Adoption of R increased across industries
1) http://www.kdnuggets.com/2014/08/four-main-languages-analytics-data-mining-data-science.html
2) http://blog.revolutionanalytics.com/2013/10/r-usage-skyrocketing-rexer-poll.html
7. 7
R is …
“The best thing about R is that it was developed by statisticians. The worst
thing about R is that… it was developed by statisticians.”
-Bo Cogwill, Google
10. 10
Horizontal scaling
“The future has arrived, it’s just
not evenly distributed yet”
- William Gibson
“The future has arrived, it’s just
not evenly distributed yet”
- William Gibson
Ship code to data,
Functional programming
12. 12
Distributed R
ANew Enterpriseclass predictive analytics platform
A scalable, high-performance platform for the R language
• Implemented as an R package
• Open source
Use familiar GUIs
and packages
Analyze data too
large for vanilla R
Leverage multiple
nodes for
distributed
processing
Vastly
improved
performance
13. 13
Distributed R: architecture
Master
• Schedules tasks across the cluster.
• Sends commands/code to workers
Workers
• Do the actual work
• Own the data
• Work on independent data partitions in
parallel
DistR Master
Worker 1
Worker 2
Worker 3
Worker 4
14. 14
• Relies on user defined partitioning
• Also support for distributed data-frames and lists
darray
Distributed R: Distributed data structures
15. 15
• Express computations over partitions
• Execute across the cluster
foreach
Distributed R: Distributed code
f (x)
19. 19
Parallel Random Forest Example
Random Forest – building an
ensemble of deep decision trees
Need to build 100 decision trees on 4
machines
Each machine builds 25 decision trees
Can use random forest to predict
March Madness Bracket
X
7
>
5
X1
2
>
3.
4
X
3
>
3
01 10
20. 21
March Madness Bracket
Train Model to predict individual games
Use team and opponent features to train a model
• blocks, steals, assists, rebounds, free throw accuracy, field goal accuracy, 3 point accuracy
Calculate the summary statistics of each team
Group by teams and get the mean of each team’s features
Predict the result of the game
Concatenate the summary statistics of the team and feed to model that predicts individual
games
Fill out bracket by predicting 1 game at the time
23. 24
Distributed R rocks!
• Regression on billions of rows in minutes
• Graph algorithms on 10B edges
• Load 400GB+ data from database to R in < 10 minutes
• Open source!
24. 25
That’s cool… what can I do with it?
• Collaborate
• Github (report issues, send PRs) https://github.com/vertica/DistributedR
• Standardization with R-core http://www.r-bloggers.com/enhancing-r-for-distributed-computing/
• Get the SW + docs: http://www.vertica.com/hp-vertica-products/hp-vertica-
distributed-r/
• Buy commercial support
25. 26
“The future has already arrived,
it’s just not evenly distributed yet”
- William Gibson