"R is the most popular language in the data-science community with 2+ million users and 6000+ R packages. R’s adoption evolved along with its easy-to-use statistical language, graphics, packages, tools and active community. In this session we will introduce Distributed R, a new open-source technology that solves the scalability and performance limitations of vanilla R. Since R is single-threaded and does not scale to accommodate large datasets, Distributed R addresses many of R’s limitations. Distributed R efficiently shares sparse structured data, leverages multi-cores, and dynamically partitions data to mitigate load imbalance.
In this talk, we will show the promise of this approach by demonstrating how important machine learning and graph algorithms can be expressed in a single framework and are substantially faster under Distributed R. Additionally, we will show how Distributed R complements Vertica, a state-of-the-art columnar analytics database, to deliver a full-cycle, fully integrated, data “prep-analyze-deploy” solution."
3. 3
Haven
Big Data Platform
Turn 100% of your
data into action.
Human Data
Business Data
Machine Data
Powering Big Data Analytics to Applications
Insight
Haven OnDemand
• Vertica OnDemand
• IDOL OnDemand
• Vertica Enterprise
• IDOL Enterprise
• Vertica for SQL on Hadoop
• Vertica Distributed R
• KeyView
Haven Enterprise
HP Haven Big Data Platform
4. 4
Predictive analytics workflow
Build Models
Evaluate Models
Deploy
Models
(In-DB or Web)
BI Integration
1 2
3
Build and evaluate predictive
models on large datasets
using Distributed R
2
1 Ingest and prepare data by
leveraging HP Vertica
Analytics Platform (SQL DB)
3 Deploy models to Vertica and
use in-database scoring to
produce prediction results for
BI and applications.
Alternatively deploy model as
a web service.
5. 5
Outline
• Distributed R Overview and Examples
• Full-cycle, “end-to-end” predictive analytics demo with a real dataset,
showcasing:
• Distributed R, using the HPdglm package (Distributed R’s parallel, high-performance
linear regression algorithm by HP)
• Vertica, with an in-database prediction function
• In-database data preparation with Vertica.dplyr
7. 7
R is ….
Popular
Not
scalable
Open
source No parallel
algorithms
Flexible
Extensible
Limited
pre/post
processing
“The best thing about R is that it was developed by
statisticians. The worst thing about R is that… it was
developed by statisticians.”
-Bo Cogwill, Google
8. 8
Data Scientists Preferred Languages: R & SQL
Adoption of R increased across industries
1) http://www.kdnuggets.com/2014/08/four-main-languages-analytics-data-mining-data-science.html
2) http://blog.revolutionanalytics.com/2013/10/r-usage-skyrocketing-rexer-poll.html
9. 9
Distributed R
ANew Enterpriseclass predictive analytics platform
A scalable, high-performance platform for the R language
• Implemented as an R package
• Open source
Use familiar GUIs
and packages
Analyze data too
large for vanilla R
Leverage multiple
nodes for
distributed
processing
Vastly
improved
performance
10. 10
Distributed R: architecture
Master
• Schedules tasks across the cluster.
• Sends commands/code to workers
Workers
• Hold data partitions
• Apply functions to data partitions in
parallel
11. 11
• Relies on user defined partitioning
• Also support for distributed data-frames and lists
darray
Distributed R: Distributed data structures
12. 12
• Express computations over partitions
• Execute across the cluster
foreach
Distributed R: Distributed code
f (x)
13. 13
Distributed R: basic concepts
# Loads the package into R
library(distributedR)
# Starts up your cluster (as defined in XML)
distributedR_start()
# Declares a distributed array of dimensions 4x4, each partition 2x2
B <- darray(dim=c(4,4), blocks=c(2,2), sparse=FALSE)
# Sets each partition to a matrix containing integers == their partition ids
foreach(i, 1:npartitions(B),
init<-function(b = splits(B,i), index=i) {
b <- matrix(index, nrow=nrow(b), ncol=ncol(b))
update(b)
})
14. 14
• Similar signature, accuracy as R packages
• Scalable and high performance
• E.g., regression on billions of rows in a couple of minutes
Distributed R: Built-in distributed algorithms
Algorithm Use cases
Linear Regression (GLM) Risk Analysis, Trend Analysis, etc.
Logistic Regression (GLM)
Customer Response modeling, Healthcare analytics
(Disease analysis)
Random Forest Customer churn, Market campaign analysis
K-Means Clustering
Customer segmentation, Fraud detection, Anomaly
detection
Page Rank Identify influencers
15. 15
Distributed R: summary
• Regression on billions of rows in minutes
• Graph algorithms on 10B edges
• Load 400GB+ data from database to R in < 10 minutes
• Open source!
16. 16
That’s cool… what can I do with it?
• Collaborate
• Github (report issues, send PRs) https://github.com/vertica/DistributedR
• Standardization with R-core http://www.r-bloggers.com/enhancing-r-for-distributed-computing/
• Get the SW + docs: http://www.vertica.com/hp-vertica-products/hp-vertica-distributed-r/
• Buy commercial support
18. 18
In our Demo…
Build Models
Evaluate Models
Deploy
Models
(In-DB or Web)
BI Integration
1 2
3
1 Retrieve the “bank-additional”
dataset from the UCI ML
repository
Create a table in HP
Vertica, and import the
CSV of data into Vertica
using the data loader.
2
3
Prepare the data using
Vertica.dplyr: cleaning it up,
creating new columns, and
separating them into training
and testing sets.
4 Load the data into Distributed
R and apply the GLM algorithm
on it to produce a model.
5 Deploy the model back to
Vertica.
6 Apply the model to data in the
testing set; check for accuracy.
19. 19
From Storage to Training to Prediction
Distributed R
P
1
P
2
P
3
P
4
Vertica DBModel Data
Vertica.dplyr
20. 20
• Developed from MIT’s C-Store
• Fast, columnar-oriented analytics
database
• Organizes data into projections
• Provides k-safety fault-tolerance and
redundancy
• Used in several industry applications for
big-data storage and analysis (see our
public customer list for examples)
What’s Vertica?
22. 22
The Bank Marketing Dataset
• Background
– A Portuguese banking institution runs a marketing campaign for selling long-term deposits and
collects data from clients they contact, covering various socioeconomic indicators.
– These data are from the years 2008-2013
– The predicted variable is whether or not the client subscribed to the service.
• 45,211 observations
• 17 input features of mixed numerical and categorical data
• Contact communication type (‘email’ vs.
‘phone’)
• Client education level
• Age
• Contact day-of-week
• Employment
• Has loans
• Number of days since last contact
• Contact Month
• Previous contact outcome
• Consumer Price Index
• Duration of Contact
• Others
Source: http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing
[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to
Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier,
62:22-31, June 2014
24. 24
Vertica.dplyr: A Vertica adapter for dplyr
• Dplyr is an R package that is quickly rising in popularity due to its
convenient syntax and mechanisms for data manipulation in R.
• filter() (and slice()), arrange(), select() (and rename()), distinct(), mutate() (and transmute()),
summarise(), sample_n() and sample_frac()
• dplyr supports SQL translation for many of these operations on databases, but requires
specific drivers for different databases
• These operations can be well leveraged for data preparation
• Vertica.dplyr is not only a driver for dplyr to work with Vertica, but also
a means for us to more fully integrate R with Vertica and keep R users
at ease with inDB data preparation (no SQL knowledge required!)
More info: http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
25. 25
Creating the table and loading the data
db_create_table(vertica$con,"bank_orig",col
umns)
orig <- db_load_from_file(vertica, file.name=
"/home/dbadmin/bank-additional/bank-
additional-full.csv", "bank_orig", sep = ";")
CREATE TABLE IF NOT EXISTS
bank_original (age int, job varchar, marital
varchar,
education varchar, "default" varchar, housing
varchar, loan varchar, contact varchar,
MONTH varchar,
day_of_week char(5), duration int, campaign
int, pdays int, "previous" int, poutcome
varchar,
"emp.var.rate" float, "cons.price.idx" float,
"cons.conf.idx" float, euribor3m float,
"nr.employed" float, y varchar(5));
COPY bank_original FROM
'/home/dbadmin/bank-additional/bank-
additional-full.csv'
Vertica.dplyr SQL
27. 27
Data Preparation Steps
1) Normalize some columns
2) Relabel low-frequency categorical data
3) Change categorical data to numerical data
4) Separate data into training and testing sets
28. 28
Data Preparation Steps
1) Normalize some columns
2) Relabel low-frequency categorical data
3) Change categorical data to numerical data
4) Separate data into training and testing sets
29. 29
Computing the Z-Score
1) To normalize the data, we will compute the z-score for all quantitative variables.
𝒛 =
𝒙 − 𝝁
𝝈
1) For these features, we’ll first have to compute the mean and standard deviation.
m_sd <- summarise(orig,m_age=mean(age),std_age=sd(age) ….
summarise in dplyr collapses columns into aggregates.
3) Then we’ll need to convert the quantitative values into z_scores for every
observation
normalized <- mutate(orig,age_z=(age-z[["m_age"]])/z[["std_age"]] …..
mutate creates new columns – in this case, we are creating new columns to store
30. 30
Data Preparation Steps
1) Normalize some columns
2) Relabel low-frequency categorical data
3) Change categorical data to numerical data
4) Separate data into training and testing sets
31. 31
Let’s take a look at the categorical variables
1) Group and sort
job_group <- group_by(norm_table,job)
arrange(summarise(job_group,freq=n()),desc(freq))
SQL equivalent:
SELECT job, COUNT(*) AS freq FROM bank_normalized GROUP BY job ORDER BY freq DESC;
2) Many categories have much higher frequencies than others. Let’s use
the DECODE function to relabel the low-frequency occurrences:
decode(job,'"admin"',"admin",'"blue-collar"',"blue-
collar",'"technician"',"technician",'"services"',"services",'"management"',"management","other
")
32. 32
A B C D E F
Before
A B C Other
After
Reclassifying Low-Frequency Categorical
Occurrences
33. 33
Data Preparation Steps
1) Normalize some columns
2) Relabel low-frequency categorical data
3) Change categorical data to numerical data
4) Separate data into training and testing sets
34. 34
cat2num
Why?
HPdglm() requires categories to be
relabeled to numbers; i.e. classes must
be changed to numbers.
cat2num("bank_top_n",dsn="VerticaDSN
",dstTable="bank_top_n_num")
3
Pet Type Number
‘Cat’ 0
‘Dog’ 1
‘Fish’ 2
‘Snake’ 3
35. 35
Data Preparation Steps
1) Normalize some columns
2) Relabel low-frequency categorical data
3) Change categorical data to numerical data
4) Separate data into training and testing sets
36. 36
Separating data into testing and training sets
top_tbl <- tbl(vertica,"bank_top_n_num")
testing_set <- filter(top_tbl,random() < 0.2)
testing_set <- compute(testing_set,name="testing_set")
training_set <- compute(setdiff(top_tbl,testing_set),"training_set")
3
40. 40
Approach 1: Too many connections
Parallel loading from DBR = Multiple concurrent ODBC connections
Each connection requests part of the same database table
• 10 servers * 32 HT cores = 320 concurrent SQL queries
• Costly intra-node transfers in DB
Overwhelms the database!
Database
Worke
r
Worke
r
Worke
r
Distributed R
41. 41
Solution: Reduce connections, DB pushes data
Master R process requests table : single SQL request
• Provides hint about number of partitions
Vertica starts multiple UDFs
• Reads table from DB, divides data into partitions
• Sends data in parallel to Distributed R nodes (over network)
Distributed R workers receive data
• Converts data to in-memory R objects
Prasad, Shreya; Fard, Arash; Gupta, Vishrut; Martinez, Jorge; LeFevre, Jeff; Xu, Vincent; Hsu, Meichun; Roy, Indrajit
(2015). "Enabling predictive analytics in Vertica: Fast data transfer, distributed model creation and in-database
prediction". ACM SIGMOD International Conference on Management of Data (SIGMOD).
Database
Maste
r
Worke
r
Worke
r
SQL data
Distributed R
42. 42
Package HPdata
Includes many functions for loading data into Distributed R, including from Vertica
as well as the file system, giving you functions like:
1) db2darrays
2) db2dframe
3) db2matrix
4) file2dgraph
5) etc.
The DB functions make full advantage of the Vertica Fast Transfer feature.
44. 44
OK, I’ve got the model, but now how do I do my in-database
prediction?
Step 1: Deploy the Model
deploy.model(model=theModel,
dsn='SF',
modelName='demoModel',
modelComments='A logistic
regression model for bank
data‘)
This converts the R model into a
table in Vertica, where the
parameters can be used to predict
on new data.
Distributed R
P
1
P
2
P
3
P
4
Vertica DBModel
45. 45
OK, I’ve got the model, but now how do I do my in-database
prediction?
Step 2: Run the prediction function, GLMpredict()
The Distributed R Extensions for HP Vertica pack contains a set of Vertica
functions that increase the synergy between R and Vertica, with prediction
functions for R models generated by:
• hpdglm/glm
• hpdkmeans/kmeans
• hpdrandomForest/randomForest
47. 47
• Distributed R
• Scalable, high-performance analytics for big data
• Compatibility with R’s massive package-base at the executor level
• Open source
• Predictive analytics tool complementing Vertica in the HP Haven Platform
• Vertica.dplyr
• Leverages power of dplyr for Vertica
• Helps keep data sandboxing in R
• Integration with Distributed R
Summary