Hadoop is rapidly being adopted as a major platform for storing and managing massive amounts of data, and for computing descriptive and query types of analytics on that data. However, it has a reputation for not being a suitable environment for high performance complex iterative algorithms such as logistic regression, generalized linear models, and decision trees. At Revolution Analytics we think that reputation is unjustified, and in this talk I discuss the approach we have taken to porting our suite of High Performance Analytics algorithms to run natively and efficiently in Hadoop. Our algorithms are written in C++ and R, and are based on a platform that automatically and efficiently parallelizes a broad class of algorithms called Parallel External Memory Algorithms (PEMA’s). This platform abstracts both the inter-process communication layer and the data source layer, so that the algorithms can work in almost any environment in which messages can be passed among processes and with almost any data source. MPI and RPC are two traditional ways to send messages, but messages can also be passed using files, as in Hadoop. I describe how we use the file-based communication choreographed by MapReduce and how we efficiently access data stored in HDFS.
2. Innovate with R
2
Most widely used data analysis software
Used by 2M+ data scientists, statisticians and analysts
Most powerful statistical programming language
Flexible, extensible and comprehensive for productivity
Create beautiful and unique data visualizations
As seen in New York Times, Twitter and Flowing Data
Thriving open-source community
Leading edge of analytics research
Fills the talent gap
New graduates prefer R
Download the White Paper
R is Hot
bit.ly/r-is-hot
3. R is open source and drives analytic innovation but has
some limitations
Bigger
data sizes
Speed of
analysis
Production
support
Memory Bound Big Data
Single Threaded Scale out, parallel
processing, high speed
Community Support Commercial
production support
Innovation
and scale
Innovative –
5000+
packages, exponential
growth
Combines with open
source R packages
where needed
4. 4
Revolution R Enterprise
High Performance, Multi-Platform Analytics Platform
Revolution R Enterprise
DeployR
Web Services
Software Development Kit
DevelopR
Integrated Development
Environment
ConnectR
High Speed & Direct Connectors
Teradata, Hadoop (HDFS, HBase), SAS, SPSS, CSV, OBDC
ScaleR
High Performance Big Data Analytics
Platform LSF, MS HPC Server,
MS Azure Burst, SMP Servers
DistributedR
Distributed Computing Framework
Platform LSF, MS HPC Server, MS
Azure Burst
RevoR
Performance Enhanced Open Source R + CRAN packages
IBM PureData (Netezza), Platform LSF, MS HPC Server, MS Azure
Burst, Cloudera, Hortonworks, IBM Big Insights, Intel Hadoop, SMP servers
Open Source R
Plus
Revolution Analytics
performance
enhancements
Revolution
Analytics
Value-Add
Components
Providing Power
and Scale to Open
Source R
5. Our objectives with respect to Hadoop
Allow our customers to do predictive
analytics as easily in Hadoop as they can
using R on their laptops today
Scalable and High Performance
Provide the first enterprise-
ready, commercially supported, full-
featured, out-of-the-box Predictive Analytics
suite running in Hadoop
5
6. Overview of Revolution and Hadoop
Revolution currently provides capabilities for
using R and Hadoop together
RHadoop packages: rmr2, rhdfs, rhbase
RevoScaleR package: Rapidly read and process
data from HDFS
We are in the process of porting
RevoScaleR to run fully “in Hadoop.” It
currently runs on
laptops, workstations, servers, and MPI-
based clusters.
6
7. RHadoop: Map-Reduce with R
Unlock data in Hadoop using only the R language
No need to learn Java, Pig, Python or any other language
7
8. RevoScaleR Overview
8
An R package that adds capabilities to R:
Data Import/Clean/Explore/Transform
Analytics – Descriptive and Predictive
Parallel and distributed computing
Visualization
Scales from small local data to huge distributed
data
Scales from laptop to server to cluster to cloud
Portable – the same code works on small and
big data, and on laptop, server, cluster, Hadoop
9. Key ways RevoScaleR enhances R
High Performance Computing (HPC)
functions: Parallel/Distributed computing
High Performance Analytics (HPA) functions:
Big Data + Parallel/Distributed computing
XDF file format; rapidly store and extract
data
Use results from HPA and HPC functions in
other R packages
Revolution R Enterprise 9
10. High Performance Big Data Analytics with
ScaleR
10
Statistical
Tests
Machine
Learning
Simulation
Descriptive
Statistics
Data
Visualization
R Data Step
Predictive
Models
Sampling
11. ScaleR: High Performance Scalable
Parallel External Memory Algorithms
Company Confidential – Do not distribute 11
Data import –
Delimited, Fixed, SAS, SPSS,
OBDC
Variable creation &
transformation
Recode variables
Factor variables
Missing value handling
Sort
Merge
Split
Aggregate by category
(means, sums)
Min / Max
Mean
Median (approx.)
Quantiles (approx.)
Standard Deviation
Variance
Correlation
Covariance
Sum of Squares (cross product
matrix for set variables)
Pairwise Cross tabs
Risk Ratio & Odds Ratio
Cross-Tabulation of Data
(standard tables & long form)
Marginal Summaries of Cross
Tabulations
Chi Square Test
Kendall Rank Correlation
Fisher’s Exact Test
Student’s t-Test
Data Prep, Distillation & Descriptive Analytics
Subsample (observations &
variables)
Random Sampling
R Data Step Statistical Tests
Sampling
Descriptive Statistics
12. ScaleR: High Performance Scalable
Parallel External Memory Algorithms
Company Confidential – Do not distribute 12
Sum of Squares (cross product
matrix for set variables)
Multiple Linear Regression
Generalized Linear Models (GLM)
- All exponential family
distributions:
binomial, Gaussian, inverse
Gaussian, Poisson, Tweedie.
Standard link functions including:
cauchit, identity, log, logit, probit.
User defined distributions & link
functions.
Covariance & Correlation
Matrices
Logistic Regression
Classification & Regression Trees
Predictions/scoring for models
Residuals for all models
Histogram
Line Plot
Scatter Plot
Lorenz Curve
ROC Curves (actual data and
predicted values)
K-Means
Statistical Modeling
Decision Trees
Predictive Models Cluster AnalysisData Visualization
Classification
Machine Learning
Simulation
Monte Carlo
13. HPC Capabilities in RevoScaleR
Execute (essentially) any R function in
parallel on nodes and cores
Results from all runs are returned in a list to
the user’s laptop
Extensive control over parameters
Extensive control over nodes, cores, and
times to run
Ideal for simulations and for running R
functions on small amounts of data
Revolution R Enterprise 13
14. Key ScaleR HPA features
Handles an arbitrarily large number of rows
in a fixed amount of memory
Scales linearly with the number of rows
Scales linearly with the number of nodes
Scales well with the number of cores per
node
Scales well with the number of parameters
Extremely high performance
Revolution R Enterprise 14
15.
16.
17. Regression comparison using in-memory
data: lm() vs rxLinMod()
Revolution R Enterprise 17
lm() in R
lm() in Revolution R
rxLinMod() in Revolution R
18. GLM comparison using in-memory data:
glm() vs rxGlm()
Revolution R Enterprise 18
19. SAS HPA Benchmarking comparison*
Logistic Regression
Rows of data 1 billion 1 billion
Parameters “just a few” 7
Time 80 seconds 44 seconds
Data location In memory On disk
Nodes 32 5
Cores 384 20
RAM 1,536 GB 80 GB
19
Revolution R is faster on the same amount of data, despite using approximately a 20th as many cores, a
20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM.
*As published by SAS in HPC Wire, April 21, 2011
Double
45%
1/6th
5%
5%
Revolution R Enterprise Delivers Performance at 2% of the Cost
20. Allstate compares SAS, Hadoop and R
for Big-Data Insurance Models
Approach Platform Time to fit
SAS 16-core Sun Server 5 hours
rmr/MapReduce 10-node 80-core
Hadoop Cluster
> 10 hours
R 250 GB Server Impossible (> 3 days)
Revolution R Enterprise 5-node 20-core
LSF cluster
5.7 minutes
Revolution R Enterprise 20
Generalized linear model, 150 million observations, 70 degrees of freedom
http://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html
21. What makes ScaleR so fast?
Lot’s of things!
This kind of performance requires
Careful architecting that supports performance
Constant, intense focus on all of the details
A review of every line of code with an eye to
performance (in addition to giving the correct
answers, of course)
Extensive profiling and continuous
benchmarking to detect problems and improve
code
Revolution R Enterprise 21
22. Specific speed-related factors -- 1
Efficient computational algorithms
Efficient memory management – minimize
data copying and data conversion
Heavy use of C++ templates; optimal code
Efficient data file format; fast access by row
and column
Revolution R Enterprise 22
23. Specific speed-related factors -- 2
Models are pre-analyzed to detect and
remove duplicate computations and points of
failure (singularities)
Handle categorical variables efficiently
Revolution R Enterprise 23
24. Parallel External Memory Algorithms
(PEMA’s)
The HPA analytics algorithms are all built on
a platform that efficiently parallelizes a broad
class of statistical, data mining and machine
learning algorithms
These Parallel External Memory Algorithms
(PEMA’s) process data a chunk at a time in
parallel across cores and nodes
Revolution R Enterprise 24
25. Scalability and portability of Revo’s
implementation of PEMA’s
These PEMA algorithms can process an unlimited
number of rows of data in a fixed amount of RAM.
They process a chunk of data at a time, giving
linear scalability
They are independent of the “compute context”
(number of cores, computers, distributed
computing platform), giving portability across these
dimensions
They are independent of where the data is coming
from, giving portability with respect to data sources
Revolution R Enterprise 25
26. Simplified ScaleR Internal Architecture
Revolution R Enterprise 26
Analytics Engine
PEMA’s are implemented here
(Scalable, Parallelized, Threaded, Distributable)
Inter-process Communication
MPI, RPC, Sockets, Files
Data Sources
HDFS, Teradata, ODBC, SAS, SPSS, CSV
, Fixed, XDF
27. ScaleR on Hadoop – Base Implementation
Each iteration (pass through the data) is a separate
MapReduce job
The mapper produces “intermediate result objects”
for each task that are combined using a combiner
and then a reducer
A master process decides if another pass through
the data is required
Allow data to be stored temporarily (or longer) in
XDF binary format for increased speed especially
on iterative algorithms
Revolution R Enterprise 27
28. Performance enhancements
A single Map job that handles all iterations,
to reduce overhead of starting multiple jobs
Tree structured communication and
reduction to reduce time for those
operations, especially on large clusters
Alternative communication schemes (e.g.
sockets) instead of files
Revolution R Enterprise 28
31. Functionality in common across HPA
algorithms
Ability to do on-the-fly data transformations
in the R language
In-formula: ArrDelay>15 ~ DayOfWeek + …
As transform: Late = ArrDelay>15
As transform function: Specify an R function to
process an entire chunk of data at a time
Row selection
Both standard weights and frequency
weights
Revolution R Enterprise 31
32. Portability of user code
The same analytics code can be used on a
laptop, server, MPI cluster, or Hadoop
The same analytics code can be used on a
small data.frame in memory and on a huge
distributed data file
Revolution R Enterprise 32
33. Sample code for logit on laptop
# Specify local data source
airData <- myLocalDataSource
# Specify model formula and parameters
rxLogit(ArrDelay>15 ~ Origin + Year +
Month + DayOfWeek + UniqueCarrier +
F(CRSDepTime), data=airData)
Revolution R Enterprise 33
34. Sample code for logit on Hadoop
# Change the “compute context”
rxSetComputeContext(myHadoopCluster)
# Change the data source if necessary
airData <- myHadoopDataSource
# Otherwise, the code is the same
rxLogit(ArrDelay>15 ~ Origin + Year +
Month + DayOfWeek + UniqueCarrier +
F(CRSDepTime), data=airData)
Revolution R Enterprise 34
Editor's Notes
More information:KDNuggets Software Poll: http://blog.revolutionanalytics.com/2013/06/kdnuggets-2013-poll-results.htmlCompanies using R: http://blog.revolutionanalytics.com/2013/05/companies-using-open-source-r-in-2013.htmlR is Hot white paper: bit.ly/r-is-hot
SAS HPA benchmarks published in:http://www.hpcwire.com/hpcwire/2011-04-19/sas_brings_high_performance_analytics_to_database_appliances.htmlMore information: “SAS Benchmarking FAQ” in Sales Tools on GDrive.