High Performance Predictive Analytics in R and Hadoop

High Performance
Predictive Analytics
in R and Hadoop
Presented by:
Mario E. Inchiosa, Ph.D.
US Chief Scientist
Hadoop Summit 2013
June 27, 2013 1

Innovate with R
2
 Most widely used data analysis software
 Used by 2M+ data scientists, statisticians and analysts
 Most powerful statistical programming language
 Flexible, extensible and comprehensive for productivity
 Create beautiful and unique data visualizations
 As seen in New York Times, Twitter and Flowing Data
 Thriving open-source community
 Leading edge of analytics research
 Fills the talent gap
 New graduates prefer R
Download the White Paper
R is Hot
bit.ly/r-is-hot

R is open source and drives analytic innovation but has
some limitations
Bigger
data sizes
Speed of
analysis
Production
support
Memory Bound Big Data
Single Threaded Scale out, parallel
processing, high speed
Community Support Commercial
production support
Innovation
and scale
Innovative –
5000+
packages, exponential
growth
Combines with open
source R packages
where needed

4
Revolution R Enterprise
High Performance, Multi-Platform Analytics Platform
Revolution R Enterprise
DeployR
Web Services
Software Development Kit
DevelopR
Integrated Development
Environment
ConnectR
High Speed & Direct Connectors
Teradata, Hadoop (HDFS, HBase), SAS, SPSS, CSV, OBDC
ScaleR
High Performance Big Data Analytics
Platform LSF, MS HPC Server,
MS Azure Burst, SMP Servers
DistributedR
Distributed Computing Framework
Platform LSF, MS HPC Server, MS
Azure Burst
RevoR
Performance Enhanced Open Source R + CRAN packages
IBM PureData (Netezza), Platform LSF, MS HPC Server, MS Azure
Burst, Cloudera, Hortonworks, IBM Big Insights, Intel Hadoop, SMP servers
Open Source R
Plus
Revolution Analytics
performance
enhancements
Revolution
Analytics
Value-Add
Components
Providing Power
and Scale to Open
Source R

Our objectives with respect to Hadoop
 Allow our customers to do predictive
analytics as easily in Hadoop as they can
using R on their laptops today
 Scalable and High Performance
 Provide the first enterprise-
ready, commercially supported, full-
featured, out-of-the-box Predictive Analytics
suite running in Hadoop
5

Overview of Revolution and Hadoop
 Revolution currently provides capabilities for
using R and Hadoop together
 RHadoop packages: rmr2, rhdfs, rhbase
 RevoScaleR package: Rapidly read and process
data from HDFS
 We are in the process of porting
RevoScaleR to run fully “in Hadoop.” It
currently runs on
laptops, workstations, servers, and MPI-
based clusters.
6

RHadoop: Map-Reduce with R
 Unlock data in Hadoop using only the R language
 No need to learn Java, Pig, Python or any other language
7

RevoScaleR Overview
8
 An R package that adds capabilities to R:
 Data Import/Clean/Explore/Transform
 Analytics – Descriptive and Predictive
 Parallel and distributed computing
 Visualization
 Scales from small local data to huge distributed
data
 Scales from laptop to server to cluster to cloud
 Portable – the same code works on small and
big data, and on laptop, server, cluster, Hadoop

Key ways RevoScaleR enhances R
 High Performance Computing (HPC)
functions: Parallel/Distributed computing
 High Performance Analytics (HPA) functions:
Big Data + Parallel/Distributed computing
 XDF file format; rapidly store and extract
data
 Use results from HPA and HPC functions in
other R packages
Revolution R Enterprise 9

High Performance Big Data Analytics with
ScaleR
10
Statistical
Tests
Machine
Learning
Simulation
Descriptive
Statistics
Data
Visualization
R Data Step
Predictive
Models
Sampling

ScaleR: High Performance Scalable
Parallel External Memory Algorithms
Company Confidential – Do not distribute 11
 Data import –
Delimited, Fixed, SAS, SPSS,
OBDC
 Variable creation &
transformation
 Recode variables
 Factor variables
 Missing value handling
 Sort
 Merge
 Split
 Aggregate by category
(means, sums)
 Min / Max
 Mean
 Median (approx.)
 Quantiles (approx.)
 Standard Deviation
 Variance
 Correlation
 Covariance
 Sum of Squares (cross product
matrix for set variables)
 Pairwise Cross tabs
 Risk Ratio & Odds Ratio
 Cross-Tabulation of Data
(standard tables & long form)
 Marginal Summaries of Cross
Tabulations
 Chi Square Test
 Kendall Rank Correlation
 Fisher’s Exact Test
 Student’s t-Test
Data Prep, Distillation & Descriptive Analytics
 Subsample (observations &
variables)
 Random Sampling
R Data Step Statistical Tests
Sampling
Descriptive Statistics

ScaleR: High Performance Scalable
Company Confidential – Do not distribute 12
 Sum of Squares (cross product
matrix for set variables)
 Multiple Linear Regression
 Generalized Linear Models (GLM)
- All exponential family
distributions:
binomial, Gaussian, inverse
Gaussian, Poisson, Tweedie.
Standard link functions including:
cauchit, identity, log, logit, probit.
User defined distributions & link
functions.
 Covariance & Correlation
Matrices
 Logistic Regression
 Classification & Regression Trees
 Predictions/scoring for models
 Residuals for all models
 Histogram
 Line Plot
 Scatter Plot
 Lorenz Curve
 ROC Curves (actual data and
predicted values)
 K-Means
Statistical Modeling
 Decision Trees
Predictive Models Cluster AnalysisData Visualization
Classification
Machine Learning
Simulation
 Monte Carlo

HPC Capabilities in RevoScaleR
 Execute (essentially) any R function in
parallel on nodes and cores
 Results from all runs are returned in a list to
the user’s laptop
 Extensive control over parameters
 Extensive control over nodes, cores, and
times to run
 Ideal for simulations and for running R
functions on small amounts of data

Key ScaleR HPA features
 Handles an arbitrarily large number of rows
in a fixed amount of memory
 Scales linearly with the number of rows
 Scales linearly with the number of nodes
 Scales well with the number of cores per
node
 Scales well with the number of parameters
 Extremely high performance

Regression comparison using in-memory
data: lm() vs rxLinMod()
lm() in R
lm() in Revolution R
rxLinMod() in Revolution R

GLM comparison using in-memory data:
glm() vs rxGlm()

SAS HPA Benchmarking comparison*
Logistic Regression
Rows of data 1 billion 1 billion
Parameters “just a few” 7
Time 80 seconds 44 seconds
Data location In memory On disk
Nodes 32 5
Cores 384 20
RAM 1,536 GB 80 GB
19
Revolution R is faster on the same amount of data, despite using approximately a 20th as many cores, a
20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM.
*As published by SAS in HPC Wire, April 21, 2011
Double
45%
1/6th
5%
5%
Revolution R Enterprise Delivers Performance at 2% of the Cost

Allstate compares SAS, Hadoop and R
for Big-Data Insurance Models
Approach Platform Time to fit
SAS 16-core Sun Server 5 hours
rmr/MapReduce 10-node 80-core
Hadoop Cluster
> 10 hours
R 250 GB Server Impossible (> 3 days)
Revolution R Enterprise 5-node 20-core
LSF cluster
5.7 minutes
Generalized linear model, 150 million observations, 70 degrees of freedom
http://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html

What makes ScaleR so fast?
 Lot’s of things!
 This kind of performance requires
 Careful architecting that supports performance
 Constant, intense focus on all of the details
 A review of every line of code with an eye to
performance (in addition to giving the correct
answers, of course)
 Extensive profiling and continuous
benchmarking to detect problems and improve
code

Specific speed-related factors -- 1
 Efficient computational algorithms
 Efficient memory management – minimize
data copying and data conversion
 Heavy use of C++ templates; optimal code
 Efficient data file format; fast access by row
and column

Specific speed-related factors -- 2
 Models are pre-analyzed to detect and
remove duplicate computations and points of
failure (singularities)
 Handle categorical variables efficiently

(PEMA’s)
 The HPA analytics algorithms are all built on
a platform that efficiently parallelizes a broad
class of statistical, data mining and machine
learning algorithms
 These Parallel External Memory Algorithms
(PEMA’s) process data a chunk at a time in
parallel across cores and nodes

Scalability and portability of Revo’s
implementation of PEMA’s
 These PEMA algorithms can process an unlimited
number of rows of data in a fixed amount of RAM.
They process a chunk of data at a time, giving
linear scalability
 They are independent of the “compute context”
(number of cores, computers, distributed
computing platform), giving portability across these
dimensions
 They are independent of where the data is coming
from, giving portability with respect to data sources

Simplified ScaleR Internal Architecture
Analytics Engine
PEMA’s are implemented here
(Scalable, Parallelized, Threaded, Distributable)
Inter-process Communication
MPI, RPC, Sockets, Files
Data Sources
HDFS, Teradata, ODBC, SAS, SPSS, CSV
, Fixed, XDF

ScaleR on Hadoop – Base Implementation
 Each iteration (pass through the data) is a separate
MapReduce job
 The mapper produces “intermediate result objects”
for each task that are combined using a combiner
and then a reducer
 A master process decides if another pass through
the data is required
 Allow data to be stored temporarily (or longer) in
XDF binary format for increased speed especially
on iterative algorithms

Performance enhancements
 A single Map job that handles all iterations,
to reduce overhead of starting multiple jobs
 Tree structured communication and
reduction to reduce time for those
operations, especially on large clusters
 Alternative communication schemes (e.g.
sockets) instead of files

Thank You!
Mario Inchiosa
mario.inchiosa@revolutionanalytics.com

Backup Slides

Functionality in common across HPA
algorithms
 Ability to do on-the-fly data transformations
in the R language
 In-formula: ArrDelay>15 ~ DayOfWeek + …
 As transform: Late = ArrDelay>15
 As transform function: Specify an R function to
process an entire chunk of data at a time
 Row selection
 Both standard weights and frequency
weights

Portability of user code
 The same analytics code can be used on a
laptop, server, MPI cluster, or Hadoop
 The same analytics code can be used on a
small data.frame in memory and on a huge
distributed data file

Sample code for logit on laptop
# Specify local data source
airData <- myLocalDataSource
# Specify model formula and parameters
rxLogit(ArrDelay>15 ~ Origin + Year +
Month + DayOfWeek + UniqueCarrier +
F(CRSDepTime), data=airData)

Sample code for logit on Hadoop
# Change the “compute context”
rxSetComputeContext(myHadoopCluster)
# Change the data source if necessary
airData <- myHadoopDataSource
# Otherwise, the code is the same
rxLogit(ArrDelay>15 ~ Origin + Year +
Month + DayOfWeek + UniqueCarrier +
F(CRSDepTime), data=airData)

High Performance Predictive Analytics in R and Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to High Performance Predictive Analytics in R and Hadoop

Similar to High Performance Predictive Analytics in R and Hadoop (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

High Performance Predictive Analytics in R and Hadoop

Editor's Notes