Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

Model Building with
RevoScaleR
Using R and Hadoop for Statistical Computation
Strata and Hadoop World 2013

Joseph Rickert, Revolution Analytics

Model Buliding with RevoScaleR
Agenda:
The three realms of data
What is RevoScaleR?
RevoScaleR working beside Hadoop
RevoScaleR running within Hadoop
Run some code
2

The 3 Realms of Data

Bridging the gaps between architectures

The 3 Realms of Data
Number of rows
The realm of
“chunking”

>1012

1011

The realm of
massive data

Data

in

Data in
a File

106
Data
In
Memory

Multipl
e

Files

Architectural complexity
4

RevoScaleR

Revolution R Enterprise

RevoScaleR
 An R package ships exclusively with Revolution R
Enterprise


 Implements Parallel External Memory Algorithms
(PEMAs)
 Provides functions to:

DeployR
ConnectR

– Import, Clean, Explore and Transform Data
– Statistical Analysis and Predictive Analytics
– Enable distributed computing

RevoScaleR
DistributedR

 Scales from small local data to huge distributed
data
 The same code works on small and big data, and
on workstation, server, cluster, Hadoop
6

Parallel External Memory Algorithms (PEMA’s)
 Built on a platform (DistributeR)
that efficiently parallelizes a
broad class of statistical, data
mining and machine learning
algorithms
 Process data a chunk at a time in
parallel across cores and nodes:
1.
2.
3.
4.

Initialize
Process Chunk
Aggregate
Finalize


DeployR
ConnectR
RevoScaleR
DistributedR

7

RevoScaleR PEMAs
Statistical Modeling

Machine Learning

Predictive Models









Covariance, Correlation, Sum of Squares
Multiple Linear Regression
Generalized Linear Models:
 All exponential family
distributions, Tweedie
distribution.
 Standard link functions
 user defined distributions & link
functions.
Classification & Regression Trees
Decision Forests
Predictions/scoring for models
Residuals for all models

Data Visualization





Histogram
Line Plot
Lorenz Curve
ROC Curves

Variable Selection



Stepwise Regression
PCA

Cluster Analysis


K-Means

Classification



Decision Trees
Decision Forests

Simulation


Parallel random number
generators for Monte
Carlo
8

GLM comparison using in-memory
data: glm() and ScaleR’s rxGlm()


9

PEMAs: Optimized for Performance
 Arbitrarily large number of
rows in a fixed amount of
memory
 Scales linearly
 with the number of rows
 with the number of nodes

 Scales well
 with the number of cores per
node
 with the number of parameters

 Efficient

 Computational algorithms
 Memory management: minimize
copying
 File format: fast access by row and
column

 Heavy use of C++
 Models

 pre-analyzed to detect and remove
duplicate computations and points of
failure (singularities)
 Handle categorical variables
efficiently
10

Write Once. Deploy Anywhere.
Hadoop

Hortonworks
Cloudera

EDW

IBM
Teradata

Clustered Systems

Platform LSF
Microsoft HPC

Workstations & Servers

Desktop
Server
Linux

In the Cloud

Microsoft Azure Burst
Amazon AWS

DeployR
ConnectR
RevoScaleR
DistributedR

DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE

11

RRE in Hadoop
or

beside

inside

12

Revolution R
Enterprise
Architecture
 Use Hadoop for data
storage and data
preparation
 Use RevoScaleR on
a connected server
for predictive
modeling
 Use Hadoop for
model deployment

A Simple Goal: Hadoop As An R Engine.
Hadoop

Run Revolution R Enterprise code In
Hadoop without change
Provide RevoScaleR Pre-Parallelized
Algorithms

Eliminate:
 The Need To “Think In MapReduce”

 Data Movement
14

Revolution R
Enterprise
HDFS
Name Node

Architecture

MapReduce

Data Node

Use RevoScaleR inside
Hadoop for:
• Data preparation
• Model building
• Custom small-data
parallel programming
• Model deployment
• Late 2013: Big-data
predictive models with
ScaleR

Data Node

Data Node

Data Node

Data Node

Task
Tracker

Task
Tracker

Task
Tracker

Task
Tracker

Task
Tracker

Job
Tracker

RRE in Hadoop
HDFS
Name Node

MapReduce

Data Node

Data Node

Data Node

Data Node

Data Node

Task
Tracker

Task
Tracker

Task
Tracker

Task
Tracker

Task
Tracker

Job
Tracker

16

RRE in Hadoop
HDFS
Name Node

MapReduce

Data Node

Data Node

Data Node

Data Node

Data Node

Task
Tracker

Task
Tracker

Task
Tracker

Task
Tracker

Task
Tracker

Job
Tracker

17

RevoScaleR on Hadoop
 Each pass through the data is one MapReduce job
 Prediction (Scoring), Transformation, Simulation:
– Map tasks store results in HDFS or return to client

 Statistics, Model Building, Visualization:
– Map tasks produce “intermediate result objects” that are
aggregated by a Reduce task
– Master process decides if another pass through the data is
required

 Data can be cached or stored in XDF binary format for
increased speed, especially on iterative algorithms

18

Sample code: logit on workstation
# Specify local data source
airData <- myLocalDataSource
# Specify model formula and parameters
rxLogit( ArrDelay>15 ~ Origin + Year + Month +
DayOfWeek + UniqueCarrier + F(CRSDepTime),
data=airData )

21

Sample code for logit on Hadoop
#

Change the “compute context”

rxSetComputeContext(myHadoopCluster)
# Change the data source if necessary
airData <- myHadoopDataSource
# Otherwise, the code is the same
rxLogit(ArrDelay>15 ~ Origin + Year + Month +
DayOfWeek + UniqueCarrier + F(CRSDepTime),
data=airData)

22

Demo rxLinMod in Hadoop - Launching


23

Demo rxLinMod in Hadoop - In Progress


24

Demo rxLinMod in Hadoop - Completed


25

Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

More Related Content

What's hot

Viewers also liked

Similar to Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

More from Revolution Analytics

Recently uploaded

Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

Editor's Notes