0
Model Building with
RevoScaleR
Using R and Hadoop for Statistical Computation
Strata and Hadoop World 2013

Joseph Rickert...
Model Buliding with RevoScaleR
Agenda:
The three realms of data
What is RevoScaleR?
RevoScaleR working beside Hadoop
R...
The 3 Realms of Data

Bridging the gaps between architectures
The 3 Realms of Data
Number of rows
The realm of
“chunking”

>1012

1011

The realm of
massive data

Data

in

Data in
a F...
RevoScaleR

Revolution R Enterprise
RevoScaleR
 An R package ships exclusively with Revolution R
Enterprise

Revolution R Enterprise

 Implements Parallel E...
Parallel External Memory Algorithms (PEMA’s)
 Built on a platform (DistributeR)
that efficiently parallelizes a
broad cla...
RevoScaleR PEMAs
Statistical Modeling

Machine Learning

Predictive Models









Covariance, Correlation, Sum of...
GLM comparison using in-memory
data: glm() and ScaleR’s rxGlm()

Revolution R Enterprise

9
PEMAs: Optimized for Performance
 Arbitrarily large number of
rows in a fixed amount of
memory
 Scales linearly
 with t...
Write Once. Deploy Anywhere.
Hadoop

Hortonworks
Cloudera

EDW

IBM
Teradata

Clustered Systems

Platform LSF
Microsoft HP...
RRE in Hadoop
or

beside

inside

12
Revolution R
Enterprise
Architecture
 Use Hadoop for data
storage and data
preparation
 Use RevoScaleR on
a connected se...
A Simple Goal: Hadoop As An R Engine.
Hadoop

Run Revolution R Enterprise code In
Hadoop without change
Provide RevoScal...
Revolution R
Enterprise
HDFS
Name Node

Architecture

MapReduce

Data Node

Use RevoScaleR inside
Hadoop for:
• Data prepa...
RRE in Hadoop
HDFS
Name Node

MapReduce

Data Node

Data Node

Data Node

Data Node

Data Node

Task
Tracker

Task
Tracker...
RRE in Hadoop
HDFS
Name Node

MapReduce

Data Node

Data Node

Data Node

Data Node

Data Node

Task
Tracker

Task
Tracker...
RevoScaleR on Hadoop
 Each pass through the data is one MapReduce job
 Prediction (Scoring), Transformation, Simulation:...
Let’s run some code.
Backup slides
Sample code: logit on workstation
# Specify local data source
airData <- myLocalDataSource
# Specify model formula and par...
Sample code for logit on Hadoop
#

Change the “compute context”

rxSetComputeContext(myHadoopCluster)
# Change the data so...
Demo rxLinMod in Hadoop - Launching

Revolution R Enterprise

23
Demo rxLinMod in Hadoop - In Progress

Revolution R Enterprise

24
Demo rxLinMod in Hadoop - Completed

Revolution R Enterprise

25
Upcoming SlideShare
Loading in...5
×

Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

2,498

Published on

Slides from Joseph Rickert's presentation at Strata NYC 2013
"Using R and Hadoop for Statistical Computation at Scale"
http://strataconf.com/stratany2013/public/schedule/detail/30632

Published in: Technology, Education
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,498
On Slideshare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
138
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide
  • Coming soon: An “Besidevs Inside” architecture slide to precede this one.
  • Transcript of "Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation"

    1. 1. Model Building with RevoScaleR Using R and Hadoop for Statistical Computation Strata and Hadoop World 2013 Joseph Rickert, Revolution Analytics
    2. 2. Model Buliding with RevoScaleR Agenda: The three realms of data What is RevoScaleR? RevoScaleR working beside Hadoop RevoScaleR running within Hadoop Run some code 2
    3. 3. The 3 Realms of Data Bridging the gaps between architectures
    4. 4. The 3 Realms of Data Number of rows The realm of “chunking” >1012 1011 The realm of massive data Data in Data in a File 106 Data In Memory Multipl e Files Architectural complexity 4
    5. 5. RevoScaleR Revolution R Enterprise
    6. 6. RevoScaleR  An R package ships exclusively with Revolution R Enterprise Revolution R Enterprise  Implements Parallel External Memory Algorithms (PEMAs)  Provides functions to: DeployR ConnectR – Import, Clean, Explore and Transform Data – Statistical Analysis and Predictive Analytics – Enable distributed computing RevoScaleR DistributedR  Scales from small local data to huge distributed data  The same code works on small and big data, and on workstation, server, cluster, Hadoop 6
    7. 7. Parallel External Memory Algorithms (PEMA’s)  Built on a platform (DistributeR) that efficiently parallelizes a broad class of statistical, data mining and machine learning algorithms  Process data a chunk at a time in parallel across cores and nodes: 1. 2. 3. 4. Initialize Process Chunk Aggregate Finalize Revolution R Enterprise DeployR ConnectR RevoScaleR DistributedR 7
    8. 8. RevoScaleR PEMAs Statistical Modeling Machine Learning Predictive Models        Covariance, Correlation, Sum of Squares Multiple Linear Regression Generalized Linear Models:  All exponential family distributions, Tweedie distribution.  Standard link functions  user defined distributions & link functions. Classification & Regression Trees Decision Forests Predictions/scoring for models Residuals for all models Data Visualization     Histogram Line Plot Lorenz Curve ROC Curves Variable Selection   Stepwise Regression PCA Cluster Analysis  K-Means Classification   Decision Trees Decision Forests Simulation  Parallel random number generators for Monte Carlo 8
    9. 9. GLM comparison using in-memory data: glm() and ScaleR’s rxGlm() Revolution R Enterprise 9
    10. 10. PEMAs: Optimized for Performance  Arbitrarily large number of rows in a fixed amount of memory  Scales linearly  with the number of rows  with the number of nodes  Scales well  with the number of cores per node  with the number of parameters  Efficient  Computational algorithms  Memory management: minimize copying  File format: fast access by row and column  Heavy use of C++  Models  pre-analyzed to detect and remove duplicate computations and points of failure (singularities)  Handle categorical variables efficiently 10
    11. 11. Write Once. Deploy Anywhere. Hadoop Hortonworks Cloudera EDW IBM Teradata Clustered Systems Platform LSF Microsoft HPC Workstations & Servers Desktop Server Linux In the Cloud Microsoft Azure Burst Amazon AWS DeployR ConnectR RevoScaleR DistributedR DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE 11
    12. 12. RRE in Hadoop or beside inside 12
    13. 13. Revolution R Enterprise Architecture  Use Hadoop for data storage and data preparation  Use RevoScaleR on a connected server for predictive modeling  Use Hadoop for model deployment
    14. 14. A Simple Goal: Hadoop As An R Engine. Hadoop Run Revolution R Enterprise code In Hadoop without change Provide RevoScaleR Pre-Parallelized Algorithms Eliminate:  The Need To “Think In MapReduce”  Data Movement 14
    15. 15. Revolution R Enterprise HDFS Name Node Architecture MapReduce Data Node Use RevoScaleR inside Hadoop for: • Data preparation • Model building • Custom small-data parallel programming • Model deployment • Late 2013: Big-data predictive models with ScaleR Data Node Data Node Data Node Data Node Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker Job Tracker
    16. 16. RRE in Hadoop HDFS Name Node MapReduce Data Node Data Node Data Node Data Node Data Node Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker Job Tracker 16
    17. 17. RRE in Hadoop HDFS Name Node MapReduce Data Node Data Node Data Node Data Node Data Node Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker Job Tracker 17
    18. 18. RevoScaleR on Hadoop  Each pass through the data is one MapReduce job  Prediction (Scoring), Transformation, Simulation: – Map tasks store results in HDFS or return to client  Statistics, Model Building, Visualization: – Map tasks produce “intermediate result objects” that are aggregated by a Reduce task – Master process decides if another pass through the data is required  Data can be cached or stored in XDF binary format for increased speed, especially on iterative algorithms Revolution R Enterprise 18
    19. 19. Let’s run some code.
    20. 20. Backup slides
    21. 21. Sample code: logit on workstation # Specify local data source airData <- myLocalDataSource # Specify model formula and parameters rxLogit( ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData ) 21
    22. 22. Sample code for logit on Hadoop # Change the “compute context” rxSetComputeContext(myHadoopCluster) # Change the data source if necessary airData <- myHadoopDataSource # Otherwise, the code is the same rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData) 22
    23. 23. Demo rxLinMod in Hadoop - Launching Revolution R Enterprise 23
    24. 24. Demo rxLinMod in Hadoop - In Progress Revolution R Enterprise 24
    25. 25. Demo rxLinMod in Hadoop - Completed Revolution R Enterprise 25
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×