Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
 

Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

on

  • 1,893 views

Slides from Joseph Rickert's presentation at Strata NYC 2013

Slides from Joseph Rickert's presentation at Strata NYC 2013
"Using R and Hadoop for Statistical Computation at Scale"
http://strataconf.com/stratany2013/public/schedule/detail/30632

Statistics

Views

Total Views
1,893
Slideshare-icon Views on SlideShare
1,789
Embed Views
104

Actions

Likes
7
Downloads
106
Comments
0

9 Embeds 104

http://dschool.co 82
http://www.dschool.co 11
http://www.linkedin.com 3
http://gyanniketanpatna.dschool.co 2
http://www.surendranathcentenary.dschool.co 2
http://rcsdarbhanga.dschool.co 1
http://localhost 1
http://www.anandshankarbedcollege.dschool.co 1
http://persistence.dschool.co 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Coming soon: An “Besidevs Inside” architecture slide to precede this one.

Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation Presentation Transcript

  • Model Building with RevoScaleR Using R and Hadoop for Statistical Computation Strata and Hadoop World 2013 Joseph Rickert, Revolution Analytics
  • Model Buliding with RevoScaleR Agenda: The three realms of data What is RevoScaleR? RevoScaleR working beside Hadoop RevoScaleR running within Hadoop Run some code 2
  • The 3 Realms of Data Bridging the gaps between architectures
  • The 3 Realms of Data Number of rows The realm of “chunking” >1012 1011 The realm of massive data Data in Data in a File 106 Data In Memory Multipl e Files Architectural complexity 4
  • RevoScaleR Revolution R Enterprise
  • RevoScaleR  An R package ships exclusively with Revolution R Enterprise Revolution R Enterprise  Implements Parallel External Memory Algorithms (PEMAs)  Provides functions to: DeployR ConnectR – Import, Clean, Explore and Transform Data – Statistical Analysis and Predictive Analytics – Enable distributed computing RevoScaleR DistributedR  Scales from small local data to huge distributed data  The same code works on small and big data, and on workstation, server, cluster, Hadoop 6
  • Parallel External Memory Algorithms (PEMA’s)  Built on a platform (DistributeR) that efficiently parallelizes a broad class of statistical, data mining and machine learning algorithms  Process data a chunk at a time in parallel across cores and nodes: 1. 2. 3. 4. Initialize Process Chunk Aggregate Finalize Revolution R Enterprise DeployR ConnectR RevoScaleR DistributedR 7
  • RevoScaleR PEMAs Statistical Modeling Machine Learning Predictive Models        Covariance, Correlation, Sum of Squares Multiple Linear Regression Generalized Linear Models:  All exponential family distributions, Tweedie distribution.  Standard link functions  user defined distributions & link functions. Classification & Regression Trees Decision Forests Predictions/scoring for models Residuals for all models Data Visualization     Histogram Line Plot Lorenz Curve ROC Curves Variable Selection   Stepwise Regression PCA Cluster Analysis  K-Means Classification   Decision Trees Decision Forests Simulation  Parallel random number generators for Monte Carlo 8
  • GLM comparison using in-memory data: glm() and ScaleR’s rxGlm() Revolution R Enterprise 9
  • PEMAs: Optimized for Performance  Arbitrarily large number of rows in a fixed amount of memory  Scales linearly  with the number of rows  with the number of nodes  Scales well  with the number of cores per node  with the number of parameters  Efficient  Computational algorithms  Memory management: minimize copying  File format: fast access by row and column  Heavy use of C++  Models  pre-analyzed to detect and remove duplicate computations and points of failure (singularities)  Handle categorical variables efficiently 10
  • Write Once. Deploy Anywhere. Hadoop Hortonworks Cloudera EDW IBM Teradata Clustered Systems Platform LSF Microsoft HPC Workstations & Servers Desktop Server Linux In the Cloud Microsoft Azure Burst Amazon AWS DeployR ConnectR RevoScaleR DistributedR DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE 11
  • RRE in Hadoop or beside inside 12
  • Revolution R Enterprise Architecture  Use Hadoop for data storage and data preparation  Use RevoScaleR on a connected server for predictive modeling  Use Hadoop for model deployment
  • A Simple Goal: Hadoop As An R Engine. Hadoop Run Revolution R Enterprise code In Hadoop without change Provide RevoScaleR Pre-Parallelized Algorithms Eliminate:  The Need To “Think In MapReduce”  Data Movement 14
  • Revolution R Enterprise HDFS Name Node Architecture MapReduce Data Node Use RevoScaleR inside Hadoop for: • Data preparation • Model building • Custom small-data parallel programming • Model deployment • Late 2013: Big-data predictive models with ScaleR Data Node Data Node Data Node Data Node Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker Job Tracker
  • RRE in Hadoop HDFS Name Node MapReduce Data Node Data Node Data Node Data Node Data Node Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker Job Tracker 16
  • RRE in Hadoop HDFS Name Node MapReduce Data Node Data Node Data Node Data Node Data Node Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker Job Tracker 17
  • RevoScaleR on Hadoop  Each pass through the data is one MapReduce job  Prediction (Scoring), Transformation, Simulation: – Map tasks store results in HDFS or return to client  Statistics, Model Building, Visualization: – Map tasks produce “intermediate result objects” that are aggregated by a Reduce task – Master process decides if another pass through the data is required  Data can be cached or stored in XDF binary format for increased speed, especially on iterative algorithms Revolution R Enterprise 18
  • Let’s run some code.
  • Backup slides
  • Sample code: logit on workstation # Specify local data source airData <- myLocalDataSource # Specify model formula and parameters rxLogit( ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData ) 21
  • Sample code for logit on Hadoop # Change the “compute context” rxSetComputeContext(myHadoopCluster) # Change the data source if necessary airData <- myHadoopDataSource # Otherwise, the code is the same rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData) 22
  • Demo rxLinMod in Hadoop - Launching Revolution R Enterprise 23
  • Demo rxLinMod in Hadoop - In Progress Revolution R Enterprise 24
  • Demo rxLinMod in Hadoop - Completed Revolution R Enterprise 25