Your SlideShare is downloading. ×
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

2,289
views

Published on

Slides from Joseph Rickert's presentation at Strata NYC 2013 …

Slides from Joseph Rickert's presentation at Strata NYC 2013
"Using R and Hadoop for Statistical Computation at Scale"
http://strataconf.com/stratany2013/public/schedule/detail/30632

Published in: Technology, Education

0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,289
On Slideshare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
133
Comments
0
Likes
10
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Coming soon: An “Besidevs Inside” architecture slide to precede this one.
  • Transcript

    • 1. Model Building with RevoScaleR Using R and Hadoop for Statistical Computation Strata and Hadoop World 2013 Joseph Rickert, Revolution Analytics
    • 2. Model Buliding with RevoScaleR Agenda: The three realms of data What is RevoScaleR? RevoScaleR working beside Hadoop RevoScaleR running within Hadoop Run some code 2
    • 3. The 3 Realms of Data Bridging the gaps between architectures
    • 4. The 3 Realms of Data Number of rows The realm of “chunking” >1012 1011 The realm of massive data Data in Data in a File 106 Data In Memory Multipl e Files Architectural complexity 4
    • 5. RevoScaleR Revolution R Enterprise
    • 6. RevoScaleR  An R package ships exclusively with Revolution R Enterprise Revolution R Enterprise  Implements Parallel External Memory Algorithms (PEMAs)  Provides functions to: DeployR ConnectR – Import, Clean, Explore and Transform Data – Statistical Analysis and Predictive Analytics – Enable distributed computing RevoScaleR DistributedR  Scales from small local data to huge distributed data  The same code works on small and big data, and on workstation, server, cluster, Hadoop 6
    • 7. Parallel External Memory Algorithms (PEMA’s)  Built on a platform (DistributeR) that efficiently parallelizes a broad class of statistical, data mining and machine learning algorithms  Process data a chunk at a time in parallel across cores and nodes: 1. 2. 3. 4. Initialize Process Chunk Aggregate Finalize Revolution R Enterprise DeployR ConnectR RevoScaleR DistributedR 7
    • 8. RevoScaleR PEMAs Statistical Modeling Machine Learning Predictive Models        Covariance, Correlation, Sum of Squares Multiple Linear Regression Generalized Linear Models:  All exponential family distributions, Tweedie distribution.  Standard link functions  user defined distributions & link functions. Classification & Regression Trees Decision Forests Predictions/scoring for models Residuals for all models Data Visualization     Histogram Line Plot Lorenz Curve ROC Curves Variable Selection   Stepwise Regression PCA Cluster Analysis  K-Means Classification   Decision Trees Decision Forests Simulation  Parallel random number generators for Monte Carlo 8
    • 9. GLM comparison using in-memory data: glm() and ScaleR’s rxGlm() Revolution R Enterprise 9
    • 10. PEMAs: Optimized for Performance  Arbitrarily large number of rows in a fixed amount of memory  Scales linearly  with the number of rows  with the number of nodes  Scales well  with the number of cores per node  with the number of parameters  Efficient  Computational algorithms  Memory management: minimize copying  File format: fast access by row and column  Heavy use of C++  Models  pre-analyzed to detect and remove duplicate computations and points of failure (singularities)  Handle categorical variables efficiently 10
    • 11. Write Once. Deploy Anywhere. Hadoop Hortonworks Cloudera EDW IBM Teradata Clustered Systems Platform LSF Microsoft HPC Workstations & Servers Desktop Server Linux In the Cloud Microsoft Azure Burst Amazon AWS DeployR ConnectR RevoScaleR DistributedR DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE 11
    • 12. RRE in Hadoop or beside inside 12
    • 13. Revolution R Enterprise Architecture  Use Hadoop for data storage and data preparation  Use RevoScaleR on a connected server for predictive modeling  Use Hadoop for model deployment
    • 14. A Simple Goal: Hadoop As An R Engine. Hadoop Run Revolution R Enterprise code In Hadoop without change Provide RevoScaleR Pre-Parallelized Algorithms Eliminate:  The Need To “Think In MapReduce”  Data Movement 14
    • 15. Revolution R Enterprise HDFS Name Node Architecture MapReduce Data Node Use RevoScaleR inside Hadoop for: • Data preparation • Model building • Custom small-data parallel programming • Model deployment • Late 2013: Big-data predictive models with ScaleR Data Node Data Node Data Node Data Node Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker Job Tracker
    • 16. RRE in Hadoop HDFS Name Node MapReduce Data Node Data Node Data Node Data Node Data Node Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker Job Tracker 16
    • 17. RRE in Hadoop HDFS Name Node MapReduce Data Node Data Node Data Node Data Node Data Node Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker Job Tracker 17
    • 18. RevoScaleR on Hadoop  Each pass through the data is one MapReduce job  Prediction (Scoring), Transformation, Simulation: – Map tasks store results in HDFS or return to client  Statistics, Model Building, Visualization: – Map tasks produce “intermediate result objects” that are aggregated by a Reduce task – Master process decides if another pass through the data is required  Data can be cached or stored in XDF binary format for increased speed, especially on iterative algorithms Revolution R Enterprise 18
    • 19. Let’s run some code.
    • 20. Backup slides
    • 21. Sample code: logit on workstation # Specify local data source airData <- myLocalDataSource # Specify model formula and parameters rxLogit( ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData ) 21
    • 22. Sample code for logit on Hadoop # Change the “compute context” rxSetComputeContext(myHadoopCluster) # Change the data source if necessary airData <- myHadoopDataSource # Otherwise, the code is the same rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData) 22
    • 23. Demo rxLinMod in Hadoop - Launching Revolution R Enterprise 23
    • 24. Demo rxLinMod in Hadoop - In Progress Revolution R Enterprise 24
    • 25. Demo rxLinMod in Hadoop - Completed Revolution R Enterprise 25

    ×