Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

High Performance Predictive Analytics in R and Hadoop


Published on

Hadoop is rapidly being adopted as a major platform for storing and managing massive amounts of data, and for computing descriptive and query types of analytics on that data. However, it has a reputation for not being a suitable environment for high performance complex iterative algorithms such as logistic regression, generalized linear models, and decision trees. At Revolution Analytics we think that reputation is unjustified, and in this talk I discuss the approach we have taken to porting our suite of High Performance Analytics algorithms to run natively and efficiently in Hadoop. Our algorithms are written in C++ and R, and are based on a platform that automatically and efficiently parallelizes a broad class of algorithms called Parallel External Memory Algorithms (PEMA’s). This platform abstracts both the inter-process communication layer and the data source layer, so that the algorithms can work in almost any environment in which messages can be passed among processes and with almost any data source. MPI and RPC are two traditional ways to send messages, but messages can also be passed using files, as in Hadoop. I describe how we use the file-based communication choreographed by MapReduce and how we efficiently access data stored in HDFS.

Published in: Technology

High Performance Predictive Analytics in R and Hadoop

  1. 1. High Performance Predictive Analytics in R and Hadoop Presented by: Mario E. Inchiosa, Ph.D. US Chief Scientist Hadoop Summit 2013 June 27, 2013 1
  2. 2. Innovate with R 2  Most widely used data analysis software  Used by 2M+ data scientists, statisticians and analysts  Most powerful statistical programming language  Flexible, extensible and comprehensive for productivity  Create beautiful and unique data visualizations  As seen in New York Times, Twitter and Flowing Data  Thriving open-source community  Leading edge of analytics research  Fills the talent gap  New graduates prefer R Download the White Paper R is Hot
  3. 3. R is open source and drives analytic innovation but has some limitations Bigger data sizes Speed of analysis Production support Memory Bound Big Data Single Threaded Scale out, parallel processing, high speed Community Support Commercial production support Innovation and scale Innovative – 5000+ packages, exponential growth Combines with open source R packages where needed
  4. 4. 4 Revolution R Enterprise High Performance, Multi-Platform Analytics Platform Revolution R Enterprise DeployR Web Services Software Development Kit DevelopR Integrated Development Environment ConnectR High Speed & Direct Connectors Teradata, Hadoop (HDFS, HBase), SAS, SPSS, CSV, OBDC ScaleR High Performance Big Data Analytics Platform LSF, MS HPC Server, MS Azure Burst, SMP Servers DistributedR Distributed Computing Framework Platform LSF, MS HPC Server, MS Azure Burst RevoR Performance Enhanced Open Source R + CRAN packages IBM PureData (Netezza), Platform LSF, MS HPC Server, MS Azure Burst, Cloudera, Hortonworks, IBM Big Insights, Intel Hadoop, SMP servers Open Source R Plus Revolution Analytics performance enhancements Revolution Analytics Value-Add Components Providing Power and Scale to Open Source R
  5. 5. Our objectives with respect to Hadoop  Allow our customers to do predictive analytics as easily in Hadoop as they can using R on their laptops today  Scalable and High Performance  Provide the first enterprise- ready, commercially supported, full- featured, out-of-the-box Predictive Analytics suite running in Hadoop 5
  6. 6. Overview of Revolution and Hadoop  Revolution currently provides capabilities for using R and Hadoop together  RHadoop packages: rmr2, rhdfs, rhbase  RevoScaleR package: Rapidly read and process data from HDFS  We are in the process of porting RevoScaleR to run fully “in Hadoop.” It currently runs on laptops, workstations, servers, and MPI- based clusters. 6
  7. 7. RHadoop: Map-Reduce with R  Unlock data in Hadoop using only the R language  No need to learn Java, Pig, Python or any other language 7
  8. 8. RevoScaleR Overview 8  An R package that adds capabilities to R:  Data Import/Clean/Explore/Transform  Analytics – Descriptive and Predictive  Parallel and distributed computing  Visualization  Scales from small local data to huge distributed data  Scales from laptop to server to cluster to cloud  Portable – the same code works on small and big data, and on laptop, server, cluster, Hadoop
  9. 9. Key ways RevoScaleR enhances R  High Performance Computing (HPC) functions: Parallel/Distributed computing  High Performance Analytics (HPA) functions: Big Data + Parallel/Distributed computing  XDF file format; rapidly store and extract data  Use results from HPA and HPC functions in other R packages Revolution R Enterprise 9
  10. 10. High Performance Big Data Analytics with ScaleR 10 Statistical Tests Machine Learning Simulation Descriptive Statistics Data Visualization R Data Step Predictive Models Sampling
  11. 11. ScaleR: High Performance Scalable Parallel External Memory Algorithms Company Confidential – Do not distribute 11  Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing value handling  Sort  Merge  Split  Aggregate by category (means, sums)  Min / Max  Mean  Median (approx.)  Quantiles (approx.)  Standard Deviation  Variance  Correlation  Covariance  Sum of Squares (cross product matrix for set variables)  Pairwise Cross tabs  Risk Ratio & Odds Ratio  Cross-Tabulation of Data (standard tables & long form)  Marginal Summaries of Cross Tabulations  Chi Square Test  Kendall Rank Correlation  Fisher’s Exact Test  Student’s t-Test Data Prep, Distillation & Descriptive Analytics  Subsample (observations & variables)  Random Sampling R Data Step Statistical Tests Sampling Descriptive Statistics
  12. 12. ScaleR: High Performance Scalable Parallel External Memory Algorithms Company Confidential – Do not distribute 12  Sum of Squares (cross product matrix for set variables)  Multiple Linear Regression  Generalized Linear Models (GLM) - All exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions including: cauchit, identity, log, logit, probit. User defined distributions & link functions.  Covariance & Correlation Matrices  Logistic Regression  Classification & Regression Trees  Predictions/scoring for models  Residuals for all models  Histogram  Line Plot  Scatter Plot  Lorenz Curve  ROC Curves (actual data and predicted values)  K-Means Statistical Modeling  Decision Trees Predictive Models Cluster AnalysisData Visualization Classification Machine Learning Simulation  Monte Carlo
  13. 13. HPC Capabilities in RevoScaleR  Execute (essentially) any R function in parallel on nodes and cores  Results from all runs are returned in a list to the user’s laptop  Extensive control over parameters  Extensive control over nodes, cores, and times to run  Ideal for simulations and for running R functions on small amounts of data Revolution R Enterprise 13
  14. 14. Key ScaleR HPA features  Handles an arbitrarily large number of rows in a fixed amount of memory  Scales linearly with the number of rows  Scales linearly with the number of nodes  Scales well with the number of cores per node  Scales well with the number of parameters  Extremely high performance Revolution R Enterprise 14
  15. 15. Regression comparison using in-memory data: lm() vs rxLinMod() Revolution R Enterprise 17 lm() in R lm() in Revolution R rxLinMod() in Revolution R
  16. 16. GLM comparison using in-memory data: glm() vs rxGlm() Revolution R Enterprise 18
  17. 17. SAS HPA Benchmarking comparison* Logistic Regression Rows of data 1 billion 1 billion Parameters “just a few” 7 Time 80 seconds 44 seconds Data location In memory On disk Nodes 32 5 Cores 384 20 RAM 1,536 GB 80 GB 19 Revolution R is faster on the same amount of data, despite using approximately a 20th as many cores, a 20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM. *As published by SAS in HPC Wire, April 21, 2011 Double 45% 1/6th 5% 5% Revolution R Enterprise Delivers Performance at 2% of the Cost
  18. 18. Allstate compares SAS, Hadoop and R for Big-Data Insurance Models Approach Platform Time to fit SAS 16-core Sun Server 5 hours rmr/MapReduce 10-node 80-core Hadoop Cluster > 10 hours R 250 GB Server Impossible (> 3 days) Revolution R Enterprise 5-node 20-core LSF cluster 5.7 minutes Revolution R Enterprise 20 Generalized linear model, 150 million observations, 70 degrees of freedom
  19. 19. What makes ScaleR so fast?  Lot’s of things!  This kind of performance requires  Careful architecting that supports performance  Constant, intense focus on all of the details  A review of every line of code with an eye to performance (in addition to giving the correct answers, of course)  Extensive profiling and continuous benchmarking to detect problems and improve code Revolution R Enterprise 21
  20. 20. Specific speed-related factors -- 1  Efficient computational algorithms  Efficient memory management – minimize data copying and data conversion  Heavy use of C++ templates; optimal code  Efficient data file format; fast access by row and column Revolution R Enterprise 22
  21. 21. Specific speed-related factors -- 2  Models are pre-analyzed to detect and remove duplicate computations and points of failure (singularities)  Handle categorical variables efficiently Revolution R Enterprise 23
  22. 22. Parallel External Memory Algorithms (PEMA’s)  The HPA analytics algorithms are all built on a platform that efficiently parallelizes a broad class of statistical, data mining and machine learning algorithms  These Parallel External Memory Algorithms (PEMA’s) process data a chunk at a time in parallel across cores and nodes Revolution R Enterprise 24
  23. 23. Scalability and portability of Revo’s implementation of PEMA’s  These PEMA algorithms can process an unlimited number of rows of data in a fixed amount of RAM. They process a chunk of data at a time, giving linear scalability  They are independent of the “compute context” (number of cores, computers, distributed computing platform), giving portability across these dimensions  They are independent of where the data is coming from, giving portability with respect to data sources Revolution R Enterprise 25
  24. 24. Simplified ScaleR Internal Architecture Revolution R Enterprise 26 Analytics Engine PEMA’s are implemented here (Scalable, Parallelized, Threaded, Distributable) Inter-process Communication MPI, RPC, Sockets, Files Data Sources HDFS, Teradata, ODBC, SAS, SPSS, CSV , Fixed, XDF
  25. 25. ScaleR on Hadoop – Base Implementation  Each iteration (pass through the data) is a separate MapReduce job  The mapper produces “intermediate result objects” for each task that are combined using a combiner and then a reducer  A master process decides if another pass through the data is required  Allow data to be stored temporarily (or longer) in XDF binary format for increased speed especially on iterative algorithms Revolution R Enterprise 27
  26. 26. Performance enhancements  A single Map job that handles all iterations, to reduce overhead of starting multiple jobs  Tree structured communication and reduction to reduce time for those operations, especially on large clusters  Alternative communication schemes (e.g. sockets) instead of files Revolution R Enterprise 28
  27. 27. Thank You! Mario Inchiosa Revolution R Enterprise 29
  28. 28. Backup Slides Revolution R Enterprise 30
  29. 29. Functionality in common across HPA algorithms  Ability to do on-the-fly data transformations in the R language  In-formula: ArrDelay>15 ~ DayOfWeek + …  As transform: Late = ArrDelay>15  As transform function: Specify an R function to process an entire chunk of data at a time  Row selection  Both standard weights and frequency weights Revolution R Enterprise 31
  30. 30. Portability of user code  The same analytics code can be used on a laptop, server, MPI cluster, or Hadoop  The same analytics code can be used on a small data.frame in memory and on a huge distributed data file Revolution R Enterprise 32
  31. 31. Sample code for logit on laptop # Specify local data source airData <- myLocalDataSource # Specify model formula and parameters rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData) Revolution R Enterprise 33
  32. 32. Sample code for logit on Hadoop # Change the “compute context” rxSetComputeContext(myHadoopCluster) # Change the data source if necessary airData <- myHadoopDataSource # Otherwise, the code is the same rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData) Revolution R Enterprise 34