Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Revolution Analytics Supports the O... by Revolution Analytics 9139 views
- Webinar: Survival Analysis for Mark... by Revolution Analytics 30335 views
- 05Nov13 Webinar: Introducing Revolu... by Revolution Analytics 12002 views
- American Century (Revolution Analyt... by Revolution Analytics 12991 views
- 14for14 - Analytic Predictions for ... by Alteryx 14368 views
- R2DOCX example by Revolution Analytics 19629 views

19,037 views

Published on

ABSTRACT: Hadoop is rapidly being adopted as a major platform for storing and managing massive amounts of data, and for computing descriptive and query types of analytics on that data. However, it has a reputation for not being a suitable environment for high performance complex iterative algorithms such as logistic regression, generalized linear models, and decision trees. At Revolution Analytics we think that reputation is unjustified, and in this talk I discuss the approach we have taken to porting our suite of High Performance Analytics algorithms to run natively and efficiently in Hadoop. Our algorithms are written in C++ and R, and are based on a platform that automatically and efficiently parallelizes a broad class of algorithms called Parallel External Memory Algorithms (PEMA’s). This platform abstracts both the inter-process communication layer and the data source layer, so that the algorithms can work in almost any environment in which messages can be passed among processes and with almost any data source. MPI and RPC are two traditional ways to send messages, but messages can also be passed using files, as in Hadoop. I describe how we use the file-based communication choreographed by MapReduce and how we efficiently access data stored in HDFS.

Published in:
Technology

No Downloads

Total views

19,037

On SlideShare

0

From Embeds

0

Number of Embeds

13,064

Shares

0

Downloads

240

Comments

0

Likes

19

No embeds

No notes for slide

- 1. High Performance Predictive Analytics in R and Hadoop Presented by: Mario E. Inchiosa, Ph.D. US Chief Scientist Hadoop Summit 2013 June 27, 2013 1
- 2. Innovate with R 2 Most widely used data analysis software Used by 2M+ data scientists, statisticians and analysts Most powerful statistical programming language Flexible, extensible and comprehensive for productivity Create beautiful and unique data visualizations As seen in New York Times, Twitter and Flowing Data Thriving open-source community Leading edge of analytics research Fills the talent gap New graduates prefer R Download the White Paper R is Hot bit.ly/r-is-hot
- 3. R is open source and drives analytic innovation but has some limitations Bigger data sizes Speed of analysis Production support Memory Bound Big Data Single Threaded Scale out, parallel processing, high speed Community Support Commercial production support Innovation and scale Innovative – 5000+ packages, exponential growth Combines with open source R packages where needed
- 4. 4 Revolution R Enterprise High Performance, Multi-Platform Analytics Platform Revolution R EnterpriseRevolution R Enterprise DeployR Web Services Software Development Kit DeployR Web Services Software Development Kit DevelopR Integrated Development Environment DevelopR Integrated Development Environment ConnectR High Speed & Direct Connectors Teradata, Hadooop (HDFS, Hbase), SAS, SPSS, CSV, OBDC ScaleR High Performance Big Data Analytics Platform LSF, MS HPC Server, MS Azure Burst, SMP Servers DistributedR Distributed Computing Framework Platform LSF, MS HPC Server, MS Azure Burst RevoR Performance Enhanced Open Source R+ CRAN packages IBM PureData (Netezza), Platform LSF, MS HPC Server, MS Azure Burst, Cloudera, Hortonworks, IBM Big Insights, Intel Hadoop, SMP servers Open Source R Plus Revolution Analytics performance enhancements Revolution Analytics Value-Add Components Providing Power and Scale to Open Source R
- 5. Our objectives with respect to Hadoop Allow our customers to do predictive analytics as easily in Hadoop as they can using R on their laptops today Scalable and High Performance Provide the first enterprise-ready, commercially supported, full-featured, out-of- the-box Predictive Analytics suite running in Hadoop 5
- 6. Overview of Revolution and Hadoop Revolution currently provides capabilities for using R and Hadoop together RHadoop packages: rmr, rhdfs, rhbase RevoScaleR package: Rapidly read and process data from HDFS We are in the process of porting RevoScaleR to run fully “in Hadoop.” It currently runs on laptops, workstations, servers, and MPI-based clusters. 6
- 7. RHadoop: Map-Reduce with R Unlock data in Hadoop using only the R language No need to learn Java, Pig, Python or any other language 7
- 8. RevoScaleR Overview 8 An R package that adds capabilities to R: Data Import/Clean/Explore/Transform Analytics – Descriptive and Predictive Parallel and distributed computing Visualization Scales from small local data to huge distributed data Scales from laptop to server to cluster to cloud Portable – the same code works on small and big data, and on laptop, server, cluster, Hadoop
- 9. Key ways RevoScaleR enhances R High Performance Computing (HPC) functions: Parallel/Distributed computing High Performance Analytics (HPA) functions: Big Data + Parallel/Distributed computing XDF file format; rapidly store and extract data Use results from HPA and HPC functions in other R packages Visualization functions Revolution R Enterprise 9
- 10. High Performance Big Data Analytics with ScaleR 10 Statistical Tests Machine Learning Simulation Descriptive Statistics Data Visualization R Data Step Predictive Models Sampling
- 11. ScaleR: High Performance Scalable Parallel External Memory Algorithms 11 Data import – Delimited, Fixed, SAS, SPSS, OBDC Variable creation & transformation Recode variables Factor variables Missing value handling Sort Merge Split Aggregate by category (means, sums) Data import – Delimited, Fixed, SAS, SPSS, OBDC Variable creation & transformation Recode variables Factor variables Missing value handling Sort Merge Split Aggregate by category (means, sums) Min / Max Mean Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables & long form) Marginal Summaries of Cross Tabulations Min / Max Mean Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables & long form) Marginal Summaries of Cross Tabulations Chi Square Test Kendall Rank Correlation Fisher’s Exact Test Student’s t-Test Chi Square Test Kendall Rank Correlation Fisher’s Exact Test Student’s t-Test Data Prep, Distillation & Descriptive AnalyticsData Prep, Distillation & Descriptive Analytics Subsample (observations & variables) Random Sampling Subsample (observations & variables) Random Sampling R Data Step Statistical Tests Sampling Descriptive Statistics
- 12. ScaleR: High Performance Scalable Parallel External Memory Algorithms 12 Sum of Squares (cross product matrix for set variables) Multiple Linear Regression Generalized Linear Models (GLM) - All exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions including: cauchit, identity, log, logit, probit. User defined distributions & link functions. Covariance & Correlation Matrices Logistic Regression Classification & Regression Trees Predictions/scoring for models Residuals for all models Sum of Squares (cross product matrix for set variables) Multiple Linear Regression Generalized Linear Models (GLM) - All exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions including: cauchit, identity, log, logit, probit. User defined distributions & link functions. Covariance & Correlation Matrices Logistic Regression Classification & Regression Trees Predictions/scoring for models Residuals for all models Histogram Line Plot Scatter Plot Lorenz Curve ROC Curves (actual data and predicted values) Histogram Line Plot Scatter Plot Lorenz Curve ROC Curves (actual data and predicted values) K-Means K-Means Statistical ModelingStatistical Modeling Decision Trees Decision Trees Predictive Models Cluster AnalysisData Visualization Classification Machine LearningMachine Learning SimulationSimulation Monte Carlo Monte Carlo
- 13. HPC Capabilities in RevoScaleR Execute (essentially) any R function in parallel on nodes and cores Results from all runs are returned in a list to the user’s laptop Extensive control over parameters Extensive control over nodes, cores, and times to run Ideal for simulations and for running R functions on small amounts of data Revolution R Enterprise 13
- 14. Key ScaleR HPA features Handles an arbitrarily large number of rows in a fixed amount of memory Scales linearly with the number of rows Scales linearly with the number of nodes Scales well with the number of cores per node Scales well with the number of parameters Extremely high performance Revolution R Enterprise 14
- 15. Regression comparison using in-memory data: lm() vs rxLinMod() Revolution R Enterprise 17 lm() in R lm() in Revolution R rxLinMod() in Revolution R
- 16. GLM comparison using in-memory data: glm() vs rxGlm() Revolution R Enterprise 18
- 17. SAS HPA Benchmarking comparison* Logistic Regression Rows of data 1 billion 1 billion Parameters “just a few” 7 Time 80 seconds 44 seconds Data location In memory On disk Nodes 32 5 Cores 384 20 RAM 1,536 GB 80 GB 19 Revolution R is faster on the same amount of data, despite using approximately a 20th as many cores, a 20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM. *As published by SAS in HPC Wire, April 21, 2011 Double 45% 1/6th 5% 5% Revolution R Enterprise Delivers Performance at 2% of the CostRevolution R Enterprise Delivers Performance at 2% of the Cost
- 18. Allstate compares SAS, Hadoop and R for Big-Data Insurance Models Approach Platform Time to fit SAS 16-core Sun Server 5 hours Rmr/map reduce 10 node 8 cores-per-node HadoopCluster > 10 hours Open source R 250 GB Server Impossible (> 3 days) RevoScaleR 5-node (4 cores / node) LSF cluster 5.7 minutes Revolution R Enterprise 20 Generalized linear model, 150 million observations, 70 degrees of freedom http://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html
- 19. What makes ScaleR so fast? Lot’s of things! This kind of performance requires Careful architecting that supports performance Constant, intense focus on all of the details A review of every line of code with an eye to performance (in addition to giving the correct answers, of course) Extensive profiling and continuous benchmarking to detect problems and improve code Revolution R Enterprise 21
- 20. Specific speed-related factors -- 1 Efficient computational algorithms Efficient memory management – minimize data copying and data conversion Heavy use of C++ templates; optimal code Efficient data file format; fast access by row and column Revolution R Enterprise 22
- 21. Specific speed-related issues -- 2 Models are pre-analyzed to detect and remove duplicate computations and points of failure (singularities) Handle categorical variables efficiently Revolution R Enterprise 23
- 22. Parallel External Memory Algorithms (PEMA’s) The HPA analytics algorithms are all built on a platform that efficiently parallelizes a broad class of statistical, data mining and machine learning algorithms These Parallel External Memory Algorithms (PEMA’s) process data a chunk at a time in parallel across cores and nodes Revolution R Enterprise 24
- 23. Scalability and portability of Revo’s implementation of PEMA’s These PEMA algorithms can process an unlimited number of rows of data in a fixed amount of RAM. They process a chunk of data at a time, giving linear scalability They are independent of the “compute context” (number of cores, computers, distributed computing platform), giving portability across these dimensions They are independent of where the data is coming from, giving portability with respect to data sources Revolution R Enterprise 25
- 24. Simplified ScaleR Internal Architecture Revolution R Enterprise 26 Analytics Engine PEMA’s are implemented here (Scalable, Parallelized, Threaded, Distributable) Inter-process Communication MPI, RPC, Sockets, Files Data Sources HDFS, Teradata, ODBC, SAS, SPSS, CSV, Fixed, XDF
- 25. ScaleR on Hadoop – Base Implementation Each iteration (pass through the data) is a separate MapReduce job The mapper produces “intermediate result objects” for each task that are combined using a combiner and then a reducer A master process decides if another pass through the data is required Allow data to be stored temporarily (or longer) in XDF binary format for increased speed especially on iterative algorithms Revolution R Enterprise 27
- 26. Performance enhancements A single Map job that handles all iterations, to reduce overhead of starting multiple jobs Tree structured communication and reduction to reduce time for those operations, especially on large clusters Alternative communication schemes (e.g. sockets) instead of files Revolution R Enterprise 28
- 27. Functionality in common across HPA algorithms Ability to do on-the-fly data transformations in the R language In-formula: ArrDelay>15 ~ DayOfWeek + … As transform: Late = ArrDelay>15 As transform function: Specify an R function to process an entire chunk of data at a time Row selection Both standard weights and frequency weights Revolution R Enterprise 29
- 28. Portability of user code The same analytics code can be used on a laptop, server, MPI cluster, or Hadoop The same analytics code can be used on a small data.frame in memory and on a huge distributed data file Revolution R Enterprise 30
- 29. Sample code for logit on laptop # Specify local data source airData <- myLocalDataSource # Specify model formula and parameters rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData) Revolution R Enterprise 31
- 30. Sample code for logit on Hadoop # Change the “compute context” rxSetComputeContext(myHadoopCluster) # Change the data source if necessary airData <- myHadoopDataSource # Otherwise, the code is the same rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData) Revolution R Enterprise 32
- 31. Thank You! Mario Inchiosa Revolution R Enterprise 33

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment