Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Model Building with RevoScaleR: Usi... by Revolution Analytics 4207 views
- Hadoop and Spark for the SAS Developer by Hadoop Summit 4641 views
- Why your Spark Job is Failing by Hadoop Summit 9906 views
- Porting R Models into Scala Spark by carl_pulley 1073 views
- R for SAS Users Complement or Repla... by Revolution Analytics 1537 views
- RHive tutorial 1: RHive 튜토리얼 1 - 설치... by Aiden Seonghak Hong 5514 views

11,474 views

Published on

Published in:
Technology

No Downloads

Total views

11,474

On SlideShare

0

From Embeds

0

Number of Embeds

221

Shares

0

Downloads

296

Comments

0

Likes

17

No embeds

No notes for slide

- 1. High Performance Predictive Analytics in R and Hadoop Presented by: Mario E. Inchiosa, Ph.D. US Chief Scientist Hadoop Summit 2013 June 27, 2013 1
- 2. Innovate with R 2 Most widely used data analysis software Used by 2M+ data scientists, statisticians and analysts Most powerful statistical programming language Flexible, extensible and comprehensive for productivity Create beautiful and unique data visualizations As seen in New York Times, Twitter and Flowing Data Thriving open-source community Leading edge of analytics research Fills the talent gap New graduates prefer R Download the White Paper R is Hot bit.ly/r-is-hot
- 3. R is open source and drives analytic innovation but has some limitations Bigger data sizes Speed of analysis Production support Memory Bound Big Data Single Threaded Scale out, parallel processing, high speed Community Support Commercial production support Innovation and scale Innovative – 5000+ packages, exponential growth Combines with open source R packages where needed
- 4. 4 Revolution R Enterprise High Performance, Multi-Platform Analytics Platform Revolution R Enterprise DeployR Web Services Software Development Kit DevelopR Integrated Development Environment ConnectR High Speed & Direct Connectors Teradata, Hadoop (HDFS, HBase), SAS, SPSS, CSV, OBDC ScaleR High Performance Big Data Analytics Platform LSF, MS HPC Server, MS Azure Burst, SMP Servers DistributedR Distributed Computing Framework Platform LSF, MS HPC Server, MS Azure Burst RevoR Performance Enhanced Open Source R + CRAN packages IBM PureData (Netezza), Platform LSF, MS HPC Server, MS Azure Burst, Cloudera, Hortonworks, IBM Big Insights, Intel Hadoop, SMP servers Open Source R Plus Revolution Analytics performance enhancements Revolution Analytics Value-Add Components Providing Power and Scale to Open Source R
- 5. Our objectives with respect to Hadoop Allow our customers to do predictive analytics as easily in Hadoop as they can using R on their laptops today Scalable and High Performance Provide the first enterprise- ready, commercially supported, full- featured, out-of-the-box Predictive Analytics suite running in Hadoop 5
- 6. Overview of Revolution and Hadoop Revolution currently provides capabilities for using R and Hadoop together RHadoop packages: rmr2, rhdfs, rhbase RevoScaleR package: Rapidly read and process data from HDFS We are in the process of porting RevoScaleR to run fully “in Hadoop.” It currently runs on laptops, workstations, servers, and MPI- based clusters. 6
- 7. RHadoop: Map-Reduce with R Unlock data in Hadoop using only the R language No need to learn Java, Pig, Python or any other language 7
- 8. RevoScaleR Overview 8 An R package that adds capabilities to R: Data Import/Clean/Explore/Transform Analytics – Descriptive and Predictive Parallel and distributed computing Visualization Scales from small local data to huge distributed data Scales from laptop to server to cluster to cloud Portable – the same code works on small and big data, and on laptop, server, cluster, Hadoop
- 9. Key ways RevoScaleR enhances R High Performance Computing (HPC) functions: Parallel/Distributed computing High Performance Analytics (HPA) functions: Big Data + Parallel/Distributed computing XDF file format; rapidly store and extract data Use results from HPA and HPC functions in other R packages Revolution R Enterprise 9
- 10. High Performance Big Data Analytics with ScaleR 10 Statistical Tests Machine Learning Simulation Descriptive Statistics Data Visualization R Data Step Predictive Models Sampling
- 11. ScaleR: High Performance Scalable Parallel External Memory Algorithms Company Confidential – Do not distribute 11 Data import – Delimited, Fixed, SAS, SPSS, OBDC Variable creation & transformation Recode variables Factor variables Missing value handling Sort Merge Split Aggregate by category (means, sums) Min / Max Mean Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables & long form) Marginal Summaries of Cross Tabulations Chi Square Test Kendall Rank Correlation Fisher’s Exact Test Student’s t-Test Data Prep, Distillation & Descriptive Analytics Subsample (observations & variables) Random Sampling R Data Step Statistical Tests Sampling Descriptive Statistics
- 12. ScaleR: High Performance Scalable Parallel External Memory Algorithms Company Confidential – Do not distribute 12 Sum of Squares (cross product matrix for set variables) Multiple Linear Regression Generalized Linear Models (GLM) - All exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions including: cauchit, identity, log, logit, probit. User defined distributions & link functions. Covariance & Correlation Matrices Logistic Regression Classification & Regression Trees Predictions/scoring for models Residuals for all models Histogram Line Plot Scatter Plot Lorenz Curve ROC Curves (actual data and predicted values) K-Means Statistical Modeling Decision Trees Predictive Models Cluster AnalysisData Visualization Classification Machine Learning Simulation Monte Carlo
- 13. HPC Capabilities in RevoScaleR Execute (essentially) any R function in parallel on nodes and cores Results from all runs are returned in a list to the user’s laptop Extensive control over parameters Extensive control over nodes, cores, and times to run Ideal for simulations and for running R functions on small amounts of data Revolution R Enterprise 13
- 14. Key ScaleR HPA features Handles an arbitrarily large number of rows in a fixed amount of memory Scales linearly with the number of rows Scales linearly with the number of nodes Scales well with the number of cores per node Scales well with the number of parameters Extremely high performance Revolution R Enterprise 14
- 15. Regression comparison using in-memory data: lm() vs rxLinMod() Revolution R Enterprise 17 lm() in R lm() in Revolution R rxLinMod() in Revolution R
- 16. GLM comparison using in-memory data: glm() vs rxGlm() Revolution R Enterprise 18
- 17. SAS HPA Benchmarking comparison* Logistic Regression Rows of data 1 billion 1 billion Parameters “just a few” 7 Time 80 seconds 44 seconds Data location In memory On disk Nodes 32 5 Cores 384 20 RAM 1,536 GB 80 GB 19 Revolution R is faster on the same amount of data, despite using approximately a 20th as many cores, a 20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM. *As published by SAS in HPC Wire, April 21, 2011 Double 45% 1/6th 5% 5% Revolution R Enterprise Delivers Performance at 2% of the Cost
- 18. Allstate compares SAS, Hadoop and R for Big-Data Insurance Models Approach Platform Time to fit SAS 16-core Sun Server 5 hours rmr/MapReduce 10-node 80-core Hadoop Cluster > 10 hours R 250 GB Server Impossible (> 3 days) Revolution R Enterprise 5-node 20-core LSF cluster 5.7 minutes Revolution R Enterprise 20 Generalized linear model, 150 million observations, 70 degrees of freedom http://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html
- 19. What makes ScaleR so fast? Lot’s of things! This kind of performance requires Careful architecting that supports performance Constant, intense focus on all of the details A review of every line of code with an eye to performance (in addition to giving the correct answers, of course) Extensive profiling and continuous benchmarking to detect problems and improve code Revolution R Enterprise 21
- 20. Specific speed-related factors -- 1 Efficient computational algorithms Efficient memory management – minimize data copying and data conversion Heavy use of C++ templates; optimal code Efficient data file format; fast access by row and column Revolution R Enterprise 22
- 21. Specific speed-related factors -- 2 Models are pre-analyzed to detect and remove duplicate computations and points of failure (singularities) Handle categorical variables efficiently Revolution R Enterprise 23
- 22. Parallel External Memory Algorithms (PEMA’s) The HPA analytics algorithms are all built on a platform that efficiently parallelizes a broad class of statistical, data mining and machine learning algorithms These Parallel External Memory Algorithms (PEMA’s) process data a chunk at a time in parallel across cores and nodes Revolution R Enterprise 24
- 23. Scalability and portability of Revo’s implementation of PEMA’s These PEMA algorithms can process an unlimited number of rows of data in a fixed amount of RAM. They process a chunk of data at a time, giving linear scalability They are independent of the “compute context” (number of cores, computers, distributed computing platform), giving portability across these dimensions They are independent of where the data is coming from, giving portability with respect to data sources Revolution R Enterprise 25
- 24. Simplified ScaleR Internal Architecture Revolution R Enterprise 26 Analytics Engine PEMA’s are implemented here (Scalable, Parallelized, Threaded, Distributable) Inter-process Communication MPI, RPC, Sockets, Files Data Sources HDFS, Teradata, ODBC, SAS, SPSS, CSV , Fixed, XDF
- 25. ScaleR on Hadoop – Base Implementation Each iteration (pass through the data) is a separate MapReduce job The mapper produces “intermediate result objects” for each task that are combined using a combiner and then a reducer A master process decides if another pass through the data is required Allow data to be stored temporarily (or longer) in XDF binary format for increased speed especially on iterative algorithms Revolution R Enterprise 27
- 26. Performance enhancements A single Map job that handles all iterations, to reduce overhead of starting multiple jobs Tree structured communication and reduction to reduce time for those operations, especially on large clusters Alternative communication schemes (e.g. sockets) instead of files Revolution R Enterprise 28
- 27. Thank You! Mario Inchiosa mario.inchiosa@revolutionanalytics.com Revolution R Enterprise 29
- 28. Backup Slides Revolution R Enterprise 30
- 29. Functionality in common across HPA algorithms Ability to do on-the-fly data transformations in the R language In-formula: ArrDelay>15 ~ DayOfWeek + … As transform: Late = ArrDelay>15 As transform function: Specify an R function to process an entire chunk of data at a time Row selection Both standard weights and frequency weights Revolution R Enterprise 31
- 30. Portability of user code The same analytics code can be used on a laptop, server, MPI cluster, or Hadoop The same analytics code can be used on a small data.frame in memory and on a huge distributed data file Revolution R Enterprise 32
- 31. Sample code for logit on laptop # Specify local data source airData <- myLocalDataSource # Specify model formula and parameters rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData) Revolution R Enterprise 33
- 32. Sample code for logit on Hadoop # Change the “compute context” rxSetComputeContext(myHadoopCluster) # Change the data source if necessary airData <- myHadoopDataSource # Otherwise, the code is the same rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData) Revolution R Enterprise 34

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment