Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

7,315 views

Published on

Published in:
Technology

No Downloads

Total views

7,315

On SlideShare

0

From Embeds

0

Number of Embeds

5,158

Shares

0

Downloads

54

Comments

0

Likes

2

No embeds

No notes for slide

- 1. Revolution Confidential R evolution R E nterpris e 5.0: S c alable Data Management and A nalys is for the E nterpris eS ue R anney, V P P roduc t DevelopmentNovember 2011
- 2. November 17, 2011: Welc ome! Revolution Confidential Thanks for coming! Text questions for Q&A after the presentation Revolution R Enterprise 5.0 Webinar 2
- 3. In Today’s Webinar… Revolution Confidential About Revolution R Enterprise 5.0 “I don’t have big data.” Why use Revolution R Enterprise 5.0 to get started? “I don’t have big hardware.” Big data on your desktop. “I have big data, and need to be ready for tomorrow’s even bigger data.” Scaling data analysis to a cluster. “I need to write my own scalable analyses.”: Creating your own scalable R extensions. Wrap-up, Q&A Revolution R Enterprise 5.0 Webinar 3
- 4. Revolution ConfidentialA bout R evolution R E nterpris e Revolution R Enterprise 5.0 Webinar 4
- 5. R evolution R E nterpris e: Revolution ConfidentialPerformance Enhancements Enterprise Performance DeploymentGreater Productivity & Ease of Use Open SourceTackle “Big Data” Technical Productivity SupportIT-Friendly Enterprise Deployment Training Big & Consulting Data AnalysisOn-Call Experts 5
- 6. R evolution R E nterpris e 5.0: What’s New? Revolution Confidential Distributed/Parallel Computing Distribute analytics and R functions to a Windows HPC server cluster Scalable Data Management New data import and cleaning/manipulation tools. Expanded Scalable Analytics Principal components analysis, factor analysis, and more Enhanced R Productivity Environment Create and build R packages Integration with Hadoop Cloudera-certified MapReduce programming in R Enhanced RevoDeployR Server Supports multiple compute nodes, batch scheduling and LDAP security Upgraded Open Source R R 2.13.2 with new byte-compiler Revolution R Enterprise 5.0 Webinar 6
- 7. R evolution R E nterpris e: What Revolution ConfidentialG ets Ins talled?Data Management and Statistical Analysis for the Enterprise Latest stable version of Open-Source R ( 2.13.2) High performance math libraries RevoScaleR package: Scalable data management and analysis Distributed data analysis/parallel computing Integrated Development Environment based on Visual Studio technology (for Windows): the R Productivity Environment (RPE), including a visual debugger for R Revolution R Enterprise 5.0 Webinar 7
- 8. Revolution Confidential“ I don’t have big data.” Why gets tarted with R evolution R E nterpris e? Revolution R Enterprise 5.0 Webinar 8
- 9. Why R evolution R E nterpris e 5.0 with“ S mall” Data Revolution Confidential Easy to get started; consistent interface for “start- to-finish” data analysis with just a few functions Data import (text, SAS, SPSS, ODBC) Data transformations & manipulation Basic data analysis Performance: fast analysis such as summary statistics, cross tabs, linear models, logistic regression – even for data that can fit in memory Scalability: replicate the data analysis you do today on big data down the road Revolution R Enterprise 5.0 Webinar 9
- 10. S c alable Data Management: Import Revolution Confidential Import data from a variety of sources with rxImport SPSS SAS Delimited text (e.g., comma separated) Fixed format text Databases with ODBC connection Read small data sets into a data frame; store larger data sets in an efficient .xdf file format Use arguments such as colClasses and colInfo to provide guidance on how to import data (e.g., as integer, factor, etc.) Revolution R Enterprise 5.0 Webinar 10
- 11. E xample: Import Mortgage Default Data Revolution Confidential Import a data file (10,000 obs) into a data frame, specifying the input file location Create a place-holder object for an output file, that you’ll use with bigger data Use the same code to import a file with 10 million observations rxImport is new in 5.0 – to simplify and scale the data import process In both cases, the data object returned can be used as input data in other RevoScaleR functions Revolution R Enterprise 5.0 Webinar 11
- 12. S c alable Data Management: Data S tep Revolution Confidential Basic steps for data manipulation and cleaning Variable selection Data transformations Row selection One function does it all: rxDataStep Can use the same approach (function arguments) at various stages of your analysis: Import Data step “On-the-fly” in data analysis Revolution R Enterprise 5.0 Webinar 12
- 13. E xample: Data S tep with Mortgage Data Revolution Confidential Specify the input data – can be a data frame or an object representing an .xdf file Specify an output file, if desired Select variables and rows to include in the new data set List out variable transformations, using usual R expressions rxDataStep is new in 5.0 – to simplify and scale the data step Revolution R Enterprise 5.0 Webinar 13
- 14. E xamples of R Operators and F unc tionsYou C an Us e in ‘trans forms ’ Revolution Confidential Operator/Function Description +, -, *, /, ^, %%, … Row-by-row addition, subtraction, multiplication, division, exponentiation, modulus <, <=, >, >=, ==, != Logical operators abs, ceiling, floor, round, log, Basic mathematical functions log10, cos, sin, sqrt, … as.Date, weekdays, months, Convert character data to Date data. Then use quarters, … functions like weekdays(), months(), quarters() rnorm, runif,gamma, exp,… Distribution functions cut Create a factor from a numeric variable substr, toupper, tolower Basic string handling ifelse ifelse(test, yes, no) – set the value of a variable conditional on a test ? Or, write your own R function Revolution R Enterprise 5.0 Webinar 14
- 15. A dditional F unc tions for P roc es s ingData S ets Revolution Confidential Function Purpose rxSort Sort a data set by one or more key variables rxMerge Merge two data sets by one or more key variables rxFactors Create or modify factors (categorical variables) based on existing variables rxSetVarInfo Change variable information such as the name or description of a variable rxSetInfo Add or change a data set description rxSplitXdf Split a single .xdf file into multiple .xdf files Revolution R Enterprise 5.0 Webinar 15
- 16. Revolution Confidential “ I don’t have big hardware.”B ig data analys is on your des ktop. Revolution R Enterprise 5.0 Webinar 16
- 17. G etting S tarted with B ig Data Revolution Confidential When I talk with people about their “big data”, almost always the first issue they raise is “hardware”. “What kind of hardware do I need to analyze big data.” My answer, “Get started today with the hardware you have. With Revolution R Enterprise 5.0, you can quickly begin doing scalable data analysis on your desktop while you are determining your longer term hardware requirements.” Revolution R Enterprise 5.0 Webinar 17
- 18. B ig Data on Your Des ktop Revolution Confidential Data sets with many variables and 100-million observations can be easily processed on a desktop using RevoScaleR functions. Using Revolution R Enterprise 5.0, you can avoid getting locked into memory-bound analyses. By processing data a chunk at a time, increasing the number of observations in your data set doesn’t increase the memory requirements for a given analysis. Revolution R Enterprise 5.0 Webinar 18
- 19. E xample: A nalys is data on all the births inthe United S tates from 1985 - 2008 Revolution Confidential From R in a Nutshell (in dealing with the 2006 birth data): The natality files are gigantic; theyre approximately 3.1 GB uncompressed. Thats a little larger than R can easily process, so I used Perl to translate these files into a form easily readable by R. Almost 100-million observations Originally stored in annual fixed-format text files; imported and appended into one .xdf file for fast access using RevoScaleR import function (no need for Perl) Revolution R Enterprise 5.0 Webinar 19
- 20. E xamples : Interac ting with Your Data Revolution Confidential Quickly compute summary statistics for variables in the data set using rxSummary: birth weight in grams and a time trend variable, months since Jan. 1985 rxSummary(~DBIRWT + MNTHS_SINCE_1985, data = birthAll, blocksPerRead = 10) blocksPerRead set to 10 will read in 10 blocks of the desired variables from the .xdf file for each read, or a little under 5,000,000 observations per read Revolution R Enterprise 5.0 Webinar 20
- 21. E xample: S ummary S tatis tic s on Two B irthData Variables Revolution Confidential Time processing all chunks and final results Looks like 9999 must be the missing value code for DBIRWT Revolution R Enterprise 5.0 Webinar 21
- 22. E xamples : G roup Averages Revolution Confidential Use rxCube to compute the proportion of babies that were boys for each year for each the race category for the mother momRaceYear <- rxCube(ItsaBoy~F(DOB_YY):MRACEREC, data = birthAll, blocksPerRead = 10) F() function creates an “on-the-fly” categorical variable for each unit interval The average of the dependent variable, ItsaBoy, will be computed for each category determined by the interaction term. Revolution R Enterprise 5.0 Webinar 22
- 23. E xample: Us e rxC ube to C ompute P roportion of Revolution ConfidentialB oys by Year and Mother ’s R ac e Put the results into a data frame, easy for plotting rxLinePlot is particularly well- suited for plotting rxCube results Revolution R Enterprise 5.0 Webinar 23
- 24. E xample: P lot the R es ults Revolution Confidential Revolution R Enterprise 5.0 Webinar 24
- 25. E s timating a L inear Model Revolution Confidential Suppose we want to estimate a linear model: birth weight (in pounds) on plurality and a time trend: BIRWTLBS ~ DPLURAL_REC + MNTHS_SINCE_1985 where BIRWTLBS = DBIRWT/453.59237 and rowSelection = DBIRWT < 9000 Revolution R Enterprise 5.0 Webinar 25
- 26. Us ing the .xdf data file as a data s ourc e forthe biglm pac kage from C R A N Revolution ConfidentialThe biglm package also processes data in chunks. Wecan create an .xdf data source and use it with biglmfunctions. A linear model on almost 100 million rows inabout 6 minutes on a desktop in R seems impressive. I’ve written a small function to use an .xdf data source with biglm Revolution R Enterprise 5.0 Webinar 26
- 27. Us ing rxL inMod: Optimized for S peed Revolution Confidential Adding another 5,000,000 observations would add less than 1 second Revolution R Enterprise 5.0 Webinar 27
- 28. L inear Model R es ults Revolution Confidential For those who have gotten interested in the actual analysis, here are the results: •At the beginning of 1985, the average singleton baby weighed 7.46 pounds. •Twins were a little over two pounds smaller, and triplets or higher even smaller. •There’s a downward trend in birth weight over time, but very small. Revolution R Enterprise 5.0 Webinar 28
- 29. E s timating a B ig L ogis tic Model Revolution Confidential Let’s try a more challenging model: a logistic regression with over 50 parameters (categorical data for Dad and Mom’s ages, race, Hispanic ethnicity, live birth order, plurality, gestation, and year) ItsaBoy ~ DadAgeR8 + MomAgeR7 + FRACEREC + FHISP_REC + MRACEREC + MHISP_REC + LBO4 + DPLURAL_REC + Gestation + F(DOB_YY) Revolution R Enterprise 5.0 Webinar 29
- 30. B ig L ogis tic Model on the Des ktop Revolution Confidential Even a large logistic regression (over 50 parameters) with almost 100 million rows of data can be estimated on a desktop, in about the time it takes to get a cup of coffee (about 6 minutes) But what if that’s not fast enough? Revolution R Enterprise 5.0 Webinar 30
- 31. A udienc e P oll Revolution Confidential Before we answer that question, let’s do a quick poll of the audience Revolution R Enterprise 5.0 Webinar 31
- 32. Revolution Confidential“ I need to be ready for tomorrow’s data.” S c aling data analys is to a c lus ter. Revolution R Enterprise 5.0 Webinar 32
- 33. S c aling Data A nalys is to a C lus ter Revolution Confidential With Revolution R Enterprise 5.0, you can use the same functions that you used on your desktop to scale to a cluster of computers Windows HPC Server currently supported. (See http://technet.microsoft.com/en-us/hpc/cc453771 for information on a 180 day evaluation copy.) Revolution R Enterprise 5.0 Webinar 33
- 34. T he B irth Data L ogis tic R egres s ion on aC lus ter Revolution Confidential In our office we have a 5-node cluster of commodity hardware (about $5,000) running Windows HPC Server I just set my compute context to use the cluster (and wait for the results) and set the location of the data on the nodes Then run the same code 42 seconds instead of 6 minutes Revolution R Enterprise 5.0 Webinar 34
- 35. How Does It Work? Revolution Confidential When I run rxLogit from my desktop with an HPC Server compute context, a job is submitted to the cluster. The master node allocates tasks to worker nodes to compute intermediate results on their part of the data. The master node aggregates the intermediate results from the nodes and processes them. If needed, more tasks are assigned (e.g., computing the next iteration) When complete, the master node sends the results back to my desktop. Best of all, I don’t need to know how it works. I just set my compute context, run my code, and get my results back. Revolution R Enterprise 5.0 Webinar 35
- 36. T he HP C J ob S c heduler Revolution Confidential If I’m interested, I can see the activity on the cluster using the HPC job scheduler, which can be launched from a menu item in the R Productivity Environment I can see that my computations were processed 4 cores on each of 5 nodes. Revolution R Enterprise 5.0 Webinar 36
- 37. HPA and HP C B oth S upported Revolution Confidential I think of the logistic regression we just ran as High Performance Analytics. The computations are automatically distributed for the analysis of huge data sets. An key component is simultaneous rapid access to data - a cluster where each node has a separate disk drive is usually ideal. With traditional High Performance Computing, the focus is not on the data. For example, a user might specify a function to be run in parallel across computing resources. Typically these are “embarrassingly” or “pleasingly” parallel computing problems. Revolution R Enterprise 5.0 Webinar 37
- 38. HP C E xample: the B irthday P roblem Revolution Confidential In a group of a given size, what is the probability that two people will have the same birthday? We can perform a brute-force computation, repeatedly creating random samples and counting – and do it in parallel across nodes of the cluster We’ll use a function, pbirthday, that takes 2 arguments: n - group size ntests - the number of times to sample Revolution R Enterprise 5.0 Webinar 38
- 39. HP C E xample: the B irthday P roblem Revolution Confidential Set the compute context to do computations on our cluster Use the rxExec function to ask each node on the 5-node cluster to do up to 20 runs of the pbirthday function, each using a different value for the ‘n’ argument. rxExec allows users to run arbitrary functions in parallel The results come back in a list, which we can manipulate and plot. Revolution R Enterprise 5.0 Webinar 39
- 40. HP C E xample: the B irthday P roblem Revolution Confidential Revolution R Enterprise 5.0 Webinar 40
- 41. Us ing R evoS c laleR with foreac h: doR S R Revolution Confidential Another alternative for doing parallel computing with RevoScaleR is using the foreach package. The foreach package provides a for-loop-like approach to parallel computing Parallel backends have been written for a variety of parallel computing packages, now including RevoScaleR Let’s look at a simple example: computing square roots in parallel Revolution R Enterprise 5.0 Webinar 41
- 42. S imple example of foreac h with doR S R Revolution Confidential To get started with doRSR, load the library and register it as the backend for foreach To run jobs on the cluster, set your compute context We’ll estimate the square root for the numbers from 1 to 20. In this case, the 20 cores will be requested from the cluster for computations Revolution R Enterprise 5.0 Webinar 42
- 43. T he J ob S c heduler Us ing doR S R Revolution Confidential You can see the 20 cores requested for the job, and that all 5 nodes were used. Revolution R Enterprise 5.0 Webinar 43
- 44. S etting Up Your C ompute C ontext Revolution Confidential I’ve mentioned the “compute context” a lot. To setup your compute context, you just need basic information about your cluster. It’s easy to create a new compute context based on an existing one. Just specify the properties you’d like to change. Revolution R Enterprise 5.0 Webinar 44
- 45. Non-Waiting J obs on a C lus ter Revolution Confidential It is common to use non-waiting jobs when working with a cluster. Send off your job, and return to work. Check the status of a non-waiting job in the object browser, or have an email sent. Then retrieve the results on your local machine. When my job is done, I can retrieve the results using ‘rxGetJobResults’ See my job status here Revolution R Enterprise 5.0 Webinar 45
- 46. Revolution Confidential“ I need to write my own s c alable analys es .” C reating your own s c alable R extens ions . Revolution R Enterprise 5.0 Webinar 46
- 47. C reating Your Own S c alable E xtens ions Revolution Confidential Use doRSR and rxExec to distribute user- defined computations across processes or nodes of a cluster Use output from RevoScaleR functions as input into other functions (i.e., output from rxCor into princomp for Principal Components) Write your own chunking algorithms, e.g., using rxDataStep to automatically chunk through the data. (I’ll show you an example.) When you’re done, create a package to distribute your new functions using the RPE Revolution R Enterprise 5.0 Webinar 47
- 48. Trans formation F unc tions Revolution Confidential Transformation functions are user-defined functions that operate on a chunk of data. They can be used to perform arbitrary computations and update results. You can use transformation functions in RevoScaleR analysis functions to perform specialized data transformations. This example is for use in rxDataStep. Revolution R Enterprise 5.0 Webinar 48
- 49. Us ing rxDataS tep for Us er C omputations Revolution Confidential rxDataStep will automatically “chunk” through the data and run the transformation function on each chunk. Just initialize your computed values in the transformObjects argument. Your final results can be returned in a list. The updated tableSum will contain the cumulated results of calling the table function on the “DayofWeek” Variable. Revolution R Enterprise 5.0 Webinar 49
- 50. P ac kage S upport in the R P E Revolution Confidential To create a new R package project, choose File/New/Project/R Package Project A solution with all the required R package components is automatically created. Right click on the ‘man’ folder to add a help file for a new function. Build a package by right-clicking on the project and choosing: Build R Package Revolution R Enterprise 5.0 Webinar 50
- 51. Wrap Up Revolution Confidential It’s time to get started with Revolution R Enterprise 5.0 Start out analyzing a small data frame Use the same code to analyze a large data set locally Get high computing performance using the same code on a cluster Extend your analyses using the power and flexibility of the R language Revolution R Enterprise 5.0 Webinar 51
- 52. R evolution R E nterpris e: F ree to A c ademia Revolution Confidential Personal use Research Teaching Package development Free Academic Download www.revolutionanalytics.com/downloads/free-academic.php Discounted Technical Support Subscriptions Available 52
- 53. Revolution Confidential R evolution R E nterpris e 5.0: Now Available!T hank You!

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment