Revolution ConfidentialNew A dvanc es in High P erformanc e A nalytic s with R : B ig Data Dec is ion Trees and A nalys is...
In today’s webc as t:                 Revolution Confidential High Performance Analytics (HPA) with  Revolution R Enterpr...
R evolution R E nterpris e: What                            Revolution ConfidentialG ets Ins talled?   Latest stable vers...
High P erformanc e A nalytic s (HPA ) inR evoS c aleR                                 Revolution Confidential High Perfor...
R evoS c aleR : HPA A lgorithms             Revolution Confidential Descriptive statistics (rxSummary) Tables and cubes ...
Dec is ion Trees                                         Revolution Confidential   Relatively easy-to-interpret models  ...
Dec is ion Tree Types                       Revolution Confidential   Classification tree: predict what ‘class’ or    ‘gr...
S imple E xample: Marketing R es pons e      Revolution Confidential Data set containing the following information:  Resp...
S imple E xample: S pec ifying the model Revolution Confidential treeOut <- rxDTree(response~ age  + income + college + ma...
S imple E xample: B as ic Output                                   Revolution Confidential  Information on the split, the...
S imple E xample: Vis ual R epres entation                                                         Revolution Confidential...
S c aling HPA with R evoS c aleR                     Revolution Confidential RevoScaleR functions can read from data sets...
T he ‘B ig Data’ Dec is ion Tree A lgorithm    Revolution Confidential Classical algorithms for building a decision tree ...
Revolution Confidential Us eful rxDTree A rguments for B ig Data cp: complexity parameter. Increasing cp will  decrease t...
‘B ig Data’ E xample      Revolution ConfidentialCDC Report in Jan. 2012                                           15
T he U.S . B irth Data: 1985 - 2009                    Revolution Confidential Public-use data sets containing informatio...
Revolution ConfidentialR egres s ion Tree: Multiple B irthsCall:rxDTree(formula = IsMultiple ~ DadAgeR8  + MAGER + FRACERE...
L eaves with L owes t P erc ent of MultipleB irths                                     Revolution Confidential   Mom is no...
L eaves with Highes t P erc ent of MultipleB irths                                  Revolution Confidential      Mom is ov...
Revolution ConfidentialP oll Ques tion        Are you using Hadoop?
R evoS c aleR with Hadoop Data F iles NE W   Revolution Confidential The Hadoop Distributed File System (HDFS)   is high...
R evoS c aleR Data S ourc es                         Revolution Confidential Data Sources can be used for import or direc...
A n E xample Us ing Hadoop Data                     Revolution Confidential Hadoop cluster in our office   Five nodes of...
S teps in A nalys is                             Revolution Confidential Set up a ‘file system’ object and a ‘data source...
Revolution ConfidentialP oll Ques tion     What features of Revolution R   Enterprise 6.1 are most interesting            ...
T hank You!                                                               Revolution Confidential Download slides, replay...
Revolution ConfidentialThe leading commercial provider of software and support for the          popular open source R stat...
Upcoming SlideShare
Loading in …5
×

New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

13,637 views
13,613 views

Published on

Revolution R Enterprise 6.1 includes two important advances in high performance predictive analytics with R: (1) big data decision trees, and (2) the ability to easily extract and perform predictive analytics on data stored in the Hadoop Distributed File System (HDFS).

Classification and regression trees are among the most frequently used algorithms for data analysis and data mining. The implementation provided in Revolution Analytics’ RevoScaleR package is parallelized, scalable, distributable, and designed with big data in mind.

Decision trees and all of the other high performance prediction analytics functions provided with RevoScaleR (such as linear and logistic regression, generalized linear models, and k-means clustering) can now also be used to analyze data stored in the HDFS file system. After specifying the connection parameters to the HDFS file system, some or all of the data can be directly explored, analyzed or quickly and efficiently extracted into a native file system.

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
13,637
On SlideShare
0
From Embeds
0
Number of Embeds
11,541
Actions
Shares
0
Downloads
101
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

New Advances in High Performance Analytics with R: 'Big Data' Decision Trees and Analysis of Hadoop Data

  1. 1. Revolution ConfidentialNew A dvanc es in High P erformanc e A nalytic s with R : B ig Data Dec is ion Trees and A nalys is of Hadoop DataP res ented by:S ue R anneyV P P roduct Development Revolution Confidential
  2. 2. In today’s webc as t: Revolution Confidential High Performance Analytics (HPA) with Revolution R Enterprise ‘Big Data’ Decision Trees Revolution’s HPA with Hadoop Data Resources, Q&A 2
  3. 3. R evolution R E nterpris e: What Revolution ConfidentialG ets Ins talled?  Latest stable version of Open-Source R  High performance math libraries  RevoScaleR package that adds:  High performance ‘big data’ capabilities to R  Access to a variety of ‘data sources’ (e.g., SAS, SPSS, text files, ODBC)  Ability to compute in a variety of ‘compute contexts’ (e.g., Windows/Linux workstation/server, Microsoft HPC Server cluster, Azure Burst, IBM Platform LSF cluster)  High performance computing capabilities  Integrated Development Environment based on Visual Studio technology (for Windows): the R Productivity Environment (RPE) Revolution R Enterprise 5.0 Webinar 3
  4. 4. High P erformanc e A nalytic s (HPA ) inR evoS c aleR Revolution Confidential High Performance Computing + Data Full-featured, fast, and scalable analysis functions Same code works on small and big data, and a variety of data sources Same code works on a variety of compute contexts - a laptop, server, cluster, or the cloud Scales approximately linearly with the number of observations – without increasing memory requirements Revolution R Enterprise 4
  5. 5. R evoS c aleR : HPA A lgorithms Revolution Confidential Descriptive statistics (rxSummary) Tables and cubes (rxCube, rxCrossTabs) Correlations/covariances (rxCovCor, rxCor, rxCov, rxSSCP) K means clustering (rxKmeans) Linear regressions (rxLinMod) Logistic regressions (rxLogit) Generalized Linear Models (rxGlm) Predictions (scoring) (rxPredict) Decision Trees (rxDTree) NEW! Revolution R Enterprise 5
  6. 6. Dec is ion Trees Revolution Confidential  Relatively easy-to-interpret models  Widely used in a variety of disciplines. For example,  Predicting which patient characteristics are associated with high risk of, for example, heart attack.  Deciding whether or not to offer a loan to an individual based on individual characteristics.  Predicting the rate of return of various investment strategies  Retail target marketing  Can handle multi-factor response easily  Useful in identifying important interactions Revolution R Enterprise 6
  7. 7. Dec is ion Tree Types Revolution Confidential  Classification tree: predict what ‘class’ or ‘group’ an observation belongs in (dependent variable is a factor) for each terminal node or leaf  Regression tree: predict average value of dependent variable for each terminal node or leaf Revolution R Enterprise 7
  8. 8. S imple E xample: Marketing R es pons e Revolution Confidential Data set containing the following information:  Response: Was response to a phone call, email, or mailing?  Age  Income  Marital status  Attended college? Revolution R Enterprise 8
  9. 9. S imple E xample: S pec ifying the model Revolution Confidential treeOut <- rxDTree(response~ age + income + college + marital, data = rdata) where rdata is the name of the data set Revolution R Enterprise 9
  10. 10. S imple E xample: B as ic Output Revolution Confidential  Information on the split, the number of observations in the node, the number that match the y value, and the y probabilities 1) root 10000 4069 Email (0.33260000 0.59310000 0.07430000) 2) college=No College 5074 2378 Phone (0.53133622 0.38943634 0.07922743) 4) age>=39.5 2518 330 Phone (0.86894361 0.00000000 0.13105639) 8) age< 64.5 2256 77 Phone (0.96586879 0.00000000 0.03413121) * 9) age>=64.5 262 9 Mail (0.03435115 0.00000000 0.96564885) * 5) age< 39.5 2556 580 Email (0.19874804 0.77308294 0.02816901) 10) marital=Single 835 371 Phone (0.55568862 0.40958084 0.03473054) 20) income>=29.5 472 14 Phone(0.97033898 0.00000000 0.02966102) * 21) income< 29.5 363 21 Email(0.01652893 0.94214876 0.04132231) * 11) marital=Married 1721 87 Email(0.02556653 0.9494480 .02498547) * 3) college=College 4926 971 Email (0.12789281 0.80288266 0.06922452) … Revolution R Enterprise 10
  11. 11. S imple E xample: Vis ual R epres entation Revolution Confidential Root No College College Age < 65 Age >= Age >= 40 Age < 40 65: Mail Single Married: Age >= 65: Married:Age < 65: Single Email Mail Email Phone Age < 40 Age >= 40: Email Income Income < >= 30: 30: Email Phone Income Income < >= 30: 30: Email Phone Revolution R Enterprise 11
  12. 12. S c aling HPA with R evoS c aleR Revolution Confidential RevoScaleR functions can read from data sets on disk in chunks, so you can increase the number of observations in the data set beyond what can be analyzed in memory all at once RevoScaleR analysis functions process chunks of data in parallel, taking greater advantage of your computing resources (Parallel External Memory Algorithms)  Multiple cores on a desktop/server  Cluster/grids have added advantage of more hard drives for storing & accessing data  Windows HPC Server Cluster  “Burst” computations to Azure in the cloud  IBM Platform LSF Grid Revolution R Enterprise 12
  13. 13. T he ‘B ig Data’ Dec is ion Tree A lgorithm Revolution Confidential Classical algorithms for building a decision tree sort all continuous variables in order to decide where to split the data. This sorting step becomes time and memory prohibitive when dealing with large data. rxDTree bins the data rather than sorting, computing histograms to create empirical distribution functions of the data rxDTree partitions the data horizontally, processing in parallel different sets of observations Revolution R Enterprise 13
  14. 14. Revolution Confidential Us eful rxDTree A rguments for B ig Data cp: complexity parameter. Increasing cp will decrease the number of splits attempted maxDepth: the maximum depth of any tree node. The computations take much longer at greater depth, so lowering maxDepth can greatly speed up computation time. maxNumBins: the maximum number of bins to use to cut numeric data. Decreasing maxNumBins will speed up computation time. Revolution R Enterprise 14
  15. 15. ‘B ig Data’ E xample Revolution ConfidentialCDC Report in Jan. 2012 15
  16. 16. T he U.S . B irth Data: 1985 - 2009 Revolution Confidential Public-use data sets containing information on all births in the United States for each year from 1985 to 2009 are available to download: http://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm “These natality files are gigantic; they’re approximately 3.1 GB uncompressed. That’s a little larger than R can easily process” – Joseph Adler, R in a Nutshell I’ve imported key variables from each year into a single .xdf file with over 100 million observations. 16
  17. 17. Revolution ConfidentialR egres s ion Tree: Multiple B irthsCall:rxDTree(formula = IsMultiple ~ DadAgeR8 + MAGER + FRACEREC + FHISP_REC + MRACEREC + MHISP_REC + DOB_YY, data = birthAllC, maxDepth = 6, cp = 1e-05, blocksPerRead = 10, verbose = 1)File: C:RevolutionDataCDCBirthUS.xdfNumber of valid observations: 100672041Number of missing observations: 0 Revolution R Enterprise 17
  18. 18. L eaves with L owes t P erc ent of MultipleB irths Revolution Confidential Mom is not black and under the 1.3% age of 20 Mom is Asian or Pacific Islander 1.6% (and not Hispanic) and is between 22 and 28 years of age. The birth is before 1997 Mom is black and under the age 1.7% of 18 18
  19. 19. L eaves with Highes t P erc ent of MultipleB irths Revolution Confidential Mom is over 47 years old and 38.6% the birth is after 1996 Mom is white, non-Hispanic, is 28.1% between 45 and 47 years old, and the birth is after 1996 Mom is Hispanic, is between 15.5% 45 and 47 years old, and the birth is after 1996 19
  20. 20. Revolution ConfidentialP oll Ques tion Are you using Hadoop?
  21. 21. R evoS c aleR with Hadoop Data F iles NE W Revolution Confidential The Hadoop Distributed File System (HDFS)  is highly fault-tolerant and  is designed to be deployed on low-cost hardware. RevoScaleR supports accessing data in the HDFS file system for import or for direct analysis 21
  22. 22. R evoS c aleR Data S ourc es Revolution Confidential Data Sources can be used for import or directly for analysis  External: delimited text, fixed format text, SAS, SPSS, ODBC connections  Provided with RevoScaleR: efficient .xdf file format Data Sources contain information about their file system  Delimited text and .xdf data sources can both be used with the HDFS file system Data sources are used as input to HPA functions 22
  23. 23. A n E xample Us ing Hadoop Data Revolution Confidential Hadoop cluster in our office  Five nodes of commodity hardware  Red Hat Enterprise Linux (RHEL) operating system  Cloudera’s Hadoop (CDH3)  Also has IBM Platform LSF workload management system installed (not required to use HDFS data) My colleague, Dawn Kinsey, recorded a data analysis session  22 comma delimited files stored in HDFS  Contain information on U.S. flight arrivals, 1997 – 2008 Revolution R Enterprise 23
  24. 24. S teps in A nalys is Revolution Confidential Set up a ‘file system’ object and a ‘data source’ object Explore the HDFS airline data for the year 2000 directly Extract variables of interest from all the files into an .xdf file in the native file system Use R’s great plotting capabilities on summary information Perform a big logistic regression on an .xdf file stored in HDFS Revolution R Enterprise 24
  25. 25. Revolution ConfidentialP oll Ques tion What features of Revolution R Enterprise 6.1 are most interesting to you?
  26. 26. T hank You! Revolution Confidential Download slides, replay from today’s webinar  http://bit.ly/QJfR4A Learn more about Revolution R Enterprise  Overview: revolutionanalytics.com/products  New feature videos: http://www.revolutionanalytics.com/products/new-features.php Contact Revolution Analytics  http://bit.ly/hey-revo November 29: Real-Time Big Data Analytics: from Deployment to Production David Smith, VP Marketing and Community, Revolution Analytics www.revolutionanalytics.com/news-events/free-webinars 26
  27. 27. Revolution ConfidentialThe leading commercial provider of software and support for the popular open source R statistics language. www.revolutionanalytics.com +1 (650) 646 9545 Twitter: @RevolutionR 27

×