Introduction to R for Data Mining

43,659 views

Published on

We at Revolution Analytics are often asked “What is the best way to learn R?” While acknowledging that there may be as many effective learning styles as there are people we have identified three factors that greatly facilitate learning R. For a quick start:

- Find a way of orienting yourself in the open source R world
- Have a definite application area in mind
- Set an initial goal of doing something useful and then build on it

In this webinar, we focus on data mining as the application area and show how anyone with just a basic knowledge of elementary data mining techniques can become immediately productive in R. We will:

- Provide an orientation to R’s data mining resources
- Show how to use the "point and click" open source data mining GUI, rattle, to perform the basic data mining functions of exploring and visualizing data, building classification models on training data sets, and using these models to classify new data.
- Show the simple R commands to accomplish these same tasks without the GUI
- Demonstrate how to build on these fundamental skills to gain further competence in R
- Move away from using small test data sets and show with the same level of skill one could analyze some fairly large data sets with RevoScaleR

Data scientists and analysts using other statistical software as well as students who are new to data mining should come away with a plan for getting started with R.

Published in: Technology
  • Fantastic!!!!!!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Introduction to R for Data Mining

  1. 1. Revolution Confidential Introduc tion to R for Data Mining2012 S pring Webinar S eriesJ os eph B . R ic kert,R evolution A nalytic sJ une 5, 2012 1
  2. 2. G oals for Today’s Webinar Revolution Confidential To convince you that: Seriously, it is not difficult to R learn enough R is a serious to do some platform for serious data data mining mining Revolution R Enterprise is is the platform for serious data mining 2
  3. 3. Data Mining Applications Actions Revolution Confidential Algorithms Credit Scoring Acquire Data CART Fraud Detection Prepare Random Forests Ad Optimization Classify SVM Targeted Predict KMeans Marketing Hierarchical Gene Detection Visualize clustering Recommendation Ensemble Optimize systems Techniques Social Networks Interpret 3
  4. 4. R ec ent K DD Nuggets P oll s ugges ts s o are a lotof other s erious data miners Revolution Confidential What Analytics, Data mining, Big Data software you used in the past 12 months for a real project (not just evaluation) [798 voters] Software % users in 2012 % users in 2011 R (245) 30.7% 23.3% Excel (238) 29.8% 21.8% Rapid-I RapidMiner (213) 26.7% 27.7% KNIME (174) 21.8% 12.1% Weka / Pentaho (118) 14.8% 11.8% StatSoft Statistica (112) 14.0% 8.5% SAS (101) 12.7% 13.6% Rapid-I RapidAnalytics (83) 10.4% Not asked in 2011 MATLAB (80) 10.0% 7.2% IBM SPSS Statistics (62) 7.8% 7.2% IBM SPSS Modeler (54) 6.8% 8.3% SAS Enterprise Miner (46) 5.8% 7.1% 4
  5. 5. Revolution ConfidentialLearning RWHAT DOE S IT ME A N TOLE AR N R ? 5
  6. 6. What does it mean to learn F renc h? Revolution Confidential To get around Paris on the Metro To read a Menu To carry on a conversation 6
  7. 7. L earning R Revolution ConfidentialLevels of R Skill Write production level code R developer Write an R package R contributor Write functions R programmer Use R Functions R user Use a GUI R aware 10 10,000 Hours of use The Malcolm Gladwell “Outlier” Scale 7
  8. 8. Revolution ConfidentialProductive from the Get go!T HE S T R UC T UR E OF RFA C IL ITAT E S L E A R NING 8
  9. 9. R is s et up to c ompute func tions on data Revolution Confidential lm.model lm <- function(x,y) lm.model$assign { lm.model$coefficients . . . lm.model$df.residual } lm.model$effects lm.model$fitted.values . . . 9
  10. 10. A little knowledge goes a long way in R Revolution Confidential  R’s functional design facilitates performing small tasks  For the most part, the output of a The trick is knowing which function depends only on the functions to values of its arguments call  calling a function multiple times with the same values of its arguments will produce the same result each time  Minimal side effects means it is much easier to understand and predict the behavior of a program 10
  11. 11. B as ic Mac hine L earning F unc tions Revolution Confidential Function Library DescriptionCluster hclust stats Hierarchical cluster analysis kmeans stats Kmeans clusteringClassifiers glm stats Logistic Regression rpart rpart Recursive partitioning and regression trees ksvm kernlab Support Vector MachineEnsemble ada ada Stochastic boosting randomForest randomForest Random Forests classification and regression 11
  12. 12. Noteworthy Data Mining P ac kages Revolution Confidential Package Comment rattle A very intuitive GUI for data mining that produces useful R code caret Well organized and remarkably complete collection of functions to facilitate model building for regression and classification problems 12
  13. 13. Revolution ConfidentialDoing a lot with a little RT IME TO R UN S OME C ODE 13
  14. 14. S c ripts to run Revolution Confidential Script Some key Functions 0 Setup Load libraries 1 Explore weather data Read.csv, plot 2 Run clustering algorithms kmeans, hclust 3 Basic decision tree rpart 4 Boosted Tree ada 5 Random Forest randomForest 6 Support Vector Machine randomForest, varImpPlot 7 Big Data Mortgage Default rxLogit, rxKmeans model 14
  15. 15. B ig Data and R Revolution ConfidentialThere are some challenges: All of your data and model code must fit into memory Big data sets as well as big models (lots of variables) can run out of memory Parallel computation might be necessary for models to run in a reasonable time 15
  16. 16. R evoS c aleR in R evolution R E nterpris e Revolution ConfidentialCan help in a number of ways: Manipulate large data sets, and perhaps aggregating data so that it will fit in memory  For example, boiling down time-stamped data like a web log to form a time series that will fit in memory Run RevoScaleR Functions directly on big data sets Run R functions in parallel 16
  17. 17. Top R evoS c aleR F unc tions for Data Miningparallel external memory algorithms Revolution Confidential Task RevoScaleR function Data processing rxDataStep Descriptive Statistics rxSumary Tables and cubes rxCube, rxCrosstabs Correlations / covariance rxCovCor, rxCor, rxCov, rxSSCP Linear Models rxLinMod Logistic regressions rxLogit Generalized linear models rxGlm K means clustering rxKmeans Predictions (scoring) rxPredict 17
  18. 18. Revolution ConfidentialMore than code, R is a communityWHE R E TO G O F R OM HE R E ? 18
  19. 19. F inding your way around the R world Revolution Confidential Machine Learning Data Mining Visualization Finding Packages  Task Views  crantastic.org Blogs  Revolutions  R-Bloggers  Quick-R Getting Help  StackOverflow  @RLangTip  Inside-R  www.rseek.org Finding R People  User Groups worldwide  #rstats Word Cloud for @inside_R 19
  20. 20. L ook at s ome more s ophis tic ated examples Revolution Confidential Thomson Nguyen on the Heritage Health Prize Shannon Terry & Ben Ogorek (Nationwide Insurance): A Direct Marketing In-Flight Forecasting System Jeffrey Breen: Mining Twitter for Airline Consumer Sentiment Joe Rothermich: Alternative Data Sources for Measuring Market Sentiment and Events (Using R) 20
  21. 21. R evolution A nalytic s Training Revolution Confidential http://www.revolutionanalytics.com/ products/training/ 21
  22. 22. R eferenc es Revolution Confidential 22
  23. 23. Revolution Confidential Revolution Confidential 23

×