Successfully reported this slideshow.

Introduction to R for Data Mining (Feb 2013)


Published on

Presented: Thursday, February 14, 2013
Presenter: Joseph Rickert, Technical Marketing Manager, Revolution Analytics

We at Revolution Analytics are often asked “What is the best way to learn R?” While acknowledging that there may be as many effective learning styles as there are people we have identified three factors that greatly facilitate learning R. For a quick start:

Find a way of orienting yourself in the open source R world
Have a definite application area in mind
Set an initial goal of doing something useful and then build on it
In this webinar, we focus on data mining as the application area and show how anyone with just a basic knowledge of elementary data mining techniques can become immediately productive in R. We will:

Provide an orientation to R’s data mining resources
Show how to use the "point and click" open source data mining GUI, rattle, to perform the basic data mining functions of exploring and visualizing data, building classification models on training data sets, and using these models to classify new data.
Show the simple R commands to accomplish these same tasks without the GUI
Demonstrate how to build on these fundamental skills to gain further competence in R
Move away from using small test data sets and show with the same level of skill one could analyze some fairly large data sets with RevoScaleR
Data scientists and analysts using other statistical software as well as students who are new to data mining should come away with a plan for getting started with R.

Published in: Technology

Introduction to R for Data Mining (Feb 2013)

  1. 1. Revolution Confidential Introduc tion to R for Data Mining 2013 Webinar S eriesJ os eph B . R ic kertF ebruary 14, 2013 1
  2. 2. F irs t P olling Ques tion Revolution Confidential What is your favorite data mining software tool? 1. R 2. SAS 3. MapReduce 4. Weka 5. Other 2
  3. 3. My goal for today’s webinar is to c onvinc eyou that: Revolution Confidential Seriously, it is not difficult to learn enough R to do some serious data mining R is a serious Revolution R platform Enterprise for is the platform data mining for serious data mining 3
  4. 4. Revolution ConfidentialA word about Data Mining We assume that you know a little bit about data mining and this is your context for learning R 4
  5. 5. Applications Actions Algorithms Data Mining Revolution Confidential Credit Scoring Acquire Data CART Fraud Detection Prepare Random Forests Ad Optimization Classify SVM Targeted Predict KMeans Marketing Hierarchical Gene Detection Visualize clustering Recommendation Ensemble Optimize systems Techniques Social Networks Interpret 5
  6. 6. Revolution ConfidentialGetting OrientatedWHAT IS R ? 6
  7. 7. Is : Revolution Confidential The way to do statistical computing A full blown programming language The home of nearly every data mining algorithm known to data science. A vibrant world-wide community Since 1997 a core R was written in early 1990’s by group of ~ 20 Robert developers guides Gentleman the evolution of the Ross Ihaka language 7
  8. 8. is organized into libraries of func tions c alled pac kages Revolution Confidential R Package Growth 4,332 packages as of 2/13/13 CRAN R download  Base  Recommended packages User contributed packages 8
  9. 9. F inding Your Way A round world of Revolution Confidential  Machine Learning  Data Mining  Visualization  Finding Packages  Task Views   Blogs  Revolutions  R-Bloggers  Quick-R  Inside-R  Getting Help  Finding R People  User Groups worldwide  Twitter : #rstats 9
  10. 10. Revolution ConfidentialLearning RT HE S T R UC T UR E OF RFA C IL ITAT E S L E A R NING 10
  11. 11. L earning R ? Revolution ConfidentialLevels of R SkillWrite production grade code R developerWrite an R package R contributorWrite code and algorithms R programmerUse R functions R userUse a GUI R aware 10 10,000 Hours of use The Malcolm Gladwell “Outlier” Scale 11
  12. 12. B as ic Mac hine L earning F unc tions Revolution Confidential Function Library DescriptionCluster hclust stats Hierarchical cluster analysis kmeans stats Kmeans clusteringClassifiers glm stats Logistic Regression rpart rpart Recursive partitioning and regression trees ksvm kernlab Support Vector Machine apriori arules Rule based classificationEnsemble ada ada Stochastic boosting randomForest randomForest Random Forests classification and regression 12
  13. 13. Noteworthy Data Mining P ac kages Revolution Confidential Package Comment caret Well organized and remarkably complete collection of functions to facilitate model building for regression and classification problems rattle A very intuitive GUI for data mining that produces useful R code 13
  14. 14. Revolution Confidential Script 1 GETTING STARTED .R 2 ROLL with RATTLE .R 3 IN THE TREES . R 4 INTRO to CARET .R 5 BIG DATA with RevoScaleR .R 6 WORDCLOUD .RDoing a lot with a little RT IME TO R UN S OME C ODEThe R Scripts are available at: 14
  15. 15. S ec ond P olling Ques tion Revolution Confidential What are your favorite data mining techniques? 1. Clustering techniques such as K-means 2. Single model classifiers such as decision trees, or SVMs 3. Ensemble classifiers such as Random Forests or boosting models 4. Text mining techniques 5. Other 15
  16. 16. T hird P olling Ques tion(ins ert after running s c ript IN T HE T R E E S Revolution Confidential What kind of data do you analyze? 1. Financial data 2. Customer data (e.g. for recommendations) 3. Website data (e.g. for ads) 4. Health Care data 5. Other 16
  17. 17. Revolution ConfidentialWorking with B ig DataRevoScaleR and Revolution R Enterprise 17
  18. 18. Too B ig for Open S ourc e R Revolution Confidential mortDF <- rxXdfToDataFrame(mdata,maxRowsByCols=300000000) model <- glm(default ~ .,data=mortDF,family="binomial") 18
  19. 19. R evoS c aleR brings the power ofB ig Data to R Revolution ConfidentialParallel External Abstracted layer forMemory Algorithms providingthat are distributed Communications communicationamong available Distributed Framework between computecompute resources Statistical Algorithms nodes in a cluster(cores & computers) (MPI, MapReduce, In-independent of Database)platformAPI for integratingexternal data R Languagesources (files, Interface Familiar, high-databases, HDFS) Data Source prodictivitythat provides API programmingoptimized reading of paradigm for R usersrows and columns inblocks 19
  20. 20. R evoS c aleR P E MA sP arallel E xternal Memory A lgorithms Revolution Confidential XDF File Read blocks and compute  R based algorithms Block 1 intermediate results in parallel, iterating as  Work on blocks of data Inherently parallel and Block 1 necessary results  distributed Block 2 Block i results Block i Block Block  Do not require all data to be in memory at one Block i Block i+1 Block i+2 i+1 i+2 results results Block i+1 Results from last time block  Can deal with distributed Block i+2 1st pass and streaming data 2nd pass 3rd pass 20
  21. 21. Revolution ConfidentialMore than code, R is a communityWHE R E TO G O F R OM HE R E ? 21
  22. 22. C ontinuing to L earn R Revolution ConfidentialResources Examples RevoJoe: How to Learn R  Thomson Nguyen on the Heritage Health Prize More R Documentation  Shannon Terry & Ben Ogorek  The R Journal (Nationwide Insurance): A Direct Marketing In-Flight Forecasting  Books System  Reference Card and more  Jeffrey Breen: Mining Twitter for Airline Consumer Classes Sentiment  Coursera  Joe Rothermich: Alternative Data Sources for Measuring Market  Revolution Analytics Sentiment and Events (Using R) 22
  23. 23. S ome B ooks Revolution Confidential 23
  24. 24. Revolution ConfidentialThe R Scripts are available at: 24