Your SlideShare is downloading. ×
0

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Introduction to R for Data Mining (Feb 2013)

19,670

Published on

Presented: Thursday, February 14, 2013 …

Presented: Thursday, February 14, 2013
Presenter: Joseph Rickert, Technical Marketing Manager, Revolution Analytics

We at Revolution Analytics are often asked “What is the best way to learn R?” While acknowledging that there may be as many effective learning styles as there are people we have identified three factors that greatly facilitate learning R. For a quick start:

Find a way of orienting yourself in the open source R world
Have a definite application area in mind
Set an initial goal of doing something useful and then build on it
In this webinar, we focus on data mining as the application area and show how anyone with just a basic knowledge of elementary data mining techniques can become immediately productive in R. We will:

Provide an orientation to R’s data mining resources
Show how to use the "point and click" open source data mining GUI, rattle, to perform the basic data mining functions of exploring and visualizing data, building classification models on training data sets, and using these models to classify new data.
Show the simple R commands to accomplish these same tasks without the GUI
Demonstrate how to build on these fundamental skills to gain further competence in R
Move away from using small test data sets and show with the same level of skill one could analyze some fairly large data sets with RevoScaleR
Data scientists and analysts using other statistical software as well as students who are new to data mining should come away with a plan for getting started with R.

Published in: Technology
1 Comment
6 Likes
Statistics
Notes
No Downloads
Views
Total Views
19,670
On Slideshare
0
From Embeds
0
Number of Embeds
22
Actions
Shares
0
Downloads
237
Comments
1
Likes
6
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Revolution Confidential Introduc tion to R for Data Mining 2013 Webinar S eriesJ os eph B . R ic kertF ebruary 14, 2013 1
  • 2. F irs t P olling Ques tion Revolution Confidential What is your favorite data mining software tool? 1. R 2. SAS 3. MapReduce 4. Weka 5. Other 2
  • 3. My goal for today’s webinar is to c onvinc eyou that: Revolution Confidential Seriously, it is not difficult to learn enough R to do some serious data mining R is a serious Revolution R platform Enterprise for is the platform data mining for serious data mining 3
  • 4. Revolution ConfidentialA word about Data Mining We assume that you know a little bit about data mining and this is your context for learning R 4
  • 5. Applications Actions Algorithms Data Mining Revolution Confidential Credit Scoring Acquire Data CART Fraud Detection Prepare Random Forests Ad Optimization Classify SVM Targeted Predict KMeans Marketing Hierarchical Gene Detection Visualize clustering Recommendation Ensemble Optimize systems Techniques Social Networks Interpret 5
  • 6. Revolution ConfidentialGetting OrientatedWHAT IS R ? 6
  • 7. Is : Revolution Confidential The way to do statistical computing A full blown programming language The home of nearly every data mining algorithm known to data science. A vibrant world-wide community Since 1997 a core R was written in early 1990’s by group of ~ 20 Robert developers guides Gentleman the evolution of the Ross Ihaka language 7
  • 8. is organized into libraries of func tions c alled pac kages Revolution Confidential R Package Growth 4,332 packages as of 2/13/13 CRAN R download  Base  Recommended packages User contributed packages 8
  • 9. F inding Your Way A round world of Revolution Confidential  Machine Learning  Data Mining  Visualization  Finding Packages  Task Views  crantastic.org  Blogs  Revolutions  R-Bloggers  Quick-R  Inside-R  Getting Help  Finding R People  User Groups worldwide  Twitter : #rstats 9
  • 10. Revolution ConfidentialLearning RT HE S T R UC T UR E OF RFA C IL ITAT E S L E A R NING 10
  • 11. L earning R ? Revolution ConfidentialLevels of R SkillWrite production grade code R developerWrite an R package R contributorWrite code and algorithms R programmerUse R functions R userUse a GUI R aware 10 10,000 Hours of use The Malcolm Gladwell “Outlier” Scale 11
  • 12. B as ic Mac hine L earning F unc tions Revolution Confidential Function Library DescriptionCluster hclust stats Hierarchical cluster analysis kmeans stats Kmeans clusteringClassifiers glm stats Logistic Regression rpart rpart Recursive partitioning and regression trees ksvm kernlab Support Vector Machine apriori arules Rule based classificationEnsemble ada ada Stochastic boosting randomForest randomForest Random Forests classification and regression 12
  • 13. Noteworthy Data Mining P ac kages Revolution Confidential Package Comment caret Well organized and remarkably complete collection of functions to facilitate model building for regression and classification problems rattle A very intuitive GUI for data mining that produces useful R code 13
  • 14. Revolution Confidential Script 1 GETTING STARTED .R 2 ROLL with RATTLE .R 3 IN THE TREES . R 4 INTRO to CARET .R 5 BIG DATA with RevoScaleR .R 6 WORDCLOUD .RDoing a lot with a little RT IME TO R UN S OME C ODEThe R Scripts are available at:https://gist.github.com/joseph-rickert/4742529 14
  • 15. S ec ond P olling Ques tion Revolution Confidential What are your favorite data mining techniques? 1. Clustering techniques such as K-means 2. Single model classifiers such as decision trees, or SVMs 3. Ensemble classifiers such as Random Forests or boosting models 4. Text mining techniques 5. Other 15
  • 16. T hird P olling Ques tion(ins ert after running s c ript IN T HE T R E E S Revolution Confidential What kind of data do you analyze? 1. Financial data 2. Customer data (e.g. for recommendations) 3. Website data (e.g. for ads) 4. Health Care data 5. Other 16
  • 17. Revolution ConfidentialWorking with B ig DataRevoScaleR and Revolution R Enterprise 17
  • 18. Too B ig for Open S ourc e R Revolution Confidential mortDF <- rxXdfToDataFrame(mdata,maxRowsByCols=300000000) model <- glm(default ~ .,data=mortDF,family="binomial") 18
  • 19. R evoS c aleR brings the power ofB ig Data to R Revolution ConfidentialParallel External Abstracted layer forMemory Algorithms providingthat are distributed Communications communicationamong available Distributed Framework between computecompute resources Statistical Algorithms nodes in a cluster(cores & computers) (MPI, MapReduce, In-independent of Database)platformAPI for integratingexternal data R Languagesources (files, Interface Familiar, high-databases, HDFS) Data Source prodictivitythat provides API programmingoptimized reading of paradigm for R usersrows and columns inblocks 19
  • 20. R evoS c aleR P E MA sP arallel E xternal Memory A lgorithms Revolution Confidential XDF File Read blocks and compute  R based algorithms Block 1 intermediate results in parallel, iterating as  Work on blocks of data Inherently parallel and Block 1 necessary results  distributed Block 2 Block i results Block i Block Block  Do not require all data to be in memory at one Block i Block i+1 Block i+2 i+1 i+2 results results Block i+1 Results from last time block  Can deal with distributed Block i+2 1st pass and streaming data 2nd pass 3rd pass 20
  • 21. Revolution ConfidentialMore than code, R is a communityWHE R E TO G O F R OM HE R E ? 21
  • 22. C ontinuing to L earn R Revolution ConfidentialResources Examples RevoJoe: How to Learn R  Thomson Nguyen on the Heritage Health Prize More R Documentation  Shannon Terry & Ben Ogorek  The R Journal (Nationwide Insurance): A Direct Marketing In-Flight Forecasting  Books System  Reference Card and more  Jeffrey Breen: Mining Twitter for Airline Consumer Classes Sentiment  Coursera  Joe Rothermich: Alternative Data Sources for Measuring Market  Revolution Analytics Sentiment and Events (Using R) 22
  • 23. S ome B ooks Revolution Confidential 23
  • 24. Revolution ConfidentialThe R Scripts are available at:https://gist.github.com/joseph-rickert/4742529 24

×