Introduction to R for Data Mining

Revolution Confidential

Introduc tion to R for
Data Mining

2012 S pring Webinar S eries

J os eph B . R ic kert,
R evolution A nalytic s
J une 5, 2012

1

G oals for Today’s Webinar Revolution Confidential

To convince you that:

Seriously, it is
not difficult to
R learn enough R
is a serious to do some
platform for serious data
data mining mining

Revolution R
Enterprise is
is the platform for
serious
data mining

2

Data Mining Applications Actions Revolution Confidential
Algorithms

Credit Scoring Acquire Data CART

Fraud Detection Prepare Random Forests

Ad Optimization Classify SVM

Targeted
Predict KMeans
Marketing

Hierarchical
Gene Detection Visualize
clustering

Recommendation Ensemble
Optimize
systems Techniques

Social Networks Interpret

3

R ec ent K DD Nuggets P oll s ugges ts s o are a lot
of other s erious data miners Revolution Confidential

What Analytics, Data mining, Big Data software you used in the past 12
months for a real project (not just evaluation) [798 voters]

Software % users in 2012 % users in 2011

R (245) 30.7% 23.3%
Excel (238) 29.8% 21.8%

Rapid-I RapidMiner (213) 26.7% 27.7%
KNIME (174) 21.8% 12.1%

Weka / Pentaho (118) 14.8% 11.8%

StatSoft Statistica (112) 14.0% 8.5%

SAS (101) 12.7% 13.6%

Rapid-I RapidAnalytics (83) 10.4% Not asked in 2011

MATLAB (80) 10.0% 7.2%

IBM SPSS Statistics (62) 7.8% 7.2%

IBM SPSS Modeler (54) 6.8% 8.3%

SAS Enterprise Miner (46) 5.8% 7.1%

4


Learning R

WHAT DOE S IT ME A N TO
LE AR N R ?

5

What does it mean to learn F renc h? Revolution Confidential

To get around Paris on the Metro

To read a Menu
To carry on a conversation

6

L earning R Revolution Confidential

Levels of R Skill

Write production level code R developer

Write an R package R contributor

Write functions R programmer

Use R Functions R user

Use a GUI R aware

10 10,000
Hours of use

The Malcolm Gladwell “Outlier” Scale

7


Productive from the Get go!

T HE S T R UC T UR E OF R
FA C IL ITAT E S L E A R NING

8

R is s et up to c ompute func tions on data

lm.model
lm <- function(x,y) lm.model$assign
{ lm.model$coefficients
. . . lm.model$df.residual
} lm.model$effects
lm.model$fitted.values
.
.
.

9

A little knowledge goes a long way in R Revolution Confidential

 R’s functional design facilitates
performing small tasks
 For the most part, the output of a The trick is
knowing which
function depends only on the functions to
values of its arguments call
 calling a function multiple times
with the same values of its
arguments will produce the same
result each time
 Minimal side effects means it is
much easier to understand and
predict the behavior of a program

10

B as ic Mac hine L earning F unc tions Revolution Confidential

Function Library Description
Cluster hclust stats Hierarchical cluster analysis
kmeans stats Kmeans clustering
Classifiers glm stats Logistic Regression
rpart rpart Recursive partitioning and
regression trees
ksvm kernlab Support Vector Machine
Ensemble ada ada Stochastic boosting
randomForest randomForest Random Forests classification and
regression

11

Noteworthy Data Mining P ac kages Revolution Confidential

Package Comment
rattle A very intuitive GUI for data mining that
produces useful R code
caret Well organized and remarkably complete
collection of functions to facilitate model
building for regression and classification
problems

12


Doing a lot with a little R

T IME TO R UN S OME C ODE

13

S c ripts to run Revolution Confidential

Script Some key Functions
0 Setup Load libraries
1 Explore weather data Read.csv, plot
2 Run clustering algorithms kmeans, hclust
3 Basic decision tree rpart
4 Boosted Tree ada
5 Random Forest randomForest
6 Support Vector Machine randomForest, varImpPlot
7 Big Data Mortgage Default rxLogit, rxKmeans
model

14

B ig Data and R Revolution Confidential

There are some challenges:
 All of your data and model code must fit into
memory
 Big data sets as well as big models (lots of
variables) can run out of memory
 Parallel computation might be necessary for
models to run in a reasonable time

15

R evoS c aleR in R evolution R E nterpris e Revolution Confidential

Can help in a number of ways:
 Manipulate large data sets, and perhaps
aggregating data so that it will fit in memory
 For example, boiling down time-stamped data
like a web log to form a time series that will fit in
memory
 Run RevoScaleR Functions directly on big
data sets
 Run R functions in parallel
16

Top R evoS c aleR F unc tions for Data Mining
parallel external memory algorithms Revolution Confidential

Task RevoScaleR function
Data processing rxDataStep
Descriptive Statistics rxSumary
Tables and cubes rxCube, rxCrosstabs
Correlations / covariance rxCovCor, rxCor, rxCov,
rxSSCP
Linear Models rxLinMod
Logistic regressions rxLogit
Generalized linear models rxGlm
K means clustering rxKmeans
Predictions (scoring) rxPredict

17


More than code, R is a community

WHE R E TO G O F R OM HE R E ?

18

F inding your way around the R world Revolution Confidential

 Machine Learning
 Data Mining
 Visualization
 Finding Packages
 Task Views
 crantastic.org
 Blogs
 Revolutions
 R-Bloggers
 Quick-R
 Getting Help
 StackOverflow
 @RLangTip
 Inside-R
 www.rseek.org
 Finding R People
 User Groups worldwide
 #rstats

Word Cloud for @inside_R

19

L ook at s ome more s ophis tic ated examples Revolution Confidential

 Thomson Nguyen on the Heritage Health Prize
 Shannon Terry & Ben Ogorek (Nationwide Insurance):
A Direct Marketing In-Flight Forecasting System
 Jeffrey Breen:
Mining Twitter for Airline Consumer Sentiment
 Joe Rothermich: Alternative Data Sources for Measuring
Market Sentiment and Events (Using R)

20

R evolution A nalytic s Training Revolution Confidential

http://www.revolutionanalytics.com/
products/training/

21

R eferenc es Revolution Confidential

22



23

Introduction to R for Data Mining

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to R for Data Mining

Similar to Introduction to R for Data Mining (20)

More from Revolution Analytics

More from Revolution Analytics (20)

Recently uploaded

Recently uploaded (20)

Introduction to R for Data Mining