1. Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions
Machine Learning with R
Joshua Reich
josh@i2pi.com
April 2, 2009
2. Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions
Why R?
How to find out about stuff?
What is Machine Learning?
Show me the money
Learn More
Questions
3. Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions
ML Alternatives
• Matlab
• Weka
• Python
• Stand alone (e.g. Vowpal Wabbit)
4. Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions
”The best thing about R is that it was developed by statisticians.
The worst thing about R is that it was developed by statisticians.”
–Bo Cowgill, Google (at SF R Meetup)
5. Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions
Why R?
• Working with the CLI - iterative discovery
• Integrated graphics
• Community supported packages (CRAN)
• ODBC Integration
• You already use it
6. Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions
How to find out about stuff?
• ?function
• help.search("search string")
• RSiteSearch("search string")
• http://rseek.org/
• names(object) or attributes(object)
• > kmeans
function (x, centers, iter.max = 10, nstart = 1,
algorithm = c("Hartigan-Wong",
"Lloyd", "Forgy", "MacQueen"))
...
7. Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions
What is Machine Learning?
Statistics Machine Learning
Probability Model Learning Model
Observations Observations
Estimation Training
MLE Optimization
8. Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions
Semantics v. Pragmatics
• For most statistical models there are either closed form or
quick numerical approximations for finding model properties -
e.g., confidence intervals. Assuming you believe that your
data generating process is accurately captured by your model,
then you can make direct statements about unseen events.
• Machine learning is a close cousin to non-parametric
techniques and relies on training/testing/validation cycles,
bootstrapping and cross-validation to determine measures of
reliability.
But invariably, simple models and a lot of data trump more
elaborate models based on less data
–Halevy, Norvig & Pereira.
9. Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions
Inductive Bias
In the days when Sussman was a novice, Minsky once
came to him as he sat hacking at the PDP-6.
”What are you doing?”, asked Minsky.
”I am training a randomly wired neural net to play
Tic-tac-toe”, Sussman replied.
”Why is the net wired randomly?”, asked Minsky.
”I do not want it to have any preconceptions of how to
play”, Sussman said.
Minsky then shut his eyes.
10. Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions
Inductive Bias
”Why do you close your eyes?” Sussman asked his
teacher.
”So that the room will be empty.”
11. Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions
What is Machine Learning?
• Regression vs. Classification
Y ∈ Rp
vs.
Y ∈ {Y1 , Y2 , . . . , YN }
• Supervised vs. Unsupervised Learning
Y = f (X )
vs.
X1 , X2 , . . . , XN
12. Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions
General Supervised Learning Framework
(Ch 7 of ElemStatLearn)
• Training / Validation / Test
• Variance - Bias Decomposition: Overfitting
• Feature Selection / Regularization
• Bootstrapping / Cross-Validation
13. Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions
What we will walk through
• K-Means clustering: kmeans()
• K-Nearest Neighbours: knn()
• Regression Trees: rpart()
• Improving trees with PCA: princomp()
• Linear Discriminant Analysis: lda()
• Support Vector Machines: svm()
14. Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions
The Problem
A
15
A
A
A A A C
AA AA C CC C C
CCC CCC C C
A A CC C C CCC CC C C C
CC C CC C
C CCCC C CCC C C C
C
A C A C CC CCCCCC CCCCCCCC
AA AAAA CCCCC ACCCCCC CCCCCCCCC C
C
C C C
10
C CCC CC CC C
C C C
CC C
C
AA A AA CCCCCCCCCCCCCCCCCCCCCC C CC
A A C CCCCCCCCCCCCCCCC CCC C
C C CCCCCCCCCCC C CC C
A AC C CCC CCCCCCCCCCCC CCCC C
CC C CCCC C C C
CCCCCCCC CCCC C C CC
CCC CCCC CCCC C
C
A C CA A ACCC C CCCCC
A AAC A A ACC CCCCC CCCCCCC C C C C C
A CC CC CCC C CC C C
C C
C CC
A A A C CCCCACACCCCCCCCCC C
C CCCCC CC C C C
C C
A C A AAACCCACCCCCCCCCCCCCCC C
A A CC CCC CCCCCCCC CC
A A CCCCC CCCCCCCC CC
A CC C C CC
A A AAA ACCACCCCCCCCCCCCCCCCCCCCCC C C
C CC C
A A C A A AACCCCCCCCCCCCC CCCCCC
C A A A AACCCC CACCCCCCCC CC C CC
A C C C
ACC C CCCCCCCC CC C
CA CA CC C C CCCC
C CC
C AAACAACCCCCCC CCCCCC C
C CC A C C C
A A A AAAACAACAACCCACCCCCCCCCCCCC C
A A A ACA A CACC C C CC C
A A ACACCCCA CCACCCC CCCCC C C
A
C
A AAAACAACCCCCCCC CCCC
AAA CCCC C CC CC C
AC CC C C C C
CA A C
C A AAAAC A A C AC
AA ACA AA C
A A AA CAAAA A AACCA AAC
AAAAA C C AC A
A AA A A A CC
5
A AA C A AAAAAAAA AA C
A AA AA
C
A AAAAA AAAA A A
C B
A A AA
A AAAA AAA
A A AAAAAAAAAAAAAAAAAA A
AAAA AAA AAAA A B BBBB
BBBBBBB
BBBBBBBBBBB
BB
AAAAAAA AAAAA A A
AA A
AAAAAAA A
AAAAAAA A A
AA B BBBBBBB BB B
B BB
A B B BBBBBBBBBBBB B
BBBB B B B
BBBBB B
B
AAAAAAAAAA A A BBB BBBBB BBB
A A AAAAAAAAAAAAAAA A A B BBBBBBBBBBBBBBBB
AAAAA AAAAAAAA B BB B
y
A AAA A A A
A AAA A A
A A AAA AAAA A A AA A
AAAAAAAAAA AA
A
A AA AAAAAAAA AAA
AAA AAAA A
A AA A BBBBBB B
B BBBBBBBBBBBBB B
BBBBBBBBBB
B BBBBBBBBBBBBBB
BBBBBBB B
B B BBBBBBBBBBBBB
BB B
B
BBBBBBBBBBBBB
BBBBBBBBBB
BBBBBBB B
A AAAAAAAAAAAAAA A BBBBBBBBBBBBBBBB B B
BBBBBBBBBB
BB BBB B
BBBBBB B B
BB B
A A AA A A
A AA A
A A A AAAAAAAAA A BBBBB BBBBBBBB
B BBB
BB B B BB
BBBBBB BBB
BBB B
B B BBBB BBBB
B BBBBBBBBBBB BB B
A A A AAAAAAAA A A A B BBBBBBBBBBBBB B B
A A AA AA A B B BBBBBBBBB
B B BB BBB B
BB BBBBBBBB
0
AAA AAAAAAAAAAAAA BBB
A A A AAAAA AA AA A A ABBBB BBBBBB BB B
A AAA A AA A
A
A AA AA A A
A AAAAAA AAAAAAA A AAAA
AAA A
A BB B
BB BB BB B
B
B
A AAAA AA AAAA AA
AA A A A
A A
AA AA AAA A AAAAA A A BB B B B
AAAAAA AAAA A
AA A
B
A A AAAA AA A AA A
A A A A
AAAAAA AAA AAA
A
AA AA A A
A A A AA A A
A A AAAAAAA AA A A A
A
AAAAA A AAA A A
AA
AA AAAAA AAAAA A A A
A AA AA
AA A A
A A
−5
A A AAA
AA A
AAAAA AAAA AA A A
AAA AAAA A
AAA A
A A A AAA A A
A
A AA A A AA A AA
A A A A
AA AA AAAAA
A AA
−10
A A A
AA
A A
A
−5 0 5 10
x
15. Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions
Machine Learning More
• MachineLearning CRAN view
http://cran.r-project.org/web/views/MachineLearning.html
• The caret package is a good one.
• Elements of Statistical Learning
http://www-stat.stanford.edu/ tibs/ElemStatLearn/
• Machine Learning (Mitchell)
http://www.cs.cmu.edu/ tom/mlbook.html
• Video Lectures
http://videolectures.net/
16. Outline Why R? How to find out about stuff? What is Machine Learning? Show me the money Learn More Questions
Questions?