Legal Analytics Course - Class 8 - Introduction to Random Forests and Ensemble Methods - Professor Daniel Martin Katz + Professor Michael J Bommarito

Class 8
Ensemble Models including Random Forests
Legal Analytics
Professor Daniel Martin Katz
Professor Michael J Bommarito II
legalanalyticscourse.com

CART
Approach
to Decision
Trees
access more at legalanalyticscourse.com

Get the Data Here:
http://www.stat.cmu.edu/~cshalizi/350/hw/06/cadata.dat

x <- read.table("http://www.stat.cmu.edu/~cshalizi/350/hw/06/cadata.dat")
Get the Data Here:
Load the DataSet:

http://www.stat.cmu.edu/~cshalizi/350/lectures/22/lecture-22.pdf
x <- read.table("http://www.stat.cmu.edu/~cshalizi/350/hw/06/cadata.dat",
header=TRUE)
Get the Data Here:
Load the DataSet:
Follow Example on Page 4-7 (example 2.1)

http://www3.nd.edu/~mclark19/learn/ML.pdf
Replicate This
for Additional Practice

Random Forest

One well-known problem with
standard classiﬁcation trees is
their tendency toward overﬁtting

This is because standard decision
trees are weak learners

Random forest is an approach to
aggregate weak learners into
collective strong learners
(think of it as statistical crowd sourcing)

Random Forest:
Group of DecisionTrees
Outperforms and is more Robust
(less likely to overﬁt) than a
Single DecisionTree

Breiman, L.(2001). Random forests.
Machine learning, 45(1), 5-32.

Ensemble method that leverages
bagging (bootstrap aggregation)
Brieman (1996)
With Random Substrates
Brieman (2001)
Random Forest:

bagging
is applied to the training data
random substrates
is applied to / about the variables
Two Layers of Randomness

What is Bagging?

bagging = bootstrap aggregation
we are going to randomly sample rows

https://www.youtube.com/watch?v=5Lu1eTiX7qM
Bagging (in general)

https://www.youtube.com/watch?v=JM4Y0B6Ho90
Bagging for classiﬁcation

https://www.youtube.com/watch?v=Rm6s6gmLTdg
bootstrap aggregation

What are
Random Substrates?

random substrates (column data)

bagging (row data)
is applied to the training data
random substrates (column data)
Two Layers of Randomness

“if the outlook is sunny and the humidity is less
than or equal to 70, then it’s probably OK to play.”
http://bit.ly/1icRlmE
Single
Decision
Tree

pseudocode from

Single
Decision
Tree
http://bit.ly/1icRlmE
Random
Forest
(Blackwell 2012)

Sample N cases at random with
replacement to create a subset of
the data
STEP 1:
(Blackwell 2012)

M predictor variables are selected at random from
all the predictor variables.
The predictor variable that provides the
best split, according to some objective function,
is used to do a binary split on that node.
At the next node, choose another m variables at
random from all predictor variables and do the
same.”
STEP 2: “At each node:

the key idea -
how do we define
best split,
according to some
objective function

Additional Notes
For Random Forest
Trees are not pruned
As potentially overﬁt
individual trees combine
to yield well ﬁt ensembles

Random Forests
are ‘unreasonably effective’

http://machinelearning202.pbworks.com/w/ﬁle/fetch/37597425/
performanceCompSupervisedLearning-caruana.pdf
Random
Forest
(particularly
with
optimization)
have proven to
be unreasonably
effective

10 Different Binary Classiﬁcation Methods
on
11 Different Datasets (w/ 5000 training cases each)
Random Forest were surprisingly effective

http://videolectures.net/solomon_caruana_wslmw/

Some Helpful
Online Resources

https://www.youtube.com/watch?v=loNcrMjYh64

https://www.youtube.com/watch?v=o7iDkcpOr_g

http://www.stat.berkeley.edu/~breiman/RandomForests/

https://www.youtube.com/watch?v=ngaQrYqxtoM#t=18

Random
Forest
Examples
Using

http://www.r-bloggers.com/part-3-random-forests-and-model-selection-considerations/
http://www.r-bloggers.com/ensemble-part2-bootstrap-aggregation/
http://www.r-bloggers.com/ensemble-methods-part-1/

http://www.r-bloggers.com/my-intro-to-multiple-classiﬁcation-with-random-forests-
conditional-inference-trees-and-linear-discriminant-analysis/

Practice Using
This Example

http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets
Probably Should Use this One
x <- read.csv("/Users/katzd/Desktop/titanic3.csv")
Read in the dataset using read.csv
Get the Dataset fromVandy
Example from http://bit.ly/1h1hGV4

> titanic.survival.train.rf = randomForest(as.factor(survived)
~ pclass + sex + age + sibsp, data=titanic.train,ntree=5000,
importance=TRUE)
> titanic.survival.train.rf
Call:
randomForest(formula = as.factor(survived) ~ pclass + sex +
age + sibsp, data = titanic.train, ntree = 5000,
importance = TRUE)
Type of random forest: classification
Number of trees: 5000
No. of variables tried at each split: 2
OOB estimate of error rate: 22.62%
Confusion matrix:
0 1 class.error
0 370 38 0.09313725
1 109 133 0.45041322

> importance(titanic.survival.train.rf)
0 1 MeanDecreaseAccuracy MeanDecreGini
pclass 67.26795 125.166721 126.40379 34.69266
sex 160.52060 221.803515 224.89038 62.82490
age 70.35831 50.568619 92.67281 53.41834
sibsp 60.84056 3.343251 52.82503 14.01936

Legal Analytics Course - Class 8 - Introduction to Random Forests and Ensemble Methods - Professor Daniel Martin Katz + Professor Michael J Bommarito

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (13)

Similar to Legal Analytics Course - Class 8 - Introduction to Random Forests and Ensemble Methods - Professor Daniel Martin Katz + Professor Michael J Bommarito

Similar to Legal Analytics Course - Class 8 - Introduction to Random Forests and Ensemble Methods - Professor Daniel Martin Katz + Professor Michael J Bommarito (20)

More from Daniel Katz

More from Daniel Katz (13)

Recently uploaded

Recently uploaded (20)

Legal Analytics Course - Class 8 - Introduction to Random Forests and Ensemble Methods - Professor Daniel Martin Katz + Professor Michael J Bommarito