The document discusses random forests, an ensemble machine learning method. Random forests create many decision trees from random subsets of the training data and variables. This helps prevent overfitting and makes the model more robust. The document provides an example of building a random forest model to predict Titanic passenger survival using R code. It also shares online resources for learning more about random forests.
{Law, Tech, Design, Delivery} Observations Regarding Innovation in the Legal ...
Similar to Legal Analytics Course - Class 8 - Introduction to Random Forests and Ensemble Methods - Professor Daniel Martin Katz + Professor Michael J Bommarito
Similar to Legal Analytics Course - Class 8 - Introduction to Random Forests and Ensemble Methods - Professor Daniel Martin Katz + Professor Michael J Bommarito (20)
Legal Analytics Course - Class 8 - Introduction to Random Forests and Ensemble Methods - Professor Daniel Martin Katz + Professor Michael J Bommarito
1. Class 8
Ensemble Models including Random Forests
Legal Analytics
Professor Daniel Martin Katz
Professor Michael J Bommarito II
legalanalyticscourse.com
8. One well-known problem with
standard classification trees is
their tendency toward overfitting
access more at legalanalyticscourse.com
9. This is because standard decision
trees are weak learners
access more at legalanalyticscourse.com
10. Random forest is an approach to
aggregate weak learners into
collective strong learners
(think of it as statistical crowd sourcing)
access more at legalanalyticscourse.com
11. Random Forest:
Group of DecisionTrees
Outperforms and is more Robust
(less likely to overfit) than a
Single DecisionTree
access more at legalanalyticscourse.com
12. Breiman, L.(2001). Random forests.
Machine learning, 45(1), 5-32.
access more at legalanalyticscourse.com
13. Ensemble method that leverages
bagging (bootstrap aggregation)
Brieman (1996)
With Random Substrates
Brieman (2001)
Random Forest:
access more at legalanalyticscourse.com
14. bagging
is applied to the training data
random substrates
is applied to / about the variables
Two Layers of Randomness
access more at legalanalyticscourse.com
22. random substrates (column data)
is applied to / about the variables
access more at legalanalyticscourse.com
23.
24. bagging
is applied to the training data
random substrates
is applied to / about the variables
Two Layers of Randomness
access more at legalanalyticscourse.com
25. bagging (row data)
is applied to the training data
random substrates (column data)
is applied to / about the variables
Two Layers of Randomness
access more at legalanalyticscourse.com
26. “if the outlook is sunny and the humidity is less
than or equal to 70, then it’s probably OK to play.”
http://bit.ly/1icRlmE
Single
Decision
Tree
29. Sample N cases at random with
replacement to create a subset of
the data
STEP 1:
(Blackwell 2012)
access more at legalanalyticscourse.com
30. M predictor variables are selected at random from
all the predictor variables.
The predictor variable that provides the
best split, according to some objective function,
is used to do a binary split on that node.
At the next node, choose another m variables at
random from all predictor variables and do the
same.”
STEP 2: “At each node:
access more at legalanalyticscourse.com
31. the key idea -
how do we define
best split,
according to some
objective function
access more at legalanalyticscourse.com
32. the key idea -
how do we define
best split,
according to some
objective function
access more at legalanalyticscourse.com
35. Additional Notes
For Random Forest
Trees are not pruned
As potentially overfit
individual trees combine
to yield well fit ensembles
access more at legalanalyticscourse.com
38. 10 Different Binary Classification Methods
on
11 Different Datasets (w/ 5000 training cases each)
Random Forest were surprisingly effective
access more at legalanalyticscourse.com
51. Example from http://bit.ly/1h1hGV4
> titanic.survival.train.rf = randomForest(as.factor(survived)
~ pclass + sex + age + sibsp, data=titanic.train,ntree=5000,
importance=TRUE)
> titanic.survival.train.rf
Call:
randomForest(formula = as.factor(survived) ~ pclass + sex +
age + sibsp, data = titanic.train, ntree = 5000,
importance = TRUE)
Type of random forest: classification
Number of trees: 5000
No. of variables tried at each split: 2
OOB estimate of error rate: 22.62%
Confusion matrix:
0 1 class.error
0 370 38 0.09313725
1 109 133 0.45041322
52. Example from http://bit.ly/1h1hGV4
> importance(titanic.survival.train.rf)
0 1 MeanDecreaseAccuracy MeanDecreGini
pclass 67.26795 125.166721 126.40379 34.69266
sex 160.52060 221.803515 224.89038 62.82490
age 70.35831 50.568619 92.67281 53.41834
sibsp 60.84056 3.343251 52.82503 14.01936
53. Legal Analytics
Class 8 - Ensemble Models including Random Forests
daniel martin katz
blog | ComputationalLegalStudies
corp | LexPredict
michael j bommarito
twitter | @computational
blog | ComputationalLegalStudies
corp | LexPredict
twitter | @mjbommar
more content available at legalanalyticscourse.com
site | danielmartinkatz.com site | bommaritollc.com