Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Class 8
Ensemble Models including Random Forests
Legal Analytics
Professor Daniel Martin Katz
Professor Michael J Bommarit...
CART
Approach
to Decision
Trees
access more at legalanalyticscourse.com
Get the Data Here:
http://www.stat.cmu.edu/~cshalizi/350/hw/06/cadata.dat
access more at legalanalyticscourse.com
x <- read.table("http://www.stat.cmu.edu/~cshalizi/350/hw/06/cadata.dat")
Get the Data Here:
Load the DataSet:
http://www....
http://www.stat.cmu.edu/~cshalizi/350/lectures/22/lecture-22.pdf
x <- read.table("http://www.stat.cmu.edu/~cshalizi/350/hw...
http://www3.nd.edu/~mclark19/learn/ML.pdf
Replicate This
for Additional Practice
access more at legalanalyticscourse.com
Random Forest
access more at legalanalyticscourse.com
One well-known problem with
standard classification trees is
their tendency toward overfitting
access more at legalanalytics...
This is because standard decision
trees are weak learners
access more at legalanalyticscourse.com
Random forest is an approach to
aggregate weak learners into
collective strong learners
(think of it as statistical crowd ...
Random Forest:
Group of DecisionTrees
Outperforms and is more Robust
(less likely to overfit) than a
Single DecisionTree
ac...
Breiman, L.(2001). Random forests.
Machine learning, 45(1), 5-32.
access more at legalanalyticscourse.com
Ensemble method that leverages
bagging (bootstrap aggregation)
Brieman (1996)
With Random Substrates
Brieman (2001)
Random...
bagging
is applied to the training data
random substrates
is applied to / about the variables
Two Layers of Randomness
acc...
What is Bagging?
access more at legalanalyticscourse.com
bagging = bootstrap aggregation
we are going to randomly sample rows
access more at legalanalyticscourse.com
https://www.youtube.com/watch?v=5Lu1eTiX7qM
Bagging (in general)
https://www.youtube.com/watch?v=JM4Y0B6Ho90
Bagging for classification
https://www.youtube.com/watch?v=Rm6s6gmLTdg
bootstrap aggregation
What are
Random Substrates?
access more at legalanalyticscourse.com
random substrates (column data)
is applied to / about the variables
access more at legalanalyticscourse.com
bagging
is applied to the training data
random substrates
is applied to / about the variables
Two Layers of Randomness
acc...
bagging (row data)
is applied to the training data
random substrates (column data)
is applied to / about the variables
Two...
“if the outlook is sunny and the humidity is less
than or equal to 70, then it’s probably OK to play.”
http://bit.ly/1icRl...
pseudocode from
access more at legalanalyticscourse.com
Single
Decision
Tree
http://bit.ly/1icRlmE
Random
Forest
(Blackwell 2012)
access more at legalanalyticscourse.com
Sample N cases at random with
replacement to create a subset of
the data
STEP 1:
(Blackwell 2012)
access more at legalanal...
M predictor variables are selected at random from
all the predictor variables.
The predictor variable that provides the
be...
the key idea -
how do we define
best split,
according to some
objective function
access more at legalanalyticscourse.com
the key idea -
how do we define
best split,
according to some
objective function
access more at legalanalyticscourse.com
access more at legalanalyticscourse.com
access more at legalanalyticscourse.com
Additional Notes
For Random Forest
Trees are not pruned
As potentially overfit
individual trees combine
to yield well fit en...
Random Forests
are ‘unreasonably effective’
access more at legalanalyticscourse.com
http://machinelearning202.pbworks.com/w/file/fetch/37597425/
performanceCompSupervisedLearning-caruana.pdf
Random
Forest
(p...
10 Different Binary Classification Methods
on
11 Different Datasets (w/ 5000 training cases each)
Random Forest were surpri...
http://videolectures.net/solomon_caruana_wslmw/
access more at legalanalyticscourse.com
Some Helpful
Online Resources
access more at legalanalyticscourse.com
https://www.youtube.com/watch?v=loNcrMjYh64
access more at legalanalyticscourse.com
https://www.youtube.com/watch?v=o7iDkcpOr_g
access more at legalanalyticscourse.com
http://www.stat.berkeley.edu/~breiman/RandomForests/
https://www.youtube.com/watch?v=ngaQrYqxtoM#t=18
Random
Forest
Examples
Using
access more at legalanalyticscourse.com
http://www.r-bloggers.com/part-3-random-forests-and-model-selection-considerations/
http://www.r-bloggers.com/ensemble-par...
http://www.r-bloggers.com/my-intro-to-multiple-classification-with-random-forests-
conditional-inference-trees-and-linear-d...
Practice Using
This Example
access more at legalanalyticscourse.com
http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets
Probably Should Use this One
x <- read.csv("/Users/katzd/Desktop/titan...
Example from http://bit.ly/1h1hGV4
> titanic.survival.train.rf = randomForest(as.factor(survived)
~ pclass + sex + age + s...
Example from http://bit.ly/1h1hGV4
> importance(titanic.survival.train.rf)
0 1 MeanDecreaseAccuracy MeanDecreGini
pclass 6...
Legal Analytics
Class 8 - Ensemble Models including Random Forests
daniel martin katz
blog | ComputationalLegalStudies
cor...
Legal Analytics Course - Class 8 - Introduction to Random Forests and Ensemble Methods - Professor Daniel Martin Katz + Pr...
Legal Analytics Course - Class 8 - Introduction to Random Forests and Ensemble Methods - Professor Daniel Martin Katz + Pr...
Legal Analytics Course - Class 8 - Introduction to Random Forests and Ensemble Methods - Professor Daniel Martin Katz + Pr...
Upcoming SlideShare
Loading in …5
×

Legal Analytics Course - Class 8 - Introduction to Random Forests and Ensemble Methods - Professor Daniel Martin Katz + Professor Michael J Bommarito

1,370 views

Published on

Legal Analytics Course - Class 8 - Introduction to Random Forests and Ensemble Methods - Professor Daniel Martin Katz + Professor Michael J Bommarito

Published in: Law
  • Be the first to comment

Legal Analytics Course - Class 8 - Introduction to Random Forests and Ensemble Methods - Professor Daniel Martin Katz + Professor Michael J Bommarito

  1. 1. Class 8 Ensemble Models including Random Forests Legal Analytics Professor Daniel Martin Katz Professor Michael J Bommarito II legalanalyticscourse.com
  2. 2. CART Approach to Decision Trees access more at legalanalyticscourse.com
  3. 3. Get the Data Here: http://www.stat.cmu.edu/~cshalizi/350/hw/06/cadata.dat access more at legalanalyticscourse.com
  4. 4. x <- read.table("http://www.stat.cmu.edu/~cshalizi/350/hw/06/cadata.dat") Get the Data Here: Load the DataSet: http://www.stat.cmu.edu/~cshalizi/350/hw/06/cadata.dat access more at legalanalyticscourse.com
  5. 5. http://www.stat.cmu.edu/~cshalizi/350/lectures/22/lecture-22.pdf x <- read.table("http://www.stat.cmu.edu/~cshalizi/350/hw/06/cadata.dat", header=TRUE) Get the Data Here: Load the DataSet: http://www.stat.cmu.edu/~cshalizi/350/hw/06/cadata.dat Follow Example on Page 4-7 (example 2.1)
  6. 6. http://www3.nd.edu/~mclark19/learn/ML.pdf Replicate This for Additional Practice access more at legalanalyticscourse.com
  7. 7. Random Forest access more at legalanalyticscourse.com
  8. 8. One well-known problem with standard classification trees is their tendency toward overfitting access more at legalanalyticscourse.com
  9. 9. This is because standard decision trees are weak learners access more at legalanalyticscourse.com
  10. 10. Random forest is an approach to aggregate weak learners into collective strong learners (think of it as statistical crowd sourcing) access more at legalanalyticscourse.com
  11. 11. Random Forest: Group of DecisionTrees Outperforms and is more Robust (less likely to overfit) than a Single DecisionTree access more at legalanalyticscourse.com
  12. 12. Breiman, L.(2001). Random forests. Machine learning, 45(1), 5-32. access more at legalanalyticscourse.com
  13. 13. Ensemble method that leverages bagging (bootstrap aggregation) Brieman (1996) With Random Substrates Brieman (2001) Random Forest: access more at legalanalyticscourse.com
  14. 14. bagging is applied to the training data random substrates is applied to / about the variables Two Layers of Randomness access more at legalanalyticscourse.com
  15. 15. What is Bagging? access more at legalanalyticscourse.com
  16. 16. bagging = bootstrap aggregation we are going to randomly sample rows access more at legalanalyticscourse.com
  17. 17. https://www.youtube.com/watch?v=5Lu1eTiX7qM Bagging (in general)
  18. 18. https://www.youtube.com/watch?v=JM4Y0B6Ho90 Bagging for classification
  19. 19. https://www.youtube.com/watch?v=Rm6s6gmLTdg bootstrap aggregation
  20. 20. What are Random Substrates? access more at legalanalyticscourse.com
  21. 21. random substrates (column data) is applied to / about the variables access more at legalanalyticscourse.com
  22. 22. bagging is applied to the training data random substrates is applied to / about the variables Two Layers of Randomness access more at legalanalyticscourse.com
  23. 23. bagging (row data) is applied to the training data random substrates (column data) is applied to / about the variables Two Layers of Randomness access more at legalanalyticscourse.com
  24. 24. “if the outlook is sunny and the humidity is less than or equal to 70, then it’s probably OK to play.” http://bit.ly/1icRlmE Single Decision Tree
  25. 25. pseudocode from access more at legalanalyticscourse.com
  26. 26. Single Decision Tree http://bit.ly/1icRlmE Random Forest (Blackwell 2012) access more at legalanalyticscourse.com
  27. 27. Sample N cases at random with replacement to create a subset of the data STEP 1: (Blackwell 2012) access more at legalanalyticscourse.com
  28. 28. M predictor variables are selected at random from all the predictor variables. The predictor variable that provides the best split, according to some objective function, is used to do a binary split on that node. At the next node, choose another m variables at random from all predictor variables and do the same.” STEP 2: “At each node: access more at legalanalyticscourse.com
  29. 29. the key idea - how do we define best split, according to some objective function access more at legalanalyticscourse.com
  30. 30. the key idea - how do we define best split, according to some objective function access more at legalanalyticscourse.com
  31. 31. access more at legalanalyticscourse.com
  32. 32. access more at legalanalyticscourse.com
  33. 33. Additional Notes For Random Forest Trees are not pruned As potentially overfit individual trees combine to yield well fit ensembles access more at legalanalyticscourse.com
  34. 34. Random Forests are ‘unreasonably effective’ access more at legalanalyticscourse.com
  35. 35. http://machinelearning202.pbworks.com/w/file/fetch/37597425/ performanceCompSupervisedLearning-caruana.pdf Random Forest (particularly with optimization) have proven to be unreasonably effective
  36. 36. 10 Different Binary Classification Methods on 11 Different Datasets (w/ 5000 training cases each) Random Forest were surprisingly effective access more at legalanalyticscourse.com
  37. 37. http://videolectures.net/solomon_caruana_wslmw/ access more at legalanalyticscourse.com
  38. 38. Some Helpful Online Resources access more at legalanalyticscourse.com
  39. 39. https://www.youtube.com/watch?v=loNcrMjYh64 access more at legalanalyticscourse.com
  40. 40. https://www.youtube.com/watch?v=o7iDkcpOr_g access more at legalanalyticscourse.com
  41. 41. http://www.stat.berkeley.edu/~breiman/RandomForests/
  42. 42. https://www.youtube.com/watch?v=ngaQrYqxtoM#t=18
  43. 43. Random Forest Examples Using access more at legalanalyticscourse.com
  44. 44. http://www.r-bloggers.com/part-3-random-forests-and-model-selection-considerations/ http://www.r-bloggers.com/ensemble-part2-bootstrap-aggregation/ http://www.r-bloggers.com/ensemble-methods-part-1/
  45. 45. http://www.r-bloggers.com/my-intro-to-multiple-classification-with-random-forests- conditional-inference-trees-and-linear-discriminant-analysis/
  46. 46. Practice Using This Example access more at legalanalyticscourse.com
  47. 47. http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets Probably Should Use this One x <- read.csv("/Users/katzd/Desktop/titanic3.csv") Read in the dataset using read.csv Get the Dataset fromVandy Example from http://bit.ly/1h1hGV4
  48. 48. Example from http://bit.ly/1h1hGV4 > titanic.survival.train.rf = randomForest(as.factor(survived) ~ pclass + sex + age + sibsp, data=titanic.train,ntree=5000, importance=TRUE) > titanic.survival.train.rf Call: randomForest(formula = as.factor(survived) ~ pclass + sex + age + sibsp, data = titanic.train, ntree = 5000, importance = TRUE) Type of random forest: classification Number of trees: 5000 No. of variables tried at each split: 2 OOB estimate of error rate: 22.62% Confusion matrix: 0 1 class.error 0 370 38 0.09313725 1 109 133 0.45041322
  49. 49. Example from http://bit.ly/1h1hGV4 > importance(titanic.survival.train.rf) 0 1 MeanDecreaseAccuracy MeanDecreGini pclass 67.26795 125.166721 126.40379 34.69266 sex 160.52060 221.803515 224.89038 62.82490 age 70.35831 50.568619 92.67281 53.41834 sibsp 60.84056 3.343251 52.82503 14.01936
  50. 50. Legal Analytics Class 8 - Ensemble Models including Random Forests daniel martin katz blog | ComputationalLegalStudies corp | LexPredict michael j bommarito twitter | @computational blog | ComputationalLegalStudies corp | LexPredict twitter | @mjbommar more content available at legalanalyticscourse.com site | danielmartinkatz.com site | bommaritollc.com

×