ICPSR - Complex Systems Models in the Social Sciences - Lecture 7 - Professor Daniel Martin Katz

478 views

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
478
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
30
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

ICPSR - Complex Systems Models in the Social Sciences - Lecture 7 - Professor Daniel Martin Katz

  1. 1. Complex Systems Models in the Social Sciences (Lecture 7) daniel martin katz illinois institute of technology chicago kent college of law @computationaldanielmartinkatz.com computationallegalstudies.com
  2. 2. consider the applied case of judicial prediction
  3. 3. Every year, law reviews, magazine and newspaper articles, television and radio time, conference panels, blog posts, and tweets are devoted to questions such as: How will the Court rule in particular cases?
  4. 4. Experts, Crowds, Algorithms
  5. 5. There are 3 Known Ways to Predict Something
  6. 6. Experts, Crowds, Algorithms
  7. 7. We could apply this to a wide range of problems
  8. 8. For today we will apply these approaches to the decisions of the Supreme Court of United States
  9. 9. this is an example of what is possible with other data
  10. 10. Experts
  11. 11. Columbia Law Review October, 2004 Theodore W. Ruger, Pauline T. Kim, Andrew D. Martin, Kevin M. Quinn Legal and Political Science Approaches to Predicting Supreme Court Decision Making The Supreme Court Forecasting Project:
  12. 12. experts
  13. 13. Case Level Prediction Justice Level Prediction 67.4% experts 58% experts From the 68 Included Cases for the 2002-2003 Supreme Court Term
  14. 14. these experts probably performed badly because they overfit
  15. 15. they fit to the noise and not the signal
  16. 16. we need to evaluate experts and somehow benchmark their expertise
  17. 17. from a pure forecasting standpoint
  18. 18. the best known SCOTUS predictor is
  19. 19. Crowds
  20. 20. crowds
  21. 21. Algorithms
  22. 22. Black Reed Frankfurter Douglas Jackson Burton Clark Minton Warren Harlan Brennan Whittaker Stewart White Goldberg Fortas Marshall Burger Blackmun Powell Rehnquist Stevens OConnor Scalia Kennedy Souter Thomas Ginsburg Breyer Roberts Alito Sotomayor Kagan 1953 1963 1973 1983 1993 2003 2013 9-0 Reverse 8-1, 7-2, 6-3 19 19 19 19 19 20 20 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 - Reverse 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 - 8-1, 7-2, 6-3 9-0 19 19 19 19 19 20 20 algorithms
  23. 23. we have developed an algorithm that we call {Marshall}+ extremely randomized trees (ERT)
  24. 24. Benchmarking since 1953 + Using only data available prior to the decision Mean Court Direction [FE] Mean Court Direction 10 [FE] Mean Court Direction Issue [FE] Mean Court Direction Issue 10 [FE] Mean Court Direction Petitioner [FE] Mean Court Direction Petitioner 10 [FE] Mean Court Direction Respondent [FE] Mean Court Direction Respondent 10 [FE] Mean Court Direction Circuit Origin [FE] Mean Court Direction Circuit Origin 10 [FE] Mean Court Direction Circuit Source [FE] Mean Court Direction Circuit Source 10 [FE] Difference Justice Court Direction [FE] Abs. Difference Justice Court Direction [FE] Difference Justice Court Direction Issue [FE] Abs. Difference Justice Court Direction Issue [FE] Z Score Difference Justice Court Direction Issue [FE] Difference Justice Court Direction Petitioner [FE] Abs. Difference Justice Court Direction Petitioner [FE] Difference Justice Court Direction Respondent [FE] Abs. Difference Justice Court Direction Respondent [FE] Z Score Justice Court Direction Difference [FE] Justice Lower Court Direction Difference [FE] Justice Lower Court Direction Abs. Difference [FE] Justice Lower Court Direction Z Score [FE] Z Score Justice Lower Court Direction Difference [FE] Agreement of Justice with Majority [FE] Agreement of Justice with Majority 10 [FE] Difference Court and Lower Ct Direction [FE] Abs. Difference Court and Lower Ct Direction [FE] Z-Score Difference Court and Lower Ct Direction [FE] Z-Score Abs. Difference Court and Lower Ct Direction [FE] Justice [S] Justice Gender [FE] Is Chief [FE] Party President [FE] Natural Court [S] Segal Cover Score [SC] Year of Birth [FE] Mean Lower Court Direction Circuit Source [FE] Mean Lower Court Direction Circuit Source 10 [FE] Mean Lower Court Direction Issue [FE] Mean Lower Court Direction Issue 10 [FE] Mean Lower Court Direction Petitioner [FE] Mean Lower Court Direction Petitioner 10 [FE] Mean Lower Court Direction Respondent [FE] Mean Lower Court Direction Respondent 10 [FE] Mean Justice Direction [FE] Mean Justice Direction 10 [FE] Mean Justice Direction Z Score [FE] Mean Justice Direction Petitioner [FE] Mean Justice Direction Petitioner 10 [FE] Mean Justice Direction Respondent [FE] Mean Justice Direction Respondent 10 [FE] Mean Justice Direction for Circuit Origin [FE] Mean Justice Direction for Circuit Origin 10 [FE] Mean Justice Direction for Circuit Source [FE] Mean Justice Direction for Circuit Source 10 [FE] Mean Justice Direction by Issue [FE] Mean Justice Direction by Issue 10 [FE] Mean Justice Direction by Issue Z Score [FE] Admin Action [S] Case Origin [S] Case Origin Circuit [S] Case Source [S] Case Source Circuit [S] Law Type [S] Lower Court Disposition Direction [S] Lower Court Disposition [S] Lower Court Disagreement [S] Issue [S] Issue Area [S] Jurisdiction Manner [S] Month Argument [FE] Month Decision [FE] Petitioner [S] Petitioner Binned [FE] Respondent [S] Respondent Binned [FE] Cert Reason [S] Mean Agreement Level of Current Court [FE] Std. Dev. of Agreement Level of Current Court [FE] Mean Current Court Direction Circuit Origin [FE] Std. Dev. Current Court Direction Circuit Origin [FE] Mean Current Court Direction Circuit Source [FE] Std. Dev. Current Court Direction Circuit Source [FE] Mean Current Court Direction Issue [FE] Z-Score Current Court Direction Issue [FE] Std. Dev. Current Court Direction Issue [FE] Mean Current Court Direction [FE] Std. Dev. Current Court Direction [FE] Mean Current Court Direction Petitioner [FE] Std. Dev. Current Court Direction Petitioner [FE] Mean Current Court Direction Respondent [FE] Std. Dev. Current Court Direction Respondent [FE] 0.00781 0.00205 0.00283 0.00604 0.00764 0.00971 0.00793 TOTAL 0.04403 Justice and Court Background Information Case Information 0.00978 0.00971 0.00845 0.00953 0.01015 0.01370 0.01190 0.01125 0.00706 0.01541 0.01469 0.00595 0.02014 0.01349 0.01406 0.01199 0.01490 0.01179 0.01408 TOTAL 0.22814 Overall Historic Supreme Court Trends 0.00988 0.01997 0.01546 0.00938 0.00863 0.00904 0.00875 0.00925 0.00791 0.00864 0.00951 0.01017 TOTAL 0.12663 Lower Court Trends 0.00962 0.01017 0.01334 0.00933 0.00949 0.00874 0.00973 0.00900 TOTAL 0.07946 0.00955 0.00936 0.00789 0.00850 0.00945 0.01021 0.01469 0.00832 0.01266 0.00918 0.00942 0.00863 0.00894 0.00882 0.00888 Current Supreme Court Trends TOTAL 0.14456 Individual Supreme Court Justice Trends 0.01248 0.01530 0.00826 0.00732 0.01027 0.00724 0.01030 0.00792 0.00945 0.00891 0.00970 0.01881 0.00950 0.00771 TOTAL 0.14323 0.01210 0.00929 0.01167 0.00968 0.01055 0.00705 0.00708 0.00690 0.00699 0.01280 0.01922 0.02494 0.01126 0.00992 0.00866 0.01483 0.01522 0.01199 0.01217 0.01150 TOTAL 0.23391 Differences in Trends
  25. 25. Total Cases Predicted Total Votes Predicted 7,700 68,964
  26. 26. Justice Prediction Case Prediction 70.9% accuracy 69.6% accuracy From 1953 - 2014
  27. 27. Relies upon Random Forest but first lets look at CART
  28. 28. Classification and RegressionTrees (CART)
  29. 29. Given Some Data: (X1, Y1), ... , (Xn, Yn) Now We Have a New Set of X’s We Want to Predict the Y
  30. 30. Form a BinaryTree that Minimizes the Error in each leaf of the tree CART (Classification & RegressionTrees)
  31. 31. Observe the Correspondence Between the Data andTrees
  32. 32. 1 0 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 01 0 Xi1 Xi2 0 Adapted from Example By Mathematical Monk
  33. 33. 1 0 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 01 0 Xi1 Xi2 0 Adapted from Example By Mathematical Monk We want to build an approach which can lead to the proper classification (labeling) of new data points ( ) that are dropped into this space
  34. 34. 1 0 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 01 0 Xi1 Xi2 0 Adapted from Example By Mathematical Monk
  35. 35. 1 0 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 01 0 Xi1 Xi2 0 Adapted from Example By Mathematical Monk L e t s B e g i n t o Partition the Space
  36. 36. 1 0 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 01 0 Xi1 Xi2 0 1 2 1 2 Adapted from Example By Mathematical Monk L e t s B e g i n t o Partition the Space split 1 (a)
  37. 37. 1 0 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 01 0 Xi1 Xi2 0 1 2 1 2 Adapted from Example By Mathematical Monk This Split Will Be Memorialized in theTree split 1 (a)
  38. 38. 1 0 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 01 0 Xi1 Xi2 0 1 2 1 2 Adapted from Example By Mathematical Monk We Ask the Question is Xi1 > 1 ? - with a binary (yes or no) response split 1 (a) Xi1 > 1 ? YesNo
  39. 39. 1 0 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 01 0 Xi1 Xi2 0 1 2 1 2 Adapted from Example By Mathematical Monk If No - then we are in zone (a) ... we tally the number of zeros and ones Using Majority Rule do we assign a classification to this rule this leaf split 1 (a) Xi1 > 1 ? YesNo (0,5) Classify as 1 zone (a)
  40. 40. 1 0 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 01 0 Xi1 Xi2 0 1 2 1 2 Adapted from Example By Mathematical Monk Here we Classify as a 1 because (0,5) which is 0 zero’s and 5 one’s split 1 (a) Xi1 > 1 ? YesNo (0,5) Classify as 1 zone (a)
  41. 41. 1 0 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 01 0 Xi1 Xi2 0 1 2 1 2 Adapted from Example By Mathematical Monk Using a Similar Approach Lets Begin to Fill in the Rest of theTree split 1 (a) Xi1 > 1 ? YesNo (0,5) Classify as 1 zone (a)
  42. 42. 1 0 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 01 0 Xi1 Xi2 0 1 2 1 2 Adapted from Example By Mathematical Monk split 1 (a) Xi1 > 1 ? YesNo (0,5) Classify as 1 zone (a) Xi2 > 1.45 ? No Yes split 2
  43. 43. 1 0 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 01 0 Xi1 Xi2 0split 1 split 2 split 3 1 2 2.2 1 2 Xi1 > 1 ? (0,5) Xi2 > 1.45 ? (4,1)(2,3) Xi1 < 2 ? Classify as 1 Classify as 1 Classify as 0 (a) zone (a) 1.45 YesNo Adapted from Example By Mathematical Monk No (b) (c) zone (b) zone (c) YesNo Yes
  44. 44. 1 0 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 01 0 Xi1 Xi2 0split 1 split 2 split 3 split 4 1 2 2.2 1 2 Xi1 > 1 ? (0,5) Xi2 > 1.45 ? Xi1 > 2.2 ? (1,4)(5,0)(4,1)(2,3) Xi1 < 2 ? Classify as 1 Classify as 1 Classify as 0 (a) zone (a) 1.45 YesNo Adapted from Example By Mathematical Monk No (b) (c) (d) (e) zone (b) zone (c) YesNo YesNo Yes zone (d) Classify as 0 Classify as 1 zone (e)
  45. 45. Okay Lets Add Back the ( ) which are new items to be classified
  46. 46. For simplicity sake there is one in each zone
  47. 47. We Will Use theTree Because theTree Is Our Prediction Machine
  48. 48. 1 0 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 01 0 Xi1 Xi2 0split 1 split 2 split 3 split 4 1 2 2.2 1 2 Xi1 > 1 ? (0,5) Xi2 > 1.45 ? Xi1 > 2.2 ? (1,4)(5,0)(4,1)(2,3) Xi1 < 2 ? Classify as 1 Classify as 1 Classify as 0 (a) zone (a) 1.45 YesNo Adapted from Example By Mathematical Monk No (b) (c) (d) (e) zone (b) zone (c) YesNo YesNo Yes zone (d) Classify as 0 Classify as 1 zone (e)
  49. 49. 1 0 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 01 0 Xi1 Xi2 0split 1 split 2 split 3 split 4 1 2 2.2 1 2 Xi1 > 1 ? (0,5) Xi2 > 1.45 ? Xi1 > 2.2 ? (1,4)(5,0)(4,1)(2,3) Xi1 < 2 ? Classify as 1 Classify as 1 Classify as 0 (a) zone (a) 1.45 YesNo Adapted from Example By Mathematical Monk No (b) (c) (d) (e) zone (b) zone (c) YesNo YesNo Yes zone (d) Classify as 0 Classify as 1 zone (e) 1 1 1 0 1 0
  50. 50. In this simple example, we eyeballed the 2D space, partitioned it and stopped after 4 Splits
  51. 51. Most Real Problems are Not So Simple ...
  52. 52. Real problems are n-dimensional (not 2D) (1)
  53. 53. For real problems, you need to select criteria (or a criterion) for deciding where to partition (split) the data (2)
  54. 54. For real problems you must develop a stopping condition or pursue recursive partitioning of the space (3)
  55. 55. Solutions to these 3 Problems are among the core questions in algorithm selection / development
  56. 56. From an Algorithmic Perspective - TheTask is to Develop a Method to Partition theTrees
  57. 57. Must Do So Without Knowing the Specific Contours of the Data / Problem in Question
  58. 58. So How Do We TraverseThrough The Data?
  59. 59. Optimal Partitioning of Trees is NP-Complete
  60. 60. “Although any given solution to an NP-complete problem can be verified quickly (in polynomial time), there is no known efficient way to locate a solution in the first place; indeed, the most notable characteristic of NP-complete problems is that no fast solution to them is known.That is, the time required to solve the problem using any currently known algorithm increases very quickly as the size of the problem grows”
  61. 61. key implication is that one cannot in advance determine the “optimal tree”
  62. 62. Breiman, et al (1984) uses a Greedy Optimization Method
  63. 63. Greedy Optimization Method is used to calculate the MLE (maximum-likelihood estimation)
  64. 64. Greedy is a Heuristic “makes the locally optimal choice at each stage with the hope of finding a global optimum. In many problems, a greedy strategy does not in general produce an optimal solution, but nonetheless a greedy heuristic may yield locally optimal solutions that approximate a global optimal solution in a reasonable time.”
  65. 65. CART Approach to Decision Trees
  66. 66. Get the Data Here: http://www.stat.cmu.edu/~cshalizi/350/hw/06/cadata.dat
  67. 67. x <- read.table("http://www.stat.cmu.edu/~cshalizi/350/hw/06/cadata.dat") Get the Data Here: Load the DataSet: http://www.stat.cmu.edu/~cshalizi/350/hw/06/cadata.dat
  68. 68. http://www.stat.cmu.edu/~cshalizi/350/lectures/22/lecture-22.pdf x <- read.table("http://www.stat.cmu.edu/~cshalizi/350/hw/06/cadata.dat", header=TRUE) Get the Data Here: Load the DataSet: http://www.stat.cmu.edu/~cshalizi/350/hw/06/cadata.dat Follow Example on Page 4-7 (example 2.1)
  69. 69. http://www3.nd.edu/~mclark19/learn/ML.pdf Replicate this On Your Own
  70. 70. Applications of Classification Trees in Law
  71. 71. http://wusct.wustl.edu/media/man2.pdf
  72. 72. Random Forest
  73. 73. One well-known problem with standard classification trees is their tendency toward overfitting
  74. 74. This is because standard decision trees are weak learners
  75. 75. Random forest is an approach to aggregate weak learners into collective strong learners (think of it as statistical crowd sourcing)
  76. 76. Random Forest: Group of DecisionTrees Outperforms and is more Robust (i.e. is less likely to overfit) than a Single DecisionTree
  77. 77. Ensemble method that leverages bagging (bootstrap aggregation) Brieman (1996) With Random Substrates Brieman (2001) Random Forest:
  78. 78. bootstrap aggregation is applied to the training data random substrates is applied to / about the variables Two Layers of Randomness
  79. 79. bootstrap aggregation (row) is applied to the training data random substrates (column) is applied to / about the variables Two Layers of Randomness
  80. 80. What is Bagging?
  81. 81. bagging = bootstrap aggregation
  82. 82. https://www.youtube.com/watch?v=Rm6s6gmLTdg
  83. 83. “if the outlook is sunny and the humidity is less than or equal to 70, then it’s probably OK to play.” http://bit.ly/1icRlmE Single Decision Tree
  84. 84. Single Decision Tree http://bit.ly/1icRlmE Random Forest (Blackwell 2012)
  85. 85. Sample N cases at random with replacement to create a subset of the data STEP 1: (Blackwell 2012)
  86. 86. M predictor variables are selected at random from all the predictor variables. The predictor variable that provides the best split, according to some objective function, is used to do a binary split on that node. At the next node, choose another m variables at random from all predictor variables and do the same.” STEP 2: “At each node:
  87. 87. http://www.stat.berkeley.edu/~breiman/RandomForests/
  88. 88. https://www.youtube.com/watch?v=ngaQrYqxtoM#t=18
  89. 89. Additional Notes For Random Forest Trees are not pruned As potentially overfit individual trees combine to yield well fit ensembles
  90. 90. http://machinelearning202.pbworks.com/w/file/fetch/37597425/ performanceCompSupervisedLearning-caruana.pdf Trees (particularly with optimization) have proven to be unreasonably effect
  91. 91. 10 Different Binary Classification Methods on 11 Different Datasets (w/ 5000 training cases each) Trees and Forest were surprisingly effective
  92. 92. http://videolectures.net/solomon_caruana_wslmw/
  93. 93. http://www.r-bloggers.com/a-brief-tour-of-the-trees-and-forests/
  94. 94. http://www.r-bloggers.com/classification-tree-models/
  95. 95. Experts, Crowds, Algorithms
  96. 96. For most problems ... ensembles of these streams outperform any single stream
  97. 97. Humans + Machines
  98. 98. Humans + Machines >
  99. 99. Humans + Machines Humans or Machines >
  100. 100. Ensembles come in various forms
  101. 101. Here is a well known example
  102. 102. Poll Aggregation is one form of ensemble where the learning question is to determine how much weight (if any) to assign to each individual poll
  103. 103. poll weighting
  104. 104. A Visual Depiction of How to build an ensemble method in our judicial prediction example
  105. 105. expert crowd algorithm ensemble method learning problem is to discover when to use a given stream of intelligence
  106. 106. expert crowd algorithm via back testing we can learn the weights to apply for particular problems ensemble method learning problem is to discover when to use a given stream of intelligence
  107. 107. {Marshall}+ algorithm
  108. 108. expert crowd algorithm
  109. 109. {Marshall}+ improvement will likely come from determining the optimal weighting of experts, crowds and algorithms for various types of cases
  110. 110. ERISA cases thus might look like this
  111. 111. Patent cases Perhaps might look like this
  112. 112. Search/Seizure cases while could look like this
  113. 113. this is one slice 
 our research effort ...
  114. 114. and we are working on a series of improvements to the model
  115. 115. including structuring previously unstructured datasets
  116. 116. and using natural language processing tools (where appropriate)

×