This document summarizes a presentation on predicting the outcomes of legal cases using machine learning models. It discusses extracting data from judgment documents and identifying key features for analysis. Exploratory data analysis was conducted on 202 observations to understand patterns. Logistic regression, KNN, random forest, and support vector machine models were developed. The tuned support vector machine model achieved the highest accuracy of 95% based on 10-fold cross-validation. Overall, support vector machine provided the best performance. The models tended to predict non-guilty outcomes more frequently due to the skewed data. Future work involves developing a mobile application for these predictive capabilities.
2. Ankita Singh Nilutpal Goswami
Agenda
Domain Objective Data Extraction EDA
Architecture Model
Development
Q & ASummary
and Findings
1 2 3 4
5 6 7 8
3. Ankita Singh Nilutpal Goswami
Legal Systems of the World
Source – Citation (http://saint-claire.org/)
4. Ankita Singh Nilutpal Goswami
Hierarchy of Indian Judiciary
Sources of Law
§ Constitution
§ Legislation
• Ordinary
• Delegated
• Ordinance
§ Judicial Precedent
§ Customs
5. Ankita Singh Nilutpal Goswami
Background
Indian Judicial System
o Largest judicial machinery based on the biggest constitution
o Constitution of India have 448 articles in 25 parts, 12
schedules, 5 appendices and 98 amendments
o Indian Penal Code (IPC) defines various crimes/offences and
prescribes the punishment
o Criminal Procedure Code (CrPC) defines the mandatory
procedures to be carried out pursuing a case
o 24 High Courts and over 600 district courts
o Nearly 5 lakhs cases are filled daily in Indian Courts
o Approximately 4.5 lakhs of cases are put up before the courts
daily
o About 2.5 lakhs of cases are disposed off daily
6. Ankita Singh Nilutpal Goswami
Challenging Facts
CASES PENDING CIVIL CASES CRIMINAL CASES TOTAL CASES PERCENTAGE
> 10 years 597,166 1,691,515 2,288,681 8.28%
Between 5 to 10 years 1,244,117 3,212,377 4,456,494 16.13%
Between 2 to 5 years 2,542,925 5,394,015 7,936,940 28.73%
< 2 years 3,946,341 8,997,935 12,944,276 46.85%
Total Pending Cases 8,330,549 19,295,842 27,626,391
> 10 years
8%
Between 5 to 10
years
16%
Between 2 to 5 years
29%
< 2 years
47%
Other
76%
Source – National Judicial Data Grid (as on September 18th 2018)
v Case Disposal Rates (August 2018)
§ 10 years – 1.5 %
§ All cases – 3.8 %
v Cases filed daily
~ 5- 8 Lakhs
v Cases pending registration
~ 7.5 Lakhs
v Has 15 judges for every 1 million
of people
v 22.2 million undertrials –
undertrials outnumber the
convicts
7. Ankita Singh Nilutpal Goswami
Faster processing of legal issues / cases
“Judgement” data sourcing and understanding of
the details
Evaluating predictions based on various machine
learning model
Develop social value by means of streamlining
the judicial case intake
Objective
8. Ankita Singh Nilutpal Goswami
Sample Judgement document snapshot
o Case Documents Analyzed – 120
o Data extraction mechanism – manual
o Unique fields extracted – 58
o Total number of final observations - 202
Data
• Nature of Disposal
• Case Type
• Court Number
• Court Name
• Judge
• Judge Gender
• Judgement Date
• Total Number of Sections
• Section 1 thru Section 10
• FIR Number/Year
• Police station
• Investigating officer
• Case Number
• Year
• Complainant
• Total Accused
• Accused #
• Accused Name
• Accused Gender
• Accused Age
• Accused Confessed? (plea)
• Date Of first Hearing
• Complainant advocate
• Prosecution advocate
• Advocate Defendant
• Number of Prosecution witnesses
• Names of prosecution witnesses
• PW's Examined?
• Number of hostile witnesses
• Defense witnesses
• Charge sheet
• Points for consideration
• Exhibits on behalf of prosecution P series
• Number of exhibits considered
• Exhibits on behalf of court Cseries
• Exhibits on behalf of accused Dseries
• Total Number of Material Objects
• Charges proved
• Charges not proved
• Issues Proved
• Issues Not Proved
• Accused released on bail
• Accused committed to prison
• Sentence of Imprisonment granted
• Fine with Imprisonment (Rs)
• Term Served in Prison(days)
• Set off (if any)
• Judgement
• Citations
Original Features
9. Ankita Singh Nilutpal Goswami
• Source – Publicly available judgement
documents
• Case Documents Analyzed – 120
• Data extraction mechanism – manual
• Unique fields extracted – 58
• Consistent features identified -15
(Judgement decision is the Target Variable)
• Total number of final observations - 202
Data
# Feature Name Description Datatype Value
1 ipc_420
Binary indicator to confirm if the case is filed
under IPC 420
Categorical Yes=1, No=0
2 ipc_120b
Binary indicator to confirm if the case is filed
under IPC 120b
Categorical Yes=1, No=0
3 ipc_471
Binary indicator to confirm if the case is filed
under IPC 471
Categorical Yes=1, No=0
4 ipc_468
Binary indicator to confirm if the case is filed
under IPC 468
Categorical Yes=1, No=0
5 ipc_34
Binary indicator to confirm if the case is filed
under IPC 34
Categorical Yes=1, No=0
6 jud_gender Gender of the judge presiding over the case Categorical Male=0, Female=1
7 jud_date Date when judgement was meted Date Date
8 tot_sec Total number of sections filed for the case Numeric Number
9 case_no Unique number of the case Categorical Multiple Factors
10 comp Complainant name * String Name
11 tot_accu Total number of accused presented in the case Numeric Number
12 accu_gender Gender of the individual accused Categorical Male=0, Female=1
13 accu_no Sequence number of the accused Categorical Multiple Factors
14 accu_age Age of the accused Numeric Number
15 judgement Judgement given in the case Categorical
Guilty=1, Not
Guilty=0
16. Ankita Singh Nilutpal Goswami
INITIAL MODEL DEVELOPMENT STEPS
PredictionData Collection
Feed data to model
1
2
3
POST IMPLEMENTATION STEPS
FEEDBACK
Development Methodology
17. Ankita Singh Nilutpal Goswami
Model Development
• Logistic Regression
• K-Nearest Neighbor
• Random Forest
• Support Vector Machine
19. Ankita Singh Nilutpal Goswami
Logistic Regression
Pseudo R-square - 45.4% of the Intercept only
model has been explained by the Full model
Log likelihood ratio implies that the null hypothesis
of all Betas are zero is rejected and at least one Beta
is nonzero.
21. Ankita Singh Nilutpal Goswami
Cross-Validation
10 fold cross-validation resulted the best value
with k=7
From the results,
Accuracy and Kappa reducing after k=5
K-Nearest Neighbor
22. Ankita Singh Nilutpal Goswami
K-Nearest Neighbor
Model was further tuned by setting twoclassSummary and classProbs
as True.
Tuned model has better accuracy of
93.44%
23. Ankita Singh Nilutpal Goswami
Random Forest
Model parameters -
• ntree = 250
as OOB hardly changes after 250 trees
• mtry = 3
initially we took sqrt(total_no_of_features)
• nodesize = 3
1% of the total observation (202 observations)
24. Ankita Singh Nilutpal Goswami
Random Forest
Cross Validation with Parameter Tuning with mtry=2,3 and 4
Tuned model has
accuracy of 93.55 %
25. Ankita Singh Nilutpal Goswami
Support Vector Machine
• Model found 41 support vectors with gamma
value of 0.017 and cost of 1
• SVM model accuracy 90.02%
26. Ankita Singh Nilutpal Goswami
10 fold cross validation identified best values of
gamma - 0.1, Cost - 1
Tuned model has accuracy of 95.04 %
Support Vector Machine
27. Ankita Singh Nilutpal Goswami
Observation
o From the assessment of all the models, Support Vector Machine provides a better
accuracy including other performance parameters.
Model Accuracy (%) Precision (%) Recall (%)
Decision Trees (Gini) 82% 82% 97%
K-Nearest Neighbor 93% 93% 100%
Logistic Regression 88% 96% 91%
Naïve Bayes 75% 76% 95%
Random Forest 94% 93% 100%
Support Vector Machines 95% 94% 99%
Summary - Model Performance
28. Ankita Singh Nilutpal Goswami
• Support Vector Machine provides a better accuracy
• Better Precision and Recall values obtained from SVM
and Gradient Boosting
• Random data is skewed towards Non-Guilty cases (89 :
11 in favor of Non-Guilty)
• Model has been developed on IPC 420 cases found
across multiple District / High Courts
• Prediction obtained were majorly predicting Non-
Guilty
Summary and Findings