Predicting outcome of legal case using machine learning algorithms By Ankita Singh Service Delivery Specialist at IBM India , Nilutpal Goswami Senior Manager at Capgemini at CYPHER 2018

PREDICTING OUTCOME OF
LEGAL CASES
Ankita Singh Nilutpal Goswami

Agenda
Domain Objective Data Extraction EDA
Architecture Model
Development
Q & ASummary
and Findings
1 2 3 4
5 6 7 8

Legal Systems of the World
Source – Citation (http://saint-claire.org/)

Hierarchy of Indian Judiciary
Sources of Law
§ Constitution
§ Legislation
• Ordinary
• Delegated
• Ordinance
§ Judicial Precedent
§ Customs

Background
Indian Judicial System
o Largest judicial machinery based on the biggest constitution
o Constitution of India have 448 articles in 25 parts, 12
schedules, 5 appendices and 98 amendments
o Indian Penal Code (IPC) defines various crimes/offences and
prescribes the punishment
o Criminal Procedure Code (CrPC) defines the mandatory
procedures to be carried out pursuing a case
o 24 High Courts and over 600 district courts
o Nearly 5 lakhs cases are filled daily in Indian Courts
o Approximately 4.5 lakhs of cases are put up before the courts
daily
o About 2.5 lakhs of cases are disposed off daily

Challenging Facts
CASES PENDING CIVIL CASES CRIMINAL CASES TOTAL CASES PERCENTAGE
> 10 years 597,166 1,691,515 2,288,681 8.28%
Between 5 to 10 years 1,244,117 3,212,377 4,456,494 16.13%
Between 2 to 5 years 2,542,925 5,394,015 7,936,940 28.73%
< 2 years 3,946,341 8,997,935 12,944,276 46.85%
Total Pending Cases 8,330,549 19,295,842 27,626,391
> 10 years
8%
Between 5 to 10
years
16%
Between 2 to 5 years
29%
< 2 years
47%
Other
76%
Source – National Judicial Data Grid (as on September 18th 2018)
v Case Disposal Rates (August 2018)
§ 10 years – 1.5 %
§ All cases – 3.8 %
v Cases filed daily
~ 5- 8 Lakhs
v Cases pending registration
~ 7.5 Lakhs
v Has 15 judges for every 1 million
of people
v 22.2 million undertrials –
undertrials outnumber the
convicts

Faster processing of legal issues / cases
“Judgement” data sourcing and understanding of
the details
Evaluating predictions based on various machine
learning model
Develop social value by means of streamlining
the judicial case intake
Objective

Sample Judgement document snapshot
o Case Documents Analyzed – 120
o Data extraction mechanism – manual
o Unique fields extracted – 58
o Total number of final observations - 202
Data
• Nature of Disposal
• Case Type
• Court Number
• Court Name
• Judge
• Judge Gender
• Judgement Date
• Total Number of Sections
• Section 1 thru Section 10
• FIR Number/Year
• Police station
• Investigating officer
• Case Number
• Year
• Complainant
• Total Accused
• Accused #
• Accused Name
• Accused Gender
• Accused Age
• Accused Confessed? (plea)
• Date Of first Hearing
• Complainant advocate
• Prosecution advocate
• Advocate Defendant
• Number of Prosecution witnesses
• Names of prosecution witnesses
• PW's Examined?
• Number of hostile witnesses
• Defense witnesses
• Charge sheet
• Points for consideration
• Exhibits on behalf of prosecution P series
• Number of exhibits considered
• Exhibits on behalf of court Cseries
• Exhibits on behalf of accused Dseries
• Total Number of Material Objects
• Charges proved
• Charges not proved
• Issues Proved
• Issues Not Proved
• Accused released on bail
• Accused committed to prison
• Sentence of Imprisonment granted
• Fine with Imprisonment (Rs)
• Term Served in Prison(days)
• Set off (if any)
• Judgement
• Citations
Original Features

• Source – Publicly available judgement
documents
• Case Documents Analyzed – 120
• Data extraction mechanism – manual
• Unique fields extracted – 58
• Consistent features identified -15
(Judgement decision is the Target Variable)
• Total number of final observations - 202
Data
# Feature Name Description Datatype Value
1 ipc_420
Binary indicator to confirm if the case is filed
under IPC 420
Categorical Yes=1, No=0
2 ipc_120b
under IPC 120b
3 ipc_471
under IPC 471
4 ipc_468
under IPC 468
5 ipc_34
under IPC 34
6 jud_gender Gender of the judge presiding over the case Categorical Male=0, Female=1
7 jud_date Date when judgement was meted Date Date
8 tot_sec Total number of sections filed for the case Numeric Number
9 case_no Unique number of the case Categorical Multiple Factors
10 comp Complainant name * String Name
11 tot_accu Total number of accused presented in the case Numeric Number
12 accu_gender Gender of the individual accused Categorical Male=0, Female=1
13 accu_no Sequence number of the accused Categorical Multiple Factors
14 accu_age Age of the accused Numeric Number
15 judgement Judgement given in the case Categorical
Guilty=1, Not
Guilty=0

Feature Importance

Exploratory Data Analysis
Guilty – 20 Non-Guilty - 182

Density Plot

IPC sections frequency

Correlation Matrix

Architecture

INITIAL MODEL DEVELOPMENT STEPS
PredictionData Collection
Feed data to model
1
2
3
POST IMPLEMENTATION STEPS
FEEDBACK
Development Methodology

Model Development
• Logistic Regression
• K-Nearest Neighbor
• Random Forest
• Support Vector Machine

Model Development

Logistic Regression
Pseudo R-square - 45.4% of the Intercept only
model has been explained by the Full model
Log likelihood ratio implies that the null hypothesis
of all Betas are zero is rejected and at least one Beta
is nonzero.

Accuracy
• Training Sample – 92.9 %
• Validation Sample – 88.5 %
Logistic Regression
Variable Importance

Cross-Validation
10 fold cross-validation resulted the best value
with k=7
From the results,
Accuracy and Kappa reducing after k=5
K-Nearest Neighbor

K-Nearest Neighbor
Model was further tuned by setting twoclassSummary and classProbs
as True.
Tuned model has better accuracy of
93.44%

Random Forest
Model parameters -
• ntree = 250
as OOB hardly changes after 250 trees
• mtry = 3
initially we took sqrt(total_no_of_features)
• nodesize = 3
1% of the total observation (202 observations)

Random Forest
Cross Validation with Parameter Tuning with mtry=2,3 and 4
Tuned model has
accuracy of 93.55 %

Support Vector Machine
• Model found 41 support vectors with gamma
value of 0.017 and cost of 1
• SVM model accuracy 90.02%

10 fold cross validation identified best values of
gamma - 0.1, Cost - 1
Tuned model has accuracy of 95.04 %
Support Vector Machine

Observation
o From the assessment of all the models, Support Vector Machine provides a better
accuracy including other performance parameters.
Model Accuracy (%) Precision (%) Recall (%)
Decision Trees (Gini) 82% 82% 97%
K-Nearest Neighbor 93% 93% 100%
Logistic Regression 88% 96% 91%
Naïve Bayes 75% 76% 95%
Random Forest 94% 93% 100%
Support Vector Machines 95% 94% 99%
Summary - Model Performance

• Support Vector Machine provides a better accuracy
• Better Precision and Recall values obtained from SVM
and Gradient Boosting
• Random data is skewed towards Non-Guilty cases (89 :
11 in favor of Non-Guilty)
• Model has been developed on IPC 420 cases found
across multiple District / High Courts
• Prediction obtained were majorly predicting Non-
Guilty
Summary and Findings

Plan Ahead
MOBILE APPLICATION

Q&A
Thanks

Predicting outcome of legal case using machine learning algorithms By Ankita Singh Service Delivery Specialist at IBM India , Nilutpal Goswami Senior Manager at Capgemini at CYPHER 2018

Recommended

Recommended

More Related Content

More from Analytics India Magazine

More from Analytics India Magazine (20)

Recently uploaded

Recently uploaded (20)

Predicting outcome of legal case using machine learning algorithms By Ankita Singh Service Delivery Specialist at IBM India , Nilutpal Goswami Senior Manager at Capgemini at CYPHER 2018