Justice Delayed is Justice Denied: Enabling Legal Artificial Intelligence via Bail Prediction on Hindi Case Documents

IIIT Hyderabad
IIIT Hyderabad IIIT Hyderabad
Justice Delayed is Justice Denied
Enabling Legal Artificial Intelligence via Bail
Prediction on Hindi Case Documents
Arnav Kapoor (20171040)
Advisor - Prof. Ponnurangam Kumaraguru
Examiners
Prof. Srijoni Sen
Prof. Saptarshi Ghosh
2
https://www.ndtv.com/india-news/artificial-intelligence-help-reduce-backlog-of-pending-cases-law-minister-kiren-rijiju-2627342
https://www.livemint.com/opinion/columns/judicial-delays-and-the-need-for-intervention-at-the-district-level-11653237233189.html
https://www.financialexpress.com/opinion/tech-to-the-aid-of-justice-delivery/2242658/
Enabling Legal Artificial Intelligence via Bail
Prediction on Hindi Case Documents
Legal Artificial Intelligence
Legal Artificial Intelligence (LegalAI) is concerned with the use of artificial
intelligence technologies particularly NLP, to legal task.
Bail Prediction
An automated system to predict if the Bail (judicial interim release of a person
suspected of a crime) would be granted or denied.
Hindi Case Documents
First work to focus on Hindi Legal documents
3
Motivation
The legal system plays a critical role in the functioning of a nation. It is paramount
that cases are handled judiciously and efficiently
In recent times, the legal system in many populous countries (e.g., India) has been
inundated with a large number of pending cases.
There is an imminent need to build systems to process legal documents and help
augment the legal procedure
4
- How many such pending cases are there and where are the majority of such cases
concentrated ?
- According to India’s National Judicial Data Grid, as of November 2021, there are
approximately 40 million cases pending in District Courts (National Judicial Data
Grid, 2021) as opposed to 5 million cases (1/8th) pending in High Courts.
- Need for developing models that could address the problems at the grass-root levels
of the Indian legal system.
‘Courts inundated with a large number of pending cases’
5
High Courts
District Courts
Judicial Structure in India
District Courts
District Court
High Courts
High Courts
SC
Supreme Court
- Similar to a few other countries like US and UK, India has adopted a tiered court-
system to improve productivity and streamline the judicial process.
6
District Court Challenges
Language - The case documents in District court use the local language of the
state, as opposed to high courts and supreme courts that consistently use English.
‘Article 348(1) of the Constitution of India provides that all proceedings in the
Supreme Court and in every High Court shall be in English language until
Parliament by law otherwise provides.’
Structure - Since each district has their own conventions and quality control check,
there is no consistency. Additionally district case documents are highly
unstructured and noisy (spelling and grammar mistakes since these are typed)
7
Prominent Legal Corpus
[Chalkidis et al., 2019] have developed an English corpus of European Court of
Justice documents. (Non-Indian context, English)
[Xiao et al., 2018] have developed Chinese Legal Document corpus. (Non-Indian
context, Chinese)
[Malik et al., 2021b] have developed an English corpus of Indian Supreme Court
documents. (Indian context, English)
To the best of our knowledge, there does not exist any legal document corpus for the
Hindi language 8
Contribution #1 - HLDC - Hindi Legal Document Corpus
About 40 million pending cases in the district courts of India. Out of the 40 million
pending cases, approximately 20 million are from courts where the official language is
Hindi.
No legal corpora exists for Hindi – HLDC (Hindi Legal Document Corpus)
Hindi Legal Documents Corpus (HLDC) is a corpus of 912,568 Indian legal case
documents in the Hindi language. The corpus is created by crawling data from the e-
Courts website , a GoI initiative to file, track and analyse case progress and documents
online.
HLDC - Corpus Pipeline
OCR
Document Segmentation (Header / Body )
Final Cleaning
10
12
Anonymisation Challenge
Anonymisation is a critical step of document processing. It has a two-fold purpose.
- Unbiased Models
- Privacy Concerns
We used a gazetteer along with regex-based rules for NER to anonymize the data.
List of first names, last names, middle names, locations, titles like पंिडत (Pandit: title of
Priest), सरजी (Sir), month names and day names were normalized to नाम (Naam:
<name>). Phone numbers were detected using regex patterns and replaced with a
<फ़ोन नंबर> (<phone-number>) tag, numbers written in both English and Hindi were
considered.
13
Standardisation Challenge - Case Type Identification
HLDC documents were processed to obtain different case types (e.g., Bail applications,
Criminal Cases)
However for the case documents, the uniformity for case type across districts and even
within district is little.
Bail applications was categorised as ‘Bail Appl’ or ‘Bail Application’ in Kanpur Nagar
district. Similarly, there were variations of ‘Civil Revision’ as ‘Civil Reivision’ and ‘Civill
Revision’ in multiple districts.
Resolved using a combination of manual verification along with edit distance metrics
14
15
Case Type Statistics
16
Case Type % in HLDC
Bail Applications 31.71
Criminal Cases 20.41
Original Suits 6.54
Warrants or Summons in Criminal Cases 5.24
Civil Misc 3.4
Final Report 3.32
Civil Cases 3.23
Contribution #2 - Bail Prediction
Showcase utility of the HLDC corpus on a relevant real world task.
The most prevalent case-type amongst the cases in the corpus was Bail cases.
Additionally bail cases are very time-sensitive in nature as the hearing and decision
needs to be completed in a short period. Bail cases on average are resolved within 7-
15 days as compared to most other cases which can takes years.
17
Bail Prediction
Bail Prediction Task
Input – Facts of the case
System – Bail Prediction ML Model
Output – Bail Granted || Bail Denied
Input
Facts and Arguments
of the case
System
Bail prediction model
Output
Bail is granted or
denied
18
Bail Corpus Creation Pipeline
19
Outcome Identification
Binary - Identify if bail is granted or denied
To extract the bail decision we manually identified keywords in result section.
Keywords like ख़ािरज (dismissed) and िनरस्त (invalidated/ cancelled) show rejection of
bail application and words like स्वीकार (accepted) identified acceptance of bail
application.
20
Regex Tokens - Outcome Identification
21
Judge’s Opinion and Facts Extraction
Judge’s Opinion Extraction - Most of the bail documents have a concluding
paragraph where the judge summarizes their viewpoints of the case. To
extract this, we first constructed certain regex which often precedes judge’s
summary.
Facts Extraction - The remaining case document after removal of judge’s
summary, final result paragraph and initial header information formed the
core facts of the case.
23
Formalised Problem Statement
Formally, consider a corpus of bail documents D = b1, b2, · · · , bi, where each bail
document is segmented as bi = (hi, fi, ji, yi).
Here, hi, fi, ji and yi represent the header, facts, judge’s summary and bail decision of
the document respectively. Additionally, the facts of every document contain k
sentences, more formally, fi = (si
1 , si
2, · · · , si
k), where si
k represents the kth
sentence of the ith bail document. We formulate the bail prediction task as a binary
classification problem.
We are interested in modelling pθ(yi|fi), which is the probability of the outcome yi
given the facts of a case fi. Here, yi ∈ {0, 1}, i.e., 0 if bail is denied or 1 if bail is
granted.
24
Models
We explore a multitude of models to solve the bail prediction problem. The proposed
Multi-Task Learning model performs the best. The different categories of models we
progressively explored were as follows :-
- Embedding Based Models
- Summarisation Based Models
- Multi Task Learning Model
25
Embedding Based Models
Doc2Vec
Classical embedding based model. Doc2Vec embeddings, in our case, is trained on the
train set of our corpus. The embeddings go as input to SVM and XGBoost classifiers.
IndicBERT
IndicBERT is a transformer language model trained on 12 major Indian languages.
However, IndicBERT, akin to other transformer LMs, has a limitation on the input’s
length (max 512 tokens). Experimented with 2 settings.
The first 512 tokens of the document.
The last 512 tokens of the document.
26
Summarisation Based Models
The models described in the previous section can lose important input signals due to
truncation which can be detrimental to the bail prediction task
Lengthy Documents —> Summarisation
Two fold benefits
a) Extractive summary of document as input i.e reducing the length of the
document, without losing important signals.
b) Summarization helps to evaluate the salient sentences in a legal document
and is a step towards developing explainable models 27
Summarisation Methods
Unsupervised - No need for ground truth summaries.
a) TF-IDF+IndicBERT: We rely on the statistical measures of term frequency
and inverse document frequency to recognise the most salient words and
sentences.
b) TextRank+IndicBERT: We use the graph based method TextRank to extract
the most important sentences from the documents
Semi-Supervised - Use the judge’s summary as a truth indicator to train the model.
- Salience Classifier + IndicBERT
28
Semi-Supervised Summarisation - Saliency Classifier
- Saliency Classifier - A binary sentence level classifier to know if the sentence is
important or not.
- We compute embeddings (using a multilingual sentence encoder) for each sentence in
the facts (source text) and the summarised judgement (silver standard), and assign a
saliency score to each sentence in the facts using cosine similarity.
- Formally – Salience(si
k ) = cos_sim( ji , si
k)
- The cosine similarities provides ranked list of sentences - top 50% salient, bottom half
as non-salient. We use this data to train the saliency classifier.
- The salient sentences are used to train (and fine-tune) IndicBERT based classifier.
29
Multi-Task Learning Model
Summarization based models show improvement in results. Instead of a 2 step -
summarisation + prediction, a multi task framework.
Bail prediction is the main task and sentence salience classification is the
auxiliary task.
31
Experiments and Results
We evaluate the models in two settings
a) All-district performance - 70:10:20 - train, validation and test split. Documents from all
districts used together.
b) District-wise performance (measure generalizability): Training on documents from 44
districts. Testing is done on a different set of 17 districts. The validation set has another set
of 10 districts. The number of documents in each category correspond to an approx
70:10:20 ratio.
All models are evaluated using standard accuracy and F1-score metric
32
33
34
Results
The performance of models is lower in the case of district-wise settings.
Possible Reason - Lexical variation across districts, difficult for the model to
generalize.
In general, summarization based models outperform better than Doc2Vec and
transformer-based models, highlighting the importance of the summarization step.
The multi-task model outperforms all the baselines in the district-wise setting with
78.53% accuracy. The auxiliary task of sentence salience classification helps learn
robust features during training.
However, in the case of an all-district split, the MTL model fails to beat simpler
baselines like TF-IDF+IndicBERT. The sentence salience training data based on
the cosine similarity heuristic, induce some noise for the auxiliary task.
35
Error Analysis
Kernel Density Estimate
(KDE) plots of our proposed
bail prediction model. The
majority of errors (incorrectly
dismissed / granted) are
borderline cases with model
output score around 0.5
36
Challenges and Limitations
Domain Specific Knowledge - Court documents contain very specific lexicon. Pre-
trained models built on other corpus don’t perform well.
Low Resource Language - Hindi, despite being the third most spoken language in
the world remains a low resource language in the context of NLP with limited high
quality training data and pre-trained models.
Length of Documents - Court cases can be extremely long in nature, the token
length constraint of transformer based models makes it challenging to process legal
documents.
37
Summary
We introduced a large corpus of legal documents for the under-
resourced language Hindi: Hindi Legal Documents Corpus (HLDC)
As a use-case for HLDC, we introduce the task of Bail Prediction - a
time-sensitive and impactful problem to tackle.
We experimented with several models and proposed a multi-task
learning based model that predicts salient sentences as an auxiliary task
and bail prediction as the main task
38
Future Work
Multiple Languages - India has 22 official languages, but for the majority of
languages, there are no legal corpora. The use of such corpora is not restricted only
to the legal domain. It can also act as a source of unsupervised data in low resource
languages to train transformer models.
Infusing Domain Specific Knowledge: Develop deep models infused with legal
knowledge so that model is able to capture legal nuances. For Indian legal
documents, we can incorporate information from the sections and acts in the IPC
(Indian Penal Code).
Other LegalNLP tasks: Apart from judgement prediction, a number of other tasks
like legal summarisation and prior case retrieval can help expedite judicial process.
39
Publications
Kapoor, A., Dhawan, M., Goel, A., Arjun, T.H, Agrawal, V., Agrawal, A.,
Bhattacharya, A., Kumaraguru, P., and Modi, A.
HLDC: Hindi Legal Documents Corpus. In Findings of the Association for
Computational Linguistics: ACL-IJCNLP 2022 , Association for Computational
Linguistics.
40
Acknowledgements
Prof. Ashutosh Modi and Prof. Arnab Bhattacharya - The work was a collaborative
effort between universities of IIIT Hyderabad, IIT Kanpur and IIIT Delhi. Thanks
for the opportunity to work on such an impactful problem.
IHUB-Data IIIT Hyderabad for funding my Masters.
The Precog group for the constant support and encouragement.
Family - For everything.
41
Questions
42
1 of 41

Recommended

Case Analysis of Siddharam Satlingappa Mhetre v. State of Maharashtra & Ors. by
Case Analysis of Siddharam Satlingappa Mhetre v. State of Maharashtra & Ors.Case Analysis of Siddharam Satlingappa Mhetre v. State of Maharashtra & Ors.
Case Analysis of Siddharam Satlingappa Mhetre v. State of Maharashtra & Ors.Devina Srivastava
3.2K views11 slides
LLB LAW NOTES ON CRIMINOLOGY AND PENOLOGY by
LLB LAW NOTES ON CRIMINOLOGY AND PENOLOGYLLB LAW NOTES ON CRIMINOLOGY AND PENOLOGY
LLB LAW NOTES ON CRIMINOLOGY AND PENOLOGYKanoon Ke Rakhwale India
7.5K views22 slides
Rights of accused persons criminal law by
Rights of accused persons criminal law Rights of accused persons criminal law
Rights of accused persons criminal law gagan deep
970 views7 slides
Assignment On: “How much Rule of Law is ensure in Bangladesh” by
Assignment On: “How much Rule of Law is ensure in Bangladesh”Assignment On: “How much Rule of Law is ensure in Bangladesh”
Assignment On: “How much Rule of Law is ensure in Bangladesh”Asian Paint Bangladesh Ltd
20.4K views10 slides
Law and Court Structure of Bangladesh by
Law and Court Structure of BangladeshLaw and Court Structure of Bangladesh
Law and Court Structure of BangladeshMd. Adib Ibne Yousuf
2K views28 slides
Role of Correctional Institutes.pptx by
Role of Correctional Institutes.pptxRole of Correctional Institutes.pptx
Role of Correctional Institutes.pptxApplied Forensic Research Sciences
1.7K views12 slides

More Related Content

What's hot

The ICJ by
The ICJThe ICJ
The ICJmrjust88
5.4K views8 slides
Specific Releif Act 1877 by
Specific Releif Act 1877Specific Releif Act 1877
Specific Releif Act 1877Asian Paint Bangladesh Ltd
20.8K views16 slides
Sources of International Criminal Law by
Sources of International Criminal LawSources of International Criminal Law
Sources of International Criminal LawUniversity of Rajshahi
541 views13 slides
Schools of Criminology by
Schools of CriminologySchools of Criminology
Schools of CriminologyVelika D'Souza
1.9K views13 slides
Trial of-summon-cases-by-magistrate by
Trial of-summon-cases-by-magistrateTrial of-summon-cases-by-magistrate
Trial of-summon-cases-by-magistratePROF. PUTTU GURU PRASAD
3.7K views8 slides
Probation.pptx by
Probation.pptxProbation.pptx
Probation.pptxUniversity of Rajshahi
375 views25 slides

What's hot(20)

The ICJ by mrjust88
The ICJThe ICJ
The ICJ
mrjust885.4K views
Casus omissus, interpretation of statutes by poonamraj2010
Casus omissus, interpretation of statutesCasus omissus, interpretation of statutes
Casus omissus, interpretation of statutes
poonamraj201034.1K views
Legal aid & ITS STATUTORY PROVISIONS by Tejinder Bhatti
Legal aid & ITS STATUTORY PROVISIONSLegal aid & ITS STATUTORY PROVISIONS
Legal aid & ITS STATUTORY PROVISIONS
Tejinder Bhatti1.9K views
Order XXI CPC, Attachment of property under Execution Proceedings PPT by MANIPAL UNIVERSITY JAIPUR
Order XXI CPC, Attachment of property under Execution Proceedings PPTOrder XXI CPC, Attachment of property under Execution Proceedings PPT
Order XXI CPC, Attachment of property under Execution Proceedings PPT
conflict of Laws or Private International Law by carolineelias239
conflict of Laws or Private International Lawconflict of Laws or Private International Law
conflict of Laws or Private International Law
carolineelias2391.6K views
Parliamentary control of Delegated Legislation by raikhanna
Parliamentary control of Delegated LegislationParliamentary control of Delegated Legislation
Parliamentary control of Delegated Legislation
raikhanna13K views
Evidence-Corroboration by sezakiza
Evidence-CorroborationEvidence-Corroboration
Evidence-Corroboration
sezakiza 16.8K views
Dying declaration & Opinions Expert by A K DAS's | Law
Dying declaration & Opinions ExpertDying declaration & Opinions Expert
Dying declaration & Opinions Expert
A K DAS's | Law2.1K views
Electronic evidence digital evidence in india by Adv Prashant Mali
Electronic evidence  digital evidence in indiaElectronic evidence  digital evidence in india
Electronic evidence digital evidence in india
Adv Prashant Mali17.9K views
Sources of International Law for 3rd year students-2013 by Chathurika86
Sources of International Law  for 3rd year students-2013Sources of International Law  for 3rd year students-2013
Sources of International Law for 3rd year students-2013
Chathurika8610.7K views

Similar to Justice Delayed is Justice Denied: Enabling Legal Artificial Intelligence via Bail Prediction on Hindi Case Documents

NLP / Language Research at Precog by
NLP / Language Research at PrecogNLP / Language Research at Precog
NLP / Language Research at PrecogIIIT Hyderabad
217 views37 slides
An overview of information extraction techniques for legal document analysis ... by
An overview of information extraction techniques for legal document analysis ...An overview of information extraction techniques for legal document analysis ...
An overview of information extraction techniques for legal document analysis ...IJECEIAES
46 views8 slides
Surviving Technology 2009 & The Paralegal by
Surviving Technology 2009 & The ParalegalSurviving Technology 2009 & The Paralegal
Surviving Technology 2009 & The ParalegalAubrey Owens
320 views3 slides
Data Science for Social Good: #LegalNLP #AlgorithmicBias by
Data Science for Social Good: #LegalNLP #AlgorithmicBiasData Science for Social Good: #LegalNLP #AlgorithmicBias
Data Science for Social Good: #LegalNLP #AlgorithmicBiasIIIT Hyderabad
52 views37 slides
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe... by
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...IRJET Journal
31 views5 slides
Total Evidence White Paper by
Total Evidence White PaperTotal Evidence White Paper
Total Evidence White PaperKevin Featherly
156 views12 slides

Similar to Justice Delayed is Justice Denied: Enabling Legal Artificial Intelligence via Bail Prediction on Hindi Case Documents(20)

NLP / Language Research at Precog by IIIT Hyderabad
NLP / Language Research at PrecogNLP / Language Research at Precog
NLP / Language Research at Precog
IIIT Hyderabad 217 views
An overview of information extraction techniques for legal document analysis ... by IJECEIAES
An overview of information extraction techniques for legal document analysis ...An overview of information extraction techniques for legal document analysis ...
An overview of information extraction techniques for legal document analysis ...
IJECEIAES46 views
Surviving Technology 2009 & The Paralegal by Aubrey Owens
Surviving Technology 2009 & The ParalegalSurviving Technology 2009 & The Paralegal
Surviving Technology 2009 & The Paralegal
Aubrey Owens320 views
Data Science for Social Good: #LegalNLP #AlgorithmicBias by IIIT Hyderabad
Data Science for Social Good: #LegalNLP #AlgorithmicBiasData Science for Social Good: #LegalNLP #AlgorithmicBias
Data Science for Social Good: #LegalNLP #AlgorithmicBias
IIIT Hyderabad 52 views
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe... by IRJET Journal
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...
IRJET Journal31 views
i-Lit Paralegals Brochure (email) by Mike Taylor
i-Lit Paralegals Brochure (email)i-Lit Paralegals Brochure (email)
i-Lit Paralegals Brochure (email)
Mike Taylor265 views
Data Science for Social Good: #MentalHealth #CodeMix #LegalNLP #AISafety by IIIT Hyderabad
Data Science for Social Good: #MentalHealth #CodeMix #LegalNLP #AISafetyData Science for Social Good: #MentalHealth #CodeMix #LegalNLP #AISafety
Data Science for Social Good: #MentalHealth #CodeMix #LegalNLP #AISafety
IIIT Hyderabad 46 views
The Role of AI Technology for Legal Research and Decision Making by IRJET Journal
The Role of AI Technology for Legal Research and Decision MakingThe Role of AI Technology for Legal Research and Decision Making
The Role of AI Technology for Legal Research and Decision Making
IRJET Journal7 views
Malware analysis by Anne ndolo
Malware analysisMalware analysis
Malware analysis
Anne ndolo1.5K views
Legal Research Outsourcing: Our Means and Your Ends by rajni_minhas
Legal Research Outsourcing: Our Means and Your EndsLegal Research Outsourcing: Our Means and Your Ends
Legal Research Outsourcing: Our Means and Your Ends
rajni_minhas501 views
CaseMark Weekly Webinar: AI-in-Legal Q3'2023 by Scott Kveton
CaseMark Weekly Webinar: AI-in-Legal Q3'2023CaseMark Weekly Webinar: AI-in-Legal Q3'2023
CaseMark Weekly Webinar: AI-in-Legal Q3'2023
Scott Kveton61 views
Analysing Predictive Coding Algorithms For Document Review by Cynthia King
Analysing Predictive Coding Algorithms For Document ReviewAnalysing Predictive Coding Algorithms For Document Review
Analysing Predictive Coding Algorithms For Document Review
Cynthia King2 views
AUTOMATED DISCOVERY OF LOGICAL FALLACIES IN LEGAL ARGUMENTATION by gerogepatton
AUTOMATED DISCOVERY OF LOGICAL FALLACIES IN LEGAL ARGUMENTATIONAUTOMATED DISCOVERY OF LOGICAL FALLACIES IN LEGAL ARGUMENTATION
AUTOMATED DISCOVERY OF LOGICAL FALLACIES IN LEGAL ARGUMENTATION
gerogepatton19 views
Automated Discovery of Logical Fallacies in Legal Argumentation by gerogepatton
Automated Discovery of Logical Fallacies in Legal ArgumentationAutomated Discovery of Logical Fallacies in Legal Argumentation
Automated Discovery of Logical Fallacies in Legal Argumentation
gerogepatton10 views
AUTOMATED DISCOVERY OF LOGICAL FALLACIES IN LEGAL ARGUMENTATION by ijaia
AUTOMATED DISCOVERY OF LOGICAL FALLACIES IN LEGAL ARGUMENTATIONAUTOMATED DISCOVERY OF LOGICAL FALLACIES IN LEGAL ARGUMENTATION
AUTOMATED DISCOVERY OF LOGICAL FALLACIES IN LEGAL ARGUMENTATION
ijaia29 views

More from IIIT Hyderabad

Responsible & Safe AI: #LegalBias #Inconsistency #BiasinLLMs #MultiModalBias by
Responsible & Safe AI: #LegalBias #Inconsistency #BiasinLLMs #MultiModalBiasResponsible & Safe AI: #LegalBias #Inconsistency #BiasinLLMs #MultiModalBias
Responsible & Safe AI: #LegalBias #Inconsistency #BiasinLLMs #MultiModalBiasIIIT Hyderabad
102 views56 slides
Identify, Inspect and Intervene Multimodal Fake News by
Identify, Inspect and Intervene Multimodal Fake NewsIdentify, Inspect and Intervene Multimodal Fake News
Identify, Inspect and Intervene Multimodal Fake NewsIIIT Hyderabad
118 views120 slides
#ChatGPT #ResponsibleAI by
#ChatGPT #ResponsibleAI#ChatGPT #ResponsibleAI
#ChatGPT #ResponsibleAIIIIT Hyderabad
331 views21 slides
It is our choices, Harry, that show what we truly are, far more than our abil... by
It is our choices, Harry, that show what we truly are, far more than our abil...It is our choices, Harry, that show what we truly are, far more than our abil...
It is our choices, Harry, that show what we truly are, far more than our abil...IIIT Hyderabad
86 views56 slides
Beyond the Surface: A Computational Exploration of Linguistic Ambiguity by
Beyond the Surface: A Computational Exploration of Linguistic AmbiguityBeyond the Surface: A Computational Exploration of Linguistic Ambiguity
Beyond the Surface: A Computational Exploration of Linguistic AmbiguityIIIT Hyderabad
57 views79 slides
Data Science for Social Good: #LegalNLP #AlgorithmicBias... by
Data Science for Social Good:                      #LegalNLP #AlgorithmicBias...Data Science for Social Good:                      #LegalNLP #AlgorithmicBias...
Data Science for Social Good: #LegalNLP #AlgorithmicBias...IIIT Hyderabad
21 views39 slides

More from IIIT Hyderabad (20)

Responsible & Safe AI: #LegalBias #Inconsistency #BiasinLLMs #MultiModalBias by IIIT Hyderabad
Responsible & Safe AI: #LegalBias #Inconsistency #BiasinLLMs #MultiModalBiasResponsible & Safe AI: #LegalBias #Inconsistency #BiasinLLMs #MultiModalBias
Responsible & Safe AI: #LegalBias #Inconsistency #BiasinLLMs #MultiModalBias
IIIT Hyderabad 102 views
Identify, Inspect and Intervene Multimodal Fake News by IIIT Hyderabad
Identify, Inspect and Intervene Multimodal Fake NewsIdentify, Inspect and Intervene Multimodal Fake News
Identify, Inspect and Intervene Multimodal Fake News
IIIT Hyderabad 118 views
It is our choices, Harry, that show what we truly are, far more than our abil... by IIIT Hyderabad
It is our choices, Harry, that show what we truly are, far more than our abil...It is our choices, Harry, that show what we truly are, far more than our abil...
It is our choices, Harry, that show what we truly are, far more than our abil...
IIIT Hyderabad 86 views
Beyond the Surface: A Computational Exploration of Linguistic Ambiguity by IIIT Hyderabad
Beyond the Surface: A Computational Exploration of Linguistic AmbiguityBeyond the Surface: A Computational Exploration of Linguistic Ambiguity
Beyond the Surface: A Computational Exploration of Linguistic Ambiguity
IIIT Hyderabad 57 views
Data Science for Social Good: #LegalNLP #AlgorithmicBias... by IIIT Hyderabad
Data Science for Social Good:                      #LegalNLP #AlgorithmicBias...Data Science for Social Good:                      #LegalNLP #AlgorithmicBias...
Data Science for Social Good: #LegalNLP #AlgorithmicBias...
IIIT Hyderabad 21 views
How to Write a (Good) Research Paper by IIIT Hyderabad
How to Write a (Good) Research Paper How to Write a (Good) Research Paper
How to Write a (Good) Research Paper
IIIT Hyderabad 250 views
Social Computing Research in India by IIIT Hyderabad
Social Computing Research in IndiaSocial Computing Research in India
Social Computing Research in India
IIIT Hyderabad 235 views
Modeling Online User Interactions and their Offline effects on Socio-Technica... by IIIT Hyderabad
Modeling Online User Interactions and their Offline effects on Socio-Technica...Modeling Online User Interactions and their Offline effects on Socio-Technica...
Modeling Online User Interactions and their Offline effects on Socio-Technica...
IIIT Hyderabad 119 views
Privacy. Winter School on “Topics in Digital Trust”. IIT Bombay by IIIT Hyderabad
Privacy. Winter School on “Topics in Digital Trust”. IIT BombayPrivacy. Winter School on “Topics in Digital Trust”. IIT Bombay
Privacy. Winter School on “Topics in Digital Trust”. IIT Bombay
IIIT Hyderabad 54 views
It is our choices, Harry, that show what we truly are, far more than our abil... by IIIT Hyderabad
It is our choices, Harry, that show what we truly are, far more than our abil...It is our choices, Harry, that show what we truly are, far more than our abil...
It is our choices, Harry, that show what we truly are, far more than our abil...
IIIT Hyderabad 56 views
It is our choices, Harry, that show what we truly are, far more than our abil... by IIIT Hyderabad
It is our choices, Harry, that show what we truly are, far more than our abil...It is our choices, Harry, that show what we truly are, far more than our abil...
It is our choices, Harry, that show what we truly are, far more than our abil...
IIIT Hyderabad 389 views
Leveraging Social Media for Financial Advice by IIIT Hyderabad
Leveraging Social Media for Financial AdviceLeveraging Social Media for Financial Advice
Leveraging Social Media for Financial Advice
IIIT Hyderabad 96 views
Development of Stress Induction and Detection System to Study its Effect on B... by IIIT Hyderabad
Development of Stress Induction and Detection System to Study its Effect on B...Development of Stress Induction and Detection System to Study its Effect on B...
Development of Stress Induction and Detection System to Study its Effect on B...
IIIT Hyderabad 108 views
A Framework for Automatic Question Answering in Indian Languages by IIIT Hyderabad
A Framework for Automatic Question Answering in Indian LanguagesA Framework for Automatic Question Answering in Indian Languages
A Framework for Automatic Question Answering in Indian Languages
IIIT Hyderabad 191 views
A Framework For Automatic Question Answering in Indian Languages by IIIT Hyderabad
A Framework For Automatic Question Answering in Indian LanguagesA Framework For Automatic Question Answering in Indian Languages
A Framework For Automatic Question Answering in Indian Languages
IIIT Hyderabad 26 views
Exposing, Examining and Intervening Fake News by IIIT Hyderabad
Exposing, Examining and Intervening Fake NewsExposing, Examining and Intervening Fake News
Exposing, Examining and Intervening Fake News
IIIT Hyderabad 98 views
It's MY JOB: Identifying and Improving Content Quality for Online recruitmen... by IIIT Hyderabad
 It's MY JOB: Identifying and Improving Content Quality for Online recruitmen... It's MY JOB: Identifying and Improving Content Quality for Online recruitmen...
It's MY JOB: Identifying and Improving Content Quality for Online recruitmen...
IIIT Hyderabad 52 views
De-anonymizing, Preserving and Democratizing Data Privacy and Ownership by IIIT Hyderabad
De-anonymizing, Preserving and Democratizing Data Privacy and OwnershipDe-anonymizing, Preserving and Democratizing Data Privacy and Ownership
De-anonymizing, Preserving and Democratizing Data Privacy and Ownership
IIIT Hyderabad 53 views

Recently uploaded

NEW SUPPLIERS SUPPLIES (copie).pdf by
NEW SUPPLIERS SUPPLIES (copie).pdfNEW SUPPLIERS SUPPLIES (copie).pdf
NEW SUPPLIERS SUPPLIES (copie).pdfgeorgesradjou
14 views30 slides
Multi-objective distributed generation integration in radial distribution sy... by
Multi-objective distributed generation integration in radial  distribution sy...Multi-objective distributed generation integration in radial  distribution sy...
Multi-objective distributed generation integration in radial distribution sy...IJECEIAES
15 views14 slides
Investor Presentation by
Investor PresentationInvestor Presentation
Investor Presentationeser sevinç
20 views26 slides
A multi-microcontroller-based hardware for deploying Tiny machine learning mo... by
A multi-microcontroller-based hardware for deploying Tiny machine learning mo...A multi-microcontroller-based hardware for deploying Tiny machine learning mo...
A multi-microcontroller-based hardware for deploying Tiny machine learning mo...IJECEIAES
12 views10 slides
SNMPx by
SNMPxSNMPx
SNMPxAmatullahbutt
14 views12 slides
How I learned to stop worrying and love the dark silicon apocalypse.pdf by
How I learned to stop worrying and love the dark silicon apocalypse.pdfHow I learned to stop worrying and love the dark silicon apocalypse.pdf
How I learned to stop worrying and love the dark silicon apocalypse.pdfTomasz Kowalczewski
24 views66 slides

Recently uploaded(20)

NEW SUPPLIERS SUPPLIES (copie).pdf by georgesradjou
NEW SUPPLIERS SUPPLIES (copie).pdfNEW SUPPLIERS SUPPLIES (copie).pdf
NEW SUPPLIERS SUPPLIES (copie).pdf
georgesradjou14 views
Multi-objective distributed generation integration in radial distribution sy... by IJECEIAES
Multi-objective distributed generation integration in radial  distribution sy...Multi-objective distributed generation integration in radial  distribution sy...
Multi-objective distributed generation integration in radial distribution sy...
IJECEIAES15 views
A multi-microcontroller-based hardware for deploying Tiny machine learning mo... by IJECEIAES
A multi-microcontroller-based hardware for deploying Tiny machine learning mo...A multi-microcontroller-based hardware for deploying Tiny machine learning mo...
A multi-microcontroller-based hardware for deploying Tiny machine learning mo...
IJECEIAES12 views
How I learned to stop worrying and love the dark silicon apocalypse.pdf by Tomasz Kowalczewski
How I learned to stop worrying and love the dark silicon apocalypse.pdfHow I learned to stop worrying and love the dark silicon apocalypse.pdf
How I learned to stop worrying and love the dark silicon apocalypse.pdf
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,... by AakashShakya12
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...
AakashShakya1257 views
2_DVD_ASIC_Design_FLow.pdf by Usha Mehta
2_DVD_ASIC_Design_FLow.pdf2_DVD_ASIC_Design_FLow.pdf
2_DVD_ASIC_Design_FLow.pdf
Usha Mehta19 views
Dynamics of Hard-Magnetic Soft Materials by Shivendra Nandan
Dynamics of Hard-Magnetic Soft MaterialsDynamics of Hard-Magnetic Soft Materials
Dynamics of Hard-Magnetic Soft Materials
Shivendra Nandan13 views
9_DVD_Dynamic_logic_circuits.pdf by Usha Mehta
9_DVD_Dynamic_logic_circuits.pdf9_DVD_Dynamic_logic_circuits.pdf
9_DVD_Dynamic_logic_circuits.pdf
Usha Mehta28 views
CHI-SQUARE ( χ2) TESTS.pptx by ssusera597c5
CHI-SQUARE ( χ2) TESTS.pptxCHI-SQUARE ( χ2) TESTS.pptx
CHI-SQUARE ( χ2) TESTS.pptx
ssusera597c529 views
Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th... by ahmedmesaiaoun
Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th...Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th...
Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th...
ahmedmesaiaoun12 views

Justice Delayed is Justice Denied: Enabling Legal Artificial Intelligence via Bail Prediction on Hindi Case Documents

  • 1. Justice Delayed is Justice Denied Enabling Legal Artificial Intelligence via Bail Prediction on Hindi Case Documents Arnav Kapoor (20171040) Advisor - Prof. Ponnurangam Kumaraguru Examiners Prof. Srijoni Sen Prof. Saptarshi Ghosh
  • 3. Enabling Legal Artificial Intelligence via Bail Prediction on Hindi Case Documents Legal Artificial Intelligence Legal Artificial Intelligence (LegalAI) is concerned with the use of artificial intelligence technologies particularly NLP, to legal task. Bail Prediction An automated system to predict if the Bail (judicial interim release of a person suspected of a crime) would be granted or denied. Hindi Case Documents First work to focus on Hindi Legal documents 3
  • 4. Motivation The legal system plays a critical role in the functioning of a nation. It is paramount that cases are handled judiciously and efficiently In recent times, the legal system in many populous countries (e.g., India) has been inundated with a large number of pending cases. There is an imminent need to build systems to process legal documents and help augment the legal procedure 4
  • 5. - How many such pending cases are there and where are the majority of such cases concentrated ? - According to India’s National Judicial Data Grid, as of November 2021, there are approximately 40 million cases pending in District Courts (National Judicial Data Grid, 2021) as opposed to 5 million cases (1/8th) pending in High Courts. - Need for developing models that could address the problems at the grass-root levels of the Indian legal system. ‘Courts inundated with a large number of pending cases’ 5 High Courts District Courts
  • 6. Judicial Structure in India District Courts District Court High Courts High Courts SC Supreme Court - Similar to a few other countries like US and UK, India has adopted a tiered court- system to improve productivity and streamline the judicial process. 6
  • 7. District Court Challenges Language - The case documents in District court use the local language of the state, as opposed to high courts and supreme courts that consistently use English. ‘Article 348(1) of the Constitution of India provides that all proceedings in the Supreme Court and in every High Court shall be in English language until Parliament by law otherwise provides.’ Structure - Since each district has their own conventions and quality control check, there is no consistency. Additionally district case documents are highly unstructured and noisy (spelling and grammar mistakes since these are typed) 7
  • 8. Prominent Legal Corpus [Chalkidis et al., 2019] have developed an English corpus of European Court of Justice documents. (Non-Indian context, English) [Xiao et al., 2018] have developed Chinese Legal Document corpus. (Non-Indian context, Chinese) [Malik et al., 2021b] have developed an English corpus of Indian Supreme Court documents. (Indian context, English) To the best of our knowledge, there does not exist any legal document corpus for the Hindi language 8
  • 9. Contribution #1 - HLDC - Hindi Legal Document Corpus About 40 million pending cases in the district courts of India. Out of the 40 million pending cases, approximately 20 million are from courts where the official language is Hindi. No legal corpora exists for Hindi – HLDC (Hindi Legal Document Corpus) Hindi Legal Documents Corpus (HLDC) is a corpus of 912,568 Indian legal case documents in the Hindi language. The corpus is created by crawling data from the e- Courts website , a GoI initiative to file, track and analyse case progress and documents online.
  • 10. HLDC - Corpus Pipeline OCR Document Segmentation (Header / Body ) Final Cleaning 10
  • 11. 12
  • 12. Anonymisation Challenge Anonymisation is a critical step of document processing. It has a two-fold purpose. - Unbiased Models - Privacy Concerns We used a gazetteer along with regex-based rules for NER to anonymize the data. List of first names, last names, middle names, locations, titles like पंिडत (Pandit: title of Priest), सरजी (Sir), month names and day names were normalized to नाम (Naam: <name>). Phone numbers were detected using regex patterns and replaced with a <फ़ोन नंबर> (<phone-number>) tag, numbers written in both English and Hindi were considered. 13
  • 13. Standardisation Challenge - Case Type Identification HLDC documents were processed to obtain different case types (e.g., Bail applications, Criminal Cases) However for the case documents, the uniformity for case type across districts and even within district is little. Bail applications was categorised as ‘Bail Appl’ or ‘Bail Application’ in Kanpur Nagar district. Similarly, there were variations of ‘Civil Revision’ as ‘Civil Reivision’ and ‘Civill Revision’ in multiple districts. Resolved using a combination of manual verification along with edit distance metrics 14
  • 14. 15
  • 15. Case Type Statistics 16 Case Type % in HLDC Bail Applications 31.71 Criminal Cases 20.41 Original Suits 6.54 Warrants or Summons in Criminal Cases 5.24 Civil Misc 3.4 Final Report 3.32 Civil Cases 3.23
  • 16. Contribution #2 - Bail Prediction Showcase utility of the HLDC corpus on a relevant real world task. The most prevalent case-type amongst the cases in the corpus was Bail cases. Additionally bail cases are very time-sensitive in nature as the hearing and decision needs to be completed in a short period. Bail cases on average are resolved within 7- 15 days as compared to most other cases which can takes years. 17
  • 17. Bail Prediction Bail Prediction Task Input – Facts of the case System – Bail Prediction ML Model Output – Bail Granted || Bail Denied Input Facts and Arguments of the case System Bail prediction model Output Bail is granted or denied 18
  • 18. Bail Corpus Creation Pipeline 19
  • 19. Outcome Identification Binary - Identify if bail is granted or denied To extract the bail decision we manually identified keywords in result section. Keywords like ख़ािरज (dismissed) and िनरस्त (invalidated/ cancelled) show rejection of bail application and words like स्वीकार (accepted) identified acceptance of bail application. 20
  • 20. Regex Tokens - Outcome Identification 21
  • 21. Judge’s Opinion and Facts Extraction Judge’s Opinion Extraction - Most of the bail documents have a concluding paragraph where the judge summarizes their viewpoints of the case. To extract this, we first constructed certain regex which often precedes judge’s summary. Facts Extraction - The remaining case document after removal of judge’s summary, final result paragraph and initial header information formed the core facts of the case.
  • 22. 23
  • 23. Formalised Problem Statement Formally, consider a corpus of bail documents D = b1, b2, · · · , bi, where each bail document is segmented as bi = (hi, fi, ji, yi). Here, hi, fi, ji and yi represent the header, facts, judge’s summary and bail decision of the document respectively. Additionally, the facts of every document contain k sentences, more formally, fi = (si 1 , si 2, · · · , si k), where si k represents the kth sentence of the ith bail document. We formulate the bail prediction task as a binary classification problem. We are interested in modelling pθ(yi|fi), which is the probability of the outcome yi given the facts of a case fi. Here, yi ∈ {0, 1}, i.e., 0 if bail is denied or 1 if bail is granted. 24
  • 24. Models We explore a multitude of models to solve the bail prediction problem. The proposed Multi-Task Learning model performs the best. The different categories of models we progressively explored were as follows :- - Embedding Based Models - Summarisation Based Models - Multi Task Learning Model 25
  • 25. Embedding Based Models Doc2Vec Classical embedding based model. Doc2Vec embeddings, in our case, is trained on the train set of our corpus. The embeddings go as input to SVM and XGBoost classifiers. IndicBERT IndicBERT is a transformer language model trained on 12 major Indian languages. However, IndicBERT, akin to other transformer LMs, has a limitation on the input’s length (max 512 tokens). Experimented with 2 settings. The first 512 tokens of the document. The last 512 tokens of the document. 26
  • 26. Summarisation Based Models The models described in the previous section can lose important input signals due to truncation which can be detrimental to the bail prediction task Lengthy Documents —> Summarisation Two fold benefits a) Extractive summary of document as input i.e reducing the length of the document, without losing important signals. b) Summarization helps to evaluate the salient sentences in a legal document and is a step towards developing explainable models 27
  • 27. Summarisation Methods Unsupervised - No need for ground truth summaries. a) TF-IDF+IndicBERT: We rely on the statistical measures of term frequency and inverse document frequency to recognise the most salient words and sentences. b) TextRank+IndicBERT: We use the graph based method TextRank to extract the most important sentences from the documents Semi-Supervised - Use the judge’s summary as a truth indicator to train the model. - Salience Classifier + IndicBERT 28
  • 28. Semi-Supervised Summarisation - Saliency Classifier - Saliency Classifier - A binary sentence level classifier to know if the sentence is important or not. - We compute embeddings (using a multilingual sentence encoder) for each sentence in the facts (source text) and the summarised judgement (silver standard), and assign a saliency score to each sentence in the facts using cosine similarity. - Formally – Salience(si k ) = cos_sim( ji , si k) - The cosine similarities provides ranked list of sentences - top 50% salient, bottom half as non-salient. We use this data to train the saliency classifier. - The salient sentences are used to train (and fine-tune) IndicBERT based classifier. 29
  • 29. Multi-Task Learning Model Summarization based models show improvement in results. Instead of a 2 step - summarisation + prediction, a multi task framework. Bail prediction is the main task and sentence salience classification is the auxiliary task.
  • 30. 31
  • 31. Experiments and Results We evaluate the models in two settings a) All-district performance - 70:10:20 - train, validation and test split. Documents from all districts used together. b) District-wise performance (measure generalizability): Training on documents from 44 districts. Testing is done on a different set of 17 districts. The validation set has another set of 10 districts. The number of documents in each category correspond to an approx 70:10:20 ratio. All models are evaluated using standard accuracy and F1-score metric 32
  • 32. 33
  • 33. 34
  • 34. Results The performance of models is lower in the case of district-wise settings. Possible Reason - Lexical variation across districts, difficult for the model to generalize. In general, summarization based models outperform better than Doc2Vec and transformer-based models, highlighting the importance of the summarization step. The multi-task model outperforms all the baselines in the district-wise setting with 78.53% accuracy. The auxiliary task of sentence salience classification helps learn robust features during training. However, in the case of an all-district split, the MTL model fails to beat simpler baselines like TF-IDF+IndicBERT. The sentence salience training data based on the cosine similarity heuristic, induce some noise for the auxiliary task. 35
  • 35. Error Analysis Kernel Density Estimate (KDE) plots of our proposed bail prediction model. The majority of errors (incorrectly dismissed / granted) are borderline cases with model output score around 0.5 36
  • 36. Challenges and Limitations Domain Specific Knowledge - Court documents contain very specific lexicon. Pre- trained models built on other corpus don’t perform well. Low Resource Language - Hindi, despite being the third most spoken language in the world remains a low resource language in the context of NLP with limited high quality training data and pre-trained models. Length of Documents - Court cases can be extremely long in nature, the token length constraint of transformer based models makes it challenging to process legal documents. 37
  • 37. Summary We introduced a large corpus of legal documents for the under- resourced language Hindi: Hindi Legal Documents Corpus (HLDC) As a use-case for HLDC, we introduce the task of Bail Prediction - a time-sensitive and impactful problem to tackle. We experimented with several models and proposed a multi-task learning based model that predicts salient sentences as an auxiliary task and bail prediction as the main task 38
  • 38. Future Work Multiple Languages - India has 22 official languages, but for the majority of languages, there are no legal corpora. The use of such corpora is not restricted only to the legal domain. It can also act as a source of unsupervised data in low resource languages to train transformer models. Infusing Domain Specific Knowledge: Develop deep models infused with legal knowledge so that model is able to capture legal nuances. For Indian legal documents, we can incorporate information from the sections and acts in the IPC (Indian Penal Code). Other LegalNLP tasks: Apart from judgement prediction, a number of other tasks like legal summarisation and prior case retrieval can help expedite judicial process. 39
  • 39. Publications Kapoor, A., Dhawan, M., Goel, A., Arjun, T.H, Agrawal, V., Agrawal, A., Bhattacharya, A., Kumaraguru, P., and Modi, A. HLDC: Hindi Legal Documents Corpus. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2022 , Association for Computational Linguistics. 40
  • 40. Acknowledgements Prof. Ashutosh Modi and Prof. Arnab Bhattacharya - The work was a collaborative effort between universities of IIIT Hyderabad, IIT Kanpur and IIIT Delhi. Thanks for the opportunity to work on such an impactful problem. IHUB-Data IIIT Hyderabad for funding my Masters. The Precog group for the constant support and encouragement. Family - For everything. 41