The document summarizes research on natural language processing and language technologies at Precog, including work on code-mixing, legal AI for Indian courts, fake news detection, hashtag segmentation, and more. Some key projects involve developing metrics to analyze code-mixing complexity, creating a Hindi legal documents corpus from Indian court cases, and building models for bail prediction and legal document summarization. The research aims to advance NLP for under-resourced Indian languages and apply technologies to problems around misinformation and online harm.
5. Broad categories of our NLP work
Almost all project involve application of NLP tools
We deal with lot of text data – Social media
Application
Fake News
Twitter, Reddit (Shocks, Drug, Depression)
Multimodal
Elections
Knowledge Graphs
Theoretical
Code-mixing
Information Extraction for Indian Languages (joint work with Prof. Pushpak)
5
6. What I will try and cover
Code Mix
Legal AI
Hashtag segmentation (if time permits)
6
8. Code-mix computationally challenging hai.
Predominantly noticed in social networks and speech data
Social Media text processing poses certain challenges
@, #, https://
Incorrect Spelling & Romanisation
Mixing two + languages – Hinglish
8
9. Code-mix computationally challenging hai.
9
A study to quantify amount of code-mixing on Twitter, for European languages, by Rijhwani et al.
Take-away : Common in multilingual communities/cities/countries.
Making code-mixing all the more relevant for India, where most of the public is bi/tri-lingual.
12. Our Data
~55k en-hi code-mix utterances, from publicly available code-mix
datasets (Sentiment, Hate Speech, etc.)
GLUECoS and LinCE Benchmarks
Trained SOTA en-hi code-mixed PoS tagger
Combined PoS Tagger and Language ID tool outputs to propose,
SyMCoM, a metric which captures the syntactic mixing in a code mix
sentence
12
14. SyMCoM for OPEN & CLOSED categories
14
CLOSED (function
words – Adpositions,
Pronouns,
Demostratives) used in
single language
OPEN (content words –
noun, adjective, verb)
class words are used in
both the languages
15. SyMCoM Scores vs POS tags
15
VERB is highly
correlated with closed
tags
16. SyMCoM Scores for benchmark datasets
16
Across benchmark datasets,
there are significant number of
monolingual sentences (peak on
the right side of plot)
Score closer to 1 == almost all
syntactic units are from single
language and is monolingual
sentence
We need to be careful about
what kind of samples should go
into benchmark datasets on
which all future work/models will
be compared
LINCE POS, Kushagra et al.
17. SyMCoM Scores for benchmark datasets for
individual pos tags
17
Noun + Adj switch is
more common than any
other switching
SyMCoM here is
aggregated at dataset
level. So score of 1
indicates that for that
PoS tag only one
language is involved, and
closer to 0 means that
the pos tag is mixed.
18. Take Aways
Our hypothesis is that code-mixing complexity is a function of not just
interspersing of Language ID, but also the syntactic units that are switched
SyMCoM captures the syntactic mixing, and ours is first study to empirically
prove long held syntactical notions of code-mixing like "nouns are switched
more"
Our intent with this study was to understand relationship with downstream
task performance
Which kind of syntactical switches are easy for models (likes of XLM-Rs, mBERTs) to
process
Which kind of examples should go into datasets
SyMCoM is the first step in this direction
18
19. What we are currently exploring?
Linking code-mix metrics (LID based metrics, and SyMCoM) to model's
performance in downstream tasks
Constructing benchmark datasets for various tasks – ensuring reliable
and quality data
Creating pipelines for code-mix text generation/translations
19
21. Legal AI for Indian Context
District courts are usually the first
point of contact between the people
and the judiciary.
Lower courts in India are burdened
with a backlog of cases (~40 million
as of 2021).
Local languages used in the
documents filed in district courts in
India.
21
Supreme Court
High Courts
District Courts
22. Legal AI for Indian Context
Accepted at Findings of ACL 2022
22
HLDC: Hindi Legal Documents Corpus
23. Legal AI - Data
We collected ~900k district court case documents from Uttar
Pradesh
All documents in Hindi, written in Devanagari
There are legal corpora for European Court of Justice and Chinese
courts, none for Indian district courts
23
24. Legal AI - Data
There are around 300 different case types, table shows the prominent
ones
Majority of the case documents correspond to Bail Applications
24
Variation in number of case documents per district
Case types in HLDC
25. Legal AI - Data Anonymization
We augmented the corpus using:
Gazetteer + Regex-rules for NER – list of first names, last names, titles,
locations, etc. Like:
1.
Months and dates were normalized to
Phone numbers were replaced with using regex rules.
RNN-based Hindi NER model to find additional entities and augment the
initial gazetteer
Ambiguous names like (Prarthna: name or action of prayer) were
removed
25
26. Legal AI - Lexical analysis
Even within legal documents of a single state, we found evidence
of dialectal variations across districts
For example, 63.8% occurrences of the word (Saaking:
motionless) come from 6 districts of East U.P. (Ballia,
Azamgarh, Maharajganj, Deoria, Siddharthnagar and Kushinagar)
Similarly, the word (Gauvanshiya: cow and related
animals) is mostly used in North-Western UP
26
27. Legal AI - Bail Documents
27
District-wise ratio of number of bail applications to total cases
28. Legal AI - Bail Prediction Task
Since many bail documents follow a similar structure, they are segmented into facts,
arguments, judge's summary and final decision using a rule-based approach
Bail Prediction Task: Given facts of a case, goal is to predict whether bail will be granted or
not
Formally, consider a corpus of bail documents , where each bail
document is segmented as to represent the header, facts, judge's summary
and bail decision of the document respectively
Facts of every document contain sentences. where represents
the sentence of the bail document
We formulate the bail prediction task as a binary classification problem. We are interested
in modeling , which is the probability of the outcome given the facts of a case
28
29. Legal AI - Bail Prediction Model
We propose a Multitask Model based on Multilingual Transformers:
Bail Prediction – to aid judges during decision making and help reduce case
load
Extractive Summarization – to reduce long legal documents and make them
amenable for processing in NLP pipelines
29
30. Legal AI - Bail Prediction Model
30
In general, the performance is lower in district-wise settings, possibly due to large
variation across districts
Overall, summarization models perform better than Doc2Vec and simpler
Transformer-based models
31. Legal AI - Bail Prediction Error Analysis
31
In one of the documents, the facts pointed
towards bail granted decision. Our model
was able to predict this.
However, the actual result of the document
showed that the judge overturned the
decision due to lack of attendance by the
accused and the final verdict was bail denied.
Majority of the errors made by our
model (incorrectly dismissed/granted) are
borderline cases with output probability
around 0.5
32. Legal AI for Indian Context - Takeaways
Indian Legal documents are a rich a source of domain-specific Indic-
language corpora, readily available online.
Multiple tasks still need attention especially for Indian settings -
Legal Summarization
Case recommendations
Citation predictions
32
33. Fake News & Multimodality
FactDrill: A Data Repository of Fact-
checked Social Media Content to
Study Fake News Incidents in India,
Accepted at ICWSM 2022
Multilingual, Multimodal dataset
curated using fact checkers from
India.
33
34. HashSet - A Dataset For Hashtag Segmentation
Accepted at LREC 2022
Dataset containing hashtags collected from
tweets originating from India.
Many hashtags with 2 or more tokens
34
Motivation : Often Hashtags encode important semantical cues that could be
useful in downstream tasks – Opinion Mining.
Trending hashtag on twitter since Twitter's acquisition by Musk
#leavingtwitter = Leaving twitter
35. Some more of our work involving application
of NLP tools
“Subverting the Jewtocracy”: Online Antisemitism Detection Using
Multimodal Deep Learning, WebSci 2021
“A Virus Has No Religion”: Analyzing Islamophobia on Twitter During
the COVID-19 Outbreak, HyperText 2021
35