SlideShare a Scribd company logo
1 of 37
Download to read offline
NLP / Language Research at
Precog
@ponguru
/in/ponguru Ponnurangam.kumaraguru
May 2, 2022
IIT Bombay
Ponnurangam Kumaraguru (“PK”)
#ProfGiri CS IIIT Hyderabad
ACM Distinguished Member
TEDx Speaker
pk.profgiri
Who We Are
2
https://precog.iiit.ac.in/
3
4
Broad categories of our NLP work
Almost all project involve application of NLP tools
We deal with lot of text data – Social media
Application
Fake News
Twitter, Reddit (Shocks, Drug, Depression)
Multimodal
Elections
Knowledge Graphs
Theoretical
Code-mixing
Information Extraction for Indian Languages (joint work with Prof. Pushpak)
5
What I will try and cover
Code Mix
Legal AI
Hashtag segmentation (if time permits)
6
7
Code-mix computationally challenging hai.
Predominantly noticed in social networks and speech data
Social Media text processing poses certain challenges
@, #, https://
Incorrect Spelling & Romanisation
Mixing two + languages – Hinglish
8
Code-mix computationally challenging hai.
9
A study to quantify amount of code-mixing on Twitter, for European languages, by Rijhwani et al.
Take-away : Common in multilingual communities/cities/countries.
Making code-mixing all the more relevant for India, where most of the public is bi/tri-lingual.
Specific examples
10
Complexity of Code-Mixed content
11
Accepted at ACL Findings 2022
Our Data
~55k en-hi code-mix utterances, from publicly available code-mix
datasets (Sentiment, Hate Speech, etc.)
GLUECoS and LinCE Benchmarks
Trained SOTA en-hi code-mixed PoS tagger
Combined PoS Tagger and Language ID tool outputs to propose,
SyMCoM, a metric which captures the syntactic mixing in a code mix
sentence
12
Our definitions
13
SyMCoM for OPEN & CLOSED categories
14
CLOSED (function
words – Adpositions,
Pronouns,
Demostratives) used in
single language
OPEN (content words –
noun, adjective, verb)
class words are used in
both the languages
SyMCoM Scores vs POS tags
15
VERB is highly
correlated with closed
tags
SyMCoM Scores for benchmark datasets
16
Across benchmark datasets,
there are significant number of
monolingual sentences (peak on
the right side of plot)
Score closer to 1 == almost all
syntactic units are from single
language and is monolingual
sentence
We need to be careful about
what kind of samples should go
into benchmark datasets on
which all future work/models will
be compared
LINCE POS, Kushagra et al.
SyMCoM Scores for benchmark datasets for
individual pos tags
17
Noun + Adj switch is
more common than any
other switching
SyMCoM here is
aggregated at dataset
level. So score of 1
indicates that for that
PoS tag only one
language is involved, and
closer to 0 means that
the pos tag is mixed.
Take Aways
Our hypothesis is that code-mixing complexity is a function of not just
interspersing of Language ID, but also the syntactic units that are switched
SyMCoM captures the syntactic mixing, and ours is first study to empirically
prove long held syntactical notions of code-mixing like "nouns are switched
more"
Our intent with this study was to understand relationship with downstream
task performance
Which kind of syntactical switches are easy for models (likes of XLM-Rs, mBERTs) to
process
Which kind of examples should go into datasets
SyMCoM is the first step in this direction
18
What we are currently exploring?
Linking code-mix metrics (LID based metrics, and SyMCoM) to model's
performance in downstream tasks
Constructing benchmark datasets for various tasks – ensuring reliable
and quality data
Creating pipelines for code-mix text generation/translations
19
20
Legal AI for Indian Context
District courts are usually the first
point of contact between the people
and the judiciary.
Lower courts in India are burdened
with a backlog of cases (~40 million
as of 2021).
Local languages used in the
documents filed in district courts in
India.
21
Supreme Court
High Courts
District Courts
Legal AI for Indian Context
Accepted at Findings of ACL 2022
22
HLDC: Hindi Legal Documents Corpus
Legal AI - Data
We collected ~900k district court case documents from Uttar
Pradesh
All documents in Hindi, written in Devanagari
There are legal corpora for European Court of Justice and Chinese
courts, none for Indian district courts
23
Legal AI - Data
There are around 300 different case types, table shows the prominent
ones
Majority of the case documents correspond to Bail Applications
24
Variation in number of case documents per district
Case types in HLDC
Legal AI - Data Anonymization
We augmented the corpus using:
Gazetteer + Regex-rules for NER – list of first names, last names, titles,
locations, etc. Like:
1.
Months and dates were normalized to
Phone numbers were replaced with using regex rules.
RNN-based Hindi NER model to find additional entities and augment the
initial gazetteer
Ambiguous names like (Prarthna: name or action of prayer) were
removed
25
Legal AI - Lexical analysis
Even within legal documents of a single state, we found evidence
of dialectal variations across districts
For example, 63.8% occurrences of the word (Saaking:
motionless) come from 6 districts of East U.P. (Ballia,
Azamgarh, Maharajganj, Deoria, Siddharthnagar and Kushinagar)
Similarly, the word (Gauvanshiya: cow and related
animals) is mostly used in North-Western UP
26
Legal AI - Bail Documents
27
District-wise ratio of number of bail applications to total cases
Legal AI - Bail Prediction Task
Since many bail documents follow a similar structure, they are segmented into facts,
arguments, judge's summary and final decision using a rule-based approach
Bail Prediction Task: Given facts of a case, goal is to predict whether bail will be granted or
not
Formally, consider a corpus of bail documents , where each bail
document is segmented as to represent the header, facts, judge's summary
and bail decision of the document respectively
Facts of every document contain sentences. where represents
the sentence of the bail document
We formulate the bail prediction task as a binary classification problem. We are interested
in modeling , which is the probability of the outcome given the facts of a case
28
Legal AI - Bail Prediction Model
We propose a Multitask Model based on Multilingual Transformers:
Bail Prediction – to aid judges during decision making and help reduce case
load
Extractive Summarization – to reduce long legal documents and make them
amenable for processing in NLP pipelines
29
Legal AI - Bail Prediction Model
30
In general, the performance is lower in district-wise settings, possibly due to large
variation across districts
Overall, summarization models perform better than Doc2Vec and simpler
Transformer-based models
Legal AI - Bail Prediction Error Analysis
31
In one of the documents, the facts pointed
towards bail granted decision. Our model
was able to predict this.
However, the actual result of the document
showed that the judge overturned the
decision due to lack of attendance by the
accused and the final verdict was bail denied.
Majority of the errors made by our
model (incorrectly dismissed/granted) are
borderline cases with output probability
around 0.5
Legal AI for Indian Context - Takeaways
Indian Legal documents are a rich a source of domain-specific Indic-
language corpora, readily available online.
Multiple tasks still need attention especially for Indian settings -
Legal Summarization
Case recommendations
Citation predictions
32
Fake News & Multimodality
FactDrill: A Data Repository of Fact-
checked Social Media Content to
Study Fake News Incidents in India,
Accepted at ICWSM 2022
Multilingual, Multimodal dataset
curated using fact checkers from
India.
33
HashSet - A Dataset For Hashtag Segmentation
Accepted at LREC 2022
Dataset containing hashtags collected from
tweets originating from India.
Many hashtags with 2 or more tokens
34
Motivation : Often Hashtags encode important semantical cues that could be
useful in downstream tasks – Opinion Mining.
Trending hashtag on twitter since Twitter's acquisition by Musk
#leavingtwitter = Leaving twitter
Some more of our work involving application
of NLP tools
“Subverting the Jewtocracy”: Online Antisemitism Detection Using
Multimodal Deep Learning, WebSci 2021
“A Virus Has No Religion”: Analyzing Islamophobia on Twitter During
the COVID-19 Outbreak, HyperText 2021
35
36
Superstar students & Partner(s) in Crime :-)
37
Thanks!
Questions? pk.guru@iiit.ac.in
http://precog.iiit.ac.in/
@ponguru
pk.profgiri
linkedin/in/ponguru

More Related Content

Similar to NLP / Language Research at Precog

Text mining on criminal documents
Text mining on criminal documentsText mining on criminal documents
Text mining on criminal documentsZhongLI28
 
Data Science for Social Good: #LegalNLP #AlgorithmicBias...
Data Science for Social Good:                      #LegalNLP #AlgorithmicBias...Data Science for Social Good:                      #LegalNLP #AlgorithmicBias...
Data Science for Social Good: #LegalNLP #AlgorithmicBias...IIIT Hyderabad
 
An overview of information extraction techniques for legal document analysis ...
An overview of information extraction techniques for legal document analysis ...An overview of information extraction techniques for legal document analysis ...
An overview of information extraction techniques for legal document analysis ...IJECEIAES
 
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...IRJET Journal
 
Supreme court dialogue classification using machine learning models
Supreme court dialogue classification using machine learning models Supreme court dialogue classification using machine learning models
Supreme court dialogue classification using machine learning models IJECEIAES
 
Social Computing Research in India
Social Computing Research in IndiaSocial Computing Research in India
Social Computing Research in IndiaIIIT Hyderabad
 
Code Mixing computationally bahut challenging hai
Code Mixing computationally bahut challenging haiCode Mixing computationally bahut challenging hai
Code Mixing computationally bahut challenging haiIIIT Hyderabad
 
How can text-mining leverage developments in Deep Learning? Presentation at ...
How can text-mining leverage developments in Deep Learning?  Presentation at ...How can text-mining leverage developments in Deep Learning?  Presentation at ...
How can text-mining leverage developments in Deep Learning? Presentation at ...jcscholtes
 
Hate Speech Detection in multilingual Text using Deep Learning
Hate Speech Detection in multilingual Text using Deep LearningHate Speech Detection in multilingual Text using Deep Learning
Hate Speech Detection in multilingual Text using Deep LearningIRJET Journal
 
A Dialogue System for Telugu, a Resource-Poor Language
A Dialogue System for Telugu, a Resource-Poor LanguageA Dialogue System for Telugu, a Resource-Poor Language
A Dialogue System for Telugu, a Resource-Poor LanguageSravanthi Mullapudi
 
Use text mining method to support criminal case judgment
Use text mining method to support criminal case judgmentUse text mining method to support criminal case judgment
Use text mining method to support criminal case judgmentZhongLI28
 
A-STUDY-ON-SENTIMENT-POLARITY.pdf
A-STUDY-ON-SENTIMENT-POLARITY.pdfA-STUDY-ON-SENTIMENT-POLARITY.pdf
A-STUDY-ON-SENTIMENT-POLARITY.pdfSUDESHNASANI1
 
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDING
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDINGLARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDING
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDINGkevig
 
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDING
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDINGLARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDING
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDINGkevig
 
Context Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word PairsContext Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word PairsIJCSIS Research Publications
 
A Deep Learning Model to Predict Congressional Roll Call Votes from Legislati...
A Deep Learning Model to Predict Congressional Roll Call Votes from Legislati...A Deep Learning Model to Predict Congressional Roll Call Votes from Legislati...
A Deep Learning Model to Predict Congressional Roll Call Votes from Legislati...mlaij
 
NBLex: emotion prediction in Kannada-English code-switchtext using naïve baye...
NBLex: emotion prediction in Kannada-English code-switchtext using naïve baye...NBLex: emotion prediction in Kannada-English code-switchtext using naïve baye...
NBLex: emotion prediction in Kannada-English code-switchtext using naïve baye...IJECEIAES
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT ecij
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT ecij
 

Similar to NLP / Language Research at Precog (20)

Text mining on criminal documents
Text mining on criminal documentsText mining on criminal documents
Text mining on criminal documents
 
Data Science for Social Good: #LegalNLP #AlgorithmicBias...
Data Science for Social Good:                      #LegalNLP #AlgorithmicBias...Data Science for Social Good:                      #LegalNLP #AlgorithmicBias...
Data Science for Social Good: #LegalNLP #AlgorithmicBias...
 
An overview of information extraction techniques for legal document analysis ...
An overview of information extraction techniques for legal document analysis ...An overview of information extraction techniques for legal document analysis ...
An overview of information extraction techniques for legal document analysis ...
 
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...
 
Supreme court dialogue classification using machine learning models
Supreme court dialogue classification using machine learning models Supreme court dialogue classification using machine learning models
Supreme court dialogue classification using machine learning models
 
Social Computing Research in India
Social Computing Research in IndiaSocial Computing Research in India
Social Computing Research in India
 
Word Embedding In IR
Word Embedding In IRWord Embedding In IR
Word Embedding In IR
 
Code Mixing computationally bahut challenging hai
Code Mixing computationally bahut challenging haiCode Mixing computationally bahut challenging hai
Code Mixing computationally bahut challenging hai
 
How can text-mining leverage developments in Deep Learning? Presentation at ...
How can text-mining leverage developments in Deep Learning?  Presentation at ...How can text-mining leverage developments in Deep Learning?  Presentation at ...
How can text-mining leverage developments in Deep Learning? Presentation at ...
 
Hate Speech Detection in multilingual Text using Deep Learning
Hate Speech Detection in multilingual Text using Deep LearningHate Speech Detection in multilingual Text using Deep Learning
Hate Speech Detection in multilingual Text using Deep Learning
 
A Dialogue System for Telugu, a Resource-Poor Language
A Dialogue System for Telugu, a Resource-Poor LanguageA Dialogue System for Telugu, a Resource-Poor Language
A Dialogue System for Telugu, a Resource-Poor Language
 
Use text mining method to support criminal case judgment
Use text mining method to support criminal case judgmentUse text mining method to support criminal case judgment
Use text mining method to support criminal case judgment
 
A-STUDY-ON-SENTIMENT-POLARITY.pdf
A-STUDY-ON-SENTIMENT-POLARITY.pdfA-STUDY-ON-SENTIMENT-POLARITY.pdf
A-STUDY-ON-SENTIMENT-POLARITY.pdf
 
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDING
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDINGLARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDING
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDING
 
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDING
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDINGLARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDING
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDING
 
Context Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word PairsContext Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word Pairs
 
A Deep Learning Model to Predict Congressional Roll Call Votes from Legislati...
A Deep Learning Model to Predict Congressional Roll Call Votes from Legislati...A Deep Learning Model to Predict Congressional Roll Call Votes from Legislati...
A Deep Learning Model to Predict Congressional Roll Call Votes from Legislati...
 
NBLex: emotion prediction in Kannada-English code-switchtext using naïve baye...
NBLex: emotion prediction in Kannada-English code-switchtext using naïve baye...NBLex: emotion prediction in Kannada-English code-switchtext using naïve baye...
NBLex: emotion prediction in Kannada-English code-switchtext using naïve baye...
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
 

More from IIIT Hyderabad

Responsible & Safe AI Systems at ACM India ROCS at IIT Bombay
Responsible & Safe AI Systems at ACM India ROCS at IIT BombayResponsible & Safe AI Systems at ACM India ROCS at IIT Bombay
Responsible & Safe AI Systems at ACM India ROCS at IIT BombayIIIT Hyderabad
 
International Collaboration: Experiences, Challenges, Success stories
International Collaboration: Experiences, Challenges, Success storiesInternational Collaboration: Experiences, Challenges, Success stories
International Collaboration: Experiences, Challenges, Success storiesIIIT Hyderabad
 
Responsible & Safe AI: #LegalBias #Inconsistency #BiasinLLMs #MultiModalBias
Responsible & Safe AI: #LegalBias #Inconsistency #BiasinLLMs #MultiModalBiasResponsible & Safe AI: #LegalBias #Inconsistency #BiasinLLMs #MultiModalBias
Responsible & Safe AI: #LegalBias #Inconsistency #BiasinLLMs #MultiModalBiasIIIT Hyderabad
 
Identify, Inspect and Intervene Multimodal Fake News
Identify, Inspect and Intervene Multimodal Fake NewsIdentify, Inspect and Intervene Multimodal Fake News
Identify, Inspect and Intervene Multimodal Fake NewsIIIT Hyderabad
 
It is our choices, Harry, that show what we truly are, far more than our abil...
It is our choices, Harry, that show what we truly are, far more than our abil...It is our choices, Harry, that show what we truly are, far more than our abil...
It is our choices, Harry, that show what we truly are, far more than our abil...IIIT Hyderabad
 
Beyond the Surface: A Computational Exploration of Linguistic Ambiguity
Beyond the Surface: A Computational Exploration of Linguistic AmbiguityBeyond the Surface: A Computational Exploration of Linguistic Ambiguity
Beyond the Surface: A Computational Exploration of Linguistic AmbiguityIIIT Hyderabad
 
How to Write a (Good) Research Paper
How to Write a (Good) Research Paper How to Write a (Good) Research Paper
How to Write a (Good) Research Paper IIIT Hyderabad
 
Modeling Online User Interactions and their Offline effects on Socio-Technica...
Modeling Online User Interactions and their Offline effects on Socio-Technica...Modeling Online User Interactions and their Offline effects on Socio-Technica...
Modeling Online User Interactions and their Offline effects on Socio-Technica...IIIT Hyderabad
 
Privacy. Winter School on “Topics in Digital Trust”. IIT Bombay
Privacy. Winter School on “Topics in Digital Trust”. IIT BombayPrivacy. Winter School on “Topics in Digital Trust”. IIT Bombay
Privacy. Winter School on “Topics in Digital Trust”. IIT BombayIIIT Hyderabad
 
It is our choices, Harry, that show what we truly are, far more than our abil...
It is our choices, Harry, that show what we truly are, far more than our abil...It is our choices, Harry, that show what we truly are, far more than our abil...
It is our choices, Harry, that show what we truly are, far more than our abil...IIIT Hyderabad
 
It is our choices, Harry, that show what we truly are, far more than our abil...
It is our choices, Harry, that show what we truly are, far more than our abil...It is our choices, Harry, that show what we truly are, far more than our abil...
It is our choices, Harry, that show what we truly are, far more than our abil...IIIT Hyderabad
 
Leveraging Social Media for Financial Advice
Leveraging Social Media for Financial AdviceLeveraging Social Media for Financial Advice
Leveraging Social Media for Financial AdviceIIIT Hyderabad
 
Development of Stress Induction and Detection System to Study its Effect on B...
Development of Stress Induction and Detection System to Study its Effect on B...Development of Stress Induction and Detection System to Study its Effect on B...
Development of Stress Induction and Detection System to Study its Effect on B...IIIT Hyderabad
 
A Framework for Automatic Question Answering in Indian Languages
A Framework for Automatic Question Answering in Indian LanguagesA Framework for Automatic Question Answering in Indian Languages
A Framework for Automatic Question Answering in Indian LanguagesIIIT Hyderabad
 
A Framework For Automatic Question Answering in Indian Languages
A Framework For Automatic Question Answering in Indian LanguagesA Framework For Automatic Question Answering in Indian Languages
A Framework For Automatic Question Answering in Indian LanguagesIIIT Hyderabad
 
Exposing, Examining and Intervening Fake News
Exposing, Examining and Intervening Fake NewsExposing, Examining and Intervening Fake News
Exposing, Examining and Intervening Fake NewsIIIT Hyderabad
 
It's MY JOB: Identifying and Improving Content Quality for Online recruitmen...
 It's MY JOB: Identifying and Improving Content Quality for Online recruitmen... It's MY JOB: Identifying and Improving Content Quality for Online recruitmen...
It's MY JOB: Identifying and Improving Content Quality for Online recruitmen...IIIT Hyderabad
 
De-anonymizing, Preserving and Democratizing Data Privacy and Ownership
De-anonymizing, Preserving and Democratizing Data Privacy and OwnershipDe-anonymizing, Preserving and Democratizing Data Privacy and Ownership
De-anonymizing, Preserving and Democratizing Data Privacy and OwnershipIIIT Hyderabad
 
“It is our choices, Harry, that show what we truly are, far more than our abi...
“It is our choices, Harry, that show what we truly are, far more than our abi...“It is our choices, Harry, that show what we truly are, far more than our abi...
“It is our choices, Harry, that show what we truly are, far more than our abi...IIIT Hyderabad
 

More from IIIT Hyderabad (20)

Responsible & Safe AI Systems at ACM India ROCS at IIT Bombay
Responsible & Safe AI Systems at ACM India ROCS at IIT BombayResponsible & Safe AI Systems at ACM India ROCS at IIT Bombay
Responsible & Safe AI Systems at ACM India ROCS at IIT Bombay
 
International Collaboration: Experiences, Challenges, Success stories
International Collaboration: Experiences, Challenges, Success storiesInternational Collaboration: Experiences, Challenges, Success stories
International Collaboration: Experiences, Challenges, Success stories
 
Responsible & Safe AI: #LegalBias #Inconsistency #BiasinLLMs #MultiModalBias
Responsible & Safe AI: #LegalBias #Inconsistency #BiasinLLMs #MultiModalBiasResponsible & Safe AI: #LegalBias #Inconsistency #BiasinLLMs #MultiModalBias
Responsible & Safe AI: #LegalBias #Inconsistency #BiasinLLMs #MultiModalBias
 
Identify, Inspect and Intervene Multimodal Fake News
Identify, Inspect and Intervene Multimodal Fake NewsIdentify, Inspect and Intervene Multimodal Fake News
Identify, Inspect and Intervene Multimodal Fake News
 
#ChatGPT #ResponsibleAI
#ChatGPT #ResponsibleAI#ChatGPT #ResponsibleAI
#ChatGPT #ResponsibleAI
 
It is our choices, Harry, that show what we truly are, far more than our abil...
It is our choices, Harry, that show what we truly are, far more than our abil...It is our choices, Harry, that show what we truly are, far more than our abil...
It is our choices, Harry, that show what we truly are, far more than our abil...
 
Beyond the Surface: A Computational Exploration of Linguistic Ambiguity
Beyond the Surface: A Computational Exploration of Linguistic AmbiguityBeyond the Surface: A Computational Exploration of Linguistic Ambiguity
Beyond the Surface: A Computational Exploration of Linguistic Ambiguity
 
How to Write a (Good) Research Paper
How to Write a (Good) Research Paper How to Write a (Good) Research Paper
How to Write a (Good) Research Paper
 
Modeling Online User Interactions and their Offline effects on Socio-Technica...
Modeling Online User Interactions and their Offline effects on Socio-Technica...Modeling Online User Interactions and their Offline effects on Socio-Technica...
Modeling Online User Interactions and their Offline effects on Socio-Technica...
 
Privacy. Winter School on “Topics in Digital Trust”. IIT Bombay
Privacy. Winter School on “Topics in Digital Trust”. IIT BombayPrivacy. Winter School on “Topics in Digital Trust”. IIT Bombay
Privacy. Winter School on “Topics in Digital Trust”. IIT Bombay
 
It is our choices, Harry, that show what we truly are, far more than our abil...
It is our choices, Harry, that show what we truly are, far more than our abil...It is our choices, Harry, that show what we truly are, far more than our abil...
It is our choices, Harry, that show what we truly are, far more than our abil...
 
It is our choices, Harry, that show what we truly are, far more than our abil...
It is our choices, Harry, that show what we truly are, far more than our abil...It is our choices, Harry, that show what we truly are, far more than our abil...
It is our choices, Harry, that show what we truly are, far more than our abil...
 
Leveraging Social Media for Financial Advice
Leveraging Social Media for Financial AdviceLeveraging Social Media for Financial Advice
Leveraging Social Media for Financial Advice
 
Development of Stress Induction and Detection System to Study its Effect on B...
Development of Stress Induction and Detection System to Study its Effect on B...Development of Stress Induction and Detection System to Study its Effect on B...
Development of Stress Induction and Detection System to Study its Effect on B...
 
A Framework for Automatic Question Answering in Indian Languages
A Framework for Automatic Question Answering in Indian LanguagesA Framework for Automatic Question Answering in Indian Languages
A Framework for Automatic Question Answering in Indian Languages
 
A Framework For Automatic Question Answering in Indian Languages
A Framework For Automatic Question Answering in Indian LanguagesA Framework For Automatic Question Answering in Indian Languages
A Framework For Automatic Question Answering in Indian Languages
 
Exposing, Examining and Intervening Fake News
Exposing, Examining and Intervening Fake NewsExposing, Examining and Intervening Fake News
Exposing, Examining and Intervening Fake News
 
It's MY JOB: Identifying and Improving Content Quality for Online recruitmen...
 It's MY JOB: Identifying and Improving Content Quality for Online recruitmen... It's MY JOB: Identifying and Improving Content Quality for Online recruitmen...
It's MY JOB: Identifying and Improving Content Quality for Online recruitmen...
 
De-anonymizing, Preserving and Democratizing Data Privacy and Ownership
De-anonymizing, Preserving and Democratizing Data Privacy and OwnershipDe-anonymizing, Preserving and Democratizing Data Privacy and Ownership
De-anonymizing, Preserving and Democratizing Data Privacy and Ownership
 
“It is our choices, Harry, that show what we truly are, far more than our abi...
“It is our choices, Harry, that show what we truly are, far more than our abi...“It is our choices, Harry, that show what we truly are, far more than our abi...
“It is our choices, Harry, that show what we truly are, far more than our abi...
 

Recently uploaded

Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 

Recently uploaded (20)

Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 

NLP / Language Research at Precog

  • 1. NLP / Language Research at Precog @ponguru /in/ponguru Ponnurangam.kumaraguru May 2, 2022 IIT Bombay Ponnurangam Kumaraguru (“PK”) #ProfGiri CS IIIT Hyderabad ACM Distinguished Member TEDx Speaker pk.profgiri
  • 3. 3
  • 4. 4
  • 5. Broad categories of our NLP work Almost all project involve application of NLP tools We deal with lot of text data – Social media Application Fake News Twitter, Reddit (Shocks, Drug, Depression) Multimodal Elections Knowledge Graphs Theoretical Code-mixing Information Extraction for Indian Languages (joint work with Prof. Pushpak) 5
  • 6. What I will try and cover Code Mix Legal AI Hashtag segmentation (if time permits) 6
  • 7. 7
  • 8. Code-mix computationally challenging hai. Predominantly noticed in social networks and speech data Social Media text processing poses certain challenges @, #, https:// Incorrect Spelling & Romanisation Mixing two + languages – Hinglish 8
  • 9. Code-mix computationally challenging hai. 9 A study to quantify amount of code-mixing on Twitter, for European languages, by Rijhwani et al. Take-away : Common in multilingual communities/cities/countries. Making code-mixing all the more relevant for India, where most of the public is bi/tri-lingual.
  • 11. Complexity of Code-Mixed content 11 Accepted at ACL Findings 2022
  • 12. Our Data ~55k en-hi code-mix utterances, from publicly available code-mix datasets (Sentiment, Hate Speech, etc.) GLUECoS and LinCE Benchmarks Trained SOTA en-hi code-mixed PoS tagger Combined PoS Tagger and Language ID tool outputs to propose, SyMCoM, a metric which captures the syntactic mixing in a code mix sentence 12
  • 14. SyMCoM for OPEN & CLOSED categories 14 CLOSED (function words – Adpositions, Pronouns, Demostratives) used in single language OPEN (content words – noun, adjective, verb) class words are used in both the languages
  • 15. SyMCoM Scores vs POS tags 15 VERB is highly correlated with closed tags
  • 16. SyMCoM Scores for benchmark datasets 16 Across benchmark datasets, there are significant number of monolingual sentences (peak on the right side of plot) Score closer to 1 == almost all syntactic units are from single language and is monolingual sentence We need to be careful about what kind of samples should go into benchmark datasets on which all future work/models will be compared LINCE POS, Kushagra et al.
  • 17. SyMCoM Scores for benchmark datasets for individual pos tags 17 Noun + Adj switch is more common than any other switching SyMCoM here is aggregated at dataset level. So score of 1 indicates that for that PoS tag only one language is involved, and closer to 0 means that the pos tag is mixed.
  • 18. Take Aways Our hypothesis is that code-mixing complexity is a function of not just interspersing of Language ID, but also the syntactic units that are switched SyMCoM captures the syntactic mixing, and ours is first study to empirically prove long held syntactical notions of code-mixing like "nouns are switched more" Our intent with this study was to understand relationship with downstream task performance Which kind of syntactical switches are easy for models (likes of XLM-Rs, mBERTs) to process Which kind of examples should go into datasets SyMCoM is the first step in this direction 18
  • 19. What we are currently exploring? Linking code-mix metrics (LID based metrics, and SyMCoM) to model's performance in downstream tasks Constructing benchmark datasets for various tasks – ensuring reliable and quality data Creating pipelines for code-mix text generation/translations 19
  • 20. 20
  • 21. Legal AI for Indian Context District courts are usually the first point of contact between the people and the judiciary. Lower courts in India are burdened with a backlog of cases (~40 million as of 2021). Local languages used in the documents filed in district courts in India. 21 Supreme Court High Courts District Courts
  • 22. Legal AI for Indian Context Accepted at Findings of ACL 2022 22 HLDC: Hindi Legal Documents Corpus
  • 23. Legal AI - Data We collected ~900k district court case documents from Uttar Pradesh All documents in Hindi, written in Devanagari There are legal corpora for European Court of Justice and Chinese courts, none for Indian district courts 23
  • 24. Legal AI - Data There are around 300 different case types, table shows the prominent ones Majority of the case documents correspond to Bail Applications 24 Variation in number of case documents per district Case types in HLDC
  • 25. Legal AI - Data Anonymization We augmented the corpus using: Gazetteer + Regex-rules for NER – list of first names, last names, titles, locations, etc. Like: 1. Months and dates were normalized to Phone numbers were replaced with using regex rules. RNN-based Hindi NER model to find additional entities and augment the initial gazetteer Ambiguous names like (Prarthna: name or action of prayer) were removed 25
  • 26. Legal AI - Lexical analysis Even within legal documents of a single state, we found evidence of dialectal variations across districts For example, 63.8% occurrences of the word (Saaking: motionless) come from 6 districts of East U.P. (Ballia, Azamgarh, Maharajganj, Deoria, Siddharthnagar and Kushinagar) Similarly, the word (Gauvanshiya: cow and related animals) is mostly used in North-Western UP 26
  • 27. Legal AI - Bail Documents 27 District-wise ratio of number of bail applications to total cases
  • 28. Legal AI - Bail Prediction Task Since many bail documents follow a similar structure, they are segmented into facts, arguments, judge's summary and final decision using a rule-based approach Bail Prediction Task: Given facts of a case, goal is to predict whether bail will be granted or not Formally, consider a corpus of bail documents , where each bail document is segmented as to represent the header, facts, judge's summary and bail decision of the document respectively Facts of every document contain sentences. where represents the sentence of the bail document We formulate the bail prediction task as a binary classification problem. We are interested in modeling , which is the probability of the outcome given the facts of a case 28
  • 29. Legal AI - Bail Prediction Model We propose a Multitask Model based on Multilingual Transformers: Bail Prediction – to aid judges during decision making and help reduce case load Extractive Summarization – to reduce long legal documents and make them amenable for processing in NLP pipelines 29
  • 30. Legal AI - Bail Prediction Model 30 In general, the performance is lower in district-wise settings, possibly due to large variation across districts Overall, summarization models perform better than Doc2Vec and simpler Transformer-based models
  • 31. Legal AI - Bail Prediction Error Analysis 31 In one of the documents, the facts pointed towards bail granted decision. Our model was able to predict this. However, the actual result of the document showed that the judge overturned the decision due to lack of attendance by the accused and the final verdict was bail denied. Majority of the errors made by our model (incorrectly dismissed/granted) are borderline cases with output probability around 0.5
  • 32. Legal AI for Indian Context - Takeaways Indian Legal documents are a rich a source of domain-specific Indic- language corpora, readily available online. Multiple tasks still need attention especially for Indian settings - Legal Summarization Case recommendations Citation predictions 32
  • 33. Fake News & Multimodality FactDrill: A Data Repository of Fact- checked Social Media Content to Study Fake News Incidents in India, Accepted at ICWSM 2022 Multilingual, Multimodal dataset curated using fact checkers from India. 33
  • 34. HashSet - A Dataset For Hashtag Segmentation Accepted at LREC 2022 Dataset containing hashtags collected from tweets originating from India. Many hashtags with 2 or more tokens 34 Motivation : Often Hashtags encode important semantical cues that could be useful in downstream tasks – Opinion Mining. Trending hashtag on twitter since Twitter's acquisition by Musk #leavingtwitter = Leaving twitter
  • 35. Some more of our work involving application of NLP tools “Subverting the Jewtocracy”: Online Antisemitism Detection Using Multimodal Deep Learning, WebSci 2021 “A Virus Has No Religion”: Analyzing Islamophobia on Twitter During the COVID-19 Outbreak, HyperText 2021 35
  • 36. 36 Superstar students & Partner(s) in Crime :-)