SlideShare a Scribd company logo
1 of 32
Sandya Mannarswamy
sandyasm@gmail.com
Rigorous
evaluation of
NLP models
for real-world
deployment
• In 5 years, how many people will
interact with an NLP application daily?
• What is the size of NLP market in
billions after five years?
• What % of AI/NLP projects fail to make
it from idea to production?
$116 B
≈ 87%
Photo credits: https://wallpaperaccess.com/full/818001.jpg
Let us deliver robust and responsible NLP
7.5 billion
Context
Sandya Mannarswamy
sandyasm@gmail.com
• 20 years industry and research experience
working with Microsoft, HP, IBM & Xerox Labs
• PhD in Computer Science from IISc.
• Research interests span Natural language
processing, Machine learning. Earlier work on
Compilers
• Holds 52 papers & patents pending
• Code Sport columnist in Open Source For You
• Currently Independent Researcher &
Consultant
Agenda
• How robust are current State of the Art NLP Models?
• How can we make NLP models robust?
• People communicate almost everything in
language
 web search
 Advertising
 Emails
 customer service
 language translation
 virtual agents
 medical reports
• AI beats humans in Stanford reading
comprehension test (CNET)
• Google Search Now Reads at a Higher
Level (WIRED)
NLP Applications Are Everywhere
All these are NLP models which failed in production
Guess What Is Common Between These News Headlines?
Algorithms grading millions of students’ essays AI – Key to recruiting diverse workforce
Microsoft unveiled Tay — a Twitter bot that… The Warren Buffett And Anne Hathaway Trade
Mummy this is just dummy
Building a NLP Model – Current Recipe
• Take a representative dataset, split into train/validation/test set
• Use the latest BERT/Roberta/Albert model
• Or build my own fancy deep architecture
• Prove model achieves > X% with the test dataset
• Voila! We are ready to go for live deployment
NLP Model Development Cycle
Build
Model
> x% Test data
performance
Manual
validation
Collect
Labelled
dataset
Failures
2/3rd of models fail after
they have gone live
Iteratively fixing issues
↑ time ↑ $$$
Real World Data Can Be Highly Diverse!
• Sentiment Analysis is a well known NLP
task
• State of the Art (SOTA) models exceed
95% on benchmark datasets
• But perform badly for many real world
utterances!
 varying tone/formality/code-
switched/transliterated text
I ❤️ this movie,
I love this flick,
I love this படம்,
Movie Aacha Hai!,
IMO, Gr8 movie!,
Value for your money!
Luv this movee
Arnie Killed it!
One word can mean 100 different things,
100 words can mean same thing!
NLP Models Match Human Performance in GLUE, SuperGlue Benchmarks
But do models really understand the task?
Credits: https://super.gluebenchmark.com/leaderboard
Current NLP Paradigm
• Models are based on associative learning (Correlation)
• Models learn statistical cues(superficial patterns) present in training data, predictive of
the label
• Examples of statistical cues can be
 presence of specific words in the data instances mapping to a specific label
 lexical overlap between two sentences (in sentence pair classification)
 presence of linguistic phenomena such as negation
• Statistical cues need not be reflective of the underlying task.
• This can lead to models with high “test set” performance, but with poor generalization.
How well do NLP models generalize?
Task #1 Sentiment Analysis
• Sentence :- “A great white shark bit me on the beach”
• Is this sentence positive or negative?
• Predicted as ‘Highly Positive’!
 In Google Cloud-NLP
 In Microsoft’s Text Analytics services
 Or even in Stanford Deep Learning based Sentiment Model
Models get misled by the surface cue words - “great, white”
Task #2 Machine Reading Comprehension
• Model – Bidirectional Attention (BIDAF)2
• Given Paragraph: The largest portion of
the Huguenots to settle in the Cape
arrived between 1688 and 1689…but
quite a few arrived as late as 1700;
thereafter, the numbers declined…
• Given Question: The number of new
Huguenot colonists declined after what
year?
• Machine Answer: 1700
Human performance
Logistic regression baseline
https://rajpurkar.github.io/SQuAD-explorer
Squad 1.0 leader board
Did the Machine Understand the Question?
• Given Paragraph: The largest portion of
the Huguenots to settle in the Cape
arrived between 1688 and 1689…but
quite a few arrived as late as 1700;
thereafter, the numbers declined.
• Given Question: The number of new
Huguenot colonists declined after what
year?
• Machine Answer: 1700
• Given Paragraph: The largest portion of
the Huguenots to settle in the Cape
arrived between 1688 and 1689…but
quite a few arrived as late as 1700;
thereafter, the numbers declined. The
number of old Acadian colonists declined
after the year of 1675.
• Given Question: The number of new
Huguenot colonists declined after what
year?
• Machine Answer: 1675
Surface cue: extract answer from sentence most similar to question
Let’s add one more line…
The
number of old Acadian colonists declined
after the year of 1675.
1675
Task #3 Natural Language Inference (NLI)
• Given a premise (P) and hypothesis (H),
determine the relation as Entailment,
Contradiction or Neutral
• Example #1
• P: A man is standing in front of the statue on
the beach
• H: A man is sleeping on the beach
• Label: Contradiction
• Example #2
• P: A man is standing on roof
• H: There is a man on roof
• Label: Entailment
• Example #3
• P: A man is standing on the roof
• H: the man has a hammer in hand
• Label: Neutral
SoTA numbers for MNLI
How robust are the NLI models?1.2
All the models drop in performance by > 48% on above examples
Example Actual Predicted Patterns/Cues
P: The judge was paid by the actor
H: The actor was paid by the judge
Contradict Entail Lexical overlap
P: Enthusiasm for Disney’s Lion King dwindles
H: Disney’s Lion King is no longer enthusiastically attended
Entail Contradict Negation Handling
P: Child Services will receive 200000$ grant
H: A grant of 900000$ will go to Child services
Contradict Entail Numerical Reasoning
P: It was still at night
H: The sun has not risen yet, and the moon was still shining
Entail Contradict World Knowledge
P: “have her show the message” – said Paul
H: Paul told her to hide the message
Contradict Entail Antonym relation
Task #4 Argument Reasoning Comprehension
Claim Google is not a harmful monopoly
• Given claim & reason, Task is to predict the warrant which makes the claim valid
• BERT achieves 78%!
• BERT picked surface cues such as the words “NOT”, “do”, “is” in warrant sentences
• Eliminating ‘label correlated’ shallow statistical cues, BERT performance drops to 50%
BERT is a strong learner, but depends on shallow surface cues in solving this task!
Reason People can choose not to use Google
Warrant Other search engines don’t redirect to Google
Alternative All other search engines redirect to Google
NLP’s Clever Hans Effect
• NLP Models often end up learning
from shallow surface cues present in
training data
• Such cues may not correlate with the
task being solved
• Models can show strong test set
performance!
• Real world data may not have those
shallow cues and performance can
really drop
Be skeptical when model achieves
near-human performance on complex NLP tasks!
Photo credits: https://en.wikipedia.org/wiki/Clever_Hans
Improving Model Robustness
• Spend time on exploring your data
• Understand why your model is making the decision
 Use interpretability tools to visualize the reason for the model decision
 See whether your model is depending on shallow surface cues unrelated to actual task
 Number of model interpretability tools available
o LIME
o AllenNLP’s interpret
Interpreting “Sentiment Analysis” with AllenNLP
model depends on the word cue "great"
Interpreting “Textual Entailment” with AllenNLP
Interpreting “Textual Entailment” with AllenNLP
Model depends on lexical overlap between premise and hypothesis
Interpreting “Textual Entailment” with AllenNLP
It Is Just Not Me, Even Andrej Says It!
Andrej Karpathy
Famous AI researcher
Now director of AI at Tesla
• Models pick up spurious correlations or
cues present in the dataset
• Cues aligned to task improves
performance
• Detrimental to classifier performance
 if cues are not representative of the actual task
 Occur in the training data but not in real world data
Example from SQUAD dataset
on PMI ranked word cues
Identifying Misleading Cues in Dataset Using PMI
Type of
Question
Cue word PMI score
Why because 0.92
When did 0.91
When Since 0.89
When year 0.76
Which people 0.69
Which Into 0.68
• We can use Pointwise Mutual Information
(PMI) to identify such cue words
𝑃𝑀𝐼( 𝑤𝑜𝑟𝑑, 𝑐𝑙𝑎𝑠𝑠) =
log 𝑝( 𝑤𝑜𝑟𝑑, 𝑐𝑙𝑎𝑠𝑠 )
𝑝 𝑤𝑜𝑟𝑑 ∗ 𝑝( 𝑐𝑙𝑎𝑠𝑠 )
• Examine the top-k PMI words to check
whether they are task representative or not
Do Named Entities in Data Carry any Unintended Biases?
• Large NLP models are often trained on
public data (Web/Wikipedia/News)
• A person’s name can be often mentioned
in negative contexts
• Models learn a spurious negative
association between that named entity
and sentiment
• Model should ideally be independent of
entities mentioned in the text!
• But it ends up being sensitive to named
entities in the text
Sentence Sentiment
I hate Justin Timberlake -0.3
I hate Kate Perry -0.1
I hate Taylor Swift -0.4
I hate Rihana -0.6
Example from FB-pub dataset
using toxicity model
• Use PSA to measure unintended biases
whether the model has learnt any unwanted associations with named entities
for each sentence containing named entity {
1. PSA perturbs the sentence
• by replacing the entity by other equivalent entities
2. Measure sensitivity of the model
• by running it on the perturbed sentences generated
3. Identify any unintended biases associated with named entities
}
Perturbation Sensitivity Analysis (PSA)
Dataset Ablation Analysis
• Do dataset ablation (in addition to model ablation)
 Test your model with partial input
 Test using random labels
 Augment your training data with Counter Examples
Every film student should see this
thing just so they'll know the very
definition of a perfect movie.
Every film student should see this
thing just so they'll know the very
definition of a bad movie.
• Generate synthetic data to stress test your model
• Augment data to break the spurious (pattern, label) correlation
• Word/Phrase level methods
• Synonym Replacement (using Wordnet/Embeddings)
• Random Word Swap/Insertion/Deletion
• Phrase level paraphrase replacement (Using PPDB)
• Readily available python libraries
• NLP Augment
• Easy Data Augmentation for NLP
• Using back translation for paraphrasing
• Translate from Language L1 to Language L2 and back to L1
Use Data Augmentation to Improve Model
Preventing Models from Learning Surface Cues
• How can we design models which do not learn spurious surface cues in dataset?
• Train using an ensemble of classifiers
• Train a naïve model which predicts based on surface cues only
• Train a stronger model which focuses on other patterns in the data excluding the surface cues
• Use only the stronger model for inference
• The stronger model generalizes well to out of domain examples
• Different types of ensembles can be used
Takeaways
• SOTA performance on a test dataset does not imply production readiness
• Understand your model using interpretability tools
• Test your model for diverse inputs
• Do dataset ablation
• Do targeted data augmentation to improve your model
 Do not add data indiscriminately
 Add needed data only!
• Continue monitoring after deployment
Thank you!

More Related Content

What's hot

Deep Dialog System Review
Deep Dialog System ReviewDeep Dialog System Review
Deep Dialog System ReviewNguyen Quang
 
What can Natural Language Processing do for you?
What can Natural Language Processing do for you?What can Natural Language Processing do for you?
What can Natural Language Processing do for you?Yves Peirsman
 
Text-mining and Automation
Text-mining and AutomationText-mining and Automation
Text-mining and Automationbenosteen
 
Towards End-to-End Reinforcement Learning of Dialogue Agents for Information ...
Towards End-to-End Reinforcement Learning of Dialogue Agents for Information ...Towards End-to-End Reinforcement Learning of Dialogue Agents for Information ...
Towards End-to-End Reinforcement Learning of Dialogue Agents for Information ...Yun-Nung (Vivian) Chen
 
Distributed Natural Language Processing Systems in Python
Distributed Natural Language Processing Systems in PythonDistributed Natural Language Processing Systems in Python
Distributed Natural Language Processing Systems in PythonClare Corthell
 
Deep Learning for Dialogue Modeling - NTHU
Deep Learning for Dialogue Modeling - NTHUDeep Learning for Dialogue Modeling - NTHU
Deep Learning for Dialogue Modeling - NTHUYun-Nung (Vivian) Chen
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
Machine Learning in NLP
Machine Learning in NLPMachine Learning in NLP
Machine Learning in NLPVijay Ganti
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Dhruv Gohil
 
Gadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLGadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLLawrie Hunter
 
SIOP Master Tutorial: NLP and Text Mining for I/O Psychologists
SIOP Master Tutorial: NLP and Text Mining for I/O PsychologistsSIOP Master Tutorial: NLP and Text Mining for I/O Psychologists
SIOP Master Tutorial: NLP and Text Mining for I/O PsychologistsAndrea Kropp
 
Day 2 (Lecture 1): Introduction to Statistical Machine Learning and Applications
Day 2 (Lecture 1): Introduction to Statistical Machine Learning and ApplicationsDay 2 (Lecture 1): Introduction to Statistical Machine Learning and Applications
Day 2 (Lecture 1): Introduction to Statistical Machine Learning and ApplicationsAseda Owusua Addai-Deseh
 
Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jNatural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jWilliam Lyon
 
An Intelligent Assistant for High-Level Task Understanding
An Intelligent Assistant for High-Level Task UnderstandingAn Intelligent Assistant for High-Level Task Understanding
An Intelligent Assistant for High-Level Task UnderstandingYun-Nung (Vivian) Chen
 
Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014
Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014
Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014Lora Aroyo
 
Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?Diana Maynard
 
End-to-End Task-Completion Neural Dialogue Systems
End-to-End Task-Completion Neural Dialogue SystemsEnd-to-End Task-Completion Neural Dialogue Systems
End-to-End Task-Completion Neural Dialogue SystemsYun-Nung (Vivian) Chen
 
Practical sentiment analysis
Practical sentiment analysisPractical sentiment analysis
Practical sentiment analysisDiana Maynard
 

What's hot (20)

Deep Dialog System Review
Deep Dialog System ReviewDeep Dialog System Review
Deep Dialog System Review
 
What can Natural Language Processing do for you?
What can Natural Language Processing do for you?What can Natural Language Processing do for you?
What can Natural Language Processing do for you?
 
Text-mining and Automation
Text-mining and AutomationText-mining and Automation
Text-mining and Automation
 
Towards End-to-End Reinforcement Learning of Dialogue Agents for Information ...
Towards End-to-End Reinforcement Learning of Dialogue Agents for Information ...Towards End-to-End Reinforcement Learning of Dialogue Agents for Information ...
Towards End-to-End Reinforcement Learning of Dialogue Agents for Information ...
 
Distributed Natural Language Processing Systems in Python
Distributed Natural Language Processing Systems in PythonDistributed Natural Language Processing Systems in Python
Distributed Natural Language Processing Systems in Python
 
Deep Learning for Dialogue Modeling - NTHU
Deep Learning for Dialogue Modeling - NTHUDeep Learning for Dialogue Modeling - NTHU
Deep Learning for Dialogue Modeling - NTHU
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Machine Learning in NLP
Machine Learning in NLPMachine Learning in NLP
Machine Learning in NLP
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical
 
Cls8 decarbonet
Cls8 decarbonetCls8 decarbonet
Cls8 decarbonet
 
Gadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLGadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALL
 
SIOP Master Tutorial: NLP and Text Mining for I/O Psychologists
SIOP Master Tutorial: NLP and Text Mining for I/O PsychologistsSIOP Master Tutorial: NLP and Text Mining for I/O Psychologists
SIOP Master Tutorial: NLP and Text Mining for I/O Psychologists
 
Day 2 (Lecture 1): Introduction to Statistical Machine Learning and Applications
Day 2 (Lecture 1): Introduction to Statistical Machine Learning and ApplicationsDay 2 (Lecture 1): Introduction to Statistical Machine Learning and Applications
Day 2 (Lecture 1): Introduction to Statistical Machine Learning and Applications
 
Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jNatural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4j
 
Blenderbot
BlenderbotBlenderbot
Blenderbot
 
An Intelligent Assistant for High-Level Task Understanding
An Intelligent Assistant for High-Level Task UnderstandingAn Intelligent Assistant for High-Level Task Understanding
An Intelligent Assistant for High-Level Task Understanding
 
Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014
Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014
Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014
 
Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?
 
End-to-End Task-Completion Neural Dialogue Systems
End-to-End Task-Completion Neural Dialogue SystemsEnd-to-End Task-Completion Neural Dialogue Systems
End-to-End Task-Completion Neural Dialogue Systems
 
Practical sentiment analysis
Practical sentiment analysisPractical sentiment analysis
Practical sentiment analysis
 

Similar to Rigourous evaluation of nlp models in real world deployment

Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...IT Arena
 
What Questions Are Worth Answering?
What Questions Are Worth Answering?What Questions Are Worth Answering?
What Questions Are Worth Answering?Ehren Reilly
 
AILABS - Lecture Series - Is AI the New Electricity? - Advances In Machine Le...
AILABS - Lecture Series - Is AI the New Electricity? - Advances In Machine Le...AILABS - Lecture Series - Is AI the New Electricity? - Advances In Machine Le...
AILABS - Lecture Series - Is AI the New Electricity? - Advances In Machine Le...AILABS Academy
 
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...hajinouha0
 
Core Methods In Educational Data Mining
Core Methods In Educational Data MiningCore Methods In Educational Data Mining
Core Methods In Educational Data Miningebelani
 
Machine Learning for Non-technical People
Machine Learning for Non-technical PeopleMachine Learning for Non-technical People
Machine Learning for Non-technical Peopleindico data
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introductionananth
 
Introduction to NLP.pptx
Introduction to NLP.pptxIntroduction to NLP.pptx
Introduction to NLP.pptxbuivantan_uneti
 
Nlp presentation
Nlp presentationNlp presentation
Nlp presentationSurya Sg
 
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...Maryam Farooq
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk Vijay Ganti
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
"Understanding Humans with Machines" (Arthur Tisi)
"Understanding Humans with Machines" (Arthur Tisi)"Understanding Humans with Machines" (Arthur Tisi)
"Understanding Humans with Machines" (Arthur Tisi)Maryam Farooq
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018HJ van Veen
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Fwdays
 
WiNLP2020 Keynote "Challenges for Conversational AI: Reflections on Gender Is...
WiNLP2020 Keynote "Challenges for Conversational AI: Reflections on Gender Is...WiNLP2020 Keynote "Challenges for Conversational AI: Reflections on Gender Is...
WiNLP2020 Keynote "Challenges for Conversational AI: Reflections on Gender Is...Verena Rieser
 
NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA DATASCIENCE
 
8 Seconds_Writing for Digital Communications.12.11
8 Seconds_Writing for Digital Communications.12.118 Seconds_Writing for Digital Communications.12.11
8 Seconds_Writing for Digital Communications.12.11Carolyn Hudson
 
Measuring the Speed of the Red Queen's Race; Adaption and Evasion in Malware
Measuring the Speed of the Red Queen's Race; Adaption and Evasion in MalwareMeasuring the Speed of the Red Queen's Race; Adaption and Evasion in Malware
Measuring the Speed of the Red Queen's Race; Adaption and Evasion in MalwarePriyanka Aash
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
 

Similar to Rigourous evaluation of nlp models in real world deployment (20)

Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
 
What Questions Are Worth Answering?
What Questions Are Worth Answering?What Questions Are Worth Answering?
What Questions Are Worth Answering?
 
AILABS - Lecture Series - Is AI the New Electricity? - Advances In Machine Le...
AILABS - Lecture Series - Is AI the New Electricity? - Advances In Machine Le...AILABS - Lecture Series - Is AI the New Electricity? - Advances In Machine Le...
AILABS - Lecture Series - Is AI the New Electricity? - Advances In Machine Le...
 
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...
 
Core Methods In Educational Data Mining
Core Methods In Educational Data MiningCore Methods In Educational Data Mining
Core Methods In Educational Data Mining
 
Machine Learning for Non-technical People
Machine Learning for Non-technical PeopleMachine Learning for Non-technical People
Machine Learning for Non-technical People
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introduction
 
Introduction to NLP.pptx
Introduction to NLP.pptxIntroduction to NLP.pptx
Introduction to NLP.pptx
 
Nlp presentation
Nlp presentationNlp presentation
Nlp presentation
 
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
"Understanding Humans with Machines" (Arthur Tisi)
"Understanding Humans with Machines" (Arthur Tisi)"Understanding Humans with Machines" (Arthur Tisi)
"Understanding Humans with Machines" (Arthur Tisi)
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
 
WiNLP2020 Keynote "Challenges for Conversational AI: Reflections on Gender Is...
WiNLP2020 Keynote "Challenges for Conversational AI: Reflections on Gender Is...WiNLP2020 Keynote "Challenges for Conversational AI: Reflections on Gender Is...
WiNLP2020 Keynote "Challenges for Conversational AI: Reflections on Gender Is...
 
NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2
 
8 Seconds_Writing for Digital Communications.12.11
8 Seconds_Writing for Digital Communications.12.118 Seconds_Writing for Digital Communications.12.11
8 Seconds_Writing for Digital Communications.12.11
 
Measuring the Speed of the Red Queen's Race; Adaption and Evasion in Malware
Measuring the Speed of the Red Queen's Race; Adaption and Evasion in MalwareMeasuring the Speed of the Red Queen's Race; Adaption and Evasion in Malware
Measuring the Speed of the Red Queen's Race; Adaption and Evasion in Malware
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Rigourous evaluation of nlp models in real world deployment

  • 2. • In 5 years, how many people will interact with an NLP application daily? • What is the size of NLP market in billions after five years? • What % of AI/NLP projects fail to make it from idea to production? $116 B ≈ 87% Photo credits: https://wallpaperaccess.com/full/818001.jpg Let us deliver robust and responsible NLP 7.5 billion Context
  • 3. Sandya Mannarswamy sandyasm@gmail.com • 20 years industry and research experience working with Microsoft, HP, IBM & Xerox Labs • PhD in Computer Science from IISc. • Research interests span Natural language processing, Machine learning. Earlier work on Compilers • Holds 52 papers & patents pending • Code Sport columnist in Open Source For You • Currently Independent Researcher & Consultant
  • 4. Agenda • How robust are current State of the Art NLP Models? • How can we make NLP models robust?
  • 5. • People communicate almost everything in language  web search  Advertising  Emails  customer service  language translation  virtual agents  medical reports • AI beats humans in Stanford reading comprehension test (CNET) • Google Search Now Reads at a Higher Level (WIRED) NLP Applications Are Everywhere
  • 6. All these are NLP models which failed in production Guess What Is Common Between These News Headlines? Algorithms grading millions of students’ essays AI – Key to recruiting diverse workforce Microsoft unveiled Tay — a Twitter bot that… The Warren Buffett And Anne Hathaway Trade Mummy this is just dummy
  • 7. Building a NLP Model – Current Recipe • Take a representative dataset, split into train/validation/test set • Use the latest BERT/Roberta/Albert model • Or build my own fancy deep architecture • Prove model achieves > X% with the test dataset • Voila! We are ready to go for live deployment
  • 8. NLP Model Development Cycle Build Model > x% Test data performance Manual validation Collect Labelled dataset Failures 2/3rd of models fail after they have gone live Iteratively fixing issues ↑ time ↑ $$$
  • 9. Real World Data Can Be Highly Diverse! • Sentiment Analysis is a well known NLP task • State of the Art (SOTA) models exceed 95% on benchmark datasets • But perform badly for many real world utterances!  varying tone/formality/code- switched/transliterated text I ❤️ this movie, I love this flick, I love this படம், Movie Aacha Hai!, IMO, Gr8 movie!, Value for your money! Luv this movee Arnie Killed it! One word can mean 100 different things, 100 words can mean same thing!
  • 10. NLP Models Match Human Performance in GLUE, SuperGlue Benchmarks But do models really understand the task? Credits: https://super.gluebenchmark.com/leaderboard
  • 11. Current NLP Paradigm • Models are based on associative learning (Correlation) • Models learn statistical cues(superficial patterns) present in training data, predictive of the label • Examples of statistical cues can be  presence of specific words in the data instances mapping to a specific label  lexical overlap between two sentences (in sentence pair classification)  presence of linguistic phenomena such as negation • Statistical cues need not be reflective of the underlying task. • This can lead to models with high “test set” performance, but with poor generalization. How well do NLP models generalize?
  • 12. Task #1 Sentiment Analysis • Sentence :- “A great white shark bit me on the beach” • Is this sentence positive or negative? • Predicted as ‘Highly Positive’!  In Google Cloud-NLP  In Microsoft’s Text Analytics services  Or even in Stanford Deep Learning based Sentiment Model Models get misled by the surface cue words - “great, white”
  • 13. Task #2 Machine Reading Comprehension • Model – Bidirectional Attention (BIDAF)2 • Given Paragraph: The largest portion of the Huguenots to settle in the Cape arrived between 1688 and 1689…but quite a few arrived as late as 1700; thereafter, the numbers declined… • Given Question: The number of new Huguenot colonists declined after what year? • Machine Answer: 1700 Human performance Logistic regression baseline https://rajpurkar.github.io/SQuAD-explorer Squad 1.0 leader board
  • 14. Did the Machine Understand the Question? • Given Paragraph: The largest portion of the Huguenots to settle in the Cape arrived between 1688 and 1689…but quite a few arrived as late as 1700; thereafter, the numbers declined. • Given Question: The number of new Huguenot colonists declined after what year? • Machine Answer: 1700 • Given Paragraph: The largest portion of the Huguenots to settle in the Cape arrived between 1688 and 1689…but quite a few arrived as late as 1700; thereafter, the numbers declined. The number of old Acadian colonists declined after the year of 1675. • Given Question: The number of new Huguenot colonists declined after what year? • Machine Answer: 1675 Surface cue: extract answer from sentence most similar to question Let’s add one more line… The number of old Acadian colonists declined after the year of 1675. 1675
  • 15. Task #3 Natural Language Inference (NLI) • Given a premise (P) and hypothesis (H), determine the relation as Entailment, Contradiction or Neutral • Example #1 • P: A man is standing in front of the statue on the beach • H: A man is sleeping on the beach • Label: Contradiction • Example #2 • P: A man is standing on roof • H: There is a man on roof • Label: Entailment • Example #3 • P: A man is standing on the roof • H: the man has a hammer in hand • Label: Neutral SoTA numbers for MNLI
  • 16. How robust are the NLI models?1.2 All the models drop in performance by > 48% on above examples Example Actual Predicted Patterns/Cues P: The judge was paid by the actor H: The actor was paid by the judge Contradict Entail Lexical overlap P: Enthusiasm for Disney’s Lion King dwindles H: Disney’s Lion King is no longer enthusiastically attended Entail Contradict Negation Handling P: Child Services will receive 200000$ grant H: A grant of 900000$ will go to Child services Contradict Entail Numerical Reasoning P: It was still at night H: The sun has not risen yet, and the moon was still shining Entail Contradict World Knowledge P: “have her show the message” – said Paul H: Paul told her to hide the message Contradict Entail Antonym relation
  • 17. Task #4 Argument Reasoning Comprehension Claim Google is not a harmful monopoly • Given claim & reason, Task is to predict the warrant which makes the claim valid • BERT achieves 78%! • BERT picked surface cues such as the words “NOT”, “do”, “is” in warrant sentences • Eliminating ‘label correlated’ shallow statistical cues, BERT performance drops to 50% BERT is a strong learner, but depends on shallow surface cues in solving this task! Reason People can choose not to use Google Warrant Other search engines don’t redirect to Google Alternative All other search engines redirect to Google
  • 18. NLP’s Clever Hans Effect • NLP Models often end up learning from shallow surface cues present in training data • Such cues may not correlate with the task being solved • Models can show strong test set performance! • Real world data may not have those shallow cues and performance can really drop Be skeptical when model achieves near-human performance on complex NLP tasks! Photo credits: https://en.wikipedia.org/wiki/Clever_Hans
  • 19. Improving Model Robustness • Spend time on exploring your data • Understand why your model is making the decision  Use interpretability tools to visualize the reason for the model decision  See whether your model is depending on shallow surface cues unrelated to actual task  Number of model interpretability tools available o LIME o AllenNLP’s interpret
  • 20. Interpreting “Sentiment Analysis” with AllenNLP model depends on the word cue "great"
  • 22. Interpreting “Textual Entailment” with AllenNLP Model depends on lexical overlap between premise and hypothesis
  • 24. It Is Just Not Me, Even Andrej Says It! Andrej Karpathy Famous AI researcher Now director of AI at Tesla
  • 25. • Models pick up spurious correlations or cues present in the dataset • Cues aligned to task improves performance • Detrimental to classifier performance  if cues are not representative of the actual task  Occur in the training data but not in real world data Example from SQUAD dataset on PMI ranked word cues Identifying Misleading Cues in Dataset Using PMI Type of Question Cue word PMI score Why because 0.92 When did 0.91 When Since 0.89 When year 0.76 Which people 0.69 Which Into 0.68 • We can use Pointwise Mutual Information (PMI) to identify such cue words 𝑃𝑀𝐼( 𝑤𝑜𝑟𝑑, 𝑐𝑙𝑎𝑠𝑠) = log 𝑝( 𝑤𝑜𝑟𝑑, 𝑐𝑙𝑎𝑠𝑠 ) 𝑝 𝑤𝑜𝑟𝑑 ∗ 𝑝( 𝑐𝑙𝑎𝑠𝑠 ) • Examine the top-k PMI words to check whether they are task representative or not
  • 26. Do Named Entities in Data Carry any Unintended Biases? • Large NLP models are often trained on public data (Web/Wikipedia/News) • A person’s name can be often mentioned in negative contexts • Models learn a spurious negative association between that named entity and sentiment • Model should ideally be independent of entities mentioned in the text! • But it ends up being sensitive to named entities in the text Sentence Sentiment I hate Justin Timberlake -0.3 I hate Kate Perry -0.1 I hate Taylor Swift -0.4 I hate Rihana -0.6 Example from FB-pub dataset using toxicity model
  • 27. • Use PSA to measure unintended biases whether the model has learnt any unwanted associations with named entities for each sentence containing named entity { 1. PSA perturbs the sentence • by replacing the entity by other equivalent entities 2. Measure sensitivity of the model • by running it on the perturbed sentences generated 3. Identify any unintended biases associated with named entities } Perturbation Sensitivity Analysis (PSA)
  • 28. Dataset Ablation Analysis • Do dataset ablation (in addition to model ablation)  Test your model with partial input  Test using random labels  Augment your training data with Counter Examples Every film student should see this thing just so they'll know the very definition of a perfect movie. Every film student should see this thing just so they'll know the very definition of a bad movie.
  • 29. • Generate synthetic data to stress test your model • Augment data to break the spurious (pattern, label) correlation • Word/Phrase level methods • Synonym Replacement (using Wordnet/Embeddings) • Random Word Swap/Insertion/Deletion • Phrase level paraphrase replacement (Using PPDB) • Readily available python libraries • NLP Augment • Easy Data Augmentation for NLP • Using back translation for paraphrasing • Translate from Language L1 to Language L2 and back to L1 Use Data Augmentation to Improve Model
  • 30. Preventing Models from Learning Surface Cues • How can we design models which do not learn spurious surface cues in dataset? • Train using an ensemble of classifiers • Train a naïve model which predicts based on surface cues only • Train a stronger model which focuses on other patterns in the data excluding the surface cues • Use only the stronger model for inference • The stronger model generalizes well to out of domain examples • Different types of ensembles can be used
  • 31. Takeaways • SOTA performance on a test dataset does not imply production readiness • Understand your model using interpretability tools • Test your model for diverse inputs • Do dataset ablation • Do targeted data augmentation to improve your model  Do not add data indiscriminately  Add needed data only! • Continue monitoring after deployment

Editor's Notes

  1. https://arxiv.org/abs/1707.07328 - Adversarial Examples for Evaluating Reading Comprehension Systems – model performance drops from 75% f1 score to 35% F1 score Model used BIDAF - https://arxiv.org/abs/1611.01603 which was SOTA on SQUAD 1.0
  2. Right for the wrong reasons - https://arxiv.org/abs/1902.01007 Stress test evaluation for natural language inference - https://arxiv.org/abs/1806.00692
  3. Clever Hans was horse that was claimed to have performed arithmetic and other intellectual tasks, but was picking up signals from its trainer for the correct answer
  4. Perturbation sensiity analysis Let X be set of sentences containing the entity type we want to perturb Ley N be the set of target entity names. E is anchor in each sentence we want to replace with (every entity in n). Measure the difference in classifier score. Take the evaerage What about “He is like Gandhi” vs “He is like Hitler” Partial Input baselines https://www.aclweb.org/anthology/S18-2023.pdf - Hypothesis Only Baselines in Natural Language Inference How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks Divyansh Kaushik, Zachary C. Lipton
  5. Perturbation sensiity analysis Let X be set of sentences containing the entity type we want to perturb Ley N be the set of target entity names. E is anchor in each sentence we want to replace with (every entity in n). Measure the difference in classifier score. Take the evaerage What about “He is like Gandhi” vs “He is like Hitler” Partial Input baselines https://www.aclweb.org/anthology/S18-2023.pdf - Hypothesis Only Baselines in Natural Language Inference How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks Divyansh Kaushik, Zachary C. Lipton
  6. Perturbation sensiity analysis Let X be set of sentences containing the entity type we want to perturb Ley N be the set of target entity names. E is anchor in each sentence we want to replace with (every entity in n). Measure the difference in classifier score. Take the evaerage What about “He is like Gandhi” vs “He is like Hitler” Partial Input baselines https://www.aclweb.org/anthology/S18-2023.pdf - Hypothesis Only Baselines in Natural Language Inference How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks Divyansh Kaushik, Zachary C. Lipton
  7. Perturbation sensiity analysis Let X be set of sentences containing the entity type we want to perturb Ley N be the set of target entity names. E is anchor in each sentence we want to replace with (every entity in n). Measure the difference in classifier score. Take the evaerage What about “He is like Gandhi” vs “He is like Hitler” Partial Input baselines https://www.aclweb.org/anthology/S18-2023.pdf - Hypothesis Only Baselines in Natural Language Inference How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks Divyansh Kaushik, Zachary C. Lipton
  8. Perturbation sensiity analysis Let X be set of sentences containing the entity type we want to perturb Ley N be the set of target entity names. E is anchor in each sentence we want to replace with (every entity in n). Measure the difference in classifier score. Take the evaerage What about “He is like Gandhi” vs “He is like Hitler” Partial Input baselines https://www.aclweb.org/anthology/S18-2023.pdf - Hypothesis Only Baselines in Natural Language Inference How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks Divyansh Kaushik, Zachary C. Lipton
  9. 3
  10. Perturbation sensiity analysis Let X be set of sentences containing the entity type we want to perturb Ley N be the set of target entity names. E is anchor in each sentence we want to replace with (every entity in n). Measure the difference in classifier score. Take the evaerage What about “He is like Gandhi” vs “He is like Hitler” Partial Input baselines https://www.aclweb.org/anthology/S18-2023.pdf - Hypothesis Only Baselines in Natural Language Inference How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks Divyansh Kaushik, Zachary C. Lipton