ICML UDL Evaluating Deep Learning Models Applications to NLP Nazneen Rajani.pdf

Evaluating Deep Learning Models
Applications to NLP
Nazneen Rajani

Outline
Part 1:
Evaluation status quo
Goal of evaluation is to inform next action:
● further analysis or
● model patching
Robustness Gym (Goel et al., 2021)
Part 2:
Caveats with evaluating PLMs
SummVis (Vig et al., 2021)

ML Pipeline
Collect data Train model Evaluate Deploy

Status Quo
ML models seemingly perform well --
● when data is iid
● evaluation measures aggregate performance
● but performance deteriorates on tail data
● and cannot generalize to ood data

Evaluation Landscape in NLP
Aggregate evaluation
BERT-base (Devlin et al.) model card

Evaluation Landscape in NLP
Aggregate evaluation
BERT-base (Devlin et al.) model card
Adversarial evaluation
Round 2 of ANLI (Nie et al.)
Premise: Toolbox Murders is a 2004 horror film directed by Tobe Hooper,
and written by Jace Anderson and Adam Gierasch. It is a remake of the 1978
film of the same name and was produced by the same people behind the
original. The film centralizes on the occupants of an apartment who are
stalked and murdered by a masked killer.
Hypothesis: Toolbox Murders is both 41 years old and 15 years old.
Gold label: Entailment
Predicted label: Contradiction

Existing Evaluation Landscape
Slew of work on evaluation in NLP -- tools and research papers

Goals of Evaluation
Next action for user:
1. further evaluation/analysis, or
2. model patching for robustness

Robustness Gym
Toolkit for uniﬁed evaluation and reporting
Consolidated Reporting
Fine-grained evaluations
● Subpopulations
● Transformations
● Evaluation Sets
● Attacks

RG iterative evaluation: 1. Contemplate

RG iterative evaluation: 2. Create

RG iterative evaluation: 3. Consolidate

How does RG support evaluation goals?

Example 1: Natural Language Inference
Classify a pair of sentences as being in a relation of entailment, neutral, or contradiction
Entailment
Premise: If it were not for COVID, we would all be at the conference Hypothesis: We are not at the conference

Robustness Report for Natural Language Inference using bert-base-uncased on SNLI

Next action: further analysis
Hypothesis: Possible spurious correlation between negation and contradiction class
Action: Evaluate on counterfactually augmented eval sets
Observation: large performance drops on class-balanced dataset

Example 2: Named Entity Linking
Map mentions of entities to
entries in a KB like the Wikipedia
FIFA World Cup
England National Football Team
When did England last win the football world cup?

Evaluate NEL models using RG
Create subpopulations of interest:
● Popular entities
● Tail entities
● Topics such as soccer, winter sports, etc.
Evaluate models:
● Academic: Bootleg (Orr et al., ‘20), REL (Hulst et al., ‘20)
● Commercial: Microsoft, Google, Amazon
● Heuristic: Popular
Dataset: AIDA (Hoﬀart et al., 2011)

Results on the AIDA-b dataset
Popularity
heuristic
outperforms all
commercial
systems
F1

Commercial
systems are
capitalization
sensitive

Bootleg is robust
across sports
F1

Next action: model patching
Assuming sports application downstream
Action: Patch model using weak labeling (Goel et al., NAACL ‘21 industry tack)
best off-the-shelf
system

best off-the-shelf
system
ﬁx poor
performance

Observation: 25% improvement in sports-related errors
best off-the-shelf
system
ﬁx poor
performance

ML iterative pipeline
Collect data Train model Evaluate Deploy
Model patching
Analysis

Goldilocks spectrum for Evaluation
Aggregate evaluations Adversarial attacks
Subpopulations
Distribution shift
Transformations Diagnostic sets

Input contamination
● Overlap between pre-training and evaluation data
● Reasons:
○ Some task datasets are crawled from the web, eg., news summarization
○ Datasets (with or without labels) are uploaded to the web

SummVis
Toolkit for interactive visual analysis for text summarization
Consolidated View
Multi-dimensional
ﬁne-grained analysis

Input contamination is a problem
Other works have also identified the problem of contamination (Brown et al., ‘20, Dodge et al., ‘21)

Input contamination is a problem
Other works have also identified the problem of contamination (Brown et al., ‘20, Dodge et al., ‘21)
How do we evaluate models with known or possible input contamination?

Takeaways
Goal of evaluation is to inform next action
Evaluation is an iterative process
Disaggregation helps expose model vulnerabilities
Challenges associated with evaluating PLMs can obscure model vulnerabilities
Other aspects not discussed:
Evaluation metrics (GEM by Gehrmann et al., ‘21, GENIE by Khashabi et al., ‘21)
Evaluation datasets/ task design (Rogers, ‘21, Bowman and Dahl ‘21)

Thank you for listening
Papers:
1. Robustness Gym: Unifying the NLP Evaluation Landscape (NAACL ‘21 demo)
2. Goodwill Hunting: Analyzing and Repurposing Oﬀ-the-Shelf Named Entity Linking Systems
(NAACL ‘21 industry track)
3. SummVis: Interactive Visual Analysis of Models, Data, and Evaluation for Text Summarization
(ACL ‘21 demo)
Collaborators:
Jesse Vig
(Salesforce)
Karan Goel
(Stanford)
Chris Ré
(Stanford)
Mohit Bansal
(UNC)
Wojciech Kryscinski
(Salesforce)
Silvio Savarese
(Salesforce)

ICML UDL Evaluating Deep Learning Models Applications to NLP Nazneen Rajani.pdf

Recommended

Recommended

More Related Content

Similar to ICML UDL Evaluating Deep Learning Models Applications to NLP Nazneen Rajani.pdf

Similar to ICML UDL Evaluating Deep Learning Models Applications to NLP Nazneen Rajani.pdf (20)

Recently uploaded

Recently uploaded (20)

ICML UDL Evaluating Deep Learning Models Applications to NLP Nazneen Rajani.pdf