SlideShare a Scribd company logo
Evaluating Deep Learning Models
Applications to NLP
Nazneen Rajani
Outline
Part 1:
Evaluation status quo
Goal of evaluation is to inform next action:
● further analysis or
● model patching
Robustness Gym (Goel et al., 2021)
Part 2:
Caveats with evaluating PLMs
SummVis (Vig et al., 2021)
Outline
Part 1:
Evaluation status quo
Goal of evaluation is to inform next action:
● further analysis or
● model patching
Robustness Gym (Goel et al., 2021)
Part 2:
Caveats with evaluating PLMs
SummVis (Vig et al., 2021)
ML Pipeline
Collect data Train model Evaluate Deploy
Status Quo
ML models seemingly perform well --
● when data is iid
● evaluation measures aggregate performance
● but performance deteriorates on tail data
● and cannot generalize to ood data
Evaluation Landscape in NLP
Aggregate evaluation
BERT-base (Devlin et al.) model card
Evaluation Landscape in NLP
Aggregate evaluation
BERT-base (Devlin et al.) model card
Adversarial evaluation
Round 2 of ANLI (Nie et al.)
Premise: Toolbox Murders is a 2004 horror film directed by Tobe Hooper,
and written by Jace Anderson and Adam Gierasch. It is a remake of the 1978
film of the same name and was produced by the same people behind the
original. The film centralizes on the occupants of an apartment who are
stalked and murdered by a masked killer.
Hypothesis: Toolbox Murders is both 41 years old and 15 years old.
Gold label: Entailment
Predicted label: Contradiction
Existing Evaluation Landscape
Slew of work on evaluation in NLP -- tools and research papers
Goals of Evaluation
Next action for user:
1. further evaluation/analysis, or
2. model patching for robustness
Robustness Gym
Toolkit for unified evaluation and reporting
Consolidated Reporting
Fine-grained evaluations
● Subpopulations
● Transformations
● Evaluation Sets
● Attacks
RG iterative evaluation: 1. Contemplate
RG iterative evaluation: 2. Create
RG iterative evaluation: 3. Consolidate
RG iterative evaluation: 3Cs
How does RG support evaluation goals?
Example 1: Natural Language Inference
Classify a pair of sentences as being in a relation of entailment, neutral, or contradiction
Entailment
Premise: If it were not for COVID, we would all be at the conference Hypothesis: We are not at the conference
Robustness Report for Natural Language Inference using bert-base-uncased on SNLI
Robustness Report for Natural Language Inference using bert-base-uncased on SNLI
Robustness Report for Natural Language Inference using bert-base-uncased on SNLI
Next action: further analysis
Hypothesis: Possible spurious correlation between negation and contradiction class
Action: Evaluate on counterfactually augmented eval sets
Observation: large performance drops on class-balanced dataset
Example 2: Named Entity Linking
Map mentions of entities to
entries in a KB like the Wikipedia
FIFA World Cup
England National Football Team
When did England last win the football world cup?
Evaluate NEL models using RG
Create subpopulations of interest:
● Popular entities
● Tail entities
● Topics such as soccer, winter sports, etc.
Evaluate models:
● Academic: Bootleg (Orr et al., ‘20), REL (Hulst et al., ‘20)
● Commercial: Microsoft, Google, Amazon
● Heuristic: Popular
Dataset: AIDA (Hoffart et al., 2011)
Results on the AIDA-b dataset
Popularity
heuristic
outperforms all
commercial
systems
F1
Results on the AIDA-b dataset
Commercial
systems are
capitalization
sensitive
Results on the AIDA-b dataset
Bootleg is robust
across sports
F1
Next action: model patching
Assuming sports application downstream
Action: Patch model using weak labeling (Goel et al., NAACL ‘21 industry tack)
best off-the-shelf
system
Next action: model patching
Assuming sports application downstream
Action: Patch model using weak labeling (Goel et al., NAACL ‘21 industry tack)
best off-the-shelf
system
fix poor
performance
Next action: model patching
Assuming sports application downstream
Action: Patch model using weak labeling (Goel et al., NAACL ‘21 industry tack)
Observation: 25% improvement in sports-related errors
best off-the-shelf
system
fix poor
performance
ML iterative pipeline
Collect data Train model Evaluate Deploy
Model patching
Analysis
Goldilocks spectrum for Evaluation
Aggregate evaluations Adversarial attacks
Subpopulations
Distribution shift
Transformations Diagnostic sets
Outline
Part 1:
Evaluation status quo
Goal of evaluation is to inform next action:
● further analysis or
● model patching
Robustness Gym (Goel et al., 2021)
Part 2:
Caveats with evaluating PLMs
SummVis (Vig et al., 2021)
Caveats with evaluating PLMs
Input contamination
● Overlap between pre-training and evaluation data
● Reasons:
○ Some task datasets are crawled from the web, eg., news summarization
○ Datasets (with or without labels) are uploaded to the web
SummVis
Toolkit for interactive visual analysis for text summarization
Consolidated View
Multi-dimensional
fine-grained analysis
Caveats with evaluating PLMs
Input contamination is a problem
Other works have also identified the problem of contamination (Brown et al., ‘20, Dodge et al., ‘21)
Caveats with evaluating PLMs
Input contamination is a problem
Other works have also identified the problem of contamination (Brown et al., ‘20, Dodge et al., ‘21)
How do we evaluate models with known or possible input contamination?
Takeaways
Goal of evaluation is to inform next action
Evaluation is an iterative process
Disaggregation helps expose model vulnerabilities
Challenges associated with evaluating PLMs can obscure model vulnerabilities
Other aspects not discussed:
Evaluation metrics (GEM by Gehrmann et al., ‘21, GENIE by Khashabi et al., ‘21)
Evaluation datasets/ task design (Rogers, ‘21, Bowman and Dahl ‘21)
Thank you for listening
Papers:
1. Robustness Gym: Unifying the NLP Evaluation Landscape (NAACL ‘21 demo)
2. Goodwill Hunting: Analyzing and Repurposing Off-the-Shelf Named Entity Linking Systems
(NAACL ‘21 industry track)
3. SummVis: Interactive Visual Analysis of Models, Data, and Evaluation for Text Summarization
(ACL ‘21 demo)
Collaborators:
Jesse Vig
(Salesforce)
Karan Goel
(Stanford)
Chris Ré
(Stanford)
Mohit Bansal
(UNC)
Wojciech Kryscinski
(Salesforce)
Silvio Savarese
(Salesforce)

More Related Content

Similar to ICML UDL Evaluating Deep Learning Models Applications to NLP Nazneen Rajani.pdf

Journal club: Meta-Prod2Vec
Journal club: Meta-Prod2Vec Journal club: Meta-Prod2Vec
Journal club: Meta-Prod2Vec
Yuya Kanemoto
 
Landscape of AI/ML in 2023
Landscape of AI/ML in 2023Landscape of AI/ML in 2023
Landscape of AI/ML in 2023
HyunJoon Jung
 
Recommenders, Topics, and Text
Recommenders, Topics, and TextRecommenders, Topics, and Text
Recommenders, Topics, and Text
NBER
 
Cs6502 ooad-cse-vst-au-unit-v dce
Cs6502 ooad-cse-vst-au-unit-v dceCs6502 ooad-cse-vst-au-unit-v dce
Cs6502 ooad-cse-vst-au-unit-v dce
tagoreengineering
 
ANP-GP Approach for Selection of Software Architecture Styles
ANP-GP Approach for Selection of Software Architecture StylesANP-GP Approach for Selection of Software Architecture Styles
ANP-GP Approach for Selection of Software Architecture Styles
Waqas Tariq
 
Data collection requires evaluators to consider a wide and diverse v.docx
Data collection requires evaluators to consider a wide and diverse v.docxData collection requires evaluators to consider a wide and diverse v.docx
Data collection requires evaluators to consider a wide and diverse v.docx
petehbailey729071
 
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Jin Young Kim
 
677-Session 11-Data Analysis-S12
677-Session 11-Data Analysis-S12677-Session 11-Data Analysis-S12
677-Session 11-Data Analysis-S12
Diane Nahl
 
acmsigtalkshare-121023190142-phpapp01.pptx
acmsigtalkshare-121023190142-phpapp01.pptxacmsigtalkshare-121023190142-phpapp01.pptx
acmsigtalkshare-121023190142-phpapp01.pptx
dongchangim30
 
Continuous Improvement: How systems design can benefit the data-driven design...
Continuous Improvement: How systems design can benefit the data-driven design...Continuous Improvement: How systems design can benefit the data-driven design...
Continuous Improvement: How systems design can benefit the data-driven design...
RSD7 Symposium
 
Empirical Evaluation of Active Learning in Recommender Systems
Empirical Evaluation of Active Learning in Recommender SystemsEmpirical Evaluation of Active Learning in Recommender Systems
Empirical Evaluation of Active Learning in Recommender Systems
University of Bergen
 
Similarity learning
  Similarity learning  Similarity learning
Similarity learning
Learnbay Datascience
 
Toward a Recommendation System for focusing Testing
Toward a Recommendation System for focusing TestingToward a Recommendation System for focusing Testing
Toward a Recommendation System for focusing Testing
rsse2008
 
Promise 2011: "Local Bias and its Impacts on the Performance of Parametric Es...
Promise 2011: "Local Bias and its Impacts on the Performance of Parametric Es...Promise 2011: "Local Bias and its Impacts on the Performance of Parametric Es...
Promise 2011: "Local Bias and its Impacts on the Performance of Parametric Es...
CS, NcState
 
Ranking Related News Predictions
Ranking Related News PredictionsRanking Related News Predictions
Ranking Related News Predictions
Nattiya Kanhabua
 
[RecSys 2014] Deviation-Based and Similarity-Based Contextual SLIM Recommenda...
[RecSys 2014] Deviation-Based and Similarity-Based Contextual SLIM Recommenda...[RecSys 2014] Deviation-Based and Similarity-Based Contextual SLIM Recommenda...
[RecSys 2014] Deviation-Based and Similarity-Based Contextual SLIM Recommenda...
YONG ZHENG
 
LAK13 linkedup tutorial_evaluation_framework
LAK13 linkedup tutorial_evaluation_frameworkLAK13 linkedup tutorial_evaluation_framework
LAK13 linkedup tutorial_evaluation_framework
Hendrik Drachsler
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Aijun Zhang
 
data_mining_Projectreport
data_mining_Projectreportdata_mining_Projectreport
data_mining_Projectreport
Sampath Velaga
 
Social Science Statistics STA2122.501 ● ONLINE Project 3
Social Science Statistics STA2122.501 ● ONLINE Project 3Social Science Statistics STA2122.501 ● ONLINE Project 3
Social Science Statistics STA2122.501 ● ONLINE Project 3
ChereCheek752
 

Similar to ICML UDL Evaluating Deep Learning Models Applications to NLP Nazneen Rajani.pdf (20)

Journal club: Meta-Prod2Vec
Journal club: Meta-Prod2Vec Journal club: Meta-Prod2Vec
Journal club: Meta-Prod2Vec
 
Landscape of AI/ML in 2023
Landscape of AI/ML in 2023Landscape of AI/ML in 2023
Landscape of AI/ML in 2023
 
Recommenders, Topics, and Text
Recommenders, Topics, and TextRecommenders, Topics, and Text
Recommenders, Topics, and Text
 
Cs6502 ooad-cse-vst-au-unit-v dce
Cs6502 ooad-cse-vst-au-unit-v dceCs6502 ooad-cse-vst-au-unit-v dce
Cs6502 ooad-cse-vst-au-unit-v dce
 
ANP-GP Approach for Selection of Software Architecture Styles
ANP-GP Approach for Selection of Software Architecture StylesANP-GP Approach for Selection of Software Architecture Styles
ANP-GP Approach for Selection of Software Architecture Styles
 
Data collection requires evaluators to consider a wide and diverse v.docx
Data collection requires evaluators to consider a wide and diverse v.docxData collection requires evaluators to consider a wide and diverse v.docx
Data collection requires evaluators to consider a wide and diverse v.docx
 
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영
 
677-Session 11-Data Analysis-S12
677-Session 11-Data Analysis-S12677-Session 11-Data Analysis-S12
677-Session 11-Data Analysis-S12
 
acmsigtalkshare-121023190142-phpapp01.pptx
acmsigtalkshare-121023190142-phpapp01.pptxacmsigtalkshare-121023190142-phpapp01.pptx
acmsigtalkshare-121023190142-phpapp01.pptx
 
Continuous Improvement: How systems design can benefit the data-driven design...
Continuous Improvement: How systems design can benefit the data-driven design...Continuous Improvement: How systems design can benefit the data-driven design...
Continuous Improvement: How systems design can benefit the data-driven design...
 
Empirical Evaluation of Active Learning in Recommender Systems
Empirical Evaluation of Active Learning in Recommender SystemsEmpirical Evaluation of Active Learning in Recommender Systems
Empirical Evaluation of Active Learning in Recommender Systems
 
Similarity learning
  Similarity learning  Similarity learning
Similarity learning
 
Toward a Recommendation System for focusing Testing
Toward a Recommendation System for focusing TestingToward a Recommendation System for focusing Testing
Toward a Recommendation System for focusing Testing
 
Promise 2011: "Local Bias and its Impacts on the Performance of Parametric Es...
Promise 2011: "Local Bias and its Impacts on the Performance of Parametric Es...Promise 2011: "Local Bias and its Impacts on the Performance of Parametric Es...
Promise 2011: "Local Bias and its Impacts on the Performance of Parametric Es...
 
Ranking Related News Predictions
Ranking Related News PredictionsRanking Related News Predictions
Ranking Related News Predictions
 
[RecSys 2014] Deviation-Based and Similarity-Based Contextual SLIM Recommenda...
[RecSys 2014] Deviation-Based and Similarity-Based Contextual SLIM Recommenda...[RecSys 2014] Deviation-Based and Similarity-Based Contextual SLIM Recommenda...
[RecSys 2014] Deviation-Based and Similarity-Based Contextual SLIM Recommenda...
 
LAK13 linkedup tutorial_evaluation_framework
LAK13 linkedup tutorial_evaluation_frameworkLAK13 linkedup tutorial_evaluation_framework
LAK13 linkedup tutorial_evaluation_framework
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
data_mining_Projectreport
data_mining_Projectreportdata_mining_Projectreport
data_mining_Projectreport
 
Social Science Statistics STA2122.501 ● ONLINE Project 3
Social Science Statistics STA2122.501 ● ONLINE Project 3Social Science Statistics STA2122.501 ● ONLINE Project 3
Social Science Statistics STA2122.501 ● ONLINE Project 3
 

Recently uploaded

一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
ywqeos
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
Jio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdfJio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdf
inaya7568
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
Building a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdfBuilding a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdf
cjimenez2581
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
lzdvtmy8
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
1tyxnjpia
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
aguty
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 

Recently uploaded (20)

一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
Jio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdfJio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdf
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
Building a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdfBuilding a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdf
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 

ICML UDL Evaluating Deep Learning Models Applications to NLP Nazneen Rajani.pdf

  • 1. Evaluating Deep Learning Models Applications to NLP Nazneen Rajani
  • 2. Outline Part 1: Evaluation status quo Goal of evaluation is to inform next action: ● further analysis or ● model patching Robustness Gym (Goel et al., 2021) Part 2: Caveats with evaluating PLMs SummVis (Vig et al., 2021)
  • 3. Outline Part 1: Evaluation status quo Goal of evaluation is to inform next action: ● further analysis or ● model patching Robustness Gym (Goel et al., 2021) Part 2: Caveats with evaluating PLMs SummVis (Vig et al., 2021)
  • 4. ML Pipeline Collect data Train model Evaluate Deploy
  • 5. Status Quo ML models seemingly perform well -- ● when data is iid ● evaluation measures aggregate performance ● but performance deteriorates on tail data ● and cannot generalize to ood data
  • 6. Evaluation Landscape in NLP Aggregate evaluation BERT-base (Devlin et al.) model card
  • 7. Evaluation Landscape in NLP Aggregate evaluation BERT-base (Devlin et al.) model card Adversarial evaluation Round 2 of ANLI (Nie et al.) Premise: Toolbox Murders is a 2004 horror film directed by Tobe Hooper, and written by Jace Anderson and Adam Gierasch. It is a remake of the 1978 film of the same name and was produced by the same people behind the original. The film centralizes on the occupants of an apartment who are stalked and murdered by a masked killer. Hypothesis: Toolbox Murders is both 41 years old and 15 years old. Gold label: Entailment Predicted label: Contradiction
  • 8. Existing Evaluation Landscape Slew of work on evaluation in NLP -- tools and research papers
  • 9. Goals of Evaluation Next action for user: 1. further evaluation/analysis, or 2. model patching for robustness
  • 10. Robustness Gym Toolkit for unified evaluation and reporting Consolidated Reporting Fine-grained evaluations ● Subpopulations ● Transformations ● Evaluation Sets ● Attacks
  • 11. RG iterative evaluation: 1. Contemplate
  • 13. RG iterative evaluation: 3. Consolidate
  • 15. How does RG support evaluation goals?
  • 16. Example 1: Natural Language Inference Classify a pair of sentences as being in a relation of entailment, neutral, or contradiction Entailment Premise: If it were not for COVID, we would all be at the conference Hypothesis: We are not at the conference
  • 17. Robustness Report for Natural Language Inference using bert-base-uncased on SNLI
  • 18. Robustness Report for Natural Language Inference using bert-base-uncased on SNLI
  • 19. Robustness Report for Natural Language Inference using bert-base-uncased on SNLI
  • 20. Next action: further analysis Hypothesis: Possible spurious correlation between negation and contradiction class Action: Evaluate on counterfactually augmented eval sets Observation: large performance drops on class-balanced dataset
  • 21. Example 2: Named Entity Linking Map mentions of entities to entries in a KB like the Wikipedia FIFA World Cup England National Football Team When did England last win the football world cup?
  • 22. Evaluate NEL models using RG Create subpopulations of interest: ● Popular entities ● Tail entities ● Topics such as soccer, winter sports, etc. Evaluate models: ● Academic: Bootleg (Orr et al., ‘20), REL (Hulst et al., ‘20) ● Commercial: Microsoft, Google, Amazon ● Heuristic: Popular Dataset: AIDA (Hoffart et al., 2011)
  • 23. Results on the AIDA-b dataset Popularity heuristic outperforms all commercial systems F1
  • 24. Results on the AIDA-b dataset Commercial systems are capitalization sensitive
  • 25. Results on the AIDA-b dataset Bootleg is robust across sports F1
  • 26. Next action: model patching Assuming sports application downstream Action: Patch model using weak labeling (Goel et al., NAACL ‘21 industry tack) best off-the-shelf system
  • 27. Next action: model patching Assuming sports application downstream Action: Patch model using weak labeling (Goel et al., NAACL ‘21 industry tack) best off-the-shelf system fix poor performance
  • 28. Next action: model patching Assuming sports application downstream Action: Patch model using weak labeling (Goel et al., NAACL ‘21 industry tack) Observation: 25% improvement in sports-related errors best off-the-shelf system fix poor performance
  • 29. ML iterative pipeline Collect data Train model Evaluate Deploy Model patching Analysis
  • 30. Goldilocks spectrum for Evaluation Aggregate evaluations Adversarial attacks Subpopulations Distribution shift Transformations Diagnostic sets
  • 31. Outline Part 1: Evaluation status quo Goal of evaluation is to inform next action: ● further analysis or ● model patching Robustness Gym (Goel et al., 2021) Part 2: Caveats with evaluating PLMs SummVis (Vig et al., 2021)
  • 32. Caveats with evaluating PLMs Input contamination ● Overlap between pre-training and evaluation data ● Reasons: ○ Some task datasets are crawled from the web, eg., news summarization ○ Datasets (with or without labels) are uploaded to the web
  • 33. SummVis Toolkit for interactive visual analysis for text summarization Consolidated View Multi-dimensional fine-grained analysis
  • 34.
  • 35.
  • 36. Caveats with evaluating PLMs Input contamination is a problem Other works have also identified the problem of contamination (Brown et al., ‘20, Dodge et al., ‘21)
  • 37. Caveats with evaluating PLMs Input contamination is a problem Other works have also identified the problem of contamination (Brown et al., ‘20, Dodge et al., ‘21) How do we evaluate models with known or possible input contamination?
  • 38. Takeaways Goal of evaluation is to inform next action Evaluation is an iterative process Disaggregation helps expose model vulnerabilities Challenges associated with evaluating PLMs can obscure model vulnerabilities Other aspects not discussed: Evaluation metrics (GEM by Gehrmann et al., ‘21, GENIE by Khashabi et al., ‘21) Evaluation datasets/ task design (Rogers, ‘21, Bowman and Dahl ‘21)
  • 39. Thank you for listening Papers: 1. Robustness Gym: Unifying the NLP Evaluation Landscape (NAACL ‘21 demo) 2. Goodwill Hunting: Analyzing and Repurposing Off-the-Shelf Named Entity Linking Systems (NAACL ‘21 industry track) 3. SummVis: Interactive Visual Analysis of Models, Data, and Evaluation for Text Summarization (ACL ‘21 demo) Collaborators: Jesse Vig (Salesforce) Karan Goel (Stanford) Chris Ré (Stanford) Mohit Bansal (UNC) Wojciech Kryscinski (Salesforce) Silvio Savarese (Salesforce)