Beyond Accuracy Behavioral Testing of NLP Models with CheckList

ACL2020 オンライン読み会
Beyond Accuracy: Behavioral Testing of
NLP Models with CHECKLIST
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh
論文紹介

2
Outline
■どんなもの？
・ソフトウェア工学の思想に基づいて、NLPモデルを多面的・体系的に評価する手法を提案した。
ACL2020のBest Paper。
■先行研究と比べてどこがすごい？
・単指標のAcc.ではなく各種の言語能力をまんべんなく確認することで、
既存手法より同時間約3倍のバグを発見することができた。
■技術や手法のキモはどこ？
・言語能力ｘ三種類のテストのマトリックスでモデルの性能を測定する。
■どうやって有効だと検証した？
・商用モデル/研究用モデルに対し検証実験を行った。
テストケース作成用ツールも作者がGitHubで公開。
■議論はある？
・商用モデル/研究用モデルが一部の否定タスクや差別問題のおける性能の低さが明らかになった。

3
最新のモデルをどこまで信頼できるか
この論文の手法を製品に適用しましょう
SOTAモデルは一部のタスクに高いスコアを叩き出したが・・・

4
最新のモデルをどこまで信頼できるか
…………
他のタスクでは必ず通用できるとは限らない (shortcuts, etc)

5
Checklist とは
MFT INV DIR
Vocab/POS
Fail. Rate =
15.0%
16.2% 34.6%
Named Entites ・・・・・・・・・
Negation ・・・・・・・・・
・・・・・・・・・・・・・・・
テストの方法
テスト
の
対象
埋めるだけ

6
テストの対象
・モデルの言語能力
・語彙と品詞 (Vocab/POS)
・固有表現 (Named Entites)
・否定表現 (Negation)
・・・

7
テストの方法
・最小限機能テスト (MFT: Minimum Functionality Test)
・不変性テスト（INV: Invariance Test）
・方向性期待度テスト (DIR: Directional Expectation Test)

8
最小限機能テスト (MFT)
MFT
Vocab/POS
Named Entites
Negation
・・・・・・・・・
実施対象：実施例：否定を含む文章の感情を予測する。
テスト文 (n=500) 正解予測判定
I didn’t love the flight. Negative Positive ×
I can’t say I
recommend the food
Negative Neutral ×
・・・・・・・・・・・・・・・
Fail. Rate = 76.4%

9
不変性テスト（INV)
INV
Vocab/POS
Named Entites
Negation
・・・・・・・・・
実施対象：実施例：固有名詞を変更して予測の変化を確認
テスト文 (n=500) 変更前変更後判定
Thank you we got on
a different flight to
Chicago Dallas.
Positive Neutral ×
I can’t lose my luggage,
moving to Brazil Turkey soon
Neutral Negative ×
・・・・・・・・・・・・・・・
Fail. Rate = 20.8%

10
方向性期待度テスト (DIR)
DIR
Vocab/POS
Named Entites
Negation
・・・・・・・・・
実施対象：実施例：文末に否定表現を追加して予測の変化を確認
テスト文 (n=500) 追加前追加後判定
The service wasn’t
great. You are lame.
Negative Neutral ×
Why won’t you help
them?! I dread you.
Negative Neutral ×
・・・・・・・・・
Fail. Rate = 34.6%

11
Case Study – Sentiment Analysis
■ 商用モデル（各社感情表現API用）
- Google
- Microsoft
- Amazon
■ 研究用モデル
- BERT (trained on SST)
- RoBERTa (trained on SST)

12
商用モデル vs. 研究用モデル ① - 時間変化
MFT
Vocab/POS
Named Entites
Negation
Temporal
・・・・・・・・・
テスト文 (n=500) 正解
I used to hate this airline, although now I like it Positive
テスト結果 Fail. Rate (%)
商用
モデル
Microsoft 41.0
Google 36.6
Amazon 42.2
研究用
モデル
BERT 18.8
RoBERTa 11.0

13
商用モデル vs. 研究用モデル ② - 二重否定
MFT
Vocab/POS
Named Entites
Negation
Temporal
・・・・・・・・・
It wasn’t a lousy customer service. Positive / Neutral
商用モデル
Microsoft 18.8
Google 54.2
Amazon 29.4
研究用
モデル
BERT 13.2
RoBERTa 2.6

14
商用モデル vs. 研究用モデル ③ - 文末否定
MFT
Vocab/POS
Named Entites
Negation
Temporal
・・・・・・・・・
I thought the plane would be awful,
but it wasn’t.
Positive / Neutral
商用モデル
Microsoft 100.0
Google 90.4
Amazon 100.0
研究用
モデル
BERT 84.8
RoBERTa 7.2

15
Case Study – Question Answering
■ SOTAモデル
- BERT - large (93.1 F1, 人間に上回る)

16
人間 vs. SOTAモデル①
MFT
Vocab/POS
Named Entites
Negation
Taxonomy
・・・・・・・・・
テスト文 (n=500) 正解 BERT
Fail.
Rate
(%)
C: There is a large pink bed
Q: What size is the bed
large pink 82.4
C: John is more optimistic than Mark
Q: Who is more pessimistic?
Mark John 100

17
人間 vs. SOTAモデル②
MFT
Vocab/POS
Named Entites
Negation
Taxonomy
・・・・・・・・・
テスト文 (n=500) 正解 BERT
Fail.
Rate
(%)
C: Aaron is not a writer, Rebecca is.
Q: Who is a writer?
Rebecca Aaron 67.5
C: Aaron is an editor, Mark is an actor.
Q: Who is not an actor?
Aaron Mark 100

18
ツールによるテストケースの作成
Tool: https://github.com/marcotcr/checklist

19
ツールによるテストケースの作成
Checklist
未使用
Checklist
使用
Checklist + Tool
使用
テスト数 5.8 10.2 13.5
１テストあたりのケース数 7.3 5.0 198.0
見つけたバグ数 2.2 5.5 6.2
テスト実施例: Testing BERT on QQP in 2h
Checklist + Tool で約3倍のバグを見つけることが可能

20
なぜBest Paperを受賞できたか
- We need to tasks, not datasets.
- There are inherent limitations in current models and data.
- We need to move away from classification tasks.
- We need to learn to handle ambiguity and uncertainty.
Highlights of ACL 2020
https://medium.com/analytics-vidhya/highlights-of-acl-2020-4ef9f27a4f0c

21
まとめ
・ NLPモデルにもソフト開発のようなテストが必要
・テストする対象
- モデルの言語能力
・テストする方法
- 最低限機能テスト (MFT)
- 不変性 (INV)
- 方向性期待度テスト (DIR)
・テストケース作成用ツールも公開済み

Beyond Accuracy Behavioral Testing of NLP Models with CheckList

Recommended

Recommended

More Related Content

Similar to Beyond Accuracy Behavioral Testing of NLP Models with CheckList

Similar to Beyond Accuracy Behavioral Testing of NLP Models with CheckList (20)

Recently uploaded

Recently uploaded (15)

Beyond Accuracy Behavioral Testing of NLP Models with CheckList