XtremeDistil: Multi-stage Distillation for Massive Multilingual Models

M I C R O S O F T R E S E A R C H A I
XtremeDistil: Multi-stage Distillation for Massive Multilingual Models
SUBHABRATA MUKHERJEE
AHMED H. AWADALLAH
ACL 2020

M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Language Model Pre-Training
z
Representation Transfer
Paraphrase
Q&A
Sentiment
Learned via self-supervision
(language modeling objectives)
Fine-tune on target task on labeled data
(e.g., cross-entropy loss objective)
2

3
z
20182017
CoVe
Elmo
OpenAI GPT
BERT
2019
OpenAI GPT-2
Pre-trained Model Complexity
Nvidia Megatron
Google T5
Turing-NLG
2020
OpenAI GPT-3

4
z
20182017
CoVe
30M params
Elmo
~90M params
OpenAI GPT
~100M params
BERT
340M params.
2019
OpenAI GPT-2
1.5B params
Pre-trained Model Complexity
Nvidia Megatron
8B params
Google T5
11B params
Turing-NLG
17B params
2020
OpenAI GPT-3
175B params
Increase in model parameters

z
Knowledge Distillation to Mimic Teacher’s Output
Teacher: Huge pre-trained
model
Student: Shallow model
Transfer Set:
Unlabeled Data
(Ba and Caruana, 2014; Romero et al. 2014; Hinton, Vinyals and Dean, 2015)
• Logits
• Log probability values over classes (before softmax)
• Captures uncertainty better than hard labels
• Internal Representations
• Hint-based guidance from Teacher
5

Knowledge Distillation
z
Unlabeled
Data
Teacher
(3) Generate logits and
representations on
unlabeled data
Labeled Data
(2) Task-specific
training w. CE loss
Augmented
Data
(1) LM pre-training
(optional)
Student
(4) Task-specific soft
training
6

TASK: Multi-lingual Named Entity Recognition
7
Jakers ! Aventurile lui Piggley Winks
R.H. Saunders ( St. Lawrence River ) ( 968 MW )
B-ORG I-ORG O B-ORG I-ORG I-ORG O O O O O
Широко распространён в Австралии , где его выращивают на срезку .
O O O B-LOC O O O O O O O
Jakers ! Aventurile lui Piggley Winks
B-ORG I-ORG I-ORG I-ORG I-ORG I-ORG
Identify PER, ORG and LOC from multiple languages jointly
We adopt Multi-lingual BERT pre-trained on 104 languages
in Wikipedia as the teacher

z
Knowledge Distillation (1/2)
8
{𝑥𝑙, 𝑦𝑙} ∈ 𝐷𝑙
Shared Word Embedding Layer
Shared RNN Layer
Shared Feedforward Layer
𝑧 𝑠 𝑥 𝑘
ps 𝑥 𝑘 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑥 𝑘)
𝑚𝑢𝑙𝑡𝑖𝑡𝑎𝑠𝑘 𝑙𝑜𝑠𝑠 = −
𝑥∈𝐷𝑙 𝑥 𝑘
∈ 𝑥
𝑦 𝑘 log(𝑝𝑠 𝑥 𝑘 )
T: teacher
s: student
Dl: labeled data
Cross-entropy loss (labeled)
{𝑥, 𝑦} ∈ 𝐷𝑙

z
Knowledge Distillation (2/2)
9
Shared BiLSTM Layer
Shared feed-forward layer
𝑧 𝑠 𝑥 𝑘 𝑧 𝑠 𝑥 𝑘
ps 𝑥 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(. ) ps 𝑥 𝑘 = 𝑙𝑖𝑛𝑒𝑎𝑟(. )
𝑚𝑢𝑙𝑡𝑖𝑡𝑎𝑠𝑘 𝑙𝑜𝑠𝑠 = 𝐶𝐸_𝑙𝑜𝑠𝑠 +
𝑥∈𝐷 𝑢 𝑥𝑘∈𝑥
𝑙𝑜𝑔𝑖𝑡 𝑇 𝑦 𝑘 − 𝑝𝑠 𝑥 𝑘
2
+ 𝐾𝐿𝐷 𝑧 𝑇 𝑥 𝑘 || 𝑧 𝑠 𝑥 𝑘
T: teacher
s: student
Du: unlabeled data
{𝑥 𝑘, 𝑙𝑜𝑔𝑖𝑡𝑇 𝑦 𝑘 , 𝑧𝑇(𝑥 𝑘)} ∈ 𝐷 𝑢
Cross-entropy
loss (labeled)
Mean-squared logit
loss (unlabeled)
Representation loss
(unlabeled)
{𝑥 𝑘, 𝑦𝑘} ∈ 𝐷𝑙

Multi-Stage Distillation with Labeled (L) and Unlabeled (U) Data
z
Joint
• α CEloss (L) +
• β Logitloss (U) +
• γ Reploss (U)
2-Stage
Stage 1: Reploss(U)
• Stage 2:
• α CEloss(L) + β
Logitloss(U)
3-Stage
• Stage 1:Reploss(U)
• Stage 2: Logitloss (U)
• Stage 3: CEloss(L)
10

Hyper-parameter free Loss Combination
z
3-Stage
• Stage 1:Reploss(U)
• Stage 2: Logitloss (U)
• Stage 3: CEloss(L)
11
General representation
Task-specific tuning

Multi-Stage Distillation: Caveats
z
12
• Catastrophic forgetting (McCloskey & Cohen, 1989; French, 1999)
• Model forgets information from earlier stages / tasks
• General solution: Update model parameters progressively*
1. Progressively in time (freezing)
2. Progressively in intensity (lower learning rates)
3. Progressively vs. pre-trained model (regularization)
*https://ruder.io/state-of-transfer-learning-in-nlp/
• Our approach: Multi-stage distillation of student with gradual unfreezing and
cosine learning rate schedule [1 + 2 + 3]

Massive Multi-lingual NER F1 over 41 Languages
13
z
Dataset: Wikiann1,2 with 705K train, 329K dev and 329K test
sequences in IOB2 format with PER, ORG, LOC tags
1. Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. ACL 2017. Crosslingual name tagging and linking for 282 languages.
2. Afshin Rahimi, Yuan Li, and Trevor Cohn. ACL 2019. Massively multilingual transfer for NER.
Refer to paper for text classification datasets and experiments

XtremeDistil Summary: Distilled mBERT for NER
14
zLanguage-agnostic: single model for all languages
 > 35x parameter compression
 > 51x latency speedup for batch inference
95% of mBERT F1 for NER over 41 languages

Parameter Compression (x) vs. F1 against mBERT
15
z
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE] [CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE] [CELLRANGE][CELLRANGE]
[CELLRANGE] [CELLRANGE][CELLRANGE][CELLRANGE]
0
5
10
15
20
25
30
35
40
84 84.5 85 85.5 86 86.5 87 87.5 88 88.5 89
ParameterCompression
F1 Measure
(E,H) denotes word embedding dim. and BiLSTM hidden states
95% of mBERT F1

Inference Speedup (x) vs. F1 against mBERT
16
z
[CELLRANGE][CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE] [CELLRANGE]
[CELLRANGE]
[CELLRANGE]
0
10
20
30
40
50
60
70
80
84 84.5 85 85.5 86 86.5 87 87.5 88 88.5 89
InferenceSpeedup
F1 Measure
Config: batch_size=32 on single P100 GPU
95% of mBERT F1
(E,H) denotes word embedding dim. and BiLSTM hidden states

XtremeDistil NER F1 (41 Lang.): Model Features vs. Transfer Data
17
z
MODELFEATURES
UNLABELED TRANSFER DATA
Performance increases with internal representations and learning schedulePerformance increases with more unlabeled transfer data till saturation

Which teacher layer to distil from?
18
z
Higher layers provide more task-specific knowledge but harder to
distil by shallow student

Which (multi-lingual) word embeddings to initialize with?
19
z
- Random initialization works well
- Singular Value Decomposition for dimensionality reduction over fine-tuned
mBERT word embeddings obtain compression + better performance

Takeaways
20
z
• Useful distillation aspects
• Hidden representations from intermediate teacher layers
• Stagewise optimization + gradual unfreezing + learning rate scheduler
• Key trade-off
• Student architecture for low-latency configurations vs. F1 score
• Parameter compression vs. latency speedup vs. F1 score
• 35x parameter compression with 51x latency speedup retaining 95% mBERT F1 score for
NER over 41 languages
• XtremeDistil can be easily extended to other tasks and languages

Making SOTA Affordable in Practice
21
• Inference cost saving potential for a single hypothetical scenario
• 25x increase in inference capacity
• Distilled model achieves parity with full model
• Opportunity for application in Enterprise Search,
Communication and Productivity scenarios 100
Million
$2.07
10
$15 MM
per Year
Savings
100
ms
2
ms
260
Productive days in a year
Active users
100
Million
$2.07
Cost of 1 X P100 GPU per hour
Queries by an average
user per productive day
Query Processing time for a
Bert-based model
Query Processing time after
distillation
• Consider a hypothetical scenario: 100 MM users,
10 queries / user, 100 ms latency / query for BERT
• Inference cost saving potential with XtremeDistil
- 35x compression with 51x latency speedup
- 95% performance match with $15 MM
savings per year for a single scenario

Thank You
Code and resources available at:
https://aka.ms/XtremeDistil
22

Low-resource NER for 41 Languages
23
z
100 labeled samples per language and unlabeled transfer samples
TRANSFERDATA

M I C R O S O F T R E S E A R C H A C L 2 0 2 0 24
z
XtremeDistil matches Transformer teacher with > 26x compression
500
50
55
60
65
70
75
80
85
90
95
100
AG News IMDB Elec DBPedia
RNN TinyBERT BERT Large (Teacher)
Distillation most effective in low-resource settings
Distilled Student BERT Large
13M 340M

XtremeDistil for Sentiment Classification on SST-2
25
z
12 million sentences sampled from IMDB
as unlabeled transfer set for distillation

Which student architecture to use?
26
z

Multi-Stage Distillation: Progressive Improvement
27

XtremeDistil: Multi-stage Distillation for Massive Multilingual Models

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to XtremeDistil: Multi-stage Distillation for Massive Multilingual Models

Similar to XtremeDistil: Multi-stage Distillation for Massive Multilingual Models (20)

More from Subhabrata Mukherjee

More from Subhabrata Mukherjee (18)

Recently uploaded

Recently uploaded (20)

XtremeDistil: Multi-stage Distillation for Massive Multilingual Models