SlideShare a Scribd company logo
M I C R O S O F T R E S E A R C H A I
XtremeDistil: Multi-stage Distillation for Massive Multilingual Models
SUBHABRATA MUKHERJEE
AHMED H. AWADALLAH
ACL 2020
M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Language Model Pre-Training
z
Representation Transfer
Paraphrase
Q&A
Sentiment
Learned via self-supervision
(language modeling objectives)
Fine-tune on target task on labeled data
(e.g., cross-entropy loss objective)
2
M I C R O S O F T R E S E A R C H A C L 2 0 2 0
3
z
20182017
CoVe
Elmo
OpenAI GPT
BERT
2019
OpenAI GPT-2
Pre-trained Model Complexity
Nvidia Megatron
Google T5
Turing-NLG
2020
OpenAI GPT-3
M I C R O S O F T R E S E A R C H A C L 2 0 2 0
4
z
20182017
CoVe
30M params
Elmo
~90M params
OpenAI GPT
~100M params
BERT
340M params.
2019
OpenAI GPT-2
1.5B params
Pre-trained Model Complexity
Nvidia Megatron
8B params
Google T5
11B params
Turing-NLG
17B params
2020
OpenAI GPT-3
175B params
Increase in model parameters
M I C R O S O F T R E S E A R C H A C L 2 0 2 0
z
Knowledge Distillation to Mimic Teacher’s Output
Teacher: Huge pre-trained
model
Student: Shallow model
Transfer Set:
Unlabeled Data
(Ba and Caruana, 2014; Romero et al. 2014; Hinton, Vinyals and Dean, 2015)
• Logits
• Log probability values over classes (before softmax)
• Captures uncertainty better than hard labels
• Internal Representations
• Hint-based guidance from Teacher
5
M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Knowledge Distillation
z
Unlabeled
Data
Teacher
(3) Generate logits and
representations on
unlabeled data
Labeled Data
(2) Task-specific
training w. CE loss
Augmented
Data
(1) LM pre-training
(optional)
Student
(4) Task-specific soft
training
6
M I C R O S O F T R E S E A R C H A C L 2 0 2 0
TASK: Multi-lingual Named Entity Recognition
7
Jakers ! Aventurile lui Piggley Winks
R.H. Saunders ( St. Lawrence River ) ( 968 MW )
B-ORG I-ORG O B-ORG I-ORG I-ORG O O O O O
Широко распространён в Австралии , где его выращивают на срезку .
O O O B-LOC O O O O O O O
Jakers ! Aventurile lui Piggley Winks
B-ORG I-ORG I-ORG I-ORG I-ORG I-ORG
Identify PER, ORG and LOC from multiple languages jointly
We adopt Multi-lingual BERT pre-trained on 104 languages
in Wikipedia as the teacher
M I C R O S O F T R E S E A R C H A C L 2 0 2 0
z
Knowledge Distillation (1/2)
8
{𝑥𝑙, 𝑦𝑙} ∈ 𝐷𝑙
Shared Word Embedding Layer
Shared RNN Layer
Shared Feedforward Layer
𝑧 𝑠 𝑥 𝑘
ps 𝑥 𝑘 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑥 𝑘)
𝑚𝑢𝑙𝑡𝑖𝑡𝑎𝑠𝑘 𝑙𝑜𝑠𝑠 = −
𝑥∈𝐷𝑙 𝑥 𝑘
∈ 𝑥
𝑦 𝑘 log(𝑝𝑠 𝑥 𝑘 )
T: teacher
s: student
Dl: labeled data
Cross-entropy loss (labeled)
{𝑥𝑙, 𝑦𝑙} ∈ 𝐷𝑙
Shared Word Embedding Layer
{𝑥, 𝑦} ∈ 𝐷𝑙
M I C R O S O F T R E S E A R C H A C L 2 0 2 0
z
Knowledge Distillation (2/2)
9
{𝑥𝑙, 𝑦𝑙} ∈ 𝐷𝑙
Shared Word Embedding Layer
Shared BiLSTM Layer
Shared feed-forward layer
𝑧 𝑠 𝑥 𝑘 𝑧 𝑠 𝑥 𝑘
ps 𝑥 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(. ) ps 𝑥 𝑘 = 𝑙𝑖𝑛𝑒𝑎𝑟(. )
𝑚𝑢𝑙𝑡𝑖𝑡𝑎𝑠𝑘 𝑙𝑜𝑠𝑠 = 𝐶𝐸_𝑙𝑜𝑠𝑠 +
𝑥∈𝐷 𝑢 𝑥𝑘∈𝑥
𝑙𝑜𝑔𝑖𝑡 𝑇 𝑦 𝑘 − 𝑝𝑠 𝑥 𝑘
2
+ 𝐾𝐿𝐷 𝑧 𝑇 𝑥 𝑘 || 𝑧 𝑠 𝑥 𝑘
T: teacher
s: student
Du: unlabeled data
{𝑥 𝑘, 𝑙𝑜𝑔𝑖𝑡𝑇 𝑦 𝑘 , 𝑧𝑇(𝑥 𝑘)} ∈ 𝐷 𝑢
Cross-entropy
loss (labeled)
Mean-squared logit
loss (unlabeled)
Representation loss
(unlabeled)
{𝑥𝑙, 𝑦𝑙} ∈ 𝐷𝑙
Shared Word Embedding Layer
{𝑥 𝑘, 𝑦𝑘} ∈ 𝐷𝑙
M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Multi-Stage Distillation with Labeled (L) and Unlabeled (U) Data
z
Joint
• α CEloss (L) +
• β Logitloss (U) +
• γ Reploss (U)
2-Stage
Stage 1: Reploss(U)
• Stage 2:
• α CEloss(L) + β
Logitloss(U)
3-Stage
• Stage 1:Reploss(U)
• Stage 2: Logitloss (U)
• Stage 3: CEloss(L)
10
M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Hyper-parameter free Loss Combination
z
3-Stage
• Stage 1:Reploss(U)
• Stage 2: Logitloss (U)
• Stage 3: CEloss(L)
11
General representation
Task-specific tuning
M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Multi-Stage Distillation: Caveats
z
12
• Catastrophic forgetting (McCloskey & Cohen, 1989; French, 1999)
• Model forgets information from earlier stages / tasks
• General solution: Update model parameters progressively*
1. Progressively in time (freezing)
2. Progressively in intensity (lower learning rates)
3. Progressively vs. pre-trained model (regularization)
*https://ruder.io/state-of-transfer-learning-in-nlp/
• Our approach: Multi-stage distillation of student with gradual unfreezing and
cosine learning rate schedule [1 + 2 + 3]
M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Massive Multi-lingual NER F1 over 41 Languages
13
z
Dataset: Wikiann1,2 with 705K train, 329K dev and 329K test
sequences in IOB2 format with PER, ORG, LOC tags
1. Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. ACL 2017. Crosslingual name tagging and linking for 282 languages.
2. Afshin Rahimi, Yuan Li, and Trevor Cohn. ACL 2019. Massively multilingual transfer for NER.
Refer to paper for text classification datasets and experiments
M I C R O S O F T R E S E A R C H A C L 2 0 2 0
XtremeDistil Summary: Distilled mBERT for NER
14
zLanguage-agnostic: single model for all languages
 > 35x parameter compression
 > 51x latency speedup for batch inference
95% of mBERT F1 for NER over 41 languages
M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Parameter Compression (x) vs. F1 against mBERT
15
z
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE] [CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE] [CELLRANGE][CELLRANGE]
[CELLRANGE] [CELLRANGE][CELLRANGE][CELLRANGE]
0
5
10
15
20
25
30
35
40
84 84.5 85 85.5 86 86.5 87 87.5 88 88.5 89
ParameterCompression
F1 Measure
(E,H) denotes word embedding dim. and BiLSTM hidden states
95% of mBERT F1
M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Inference Speedup (x) vs. F1 against mBERT
16
z
[CELLRANGE][CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE][CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE] [CELLRANGE]
[CELLRANGE][CELLRANGE]
[CELLRANGE]
[CELLRANGE]
0
10
20
30
40
50
60
70
80
84 84.5 85 85.5 86 86.5 87 87.5 88 88.5 89
InferenceSpeedup
F1 Measure
Config: batch_size=32 on single P100 GPU
95% of mBERT F1
(E,H) denotes word embedding dim. and BiLSTM hidden states
M I C R O S O F T R E S E A R C H A C L 2 0 2 0
XtremeDistil NER F1 (41 Lang.): Model Features vs. Transfer Data
17
z
MODELFEATURES
UNLABELED TRANSFER DATA
Performance increases with internal representations and learning schedulePerformance increases with more unlabeled transfer data till saturation
M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Which teacher layer to distil from?
18
z
Higher layers provide more task-specific knowledge but harder to
distil by shallow student
M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Which (multi-lingual) word embeddings to initialize with?
19
z
- Random initialization works well
- Singular Value Decomposition for dimensionality reduction over fine-tuned
mBERT word embeddings obtain compression + better performance
M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Takeaways
20
z
• Useful distillation aspects
• Hidden representations from intermediate teacher layers
• Stagewise optimization + gradual unfreezing + learning rate scheduler
• Key trade-off
• Student architecture for low-latency configurations vs. F1 score
• Parameter compression vs. latency speedup vs. F1 score
• 35x parameter compression with 51x latency speedup retaining 95% mBERT F1 score for
NER over 41 languages
• XtremeDistil can be easily extended to other tasks and languages
M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Making SOTA Affordable in Practice
21
• Inference cost saving potential for a single hypothetical scenario
• 25x increase in inference capacity
• Distilled model achieves parity with full model
• Opportunity for application in Enterprise Search,
Communication and Productivity scenarios 100
Million
$2.07
10
$15 MM
per Year
Savings
100
ms
2
ms
260
Productive days in a year
Active users
100
Million
$2.07
Cost of 1 X P100 GPU per hour
Queries by an average
user per productive day
Query Processing time for a
Bert-based model
Query Processing time after
distillation
• Consider a hypothetical scenario: 100 MM users,
10 queries / user, 100 ms latency / query for BERT
• Inference cost saving potential with XtremeDistil
- 35x compression with 51x latency speedup
- 95% performance match with $15 MM
savings per year for a single scenario
M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Thank You
Code and resources available at:
https://aka.ms/XtremeDistil
22
M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Low-resource NER for 41 Languages
23
z
100 labeled samples per language and unlabeled transfer samples
TRANSFERDATA
M I C R O S O F T R E S E A R C H A C L 2 0 2 0 24
z
XtremeDistil matches Transformer teacher with > 26x compression
500
50
55
60
65
70
75
80
85
90
95
100
AG News IMDB Elec DBPedia
RNN TinyBERT BERT Large (Teacher)
Distillation most effective in low-resource settings
Distilled Student BERT Large
13M 340M
M I C R O S O F T R E S E A R C H A C L 2 0 2 0
XtremeDistil for Sentiment Classification on SST-2
25
z
12 million sentences sampled from IMDB
as unlabeled transfer set for distillation
M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Which student architecture to use?
26
z
M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Multi-Stage Distillation: Progressive Improvement
27

More Related Content

What's hot

assembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YUassembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YU
Education
 
LDPC Encoding and Hamming Encoding
LDPC Encoding and Hamming EncodingLDPC Encoding and Hamming Encoding
LDPC Encoding and Hamming Encoding
Bhagwat Singh Rathore
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Deep Learning Italia
 
Evolution of Structure of Some Binary Group-Based N-Bit Compartor, N-To-2N De...
Evolution of Structure of Some Binary Group-Based N-Bit Compartor, N-To-2N De...Evolution of Structure of Some Binary Group-Based N-Bit Compartor, N-To-2N De...
Evolution of Structure of Some Binary Group-Based N-Bit Compartor, N-To-2N De...
VLSICS Design
 
Lifting variability from C to mbeddr-C
Lifting variability from C to mbeddr-CLifting variability from C to mbeddr-C
Lifting variability from C to mbeddr-C
Federico Tomassetti
 
Regular expression to NFA (Nondeterministic Finite Automata)
Regular expression to NFA (Nondeterministic Finite Automata)Regular expression to NFA (Nondeterministic Finite Automata)
Regular expression to NFA (Nondeterministic Finite Automata)
Niloy Biswas
 
JetBrains MPS: Projectional Editing in Domain-Specific Languages
JetBrains MPS: Projectional Editing in Domain-Specific LanguagesJetBrains MPS: Projectional Editing in Domain-Specific Languages
JetBrains MPS: Projectional Editing in Domain-Specific Languages
Oscar Rodriguez
 
Turbo codes
Turbo codesTurbo codes
Turbo codes
RAVINDRA GAIKWAD
 
Reed solomon codes
Reed solomon codesReed solomon codes
Reed solomon codes
Samreen Reyaz Ansari
 

What's hot (9)

assembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YUassembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YU
 
LDPC Encoding and Hamming Encoding
LDPC Encoding and Hamming EncodingLDPC Encoding and Hamming Encoding
LDPC Encoding and Hamming Encoding
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
Evolution of Structure of Some Binary Group-Based N-Bit Compartor, N-To-2N De...
Evolution of Structure of Some Binary Group-Based N-Bit Compartor, N-To-2N De...Evolution of Structure of Some Binary Group-Based N-Bit Compartor, N-To-2N De...
Evolution of Structure of Some Binary Group-Based N-Bit Compartor, N-To-2N De...
 
Lifting variability from C to mbeddr-C
Lifting variability from C to mbeddr-CLifting variability from C to mbeddr-C
Lifting variability from C to mbeddr-C
 
Regular expression to NFA (Nondeterministic Finite Automata)
Regular expression to NFA (Nondeterministic Finite Automata)Regular expression to NFA (Nondeterministic Finite Automata)
Regular expression to NFA (Nondeterministic Finite Automata)
 
JetBrains MPS: Projectional Editing in Domain-Specific Languages
JetBrains MPS: Projectional Editing in Domain-Specific LanguagesJetBrains MPS: Projectional Editing in Domain-Specific Languages
JetBrains MPS: Projectional Editing in Domain-Specific Languages
 
Turbo codes
Turbo codesTurbo codes
Turbo codes
 
Reed solomon codes
Reed solomon codesReed solomon codes
Reed solomon codes
 

Similar to XtremeDistil: Multi-stage Distillation for Massive Multilingual Models

defense
defensedefense
defense
Qing Dou
 
N20181217
N20181217N20181217
N20181217
TMU, Japan
 
Introduction to Prolog
Introduction to PrologIntroduction to Prolog
Introduction to Prolog
Chamath Sajeewa
 
Scaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowScaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlow
Databricks
 
Deep-learning based Language Understanding and Emotion extractions
Deep-learning based Language Understanding and Emotion extractionsDeep-learning based Language Understanding and Emotion extractions
Deep-learning based Language Understanding and Emotion extractions
Jeongkyu Shin
 
Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language Processing
Apache MXNet
 
Academy Software Foundation on MaterialX | SIGGRAPH 2021
Academy Software Foundation on MaterialX | SIGGRAPH 2021 Academy Software Foundation on MaterialX | SIGGRAPH 2021
Academy Software Foundation on MaterialX | SIGGRAPH 2021
Alejandro Franceschi
 
Generating super resolution images using transformers
Generating super resolution images using transformersGenerating super resolution images using transformers
Generating super resolution images using transformers
NEERAJ BAGHEL
 
Merghani-SACNAS Poster
Merghani-SACNAS PosterMerghani-SACNAS Poster
Merghani-SACNAS Poster
Taha Merghani
 
Reginf pldi3
Reginf pldi3Reginf pldi3
Reginf pldi3
daniel_yokomizo
 
A Distributed Tableau Algorithm for Package-based Description Logics
A Distributed Tableau Algorithm for Package-based Description LogicsA Distributed Tableau Algorithm for Package-based Description Logics
A Distributed Tableau Algorithm for Package-based Description Logics
Jie Bao
 
"Isolated Sign Recognition with a Siamese Neural Network of RGB and Depth Str...
"Isolated Sign Recognition with a Siamese Neural Network of RGB and Depth Str..."Isolated Sign Recognition with a Siamese Neural Network of RGB and Depth Str...
"Isolated Sign Recognition with a Siamese Neural Network of RGB and Depth Str...
Anıl Osman Tur
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
zukun
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
Bhaskar Mitra
 
Classification of CNN.com Articles using a TF*IDF Metric
Classification of CNN.com Articles using a TF*IDF MetricClassification of CNN.com Articles using a TF*IDF Metric
Classification of CNN.com Articles using a TF*IDF Metric
Marie Vans
 
NIPS2007: structured prediction
NIPS2007: structured predictionNIPS2007: structured prediction
NIPS2007: structured prediction
zukun
 
Reverse-Engineering Reusable Language Modules from Legacy DSLs
Reverse-Engineering Reusable Language Modules from Legacy DSLsReverse-Engineering Reusable Language Modules from Legacy DSLs
Reverse-Engineering Reusable Language Modules from Legacy DSLs
David Méndez-Acuña
 
An optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideAn optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slide
WooSung Choi
 
A machine-learning view on heterogeneous catalyst design and discovery
A machine-learning view on heterogeneous catalyst design and discoveryA machine-learning view on heterogeneous catalyst design and discovery
A machine-learning view on heterogeneous catalyst design and discovery
Ichigaku Takigawa
 
FPGA - Programmable Logic Design
FPGA - Programmable Logic DesignFPGA - Programmable Logic Design
FPGA - Programmable Logic Design
Dr. Shivananda Koteshwar
 

Similar to XtremeDistil: Multi-stage Distillation for Massive Multilingual Models (20)

defense
defensedefense
defense
 
N20181217
N20181217N20181217
N20181217
 
Introduction to Prolog
Introduction to PrologIntroduction to Prolog
Introduction to Prolog
 
Scaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowScaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlow
 
Deep-learning based Language Understanding and Emotion extractions
Deep-learning based Language Understanding and Emotion extractionsDeep-learning based Language Understanding and Emotion extractions
Deep-learning based Language Understanding and Emotion extractions
 
Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language Processing
 
Academy Software Foundation on MaterialX | SIGGRAPH 2021
Academy Software Foundation on MaterialX | SIGGRAPH 2021 Academy Software Foundation on MaterialX | SIGGRAPH 2021
Academy Software Foundation on MaterialX | SIGGRAPH 2021
 
Generating super resolution images using transformers
Generating super resolution images using transformersGenerating super resolution images using transformers
Generating super resolution images using transformers
 
Merghani-SACNAS Poster
Merghani-SACNAS PosterMerghani-SACNAS Poster
Merghani-SACNAS Poster
 
Reginf pldi3
Reginf pldi3Reginf pldi3
Reginf pldi3
 
A Distributed Tableau Algorithm for Package-based Description Logics
A Distributed Tableau Algorithm for Package-based Description LogicsA Distributed Tableau Algorithm for Package-based Description Logics
A Distributed Tableau Algorithm for Package-based Description Logics
 
"Isolated Sign Recognition with a Siamese Neural Network of RGB and Depth Str...
"Isolated Sign Recognition with a Siamese Neural Network of RGB and Depth Str..."Isolated Sign Recognition with a Siamese Neural Network of RGB and Depth Str...
"Isolated Sign Recognition with a Siamese Neural Network of RGB and Depth Str...
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
 
Classification of CNN.com Articles using a TF*IDF Metric
Classification of CNN.com Articles using a TF*IDF MetricClassification of CNN.com Articles using a TF*IDF Metric
Classification of CNN.com Articles using a TF*IDF Metric
 
NIPS2007: structured prediction
NIPS2007: structured predictionNIPS2007: structured prediction
NIPS2007: structured prediction
 
Reverse-Engineering Reusable Language Modules from Legacy DSLs
Reverse-Engineering Reusable Language Modules from Legacy DSLsReverse-Engineering Reusable Language Modules from Legacy DSLs
Reverse-Engineering Reusable Language Modules from Legacy DSLs
 
An optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideAn optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slide
 
A machine-learning view on heterogeneous catalyst design and discovery
A machine-learning view on heterogeneous catalyst design and discoveryA machine-learning view on heterogeneous catalyst design and discovery
A machine-learning view on heterogeneous catalyst design and discovery
 
FPGA - Programmable Logic Design
FPGA - Programmable Logic DesignFPGA - Programmable Logic Design
FPGA - Programmable Logic Design
 

More from Subhabrata Mukherjee

Probabilistic Graphical Models for Credibility Analysis in Evolving Online Co...
Probabilistic Graphical Models for Credibility Analysis in Evolving Online Co...Probabilistic Graphical Models for Credibility Analysis in Evolving Online Co...
Probabilistic Graphical Models for Credibility Analysis in Evolving Online Co...
Subhabrata Mukherjee
 
Fact Checking from Text
Fact Checking from TextFact Checking from Text
Fact Checking from Text
Subhabrata Mukherjee
 
OpenTag: Open Attribute Value Extraction From Product Profiles
OpenTag: Open Attribute Value Extraction From Product ProfilesOpenTag: Open Attribute Value Extraction From Product Profiles
OpenTag: Open Attribute Value Extraction From Product Profiles
Subhabrata Mukherjee
 
Probabilistic Graphical Models for Credibility Analysis in Evolving Online Co...
Probabilistic Graphical Models for Credibility Analysis in Evolving Online Co...Probabilistic Graphical Models for Credibility Analysis in Evolving Online Co...
Probabilistic Graphical Models for Credibility Analysis in Evolving Online Co...
Subhabrata Mukherjee
 
Continuous Experience-aware Language Model
Continuous Experience-aware Language ModelContinuous Experience-aware Language Model
Continuous Experience-aware Language Model
Subhabrata Mukherjee
 
Experience aware Item Recommendation in Evolving Review Communities
Experience aware Item Recommendation in Evolving Review CommunitiesExperience aware Item Recommendation in Evolving Review Communities
Experience aware Item Recommendation in Evolving Review Communities
Subhabrata Mukherjee
 
Domain Cartridge: Unsupervised Framework for Shallow Domain Ontology Construc...
Domain Cartridge: Unsupervised Framework for Shallow Domain Ontology Construc...Domain Cartridge: Unsupervised Framework for Shallow Domain Ontology Construc...
Domain Cartridge: Unsupervised Framework for Shallow Domain Ontology Construc...
Subhabrata Mukherjee
 
Leveraging Joint Interactions for Credibility Analysis in News Communities
Leveraging Joint Interactions for Credibility Analysis in News CommunitiesLeveraging Joint Interactions for Credibility Analysis in News Communities
Leveraging Joint Interactions for Credibility Analysis in News Communities
Subhabrata Mukherjee
 
People on Drugs: Credibility of User Statements in Health Forums
People on Drugs: Credibility of User Statements in Health ForumsPeople on Drugs: Credibility of User Statements in Health Forums
People on Drugs: Credibility of User Statements in Health Forums
Subhabrata Mukherjee
 
Author-Specific Hierarchical Sentiment Aggregation for Rating Prediction of R...
Author-Specific Hierarchical Sentiment Aggregation for Rating Prediction of R...Author-Specific Hierarchical Sentiment Aggregation for Rating Prediction of R...
Author-Specific Hierarchical Sentiment Aggregation for Rating Prediction of R...
Subhabrata Mukherjee
 
Joint Author Sentiment Topic Model
Joint Author Sentiment Topic ModelJoint Author Sentiment Topic Model
Joint Author Sentiment Topic Model
Subhabrata Mukherjee
 
TwiSent: A Multi-Stage System for Analyzing Sentiment in Twitter
TwiSent: A Multi-Stage System for Analyzing Sentiment in TwitterTwiSent: A Multi-Stage System for Analyzing Sentiment in Twitter
TwiSent: A Multi-Stage System for Analyzing Sentiment in Twitter
Subhabrata Mukherjee
 
Adaptation of Sentiment Analysis to New Linguistic Features, Informal Languag...
Adaptation of Sentiment Analysis to New Linguistic Features, Informal Languag...Adaptation of Sentiment Analysis to New Linguistic Features, Informal Languag...
Adaptation of Sentiment Analysis to New Linguistic Features, Informal Languag...
Subhabrata Mukherjee
 
Leveraging Sentiment to Compute Word Similarity
Leveraging Sentiment to Compute Word SimilarityLeveraging Sentiment to Compute Word Similarity
Leveraging Sentiment to Compute Word Similarity
Subhabrata Mukherjee
 
WikiSent : Weakly Supervised Sentiment Analysis Through Extractive Summarizat...
WikiSent : Weakly Supervised Sentiment Analysis Through Extractive Summarizat...WikiSent : Weakly Supervised Sentiment Analysis Through Extractive Summarizat...
WikiSent : Weakly Supervised Sentiment Analysis Through Extractive Summarizat...
Subhabrata Mukherjee
 
Feature specific analysis of reviews
Feature specific analysis of reviewsFeature specific analysis of reviews
Feature specific analysis of reviews
Subhabrata Mukherjee
 
YouCat : Weakly Supervised Youtube Video Categorization System from Meta Data...
YouCat : Weakly Supervised Youtube Video Categorization System from Meta Data...YouCat : Weakly Supervised Youtube Video Categorization System from Meta Data...
YouCat : Weakly Supervised Youtube Video Categorization System from Meta Data...
Subhabrata Mukherjee
 
Sentiment Analysis in Twitter with Lightweight Discourse Analysis
Sentiment Analysis in Twitter with Lightweight Discourse AnalysisSentiment Analysis in Twitter with Lightweight Discourse Analysis
Sentiment Analysis in Twitter with Lightweight Discourse Analysis
Subhabrata Mukherjee
 

More from Subhabrata Mukherjee (18)

Probabilistic Graphical Models for Credibility Analysis in Evolving Online Co...
Probabilistic Graphical Models for Credibility Analysis in Evolving Online Co...Probabilistic Graphical Models for Credibility Analysis in Evolving Online Co...
Probabilistic Graphical Models for Credibility Analysis in Evolving Online Co...
 
Fact Checking from Text
Fact Checking from TextFact Checking from Text
Fact Checking from Text
 
OpenTag: Open Attribute Value Extraction From Product Profiles
OpenTag: Open Attribute Value Extraction From Product ProfilesOpenTag: Open Attribute Value Extraction From Product Profiles
OpenTag: Open Attribute Value Extraction From Product Profiles
 
Probabilistic Graphical Models for Credibility Analysis in Evolving Online Co...
Probabilistic Graphical Models for Credibility Analysis in Evolving Online Co...Probabilistic Graphical Models for Credibility Analysis in Evolving Online Co...
Probabilistic Graphical Models for Credibility Analysis in Evolving Online Co...
 
Continuous Experience-aware Language Model
Continuous Experience-aware Language ModelContinuous Experience-aware Language Model
Continuous Experience-aware Language Model
 
Experience aware Item Recommendation in Evolving Review Communities
Experience aware Item Recommendation in Evolving Review CommunitiesExperience aware Item Recommendation in Evolving Review Communities
Experience aware Item Recommendation in Evolving Review Communities
 
Domain Cartridge: Unsupervised Framework for Shallow Domain Ontology Construc...
Domain Cartridge: Unsupervised Framework for Shallow Domain Ontology Construc...Domain Cartridge: Unsupervised Framework for Shallow Domain Ontology Construc...
Domain Cartridge: Unsupervised Framework for Shallow Domain Ontology Construc...
 
Leveraging Joint Interactions for Credibility Analysis in News Communities
Leveraging Joint Interactions for Credibility Analysis in News CommunitiesLeveraging Joint Interactions for Credibility Analysis in News Communities
Leveraging Joint Interactions for Credibility Analysis in News Communities
 
People on Drugs: Credibility of User Statements in Health Forums
People on Drugs: Credibility of User Statements in Health ForumsPeople on Drugs: Credibility of User Statements in Health Forums
People on Drugs: Credibility of User Statements in Health Forums
 
Author-Specific Hierarchical Sentiment Aggregation for Rating Prediction of R...
Author-Specific Hierarchical Sentiment Aggregation for Rating Prediction of R...Author-Specific Hierarchical Sentiment Aggregation for Rating Prediction of R...
Author-Specific Hierarchical Sentiment Aggregation for Rating Prediction of R...
 
Joint Author Sentiment Topic Model
Joint Author Sentiment Topic ModelJoint Author Sentiment Topic Model
Joint Author Sentiment Topic Model
 
TwiSent: A Multi-Stage System for Analyzing Sentiment in Twitter
TwiSent: A Multi-Stage System for Analyzing Sentiment in TwitterTwiSent: A Multi-Stage System for Analyzing Sentiment in Twitter
TwiSent: A Multi-Stage System for Analyzing Sentiment in Twitter
 
Adaptation of Sentiment Analysis to New Linguistic Features, Informal Languag...
Adaptation of Sentiment Analysis to New Linguistic Features, Informal Languag...Adaptation of Sentiment Analysis to New Linguistic Features, Informal Languag...
Adaptation of Sentiment Analysis to New Linguistic Features, Informal Languag...
 
Leveraging Sentiment to Compute Word Similarity
Leveraging Sentiment to Compute Word SimilarityLeveraging Sentiment to Compute Word Similarity
Leveraging Sentiment to Compute Word Similarity
 
WikiSent : Weakly Supervised Sentiment Analysis Through Extractive Summarizat...
WikiSent : Weakly Supervised Sentiment Analysis Through Extractive Summarizat...WikiSent : Weakly Supervised Sentiment Analysis Through Extractive Summarizat...
WikiSent : Weakly Supervised Sentiment Analysis Through Extractive Summarizat...
 
Feature specific analysis of reviews
Feature specific analysis of reviewsFeature specific analysis of reviews
Feature specific analysis of reviews
 
YouCat : Weakly Supervised Youtube Video Categorization System from Meta Data...
YouCat : Weakly Supervised Youtube Video Categorization System from Meta Data...YouCat : Weakly Supervised Youtube Video Categorization System from Meta Data...
YouCat : Weakly Supervised Youtube Video Categorization System from Meta Data...
 
Sentiment Analysis in Twitter with Lightweight Discourse Analysis
Sentiment Analysis in Twitter with Lightweight Discourse AnalysisSentiment Analysis in Twitter with Lightweight Discourse Analysis
Sentiment Analysis in Twitter with Lightweight Discourse Analysis
 

Recently uploaded

Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
jpupo2018
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 

Recently uploaded (20)

Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 

XtremeDistil: Multi-stage Distillation for Massive Multilingual Models

  • 1. M I C R O S O F T R E S E A R C H A I XtremeDistil: Multi-stage Distillation for Massive Multilingual Models SUBHABRATA MUKHERJEE AHMED H. AWADALLAH ACL 2020
  • 2. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 Language Model Pre-Training z Representation Transfer Paraphrase Q&A Sentiment Learned via self-supervision (language modeling objectives) Fine-tune on target task on labeled data (e.g., cross-entropy loss objective) 2
  • 3. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 3 z 20182017 CoVe Elmo OpenAI GPT BERT 2019 OpenAI GPT-2 Pre-trained Model Complexity Nvidia Megatron Google T5 Turing-NLG 2020 OpenAI GPT-3
  • 4. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 4 z 20182017 CoVe 30M params Elmo ~90M params OpenAI GPT ~100M params BERT 340M params. 2019 OpenAI GPT-2 1.5B params Pre-trained Model Complexity Nvidia Megatron 8B params Google T5 11B params Turing-NLG 17B params 2020 OpenAI GPT-3 175B params Increase in model parameters
  • 5. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 z Knowledge Distillation to Mimic Teacher’s Output Teacher: Huge pre-trained model Student: Shallow model Transfer Set: Unlabeled Data (Ba and Caruana, 2014; Romero et al. 2014; Hinton, Vinyals and Dean, 2015) • Logits • Log probability values over classes (before softmax) • Captures uncertainty better than hard labels • Internal Representations • Hint-based guidance from Teacher 5
  • 6. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 Knowledge Distillation z Unlabeled Data Teacher (3) Generate logits and representations on unlabeled data Labeled Data (2) Task-specific training w. CE loss Augmented Data (1) LM pre-training (optional) Student (4) Task-specific soft training 6
  • 7. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 TASK: Multi-lingual Named Entity Recognition 7 Jakers ! Aventurile lui Piggley Winks R.H. Saunders ( St. Lawrence River ) ( 968 MW ) B-ORG I-ORG O B-ORG I-ORG I-ORG O O O O O Широко распространён в Австралии , где его выращивают на срезку . O O O B-LOC O O O O O O O Jakers ! Aventurile lui Piggley Winks B-ORG I-ORG I-ORG I-ORG I-ORG I-ORG Identify PER, ORG and LOC from multiple languages jointly We adopt Multi-lingual BERT pre-trained on 104 languages in Wikipedia as the teacher
  • 8. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 z Knowledge Distillation (1/2) 8 {𝑥𝑙, 𝑦𝑙} ∈ 𝐷𝑙 Shared Word Embedding Layer Shared RNN Layer Shared Feedforward Layer 𝑧 𝑠 𝑥 𝑘 ps 𝑥 𝑘 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑥 𝑘) 𝑚𝑢𝑙𝑡𝑖𝑡𝑎𝑠𝑘 𝑙𝑜𝑠𝑠 = − 𝑥∈𝐷𝑙 𝑥 𝑘 ∈ 𝑥 𝑦 𝑘 log(𝑝𝑠 𝑥 𝑘 ) T: teacher s: student Dl: labeled data Cross-entropy loss (labeled) {𝑥𝑙, 𝑦𝑙} ∈ 𝐷𝑙 Shared Word Embedding Layer {𝑥, 𝑦} ∈ 𝐷𝑙
  • 9. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 z Knowledge Distillation (2/2) 9 {𝑥𝑙, 𝑦𝑙} ∈ 𝐷𝑙 Shared Word Embedding Layer Shared BiLSTM Layer Shared feed-forward layer 𝑧 𝑠 𝑥 𝑘 𝑧 𝑠 𝑥 𝑘 ps 𝑥 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(. ) ps 𝑥 𝑘 = 𝑙𝑖𝑛𝑒𝑎𝑟(. ) 𝑚𝑢𝑙𝑡𝑖𝑡𝑎𝑠𝑘 𝑙𝑜𝑠𝑠 = 𝐶𝐸_𝑙𝑜𝑠𝑠 + 𝑥∈𝐷 𝑢 𝑥𝑘∈𝑥 𝑙𝑜𝑔𝑖𝑡 𝑇 𝑦 𝑘 − 𝑝𝑠 𝑥 𝑘 2 + 𝐾𝐿𝐷 𝑧 𝑇 𝑥 𝑘 || 𝑧 𝑠 𝑥 𝑘 T: teacher s: student Du: unlabeled data {𝑥 𝑘, 𝑙𝑜𝑔𝑖𝑡𝑇 𝑦 𝑘 , 𝑧𝑇(𝑥 𝑘)} ∈ 𝐷 𝑢 Cross-entropy loss (labeled) Mean-squared logit loss (unlabeled) Representation loss (unlabeled) {𝑥𝑙, 𝑦𝑙} ∈ 𝐷𝑙 Shared Word Embedding Layer {𝑥 𝑘, 𝑦𝑘} ∈ 𝐷𝑙
  • 10. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 Multi-Stage Distillation with Labeled (L) and Unlabeled (U) Data z Joint • α CEloss (L) + • β Logitloss (U) + • γ Reploss (U) 2-Stage Stage 1: Reploss(U) • Stage 2: • α CEloss(L) + β Logitloss(U) 3-Stage • Stage 1:Reploss(U) • Stage 2: Logitloss (U) • Stage 3: CEloss(L) 10
  • 11. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 Hyper-parameter free Loss Combination z 3-Stage • Stage 1:Reploss(U) • Stage 2: Logitloss (U) • Stage 3: CEloss(L) 11 General representation Task-specific tuning
  • 12. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 Multi-Stage Distillation: Caveats z 12 • Catastrophic forgetting (McCloskey & Cohen, 1989; French, 1999) • Model forgets information from earlier stages / tasks • General solution: Update model parameters progressively* 1. Progressively in time (freezing) 2. Progressively in intensity (lower learning rates) 3. Progressively vs. pre-trained model (regularization) *https://ruder.io/state-of-transfer-learning-in-nlp/ • Our approach: Multi-stage distillation of student with gradual unfreezing and cosine learning rate schedule [1 + 2 + 3]
  • 13. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 Massive Multi-lingual NER F1 over 41 Languages 13 z Dataset: Wikiann1,2 with 705K train, 329K dev and 329K test sequences in IOB2 format with PER, ORG, LOC tags 1. Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. ACL 2017. Crosslingual name tagging and linking for 282 languages. 2. Afshin Rahimi, Yuan Li, and Trevor Cohn. ACL 2019. Massively multilingual transfer for NER. Refer to paper for text classification datasets and experiments
  • 14. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 XtremeDistil Summary: Distilled mBERT for NER 14 zLanguage-agnostic: single model for all languages  > 35x parameter compression  > 51x latency speedup for batch inference 95% of mBERT F1 for NER over 41 languages
  • 15. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 Parameter Compression (x) vs. F1 against mBERT 15 z [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE][CELLRANGE] [CELLRANGE] [CELLRANGE][CELLRANGE][CELLRANGE] 0 5 10 15 20 25 30 35 40 84 84.5 85 85.5 86 86.5 87 87.5 88 88.5 89 ParameterCompression F1 Measure (E,H) denotes word embedding dim. and BiLSTM hidden states 95% of mBERT F1
  • 16. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 Inference Speedup (x) vs. F1 against mBERT 16 z [CELLRANGE][CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE][CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE][CELLRANGE] [CELLRANGE] [CELLRANGE] 0 10 20 30 40 50 60 70 80 84 84.5 85 85.5 86 86.5 87 87.5 88 88.5 89 InferenceSpeedup F1 Measure Config: batch_size=32 on single P100 GPU 95% of mBERT F1 (E,H) denotes word embedding dim. and BiLSTM hidden states
  • 17. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 XtremeDistil NER F1 (41 Lang.): Model Features vs. Transfer Data 17 z MODELFEATURES UNLABELED TRANSFER DATA Performance increases with internal representations and learning schedulePerformance increases with more unlabeled transfer data till saturation
  • 18. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 Which teacher layer to distil from? 18 z Higher layers provide more task-specific knowledge but harder to distil by shallow student
  • 19. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 Which (multi-lingual) word embeddings to initialize with? 19 z - Random initialization works well - Singular Value Decomposition for dimensionality reduction over fine-tuned mBERT word embeddings obtain compression + better performance
  • 20. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 Takeaways 20 z • Useful distillation aspects • Hidden representations from intermediate teacher layers • Stagewise optimization + gradual unfreezing + learning rate scheduler • Key trade-off • Student architecture for low-latency configurations vs. F1 score • Parameter compression vs. latency speedup vs. F1 score • 35x parameter compression with 51x latency speedup retaining 95% mBERT F1 score for NER over 41 languages • XtremeDistil can be easily extended to other tasks and languages
  • 21. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 Making SOTA Affordable in Practice 21 • Inference cost saving potential for a single hypothetical scenario • 25x increase in inference capacity • Distilled model achieves parity with full model • Opportunity for application in Enterprise Search, Communication and Productivity scenarios 100 Million $2.07 10 $15 MM per Year Savings 100 ms 2 ms 260 Productive days in a year Active users 100 Million $2.07 Cost of 1 X P100 GPU per hour Queries by an average user per productive day Query Processing time for a Bert-based model Query Processing time after distillation • Consider a hypothetical scenario: 100 MM users, 10 queries / user, 100 ms latency / query for BERT • Inference cost saving potential with XtremeDistil - 35x compression with 51x latency speedup - 95% performance match with $15 MM savings per year for a single scenario
  • 22. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 Thank You Code and resources available at: https://aka.ms/XtremeDistil 22
  • 23. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 Low-resource NER for 41 Languages 23 z 100 labeled samples per language and unlabeled transfer samples TRANSFERDATA
  • 24. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 24 z XtremeDistil matches Transformer teacher with > 26x compression 500 50 55 60 65 70 75 80 85 90 95 100 AG News IMDB Elec DBPedia RNN TinyBERT BERT Large (Teacher) Distillation most effective in low-resource settings Distilled Student BERT Large 13M 340M
  • 25. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 XtremeDistil for Sentiment Classification on SST-2 25 z 12 million sentences sampled from IMDB as unlabeled transfer set for distillation
  • 26. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 Which student architecture to use? 26 z
  • 27. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 Multi-Stage Distillation: Progressive Improvement 27