The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding Distillation with Ensemble Learning

The Pupil Has Become the Master:
Teacher-Student Model-Based
Word Embedding Distillation with
Ensemble Learning
Bonggun Shin, Hao Yang and Jinho D. Choi

IJCAI 2019

Typical NLP Models
• Input: a sequence of
tokens

• Each token is
transformed into a vector

• Once vectorized, any
neural network can be
attached

• Output: a class or a
sequence (depends on
the task)
Embeddings
CNN/LSTM
Output
Input

• The majority of model space is
occupied by word embeddings

• NLP model compression ->
Embedding compression

• Neural model compression
methods

• Weight pruning

• Weight quantization

• Lossless compression

• Distillation
Why is Compressing an
Embedding Beneﬁcial?
Embeddings
Output
Input
CNN/LSTM

Embedding Compression
• V: the number of vocabulary

• D: the size of an original embedding vector

• D’: the size of a compressed embedding vector
Embedding
V
D Compressed Embedding
V
D’

Embedding Encoding
• Proposed by [Mou et al., 2016] - 8 times compression with performance loss

• Not a teacher-student model - trains a single model with an encoding (projection) layer

• Encoding (projection) layer: Pre-trained large embedding -> smaller embedding

• Discards the large embedding at model deployment
Embedding Encoding
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
GOLD
SMALL EMB
LOGIT
SOFTMAX
NETWORK
SMALL EMB
WORD INDEX
TEACHER
MODEL
LOGIT
SOFTMAX
WORD INDEX
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
SMALL EMB
Projection
Projection Projection
Model Deployment
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
GOLD
SMALL EMB
LOGIT
SOFTMAX
NETWORK
SMALL EMB
WORD INDEX
TEACHER
MODEL
LOGIT
SOFTMAX
WORD INDEX
L
SO
WOR
NET
LARGE E
SMA
Projection
Projection Pro

We Propose
• Embedding Distillation

• Single teacher

• Better performance with a 8-times smaller model size

• Distillation Ensemble

• Ten teachers

• Better performance with a 80-times smaller model size

Background
Teacher Student Model
• Teacher - (Dataset.X, Dataset.Y), Student - (Dataset.X, Teacher.Output)

• Logit matching [Ba and Caruana, 2014]

• Noisy logit matching [Sau and Balasubramanian, 2016]

• Softmax tau matching [Hinton et al., 2014]
GOLD
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
NOISE
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
GOLD
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
NOISE
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
GOLD
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
NOISE
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX

Dataset
• Seven document classification tasks

• Pre-trained word embeddings (Trained by Word2Vec [Mikolov et al., 2013])

• Sentiment analysis (SST-*, MR and CR) - Amazon review data (2.6M)

• Others - Wikipedia + New York Times
All models are evaluated on seven document classification
datasets in Table 1. MR, SST-*, and CR are targeted at the
task of sentiment analysis while Subj, TREC, and MPQA are
targeted at the classifications of subjectivity, question types,
and opinion polarity, respectively. About 10% of the training
sets are split into development sets for SST-* and TREC, and
about 10/20% of the provided resources are divided into de-
velopment/evaluation sets for the other datasets, respectively.
Dataset C TRN DEV TST
MR [Pang and Lee, 2005] 2 7,684 990 1,988
SST-1 [Socher et al., 2013] 5 8,544 1,101 2,210
SST-2 [Socher et al., 2013] 2 6,920 872 1,821
Subj [Pang and Lee, 2004] 2 7,199 907 1,894
TREC [Li and Roth, 2002] 6 4,952 500 500
CR [Hu and Liu, 2004] 2 2,718 340 717
MPQA [Wiebe et al., 2005] 2 7,636 955 2,015
Table 1: Seven datasets used for our experiments. C: num-
ber of classes, TRN/DEV/TST: number of instances in train-
ing/development/evaluation set.
5.2 Word Embeddings
For sentiment analysis, raw text from the Amazon Review
dataset2
is used to train word embeddings, resulting 2.67M
word vectors. For the other tasks, combined text from
Wikipedia and the New York Times Annotated corpus3
are
used, resulting 1.96M word vectors. For each group, two sets
beddings. The
the size of the l
embedding lay
which is the dim
layered projecti
various combina
and deeper laye
5.4 Pre-trai
An autoencode
400-dimension
two-layered pro
and the decode
upper and lowe
by using pre-tr
are not reporte
consistently see
for the projectio
5.5 Embedd
Six models are
show the effect
work: logit ma
softmax tau m
based embeddin
els with the aut
two layered pro

Embedding Distillation
Model
• Train a teacher using (X,Y)

• Distill to a student with a projection to smaller embeddings using (X,
Teacher.Logit)

• Discard the large embedding at model deployment
LOGIT
SOFTMAX
NETWORK
SMALL EMB
WORD INDEX
TEACHER
MODEL
LOGIT
SOFTMAX
WORD INDEX
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
SMALL EMB
Projection
Model deployment
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
GOLD
SMALL EMB
LOGIT
SOFTMAX
NETWORK
SMALL EMB
WORD INDEX
TEACHER
MODEL
LOGIT
SOFTMAX
WORD INDEX W
LAR
S
Projection
Projection

Embedding Distillation
Result
• Proposed models outperform ENC

• Pre-training (PT) and 2-layered projection (2L) boost the performance

• Diﬀerences among LM vs NLM vs STM are marginal

LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
SMALL EMB
Projection
TEACHER
MODEL
LOGIT
SOFTMAX
WORD INDEX
R. LOGIT
Model deployment
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
GOLD
SMALL EMB
LOGIT
SOFTMAX
NETWORK
SMALL EMB
WORD INDEX
TEACHER
MODEL
LOGIT
SOFTMAX
WORD INDEX
Projection
Projection
Distillation Ensemble
Model
• Train multiple (for example, 10) teachers

• Calculate representing logits for each input dataset X

• Distill to a student with a projection to smaller embeddings using (X, Representing.Logit)

• Discard the large embedding at model deployment

Representing Logit
0
0.25
0.5
0.75
1
C1 C2 C3
0
0.25
0.5
0.75
1
C1 C2 C3
0
0.25
0.5
0.75
1
C1 C2 C3
Teacher1 Teacher2 Teacher3
0
0.25
0.5
0.75
1
C1 C2 C3
RAE
0
0.25
0.5
0.75
1
C1 C2 C3
RDE
Routing by

Agreement

Ensemble:

Put more weights

on majority

opinions

Routing by

Disagreement

Ensemble:

Put more weights

on minority

opinions

Distillation Ensemble
Result
• Teachers: 5 CNN-based and 5 LSTM-based models

• RDE significantly outperforms the teachers, if the dataset is sufficiently large

• Enough data samples -> The more chance to explore different insights from minor
opinions.
0.780
0.790
0.800
Teacher RAE RDE
MR
0.470
0.480
0.490
0.500
Teacher RAE RDE
SST-1
0.850
0.860
0.870
Teacher RAE RDE
SST-2
0.920
0.925
0.930
0.935
Teacher RAE RDE
Subj
0.885
0.890
0.895
0.900
0.905
Teacher RAE RDE
MPQA
5 while n iterations do
6 c softmax(w)
7 zrep
=
PT
t=1 ct · zt
8 s squash(zrep
)
9 if not last iteration then
10 for t 2 {1, . . . , T} do
11 wt wt + kxt
· s
12 return zrep
5 Experiments
5.1 Datasets
All models are evaluated on seven document classification
datasets in Table 1. MR, SST-*, and CR are targeted at the
task of sentiment analysis while Subj, TREC, and MPQA are
targeted at the classifications of subjectivity, question types,
and opinion polarity, respectively. About 10% of the training
sets are split into development sets for SST-* and TREC, and
about 10/20% of the provided resources are divided into de-
velopment/evaluation sets for the other datasets, respectively.
Dataset C TRN DEV TST
MR [Pang and Lee, 2005] 2 7,684 990 1,988
SST-1 [Socher et al., 2013] 5 8,544 1,101 2,210
SST-2 [Socher et al., 2013] 2 6,920 872 1,821
Subj [Pang and Lee, 2004] 2 7,199 907 1,894
TREC [Li and Roth, 2002] 6 4,952 500 500
CR [Hu and Liu, 2004] 2 2,718 340 717
MPQA [Wiebe et al., 2005] 2 7,636 955 2,015
Table 1: Seven datasets used for our experiments. C: num-
ber of classes, TRN/DEV/TST: number of instances in train-
ing/development/evaluation set.
5.2 Word Embeddings
For sentiment analysis, raw text from the Amazon Review
dataset2
is used to train word embeddings, resulting 2.67M
word vectors. For the other tasks, combined text from
Wikipedia and the New York Times Annotated corpus3
are
models ad
dent mode
the the pro
teacher m
last outpu
which bec
0.2 is appl
For both C
same mod
this limite
remarkabl
Each st
layers. Th
hidden lay
knowledg
beddings.
the size of
embeddin
which is th
layered pr
various co
and deepe
5.4 Pre
An autoen
400-dimen
two-layere
and the d
upper and
by using p
are not re
consistent
for the pro
5.5 Em
Six mode
show the
work: log
softmax t
based emb
els with th
0.900
0.910
0.920
0.930
0.940
Teacher RAE RDE
TREC
0.810
0.820
0.830
0.840
0.850
Teacher RAE RDE
CR

Discussion
• Outperforms the previous distillation model

• Gives compatible accuracy to the teacher models with x8
and x80 smaller embeddings

• Our distillation ensemble approach consistently shows
more robust results when the size of training data is
suﬃciently large

The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding Distillation with Ensemble Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding Distillation with Ensemble Learning

Similar to The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding Distillation with Ensemble Learning (20)

More from Jinho Choi

More from Jinho Choi (20)

Recently uploaded

Recently uploaded (20)

The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding Distillation with Ensemble Learning