SlideShare a Scribd company logo
The Pupil Has Become the Master:
Teacher-Student Model-Based
Word Embedding Distillation with
Ensemble Learning
Bonggun Shin, Hao Yang and Jinho D. Choi

IJCAI 2019
Typical NLP Models
• Input: a sequence of
tokens

• Each token is
transformed into a vector

• Once vectorized, any
neural network can be
attached

• Output: a class or a
sequence (depends on
the task)
Embeddings
CNN/LSTM
Output
Input
• The majority of model space is
occupied by word embeddings

• NLP model compression ->
Embedding compression

• Neural model compression
methods

• Weight pruning

• Weight quantization

• Lossless compression

• Distillation
Why is Compressing an
Embedding Beneficial?
Embeddings
Output
Input
CNN/LSTM
Embedding Compression
• V: the number of vocabulary

• D: the size of an original embedding vector

• D’: the size of a compressed embedding vector
Embedding
V
D Compressed Embedding
V
D’
Embedding Encoding
• Proposed by [Mou et al., 2016] - 8 times compression with performance loss

• Not a teacher-student model - trains a single model with an encoding (projection) layer 

• Encoding (projection) layer: Pre-trained large embedding -> smaller embedding

• Discards the large embedding at model deployment
Embedding Encoding
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
GOLD
SMALL EMB
LOGIT
SOFTMAX
NETWORK
SMALL EMB
WORD INDEX
TEACHER
MODEL
LOGIT
SOFTMAX
WORD INDEX
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
SMALL EMB
Projection
Projection Projection
Model Deployment
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
GOLD
SMALL EMB
LOGIT
SOFTMAX
NETWORK
SMALL EMB
WORD INDEX
TEACHER
MODEL
LOGIT
SOFTMAX
WORD INDEX
L
SO
WOR
NET
LARGE E
SMA
Projection
Projection Pro
We Propose
• Embedding Distillation

• Single teacher

• Better performance with a 8-times smaller model size

• Distillation Ensemble

• Ten teachers

• Better performance with a 80-times smaller model size
Background
Teacher Student Model
• Teacher - (Dataset.X, Dataset.Y), Student - (Dataset.X, Teacher.Output)

• Logit matching [Ba and Caruana, 2014]

• Noisy logit matching [Sau and Balasubramanian, 2016]

• Softmax tau matching [Hinton et al., 2014]
GOLD
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
NOISE
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
GOLD
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
NOISE
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
GOLD
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
NOISE
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
Dataset
• Seven document classification tasks

• Pre-trained word embeddings (Trained by Word2Vec [Mikolov et al., 2013])

• Sentiment analysis (SST-*, MR and CR) - Amazon review data (2.6M)

• Others - Wikipedia + New York Times
All models are evaluated on seven document classification
datasets in Table 1. MR, SST-*, and CR are targeted at the
task of sentiment analysis while Subj, TREC, and MPQA are
targeted at the classifications of subjectivity, question types,
and opinion polarity, respectively. About 10% of the training
sets are split into development sets for SST-* and TREC, and
about 10/20% of the provided resources are divided into de-
velopment/evaluation sets for the other datasets, respectively.
Dataset C TRN DEV TST
MR [Pang and Lee, 2005] 2 7,684 990 1,988
SST-1 [Socher et al., 2013] 5 8,544 1,101 2,210
SST-2 [Socher et al., 2013] 2 6,920 872 1,821
Subj [Pang and Lee, 2004] 2 7,199 907 1,894
TREC [Li and Roth, 2002] 6 4,952 500 500
CR [Hu and Liu, 2004] 2 2,718 340 717
MPQA [Wiebe et al., 2005] 2 7,636 955 2,015
Table 1: Seven datasets used for our experiments. C: num-
ber of classes, TRN/DEV/TST: number of instances in train-
ing/development/evaluation set.
5.2 Word Embeddings
For sentiment analysis, raw text from the Amazon Review
dataset2
is used to train word embeddings, resulting 2.67M
word vectors. For the other tasks, combined text from
Wikipedia and the New York Times Annotated corpus3
are
used, resulting 1.96M word vectors. For each group, two sets
beddings. The
the size of the l
embedding lay
which is the dim
layered projecti
various combina
and deeper laye
5.4 Pre-trai
An autoencode
400-dimension
two-layered pro
and the decode
upper and lowe
by using pre-tr
are not reporte
consistently see
for the projectio
5.5 Embedd
Six models are
show the effect
work: logit ma
softmax tau m
based embeddin
els with the aut
two layered pro
Embedding Distillation
Model
• Train a teacher using (X,Y)

• Distill to a student with a projection to smaller embeddings using (X,
Teacher.Logit)

• Discard the large embedding at model deployment
LOGIT
SOFTMAX
NETWORK
SMALL EMB
WORD INDEX
TEACHER
MODEL
LOGIT
SOFTMAX
WORD INDEX
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
SMALL EMB
Projection
Model deployment
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
GOLD
SMALL EMB
LOGIT
SOFTMAX
NETWORK
SMALL EMB
WORD INDEX
TEACHER
MODEL
LOGIT
SOFTMAX
WORD INDEX W
LAR
S
Projection
Projection
Embedding Distillation
Result
• Proposed models outperform ENC

• Pre-training (PT) and 2-layered projection (2L) boost the performance

• Differences among LM vs NLM vs STM are marginal
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
SMALL EMB
Projection
TEACHER
MODEL
LOGIT
SOFTMAX
WORD INDEX
R. LOGIT
Model deployment
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
GOLD
SMALL EMB
LOGIT
SOFTMAX
NETWORK
SMALL EMB
WORD INDEX
TEACHER
MODEL
LOGIT
SOFTMAX
WORD INDEX
Projection
Projection
Distillation Ensemble
Model
• Train multiple (for example, 10) teachers

• Calculate representing logits for each input dataset X

• Distill to a student with a projection to smaller embeddings using (X, Representing.Logit)

• Discard the large embedding at model deployment
Representing Logit
0
0.25
0.5
0.75
1
C1 C2 C3
0
0.25
0.5
0.75
1
C1 C2 C3
0
0.25
0.5
0.75
1
C1 C2 C3
Teacher1 Teacher2 Teacher3
0
0.25
0.5
0.75
1
C1 C2 C3
RAE
0
0.25
0.5
0.75
1
C1 C2 C3
RDE
Routing by 

Agreement

Ensemble: 

Put more weights

on majority

opinions

Routing by 

Disagreement

Ensemble: 

Put more weights

on minority

opinions
Distillation Ensemble
Result
• Teachers: 5 CNN-based and 5 LSTM-based models

• RDE significantly outperforms the teachers, if the dataset is sufficiently large

• Enough data samples -> The more chance to explore different insights from minor
opinions.
0.780
0.790
0.800
Teacher RAE RDE
MR
0.470
0.480
0.490
0.500
Teacher RAE RDE
SST-1
0.850
0.860
0.870
Teacher RAE RDE
SST-2
0.920
0.925
0.930
0.935
Teacher RAE RDE
Subj
0.885
0.890
0.895
0.900
0.905
Teacher RAE RDE
MPQA
5 while n iterations do
6 c softmax(w)
7 zrep
=
PT
t=1 ct · zt
8 s squash(zrep
)
9 if not last iteration then
10 for t 2 {1, . . . , T} do
11 wt wt + kxt
· s
12 return zrep
5 Experiments
5.1 Datasets
All models are evaluated on seven document classification
datasets in Table 1. MR, SST-*, and CR are targeted at the
task of sentiment analysis while Subj, TREC, and MPQA are
targeted at the classifications of subjectivity, question types,
and opinion polarity, respectively. About 10% of the training
sets are split into development sets for SST-* and TREC, and
about 10/20% of the provided resources are divided into de-
velopment/evaluation sets for the other datasets, respectively.
Dataset C TRN DEV TST
MR [Pang and Lee, 2005] 2 7,684 990 1,988
SST-1 [Socher et al., 2013] 5 8,544 1,101 2,210
SST-2 [Socher et al., 2013] 2 6,920 872 1,821
Subj [Pang and Lee, 2004] 2 7,199 907 1,894
TREC [Li and Roth, 2002] 6 4,952 500 500
CR [Hu and Liu, 2004] 2 2,718 340 717
MPQA [Wiebe et al., 2005] 2 7,636 955 2,015
Table 1: Seven datasets used for our experiments. C: num-
ber of classes, TRN/DEV/TST: number of instances in train-
ing/development/evaluation set.
5.2 Word Embeddings
For sentiment analysis, raw text from the Amazon Review
dataset2
is used to train word embeddings, resulting 2.67M
word vectors. For the other tasks, combined text from
Wikipedia and the New York Times Annotated corpus3
are
models ad
dent mode
the the pro
teacher m
last outpu
which bec
0.2 is appl
For both C
same mod
this limite
remarkabl
Each st
layers. Th
hidden lay
knowledg
beddings.
the size of
embeddin
which is th
layered pr
various co
and deepe
5.4 Pre
An autoen
400-dimen
two-layere
and the d
upper and
by using p
are not re
consistent
for the pro
5.5 Em
Six mode
show the
work: log
softmax t
based emb
els with th
0.900
0.910
0.920
0.930
0.940
Teacher RAE RDE
TREC
0.810
0.820
0.830
0.840
0.850
Teacher RAE RDE
CR
Discussion
• Outperforms the previous distillation model

• Gives compatible accuracy to the teacher models with x8
and x80 smaller embeddings

• Our distillation ensemble approach consistently shows
more robust results when the size of training data is
sufficiently large

More Related Content

What's hot

Word Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented LanguagesWord Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented Languages
hs0041
 
2015EDM: Feature-Aware Student Knowledge Tracing Tutorial
2015EDM: Feature-Aware Student Knowledge Tracing Tutorial2015EDM: Feature-Aware Student Knowledge Tracing Tutorial
2015EDM: Feature-Aware Student Knowledge Tracing Tutorial
Yun Huang
 
UMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based Learning
UMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based LearningUMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based Learning
UMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based Learning
Yun Huang
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
indico data
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
Felipe Moraes
 
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
Hiroki Shimanaka
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
Anuj Gupta
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
Satyam Saxena
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
Surya Sg
 
Towards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositoriesTowards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositoriesValentina Paunovic
 
Disentangled Representation Learning of Deep Generative Models
Disentangled Representation Learning of Deep Generative ModelsDisentangled Representation Learning of Deep Generative Models
Disentangled Representation Learning of Deep Generative Models
Ryohei Suzuki
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...
Sebastian Ruder
 
Information Retrieval with Deep Learning
Information Retrieval with Deep LearningInformation Retrieval with Deep Learning
Information Retrieval with Deep Learning
Adam Gibson
 
Evaluation of subjective answers using glsa enhanced with contextual synonymy
Evaluation of subjective answers using glsa enhanced with contextual synonymyEvaluation of subjective answers using glsa enhanced with contextual synonymy
Evaluation of subjective answers using glsa enhanced with contextual synonymy
ijnlc
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project Presentation
Aryak Sengupta
 
Nlp presentation
Nlp presentationNlp presentation
Nlp presentation
Surya Sg
 
Deep learning for natural language embeddings
Deep learning for natural language embeddingsDeep learning for natural language embeddings
Deep learning for natural language embeddings
Roelof Pieters
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
Lifeng (Aaron) Han
 
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachText Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachAhmed Hani Ibrahim
 
Plug play language_models
Plug play language_modelsPlug play language_models
Plug play language_models
Mohammad Moslem Uddin
 

What's hot (20)

Word Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented LanguagesWord Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented Languages
 
2015EDM: Feature-Aware Student Knowledge Tracing Tutorial
2015EDM: Feature-Aware Student Knowledge Tracing Tutorial2015EDM: Feature-Aware Student Knowledge Tracing Tutorial
2015EDM: Feature-Aware Student Knowledge Tracing Tutorial
 
UMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based Learning
UMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based LearningUMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based Learning
UMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based Learning
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
 
Towards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositoriesTowards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositories
 
Disentangled Representation Learning of Deep Generative Models
Disentangled Representation Learning of Deep Generative ModelsDisentangled Representation Learning of Deep Generative Models
Disentangled Representation Learning of Deep Generative Models
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...
 
Information Retrieval with Deep Learning
Information Retrieval with Deep LearningInformation Retrieval with Deep Learning
Information Retrieval with Deep Learning
 
Evaluation of subjective answers using glsa enhanced with contextual synonymy
Evaluation of subjective answers using glsa enhanced with contextual synonymyEvaluation of subjective answers using glsa enhanced with contextual synonymy
Evaluation of subjective answers using glsa enhanced with contextual synonymy
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project Presentation
 
Nlp presentation
Nlp presentationNlp presentation
Nlp presentation
 
Deep learning for natural language embeddings
Deep learning for natural language embeddingsDeep learning for natural language embeddings
Deep learning for natural language embeddings
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachText Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
 
Plug play language_models
Plug play language_modelsPlug play language_models
Plug play language_models
 

Similar to The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding Distillation with Ensemble Learning

AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
Dr. Haxel Consult
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Matthew Lease
 
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Lifeng (Aaron) Han
 
Concept Detection of Multiple Choice Questions using Transformer Based Models
Concept Detection of Multiple Choice Questions using Transformer Based ModelsConcept Detection of Multiple Choice Questions using Transformer Based Models
Concept Detection of Multiple Choice Questions using Transformer Based Models
IRJET Journal
 
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Lucidworks
 
Chounta@paws
Chounta@pawsChounta@paws
IRJET - Response Analysis of Educational Videos
IRJET - Response Analysis of Educational VideosIRJET - Response Analysis of Educational Videos
IRJET - Response Analysis of Educational Videos
IRJET Journal
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
NAVER Engineering
 
Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and RefinementGoal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Emil Lupu
 
A pilot on Semantic Textual Similarity
A pilot on Semantic Textual SimilarityA pilot on Semantic Textual Similarity
A pilot on Semantic Textual Similaritypathsproject
 
How to conduct systematic literature review
How to conduct systematic literature reviewHow to conduct systematic literature review
How to conduct systematic literature review
Kashif Hussain
 
Toward a Traceable, Explainable and fair JD/Resume Recommendation System
Toward a Traceable, Explainable and fair JD/Resume Recommendation SystemToward a Traceable, Explainable and fair JD/Resume Recommendation System
Toward a Traceable, Explainable and fair JD/Resume Recommendation System
Amine Barrak
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
Jayavardhan Reddy Peddamail
 
Dominik Kowald PhD Defense Recommender Systems
Dominik Kowald PhD Defense Recommender SystemsDominik Kowald PhD Defense Recommender Systems
Dominik Kowald PhD Defense Recommender Systems
Dominik Kowald
 
Stress Sentimental Analysis Using Machine learning (Reddit): A Review
Stress Sentimental Analysis Using Machine learning (Reddit): A ReviewStress Sentimental Analysis Using Machine learning (Reddit): A Review
Stress Sentimental Analysis Using Machine learning (Reddit): A Review
IRJET Journal
 
DataScience SG | Undergrad Series | 26th Sep 19
DataScience SG | Undergrad Series | 26th Sep 19DataScience SG | Undergrad Series | 26th Sep 19
DataScience SG | Undergrad Series | 26th Sep 19
Yong Siang (Ivan) Tan
 
Model-Driven Spreadsheet Development
Model-Driven Spreadsheet DevelopmentModel-Driven Spreadsheet Development
Model-Driven Spreadsheet Development
Jácome Cunha
 
Preliminary Exam Slides
Preliminary Exam SlidesPreliminary Exam Slides
Preliminary Exam Slides
Debasmit Das
 
2211 APSIPA
2211 APSIPA2211 APSIPA
2211 APSIPA
WarNik Chow
 
Text Document Classification System
Text Document Classification SystemText Document Classification System
Text Document Classification System
IRJET Journal
 

Similar to The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding Distillation with Ensemble Learning (20)

AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
 
Concept Detection of Multiple Choice Questions using Transformer Based Models
Concept Detection of Multiple Choice Questions using Transformer Based ModelsConcept Detection of Multiple Choice Questions using Transformer Based Models
Concept Detection of Multiple Choice Questions using Transformer Based Models
 
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
 
Chounta@paws
Chounta@pawsChounta@paws
Chounta@paws
 
IRJET - Response Analysis of Educational Videos
IRJET - Response Analysis of Educational VideosIRJET - Response Analysis of Educational Videos
IRJET - Response Analysis of Educational Videos
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
 
Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and RefinementGoal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
 
A pilot on Semantic Textual Similarity
A pilot on Semantic Textual SimilarityA pilot on Semantic Textual Similarity
A pilot on Semantic Textual Similarity
 
How to conduct systematic literature review
How to conduct systematic literature reviewHow to conduct systematic literature review
How to conduct systematic literature review
 
Toward a Traceable, Explainable and fair JD/Resume Recommendation System
Toward a Traceable, Explainable and fair JD/Resume Recommendation SystemToward a Traceable, Explainable and fair JD/Resume Recommendation System
Toward a Traceable, Explainable and fair JD/Resume Recommendation System
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
Dominik Kowald PhD Defense Recommender Systems
Dominik Kowald PhD Defense Recommender SystemsDominik Kowald PhD Defense Recommender Systems
Dominik Kowald PhD Defense Recommender Systems
 
Stress Sentimental Analysis Using Machine learning (Reddit): A Review
Stress Sentimental Analysis Using Machine learning (Reddit): A ReviewStress Sentimental Analysis Using Machine learning (Reddit): A Review
Stress Sentimental Analysis Using Machine learning (Reddit): A Review
 
DataScience SG | Undergrad Series | 26th Sep 19
DataScience SG | Undergrad Series | 26th Sep 19DataScience SG | Undergrad Series | 26th Sep 19
DataScience SG | Undergrad Series | 26th Sep 19
 
Model-Driven Spreadsheet Development
Model-Driven Spreadsheet DevelopmentModel-Driven Spreadsheet Development
Model-Driven Spreadsheet Development
 
Preliminary Exam Slides
Preliminary Exam SlidesPreliminary Exam Slides
Preliminary Exam Slides
 
2211 APSIPA
2211 APSIPA2211 APSIPA
2211 APSIPA
 
Text Document Classification System
Text Document Classification SystemText Document Classification System
Text Document Classification System
 

More from Jinho Choi

Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Jinho Choi
 
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Jinho Choi
 
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Jinho Choi
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Jinho Choi
 
The Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference ResolutionThe Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference Resolution
Jinho Choi
 
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Jinho Choi
 
Abstract Meaning Representation
Abstract Meaning RepresentationAbstract Meaning Representation
Abstract Meaning Representation
Jinho Choi
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
Jinho Choi
 
CKY Parsing
CKY ParsingCKY Parsing
CKY Parsing
Jinho Choi
 
CS329 - WordNet Similarities
CS329 - WordNet SimilaritiesCS329 - WordNet Similarities
CS329 - WordNet Similarities
Jinho Choi
 
CS329 - Lexical Relations
CS329 - Lexical RelationsCS329 - Lexical Relations
CS329 - Lexical Relations
Jinho Choi
 
Automatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue ManagementAutomatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue Management
Jinho Choi
 
Attention is All You Need for AMR Parsing
Attention is All You Need for AMR ParsingAttention is All You Need for AMR Parsing
Attention is All You Need for AMR Parsing
Jinho Choi
 
Graph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueGraph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to Dialogue
Jinho Choi
 
Real-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue UnderstandingReal-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue Understanding
Jinho Choi
 
Topological Sort
Topological SortTopological Sort
Topological Sort
Jinho Choi
 
Tries - Put
Tries - PutTries - Put
Tries - Put
Jinho Choi
 
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's DiseaseMulti-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
Jinho Choi
 
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue ContextsBuilding Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
Jinho Choi
 
How to make Emora talk about Sports Intelligently
How to make Emora talk about Sports IntelligentlyHow to make Emora talk about Sports Intelligently
How to make Emora talk about Sports Intelligently
Jinho Choi
 

More from Jinho Choi (20)

Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
 
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
 
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
 
The Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference ResolutionThe Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference Resolution
 
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
 
Abstract Meaning Representation
Abstract Meaning RepresentationAbstract Meaning Representation
Abstract Meaning Representation
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
CKY Parsing
CKY ParsingCKY Parsing
CKY Parsing
 
CS329 - WordNet Similarities
CS329 - WordNet SimilaritiesCS329 - WordNet Similarities
CS329 - WordNet Similarities
 
CS329 - Lexical Relations
CS329 - Lexical RelationsCS329 - Lexical Relations
CS329 - Lexical Relations
 
Automatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue ManagementAutomatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue Management
 
Attention is All You Need for AMR Parsing
Attention is All You Need for AMR ParsingAttention is All You Need for AMR Parsing
Attention is All You Need for AMR Parsing
 
Graph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueGraph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to Dialogue
 
Real-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue UnderstandingReal-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue Understanding
 
Topological Sort
Topological SortTopological Sort
Topological Sort
 
Tries - Put
Tries - PutTries - Put
Tries - Put
 
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's DiseaseMulti-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
 
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue ContextsBuilding Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
 
How to make Emora talk about Sports Intelligently
How to make Emora talk about Sports IntelligentlyHow to make Emora talk about Sports Intelligently
How to make Emora talk about Sports Intelligently
 

Recently uploaded

Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 

Recently uploaded (20)

Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 

The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding Distillation with Ensemble Learning

  • 1. The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding Distillation with Ensemble Learning Bonggun Shin, Hao Yang and Jinho D. Choi IJCAI 2019
  • 2. Typical NLP Models • Input: a sequence of tokens • Each token is transformed into a vector • Once vectorized, any neural network can be attached • Output: a class or a sequence (depends on the task) Embeddings CNN/LSTM Output Input
  • 3. • The majority of model space is occupied by word embeddings • NLP model compression -> Embedding compression • Neural model compression methods • Weight pruning • Weight quantization • Lossless compression • Distillation Why is Compressing an Embedding Beneficial? Embeddings Output Input CNN/LSTM
  • 4. Embedding Compression • V: the number of vocabulary • D: the size of an original embedding vector • D’: the size of a compressed embedding vector Embedding V D Compressed Embedding V D’
  • 5. Embedding Encoding • Proposed by [Mou et al., 2016] - 8 times compression with performance loss • Not a teacher-student model - trains a single model with an encoding (projection) layer • Encoding (projection) layer: Pre-trained large embedding -> smaller embedding • Discards the large embedding at model deployment Embedding Encoding LOGIT SOFTMAX WORD INDEX NETWORK LARGE EMBEDDING GOLD SMALL EMB LOGIT SOFTMAX NETWORK SMALL EMB WORD INDEX TEACHER MODEL LOGIT SOFTMAX WORD INDEX LOGIT SOFTMAX WORD INDEX NETWORK LARGE EMBEDDING SMALL EMB Projection Projection Projection Model Deployment LOGIT SOFTMAX WORD INDEX NETWORK LARGE EMBEDDING GOLD SMALL EMB LOGIT SOFTMAX NETWORK SMALL EMB WORD INDEX TEACHER MODEL LOGIT SOFTMAX WORD INDEX L SO WOR NET LARGE E SMA Projection Projection Pro
  • 6. We Propose • Embedding Distillation • Single teacher • Better performance with a 8-times smaller model size • Distillation Ensemble • Ten teachers • Better performance with a 80-times smaller model size
  • 7. Background Teacher Student Model • Teacher - (Dataset.X, Dataset.Y), Student - (Dataset.X, Teacher.Output) • Logit matching [Ba and Caruana, 2014] • Noisy logit matching [Sau and Balasubramanian, 2016] • Softmax tau matching [Hinton et al., 2014] GOLD STUDENT MODEL TEACHER MODEL LOGIT SOFTMAX LOGIT SOFTMAX WORD INDEX WORD INDEX NOISE STUDENT MODEL TEACHER MODEL LOGIT SOFTMAX LOGIT SOFTMAX WORD INDEX WORD INDEX STUDENT MODEL TEACHER MODEL LOGIT SOFTMAX LOGIT SOFTMAX WORD INDEX WORD INDEX GOLD STUDENT MODEL TEACHER MODEL LOGIT SOFTMAX LOGIT SOFTMAX WORD INDEX WORD INDEX NOISE STUDENT MODEL TEACHER MODEL LOGIT SOFTMAX LOGIT SOFTMAX WORD INDEX WORD INDEX STUDENT MODEL TEACHER MODEL LOGIT SOFTMAX LOGIT SOFTMAX WORD INDEX WORD INDEX GOLD STUDENT MODEL TEACHER MODEL LOGIT SOFTMAX LOGIT SOFTMAX WORD INDEX WORD INDEX NOISE STUDENT MODEL TEACHER MODEL LOGIT SOFTMAX LOGIT SOFTMAX WORD INDEX WORD INDEX STUDENT MODEL TEACHER MODEL LOGIT SOFTMAX LOGIT SOFTMAX WORD INDEX WORD INDEX
  • 8. Dataset • Seven document classification tasks • Pre-trained word embeddings (Trained by Word2Vec [Mikolov et al., 2013]) • Sentiment analysis (SST-*, MR and CR) - Amazon review data (2.6M) • Others - Wikipedia + New York Times All models are evaluated on seven document classification datasets in Table 1. MR, SST-*, and CR are targeted at the task of sentiment analysis while Subj, TREC, and MPQA are targeted at the classifications of subjectivity, question types, and opinion polarity, respectively. About 10% of the training sets are split into development sets for SST-* and TREC, and about 10/20% of the provided resources are divided into de- velopment/evaluation sets for the other datasets, respectively. Dataset C TRN DEV TST MR [Pang and Lee, 2005] 2 7,684 990 1,988 SST-1 [Socher et al., 2013] 5 8,544 1,101 2,210 SST-2 [Socher et al., 2013] 2 6,920 872 1,821 Subj [Pang and Lee, 2004] 2 7,199 907 1,894 TREC [Li and Roth, 2002] 6 4,952 500 500 CR [Hu and Liu, 2004] 2 2,718 340 717 MPQA [Wiebe et al., 2005] 2 7,636 955 2,015 Table 1: Seven datasets used for our experiments. C: num- ber of classes, TRN/DEV/TST: number of instances in train- ing/development/evaluation set. 5.2 Word Embeddings For sentiment analysis, raw text from the Amazon Review dataset2 is used to train word embeddings, resulting 2.67M word vectors. For the other tasks, combined text from Wikipedia and the New York Times Annotated corpus3 are used, resulting 1.96M word vectors. For each group, two sets beddings. The the size of the l embedding lay which is the dim layered projecti various combina and deeper laye 5.4 Pre-trai An autoencode 400-dimension two-layered pro and the decode upper and lowe by using pre-tr are not reporte consistently see for the projectio 5.5 Embedd Six models are show the effect work: logit ma softmax tau m based embeddin els with the aut two layered pro
  • 9. Embedding Distillation Model • Train a teacher using (X,Y) • Distill to a student with a projection to smaller embeddings using (X, Teacher.Logit) • Discard the large embedding at model deployment LOGIT SOFTMAX NETWORK SMALL EMB WORD INDEX TEACHER MODEL LOGIT SOFTMAX WORD INDEX LOGIT SOFTMAX WORD INDEX NETWORK LARGE EMBEDDING SMALL EMB Projection Model deployment LOGIT SOFTMAX WORD INDEX NETWORK LARGE EMBEDDING GOLD SMALL EMB LOGIT SOFTMAX NETWORK SMALL EMB WORD INDEX TEACHER MODEL LOGIT SOFTMAX WORD INDEX W LAR S Projection Projection
  • 10. Embedding Distillation Result • Proposed models outperform ENC • Pre-training (PT) and 2-layered projection (2L) boost the performance • Differences among LM vs NLM vs STM are marginal
  • 11. LOGIT SOFTMAX WORD INDEX NETWORK LARGE EMBEDDING SMALL EMB Projection TEACHER MODEL LOGIT SOFTMAX WORD INDEX R. LOGIT Model deployment LOGIT SOFTMAX WORD INDEX NETWORK LARGE EMBEDDING GOLD SMALL EMB LOGIT SOFTMAX NETWORK SMALL EMB WORD INDEX TEACHER MODEL LOGIT SOFTMAX WORD INDEX Projection Projection Distillation Ensemble Model • Train multiple (for example, 10) teachers • Calculate representing logits for each input dataset X • Distill to a student with a projection to smaller embeddings using (X, Representing.Logit) • Discard the large embedding at model deployment
  • 12. Representing Logit 0 0.25 0.5 0.75 1 C1 C2 C3 0 0.25 0.5 0.75 1 C1 C2 C3 0 0.25 0.5 0.75 1 C1 C2 C3 Teacher1 Teacher2 Teacher3 0 0.25 0.5 0.75 1 C1 C2 C3 RAE 0 0.25 0.5 0.75 1 C1 C2 C3 RDE Routing by Agreement Ensemble: Put more weights on majority opinions Routing by Disagreement Ensemble: Put more weights on minority opinions
  • 13. Distillation Ensemble Result • Teachers: 5 CNN-based and 5 LSTM-based models • RDE significantly outperforms the teachers, if the dataset is sufficiently large • Enough data samples -> The more chance to explore different insights from minor opinions. 0.780 0.790 0.800 Teacher RAE RDE MR 0.470 0.480 0.490 0.500 Teacher RAE RDE SST-1 0.850 0.860 0.870 Teacher RAE RDE SST-2 0.920 0.925 0.930 0.935 Teacher RAE RDE Subj 0.885 0.890 0.895 0.900 0.905 Teacher RAE RDE MPQA 5 while n iterations do 6 c softmax(w) 7 zrep = PT t=1 ct · zt 8 s squash(zrep ) 9 if not last iteration then 10 for t 2 {1, . . . , T} do 11 wt wt + kxt · s 12 return zrep 5 Experiments 5.1 Datasets All models are evaluated on seven document classification datasets in Table 1. MR, SST-*, and CR are targeted at the task of sentiment analysis while Subj, TREC, and MPQA are targeted at the classifications of subjectivity, question types, and opinion polarity, respectively. About 10% of the training sets are split into development sets for SST-* and TREC, and about 10/20% of the provided resources are divided into de- velopment/evaluation sets for the other datasets, respectively. Dataset C TRN DEV TST MR [Pang and Lee, 2005] 2 7,684 990 1,988 SST-1 [Socher et al., 2013] 5 8,544 1,101 2,210 SST-2 [Socher et al., 2013] 2 6,920 872 1,821 Subj [Pang and Lee, 2004] 2 7,199 907 1,894 TREC [Li and Roth, 2002] 6 4,952 500 500 CR [Hu and Liu, 2004] 2 2,718 340 717 MPQA [Wiebe et al., 2005] 2 7,636 955 2,015 Table 1: Seven datasets used for our experiments. C: num- ber of classes, TRN/DEV/TST: number of instances in train- ing/development/evaluation set. 5.2 Word Embeddings For sentiment analysis, raw text from the Amazon Review dataset2 is used to train word embeddings, resulting 2.67M word vectors. For the other tasks, combined text from Wikipedia and the New York Times Annotated corpus3 are models ad dent mode the the pro teacher m last outpu which bec 0.2 is appl For both C same mod this limite remarkabl Each st layers. Th hidden lay knowledg beddings. the size of embeddin which is th layered pr various co and deepe 5.4 Pre An autoen 400-dimen two-layere and the d upper and by using p are not re consistent for the pro 5.5 Em Six mode show the work: log softmax t based emb els with th 0.900 0.910 0.920 0.930 0.940 Teacher RAE RDE TREC 0.810 0.820 0.830 0.840 0.850 Teacher RAE RDE CR
  • 14. Discussion • Outperforms the previous distillation model • Gives compatible accuracy to the teacher models with x8 and x80 smaller embeddings • Our distillation ensemble approach consistently shows more robust results when the size of training data is sufficiently large