SlideShare a Scribd company logo
Meta Pseudo Label
Figure. Meta Pseudo Label
Introduction
Have you heard the term meta-learning? Do you remember the time
when you used pseudo labeling for a Kaggle competition? What if we
combine the two techniques? today, we will discuss the latest research
paper that aims to combine meta-learning and pseudo labeling for semi-
supervised learning.
Before we dive into the paper, let’s take a step back and try to
understand why we need this kind of technique in the first place. When
we build models, we try to minimize the cross-entropy loss or we can say
that we want to measure the KL divergence from a target distribution
over all the possible classes to the distribution predicted by a network.
This target distribution, most of the time, is based on some heuristics.
Specifically, in supervised learning, the target distribution is often a one-
hot vector, or a smoothed version of the one-hot vector, i.e., label
smoothing. In semi-supervised learning, the target distributions, also
known as pseudo labels, are often generated on unlabeled data by a
sharpened or dampened teacher model trained on labeled data.
Even if the target distribution is based on heuristics, our algorithms work
really well. I don’t see why is that a problem. No matter what you say, if
you are working with supervised learning or sem-supervised learning,
you have to encode the target distribution in some way, right?
Of course, the target distribution has to be constructed in some way. If
one thing works well, then it doesn’t mean that it is the best way to do it.
One of the drawbacks of these heuristics is that they cannot adapt to the
learning state of the neural networks being trained.
To this end, the authors propose to meta-learn the target distribution.
They design a teacher model that assigns distributions to input
examples to train the main model, the student model. Throughout the
course of the student’s training, the teacher observes the student’s
performance on a held-out validation set and learns to generate target
distributions so that if the student learns from such distributions, the
student will achieve good validation performance. Since the meta-
learned target distributions play a similar role to pseudo labels, the
authors named this method Meta Pseudo Label (MPL).
Fine. I agree that these heuristics may not be the best ones. But the
authors can’t just disprove it like this. There must be some motivation
behind this proposal.
Motivation
For the sake of simplicity, the authors focused on training a C-way
classification model parameterized by Θ, such as a neural network
where we try to minimize the cross-entropy between a target distribution
q∗(Y|X) and the model distribution pΘ(Y|X), i.e.
Let’s see how the target distribution is defined in different conditions
 Supervised training: The target distribution is defined as the one-hot
vector representing the ground-truth class, i.e., q∗(Y|x) = one-hot(y)

 Knowledge distillation: For each data point x, the predicted distribution
of the large model q_large(x) is directly taken as the target distribution,
i.e. q∗(Y|x) = q_large(Y|x)

 Semi-supervised learning: In SSL, a typical solution first employs an
existing model qξ (trained on limited labeled data) to predict the class for
each data point from an unlabeled set and utilizes the prediction to
construct the target distribution which can be done in two ways: Hard
label: q∗(Y|x) = one-hot (argmax (qξ(y|x)) Soft label: q∗(Y|x) = qξ(Y|x)
Also, MPL is not the first paper to talk about these non-optimal
heuristics. Two very popular methods that try to mitigate this issue are:
 Label smoothing: It has been found that using one-hot vectors as the
target distribution in a fully supervised machine translation and large-
scale image classification such as ImageNet, can lead to overfitting. To
combat this phenomenon, label smoothing proposes to smooth the one-
hot distribution by allocating a small uniform weights to all classes. The
target distribution is then redefined as:
 Temperature Tuning: For both knowledge distillation (KD) and soft-
label SSL, it has been found that explicitly introducing a temperature
hyper-parameter to modulate the target distribution could be very
helpful. The logits l(x), are scaled by temperature τ. A value of τ < 1,
sharpens the distribution and helps in speeding up the training. On the
other hand, τ > 1, smoothens the distribution and helps to prevent over
fitting.
To this end, we can say that there are two intrinsic limitations of the
current target distribution construction:
However, instead of looking for better heuristics for designing the target
distribution from scratch, we should look for a generic and systematic
method that can be used to modify the target distribution in an existing
algorithm that can lead to better performance. MPL tries to solve this.
Meta Pseudo Labels Algorithm
MPL tries to learn the target distribution q∗(x) throughout the course of
training pΘ. The target distribution q∗(x) is parameterized as qΨ(x)
where Ψ is trained using gradient descent. For the sake of simplicity, qΨ
is taken as a classification model, which assigns the conditional
probabilities to different classes of each input example x.
(Medium is pretty bad when it comes to LaTex support �
)
Here qΨ(x) serves the same role as q_large in knowledge distillation or
qξ in SSL, as qΨ(x) provides the pseudo labels for pΘ to learn. Due to
this similarity, the authors call qΨ the teacher model and pΘ the student
model.
As can be seen in the above figure, MPL update consists of two phases:
 Phase 1- The Student Learns from the Teacher: In this phase, given a
single input example x, the teacher qΨ produces the conditional class
distribution qΨ(x) to train the student. The pair (x, qΨ(x)) is then shown
to the student to update its parameters by backpropagating from the
cross-entropy loss. For instance, if Θ is trained with SGD with a learning
rate of η, then we have:
 Phase 2- The Teacher Learns from the Student’s Validation Loss: After
the student updates its parameters as in Equation 1, its new parameter
Θ(t+1) is evaluated on an example (xval, yval) from the held-out
validation dataset, using the cross-entropy loss, L(yval, pΘ(t+1) (xval)).
Since Θ(t+1) depends on Ψ via Equation 1, this validation cross-entropy
loss is a function of Ψ. Specifically, dropping (xval, yval) from the
equations for the sake of readability, we can write:
Hold on a sec! The algorithm looks fine but I have a doubt. We are not
using the original labels at all. It looks like that the pseudo labels
generated by the teacher are enough to make corrections in both the
models, right?
Although the performance of the student model on the validation set
allows the teacher model to adjust its parameter accordingly, the authors
found out that this signal is not sufficient to train the teacher model. If we
rely solely on the student signal, by the time the teacher model has seen
enough evidence to generate meaningful target distribution, the student
model might have started over fitting. A similar short-falling has been
observed when training neural machine translation models with RL.
To overcome this, the authors used supervised learning for the teacher.
How? At each training step, apart from the MPL updates in Equation 2,
the teacher also computes a gradient ∇Ψ(L(y, qΨ(x))) on a pair of
labelled data (x, y). This gradient is then added to the MPL gradient ∇ΨR
from Equation 2 to update the teacher’s parameters Ψ.
Though this provides a sufficient signal to train the teacher, it poses
another problem. We need to keep two classification models, the
teacher, and the student, in memory. For smaller models, like ResNets,
this won’t be a big problem but with larger models like EfficientNets, it
limits the batch size leading to slow training time. To mitigate this issue,
the authors designed a smaller alternative, ReducedMPL.
A large model, q_large, is trained to convergence. Then, this model is
used to pre-compute all target distributions for the student’s training
data. Importantly, until this step, the student model has not been loaded
into memory, effectively avoiding the large memory footprint of MPL. A
reduced teacher qΨ as a small and efficient network, such as a multi-
layered perceptron, is parameterized to be trained along with the
student. This reduced teacher qΨ takes as input the distribution
predicted by the large teacher q_large and outputs a calibrated
distribution for the student to learn.
Experimental Results
MPL is tested in two scenarios:
 Where limited data is available.
 Where the full labeled dataset is used.
In both scenarios, experiments were performed on CIFAR-10, SVHN,
and ImageNet. For experiments on full datasets, ReducedMPL was used
due to the large memory footprint of MPL. ( I won’t be putting all the
experimental details here. Please check the paper for the details.)
Analysis
 MPL is not label correction: Since the teacher in MPL provides the target
distribution for the student to learn and observes the student’s
performance to improve itself, it is intuitive to think that the teacher tries
to guess the correct labels for the student. To prove this, the authors
plotted the training accuracy of all three: a purely supervised model, the
teacher model, and the student model. The results look like this:
As shown, the training accuracy of both the teacher and the student
stays relatively low. Meanwhile, the training accuracy of the supervised
model eventually reaches 100% much earlier. If MPL was simply
performing label correction, then these accuracies should have been
high. Instead, it looks like the teacher in MPL is trying to regularize the
student to prevent overfitting which is crucial when you have a limited
labeled dataset.
 MPL is Not Only a Regularization Strategy: As the authors said above
that the teacher model in MPL is trying to regularize the student, it is
easy to think that the teacher might be injecting noise so that the student
doesn’t overfit. Turns out it isn’t the case. If the teacher has to inject
noise in students’ training, it can do it in two ways: 1.) By flipping the
target class, eg. tell the student that the image of a car is an image of a
horse. 2.)By dampening the target distribution. To confirm this, the
authors visualized a few target distributions predicted by the teacher
model in ReducedMPL for images in the TinyImages dataset.
If you look at the above figure, you will notice that the label with the
highest confidence for the images does not change at the quarters of the
student’s training process. This means that the teacher has not
managed to flip the target labels.
Also, the target distributions that the teacher predicts become steeper
between 50% and 75%. As the student is learning during this time, if the
teacher simply wants to regularize the student, then the teacher should
dampen the distributions.
This observation is sufficient enough to prove that MPL isn’t just only a
regularization method.
Conclusion
Overall the paper is well written. The ideas are crystal clear and thought-
provoking. MPL makes use of both labeled and unlabeled data, making
it a typical SSL algorithm but a significant difference between MPL and
other SSL methods is that the teacher model receives learning signals
from the student’s performance, and hence can adapt to the student’s
learning state throughout the student’s training.
To learn more about the Data Science algorithm visit the InsideAIML
page.

More Related Content

What's hot

Summary distributed representations_words_phrases
Summary distributed representations_words_phrasesSummary distributed representations_words_phrases
Summary distributed representations_words_phrases
Yue Xiangnan
 
Quantum Deep Learning
Quantum Deep LearningQuantum Deep Learning
Quantum Deep Learning
Willy Marroquin (WillyDevNET)
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Marina Santini
 
DagdelenSiriwardaneY..
DagdelenSiriwardaneY..DagdelenSiriwardaneY..
DagdelenSiriwardaneY..butest
 
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 4 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 4 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
Madhav Mishra
 
Learning Methods in a Neural Network
Learning Methods in a Neural NetworkLearning Methods in a Neural Network
Learning Methods in a Neural Network
Saransh Choudhary
 
Latent Semantic Word Sense Disambiguation Using Global Co-Occurrence Information
Latent Semantic Word Sense Disambiguation Using Global Co-Occurrence InformationLatent Semantic Word Sense Disambiguation Using Global Co-Occurrence Information
Latent Semantic Word Sense Disambiguation Using Global Co-Occurrence Information
csandit
 
2.7 other classifiers
2.7 other classifiers2.7 other classifiers
2.7 other classifiers
Krish_ver2
 
Main single agent machine learning algorithms
Main single agent machine learning algorithmsMain single agent machine learning algorithms
Main single agent machine learning algorithmsbutest
 
Classifiers
ClassifiersClassifiers
Classifiers
Ayurdata
 
Neural network for machine learning
Neural network for machine learningNeural network for machine learning
Neural network for machine learningUjjawal
 
Setting Artificial Neural Networks parameters
Setting Artificial Neural Networks parametersSetting Artificial Neural Networks parameters
Setting Artificial Neural Networks parameters
Madhumita Tamhane
 
Machine Learning Interview Questions
Machine Learning Interview QuestionsMachine Learning Interview Questions
Machine Learning Interview Questions
Rock Interview
 
An improved teaching learning
An improved teaching learningAn improved teaching learning
An improved teaching learning
csandit
 
Machine learning
Machine learningMachine learning
Machine learning
Sukhwinder Singh
 
Genetic algorithms
Genetic algorithmsGenetic algorithms
Genetic algorithms
swapnac12
 
Final Report
Final ReportFinal Report
Final ReportAman Soni
 
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET Journal
 
Instance Based Learning in Machine Learning
Instance Based Learning in Machine LearningInstance Based Learning in Machine Learning
Instance Based Learning in Machine Learning
Pavithra Thippanaik
 

What's hot (20)

Summary distributed representations_words_phrases
Summary distributed representations_words_phrasesSummary distributed representations_words_phrases
Summary distributed representations_words_phrases
 
Quantum Deep Learning
Quantum Deep LearningQuantum Deep Learning
Quantum Deep Learning
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
 
DagdelenSiriwardaneY..
DagdelenSiriwardaneY..DagdelenSiriwardaneY..
DagdelenSiriwardaneY..
 
Learning
LearningLearning
Learning
 
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 4 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 4 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
 
Learning Methods in a Neural Network
Learning Methods in a Neural NetworkLearning Methods in a Neural Network
Learning Methods in a Neural Network
 
Latent Semantic Word Sense Disambiguation Using Global Co-Occurrence Information
Latent Semantic Word Sense Disambiguation Using Global Co-Occurrence InformationLatent Semantic Word Sense Disambiguation Using Global Co-Occurrence Information
Latent Semantic Word Sense Disambiguation Using Global Co-Occurrence Information
 
2.7 other classifiers
2.7 other classifiers2.7 other classifiers
2.7 other classifiers
 
Main single agent machine learning algorithms
Main single agent machine learning algorithmsMain single agent machine learning algorithms
Main single agent machine learning algorithms
 
Classifiers
ClassifiersClassifiers
Classifiers
 
Neural network for machine learning
Neural network for machine learningNeural network for machine learning
Neural network for machine learning
 
Setting Artificial Neural Networks parameters
Setting Artificial Neural Networks parametersSetting Artificial Neural Networks parameters
Setting Artificial Neural Networks parameters
 
Machine Learning Interview Questions
Machine Learning Interview QuestionsMachine Learning Interview Questions
Machine Learning Interview Questions
 
An improved teaching learning
An improved teaching learningAn improved teaching learning
An improved teaching learning
 
Machine learning
Machine learningMachine learning
Machine learning
 
Genetic algorithms
Genetic algorithmsGenetic algorithms
Genetic algorithms
 
Final Report
Final ReportFinal Report
Final Report
 
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
 
Instance Based Learning in Machine Learning
Instance Based Learning in Machine LearningInstance Based Learning in Machine Learning
Instance Based Learning in Machine Learning
 

Similar to Meta Pseudo Label - InsideAIML

Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data萍華 楊
 
Machine learning (5)
Machine learning (5)Machine learning (5)
Machine learning (5)NYversity
 
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
MLReview
 
Second Thoughts are Best: Learning to Re-Align With Human Values from Text Ed...
Second Thoughts are Best: Learning to Re-Align With Human Values from Text Ed...Second Thoughts are Best: Learning to Re-Align With Human Values from Text Ed...
Second Thoughts are Best: Learning to Re-Align With Human Values from Text Ed...
Po-Chuan Chen
 
Artificial Intelligence.pptx
Artificial Intelligence.pptxArtificial Intelligence.pptx
Artificial Intelligence.pptx
Kaviya452563
 
Rd1 r17a19 datawarehousing and mining_cap617t_cap617
Rd1 r17a19 datawarehousing and mining_cap617t_cap617Rd1 r17a19 datawarehousing and mining_cap617t_cap617
Rd1 r17a19 datawarehousing and mining_cap617t_cap617Ravi Kumar
 
Knowledge distillation deeplab
Knowledge distillation deeplabKnowledge distillation deeplab
Knowledge distillation deeplab
Frozen Paradise
 
SemiBoost: Boosting for Semi-supervised Learning
SemiBoost: Boosting for Semi-supervised LearningSemiBoost: Boosting for Semi-supervised Learning
SemiBoost: Boosting for Semi-supervised Learningbutest
 
activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.pptbutest
 
INTRODUCTION TO BOOSTING.ppt
INTRODUCTION TO BOOSTING.pptINTRODUCTION TO BOOSTING.ppt
INTRODUCTION TO BOOSTING.ppt
BharatDaiyaBharat
 
Graph Neural Prompting with Large Language Models.pptx
Graph Neural Prompting with Large Language Models.pptxGraph Neural Prompting with Large Language Models.pptx
Graph Neural Prompting with Large Language Models.pptx
ssuser2624f71
 
Boosting dl concept learners
Boosting dl concept learners Boosting dl concept learners
Boosting dl concept learners
Giuseppe Rizzo
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional Verification
Sai Kiran Kadam
 
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithms
Arunangsu Sahu
 
Leach...Concept Stabilization
Leach...Concept StabilizationLeach...Concept Stabilization
Leach...Concept StabilizationMatthew Leach
 
local_learning.doc - Word Format
local_learning.doc - Word Formatlocal_learning.doc - Word Format
local_learning.doc - Word Formatbutest
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
Dinusha Dilanka
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAI
Raouf KESKES
 
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Po-Chuan Chen
 

Similar to Meta Pseudo Label - InsideAIML (20)

Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data
 
Machine learning (5)
Machine learning (5)Machine learning (5)
Machine learning (5)
 
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 
Second Thoughts are Best: Learning to Re-Align With Human Values from Text Ed...
Second Thoughts are Best: Learning to Re-Align With Human Values from Text Ed...Second Thoughts are Best: Learning to Re-Align With Human Values from Text Ed...
Second Thoughts are Best: Learning to Re-Align With Human Values from Text Ed...
 
Artificial Intelligence.pptx
Artificial Intelligence.pptxArtificial Intelligence.pptx
Artificial Intelligence.pptx
 
Rd1 r17a19 datawarehousing and mining_cap617t_cap617
Rd1 r17a19 datawarehousing and mining_cap617t_cap617Rd1 r17a19 datawarehousing and mining_cap617t_cap617
Rd1 r17a19 datawarehousing and mining_cap617t_cap617
 
Knowledge distillation deeplab
Knowledge distillation deeplabKnowledge distillation deeplab
Knowledge distillation deeplab
 
SemiBoost: Boosting for Semi-supervised Learning
SemiBoost: Boosting for Semi-supervised LearningSemiBoost: Boosting for Semi-supervised Learning
SemiBoost: Boosting for Semi-supervised Learning
 
activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.ppt
 
INTRODUCTION TO BOOSTING.ppt
INTRODUCTION TO BOOSTING.pptINTRODUCTION TO BOOSTING.ppt
INTRODUCTION TO BOOSTING.ppt
 
Graph Neural Prompting with Large Language Models.pptx
Graph Neural Prompting with Large Language Models.pptxGraph Neural Prompting with Large Language Models.pptx
Graph Neural Prompting with Large Language Models.pptx
 
Boosting dl concept learners
Boosting dl concept learners Boosting dl concept learners
Boosting dl concept learners
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional Verification
 
Text
TextText
Text
 
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithms
 
Leach...Concept Stabilization
Leach...Concept StabilizationLeach...Concept Stabilization
Leach...Concept Stabilization
 
local_learning.doc - Word Format
local_learning.doc - Word Formatlocal_learning.doc - Word Format
local_learning.doc - Word Format
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAI
 
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
 

More from VijaySharma802

Introduction to Python Programming
Introduction to Python ProgrammingIntroduction to Python Programming
Introduction to Python Programming
VijaySharma802
 
Introduction to Python Programming | InsideAIML
Introduction to Python Programming | InsideAIMLIntroduction to Python Programming | InsideAIML
Introduction to Python Programming | InsideAIML
VijaySharma802
 
Different Models Used In Time Series - InsideAIML
Different Models Used In Time Series - InsideAIMLDifferent Models Used In Time Series - InsideAIML
Different Models Used In Time Series - InsideAIML
VijaySharma802
 
Data Science with Python Lesson 1 - InsideAIML
Data Science with Python Lesson 1 - InsideAIMLData Science with Python Lesson 1 - InsideAIML
Data Science with Python Lesson 1 - InsideAIML
VijaySharma802
 
Data Science with Python Course - InsideAIML
Data Science with Python Course - InsideAIMLData Science with Python Course - InsideAIML
Data Science with Python Course - InsideAIML
VijaySharma802
 
Evolution of Machine Learning - InsideAIML
Evolution of Machine Learning - InsideAIMLEvolution of Machine Learning - InsideAIML
Evolution of Machine Learning - InsideAIML
VijaySharma802
 
An In-Depth Explanation of Recurrent Neural Networks (RNNs) - InsideAIML
An In-Depth Explanation of Recurrent Neural Networks (RNNs) - InsideAIMLAn In-Depth Explanation of Recurrent Neural Networks (RNNs) - InsideAIML
An In-Depth Explanation of Recurrent Neural Networks (RNNs) - InsideAIML
VijaySharma802
 
Ten trends of IoT in 2020 - InsideAIML
Ten trends of IoT in 2020 - InsideAIMLTen trends of IoT in 2020 - InsideAIML
Ten trends of IoT in 2020 - InsideAIML
VijaySharma802
 

More from VijaySharma802 (8)

Introduction to Python Programming
Introduction to Python ProgrammingIntroduction to Python Programming
Introduction to Python Programming
 
Introduction to Python Programming | InsideAIML
Introduction to Python Programming | InsideAIMLIntroduction to Python Programming | InsideAIML
Introduction to Python Programming | InsideAIML
 
Different Models Used In Time Series - InsideAIML
Different Models Used In Time Series - InsideAIMLDifferent Models Used In Time Series - InsideAIML
Different Models Used In Time Series - InsideAIML
 
Data Science with Python Lesson 1 - InsideAIML
Data Science with Python Lesson 1 - InsideAIMLData Science with Python Lesson 1 - InsideAIML
Data Science with Python Lesson 1 - InsideAIML
 
Data Science with Python Course - InsideAIML
Data Science with Python Course - InsideAIMLData Science with Python Course - InsideAIML
Data Science with Python Course - InsideAIML
 
Evolution of Machine Learning - InsideAIML
Evolution of Machine Learning - InsideAIMLEvolution of Machine Learning - InsideAIML
Evolution of Machine Learning - InsideAIML
 
An In-Depth Explanation of Recurrent Neural Networks (RNNs) - InsideAIML
An In-Depth Explanation of Recurrent Neural Networks (RNNs) - InsideAIMLAn In-Depth Explanation of Recurrent Neural Networks (RNNs) - InsideAIML
An In-Depth Explanation of Recurrent Neural Networks (RNNs) - InsideAIML
 
Ten trends of IoT in 2020 - InsideAIML
Ten trends of IoT in 2020 - InsideAIMLTen trends of IoT in 2020 - InsideAIML
Ten trends of IoT in 2020 - InsideAIML
 

Recently uploaded

MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdfMASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
goswamiyash170123
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
Marketing internship report file for MBA
Marketing internship report file for MBAMarketing internship report file for MBA
Marketing internship report file for MBA
gb193092
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBCSTRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
kimdan468
 
The Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptxThe Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptx
DhatriParmar
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
Wasim Ak
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
Multithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race conditionMultithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race condition
Mohammed Sikander
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
Krisztián Száraz
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Akanksha trivedi rama nursing college kanpur.
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Atul Kumar Singh
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 

Recently uploaded (20)

MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdfMASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
Marketing internship report file for MBA
Marketing internship report file for MBAMarketing internship report file for MBA
Marketing internship report file for MBA
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBCSTRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
STRAND 3 HYGIENIC PRACTICES.pptx GRADE 7 CBC
 
The Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptxThe Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptx
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
Multithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race conditionMultithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race condition
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 

Meta Pseudo Label - InsideAIML

  • 1. Meta Pseudo Label Figure. Meta Pseudo Label Introduction Have you heard the term meta-learning? Do you remember the time when you used pseudo labeling for a Kaggle competition? What if we combine the two techniques? today, we will discuss the latest research paper that aims to combine meta-learning and pseudo labeling for semi- supervised learning. Before we dive into the paper, let’s take a step back and try to understand why we need this kind of technique in the first place. When we build models, we try to minimize the cross-entropy loss or we can say that we want to measure the KL divergence from a target distribution over all the possible classes to the distribution predicted by a network. This target distribution, most of the time, is based on some heuristics. Specifically, in supervised learning, the target distribution is often a one- hot vector, or a smoothed version of the one-hot vector, i.e., label smoothing. In semi-supervised learning, the target distributions, also known as pseudo labels, are often generated on unlabeled data by a sharpened or dampened teacher model trained on labeled data. Even if the target distribution is based on heuristics, our algorithms work really well. I don’t see why is that a problem. No matter what you say, if
  • 2. you are working with supervised learning or sem-supervised learning, you have to encode the target distribution in some way, right? Of course, the target distribution has to be constructed in some way. If one thing works well, then it doesn’t mean that it is the best way to do it. One of the drawbacks of these heuristics is that they cannot adapt to the learning state of the neural networks being trained. To this end, the authors propose to meta-learn the target distribution. They design a teacher model that assigns distributions to input examples to train the main model, the student model. Throughout the course of the student’s training, the teacher observes the student’s performance on a held-out validation set and learns to generate target distributions so that if the student learns from such distributions, the student will achieve good validation performance. Since the meta- learned target distributions play a similar role to pseudo labels, the authors named this method Meta Pseudo Label (MPL). Fine. I agree that these heuristics may not be the best ones. But the authors can’t just disprove it like this. There must be some motivation behind this proposal. Motivation For the sake of simplicity, the authors focused on training a C-way classification model parameterized by Θ, such as a neural network where we try to minimize the cross-entropy between a target distribution q∗(Y|X) and the model distribution pΘ(Y|X), i.e. Let’s see how the target distribution is defined in different conditions  Supervised training: The target distribution is defined as the one-hot vector representing the ground-truth class, i.e., q∗(Y|x) = one-hot(y)   Knowledge distillation: For each data point x, the predicted distribution of the large model q_large(x) is directly taken as the target distribution, i.e. q∗(Y|x) = q_large(Y|x)   Semi-supervised learning: In SSL, a typical solution first employs an existing model qξ (trained on limited labeled data) to predict the class for each data point from an unlabeled set and utilizes the prediction to construct the target distribution which can be done in two ways: Hard label: q∗(Y|x) = one-hot (argmax (qξ(y|x)) Soft label: q∗(Y|x) = qξ(Y|x)
  • 3. Also, MPL is not the first paper to talk about these non-optimal heuristics. Two very popular methods that try to mitigate this issue are:  Label smoothing: It has been found that using one-hot vectors as the target distribution in a fully supervised machine translation and large- scale image classification such as ImageNet, can lead to overfitting. To combat this phenomenon, label smoothing proposes to smooth the one- hot distribution by allocating a small uniform weights to all classes. The target distribution is then redefined as:  Temperature Tuning: For both knowledge distillation (KD) and soft- label SSL, it has been found that explicitly introducing a temperature hyper-parameter to modulate the target distribution could be very helpful. The logits l(x), are scaled by temperature τ. A value of τ < 1, sharpens the distribution and helps in speeding up the training. On the other hand, τ > 1, smoothens the distribution and helps to prevent over fitting. To this end, we can say that there are two intrinsic limitations of the current target distribution construction: However, instead of looking for better heuristics for designing the target distribution from scratch, we should look for a generic and systematic method that can be used to modify the target distribution in an existing algorithm that can lead to better performance. MPL tries to solve this. Meta Pseudo Labels Algorithm MPL tries to learn the target distribution q∗(x) throughout the course of training pΘ. The target distribution q∗(x) is parameterized as qΨ(x) where Ψ is trained using gradient descent. For the sake of simplicity, qΨ is taken as a classification model, which assigns the conditional probabilities to different classes of each input example x. (Medium is pretty bad when it comes to LaTex support � ) Here qΨ(x) serves the same role as q_large in knowledge distillation or qξ in SSL, as qΨ(x) provides the pseudo labels for pΘ to learn. Due to this similarity, the authors call qΨ the teacher model and pΘ the student model. As can be seen in the above figure, MPL update consists of two phases:  Phase 1- The Student Learns from the Teacher: In this phase, given a single input example x, the teacher qΨ produces the conditional class
  • 4. distribution qΨ(x) to train the student. The pair (x, qΨ(x)) is then shown to the student to update its parameters by backpropagating from the cross-entropy loss. For instance, if Θ is trained with SGD with a learning rate of η, then we have:  Phase 2- The Teacher Learns from the Student’s Validation Loss: After the student updates its parameters as in Equation 1, its new parameter Θ(t+1) is evaluated on an example (xval, yval) from the held-out validation dataset, using the cross-entropy loss, L(yval, pΘ(t+1) (xval)). Since Θ(t+1) depends on Ψ via Equation 1, this validation cross-entropy loss is a function of Ψ. Specifically, dropping (xval, yval) from the equations for the sake of readability, we can write: Hold on a sec! The algorithm looks fine but I have a doubt. We are not using the original labels at all. It looks like that the pseudo labels generated by the teacher are enough to make corrections in both the models, right? Although the performance of the student model on the validation set allows the teacher model to adjust its parameter accordingly, the authors found out that this signal is not sufficient to train the teacher model. If we rely solely on the student signal, by the time the teacher model has seen enough evidence to generate meaningful target distribution, the student model might have started over fitting. A similar short-falling has been observed when training neural machine translation models with RL. To overcome this, the authors used supervised learning for the teacher. How? At each training step, apart from the MPL updates in Equation 2, the teacher also computes a gradient ∇Ψ(L(y, qΨ(x))) on a pair of labelled data (x, y). This gradient is then added to the MPL gradient ∇ΨR from Equation 2 to update the teacher’s parameters Ψ. Though this provides a sufficient signal to train the teacher, it poses another problem. We need to keep two classification models, the teacher, and the student, in memory. For smaller models, like ResNets, this won’t be a big problem but with larger models like EfficientNets, it limits the batch size leading to slow training time. To mitigate this issue, the authors designed a smaller alternative, ReducedMPL. A large model, q_large, is trained to convergence. Then, this model is used to pre-compute all target distributions for the student’s training data. Importantly, until this step, the student model has not been loaded into memory, effectively avoiding the large memory footprint of MPL. A
  • 5. reduced teacher qΨ as a small and efficient network, such as a multi- layered perceptron, is parameterized to be trained along with the student. This reduced teacher qΨ takes as input the distribution predicted by the large teacher q_large and outputs a calibrated distribution for the student to learn. Experimental Results MPL is tested in two scenarios:  Where limited data is available.  Where the full labeled dataset is used. In both scenarios, experiments were performed on CIFAR-10, SVHN, and ImageNet. For experiments on full datasets, ReducedMPL was used due to the large memory footprint of MPL. ( I won’t be putting all the experimental details here. Please check the paper for the details.) Analysis  MPL is not label correction: Since the teacher in MPL provides the target distribution for the student to learn and observes the student’s performance to improve itself, it is intuitive to think that the teacher tries to guess the correct labels for the student. To prove this, the authors plotted the training accuracy of all three: a purely supervised model, the teacher model, and the student model. The results look like this: As shown, the training accuracy of both the teacher and the student stays relatively low. Meanwhile, the training accuracy of the supervised model eventually reaches 100% much earlier. If MPL was simply performing label correction, then these accuracies should have been high. Instead, it looks like the teacher in MPL is trying to regularize the student to prevent overfitting which is crucial when you have a limited labeled dataset.  MPL is Not Only a Regularization Strategy: As the authors said above that the teacher model in MPL is trying to regularize the student, it is easy to think that the teacher might be injecting noise so that the student doesn’t overfit. Turns out it isn’t the case. If the teacher has to inject noise in students’ training, it can do it in two ways: 1.) By flipping the
  • 6. target class, eg. tell the student that the image of a car is an image of a horse. 2.)By dampening the target distribution. To confirm this, the authors visualized a few target distributions predicted by the teacher model in ReducedMPL for images in the TinyImages dataset. If you look at the above figure, you will notice that the label with the highest confidence for the images does not change at the quarters of the student’s training process. This means that the teacher has not managed to flip the target labels. Also, the target distributions that the teacher predicts become steeper between 50% and 75%. As the student is learning during this time, if the teacher simply wants to regularize the student, then the teacher should dampen the distributions. This observation is sufficient enough to prove that MPL isn’t just only a regularization method. Conclusion Overall the paper is well written. The ideas are crystal clear and thought- provoking. MPL makes use of both labeled and unlabeled data, making it a typical SSL algorithm but a significant difference between MPL and other SSL methods is that the teacher model receives learning signals from the student’s performance, and hence can adapt to the student’s learning state throughout the student’s training. To learn more about the Data Science algorithm visit the InsideAIML page.