PR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo SupervisionSungchul Kim
This document proposes a new semi-supervised learning method called Cross Pseudo Supervision (CPS) for semantic segmentation. CPS trains two segmentation networks simultaneously where each network generates pseudo labels for the other using its own predictions. It incorporates this cross pseudo supervision loss with a CutMix data augmentation technique. The document evaluates CPS on PASCAL VOC 2012 and Cityscapes datasets with limited labeled data, finding it outperforms other semi-supervised segmentation methods and ablations. Qualitative results show CPS with CutMix generates more accurate segmentations than alternatives.
Revisiting the Calibration of Modern Neural NetworksSungchul Kim
The document discusses calibration in modern neural networks. It finds that some recent model families, like MLP-Mixers and Vision Transformers (ViTs), are both highly accurate and well-calibrated on in-distribution and out-of-distribution image datasets. Temperature scaling can improve calibration in older models but the newest architectures remain better calibrated. Model size and pretraining amount do not fully explain differences in calibration between families. Architecture appears to be a major factor, with non-convolutional models tending to have better calibration properties.
Emerging Properties in Self-Supervised Vision TransformersSungchul Kim
The document summarizes the DINO self-supervised learning approach for vision transformers. DINO uses a teacher-student framework where the teacher's predictions are used to supervise the student through knowledge distillation. Two global and several local views of an image are passed through the student, while only global views are passed through the teacher. The student is trained to match the teacher's predictions for local views. DINO achieves state-of-the-art results on ImageNet with linear evaluation and transfers well to downstream tasks. It also enables vision transformers to discover object boundaries and semantic layouts.
PR-305: Exploring Simple Siamese Representation LearningSungchul Kim
SimSiam is a self-supervised learning method that uses a Siamese network with stop-gradient to learn representations from unlabeled data. The paper finds that stop-gradient plays an essential role in preventing the model from collapsing to a degenerate solution. Additionally, it is hypothesized that SimSiam implicitly optimizes an Expectation-Maximization-like algorithm that alternates between updating the network parameters and assigning representations to samples in a manner analogous to k-means clustering.
Score based Generative Modeling through Stochastic Differential EquationsSungchul Kim
This document discusses score-based generative modeling using stochastic differential equations (SDEs). It introduces modeling data diffusion as an SDE from the data distribution to a simple prior and generating samples by reversing this diffusion process. It also describes estimating the score (gradient of the log probability density) needed for the reverse process using score matching. Finally, it notes that noise perturbation models like NCSN and DDPM can be viewed as discretizations of specific SDEs called variance exploding and variance preserving SDEs.
Exploring Simple Siamese Representation LearningSungchul Kim
This document discusses an unsupervised representation learning method called SimSiam. It proposes that SimSiam can be interpreted as an expectation-maximization algorithm that alternates between updating the encoder parameters and assigning representations to images. Key aspects discussed include how the stop-gradient operation prevents collapsed representations, the role of the predictor network, effects of batch size and batch normalization, and alternatives to the cosine similarity measure. Empirical results show that SimSiam learns meaningful representations without collapsing, and the various design choices affect performance but not the ability to prevent collapsed representations.
Revisiting the Sibling Head in Object DetectorSungchul Kim
This document describes a method called Task-aware Spatial Disentanglement (TSD) for object detection. TSD uses separate branches to process features for the classification and localization tasks in a spatially disentangled manner. For classification, TSD applies pointwise deformations to the feature map. For localization, it applies proposal-wise translations. This allows each task to process features with spatial sensitivities suitable for their goals, improving performance over methods that do not separate spatial processing for different tasks. TSD achieves state-of-the-art object detection accuracy on standard benchmarks.
PR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo SupervisionSungchul Kim
This document proposes a new semi-supervised learning method called Cross Pseudo Supervision (CPS) for semantic segmentation. CPS trains two segmentation networks simultaneously where each network generates pseudo labels for the other using its own predictions. It incorporates this cross pseudo supervision loss with a CutMix data augmentation technique. The document evaluates CPS on PASCAL VOC 2012 and Cityscapes datasets with limited labeled data, finding it outperforms other semi-supervised segmentation methods and ablations. Qualitative results show CPS with CutMix generates more accurate segmentations than alternatives.
Revisiting the Calibration of Modern Neural NetworksSungchul Kim
The document discusses calibration in modern neural networks. It finds that some recent model families, like MLP-Mixers and Vision Transformers (ViTs), are both highly accurate and well-calibrated on in-distribution and out-of-distribution image datasets. Temperature scaling can improve calibration in older models but the newest architectures remain better calibrated. Model size and pretraining amount do not fully explain differences in calibration between families. Architecture appears to be a major factor, with non-convolutional models tending to have better calibration properties.
Emerging Properties in Self-Supervised Vision TransformersSungchul Kim
The document summarizes the DINO self-supervised learning approach for vision transformers. DINO uses a teacher-student framework where the teacher's predictions are used to supervise the student through knowledge distillation. Two global and several local views of an image are passed through the student, while only global views are passed through the teacher. The student is trained to match the teacher's predictions for local views. DINO achieves state-of-the-art results on ImageNet with linear evaluation and transfers well to downstream tasks. It also enables vision transformers to discover object boundaries and semantic layouts.
PR-305: Exploring Simple Siamese Representation LearningSungchul Kim
SimSiam is a self-supervised learning method that uses a Siamese network with stop-gradient to learn representations from unlabeled data. The paper finds that stop-gradient plays an essential role in preventing the model from collapsing to a degenerate solution. Additionally, it is hypothesized that SimSiam implicitly optimizes an Expectation-Maximization-like algorithm that alternates between updating the network parameters and assigning representations to samples in a manner analogous to k-means clustering.
Score based Generative Modeling through Stochastic Differential EquationsSungchul Kim
This document discusses score-based generative modeling using stochastic differential equations (SDEs). It introduces modeling data diffusion as an SDE from the data distribution to a simple prior and generating samples by reversing this diffusion process. It also describes estimating the score (gradient of the log probability density) needed for the reverse process using score matching. Finally, it notes that noise perturbation models like NCSN and DDPM can be viewed as discretizations of specific SDEs called variance exploding and variance preserving SDEs.
Exploring Simple Siamese Representation LearningSungchul Kim
This document discusses an unsupervised representation learning method called SimSiam. It proposes that SimSiam can be interpreted as an expectation-maximization algorithm that alternates between updating the encoder parameters and assigning representations to images. Key aspects discussed include how the stop-gradient operation prevents collapsed representations, the role of the predictor network, effects of batch size and batch normalization, and alternatives to the cosine similarity measure. Empirical results show that SimSiam learns meaningful representations without collapsing, and the various design choices affect performance but not the ability to prevent collapsed representations.
Revisiting the Sibling Head in Object DetectorSungchul Kim
This document describes a method called Task-aware Spatial Disentanglement (TSD) for object detection. TSD uses separate branches to process features for the classification and localization tasks in a spatially disentangled manner. For classification, TSD applies pointwise deformations to the feature map. For localization, it applies proposal-wise translations. This allows each task to process features with spatial sensitivities suitable for their goals, improving performance over methods that do not separate spatial processing for different tasks. TSD achieves state-of-the-art object detection accuracy on standard benchmarks.
This document summarizes the DeepLab models for semantic image segmentation: DeepLab v1 used atrous convolution with VGG-16 as the backbone network. DeepLab v2 improved on this with atrous spatial pyramid pooling and added ResNet-101 as an option. DeepLab v3 removed dense CRFs and introduced multi-grid atrous convolution and bootstrapping. DeepLab v3+ uses an encoder-decoder architecture with Xception or ResNet-101 as the backbone and atrous separable convolutions.
This document summarizes a paper on GoogLeNet, a convolutional neural network that won the 2014 ILSVRC competition. It discusses how GoogLeNet uses inception layers and network-in-network dimensionality reduction to achieve higher accuracy than AlexNet with 12x fewer parameters. The document outlines GoogLeNet's architectural details, training methodology using stochastic gradient descent, and results in classification and detection tracks at ILSVRC 2014. It also references related work on R-CNN, Network-in-Network, and AlexNet.
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...Sungchul Kim
1. The document proposes a Generalized Intersection over Union (GIoU) as a new loss function for bounding box regression that addresses limitations of the standard Intersection over Union (IoU) metric.
2. GIoU extends IoU by also considering the minimum enclosing box for two bounding boxes, not just their intersection. This allows GIoU to better measure the distance between two boxes even if they do not intersect.
3. Experimental results on object detection models like YOLO v3, Faster R-CNN and Mask R-CNN show that using GIoU as the regression loss provides stronger correlation with IoU and improves model performance compared to using IoU directly as the loss.
1. This document discusses panoptic segmentation, which aims to perform semantic segmentation and instance segmentation simultaneously.
2. It introduces the panoptic segmentation task format, which assigns each pixel a semantic label and instance ID. It also presents a new evaluation metric called Panoptic Quality (PQ) to evaluate panoptic segmentation.
3. Baseline experiments are discussed that combine existing semantic and instance segmentation models. The results show current machine performance is still below human consistency levels based on a human annotation study. Future work to develop end-to-end panoptic segmentation models is suggested.
On the Variance of the Adaptive Learning Rate and BeyondSungchul Kim
This document discusses the variance of adaptive learning rates in stochastic gradient descent optimization. It is hypothesized that the large variance of adaptive learning rates early in training can lead to getting stuck in bad local optima. The document proposes rectifying adaptive learning rates to reduce this initial variance, similarly to how warmup reduces variance but without requiring tuning hyperparameters. Experiments show that rectifying adaptive learning rates outperforms vanilla adaptive methods and warmup in convergence and robustness to learning rate changes.
A Benchmark for Interpretability Methods in Deep Neural NetworksSungchul Kim
This document proposes two interpretability evaluation frameworks: ROAR (RemOve And Retrain) and KAR (Keep And Retrain). ROAR evaluates feature importance by removing salient features from inputs based on a trained model's saliency map, regenerating training and test data, and retraining the model. KAR keeps salient features and retrains instead of removing. The document introduces common interpretability methods like gradients, integrated gradients, and ensembling methods. It then explains the ROAR and KAR mechanisms and provides example results, concluding the frameworks can validate feature importance estimates and model reliability.
KDGAN: Knowledge Distillation with Generative Adversarial NetworksSungchul Kim
This document discusses a method called KDGAN that combines knowledge distillation and generative adversarial networks. KDGAN trains a lightweight classifier (student) using knowledge distilled from a larger teacher model and a discriminator that learns the true data distribution. The classifier, teacher, and discriminator are trained together in an adversarial framework where the classifier learns to mimic the teacher's outputs and fool the discriminator, while the discriminator learns to distinguish the classifier's outputs from real data. This equilibrium allows the classifier to learn the true data distribution effectively from limited data with privileged information only available during training. The document outlines the KDGAN training method and framework and provides examples of compressing deep models and performing image tag recommendation.
This document discusses the design of neural network architectures. It introduces RegNet, a design space that uses a simple linear parameterization to generate "regular" networks. The document analyzes the RegNet design space and compares it to existing networks. Key findings include that optimal depth is around 20 blocks, a bottleneck ratio of 1 is best, and a width multiplier of around 2.5 works well. RegNet models achieve state-of-the-art or competitive results across mobile, standard, and full network regimes.
Search to Distill: Pearls are Everywhere but not the EyesSungchul Kim
This document discusses architecture-aware knowledge distillation (AKD), which uses reinforcement learning to find the optimal student network architecture for distilling knowledge from a given teacher model. It introduces AKD and how it guides the search for best architectures under latency constraints. Preliminary experiments show AKD networks achieve state-of-the-art results on ImageNet classification and generalize well to other tasks like face recognition and ensemble learning.
Bootstrap Your Own Latent: A New Approach to Self-Supervised LearningSungchul Kim
This document summarizes a paper on Bootstrap Your Own Latent (BYOL), an unsupervised contrastive learning method that does not use negative pairs. BYOL trains a target network to predict the output of an online network using a different data augmentation. The loss is the mean squared error between the predictions. BYOL achieves state-of-the-art performance on several image classification benchmarks without negative pairs by bootstrapping representations from its own augmented views. Ablation studies show BYOL is robust to different augmentations and batch sizes but requires careful tuning of the target network update rate.
Regularizing Class-wise Predictions via Self-knowledge DistillationSungchul Kim
- The document proposes a method called Class-wise Self-Knowledge Distillation (CS-KD) to improve the generalization of deep neural networks.
- CS-KD matches the predictive distributions of different samples from the same class, encouraging consistent predictions for samples of the same class.
- Experiments on image classification datasets show CS-KD reduces error rates compared to other regularization methods and self-distillation baselines. It also improves calibration and works well with other techniques.
- By minimizing the KL divergence between predictive distributions of same-class samples, CS-KD enhances generalization and calibration of neural networks.
Unsupervised Data Augmentation for Consistency TrainingSungchul Kim
This document discusses semi-supervised learning and unsupervised data augmentation (UDA). It begins by explaining techniques in semi-supervised learning like entropy minimization and consistency regularization. It then introduces UDA, which trains models to be less sensitive to noise by minimizing the divergence between predictions on original and augmented data. The document reports on experiments applying UDA and comparing it to other methods on image datasets, finding it achieves better performance. It also explores techniques like training signal annealing and discusses ablation studies.
This document summarizes the DeepLab models for semantic image segmentation: DeepLab v1 used atrous convolution with VGG-16 as the backbone network. DeepLab v2 improved on this with atrous spatial pyramid pooling and added ResNet-101 as an option. DeepLab v3 removed dense CRFs and introduced multi-grid atrous convolution and bootstrapping. DeepLab v3+ uses an encoder-decoder architecture with Xception or ResNet-101 as the backbone and atrous separable convolutions.
This document summarizes a paper on GoogLeNet, a convolutional neural network that won the 2014 ILSVRC competition. It discusses how GoogLeNet uses inception layers and network-in-network dimensionality reduction to achieve higher accuracy than AlexNet with 12x fewer parameters. The document outlines GoogLeNet's architectural details, training methodology using stochastic gradient descent, and results in classification and detection tracks at ILSVRC 2014. It also references related work on R-CNN, Network-in-Network, and AlexNet.
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...Sungchul Kim
1. The document proposes a Generalized Intersection over Union (GIoU) as a new loss function for bounding box regression that addresses limitations of the standard Intersection over Union (IoU) metric.
2. GIoU extends IoU by also considering the minimum enclosing box for two bounding boxes, not just their intersection. This allows GIoU to better measure the distance between two boxes even if they do not intersect.
3. Experimental results on object detection models like YOLO v3, Faster R-CNN and Mask R-CNN show that using GIoU as the regression loss provides stronger correlation with IoU and improves model performance compared to using IoU directly as the loss.
1. This document discusses panoptic segmentation, which aims to perform semantic segmentation and instance segmentation simultaneously.
2. It introduces the panoptic segmentation task format, which assigns each pixel a semantic label and instance ID. It also presents a new evaluation metric called Panoptic Quality (PQ) to evaluate panoptic segmentation.
3. Baseline experiments are discussed that combine existing semantic and instance segmentation models. The results show current machine performance is still below human consistency levels based on a human annotation study. Future work to develop end-to-end panoptic segmentation models is suggested.
On the Variance of the Adaptive Learning Rate and BeyondSungchul Kim
This document discusses the variance of adaptive learning rates in stochastic gradient descent optimization. It is hypothesized that the large variance of adaptive learning rates early in training can lead to getting stuck in bad local optima. The document proposes rectifying adaptive learning rates to reduce this initial variance, similarly to how warmup reduces variance but without requiring tuning hyperparameters. Experiments show that rectifying adaptive learning rates outperforms vanilla adaptive methods and warmup in convergence and robustness to learning rate changes.
A Benchmark for Interpretability Methods in Deep Neural NetworksSungchul Kim
This document proposes two interpretability evaluation frameworks: ROAR (RemOve And Retrain) and KAR (Keep And Retrain). ROAR evaluates feature importance by removing salient features from inputs based on a trained model's saliency map, regenerating training and test data, and retraining the model. KAR keeps salient features and retrains instead of removing. The document introduces common interpretability methods like gradients, integrated gradients, and ensembling methods. It then explains the ROAR and KAR mechanisms and provides example results, concluding the frameworks can validate feature importance estimates and model reliability.
KDGAN: Knowledge Distillation with Generative Adversarial NetworksSungchul Kim
This document discusses a method called KDGAN that combines knowledge distillation and generative adversarial networks. KDGAN trains a lightweight classifier (student) using knowledge distilled from a larger teacher model and a discriminator that learns the true data distribution. The classifier, teacher, and discriminator are trained together in an adversarial framework where the classifier learns to mimic the teacher's outputs and fool the discriminator, while the discriminator learns to distinguish the classifier's outputs from real data. This equilibrium allows the classifier to learn the true data distribution effectively from limited data with privileged information only available during training. The document outlines the KDGAN training method and framework and provides examples of compressing deep models and performing image tag recommendation.
This document discusses the design of neural network architectures. It introduces RegNet, a design space that uses a simple linear parameterization to generate "regular" networks. The document analyzes the RegNet design space and compares it to existing networks. Key findings include that optimal depth is around 20 blocks, a bottleneck ratio of 1 is best, and a width multiplier of around 2.5 works well. RegNet models achieve state-of-the-art or competitive results across mobile, standard, and full network regimes.
Search to Distill: Pearls are Everywhere but not the EyesSungchul Kim
This document discusses architecture-aware knowledge distillation (AKD), which uses reinforcement learning to find the optimal student network architecture for distilling knowledge from a given teacher model. It introduces AKD and how it guides the search for best architectures under latency constraints. Preliminary experiments show AKD networks achieve state-of-the-art results on ImageNet classification and generalize well to other tasks like face recognition and ensemble learning.
Bootstrap Your Own Latent: A New Approach to Self-Supervised LearningSungchul Kim
This document summarizes a paper on Bootstrap Your Own Latent (BYOL), an unsupervised contrastive learning method that does not use negative pairs. BYOL trains a target network to predict the output of an online network using a different data augmentation. The loss is the mean squared error between the predictions. BYOL achieves state-of-the-art performance on several image classification benchmarks without negative pairs by bootstrapping representations from its own augmented views. Ablation studies show BYOL is robust to different augmentations and batch sizes but requires careful tuning of the target network update rate.
Regularizing Class-wise Predictions via Self-knowledge DistillationSungchul Kim
- The document proposes a method called Class-wise Self-Knowledge Distillation (CS-KD) to improve the generalization of deep neural networks.
- CS-KD matches the predictive distributions of different samples from the same class, encouraging consistent predictions for samples of the same class.
- Experiments on image classification datasets show CS-KD reduces error rates compared to other regularization methods and self-distillation baselines. It also improves calibration and works well with other techniques.
- By minimizing the KL divergence between predictive distributions of same-class samples, CS-KD enhances generalization and calibration of neural networks.
Unsupervised Data Augmentation for Consistency TrainingSungchul Kim
This document discusses semi-supervised learning and unsupervised data augmentation (UDA). It begins by explaining techniques in semi-supervised learning like entropy minimization and consistency regularization. It then introduces UDA, which trains models to be less sensitive to noise by minimizing the divergence between predictions on original and augmented data. The document reports on experiments applying UDA and comparing it to other methods on image datasets, finding it achieves better performance. It also explores techniques like training signal annealing and discusses ablation studies.
Unsupervised Data Augmentation for Consistency Training
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
1.
2. Contents
1. Class Activation Map (B. Zhou et al., 2015.)
2. Guided Backpropagation(J. T. Springenberg et al., 2015.)
3. Grad-CAM
2
3. Class Activation Map (B. Zhou et al., 2015.)
• Global Average Pooling (M. Lin et al., 2014.)
• class에 해당하는 weight들로 마지막 conv layer에 weighted sum
• 해당 class가 활성화되는 부분을 확인!
3
4. Guided Backpropagation (J. T. Springenberg et al., 2015.)
• 기존 backpropagation은 forward의 activation 방식을 따름
• deconvnet (M. D. Zeiler et al., 2014.) 은 gradient의 값으로 activation을 적용
• Guided Backpropagation은 backpropagation + deconvnet
4
13. Reference
- R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh and D. Batra, “Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization”, in
ICCV, 2017.
- R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh and D. Batra, “Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-
based Localization”, in CORR, abs/1610.02391, 2016.
- B. Zhou, A. Khosla, A. Lapedriza, A. Oliva and A. Torralba, “Learning Deep Features for Discriminative Localization”, in CVPR, 2016.
- M. Lin, Q. Chen, and S. Yan. “Network in network”, in ICLR, 2014.
- J. T. Springenberg, A. Dosovitskiy, T. Brox and M. Riedmiller, “Striving for Simplicity: The All Convolutional Net”, in ICLR, 2015.
- M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks”, in ECCV, 2014.
13