SlideShare a Scribd company logo
1 of 64
Download to read offline
[Ph.D. Defense]
Learning Visual Representations from Uncurated Data
2023.06.01.
Presenter: Sangwoo Mo (KAIST)
Committee:
Profs. Jinwoo Shin, Inso Kweon, Junmo Kim, Eunho Yang, Jaegul Choo
1
• I’m a 5th year Ph.D. student at KAIST, focusing on machine learning
• Education
• Ph.D. in Electrical Engineering, KAIST (advisor: Jinwoo Shin)
• B.S. in Mathematics, POSTECH (summa cum laude)
• Work experience
• Research Intern, NVIDIA AI (Oct 2022 - Jan 2023)
• Research Intern, Meta AI (Jun 2022 - Sep 2022)
• External Collaborator, Naver AI (Jun 2021 - Sep 2021)
• External Collaborator, Google Cloud AI (Nov 2020 - May 2021)
• Research Intern, Kakao Brain (Feb 2019 - May 2019)
Who am I?
2
• My research goal is to build a representation of the world. To this end, I worked on:
What I worked on?
3
Lead-authored publications
[1] Mo et al. InstaGAN: Instance-aware Image-to-Image Translation. ICLR’19.
[2] Mo et al. Mining GOLD Samples for Conditional GANs. NeurIPS’19.
[3] Mo et al. Freeze the Discriminator: a Simple Baseline for Fine-Tuning GANs. CVPRW’20.
[4] Tack*, Mo* et al. CSI: Novelty Detection via Contrastive Learning on Distributionally Shifted Instances. NeurIPS’20.
[5] Moon*, Mo* et al. MASKER: Masked Keyword Regularization for Reliable Text Classification. AAAI’21.
[6] Mo*, Kang* et al. Object-aware Contrastive Learning for Debiased Scene Representation. NeurIPS’21.
[7] Yu*, Tack*, Mo* et al. Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks. ICLR’22.
[8] Kang*, Mo* et al. OAMixer: Object-aware Mixing Layer for Vision Transformers. CVPRW'22.
[9] Mo et al. RoPAWS: Robust Semi-supervised Representation Learning from Uncurated Data. ICLR’23.
[10] Kim*, Mo* et al. Bias-to-Text: Debiasing Unknown Visual Biases through Language Interpretation. Under Review.
[11] Mo et al. S-CLIP: Semi-supervised Vision-Language Pre-training using Few Specialist Captions. Under Review.
Co-authored publications
[12] Lookahead: A Far-sighted Alternative of Magnitude-based Pruning. ICLR’20.
[13] Layer-adaptive Sparsity for the Magnitude-based Pruning. ICLR’21.
[14] Abstract Reasoning via Logic-guided Generation. ICML’21.
[15] Deep Neural Network Approach in Electrical Impedance Tomography-Based Real-Time Soft Tactile Sensor. IROS’19.
[16] Deep Neural Network Based Electrical Impedance Tomographic Sensing Methodology for Large-Area Robotic Tactile Sensing. TRO’21.
[17] Breaking the Spurious Causality of Conditional Generation via Fairness Intervention with Corrective Sampling. TMLR’23.
[18] Diffusion Probabilistic Models for Structured Node Classification. Under Review.
Generative models
“How does the world look like?”
Representation learning
“What is the essence of the world?”
Robust and fair models
“How to deploy the model in real-world?”
InstaGAN (ICLR’19) [1]
GOLD (NeurIPS’19) [2]
FreezeD (CVPRW’20) [3]
DIGAN (ICLR’22) [7]
OACon (NeurIPS’21) [6]
OAMixer (CVPRW’22) [8]
CSI (NeruIPS’20) [4]
RoPAWS (ICLR’23) [9]
S-CLIP (Under Review) [11]
MASKER (AAAI’21) [5]
B2T (Under Review) [10]
• My research goal is to build a representation of the world. To this end, I worked on:
What I worked on?
4
Lead-authored publications
[1] Mo et al. InstaGAN: Instance-aware Image-to-Image Translation. ICLR’19.
[2] Mo et al. Mining GOLD Samples for Conditional GANs. NeurIPS’19.
[3] Mo et al. Freeze the Discriminator: a Simple Baseline for Fine-Tuning GANs. CVPRW’20.
[4] Tack*, Mo* et al. CSI: Novelty Detection via Contrastive Learning on Distributionally Shifted Instances. NeurIPS’20.
[5] Moon*, Mo* et al. MASKER: Masked Keyword Regularization for Reliable Text Classification. AAAI’21.
[6] Mo*, Kang* et al. Object-aware Contrastive Learning for Debiased Scene Representation. NeurIPS’21.
[7] Yu*, Tack*, Mo* et al. Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks. ICLR’22.
[8] Kang*, Mo* et al. OAMixer: Object-aware Mixing Layer for Vision Transformers. CVPRW'22.
[9] Mo et al. RoPAWS: Robust Semi-supervised Representation Learning from Uncurated Data. ICLR’23.
[10] Kim*, Mo* et al. Bias-to-Text: Debiasing Unknown Visual Biases through Language Interpretation. Under Review.
[11] Mo et al. S-CLIP: Semi-supervised Vision-Language Pre-training using Few Specialist Captions. Under Review.
Co-authored publications
[11] Lookahead: A Far-sighted Alternative of Magnitude-based Pruning. ICLR’20.
[12] Layer-adaptive Sparsity for the Magnitude-based Pruning. ICLR’21.
[13] Abstract Reasoning via Logic-guided Generation. ICML’21.
[14] Deep Neural Network Approach in Electrical Impedance Tomography-Based Real-Time Soft Tactile Sensor. IROS’19.
[15] Deep Neural Network Based Electrical Impedance Tomographic Sensing Methodology for Large-Area Robotic Tactile Sensing. TRO’21.
[16] Breaking the Spurious Causality of Conditional Generation via Fairness Intervention with Corrective Sampling. TMLR’23.
[17] Diffusion Probabilistic Models for Structured Node Classification. Under Review.
Generative models
“How does the world look like?”
Representation learning
“What is the essence of the world?”
Robust and fair models
“How to deploy the model in real-world?”
InstaGAN (ICLR’19) [1]
GOLD (NeurIPS’19) [2]
FreezeD (CVPRW’20) [3]
DIGAN (ICLR’22) [7]
OACon (NeurIPS’21) [6]
OAMixer (CVPRW’22) [8]
CSI (NeruIPS’20) [4]
RoPAWS (ICLR’23) [9]
S-CLIP (Under Review) [11]
MASKER (AAAI’21) [5]
B2T (Under Review) [10]
Scope of the thesis
• Visual representation has a remarkable progress in recent years
Learning from uncurated data
5
Image from https://paperswithcode.com/sota/image-classification-on-imagenet
• Still many challenges to deploy the model in real-world datasets!
• Real images are multi-object, drawn from multiple distributions, classes are imbalanced, etc.
Learning from uncurated data
6
Image from https://liuziwei7.github.io/projects/LongTail.html
• Representation learning from multi-object images, disentangling objects and background
• CSI (NeurIPS’20) [1]
• Can we use self-supervised learning for OOD detection?
• RoPAWS (under review) [2]
• Can we do semi-supervised learning when unlabeled data contains OOD data?
• Semi-supervised learning when labeled and unlabeled data are drawn from different distributions
Outline of the thesis
7
• Representation learning from multi-object images, disentangling objects and background
• OACon (NeurIPS’21) [1]
• Fix contrastive learning to disentangle objects and background
• OAMixer (CVPRW’22) [2]
• Fix ViT (and Mixer) to disentangle objects and background
• Semi-supervised learning when labeled and unlabeled data are drawn from different distributions
• RoPAWS (ICLR’23) [3]
• Semi-supervised image classification when unlabeled images contain unseen classes
• S-CLIP (Under Review) [4]
• Semi-supervised vision-language pre-training when unlabeled images contain unseen captions
Outline of the thesis
8
[1] Mo*, Kang* et al. Object-aware Contrastive Learning for Debiased Scene Representation. NeurIPS’21.
[2] Kang*, Mo* et al. OAMixer: Object-aware Mixing Layer for Vision Transformers. CVPRW’22.
[3] Mo et al. RoPAWS: Robust Semi-supervised Representation Learning from Uncurated Data. ICLR’23.
[4] Mo et al. S-CLIP: Semi-supervised Vision-Language Pre-training using Few Specialist Captions. Under Review.
Object-aware Contrastive Learning for
Debiased Scene Representation
9
Joint work with Hyunwoo Kang* (equal), Kihyuk Sohn, Chun-Liang Li, Jinwoo Shin
(collaborated w/ Google Cloud AI, NeurIPS’21)
• Self-supervised learning – joint embedding approach
• Joint embedding approach makes different views of the same image be invariant
• It is successful visual recognition, particularly for image classification
Motivation
10
Image from https://wikidocs.net/164357
• Self-supervised learning – joint embedding approach
• In image domain, random crop augmentation plays a crucial for for representation learning
Motivation
11
Image from Chen et al., “A Simple Framework for Contrastive Learning of Visual Representations,” ICML’20
Random Crop
• Issues of random crop
• However, entangling randomly cropped images may occur scene bias
• Contextual bias: Entangle the features of different objects (for multi-object images)
• Background bias: Entangle the features of object and background
Motivation
12
• Issues of random crop
• Contextual bias: Entangle the features of different objects
• It harms the discriminative power of the learned representation (e.g., cannot classify giraffe vs. zebra)
Motivation
13
• Issues of random crop
• Background bias: Entangle the features of object and background
• It harms the generalization of the representation on background shift
• We found that reducing background bias also improves general distribution shifts
(the model learns more object-centric representation)
Motivation
14
• How to solve this?
• The idea is very simple: if we know the object location, just enforce random crop to not entangle different objects
• Object-aware random crop (OA-Crop)
• Restrict the random crop to be applied inside of the object bbox (very simple, but was effective 😅)
Method
15
• How to solve this?
• The idea is very simple: if we know the object location, just enforce random crop to not entangle different objects
• Background mixup (BG-Mixup)
• Mix object of image 1 and background of image 2 to get the bg-shifted object 1 image
• Cf. We treat the mixed image as object 1, i.e., not mixup-ing labels (bg-only image has no object info.)
Method
16
Repeat the outside of
object bbox regions
• But how can we get the object locations?
• We found that contrastively learned representations have an ability to find object regions
…concurrent with DINO, but applying CAM for ResNet instead of attention for ViT
Method
17
• But how can we get the object locations?
• Contrastive CAM (ContraCAM)
• We apply iterative masking to identify multiple and entire (not just part) objects
(find the next most salient region after masking previous ones)
Method
18
• Outline
• We conduct three experiments to verify the performance of our proposed methods:
• ContraCAM – unsupervised object localization
• Compare the mean IoU of saliency map vs. GT object
• OACrop – self-supervised learning from multi-object images (COCO)
• Linear evaluation and fine-tuning for object detection
• BG-Mixup – robustness on distribution shifts of self-supervised representations (ImageNet-variants)
• Train a classifier with original data, and evaluate it on distribution-shifted data
Experiments
19
• Unsupervised object localization
• ContraCAM outperforms the SOTA unsupervised localization method
• ContraCAM is comparable (often wins) with classifier CAM, although not using class information
Experiments
20
• Self-supervised learning from multi-object images (COCO)
• OACrop consistently improves the performance on downstream tasks
• ContraCAM gives reasonable performance, but better object location (GT) may give further improvements
• GT boxes harmed MoCov2 – contain too-small objects, and hard to find the positive pair
Experiments
21
• Robustness on distribution shifts (ImageNet-variants)
• BG-Mixup improves robustness on background shift (evaluated on BG Challenge benchmark)
• Similar to OACrop, better object location (GT) may give further improvements
Experiments
22
• Robustness on distribution shifts (ImageNet-variants)
• BG-Mixup also improves robustness on general distribution shifts
• Better than applying Mixup or CutMix for self-supervised learning (suggested in the i-Mix paper)
Experiments
23
24
OAMixer: Object-aware Mixing Layer
for Vision Transformers
Joint work with Hyunwoo Kang* (equal), Jinwoo Shin
(CVPRW’22)
• Patch-based models (ViT, Mixers) has shown remarkable success in visual recognition
• Divide an image as a set of patches
• Apply intra-patch and inter-patch (e.g., self-attention, MLP, Conv) operations to process features
Background
25
Image from the MLP-Mixer paper
[ViT] Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021.
[MLP-Mixer] Tolstikhin et al. MLP-Mixer: An all-MLP Architecture for Vision. NeurIPS 2021.
[ConvMixer] Trockman & Kolter. Patches Are All You Need? arXiv 2022.
Inter-patch operation
• ViT – self-attention
• MLP-Mixer – MLP
• ConvMixer – Conv
Intra-patch operation
• Inter-patch operations entangle the features of different objects and backgrounds
• It occurs the scene bias (i.e., spurious correlation between objects and background)
• Contextual bias: Different objects (e.g., giraffe and zebra) co-occurs in the training data
• Background bias: Object and background (e.g., zebra and safari) co-occurs in the training data
Motivation
26
Image from the OACon paper
• Idea. Less interact features of different object/background patches
• In our early experiments, we used the ground-truth object masks
• We hardly restricted attention of ViT to only interact with the patches of the same object
• Assume single object classification, and divide patches into “contains object” and “background only” groups
• We observed that this “hard disentangle” improves background robustness, but harms accuracy
• BG-robustness: Evaluate accuracy on background-shifted data
• We then tried a “soft version” of this idea, and surprisingly, it improved both accuracy and BG-robustness
• Intuitively, this object-aware inductive bias guides the attention to focus on more related patches
Method
27
Interact Do not interact
Image from the MLP-Mixer paper
• Overall framework of the object-aware mixing layer
• We assume that the object labels are known, then provide them as an input for the model
• We tried to jointly predict the object labels during forward, but the 2-stage approach performed better
• The quality of object labels become better in the higher layer
• However, the patch disentanglement should be applied in the lower layer
Method
28
• Overall framework of the object-aware mixing layer
• We assume that the object labels are known, then provide them as an input for the model
• Step 1. Compute the reweighting mask 𝐌 = [𝐌!"] from the object labels {𝑦!}
• Jointly optimize the scale parameter 𝜅($) (for each layer 𝑙)
• Step 2. Reweight the patch mixing layer 𝑓&'( to get the object-aware mixing layer 𝑓)*&'(
• The definition of 𝑓)*&'( depends on the shape of 𝑓&'(: ℝ+ → ℝ+ for 𝑁 patches
• If the patch mixing layer is linear, i.e., 𝑓&'( ≔ 𝐋&'(, the OAMixer is 𝑓)*&'( ≔ 𝐌 ⊙ 𝐋&'(
• We consider three cases of nonlinear mixing layer: self-attention, MLP, and Conv
Method
29
• OAMixer for ViT, MLP-Mixer, and ConvMixer
• Self-attention: Reweight the attention matrix
• MLP: Decompose into linear and residual parts, and only reweight the linear part
• Conv: Linearize the convolution and multiply the reweighting mask
Method
30
• How to obtain the object labels?
• We use ReLabel for supervised setting and DINO attention for self-supervised setting
• We don’t need extra supervision more than the training data
• ReLabel is a weakly-supervised method, i.e., only need class labels
• DINO is a self-supervised method, i.e., no labels are needed
Method
31
[ReLabel] Yun et al. Re-labeling ImageNet: from Single to Multi-Labels, from Global to Localized Labels. CVPR 2021.
[DINO] Caron et al. Emerging Properties in Self-Supervised Vision Transformers. ICCV 2021.
• OAMixer improves accuracy (↑ is better) and BG-robustness (↓ is better)
• Evaluated on the Background Challenge benchmark (BG-Gap = Mixed-Same – Mixed-Rand)
Experiments
32
[BG Challenge] Xiao et al. Noise or Signal: The Role of Image Backgrounds in Object Recognition. ICLR 2021.
• OAMixer gives orthogonal gain over the spatial inductive bias methods
• Some works aimed to combine spatial inductive bias (i.e., conv-like) to patch-based models
• They (ConViT, CoAtNet) improve accuracy, but does not help BG-robustness
Experiments
33
[ConViT] d’Ascoli et al. ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases. ICML 2021.
[CoAtNet] Dai et al. CoAtNet: Marrying Convolution and Attention for All Data Sizes. NeurIPS 2021.
• The gain of OAMixer not only comes from the object labels
• TokenLabeling (TL) uses patch-level labels as additional supervision during training
• TL improves accuracy, but does not help BG-robustness
Experiments
34
[TL] Jian et al. All Tokens Matter: Token Labeling for Training Better Vision Transformers. NeurIPS 2021.
• OAMixer also improves ImageNet supervised and self-supervised learning
• Improves ImageNet top-1 classification accuracy
• Improves linear proving of DINO trained on ImageNet
Experiments
35
• Analysis
• OAMixer learns higher mask scales 𝜅($) (i.e., focus on intra-objects) in early layers
• It resembles the local-to-global structure of CNN, but the object (instead of 2D spatial) version
• ConvMixer has lower mask scales (i.e., see globally) in early layers due to the limited receptive field
• OAMixer makes ViT more focus on the object (on background-shifted images)
Experiments
36
[Transformer Saliency] Chefer et al. Transformer Interpretability Beyond Attention Visualization. CVPR 2021.
RoPAWS: Robust Semi-supervised
Representation Learning from Uncurated Data
37
Joint work with Jong-Chyi Su, Chih-Yao Ma, Mahmoud Assran, Ishan Misra, Licheng Yu, Sean Bell
(collaborated w/ Meta AI, ICLR’23)
• Semi-supervised learning (Semi-SL)
• Semi-SL aims to learn a model using small labeled data and large unlabeled data
Motivation
38
Image from Yalniz et al., “Billion-scale semi-supervised learning for image classification,” arXiv’19
• Semi-supervised learning (Semi-SL)
• However, often hardly fail when unlabeled data has a distribution shift, drawn from different data sources
Motivation
39
Image from Yalniz et al., “Billion-scale semi-supervised learning for image classification,” arXiv’19
• PAWS – the SOTA method for semi-supervised image classification
• PAWS creates a pseudo-label from the labeled data {𝑥, 𝑦} with soft nearest neighbor (Soft-NN) classifier
• Formally, the pseudo-label of PAWS is:
• 𝑧 is embedding, 𝐳$ are embeddings of labeled data with labels 𝐩$, and 𝜎, is softmax with temperature 𝜏
Method
40
Assran et al., “Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples,” ICCV’21
• Why PAWS fails on uncurated data?
• PAWS infers the pseudo-label of unlabeled data from the nearby labeled data
• The pseudo-label becomes overconfident in some labeled class, although the unlabeled data is out-of-class
• RoPAWS calibrates the pseudo-label considering the densities of both labeled and unlabeled data
• Since out-of-class data is father from labeled data than other unlabeled data, it gives uncertain prediction
Method
41
• Why PAWS fails on uncurated data?
• PAWS pushes out-of-class data (black points) to some in-class clusters (colored points)
• In contrast, RoPAWS pushes out-of-class data far from all the in-class clusters
Method
42
• How to tackle this?
• Probabilistic modeling (or generative classifier) is a principled way to calibrate the prediction
• Our key insight is that PAWS can be secretly interpreted as a generative classifier (KDE on embedding space)
Method
43
Image from https://learnopencv.com/generative-and-discriminative-models/
• PAWS is secretly a generative classifier
• Define the densities on the embedding space using KDE with labeled data ℬ$ (and ℬ$
-
, a subset of class 𝑦)
• Applying the Bayes’ rule, we can recover the original prediction formula of PAWS
Method
44
• Semi-supervised density modeling
• RoPAWS extends the KDE modeling to consider both labeled data ℬ$ and unlabeled data ℬ.
• Applying the Bayes’ rule, the pseudo-label is given by a closed-form solution of KDE:
• Pseudo-labels of unlabeled data is a function of pseudo-labels of labeled and unlabeled data
• Organizing the formula, the pseudo-label of unlabeled data is given by
Method
45
• In-domain prior
• Semi-supervised KDE calibrates the prediction, but not explicitly considers out-of-class data
• We propose (data-dependent) in-domain prior that explicitly regularizes the prediction of uncertain data ≈ uniform
• We set the prior 𝑝(𝑦|𝑥) of unlabeled data as 𝒰 ⋅ 𝑝 in 𝑥 where 𝑝 in 𝑥 is in-domain prior
• Then, the closed-form solution gives the posterior 𝑞(𝑦|𝑥) whose sum is < 1
• Convert it to the final 𝐾-dim probability by adding the posterior of the OOD class 𝒰 ⋅ (1 − 𝑞 in 𝑥 )
Method
46
• Training of RoPAWS
• Use this calibrated pseudo-label as the target of PAWS (keep the output as the original Soft-NN)
• Inference of RoPAWS
• Use the same Soft-NN classifier of PAWS (but the outputs are calibrated by targets)
• One can also train a (linear) classifier on top of the learned representation
Method
47
• Semi-iNat results
• RoPAWS achieves SOTA on a realistic large-scale robust Semi-SL benchmark, Semi-iNat
Experiments
48
• CIFAR-10 results
• RoPAWS outperforms previous SOTA robust Semi-SL methods under various scenarios
Experiments
49
• Ablation studies
• RoPAWS gives reasonable performance under a range of hyperparameters
• All the proposed components contribute to the final accuracy
Experiments
50
S-CLIP: Semi-supervised Vision-Language Pre-training
using Few Specialist Captions
51
Joint work with Minkyu Kim, Kyungmin Lee, and Jinwoo Shin
(under review)
• CLIP (contrastive language-image pre-training) has achieved remarkable success
• Downstream tasks including zero-shot classification, image-text retrieval, etc.
Motivation
52
• However, CLIP fails on specialist domains such as remote sensing or fashion
• CLIP is trained on web-crawled data, which may not contain information of specialist domains
• Adapting CLIP to such domains is challenging since the image-text pairs are hard to obtain
Motivation
53
(a) Remote sensing (b) Fashion (c) Scientific figure (d) Comics
• We aim to apply Semi-SL for CLIP by computing pseudo-labels (captions) of unlabeled images
• However, naive pseudo-labeling fails since the caption of unlabeled images are different from labeled images
• Recap. It is the similar to Robust Semi-SL for image classification (e.g., RoPAWS)
Motivation
54
• We propose two pseudo-labeling techniques for semi-supervised CLIP
• Caption-level pseudo-label:
• Assume the semantic of unlabeled image is a combination of the given captions
• Keyword-level pseudo-label:
• Assume the unlabeled image shares a keyword with the nearest labeled image
(candidate set of keywords K. is a subset of K such that keyword is contained in 𝑦., K is pre-defined)
Method
55
• How to utilize the pseudo-labels?
• CLIP solves two classification tasks: predict caption 𝑦 given image 𝑥, and image 𝑥 given caption 𝑦
Method
56
• How to utilize the pseudo-labels?
• The pseudo-labels extend the CLIP loss for unlabeled images (relation to captions or keywords)
Method
57
• Caption-level pseudo-labels
• Find the optimal transport (OT) mapping between unlabeled images {𝑢} and labeled images {𝑥}
• The cost function C is given by their visual similarities (negative cosine similarity of embeddings)
• Also tested direct mapping between unlabeled images {𝑢} and labeled texts {𝑦}, but it was worse
• The pseudo-label 𝑞! for unlabeled image 𝑢! is given by normalizing the mapping Γ
• This soft-labels guides the relation between an unlabeled image 𝑢' and captions {𝑦}
Method
58
• Keyword-level pseudo-labels
• The nearest labeled image 𝑥∗ provides a candidate set of keywords K. contained in the unlabeled image 𝑢
• The candidate set is given by checking if a keyword 𝑘 is contained in the caption
• Note that we assume that 𝑥∗ and 𝑢 may share a keyword, but what is the exact one is unknown
• This forms the partial label learning (PLL) problem, using a candidate set (not exact one) as the supervision
• To solve this, we define the pseudo-label 𝑞! as a sparse probability over the elements of K.
• This soft-labels guides the relation between an unlabeled image 𝑢' and keywords {𝑘}
Method
59
• Remote sensing results
• S-CLIP significantly improves top-1 zero-shot accuracy
• Both distribution-aligned (L=U) and distribution-shifted (L≠U) scenarios
Experiments
60
• Remote sensing results
• S-CLIP also improves image-text retrieval - recall at K (R@K)
Experiments
61
• More results
• S-CLIP works in various domains, such as fashion, scientific figures, and comics
Experiments
62
• Ablation studies
• Justification of each design choice (e.g., caption-level and keyword-level pseudo-labels are complementary)
Experiments
63
• In this talk, I covered two topics on learning visual representations from uncurated data
• Object-aware representation for contrastive learning and patch-based models
• CSI (NeurIPS’20) [1]
• Can we use self-supervised learning for OOD detection?
• RoPAWS (under review) [2]
• Can we do semi-supervised learning when unlabeled data contains OOD data?
• Robust semi-supervised learning for image classification and vision-language pre-training
Final remarks
64

More Related Content

Similar to Learning Visual Representations from Uncurated Data

Object recognition with cortex like mechanisms pami-07
Object recognition with cortex like mechanisms pami-07Object recognition with cortex like mechanisms pami-07
Object recognition with cortex like mechanisms pami-07
dingggthu
 
auto-assistance system for visually impaired person
auto-assistance system for visually impaired personauto-assistance system for visually impaired person
auto-assistance system for visually impaired person
shahsamkit73
 

Similar to Learning Visual Representations from Uncurated Data (20)

Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
 
Introduction to 3D Computer Vision and Differentiable Rendering
Introduction to 3D Computer Vision and Differentiable RenderingIntroduction to 3D Computer Vision and Differentiable Rendering
Introduction to 3D Computer Vision and Differentiable Rendering
 
IRJET - Direct Me-Nevigation for Blind People
IRJET -  	  Direct Me-Nevigation for Blind PeopleIRJET -  	  Direct Me-Nevigation for Blind People
IRJET - Direct Me-Nevigation for Blind People
 
Visual Transformers
Visual TransformersVisual Transformers
Visual Transformers
 
Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene Graphs
 
Object recognition with cortex like mechanisms pami-07
Object recognition with cortex like mechanisms pami-07Object recognition with cortex like mechanisms pami-07
Object recognition with cortex like mechanisms pami-07
 
Introduction talk to Computer Vision
Introduction talk to Computer Vision Introduction talk to Computer Vision
Introduction talk to Computer Vision
 
auto-assistance system for visually impaired person
auto-assistance system for visually impaired personauto-assistance system for visually impaired person
auto-assistance system for visually impaired person
 
Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017
Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017
Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017
 
Object detection with deep learning
Object detection with deep learningObject detection with deep learning
Object detection with deep learning
 
A Comprehensive Analysis on Co-Saliency Detection on Learning Approaches in 3...
A Comprehensive Analysis on Co-Saliency Detection on Learning Approaches in 3...A Comprehensive Analysis on Co-Saliency Detection on Learning Approaches in 3...
A Comprehensive Analysis on Co-Saliency Detection on Learning Approaches in 3...
 
brief Introduction to Different Kinds of GANs
brief Introduction to Different Kinds of GANsbrief Introduction to Different Kinds of GANs
brief Introduction to Different Kinds of GANs
 
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
 
Eren_Golge_MS_Thesis_2014
Eren_Golge_MS_Thesis_2014Eren_Golge_MS_Thesis_2014
Eren_Golge_MS_Thesis_2014
 
Object Detection An Overview
Object Detection An OverviewObject Detection An Overview
Object Detection An Overview
 
Visual Search and Question Answering II
Visual Search and Question Answering IIVisual Search and Question Answering II
Visual Search and Question Answering II
 
Image classification with Deep Neural Networks
Image classification with Deep Neural NetworksImage classification with Deep Neural Networks
Image classification with Deep Neural Networks
 
AI in Industrial Robotics Applications
AI in Industrial Robotics ApplicationsAI in Industrial Robotics Applications
AI in Industrial Robotics Applications
 
IRJET- Real-Time Object Detection using Deep Learning: A Survey
IRJET- Real-Time Object Detection using Deep Learning: A SurveyIRJET- Real-Time Object Detection using Deep Learning: A Survey
IRJET- Real-Time Object Detection using Deep Learning: A Survey
 
Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik
Deep Visual Understanding from Deep Learning by Prof. Jitendra MalikDeep Visual Understanding from Deep Learning by Prof. Jitendra Malik
Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik
 

More from Sangwoo Mo

More from Sangwoo Mo (20)

Brief History of Visual Representation Learning
Brief History of Visual Representation LearningBrief History of Visual Representation Learning
Brief History of Visual Representation Learning
 
Hyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement LearningHyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement Learning
 
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
 
Self-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSelf-supervised Learning Lecture Note
Self-supervised Learning Lecture Note
 
Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)
 
Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)
 
Introduction to Diffusion Models
Introduction to Diffusion ModelsIntroduction to Diffusion Models
Introduction to Diffusion Models
 
Object-Region Video Transformers
Object-Region Video TransformersObject-Region Video Transformers
Object-Region Video Transformers
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural Networks
 
Learning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaLearning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat Minima
 
Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)
 
Explicit Density Models
Explicit Density ModelsExplicit Density Models
Explicit Density Models
 
Score-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential EquationsScore-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential Equations
 
Self-Attention with Linear Complexity
Self-Attention with Linear ComplexitySelf-Attention with Linear Complexity
Self-Attention with Linear Complexity
 
Meta-Learning with Implicit Gradients
Meta-Learning with Implicit GradientsMeta-Learning with Implicit Gradients
Meta-Learning with Implicit Gradients
 
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
 
Generative Models for General Audiences
Generative Models for General AudiencesGenerative Models for General Audiences
Generative Models for General Audiences
 
Bayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningBayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-Learning
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Domain Transfer and Adaptation Survey
Domain Transfer and Adaptation SurveyDomain Transfer and Adaptation Survey
Domain Transfer and Adaptation Survey
 

Recently uploaded

Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
FIDO Alliance
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 

Recently uploaded (20)

Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptxCyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 

Learning Visual Representations from Uncurated Data

  • 1. [Ph.D. Defense] Learning Visual Representations from Uncurated Data 2023.06.01. Presenter: Sangwoo Mo (KAIST) Committee: Profs. Jinwoo Shin, Inso Kweon, Junmo Kim, Eunho Yang, Jaegul Choo 1
  • 2. • I’m a 5th year Ph.D. student at KAIST, focusing on machine learning • Education • Ph.D. in Electrical Engineering, KAIST (advisor: Jinwoo Shin) • B.S. in Mathematics, POSTECH (summa cum laude) • Work experience • Research Intern, NVIDIA AI (Oct 2022 - Jan 2023) • Research Intern, Meta AI (Jun 2022 - Sep 2022) • External Collaborator, Naver AI (Jun 2021 - Sep 2021) • External Collaborator, Google Cloud AI (Nov 2020 - May 2021) • Research Intern, Kakao Brain (Feb 2019 - May 2019) Who am I? 2
  • 3. • My research goal is to build a representation of the world. To this end, I worked on: What I worked on? 3 Lead-authored publications [1] Mo et al. InstaGAN: Instance-aware Image-to-Image Translation. ICLR’19. [2] Mo et al. Mining GOLD Samples for Conditional GANs. NeurIPS’19. [3] Mo et al. Freeze the Discriminator: a Simple Baseline for Fine-Tuning GANs. CVPRW’20. [4] Tack*, Mo* et al. CSI: Novelty Detection via Contrastive Learning on Distributionally Shifted Instances. NeurIPS’20. [5] Moon*, Mo* et al. MASKER: Masked Keyword Regularization for Reliable Text Classification. AAAI’21. [6] Mo*, Kang* et al. Object-aware Contrastive Learning for Debiased Scene Representation. NeurIPS’21. [7] Yu*, Tack*, Mo* et al. Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks. ICLR’22. [8] Kang*, Mo* et al. OAMixer: Object-aware Mixing Layer for Vision Transformers. CVPRW'22. [9] Mo et al. RoPAWS: Robust Semi-supervised Representation Learning from Uncurated Data. ICLR’23. [10] Kim*, Mo* et al. Bias-to-Text: Debiasing Unknown Visual Biases through Language Interpretation. Under Review. [11] Mo et al. S-CLIP: Semi-supervised Vision-Language Pre-training using Few Specialist Captions. Under Review. Co-authored publications [12] Lookahead: A Far-sighted Alternative of Magnitude-based Pruning. ICLR’20. [13] Layer-adaptive Sparsity for the Magnitude-based Pruning. ICLR’21. [14] Abstract Reasoning via Logic-guided Generation. ICML’21. [15] Deep Neural Network Approach in Electrical Impedance Tomography-Based Real-Time Soft Tactile Sensor. IROS’19. [16] Deep Neural Network Based Electrical Impedance Tomographic Sensing Methodology for Large-Area Robotic Tactile Sensing. TRO’21. [17] Breaking the Spurious Causality of Conditional Generation via Fairness Intervention with Corrective Sampling. TMLR’23. [18] Diffusion Probabilistic Models for Structured Node Classification. Under Review. Generative models “How does the world look like?” Representation learning “What is the essence of the world?” Robust and fair models “How to deploy the model in real-world?” InstaGAN (ICLR’19) [1] GOLD (NeurIPS’19) [2] FreezeD (CVPRW’20) [3] DIGAN (ICLR’22) [7] OACon (NeurIPS’21) [6] OAMixer (CVPRW’22) [8] CSI (NeruIPS’20) [4] RoPAWS (ICLR’23) [9] S-CLIP (Under Review) [11] MASKER (AAAI’21) [5] B2T (Under Review) [10]
  • 4. • My research goal is to build a representation of the world. To this end, I worked on: What I worked on? 4 Lead-authored publications [1] Mo et al. InstaGAN: Instance-aware Image-to-Image Translation. ICLR’19. [2] Mo et al. Mining GOLD Samples for Conditional GANs. NeurIPS’19. [3] Mo et al. Freeze the Discriminator: a Simple Baseline for Fine-Tuning GANs. CVPRW’20. [4] Tack*, Mo* et al. CSI: Novelty Detection via Contrastive Learning on Distributionally Shifted Instances. NeurIPS’20. [5] Moon*, Mo* et al. MASKER: Masked Keyword Regularization for Reliable Text Classification. AAAI’21. [6] Mo*, Kang* et al. Object-aware Contrastive Learning for Debiased Scene Representation. NeurIPS’21. [7] Yu*, Tack*, Mo* et al. Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks. ICLR’22. [8] Kang*, Mo* et al. OAMixer: Object-aware Mixing Layer for Vision Transformers. CVPRW'22. [9] Mo et al. RoPAWS: Robust Semi-supervised Representation Learning from Uncurated Data. ICLR’23. [10] Kim*, Mo* et al. Bias-to-Text: Debiasing Unknown Visual Biases through Language Interpretation. Under Review. [11] Mo et al. S-CLIP: Semi-supervised Vision-Language Pre-training using Few Specialist Captions. Under Review. Co-authored publications [11] Lookahead: A Far-sighted Alternative of Magnitude-based Pruning. ICLR’20. [12] Layer-adaptive Sparsity for the Magnitude-based Pruning. ICLR’21. [13] Abstract Reasoning via Logic-guided Generation. ICML’21. [14] Deep Neural Network Approach in Electrical Impedance Tomography-Based Real-Time Soft Tactile Sensor. IROS’19. [15] Deep Neural Network Based Electrical Impedance Tomographic Sensing Methodology for Large-Area Robotic Tactile Sensing. TRO’21. [16] Breaking the Spurious Causality of Conditional Generation via Fairness Intervention with Corrective Sampling. TMLR’23. [17] Diffusion Probabilistic Models for Structured Node Classification. Under Review. Generative models “How does the world look like?” Representation learning “What is the essence of the world?” Robust and fair models “How to deploy the model in real-world?” InstaGAN (ICLR’19) [1] GOLD (NeurIPS’19) [2] FreezeD (CVPRW’20) [3] DIGAN (ICLR’22) [7] OACon (NeurIPS’21) [6] OAMixer (CVPRW’22) [8] CSI (NeruIPS’20) [4] RoPAWS (ICLR’23) [9] S-CLIP (Under Review) [11] MASKER (AAAI’21) [5] B2T (Under Review) [10] Scope of the thesis
  • 5. • Visual representation has a remarkable progress in recent years Learning from uncurated data 5 Image from https://paperswithcode.com/sota/image-classification-on-imagenet
  • 6. • Still many challenges to deploy the model in real-world datasets! • Real images are multi-object, drawn from multiple distributions, classes are imbalanced, etc. Learning from uncurated data 6 Image from https://liuziwei7.github.io/projects/LongTail.html
  • 7. • Representation learning from multi-object images, disentangling objects and background • CSI (NeurIPS’20) [1] • Can we use self-supervised learning for OOD detection? • RoPAWS (under review) [2] • Can we do semi-supervised learning when unlabeled data contains OOD data? • Semi-supervised learning when labeled and unlabeled data are drawn from different distributions Outline of the thesis 7
  • 8. • Representation learning from multi-object images, disentangling objects and background • OACon (NeurIPS’21) [1] • Fix contrastive learning to disentangle objects and background • OAMixer (CVPRW’22) [2] • Fix ViT (and Mixer) to disentangle objects and background • Semi-supervised learning when labeled and unlabeled data are drawn from different distributions • RoPAWS (ICLR’23) [3] • Semi-supervised image classification when unlabeled images contain unseen classes • S-CLIP (Under Review) [4] • Semi-supervised vision-language pre-training when unlabeled images contain unseen captions Outline of the thesis 8 [1] Mo*, Kang* et al. Object-aware Contrastive Learning for Debiased Scene Representation. NeurIPS’21. [2] Kang*, Mo* et al. OAMixer: Object-aware Mixing Layer for Vision Transformers. CVPRW’22. [3] Mo et al. RoPAWS: Robust Semi-supervised Representation Learning from Uncurated Data. ICLR’23. [4] Mo et al. S-CLIP: Semi-supervised Vision-Language Pre-training using Few Specialist Captions. Under Review.
  • 9. Object-aware Contrastive Learning for Debiased Scene Representation 9 Joint work with Hyunwoo Kang* (equal), Kihyuk Sohn, Chun-Liang Li, Jinwoo Shin (collaborated w/ Google Cloud AI, NeurIPS’21)
  • 10. • Self-supervised learning – joint embedding approach • Joint embedding approach makes different views of the same image be invariant • It is successful visual recognition, particularly for image classification Motivation 10 Image from https://wikidocs.net/164357
  • 11. • Self-supervised learning – joint embedding approach • In image domain, random crop augmentation plays a crucial for for representation learning Motivation 11 Image from Chen et al., “A Simple Framework for Contrastive Learning of Visual Representations,” ICML’20 Random Crop
  • 12. • Issues of random crop • However, entangling randomly cropped images may occur scene bias • Contextual bias: Entangle the features of different objects (for multi-object images) • Background bias: Entangle the features of object and background Motivation 12
  • 13. • Issues of random crop • Contextual bias: Entangle the features of different objects • It harms the discriminative power of the learned representation (e.g., cannot classify giraffe vs. zebra) Motivation 13
  • 14. • Issues of random crop • Background bias: Entangle the features of object and background • It harms the generalization of the representation on background shift • We found that reducing background bias also improves general distribution shifts (the model learns more object-centric representation) Motivation 14
  • 15. • How to solve this? • The idea is very simple: if we know the object location, just enforce random crop to not entangle different objects • Object-aware random crop (OA-Crop) • Restrict the random crop to be applied inside of the object bbox (very simple, but was effective 😅) Method 15
  • 16. • How to solve this? • The idea is very simple: if we know the object location, just enforce random crop to not entangle different objects • Background mixup (BG-Mixup) • Mix object of image 1 and background of image 2 to get the bg-shifted object 1 image • Cf. We treat the mixed image as object 1, i.e., not mixup-ing labels (bg-only image has no object info.) Method 16 Repeat the outside of object bbox regions
  • 17. • But how can we get the object locations? • We found that contrastively learned representations have an ability to find object regions …concurrent with DINO, but applying CAM for ResNet instead of attention for ViT Method 17
  • 18. • But how can we get the object locations? • Contrastive CAM (ContraCAM) • We apply iterative masking to identify multiple and entire (not just part) objects (find the next most salient region after masking previous ones) Method 18
  • 19. • Outline • We conduct three experiments to verify the performance of our proposed methods: • ContraCAM – unsupervised object localization • Compare the mean IoU of saliency map vs. GT object • OACrop – self-supervised learning from multi-object images (COCO) • Linear evaluation and fine-tuning for object detection • BG-Mixup – robustness on distribution shifts of self-supervised representations (ImageNet-variants) • Train a classifier with original data, and evaluate it on distribution-shifted data Experiments 19
  • 20. • Unsupervised object localization • ContraCAM outperforms the SOTA unsupervised localization method • ContraCAM is comparable (often wins) with classifier CAM, although not using class information Experiments 20
  • 21. • Self-supervised learning from multi-object images (COCO) • OACrop consistently improves the performance on downstream tasks • ContraCAM gives reasonable performance, but better object location (GT) may give further improvements • GT boxes harmed MoCov2 – contain too-small objects, and hard to find the positive pair Experiments 21
  • 22. • Robustness on distribution shifts (ImageNet-variants) • BG-Mixup improves robustness on background shift (evaluated on BG Challenge benchmark) • Similar to OACrop, better object location (GT) may give further improvements Experiments 22
  • 23. • Robustness on distribution shifts (ImageNet-variants) • BG-Mixup also improves robustness on general distribution shifts • Better than applying Mixup or CutMix for self-supervised learning (suggested in the i-Mix paper) Experiments 23
  • 24. 24 OAMixer: Object-aware Mixing Layer for Vision Transformers Joint work with Hyunwoo Kang* (equal), Jinwoo Shin (CVPRW’22)
  • 25. • Patch-based models (ViT, Mixers) has shown remarkable success in visual recognition • Divide an image as a set of patches • Apply intra-patch and inter-patch (e.g., self-attention, MLP, Conv) operations to process features Background 25 Image from the MLP-Mixer paper [ViT] Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021. [MLP-Mixer] Tolstikhin et al. MLP-Mixer: An all-MLP Architecture for Vision. NeurIPS 2021. [ConvMixer] Trockman & Kolter. Patches Are All You Need? arXiv 2022. Inter-patch operation • ViT – self-attention • MLP-Mixer – MLP • ConvMixer – Conv Intra-patch operation
  • 26. • Inter-patch operations entangle the features of different objects and backgrounds • It occurs the scene bias (i.e., spurious correlation between objects and background) • Contextual bias: Different objects (e.g., giraffe and zebra) co-occurs in the training data • Background bias: Object and background (e.g., zebra and safari) co-occurs in the training data Motivation 26 Image from the OACon paper
  • 27. • Idea. Less interact features of different object/background patches • In our early experiments, we used the ground-truth object masks • We hardly restricted attention of ViT to only interact with the patches of the same object • Assume single object classification, and divide patches into “contains object” and “background only” groups • We observed that this “hard disentangle” improves background robustness, but harms accuracy • BG-robustness: Evaluate accuracy on background-shifted data • We then tried a “soft version” of this idea, and surprisingly, it improved both accuracy and BG-robustness • Intuitively, this object-aware inductive bias guides the attention to focus on more related patches Method 27 Interact Do not interact Image from the MLP-Mixer paper
  • 28. • Overall framework of the object-aware mixing layer • We assume that the object labels are known, then provide them as an input for the model • We tried to jointly predict the object labels during forward, but the 2-stage approach performed better • The quality of object labels become better in the higher layer • However, the patch disentanglement should be applied in the lower layer Method 28
  • 29. • Overall framework of the object-aware mixing layer • We assume that the object labels are known, then provide them as an input for the model • Step 1. Compute the reweighting mask 𝐌 = [𝐌!"] from the object labels {𝑦!} • Jointly optimize the scale parameter 𝜅($) (for each layer 𝑙) • Step 2. Reweight the patch mixing layer 𝑓&'( to get the object-aware mixing layer 𝑓)*&'( • The definition of 𝑓)*&'( depends on the shape of 𝑓&'(: ℝ+ → ℝ+ for 𝑁 patches • If the patch mixing layer is linear, i.e., 𝑓&'( ≔ 𝐋&'(, the OAMixer is 𝑓)*&'( ≔ 𝐌 ⊙ 𝐋&'( • We consider three cases of nonlinear mixing layer: self-attention, MLP, and Conv Method 29
  • 30. • OAMixer for ViT, MLP-Mixer, and ConvMixer • Self-attention: Reweight the attention matrix • MLP: Decompose into linear and residual parts, and only reweight the linear part • Conv: Linearize the convolution and multiply the reweighting mask Method 30
  • 31. • How to obtain the object labels? • We use ReLabel for supervised setting and DINO attention for self-supervised setting • We don’t need extra supervision more than the training data • ReLabel is a weakly-supervised method, i.e., only need class labels • DINO is a self-supervised method, i.e., no labels are needed Method 31 [ReLabel] Yun et al. Re-labeling ImageNet: from Single to Multi-Labels, from Global to Localized Labels. CVPR 2021. [DINO] Caron et al. Emerging Properties in Self-Supervised Vision Transformers. ICCV 2021.
  • 32. • OAMixer improves accuracy (↑ is better) and BG-robustness (↓ is better) • Evaluated on the Background Challenge benchmark (BG-Gap = Mixed-Same – Mixed-Rand) Experiments 32 [BG Challenge] Xiao et al. Noise or Signal: The Role of Image Backgrounds in Object Recognition. ICLR 2021.
  • 33. • OAMixer gives orthogonal gain over the spatial inductive bias methods • Some works aimed to combine spatial inductive bias (i.e., conv-like) to patch-based models • They (ConViT, CoAtNet) improve accuracy, but does not help BG-robustness Experiments 33 [ConViT] d’Ascoli et al. ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases. ICML 2021. [CoAtNet] Dai et al. CoAtNet: Marrying Convolution and Attention for All Data Sizes. NeurIPS 2021.
  • 34. • The gain of OAMixer not only comes from the object labels • TokenLabeling (TL) uses patch-level labels as additional supervision during training • TL improves accuracy, but does not help BG-robustness Experiments 34 [TL] Jian et al. All Tokens Matter: Token Labeling for Training Better Vision Transformers. NeurIPS 2021.
  • 35. • OAMixer also improves ImageNet supervised and self-supervised learning • Improves ImageNet top-1 classification accuracy • Improves linear proving of DINO trained on ImageNet Experiments 35
  • 36. • Analysis • OAMixer learns higher mask scales 𝜅($) (i.e., focus on intra-objects) in early layers • It resembles the local-to-global structure of CNN, but the object (instead of 2D spatial) version • ConvMixer has lower mask scales (i.e., see globally) in early layers due to the limited receptive field • OAMixer makes ViT more focus on the object (on background-shifted images) Experiments 36 [Transformer Saliency] Chefer et al. Transformer Interpretability Beyond Attention Visualization. CVPR 2021.
  • 37. RoPAWS: Robust Semi-supervised Representation Learning from Uncurated Data 37 Joint work with Jong-Chyi Su, Chih-Yao Ma, Mahmoud Assran, Ishan Misra, Licheng Yu, Sean Bell (collaborated w/ Meta AI, ICLR’23)
  • 38. • Semi-supervised learning (Semi-SL) • Semi-SL aims to learn a model using small labeled data and large unlabeled data Motivation 38 Image from Yalniz et al., “Billion-scale semi-supervised learning for image classification,” arXiv’19
  • 39. • Semi-supervised learning (Semi-SL) • However, often hardly fail when unlabeled data has a distribution shift, drawn from different data sources Motivation 39 Image from Yalniz et al., “Billion-scale semi-supervised learning for image classification,” arXiv’19
  • 40. • PAWS – the SOTA method for semi-supervised image classification • PAWS creates a pseudo-label from the labeled data {𝑥, 𝑦} with soft nearest neighbor (Soft-NN) classifier • Formally, the pseudo-label of PAWS is: • 𝑧 is embedding, 𝐳$ are embeddings of labeled data with labels 𝐩$, and 𝜎, is softmax with temperature 𝜏 Method 40 Assran et al., “Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples,” ICCV’21
  • 41. • Why PAWS fails on uncurated data? • PAWS infers the pseudo-label of unlabeled data from the nearby labeled data • The pseudo-label becomes overconfident in some labeled class, although the unlabeled data is out-of-class • RoPAWS calibrates the pseudo-label considering the densities of both labeled and unlabeled data • Since out-of-class data is father from labeled data than other unlabeled data, it gives uncertain prediction Method 41
  • 42. • Why PAWS fails on uncurated data? • PAWS pushes out-of-class data (black points) to some in-class clusters (colored points) • In contrast, RoPAWS pushes out-of-class data far from all the in-class clusters Method 42
  • 43. • How to tackle this? • Probabilistic modeling (or generative classifier) is a principled way to calibrate the prediction • Our key insight is that PAWS can be secretly interpreted as a generative classifier (KDE on embedding space) Method 43 Image from https://learnopencv.com/generative-and-discriminative-models/
  • 44. • PAWS is secretly a generative classifier • Define the densities on the embedding space using KDE with labeled data ℬ$ (and ℬ$ - , a subset of class 𝑦) • Applying the Bayes’ rule, we can recover the original prediction formula of PAWS Method 44
  • 45. • Semi-supervised density modeling • RoPAWS extends the KDE modeling to consider both labeled data ℬ$ and unlabeled data ℬ. • Applying the Bayes’ rule, the pseudo-label is given by a closed-form solution of KDE: • Pseudo-labels of unlabeled data is a function of pseudo-labels of labeled and unlabeled data • Organizing the formula, the pseudo-label of unlabeled data is given by Method 45
  • 46. • In-domain prior • Semi-supervised KDE calibrates the prediction, but not explicitly considers out-of-class data • We propose (data-dependent) in-domain prior that explicitly regularizes the prediction of uncertain data ≈ uniform • We set the prior 𝑝(𝑦|𝑥) of unlabeled data as 𝒰 ⋅ 𝑝 in 𝑥 where 𝑝 in 𝑥 is in-domain prior • Then, the closed-form solution gives the posterior 𝑞(𝑦|𝑥) whose sum is < 1 • Convert it to the final 𝐾-dim probability by adding the posterior of the OOD class 𝒰 ⋅ (1 − 𝑞 in 𝑥 ) Method 46
  • 47. • Training of RoPAWS • Use this calibrated pseudo-label as the target of PAWS (keep the output as the original Soft-NN) • Inference of RoPAWS • Use the same Soft-NN classifier of PAWS (but the outputs are calibrated by targets) • One can also train a (linear) classifier on top of the learned representation Method 47
  • 48. • Semi-iNat results • RoPAWS achieves SOTA on a realistic large-scale robust Semi-SL benchmark, Semi-iNat Experiments 48
  • 49. • CIFAR-10 results • RoPAWS outperforms previous SOTA robust Semi-SL methods under various scenarios Experiments 49
  • 50. • Ablation studies • RoPAWS gives reasonable performance under a range of hyperparameters • All the proposed components contribute to the final accuracy Experiments 50
  • 51. S-CLIP: Semi-supervised Vision-Language Pre-training using Few Specialist Captions 51 Joint work with Minkyu Kim, Kyungmin Lee, and Jinwoo Shin (under review)
  • 52. • CLIP (contrastive language-image pre-training) has achieved remarkable success • Downstream tasks including zero-shot classification, image-text retrieval, etc. Motivation 52
  • 53. • However, CLIP fails on specialist domains such as remote sensing or fashion • CLIP is trained on web-crawled data, which may not contain information of specialist domains • Adapting CLIP to such domains is challenging since the image-text pairs are hard to obtain Motivation 53 (a) Remote sensing (b) Fashion (c) Scientific figure (d) Comics
  • 54. • We aim to apply Semi-SL for CLIP by computing pseudo-labels (captions) of unlabeled images • However, naive pseudo-labeling fails since the caption of unlabeled images are different from labeled images • Recap. It is the similar to Robust Semi-SL for image classification (e.g., RoPAWS) Motivation 54
  • 55. • We propose two pseudo-labeling techniques for semi-supervised CLIP • Caption-level pseudo-label: • Assume the semantic of unlabeled image is a combination of the given captions • Keyword-level pseudo-label: • Assume the unlabeled image shares a keyword with the nearest labeled image (candidate set of keywords K. is a subset of K such that keyword is contained in 𝑦., K is pre-defined) Method 55
  • 56. • How to utilize the pseudo-labels? • CLIP solves two classification tasks: predict caption 𝑦 given image 𝑥, and image 𝑥 given caption 𝑦 Method 56
  • 57. • How to utilize the pseudo-labels? • The pseudo-labels extend the CLIP loss for unlabeled images (relation to captions or keywords) Method 57
  • 58. • Caption-level pseudo-labels • Find the optimal transport (OT) mapping between unlabeled images {𝑢} and labeled images {𝑥} • The cost function C is given by their visual similarities (negative cosine similarity of embeddings) • Also tested direct mapping between unlabeled images {𝑢} and labeled texts {𝑦}, but it was worse • The pseudo-label 𝑞! for unlabeled image 𝑢! is given by normalizing the mapping Γ • This soft-labels guides the relation between an unlabeled image 𝑢' and captions {𝑦} Method 58
  • 59. • Keyword-level pseudo-labels • The nearest labeled image 𝑥∗ provides a candidate set of keywords K. contained in the unlabeled image 𝑢 • The candidate set is given by checking if a keyword 𝑘 is contained in the caption • Note that we assume that 𝑥∗ and 𝑢 may share a keyword, but what is the exact one is unknown • This forms the partial label learning (PLL) problem, using a candidate set (not exact one) as the supervision • To solve this, we define the pseudo-label 𝑞! as a sparse probability over the elements of K. • This soft-labels guides the relation between an unlabeled image 𝑢' and keywords {𝑘} Method 59
  • 60. • Remote sensing results • S-CLIP significantly improves top-1 zero-shot accuracy • Both distribution-aligned (L=U) and distribution-shifted (L≠U) scenarios Experiments 60
  • 61. • Remote sensing results • S-CLIP also improves image-text retrieval - recall at K (R@K) Experiments 61
  • 62. • More results • S-CLIP works in various domains, such as fashion, scientific figures, and comics Experiments 62
  • 63. • Ablation studies • Justification of each design choice (e.g., caption-level and keyword-level pseudo-labels are complementary) Experiments 63
  • 64. • In this talk, I covered two topics on learning visual representations from uncurated data • Object-aware representation for contrastive learning and patch-based models • CSI (NeurIPS’20) [1] • Can we use self-supervised learning for OOD detection? • RoPAWS (under review) [2] • Can we do semi-supervised learning when unlabeled data contains OOD data? • Robust semi-supervised learning for image classification and vision-language pre-training Final remarks 64