SlideShare a Scribd company logo
1 of 28
Download to read offline
Brief History of Visual Representation Learning
2023.08.16.
KAIST / University of Michigan
Sangwoo Mo
1
• [2012-2015] Evolution of deep learning architectures
• [2016-2019] Learning paradigms for diverse tasks
• [2020-current] Scaling laws and foundation models
Outline of the talk
2
Caveat: These eras are not mutually exclusive
• What constitutes artificial intelligence (AI)?
• AI should replicate human actions like seeing, writing, listening, and manipulating
What is visual representation learning?
3
Computer vision Natural language Speech Robotics
Image from https://unsplash.com
• Learning general visual representations
• Solve various downstream tasks such as classification and segmentation
What is visual representation learning?
4
Image from https://www.youtube.com/watch?v=taC5pMCm70U
• Has been remarkable progress in the last 10 years
What is visual representation learning?
5
Image from https://paperswithcode.com
6
[2012-2015] Evolution of deep learning architectures
• 2012: AlexNet opened the era of deep learning for computer vision
• Significantly outperforms the shallow competitors
[2012-2015] Evolution of deep learning architectures
7
Image from https://www.pinecone.io/learn/series/image-search/imagenet/
• 2013-: Golden era for architecture design
• 2013 ZFNet, 2014 VGG-Net and GoogLeNet
[2012-2015] Evolution of deep learning architectures
8
Image from https://devopedia.org/imagenet
• 2015: ResNet exceeded human performance – skip connection is the key to success!
• ResNet is still actively used in 2023 (though SOTA is ViT families)
[2012-2015] Evolution of deep learning architectures
9
Image from https://devopedia.org/imagenet
• 2015: BatchNorm opened the exploration of normalization layers
• Reparameterization facilitates ease of optimization1
• (Side note) BatchNorm was essential in early ConvNet, but has some side effects (e.g., adversarially venerable)
Recent architectures like ViT and ConvNeXt use LayerNorm instead
[2012-2015] Evolution of deep learning architectures
10
1. See “How Does Batch Normalization Help Optimization?” (2018)
• What happened after 2015?
• New architectures are still being proposed (2016 WideResNet, 2017 ResNeXt and DenseNet, etc.)
• However, the interests of the community have moved beyond ImageNet (challenge ended in 2017)
• Instead, researchers explored diverse tasks and learning paradigms in 2016-
• Few-shot learning, continual learning, domain adaptation, etc.
[2012-2015] Evolution of deep learning architectures
11
Images from https://www.youtube.com/watch?v=hE7eGew4eeg and https://mila.quebec/en/article/la-maml-look-ahead-meta-learning-for-continual-learning/
• (Side note) Architectures after 2015
• Convolutional: 2016 WideResNet, 2017 ResNeXt and DenseNet
• Convolutional + attention: 2018 SENet and CBAM – Inspired by 2017 Transformer (self-attention)
• Some automated designs using neural architecture search (NAS) – 2017 NAS, 2019 EfficientNet, etc.
• EfficientNet found an architecture that scales well (will be revisited in the “scaling laws” part)
[2012-2015] Evolution of deep learning architectures
12
• (Side note) Architectures after 2015
• In 2021, Vision Transformer (ViT) changed the landscape, giving a better scaling than ConvNet
• Followed by hybrid models like Swin-T, or patch-based models like MLP-Mixer
• Some folks (e.g., Yann LeCun) still believe convolution is an essential for image recognition
• Currently, there are two philosophies
• Pure ViT: Use vanilla Transformer, same as other modalities like language
• Hybrid model: Combine ConvNet and ViT, specialized to the image modality
• Scaling pure ViT is more popular1 now, but let’s wait for time to tell
[2012-2015] Evolution of deep learning architectures
13
1. See “Scaling Vision Transformers to 22 Billion Parameters” (2023)
vs.
• (Side note) Architectures beyond classification
• Need specialized modules for image recognition beyond classification (e.g., object detection/segmentation)
• It also had rapid evolution in this era – 2013 R-CNN, 2015 Faster R-CNN, 2017 Mask R-CNN, etc.
• Recent efforts are aimed at simplifying such modules (e.g., 2020 DETR, 2021 MaskFormer),
or even creating a single model that solves all tasks universally (e.g., 2022 pix2seq, 2023 pix2seq-D)
[2012-2015] Evolution of deep learning architectures
14
15
[2016-2019] Learning paradigms for diverse tasks
• 2016-2019: Explosion of various tasks and task-wise learning paradigms
• Few-shot learning: 2016 MatchingNet, 2017 ProtoNet, etc.
• Meta learning: 2016 Learn2Learn, 2017 MAML, 2018 NP, etc.
• Continual learning: 2016 ProgressiveNet and LwF, 2017 EWC, etc.
• Self-supervised learning: 2016 Jigsaw and Colorization, 2018 RotNet, etc.
• Semi-supervised learning: 2015 VAT, 2017 Temporal Ensemble 2018 Mean Teacher, etc.
• Domain adaptation: 2015 DAN and DANN, 2016 RTN, 2017 ADDA, etc.
• Knowledge distillation: 2014 KD and FitNet, 2017 Attention Transfer, etc.
• Data imbalance: 2017 Focal Loss, 2018 Learn2Reweight, etc.
• Noisy labels: 2017 Decouple, 2018 MentorNet and Co-teaching, etc.
• …but not limited to (e.g., {zero-shot, multi-task, active} learning, domain generalization)
[2016-2019] Learning paradigms for diverse tasks
16
• 2019: Big Transfer (BiT), or “Is a good backbone is all you need?”
• Large model + big data = universally good performance on diverse tasks
• Performs well for transfer and few-shot settings (no specialized method is needed)
[2016-2019] Learning paradigms for diverse tasks
17
• How to train the backbone?
• Supervised training (e.g., use JFT-300M) – 2017
• Weakly-supervised training (e.g., use Instagram-1B) – 2018
• + Self-training (noisy student) to boost performance – 2020
• …however, collecting (weakly-)supervised data is expensive!
→ 2020-: Use large-scale unlabeled data by self-supervised learning
• (+1) The backbone trained by self-sup is robust to data imbalance and noisy labels
[2016-2019] Learning paradigms for diverse tasks
18
Cherry (RL) – e.g., RLHF
Icing (sup) – e.g., instruction fine-tuning
Cake (self-sup)1 – e.g., language modeling
1. Yann LeCun’s cake analogy
19
[2020-current] Scaling laws and foundation models
• 2019: EfficientNet and BiT suggests the scaling laws for computer vision (similar to NLP1)
• Larger model, bigger data, and more compute give consistent performance gains
• In NLP, they use Transformer model and language modeling self-supervision to scale model and data
→ How to design computer vision models and data to be scaled effectively?
[2020-current] Scaling laws and foundation models
20
See “Scaling Laws for Neural Language Models” (2020)
• 2020: MoCo and SimCLR open the era of self-supervised learning (joint embedding method)
• Create two views (e.g., by data augmentation) of an unlabeled data and make their features similar
• Originally called contrastive learning, but followed up by non-contrastive methods like SwAV and BYOL
• Limitation: It enables data scaling, but performance saturates since ConvNet does not scale well
[2020-current] Scaling laws and foundation models
21
Image from http://aidev.co.kr/deeplearning/8968
• 2021: Vision Transformer (ViT) scales better than ConvNet in a supervised setup (further scale up1,2
)
• Divide an image into patches and treat them as tokens (or words) of a Transformer
• Hope: ViT shows a new potential of model scaling in computer vision, and could be combined with self-sup?
[2020-current] Scaling laws and foundation models
22
1. “Scaling Vision Transformers” (2022)
2. “Scaling Vision Transformers to 22 Billion Parameters” (2023)
• 2021: DINO and MoCo-v3 combines joint embedding method and ViT
• It works okay, and provides nice properties, such as unsupervised object discovery from attention
• Limitation: However, joint embedding for ViT does not scale well, unlike supervised learning
[2020-current] Scaling laws and foundation models
23
• 2022: Masked Autoencoder (MAE)1 offers a new self-supervision for scaling ViT (and other nets2,3
)
• Masked language modeling (e.g., BERT) scales well in NLP → Apply similar idea to ViT
• Current SOTA: MAE scales well with model size, but not with data size (also reported in MAE-CLIP4)
[2020-current] Scaling laws and foundation models
24
1. Also called masked image modeling (MIM)
2. “ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders” (2023)
3. “Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles” (2023)
4. “Masked Autoencoding Does Not Help Natural Language Supervision at Scale” (2023)
MAE is not effective when combined with CLIP
… by the way, what is CLIP? (next page)
• 2023-: (Learning paradigm) Extension to multimodal models
• In 2021, CLIP suggested an alternative way to train vision encoder using (web-collected) image-text pairs
• The CLIP backbone is generally superior to the self-supervised backbone for natural images
• Next direction: How to scale up CLIP? Can we further harness unlabeled data through self-supervision?
• LAION-2B (image-text dataset) is big but still limited compared to unlabeled data
[2020-current] Scaling laws and foundation models
25
• 2023-: (Learning paradigm) Combination with generative models
• Generative modeling is one natural way for self-supervision – 2014 Semi+VAE, 2019 BigBiGAN, etc.
• Recently, (text-to-image) diffusion models have shown their effectiveness1 in low-level vision understanding
• Next direction: Can we combine diffusion models and representation learning in a single framework?
• Stable Diffusion can be an alternative of (dense) CLIP on learning low-level representations
[2020-current] Scaling laws and foundation models
26
1. See “Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models” for example
• 2023-: (Architecture design) Towards better scaling laws
• Should we use Pure ViT, or hybrid models (Swin-T, ConvNeXt) that incorporate visual inductive biases?
• In NLP, pure(-like) Transformers have succeeded in scaling up1, while efficient variants have failed2
• Next direction: Check scaling law of vision architectures, matching the scale with large language models (LLMs)
• ViT (22B params) is still much smaller than GPT-4 (1.7T params)
[2020-current] Scaling laws and foundation models
27
1. “Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?” (2022)
2. Recent works claim they can, such as “Retentive Network: A Successor to Transformer for Large Language Models” (2023)
• Roughly speaking, there are two major topics in visual representation learning
• Architecture design:
• [2012-2015] Evolution of convolutional architectures
• [2016-2020] Attention, NAS, EfficientNet, and scaling laws
• [2021-current] Scaling up ViT and hybrid models (Swin-T, ConvNeXt)
• [Current-] Can ViT (or hybrid models) match the scale of LLM?
• Learning paradigm:
• [2016-2019] Various tasks and task-wise learning paradigms
• [2020-current] Self-supervised and multimodal learning for foundation models
• [Current-] Scale up CLIP (or Diffusion) and combine it with self-supervision?
Summary
28

More Related Content

What's hot

Unit 4 chemical and electrochemical energy based processes
Unit 4   chemical and electrochemical energy based processesUnit 4   chemical and electrochemical energy based processes
Unit 4 chemical and electrochemical energy based processesNarayanasamy Pandiarajan
 
Stereolithography latest
Stereolithography latestStereolithography latest
Stereolithography latestyuvarajeil
 
Carbon Nanotubes Reinforced Composites
Carbon Nanotubes Reinforced CompositesCarbon Nanotubes Reinforced Composites
Carbon Nanotubes Reinforced CompositesChampion Vinith
 
Preparation and characterization of self reinforced fibre polymer composites ...
Preparation and characterization of self reinforced fibre polymer composites ...Preparation and characterization of self reinforced fibre polymer composites ...
Preparation and characterization of self reinforced fibre polymer composites ...Padmanabhan Krishnan
 
dynamic mechanical analysis
dynamic mechanical analysisdynamic mechanical analysis
dynamic mechanical analysiskarthi keyan
 
Vacuum bag molding
Vacuum bag moldingVacuum bag molding
Vacuum bag moldingSethu Ram
 
Magnetorheological
MagnetorheologicalMagnetorheological
Magnetorheologicalarunedm
 
Laser micromachining seminar ppt
Laser micromachining  seminar pptLaser micromachining  seminar ppt
Laser micromachining seminar pptanil chaurasiya
 
METAL MATRIX COMPOSITE
METAL MATRIX COMPOSITEMETAL MATRIX COMPOSITE
METAL MATRIX COMPOSITEkedarisantosh
 
UNCONVENTIONAL MACHINING PROCESS
UNCONVENTIONAL MACHINING PROCESSUNCONVENTIONAL MACHINING PROCESS
UNCONVENTIONAL MACHINING PROCESSS. Sathishkumar
 
Explainable AI is not yet Understandable AI
Explainable AI is not yet Understandable AIExplainable AI is not yet Understandable AI
Explainable AI is not yet Understandable AIepsilon_tud
 
Electro magnetic forming- metal spinning-peen forming
Electro magnetic forming- metal spinning-peen formingElectro magnetic forming- metal spinning-peen forming
Electro magnetic forming- metal spinning-peen formingPravinkumar
 
Mechanics of Composite Materials
Mechanics of Composite MaterialsMechanics of Composite Materials
Mechanics of Composite MaterialsChris Pastore
 

What's hot (20)

Unit 4 chemical and electrochemical energy based processes
Unit 4   chemical and electrochemical energy based processesUnit 4   chemical and electrochemical energy based processes
Unit 4 chemical and electrochemical energy based processes
 
Stereolithography latest
Stereolithography latestStereolithography latest
Stereolithography latest
 
Carbon Nanotubes Reinforced Composites
Carbon Nanotubes Reinforced CompositesCarbon Nanotubes Reinforced Composites
Carbon Nanotubes Reinforced Composites
 
Preparation and characterization of self reinforced fibre polymer composites ...
Preparation and characterization of self reinforced fibre polymer composites ...Preparation and characterization of self reinforced fibre polymer composites ...
Preparation and characterization of self reinforced fibre polymer composites ...
 
dynamic mechanical analysis
dynamic mechanical analysisdynamic mechanical analysis
dynamic mechanical analysis
 
Presentation1
Presentation1Presentation1
Presentation1
 
Composite introduction
Composite introductionComposite introduction
Composite introduction
 
Vacuum bag molding
Vacuum bag moldingVacuum bag molding
Vacuum bag molding
 
Pultrusion process
Pultrusion  processPultrusion  process
Pultrusion process
 
Magnetorheological
MagnetorheologicalMagnetorheological
Magnetorheological
 
Laser micromachining seminar ppt
Laser micromachining  seminar pptLaser micromachining  seminar ppt
Laser micromachining seminar ppt
 
Types of Fiber
Types of FiberTypes of Fiber
Types of Fiber
 
Cmc seminar1
Cmc seminar1Cmc seminar1
Cmc seminar1
 
METAL MATRIX COMPOSITE
METAL MATRIX COMPOSITEMETAL MATRIX COMPOSITE
METAL MATRIX COMPOSITE
 
UNCONVENTIONAL MACHINING PROCESS
UNCONVENTIONAL MACHINING PROCESSUNCONVENTIONAL MACHINING PROCESS
UNCONVENTIONAL MACHINING PROCESS
 
Carbon carbon composite
Carbon carbon compositeCarbon carbon composite
Carbon carbon composite
 
Rapid Prototyping
Rapid PrototypingRapid Prototyping
Rapid Prototyping
 
Explainable AI is not yet Understandable AI
Explainable AI is not yet Understandable AIExplainable AI is not yet Understandable AI
Explainable AI is not yet Understandable AI
 
Electro magnetic forming- metal spinning-peen forming
Electro magnetic forming- metal spinning-peen formingElectro magnetic forming- metal spinning-peen forming
Electro magnetic forming- metal spinning-peen forming
 
Mechanics of Composite Materials
Mechanics of Composite MaterialsMechanics of Composite Materials
Mechanics of Composite Materials
 

Similar to Brief History of Visual Representation Learning

Apresentacao sigdoc wiki_2010
Apresentacao sigdoc wiki_2010Apresentacao sigdoc wiki_2010
Apresentacao sigdoc wiki_2010thiagojabur
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
SWEBOK Guide Evolution and Its Emerging Areas including Machine Learning Patt...
SWEBOK Guide Evolution and Its Emerging Areas including Machine Learning Patt...SWEBOK Guide Evolution and Its Emerging Areas including Machine Learning Patt...
SWEBOK Guide Evolution and Its Emerging Areas including Machine Learning Patt...Hironori Washizaki
 
InLOC: the potential of competence structures
InLOC: the potential of competence structuresInLOC: the potential of competence structures
InLOC: the potential of competence structuresSimon Grant
 
Frontend War: Angular vs React vs Vue
Frontend War: Angular vs React vs VueFrontend War: Angular vs React vs Vue
Frontend War: Angular vs React vs VueMarudi Subakti
 
Landscape of AI/ML in 2023
Landscape of AI/ML in 2023Landscape of AI/ML in 2023
Landscape of AI/ML in 2023HyunJoon Jung
 
SoC Summit 2014 K. Koutsopoulos
SoC Summit 2014 K. KoutsopoulosSoC Summit 2014 K. Koutsopoulos
SoC Summit 2014 K. KoutsopoulosTheSoFGr
 
Thesis presentation for defence
Thesis presentation for defenceThesis presentation for defence
Thesis presentation for defenceKnut Jetlund
 
The Future of Data Models
The Future of Data ModelsThe Future of Data Models
The Future of Data ModelsJim Logan
 
VU University Amsterdam - The Social Web 2016 - Lecture 6
VU University Amsterdam - The Social Web 2016 - Lecture 6VU University Amsterdam - The Social Web 2016 - Lecture 6
VU University Amsterdam - The Social Web 2016 - Lecture 6Davide Ceolin
 
Lecture 1 computer vision introduction
Lecture 1 computer vision introductionLecture 1 computer vision introduction
Lecture 1 computer vision introductioncairo university
 
Session 1 and 2 "Challenges and Opportunities with Big Linked Data Visualiza...
Session 1 and 2  "Challenges and Opportunities with Big Linked Data Visualiza...Session 1 and 2  "Challenges and Opportunities with Big Linked Data Visualiza...
Session 1 and 2 "Challenges and Opportunities with Big Linked Data Visualiza...Laura Po
 
Oge Marques (FAU) - invited talk at WISMA 2010 (Barcelona, May 2010)
Oge Marques (FAU) - invited talk at WISMA 2010 (Barcelona, May 2010)Oge Marques (FAU) - invited talk at WISMA 2010 (Barcelona, May 2010)
Oge Marques (FAU) - invited talk at WISMA 2010 (Barcelona, May 2010)Oge Marques
 
Teaching Open Web Mapping - AAG 2017
Teaching Open Web Mapping - AAG 2017Teaching Open Web Mapping - AAG 2017
Teaching Open Web Mapping - AAG 2017Carl Sack
 
InLOC - the project
InLOC - the projectInLOC - the project
InLOC - the projectSimon Grant
 
DCMI Education Linked Data Session, DC-2009 Conference, Seoul Korea
DCMI Education Linked Data Session, DC-2009 Conference, Seoul KoreaDCMI Education Linked Data Session, DC-2009 Conference, Seoul Korea
DCMI Education Linked Data Session, DC-2009 Conference, Seoul KoreaSarah Currier
 
Paths to more personal and collaborative knowledge graphs
Paths to more personal and collaborative knowledge graphsPaths to more personal and collaborative knowledge graphs
Paths to more personal and collaborative knowledge graphsAlan Morrison
 

Similar to Brief History of Visual Representation Learning (20)

Apresentacao sigdoc wiki_2010
Apresentacao sigdoc wiki_2010Apresentacao sigdoc wiki_2010
Apresentacao sigdoc wiki_2010
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
SWEBOK Guide Evolution and Its Emerging Areas including Machine Learning Patt...
SWEBOK Guide Evolution and Its Emerging Areas including Machine Learning Patt...SWEBOK Guide Evolution and Its Emerging Areas including Machine Learning Patt...
SWEBOK Guide Evolution and Its Emerging Areas including Machine Learning Patt...
 
BDNSbrochure
BDNSbrochureBDNSbrochure
BDNSbrochure
 
InLOC: the potential of competence structures
InLOC: the potential of competence structuresInLOC: the potential of competence structures
InLOC: the potential of competence structures
 
Visual Network Narrations
Visual Network NarrationsVisual Network Narrations
Visual Network Narrations
 
Frontend War: Angular vs React vs Vue
Frontend War: Angular vs React vs VueFrontend War: Angular vs React vs Vue
Frontend War: Angular vs React vs Vue
 
Landscape of AI/ML in 2023
Landscape of AI/ML in 2023Landscape of AI/ML in 2023
Landscape of AI/ML in 2023
 
SoC Summit 2014 K. Koutsopoulos
SoC Summit 2014 K. KoutsopoulosSoC Summit 2014 K. Koutsopoulos
SoC Summit 2014 K. Koutsopoulos
 
Thesis presentation for defence
Thesis presentation for defenceThesis presentation for defence
Thesis presentation for defence
 
The Future of Data Models
The Future of Data ModelsThe Future of Data Models
The Future of Data Models
 
VU University Amsterdam - The Social Web 2016 - Lecture 6
VU University Amsterdam - The Social Web 2016 - Lecture 6VU University Amsterdam - The Social Web 2016 - Lecture 6
VU University Amsterdam - The Social Web 2016 - Lecture 6
 
Lecture 1 computer vision introduction
Lecture 1 computer vision introductionLecture 1 computer vision introduction
Lecture 1 computer vision introduction
 
Session 1 and 2 "Challenges and Opportunities with Big Linked Data Visualiza...
Session 1 and 2  "Challenges and Opportunities with Big Linked Data Visualiza...Session 1 and 2  "Challenges and Opportunities with Big Linked Data Visualiza...
Session 1 and 2 "Challenges and Opportunities with Big Linked Data Visualiza...
 
Oge Marques (FAU) - invited talk at WISMA 2010 (Barcelona, May 2010)
Oge Marques (FAU) - invited talk at WISMA 2010 (Barcelona, May 2010)Oge Marques (FAU) - invited talk at WISMA 2010 (Barcelona, May 2010)
Oge Marques (FAU) - invited talk at WISMA 2010 (Barcelona, May 2010)
 
Teaching Open Web Mapping - AAG 2017
Teaching Open Web Mapping - AAG 2017Teaching Open Web Mapping - AAG 2017
Teaching Open Web Mapping - AAG 2017
 
Seeory
SeeorySeeory
Seeory
 
InLOC - the project
InLOC - the projectInLOC - the project
InLOC - the project
 
DCMI Education Linked Data Session, DC-2009 Conference, Seoul Korea
DCMI Education Linked Data Session, DC-2009 Conference, Seoul KoreaDCMI Education Linked Data Session, DC-2009 Conference, Seoul Korea
DCMI Education Linked Data Session, DC-2009 Conference, Seoul Korea
 
Paths to more personal and collaborative knowledge graphs
Paths to more personal and collaborative knowledge graphsPaths to more personal and collaborative knowledge graphs
Paths to more personal and collaborative knowledge graphs
 

More from Sangwoo Mo

Learning Visual Representations from Uncurated Data
Learning Visual Representations from Uncurated DataLearning Visual Representations from Uncurated Data
Learning Visual Representations from Uncurated DataSangwoo Mo
 
Hyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement LearningHyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement LearningSangwoo Mo
 
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...Sangwoo Mo
 
Self-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSelf-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSangwoo Mo
 
Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)Sangwoo Mo
 
Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)Sangwoo Mo
 
Introduction to Diffusion Models
Introduction to Diffusion ModelsIntroduction to Diffusion Models
Introduction to Diffusion ModelsSangwoo Mo
 
Object-Region Video Transformers
Object-Region Video TransformersObject-Region Video Transformers
Object-Region Video TransformersSangwoo Mo
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksSangwoo Mo
 
Learning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaLearning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaSangwoo Mo
 
Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)Sangwoo Mo
 
Explicit Density Models
Explicit Density ModelsExplicit Density Models
Explicit Density ModelsSangwoo Mo
 
Score-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential EquationsScore-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential EquationsSangwoo Mo
 
Self-Attention with Linear Complexity
Self-Attention with Linear ComplexitySelf-Attention with Linear Complexity
Self-Attention with Linear ComplexitySangwoo Mo
 
Meta-Learning with Implicit Gradients
Meta-Learning with Implicit GradientsMeta-Learning with Implicit Gradients
Meta-Learning with Implicit GradientsSangwoo Mo
 
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Sangwoo Mo
 
Generative Models for General Audiences
Generative Models for General AudiencesGenerative Models for General Audiences
Generative Models for General AudiencesSangwoo Mo
 
Bayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningBayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningSangwoo Mo
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingSangwoo Mo
 
Domain Transfer and Adaptation Survey
Domain Transfer and Adaptation SurveyDomain Transfer and Adaptation Survey
Domain Transfer and Adaptation SurveySangwoo Mo
 

More from Sangwoo Mo (20)

Learning Visual Representations from Uncurated Data
Learning Visual Representations from Uncurated DataLearning Visual Representations from Uncurated Data
Learning Visual Representations from Uncurated Data
 
Hyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement LearningHyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement Learning
 
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
 
Self-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSelf-supervised Learning Lecture Note
Self-supervised Learning Lecture Note
 
Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)
 
Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)
 
Introduction to Diffusion Models
Introduction to Diffusion ModelsIntroduction to Diffusion Models
Introduction to Diffusion Models
 
Object-Region Video Transformers
Object-Region Video TransformersObject-Region Video Transformers
Object-Region Video Transformers
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural Networks
 
Learning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaLearning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat Minima
 
Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)
 
Explicit Density Models
Explicit Density ModelsExplicit Density Models
Explicit Density Models
 
Score-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential EquationsScore-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential Equations
 
Self-Attention with Linear Complexity
Self-Attention with Linear ComplexitySelf-Attention with Linear Complexity
Self-Attention with Linear Complexity
 
Meta-Learning with Implicit Gradients
Meta-Learning with Implicit GradientsMeta-Learning with Implicit Gradients
Meta-Learning with Implicit Gradients
 
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
 
Generative Models for General Audiences
Generative Models for General AudiencesGenerative Models for General Audiences
Generative Models for General Audiences
 
Bayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningBayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-Learning
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Domain Transfer and Adaptation Survey
Domain Transfer and Adaptation SurveyDomain Transfer and Adaptation Survey
Domain Transfer and Adaptation Survey
 

Recently uploaded

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 

Recently uploaded (20)

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 

Brief History of Visual Representation Learning

  • 1. Brief History of Visual Representation Learning 2023.08.16. KAIST / University of Michigan Sangwoo Mo 1
  • 2. • [2012-2015] Evolution of deep learning architectures • [2016-2019] Learning paradigms for diverse tasks • [2020-current] Scaling laws and foundation models Outline of the talk 2 Caveat: These eras are not mutually exclusive
  • 3. • What constitutes artificial intelligence (AI)? • AI should replicate human actions like seeing, writing, listening, and manipulating What is visual representation learning? 3 Computer vision Natural language Speech Robotics Image from https://unsplash.com
  • 4. • Learning general visual representations • Solve various downstream tasks such as classification and segmentation What is visual representation learning? 4 Image from https://www.youtube.com/watch?v=taC5pMCm70U
  • 5. • Has been remarkable progress in the last 10 years What is visual representation learning? 5 Image from https://paperswithcode.com
  • 6. 6 [2012-2015] Evolution of deep learning architectures
  • 7. • 2012: AlexNet opened the era of deep learning for computer vision • Significantly outperforms the shallow competitors [2012-2015] Evolution of deep learning architectures 7 Image from https://www.pinecone.io/learn/series/image-search/imagenet/
  • 8. • 2013-: Golden era for architecture design • 2013 ZFNet, 2014 VGG-Net and GoogLeNet [2012-2015] Evolution of deep learning architectures 8 Image from https://devopedia.org/imagenet
  • 9. • 2015: ResNet exceeded human performance – skip connection is the key to success! • ResNet is still actively used in 2023 (though SOTA is ViT families) [2012-2015] Evolution of deep learning architectures 9 Image from https://devopedia.org/imagenet
  • 10. • 2015: BatchNorm opened the exploration of normalization layers • Reparameterization facilitates ease of optimization1 • (Side note) BatchNorm was essential in early ConvNet, but has some side effects (e.g., adversarially venerable) Recent architectures like ViT and ConvNeXt use LayerNorm instead [2012-2015] Evolution of deep learning architectures 10 1. See “How Does Batch Normalization Help Optimization?” (2018)
  • 11. • What happened after 2015? • New architectures are still being proposed (2016 WideResNet, 2017 ResNeXt and DenseNet, etc.) • However, the interests of the community have moved beyond ImageNet (challenge ended in 2017) • Instead, researchers explored diverse tasks and learning paradigms in 2016- • Few-shot learning, continual learning, domain adaptation, etc. [2012-2015] Evolution of deep learning architectures 11 Images from https://www.youtube.com/watch?v=hE7eGew4eeg and https://mila.quebec/en/article/la-maml-look-ahead-meta-learning-for-continual-learning/
  • 12. • (Side note) Architectures after 2015 • Convolutional: 2016 WideResNet, 2017 ResNeXt and DenseNet • Convolutional + attention: 2018 SENet and CBAM – Inspired by 2017 Transformer (self-attention) • Some automated designs using neural architecture search (NAS) – 2017 NAS, 2019 EfficientNet, etc. • EfficientNet found an architecture that scales well (will be revisited in the “scaling laws” part) [2012-2015] Evolution of deep learning architectures 12
  • 13. • (Side note) Architectures after 2015 • In 2021, Vision Transformer (ViT) changed the landscape, giving a better scaling than ConvNet • Followed by hybrid models like Swin-T, or patch-based models like MLP-Mixer • Some folks (e.g., Yann LeCun) still believe convolution is an essential for image recognition • Currently, there are two philosophies • Pure ViT: Use vanilla Transformer, same as other modalities like language • Hybrid model: Combine ConvNet and ViT, specialized to the image modality • Scaling pure ViT is more popular1 now, but let’s wait for time to tell [2012-2015] Evolution of deep learning architectures 13 1. See “Scaling Vision Transformers to 22 Billion Parameters” (2023) vs.
  • 14. • (Side note) Architectures beyond classification • Need specialized modules for image recognition beyond classification (e.g., object detection/segmentation) • It also had rapid evolution in this era – 2013 R-CNN, 2015 Faster R-CNN, 2017 Mask R-CNN, etc. • Recent efforts are aimed at simplifying such modules (e.g., 2020 DETR, 2021 MaskFormer), or even creating a single model that solves all tasks universally (e.g., 2022 pix2seq, 2023 pix2seq-D) [2012-2015] Evolution of deep learning architectures 14
  • 16. • 2016-2019: Explosion of various tasks and task-wise learning paradigms • Few-shot learning: 2016 MatchingNet, 2017 ProtoNet, etc. • Meta learning: 2016 Learn2Learn, 2017 MAML, 2018 NP, etc. • Continual learning: 2016 ProgressiveNet and LwF, 2017 EWC, etc. • Self-supervised learning: 2016 Jigsaw and Colorization, 2018 RotNet, etc. • Semi-supervised learning: 2015 VAT, 2017 Temporal Ensemble 2018 Mean Teacher, etc. • Domain adaptation: 2015 DAN and DANN, 2016 RTN, 2017 ADDA, etc. • Knowledge distillation: 2014 KD and FitNet, 2017 Attention Transfer, etc. • Data imbalance: 2017 Focal Loss, 2018 Learn2Reweight, etc. • Noisy labels: 2017 Decouple, 2018 MentorNet and Co-teaching, etc. • …but not limited to (e.g., {zero-shot, multi-task, active} learning, domain generalization) [2016-2019] Learning paradigms for diverse tasks 16
  • 17. • 2019: Big Transfer (BiT), or “Is a good backbone is all you need?” • Large model + big data = universally good performance on diverse tasks • Performs well for transfer and few-shot settings (no specialized method is needed) [2016-2019] Learning paradigms for diverse tasks 17
  • 18. • How to train the backbone? • Supervised training (e.g., use JFT-300M) – 2017 • Weakly-supervised training (e.g., use Instagram-1B) – 2018 • + Self-training (noisy student) to boost performance – 2020 • …however, collecting (weakly-)supervised data is expensive! → 2020-: Use large-scale unlabeled data by self-supervised learning • (+1) The backbone trained by self-sup is robust to data imbalance and noisy labels [2016-2019] Learning paradigms for diverse tasks 18 Cherry (RL) – e.g., RLHF Icing (sup) – e.g., instruction fine-tuning Cake (self-sup)1 – e.g., language modeling 1. Yann LeCun’s cake analogy
  • 19. 19 [2020-current] Scaling laws and foundation models
  • 20. • 2019: EfficientNet and BiT suggests the scaling laws for computer vision (similar to NLP1) • Larger model, bigger data, and more compute give consistent performance gains • In NLP, they use Transformer model and language modeling self-supervision to scale model and data → How to design computer vision models and data to be scaled effectively? [2020-current] Scaling laws and foundation models 20 See “Scaling Laws for Neural Language Models” (2020)
  • 21. • 2020: MoCo and SimCLR open the era of self-supervised learning (joint embedding method) • Create two views (e.g., by data augmentation) of an unlabeled data and make their features similar • Originally called contrastive learning, but followed up by non-contrastive methods like SwAV and BYOL • Limitation: It enables data scaling, but performance saturates since ConvNet does not scale well [2020-current] Scaling laws and foundation models 21 Image from http://aidev.co.kr/deeplearning/8968
  • 22. • 2021: Vision Transformer (ViT) scales better than ConvNet in a supervised setup (further scale up1,2 ) • Divide an image into patches and treat them as tokens (or words) of a Transformer • Hope: ViT shows a new potential of model scaling in computer vision, and could be combined with self-sup? [2020-current] Scaling laws and foundation models 22 1. “Scaling Vision Transformers” (2022) 2. “Scaling Vision Transformers to 22 Billion Parameters” (2023)
  • 23. • 2021: DINO and MoCo-v3 combines joint embedding method and ViT • It works okay, and provides nice properties, such as unsupervised object discovery from attention • Limitation: However, joint embedding for ViT does not scale well, unlike supervised learning [2020-current] Scaling laws and foundation models 23
  • 24. • 2022: Masked Autoencoder (MAE)1 offers a new self-supervision for scaling ViT (and other nets2,3 ) • Masked language modeling (e.g., BERT) scales well in NLP → Apply similar idea to ViT • Current SOTA: MAE scales well with model size, but not with data size (also reported in MAE-CLIP4) [2020-current] Scaling laws and foundation models 24 1. Also called masked image modeling (MIM) 2. “ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders” (2023) 3. “Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles” (2023) 4. “Masked Autoencoding Does Not Help Natural Language Supervision at Scale” (2023) MAE is not effective when combined with CLIP … by the way, what is CLIP? (next page)
  • 25. • 2023-: (Learning paradigm) Extension to multimodal models • In 2021, CLIP suggested an alternative way to train vision encoder using (web-collected) image-text pairs • The CLIP backbone is generally superior to the self-supervised backbone for natural images • Next direction: How to scale up CLIP? Can we further harness unlabeled data through self-supervision? • LAION-2B (image-text dataset) is big but still limited compared to unlabeled data [2020-current] Scaling laws and foundation models 25
  • 26. • 2023-: (Learning paradigm) Combination with generative models • Generative modeling is one natural way for self-supervision – 2014 Semi+VAE, 2019 BigBiGAN, etc. • Recently, (text-to-image) diffusion models have shown their effectiveness1 in low-level vision understanding • Next direction: Can we combine diffusion models and representation learning in a single framework? • Stable Diffusion can be an alternative of (dense) CLIP on learning low-level representations [2020-current] Scaling laws and foundation models 26 1. See “Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models” for example
  • 27. • 2023-: (Architecture design) Towards better scaling laws • Should we use Pure ViT, or hybrid models (Swin-T, ConvNeXt) that incorporate visual inductive biases? • In NLP, pure(-like) Transformers have succeeded in scaling up1, while efficient variants have failed2 • Next direction: Check scaling law of vision architectures, matching the scale with large language models (LLMs) • ViT (22B params) is still much smaller than GPT-4 (1.7T params) [2020-current] Scaling laws and foundation models 27 1. “Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?” (2022) 2. Recent works claim they can, such as “Retentive Network: A Successor to Transformer for Large Language Models” (2023)
  • 28. • Roughly speaking, there are two major topics in visual representation learning • Architecture design: • [2012-2015] Evolution of convolutional architectures • [2016-2020] Attention, NAS, EfficientNet, and scaling laws • [2021-current] Scaling up ViT and hybrid models (Swin-T, ConvNeXt) • [Current-] Can ViT (or hybrid models) match the scale of LLM? • Learning paradigm: • [2016-2019] Various tasks and task-wise learning paradigms • [2020-current] Self-supervised and multimodal learning for foundation models • [Current-] Scale up CLIP (or Diffusion) and combine it with self-supervision? Summary 28