SlideShare a Scribd company logo
1 of 23
Download to read offline
A Unified Framework for Computer Vision Tasks:
(Conditional) Generative Model is All You Need
2022.10.17.
Sangwoo Mo
1
• Prior works designed a specific algorithm for each computer vision task
Motivation
2
Slide from Stanford CS231n
• Prior works designed a specific algorithm for each computer vision task
• Example of semantic segmentation algorithm
Motivation
3
Slide from Stanford CS231n
• Prior works designed a specific algorithm for each computer vision task
• Example of object detection algorithm
Motivation
4
Slide from Stanford CS231n
• Prior works designed a specific algorithm for each computer vision task
• Example of object detection algorithm
Motivation
5
Slide from Stanford CS231n
• Prior works designed a specific algorithm for each computer vision task
• Example of instance segmentation algorithm
Motivation
6
Slide from Stanford CS231n
• Prior works designed a specific algorithm for each computer vision task
• However, those task-specific approach is not desirable
• Human may not use different techniques to solve such vision tasks
• Design a new algorithm for a new task (e.g., keypoint detection) is inefficient and impractical
• Goal. Build a single unified framework that can solve all (or most) computer vision tasks
• Prediction is just a X (input) to Y (output) mapping
• One can generally use a conditional generative model to predict arbitrary Y
Motivation
7
• This talks follows the recent journey of Ting Chen (1st author or SimCLR)
1. Tasks with sparse outputs (e.g., detection = object-wise bboxes)
• Idea: Use an autoregressive model to predict discrete tokens (e.g., sequence of bboxes)
• Pix2seq: A Language Modeling Framework for Object Detection (ICLR’22)
• A Unified Sequence Interface for Vision Tasks (NeurIPS’22)
2. Tasks with dense outputs (e.g., segmentation = pixel-wise labels)
• Idea: Use a diffusion model to predict continuous outputs (e.g., segmentation maps)
• A Generalist Framework for Panoptic Segmentation of Images and Videos (submitted to ICLR’23)
Outline
8
• This talks follows the recent journey of Ting Chen (1st author or SimCLR)
Outline
9
• Pix2Seq
• Cast object descriptions as a sequence of discrete tokens (bboxes and class labels)
• Training and inference are done as LM (MLE training, stochastic decoding)
• Each object = {4 bbox coordinates + 1 class label}
• The coordinate is quantized to 𝑛!"#$ values, hence the vocab size = 𝑛!"#$ + 𝑛%&'$$ + 1 for [EOS] token
Tasks with sparse outputs
10
CNN encoder + Transformer decoder
• Pix2Seq
• Cast object descriptions as a sequence of discrete tokens (bboxes and class labels)
• Setting 𝑛!"#$ ≈ # of pixels is sufficient to detect small objects
Tasks with sparse outputs
11
• Pix2Seq
• Sequence augmentation to propose more regions and improve recall
• Pix2Seq misses some objects due to early stopping of decoding ([EOS] comes quickly)
• To avoid this, Pix2Seq keep the max size of bboxes by adding synthetic bboxes
• Specially, get the 4 coordinates of a random rectangle and assign “noise” class
Tasks with sparse outputs
12
• Pix2Seq
• Sequence augmentation to propose more regions and improve recall
• To avoid this, Pix2Seq keep the max size of bboxes by adding synthetic bboxes
• Then, the model decodes the fixed # of objects by replacing “noise” to the most likely class
• Sequence augmentation significantly improves the detection performance
• IMO, I think this trick can also be used for open-set scenario (get bbox of unknown objects)
Tasks with sparse outputs
13
• Pix2Seq
• Experimental results
• Pix2Seq is comparable with Faster R-CNN and DETR
• Pix2Seq scales for model size and resolution
Tasks with sparse outputs
14
• Pix2Seq – Multi-task
• The idea of Pix2Seq can be applied to various problems
• A single model solves detection, segmentation, and captioning by controlling the input prompt
Tasks with sparse outputs
15
• Pix2Seq – Multi-task
• The idea of Pix2Seq can be applied to various problems
• Object detection → same as before
• Captioning → obvious
• Instance segmentation & keypoint detection
→ Condition on each object bbox
• Seg mask → predict polygon
• Keypoint → predict seq. of points
{4 coordinates + keypoint label}
• The paper lacks explanation, but I guess
one needs a two-stage approach for
instance segmentation (get bboxes first
then predict the mask by conditioning)
Tasks with sparse outputs
16
• Pix2Seq – Multi-task
• Experimental results
• This unified framework works for various problems
Tasks with sparse outputs
17
• Pix2Seq-𝒟 (dense)
• Transformers can predict sparse outputs, but not suitable for dense outputs (e.g., pixel-wise segmentation)
• Instead, one can use a diffusion model to generate mask from image
Tasks with dense outputs
18
• Pix2Seq-𝒟 (dense)
• Instead, one can use a diffusion model to generate mask from image
• Condition on image and previous mask to predict next mask
Tasks with dense outputs
19
• Pix2Seq-𝒟 (dense)
• Instead, one can use a diffusion model to generate mask from image
• However, segmentation masks are discrete values (pixel-wise classification), so how to define the diffusion?
• The authors use Bit Diffusion, which converts the discrete values into binary bits and apply continuous diffusion
Tasks with dense outputs
20
• Pix2Seq-𝒟 (dense)
• Experimental results
• Works, but worse than task-specific models such as Mask DINO
Tasks with dense outputs
21
• TL;DR. Simple autoregressive or diffusion models can solve a large class of computer vision problems
• Discussion. General vs. task-specific algorithm design
• Currently, task-specific algorithm usually performs better by leveraging the structures of task
• However, the general-purpose algorithm may implicitly learn the structure of task from data
• E.g., ViT learns the spatial structure of images, e.g., translation equivariance
• I believe the model should reflect the task structures in some way, either explicitly or implicitly
• In this perspective, I think there are three directions for designing algorithms:
1. Keep design a task-specific algorithm (short-term goal before AGI comes)
2. Make the general-purpose model to better learn the task structures (e.g., SeqAug)
3. Analysis the structure learned by the general-purpose model (e.g., [1])
Discussion
22
[1] The Lie Derivative for Measuring Learned Equivariance → Analyze the equivariance learned by ViT
Thank you for listening! 😀
23

More Related Content

What's hot

What's hot (20)

Convolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNetConvolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNet
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
 
AI 연구자를 위한 클린코드 - GDG DevFest Seoul 2019
AI 연구자를 위한 클린코드 - GDG DevFest Seoul 2019AI 연구자를 위한 클린코드 - GDG DevFest Seoul 2019
AI 연구자를 위한 클린코드 - GDG DevFest Seoul 2019
 
Liver segmentation using U-net: Practical issues @ SNU-TF
Liver segmentation using U-net: Practical issues @ SNU-TFLiver segmentation using U-net: Practical issues @ SNU-TF
Liver segmentation using U-net: Practical issues @ SNU-TF
 
Face detection and recognition
Face detection and recognitionFace detection and recognition
Face detection and recognition
 
Image Object Detection Pipeline
Image Object Detection PipelineImage Object Detection Pipeline
Image Object Detection Pipeline
 
Siamese-rPPG Network: Remote Photoplethysmography Signal Estimation from Face...
Siamese-rPPG Network: Remote Photoplethysmography Signal Estimation from Face...Siamese-rPPG Network: Remote Photoplethysmography Signal Estimation from Face...
Siamese-rPPG Network: Remote Photoplethysmography Signal Estimation from Face...
 
Enabling Power-Efficient AI Through Quantization
Enabling Power-Efficient AI Through QuantizationEnabling Power-Efficient AI Through Quantization
Enabling Power-Efficient AI Through Quantization
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
PR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked AutoencodersPR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked Autoencoders
 
Alexnet paper review
Alexnet paper reviewAlexnet paper review
Alexnet paper review
 
Machine Learning - Object Detection and Classification
Machine Learning - Object Detection and ClassificationMachine Learning - Object Detection and Classification
Machine Learning - Object Detection and Classification
 
Generative Adversarial Networks (GANs)
Generative Adversarial Networks (GANs)Generative Adversarial Networks (GANs)
Generative Adversarial Networks (GANs)
 
Introduction_to_DEEP_LEARNING.ppt
Introduction_to_DEEP_LEARNING.pptIntroduction_to_DEEP_LEARNING.ppt
Introduction_to_DEEP_LEARNING.ppt
 
An Introduction to Neural Architecture Search
An Introduction to Neural Architecture SearchAn Introduction to Neural Architecture Search
An Introduction to Neural Architecture Search
 
Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.
 
Deep learning
Deep learningDeep learning
Deep learning
 
Deep Learning With Python Tutorial | Edureka
Deep Learning With Python Tutorial | EdurekaDeep Learning With Python Tutorial | Edureka
Deep Learning With Python Tutorial | Edureka
 
Machine Learning Landscape
Machine Learning LandscapeMachine Learning Landscape
Machine Learning Landscape
 
Deep Learning With Neural Networks
Deep Learning With Neural NetworksDeep Learning With Neural Networks
Deep Learning With Neural Networks
 

Similar to A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model is All You Need

Cahall Final Intern Presentation
Cahall Final Intern PresentationCahall Final Intern Presentation
Cahall Final Intern Presentation
Daniel Cahall
 

Similar to A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model is All You Need (20)

NVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digits
NVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digitsNVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digits
NVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digits
 
Computer vision-nit-silchar-hackathon
Computer vision-nit-silchar-hackathonComputer vision-nit-silchar-hackathon
Computer vision-nit-silchar-hackathon
 
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
 
Deep Neural Networks Presentation
Deep Neural Networks PresentationDeep Neural Networks Presentation
Deep Neural Networks Presentation
 
Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...
Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...
Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...
 
Cahall Final Intern Presentation
Cahall Final Intern PresentationCahall Final Intern Presentation
Cahall Final Intern Presentation
 
Intro_OpenCV.ppt
Intro_OpenCV.pptIntro_OpenCV.ppt
Intro_OpenCV.ppt
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep Learning
 
[RSS2023] Local Object Crop Collision Network for Efficient Simulation
[RSS2023] Local Object Crop Collision Network for Efficient Simulation[RSS2023] Local Object Crop Collision Network for Efficient Simulation
[RSS2023] Local Object Crop Collision Network for Efficient Simulation
 
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox Detector
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A survey
 
Mirko Lucchese - Deep Image Processing
Mirko Lucchese - Deep Image ProcessingMirko Lucchese - Deep Image Processing
Mirko Lucchese - Deep Image Processing
 
Analysis of KinectFusion
Analysis of KinectFusionAnalysis of KinectFusion
Analysis of KinectFusion
 
Deep Learning for Computer Vision - PyconDE 2017
Deep Learning for Computer Vision - PyconDE 2017Deep Learning for Computer Vision - PyconDE 2017
Deep Learning for Computer Vision - PyconDE 2017
 
Introduction to computer vision with Convoluted Neural Networks
Introduction to computer vision with Convoluted Neural NetworksIntroduction to computer vision with Convoluted Neural Networks
Introduction to computer vision with Convoluted Neural Networks
 
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
 
Introduction to computer vision
Introduction to computer visionIntroduction to computer vision
Introduction to computer vision
 
“Modern Machine Vision from Basics to Advanced Deep Learning,” a Presentation...
“Modern Machine Vision from Basics to Advanced Deep Learning,” a Presentation...“Modern Machine Vision from Basics to Advanced Deep Learning,” a Presentation...
“Modern Machine Vision from Basics to Advanced Deep Learning,” a Presentation...
 

More from Sangwoo Mo

More from Sangwoo Mo (20)

Brief History of Visual Representation Learning
Brief History of Visual Representation LearningBrief History of Visual Representation Learning
Brief History of Visual Representation Learning
 
Learning Visual Representations from Uncurated Data
Learning Visual Representations from Uncurated DataLearning Visual Representations from Uncurated Data
Learning Visual Representations from Uncurated Data
 
Hyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement LearningHyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement Learning
 
Self-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSelf-supervised Learning Lecture Note
Self-supervised Learning Lecture Note
 
Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)
 
Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)
 
Introduction to Diffusion Models
Introduction to Diffusion ModelsIntroduction to Diffusion Models
Introduction to Diffusion Models
 
Object-Region Video Transformers
Object-Region Video TransformersObject-Region Video Transformers
Object-Region Video Transformers
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural Networks
 
Learning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaLearning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat Minima
 
Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)
 
Explicit Density Models
Explicit Density ModelsExplicit Density Models
Explicit Density Models
 
Score-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential EquationsScore-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential Equations
 
Self-Attention with Linear Complexity
Self-Attention with Linear ComplexitySelf-Attention with Linear Complexity
Self-Attention with Linear Complexity
 
Meta-Learning with Implicit Gradients
Meta-Learning with Implicit GradientsMeta-Learning with Implicit Gradients
Meta-Learning with Implicit Gradients
 
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
 
Generative Models for General Audiences
Generative Models for General AudiencesGenerative Models for General Audiences
Generative Models for General Audiences
 
Bayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningBayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-Learning
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Domain Transfer and Adaptation Survey
Domain Transfer and Adaptation SurveyDomain Transfer and Adaptation Survey
Domain Transfer and Adaptation Survey
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 

A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model is All You Need

  • 1. A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model is All You Need 2022.10.17. Sangwoo Mo 1
  • 2. • Prior works designed a specific algorithm for each computer vision task Motivation 2 Slide from Stanford CS231n
  • 3. • Prior works designed a specific algorithm for each computer vision task • Example of semantic segmentation algorithm Motivation 3 Slide from Stanford CS231n
  • 4. • Prior works designed a specific algorithm for each computer vision task • Example of object detection algorithm Motivation 4 Slide from Stanford CS231n
  • 5. • Prior works designed a specific algorithm for each computer vision task • Example of object detection algorithm Motivation 5 Slide from Stanford CS231n
  • 6. • Prior works designed a specific algorithm for each computer vision task • Example of instance segmentation algorithm Motivation 6 Slide from Stanford CS231n
  • 7. • Prior works designed a specific algorithm for each computer vision task • However, those task-specific approach is not desirable • Human may not use different techniques to solve such vision tasks • Design a new algorithm for a new task (e.g., keypoint detection) is inefficient and impractical • Goal. Build a single unified framework that can solve all (or most) computer vision tasks • Prediction is just a X (input) to Y (output) mapping • One can generally use a conditional generative model to predict arbitrary Y Motivation 7
  • 8. • This talks follows the recent journey of Ting Chen (1st author or SimCLR) 1. Tasks with sparse outputs (e.g., detection = object-wise bboxes) • Idea: Use an autoregressive model to predict discrete tokens (e.g., sequence of bboxes) • Pix2seq: A Language Modeling Framework for Object Detection (ICLR’22) • A Unified Sequence Interface for Vision Tasks (NeurIPS’22) 2. Tasks with dense outputs (e.g., segmentation = pixel-wise labels) • Idea: Use a diffusion model to predict continuous outputs (e.g., segmentation maps) • A Generalist Framework for Panoptic Segmentation of Images and Videos (submitted to ICLR’23) Outline 8
  • 9. • This talks follows the recent journey of Ting Chen (1st author or SimCLR) Outline 9
  • 10. • Pix2Seq • Cast object descriptions as a sequence of discrete tokens (bboxes and class labels) • Training and inference are done as LM (MLE training, stochastic decoding) • Each object = {4 bbox coordinates + 1 class label} • The coordinate is quantized to 𝑛!"#$ values, hence the vocab size = 𝑛!"#$ + 𝑛%&'$$ + 1 for [EOS] token Tasks with sparse outputs 10 CNN encoder + Transformer decoder
  • 11. • Pix2Seq • Cast object descriptions as a sequence of discrete tokens (bboxes and class labels) • Setting 𝑛!"#$ ≈ # of pixels is sufficient to detect small objects Tasks with sparse outputs 11
  • 12. • Pix2Seq • Sequence augmentation to propose more regions and improve recall • Pix2Seq misses some objects due to early stopping of decoding ([EOS] comes quickly) • To avoid this, Pix2Seq keep the max size of bboxes by adding synthetic bboxes • Specially, get the 4 coordinates of a random rectangle and assign “noise” class Tasks with sparse outputs 12
  • 13. • Pix2Seq • Sequence augmentation to propose more regions and improve recall • To avoid this, Pix2Seq keep the max size of bboxes by adding synthetic bboxes • Then, the model decodes the fixed # of objects by replacing “noise” to the most likely class • Sequence augmentation significantly improves the detection performance • IMO, I think this trick can also be used for open-set scenario (get bbox of unknown objects) Tasks with sparse outputs 13
  • 14. • Pix2Seq • Experimental results • Pix2Seq is comparable with Faster R-CNN and DETR • Pix2Seq scales for model size and resolution Tasks with sparse outputs 14
  • 15. • Pix2Seq – Multi-task • The idea of Pix2Seq can be applied to various problems • A single model solves detection, segmentation, and captioning by controlling the input prompt Tasks with sparse outputs 15
  • 16. • Pix2Seq – Multi-task • The idea of Pix2Seq can be applied to various problems • Object detection → same as before • Captioning → obvious • Instance segmentation & keypoint detection → Condition on each object bbox • Seg mask → predict polygon • Keypoint → predict seq. of points {4 coordinates + keypoint label} • The paper lacks explanation, but I guess one needs a two-stage approach for instance segmentation (get bboxes first then predict the mask by conditioning) Tasks with sparse outputs 16
  • 17. • Pix2Seq – Multi-task • Experimental results • This unified framework works for various problems Tasks with sparse outputs 17
  • 18. • Pix2Seq-𝒟 (dense) • Transformers can predict sparse outputs, but not suitable for dense outputs (e.g., pixel-wise segmentation) • Instead, one can use a diffusion model to generate mask from image Tasks with dense outputs 18
  • 19. • Pix2Seq-𝒟 (dense) • Instead, one can use a diffusion model to generate mask from image • Condition on image and previous mask to predict next mask Tasks with dense outputs 19
  • 20. • Pix2Seq-𝒟 (dense) • Instead, one can use a diffusion model to generate mask from image • However, segmentation masks are discrete values (pixel-wise classification), so how to define the diffusion? • The authors use Bit Diffusion, which converts the discrete values into binary bits and apply continuous diffusion Tasks with dense outputs 20
  • 21. • Pix2Seq-𝒟 (dense) • Experimental results • Works, but worse than task-specific models such as Mask DINO Tasks with dense outputs 21
  • 22. • TL;DR. Simple autoregressive or diffusion models can solve a large class of computer vision problems • Discussion. General vs. task-specific algorithm design • Currently, task-specific algorithm usually performs better by leveraging the structures of task • However, the general-purpose algorithm may implicitly learn the structure of task from data • E.g., ViT learns the spatial structure of images, e.g., translation equivariance • I believe the model should reflect the task structures in some way, either explicitly or implicitly • In this perspective, I think there are three directions for designing algorithms: 1. Keep design a task-specific algorithm (short-term goal before AGI comes) 2. Make the general-purpose model to better learn the task structures (e.g., SeqAug) 3. Analysis the structure learned by the general-purpose model (e.g., [1]) Discussion 22 [1] The Lie Derivative for Measuring Learned Equivariance → Analyze the equivariance learned by ViT
  • 23. Thank you for listening! 😀 23