Transformer Architectures in Vision
[2018 ICML] Image Transformer
[2019 CVPR] Video Action Transformer Network
[2020 ECCV] End-to-End Object Detection with Transformers
[2021 ICLR] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
The document discusses the application of transformers to computer vision tasks. It first introduces the standard transformer architecture and its use in natural language processing. It then summarizes recent works on applying transformers to object detection (DETR) and image classification (ViT). DETR proposes an end-to-end object detection method using a CNN-Transformer encoder-decoder architecture. Deformable DETR improves on DETR by incorporating deformable attention mechanisms. ViT represents images as sequences of patches and applies a standard Transformer encoder for image recognition, exceeding state-of-the-art models with less pre-training computation. While promising results have been achieved, challenges remain regarding model parameters and expanding transformer applications to other computer vision tasks.
[Paper] Multiscale Vision Transformers(MVit)Susang Kim
This document summarizes research on multiscale vision transformers (MViT). MViT builds on the transformer architecture by incorporating a multiscale pyramid of features, with early layers operating at high resolution to model low-level visual information and deeper layers focusing on coarse, complex features. MViT introduces multi-head pooling attention to operate at changing resolutions, and uses separate spatial and temporal embeddings. Experiments on Kinetics-400 and ImageNet show MViT achieves better accuracy than ViT baselines with fewer parameters and lower computational cost. Ablation studies validate design choices in MViT like input sampling and stage distribution.
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...changedaeoh
The document summarizes a research seminar presentation on using transformers for image recognition without convolutional biases. It discusses how a pure transformer architecture called Vision Transformer (ViT) can achieve state-of-the-art image classification performance when pretrained on large datasets. ViT works by splitting images into patches and treating the sequence of patch embeddings with a standard transformer. Experiments show ViT outperforms convolutional models in performance per computation and can learn spatial representations without explicit inductive biases. While limited to classification, ViT shows potential for vision tasks if pretrained self-supervision and model extensions are improved.
TLDR (Twin Learning for Dimensionality Reduction) is an unsupervised dimensionality reduction method that combines neighborhood embedding learning with the simplicity and effectiveness of recent self-supervised learning losses.
PR-355: Masked Autoencoders Are Scalable Vision LearnersJinwon Lee
- Masked Autoencoders Are Scalable Vision Learners presents a new self-supervised learning method called Masked Autoencoder (MAE) for computer vision.
- MAE works by masking random patches of input images, encoding the visible patches, and decoding to reconstruct the full image. This forces the model to learn visual representations from incomplete views of images.
- Experiments on ImageNet show that MAE achieves superior results compared to supervised pre-training from scratch as well as other self-supervised methods, scaling effectively to larger models. MAE representations also transfer well to downstream tasks like object detection, instance segmentation and semantic segmentation.
Transformer Architectures in Vision
[2018 ICML] Image Transformer
[2019 CVPR] Video Action Transformer Network
[2020 ECCV] End-to-End Object Detection with Transformers
[2021 ICLR] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
The document discusses the application of transformers to computer vision tasks. It first introduces the standard transformer architecture and its use in natural language processing. It then summarizes recent works on applying transformers to object detection (DETR) and image classification (ViT). DETR proposes an end-to-end object detection method using a CNN-Transformer encoder-decoder architecture. Deformable DETR improves on DETR by incorporating deformable attention mechanisms. ViT represents images as sequences of patches and applies a standard Transformer encoder for image recognition, exceeding state-of-the-art models with less pre-training computation. While promising results have been achieved, challenges remain regarding model parameters and expanding transformer applications to other computer vision tasks.
[Paper] Multiscale Vision Transformers(MVit)Susang Kim
This document summarizes research on multiscale vision transformers (MViT). MViT builds on the transformer architecture by incorporating a multiscale pyramid of features, with early layers operating at high resolution to model low-level visual information and deeper layers focusing on coarse, complex features. MViT introduces multi-head pooling attention to operate at changing resolutions, and uses separate spatial and temporal embeddings. Experiments on Kinetics-400 and ImageNet show MViT achieves better accuracy than ViT baselines with fewer parameters and lower computational cost. Ablation studies validate design choices in MViT like input sampling and stage distribution.
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...changedaeoh
The document summarizes a research seminar presentation on using transformers for image recognition without convolutional biases. It discusses how a pure transformer architecture called Vision Transformer (ViT) can achieve state-of-the-art image classification performance when pretrained on large datasets. ViT works by splitting images into patches and treating the sequence of patch embeddings with a standard transformer. Experiments show ViT outperforms convolutional models in performance per computation and can learn spatial representations without explicit inductive biases. While limited to classification, ViT shows potential for vision tasks if pretrained self-supervision and model extensions are improved.
TLDR (Twin Learning for Dimensionality Reduction) is an unsupervised dimensionality reduction method that combines neighborhood embedding learning with the simplicity and effectiveness of recent self-supervised learning losses.
PR-355: Masked Autoencoders Are Scalable Vision LearnersJinwon Lee
- Masked Autoencoders Are Scalable Vision Learners presents a new self-supervised learning method called Masked Autoencoder (MAE) for computer vision.
- MAE works by masking random patches of input images, encoding the visible patches, and decoding to reconstruct the full image. This forces the model to learn visual representations from incomplete views of images.
- Experiments on ImageNet show that MAE achieves superior results compared to supervised pre-training from scratch as well as other self-supervised methods, scaling effectively to larger models. MAE representations also transfer well to downstream tasks like object detection, instance segmentation and semantic segmentation.
How much position information do convolutional neural networks encode? review...Dongmin Choi
This document presents research into whether convolutional neural networks (CNNs) encode absolute spatial or position information. The authors hypothesize that CNN models implicitly encode position information through techniques like zero-padding. They propose PosENet, a model that couples a pretrained encoder like VGG or ResNet with a position encoding module to predict gradient-like position maps. PosENet is trained on a saliency detection dataset and evaluated on a semantic segmentation dataset. Results show deeper models and position-dependent tasks encode more position information. The authors conclude that zero-padding plays a key role in delivering position cues to CNNs.
Architecture Design for Deep Neural Networks IWanjin Yu
This document summarizes Gao Huang's presentation on neural architectures for efficient inference. The presentation covered three parts: 1) macro-architecture innovations in convolutional neural networks (CNNs) such as ResNet, DenseNet, and multi-scale networks; 2) micro-architecture innovations including group convolution, depthwise separable convolution, and attention mechanisms; and 3) moving from static networks to dynamic networks that can adaptively select simpler or more complex models based on input complexity. The key idea is to enable faster yet accurate inference by matching computational cost to input difficulty.
Scene classification using Convolutional Neural Networks - Jayani WithanawasamWithTheBest
The document discusses scene classification using convolutional neural networks (CNNs). It begins with an outline of the topic, then provides background on computer vision as an AI problem and the importance and challenges of scene classification. It introduces CNNs as a deep learning technique for visual pattern recognition, describing their hierarchical organization and components like convolution and pooling layers. The document also discusses traditional machine learning approaches versus deep learning for scene classification and frameworks like Caffe that can be used to implement CNNs.
Devil in the Details: Analysing the Performance of ConvNet FeaturesKen Chatfield
This document summarizes research comparing different convolutional neural network (CNN) architectures and feature representations on common image classification tasks. It finds that CNN-based methods outperform traditional bag-of-words models. Specifically, it compares different pre-trained CNNs, explores the effects of data augmentation, and shows that fine-tuning networks to target datasets improves performance. The best results are achieved with smaller filters, deeper networks, and ranking loss fine-tuning, outperforming more complex architectures. Code and models are available online for others to replicate the findings.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://telecombcn-dl.github.io/2018-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
1) iMTFA is an incremental approach to few-shot instance segmentation that allows adding new classes without retraining.
2) It extends the MTFA baseline by training an instance feature extractor to generate discriminative embeddings for each instance, with the average embedding used as the class representative.
3) At inference, it predicts classes based on the cosine distance between ROI embeddings and stored class representatives, using class-agnostic box regression and mask prediction.
4) Experiments on COCO, VOC2007 and VOC2012 show iMTFA outperforms SOTA few-shot object detection and instance segmentation methods while enabling incremental class addition.
The document discusses content-based image retrieval and various techniques used for it. It begins by defining content-based image retrieval as taking a query image and ranking images in a large dataset based on how similar they are to the query. It then covers classic pipelines using SIFT features, using off-the-shelf CNN features, and learning representations specifically for retrieval. Methods discussed include spatial pooling of CNN activations, region pooling like R-MAC, and learning embeddings or features through triplet loss or diffusion-based ranking refinement. The goal is to learn representations from data that effectively capture semantic similarity for retrieval tasks.
NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...paperpublications3
This document discusses a proposed method for detecting number plates on images of fast moving vehicles that have been blurred due to motion. It begins with an introduction to image processing and digital images. It then discusses estimating the blur kernel caused by vehicle motion in order to model it as a linear uniform blur with parameters for angle and length. Existing related works on image deblurring are reviewed. The proposed system estimates the blur kernel parameters using sparse representation and Radon transform methods, allows deblurring the image, and then uses artificial neural networks to identify numbers and characters in the deblurred image. The system is evaluated on real blurred images and shown to improve license plate recognition compared to previous methods.
This document summarizes the evolution of convolutional neural networks from LeNet in 1998 to ResNet in 2015. It describes key networks like AlexNet, VGG, GoogleNet, and ResNet and their contributions to improving accuracy on tasks like the ImageNet challenge. The networks progressed from LeNet's basic convolutional layers to deeper networks enabled by techniques like dropout, ReLU activations, and residual connections, leading to substantially improved accuracy over time.
convolutional neural network (CNN, or ConvNet)RakeshSaran5
This presentation provides an overview of Convolutional Neural Networks (CNNs). It begins with an introduction to CNNs and their advantages over fully connected networks for image recognition. It then describes the key components of a CNN, including convolution layers, ReLU layers, pooling layers, and fully connected layers. Examples of each component are provided. The presentation concludes with a discussion of CNN use cases for image recognition.
https://telecombcn-dl.github.io/2018-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
160205 NeuralArt - Understanding Neural RepresentationJunho Cho
The document summarizes three papers on neural representations presented at a seminar:
1. Texture synthesis using convolutional neural networks (CNNs) to generate new texture samples matching a source texture based on gram matrices of CNN feature maps.
2. Reconstructing images from feature maps of CNNs trained on object recognition to understand neural representations.
3. A neural algorithm of artistic style that combines the content of one image and style of another using CNN representations of content and style.
http://imatge-upc.github.io/telecombcn-2016-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
Object Detection Methods using Deep LearningSungjoon Choi
The document discusses object detection techniques including R-CNN, SPPnet, Fast R-CNN, and Faster R-CNN. R-CNN uses region proposals and CNN features to classify each region. SPPnet improves efficiency by computing CNN features once for the whole image. Fast R-CNN further improves efficiency by sharing computation and using a RoI pooling layer. Faster R-CNN introduces a region proposal network to generate proposals, achieving end-to-end training. The techniques showed improved accuracy and processing speed over prior methods.
In this project, we propose methods for semantic segmentation with the deep learning state-of-the-art models. Moreover,
we want to filterize the segmentation to the specific object in specific application. Instead of concentrating on unnecessary objects we
can focus on special ones and make it more specialize and effecient for special purposes. Furtheromore, In this project, we leverage
models that are suitable for face segmentation. The models that are used in this project are Mask-RCNN and DeepLabv3. The
experimental results clearly indicate that how illustrated approach are efficient and robust in the segmentation task to the previous work
in the field of segmentation. These models are reached to 74.4 and 86.6 precision of Mean of Intersection over Union. The visual
Results of the models are shown in Appendix part.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
The presentation is coverong the convolution neural network (CNN) design.
First,
the main building blocks of CNNs will be introduced. Then we systematically
investigate the impact of a range of recent advances in CNN architectures and
learning methods on the object categorization (ILSVRC) problem. In the
evaluation, the influence of the following choices of the architecture are
tested: non-linearity (ReLU, ELU, maxout, compatibility with batch
normalization), pooling variants (stochastic, max, average, mixed), network
width, classifier design (convolution, fully-connected, SPP), image
pre-processing, and of learning parameters: learning rate, batch size,
cleanliness of the data, etc.
These slides discuss some milestone results in image classification using Deep Convolutional neural network and talks about our results on Obscenity detection in images by using Deep Convolutional neural network and transfer learning on ImageNet models.
The document discusses image captioning using deep neural networks. It begins by providing examples of how humans can easily describe images but generating image captions with a computer program was previously very difficult. Recent advances in deep learning, specifically using convolutional neural networks (CNNs) to recognize objects in images and recurrent neural networks (RNNs) to generate captions, have enabled automated image captioning. The document discusses CNN and RNN architectures for image captioning and provides examples of pre-trained models that can be used, such as VGG-16.
How much position information do convolutional neural networks encode? review...Dongmin Choi
This document presents research into whether convolutional neural networks (CNNs) encode absolute spatial or position information. The authors hypothesize that CNN models implicitly encode position information through techniques like zero-padding. They propose PosENet, a model that couples a pretrained encoder like VGG or ResNet with a position encoding module to predict gradient-like position maps. PosENet is trained on a saliency detection dataset and evaluated on a semantic segmentation dataset. Results show deeper models and position-dependent tasks encode more position information. The authors conclude that zero-padding plays a key role in delivering position cues to CNNs.
Architecture Design for Deep Neural Networks IWanjin Yu
This document summarizes Gao Huang's presentation on neural architectures for efficient inference. The presentation covered three parts: 1) macro-architecture innovations in convolutional neural networks (CNNs) such as ResNet, DenseNet, and multi-scale networks; 2) micro-architecture innovations including group convolution, depthwise separable convolution, and attention mechanisms; and 3) moving from static networks to dynamic networks that can adaptively select simpler or more complex models based on input complexity. The key idea is to enable faster yet accurate inference by matching computational cost to input difficulty.
Scene classification using Convolutional Neural Networks - Jayani WithanawasamWithTheBest
The document discusses scene classification using convolutional neural networks (CNNs). It begins with an outline of the topic, then provides background on computer vision as an AI problem and the importance and challenges of scene classification. It introduces CNNs as a deep learning technique for visual pattern recognition, describing their hierarchical organization and components like convolution and pooling layers. The document also discusses traditional machine learning approaches versus deep learning for scene classification and frameworks like Caffe that can be used to implement CNNs.
Devil in the Details: Analysing the Performance of ConvNet FeaturesKen Chatfield
This document summarizes research comparing different convolutional neural network (CNN) architectures and feature representations on common image classification tasks. It finds that CNN-based methods outperform traditional bag-of-words models. Specifically, it compares different pre-trained CNNs, explores the effects of data augmentation, and shows that fine-tuning networks to target datasets improves performance. The best results are achieved with smaller filters, deeper networks, and ranking loss fine-tuning, outperforming more complex architectures. Code and models are available online for others to replicate the findings.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
https://telecombcn-dl.github.io/2018-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
1) iMTFA is an incremental approach to few-shot instance segmentation that allows adding new classes without retraining.
2) It extends the MTFA baseline by training an instance feature extractor to generate discriminative embeddings for each instance, with the average embedding used as the class representative.
3) At inference, it predicts classes based on the cosine distance between ROI embeddings and stored class representatives, using class-agnostic box regression and mask prediction.
4) Experiments on COCO, VOC2007 and VOC2012 show iMTFA outperforms SOTA few-shot object detection and instance segmentation methods while enabling incremental class addition.
The document discusses content-based image retrieval and various techniques used for it. It begins by defining content-based image retrieval as taking a query image and ranking images in a large dataset based on how similar they are to the query. It then covers classic pipelines using SIFT features, using off-the-shelf CNN features, and learning representations specifically for retrieval. Methods discussed include spatial pooling of CNN activations, region pooling like R-MAC, and learning embeddings or features through triplet loss or diffusion-based ranking refinement. The goal is to learn representations from data that effectively capture semantic similarity for retrieval tasks.
NUMBER PLATE IMAGE DETECTION FOR FAST MOTION VEHICLES USING BLUR KERNEL ESTIM...paperpublications3
This document discusses a proposed method for detecting number plates on images of fast moving vehicles that have been blurred due to motion. It begins with an introduction to image processing and digital images. It then discusses estimating the blur kernel caused by vehicle motion in order to model it as a linear uniform blur with parameters for angle and length. Existing related works on image deblurring are reviewed. The proposed system estimates the blur kernel parameters using sparse representation and Radon transform methods, allows deblurring the image, and then uses artificial neural networks to identify numbers and characters in the deblurred image. The system is evaluated on real blurred images and shown to improve license plate recognition compared to previous methods.
This document summarizes the evolution of convolutional neural networks from LeNet in 1998 to ResNet in 2015. It describes key networks like AlexNet, VGG, GoogleNet, and ResNet and their contributions to improving accuracy on tasks like the ImageNet challenge. The networks progressed from LeNet's basic convolutional layers to deeper networks enabled by techniques like dropout, ReLU activations, and residual connections, leading to substantially improved accuracy over time.
convolutional neural network (CNN, or ConvNet)RakeshSaran5
This presentation provides an overview of Convolutional Neural Networks (CNNs). It begins with an introduction to CNNs and their advantages over fully connected networks for image recognition. It then describes the key components of a CNN, including convolution layers, ReLU layers, pooling layers, and fully connected layers. Examples of each component are provided. The presentation concludes with a discussion of CNN use cases for image recognition.
https://telecombcn-dl.github.io/2018-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
160205 NeuralArt - Understanding Neural RepresentationJunho Cho
The document summarizes three papers on neural representations presented at a seminar:
1. Texture synthesis using convolutional neural networks (CNNs) to generate new texture samples matching a source texture based on gram matrices of CNN feature maps.
2. Reconstructing images from feature maps of CNNs trained on object recognition to understand neural representations.
3. A neural algorithm of artistic style that combines the content of one image and style of another using CNN representations of content and style.
http://imatge-upc.github.io/telecombcn-2016-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
Object Detection Methods using Deep LearningSungjoon Choi
The document discusses object detection techniques including R-CNN, SPPnet, Fast R-CNN, and Faster R-CNN. R-CNN uses region proposals and CNN features to classify each region. SPPnet improves efficiency by computing CNN features once for the whole image. Fast R-CNN further improves efficiency by sharing computation and using a RoI pooling layer. Faster R-CNN introduces a region proposal network to generate proposals, achieving end-to-end training. The techniques showed improved accuracy and processing speed over prior methods.
In this project, we propose methods for semantic segmentation with the deep learning state-of-the-art models. Moreover,
we want to filterize the segmentation to the specific object in specific application. Instead of concentrating on unnecessary objects we
can focus on special ones and make it more specialize and effecient for special purposes. Furtheromore, In this project, we leverage
models that are suitable for face segmentation. The models that are used in this project are Mask-RCNN and DeepLabv3. The
experimental results clearly indicate that how illustrated approach are efficient and robust in the segmentation task to the previous work
in the field of segmentation. These models are reached to 74.4 and 86.6 precision of Mean of Intersection over Union. The visual
Results of the models are shown in Appendix part.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
The presentation is coverong the convolution neural network (CNN) design.
First,
the main building blocks of CNNs will be introduced. Then we systematically
investigate the impact of a range of recent advances in CNN architectures and
learning methods on the object categorization (ILSVRC) problem. In the
evaluation, the influence of the following choices of the architecture are
tested: non-linearity (ReLU, ELU, maxout, compatibility with batch
normalization), pooling variants (stochastic, max, average, mixed), network
width, classifier design (convolution, fully-connected, SPP), image
pre-processing, and of learning parameters: learning rate, batch size,
cleanliness of the data, etc.
These slides discuss some milestone results in image classification using Deep Convolutional neural network and talks about our results on Obscenity detection in images by using Deep Convolutional neural network and transfer learning on ImageNet models.
The document discusses image captioning using deep neural networks. It begins by providing examples of how humans can easily describe images but generating image captions with a computer program was previously very difficult. Recent advances in deep learning, specifically using convolutional neural networks (CNNs) to recognize objects in images and recurrent neural networks (RNNs) to generate captions, have enabled automated image captioning. The document discusses CNN and RNN architectures for image captioning and provides examples of pre-trained models that can be used, such as VGG-16.
This document discusses using fully convolutional neural networks for defect inspection. It begins with an agenda that outlines image segmentation using FCNs and defect inspection. It then provides details on data preparation including labeling guidelines, data augmentation, and model setup using techniques like deconvolution layers and the U-Net architecture. Metrics for evaluating the model like Dice score and IoU are also covered. The document concludes with best practices for successful deep learning projects focusing on aspects like having a large reusable dataset, feasibility of the problem, potential payoff, and fault tolerance.
Image Captioning Generator using Deep Machine Learningijtsrd
Technologys scope has evolved into one of the most powerful tools for human development in a variety of fields.AI and machine learning have become one of the most powerful tools for completing tasks quickly and accurately without the need for human intervention. This project demonstrates how deep machine learning can be used to create a caption or a sentence for a given picture. This can be used for visually impaired persons, as well as automobiles for self identification, and for various applications to verify quickly and easily. The Convolutional Neural Network CNN is used to describe the alphabet, and the Long Short Term Memory LSTM is used to organize the right meaningful sentences in this model. The flicker 8k and flicker 30k datasets were used to train this. Sreejith S P | Vijayakumar A "Image Captioning Generator using Deep Machine Learning" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42344.pdf Paper URL: https://www.ijtsrd.comcomputer-science/artificial-intelligence/42344/image-captioning-generator-using-deep-machine-learning/sreejith-s-p
The goal of this report is the presentation of our biometry and security course’s project: Face recognition for Labeled Faces in the Wild dataset using Convolutional Neural Network technology with Graphlab Framework.
Representational Continuity for Unsupervised Continual LearningMLAI2
Continual learning (CL) aims to learn a sequence of tasks without forgetting the previously acquired knowledge. However, recent CL advances are restricted to supervised continual learning (SCL) scenarios. Consequently, they are not scalable to real-world applications where the data distribution is often biased and unannotated. In this work, we focus on unsupervised continual learning (UCL), where we learn the feature representations on an unlabelled sequence of tasks and show that reliance on annotated data is not necessary for continual learning. We conduct a systematic study analyzing the learned feature representations and show that unsupervised visual representations are surprisingly more robust to catastrophic forgetting, consistently achieve better performance, and generalize better to out-of-distribution tasks than SCL. Furthermore, we find that UCL achieves a smoother loss landscape through qualitative analysis of the learned representations and learns meaningful feature representations. Additionally, we propose Lifelong Unsupervised Mixup (Lump), a simple yet effective technique that interpolates between the current task and previous tasks' instances to alleviate catastrophic forgetting for unsupervised representations.
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakPyData
This document discusses using deep learning and deep features to build an app that finds similar images. It begins with an overview of deep learning and how neural networks can learn complex patterns in data. The document then discusses how pre-trained neural networks can be used as feature extractors for other domains through transfer learning. This reduces data and tuning requirements compared to training new deep learning models. The rest of the document focuses on building an image similarity service using these techniques, including training a model with GraphLab Create and deploying it as a web service with Dato Predictive Services.
Minor Project Report on Denoising Diffusion Probabilistic Modelsoxigoh238
Denoising Diffusion Probabilistic Model
Contrastive models like CLIP as a key inspiration.
Demonstrates robust image representations capturing both semantics and style.
Project Objectives:
Two-stage model proposed:
Prior generating a CLIP image embedding from a given text.
Decoder generating an image based on these CLIP image embeddings.
Automatic gender and age classification has become quite relevant in the rise of social media platforms. However, the existing methods have not been completely successful in achieving this. Through this project, an attempt has been made to determine the gender and age based on a frame of the person. This is done by using deep learning, OpenCV which is capable of processing the real-time frames. This frame is given as input and the predicted gender and age are given as output. It is difficult to predict the exact age of a person using one frame due the facial expressions, lighting, makeup and so on so for this purpose various age ranges are taken, and the predicted age falls in one of them. The Adience dataset is used as it is a benchmark for face photos and includes various real-world imaging conditions like noise, lighting etc.
One-shot learning is an object categorization problem in computer vision. Whereas most machine learning based object categorization algorithms require training on hundreds or thousands of images and very large datasets, one-shot learning aims to learn information about object categories from one, or only a few, training images
This document discusses comparing the performance of different convolutional neural networks (CNNs) when trained on large image datasets using Apache Spark. It summarizes the datasets used - CIFAR-10 and ImageNet - and preprocessing done to standardize image sizes. It then provides an overview of CNN architecture, including convolutional layers, pooling layers, and fully connected layers. Finally, it introduces SparkNet, a framework that allows training deep networks using Spark by wrapping Caffe and providing tools for distributed deep learning on Spark. The goal is to see if SparkNet can provide faster training times compared to a single machine by distributing training across a cluster.
Automated Image Captioning – Model Based on CNN – GRU ArchitectureIRJET Journal
This document presents a model for automated image captioning using deep learning techniques. The model uses a CNN-GRU architecture, where a CNN encoder extracts image features and a GRU decoder generates captions. The model is trained on the Flickr30K dataset and achieves a BLEU score of 0.5625. Experimental results show the model can accurately identify objects, animals, and relationships between objects in images and generate descriptive captions. The authors integrate text-to-speech functionality to help describe images to visually impaired people. In under 3 sentences, the document introduces an image captioning model using CNN-GRU, discusses training on Flickr30K, and highlights integration of text-to-speech for assisting the visually impaired.
Tools using AI will affect and, in many cases, redefine most areas of societal impact such as medical practice and intervention, autonomous transportation and law enforcement. While so far, most of the focus and time is invested into optimizing models’ performance, whenever a single wrong prediction has big implications in terms of value or life, accuracy becomes less important than explainability.
In this talk, we will learn about explainable AI and we will see how to apply some of the available tools to answer the question ‘’what did my system consider in order to output a specific prediction’.
Transfer learning enables you to use pretrained deep neural networks trained on various large datasets (ImageNet, CIFAR, WikiQA, SQUAD, and more) and adapt them for various deep learning tasks (e.g., image classification, question answering, and more).
Wee Hyong Tok and Danielle Dean share the basics of transfer learning and demonstrate how to use the technique to bootstrap the building of custom image classifiers and custom question-answering (QA) models. You’ll learn how to use the pretrained CNNs available in various model libraries to custom build a convolution neural network for your use case. In addition, you’ll discover how to use transfer learning for question-answering tasks, with models trained on large QA datasets (WikiQA, SQUAD, and more), and adapt them for new question-answering tasks.
Topics include:
An introduction to convolution neural networks and question-answering problems
Using pretrained CNNs and the last fully connected layer as a featurizer (Once the features are extracted, any existing classifier can be used for image classification, using the extracted features as inputs.)
Fine-tuning the pretrained models and adapting them for the new images
Using pretrained QA models trained on large QA datasets (WikiQA, SQUAD) and applying transfer learning for QA tasks
We trained a large, deep convolutional neural network to classify the 1.2 million
high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different
classes. On the test data, we achieved top-1 and top-5 error rates of 37.5%
and 17.0% which is considerably better than the previous state-of-the-art. The
neural network, which has 60 million parameters and 650,000 neurons, consists
of five convolutional layers, some of which are followed by max-pooling layers,
and three fully-connected layers with a final 1000-way softmax. To make training
faster, we used non-saturating neurons and a very efficient GPU implementation
of the convolution operation. To reduce overfitting in the fully-connected
layers we employed a recently-developed regularization method called “dropout”
that proved to be very effective. We also entered a variant of this model in the
ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%,
compared to 26.2% achieved by the second-best entry
16 OpenCV Functions to Start your Computer Vision journey.docxssuser90e017
This article discusses 16 OpenCV functions for computer vision tasks with Python code examples. It begins with an introduction to computer vision and why OpenCV is useful. It then covers functions for reading/writing images, changing color spaces, resizing images, rotating images, translating images, thresholding images, adaptive thresholding, image segmentation with watershed algorithm, bitwise operations, edge detection, image filtering, contours, SIFT, SURF, feature matching, and face detection. Code examples are provided for each function to demonstrate its use.
Multimodal foundation models are a revolutionary class of AI models that provide impressive abilities to generate multimedia content and do so by interactive prompts in a seemingly creative manner. These foundation models are often self-supervised transformer-based models pre-trained on large volumes of data, typically collected from the web. They already form the basis of all state-of-the-art systems in computer vision and natural language processing across a wide range of tasks and have shown impressive transfer learning abilities. Despite their immense potential, these foundation models face challenges in fundamental perception tasks such as spatial grounding and temporal reasoning, have difficulty to operate on low-resource scenarios, and neglect human-alignment for ethical, legal, and societal acceptance. In this talk I will highlight recent work from my lab that identifies several of these challenges as well as ways to update foundation models to address these challenges and to do so in a sustainable way, without the need to retrain from scratch.
Similar to Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the Wild (20)
The document discusses recent developments in video transformers. It summarizes several recent works that employ spatial backbones like ViT or ResNet combined with temporal transformers for video classification. Examples mentioned include VTN, TimeSformer, STAM, and ViViT. The document also discusses common practices in video transformer inference, like using multiple clips/crops and averaging predictions. Design choices covered include number of frames, spatial dimensions, and multi-view inference techniques.
An Empirical Study of Training Self-Supervised Vision Transformers.pptxSangmin Woo
Chen, Xinlei, Saining Xie, and Kaiming He. "An empirical study of training self-supervised vision transformers." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
This document discusses video grounding, which aims to localize moments in video corresponding to natural language descriptions. It defines related tasks like natural language video localization and video moment retrieval. It lists keywords for these tasks and provides examples of approaches, applications, and GitHub resources for temporal language grounding in videos.
This document summarizes several action recognition datasets for human activities. It describes both single-label datasets that classify entire videos, as well as multi-label datasets that temporally localize actions within videos. It also categorizes datasets as generic, instructional, ego-centric, compositional, multi-view, or multi-modal depending on the type of activities and data modalities included. Several prominent multi-modal datasets are highlighted, such as PKU-MMD, NTU RGB+D, MMAct, and HOMAGE, which provide video alongside additional modalities like depth, infrared, audio, and sensor data.
Chen, X., & He, K. (2021). Exploring Simple Siamese Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 15750-15758).
[2020 ICLR] Reformer: The Efficient Transformer
[2020 ICML] Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
[2020 NIPS] Big Bird: Transformers for Longer Sequences
[2021 ICLR] Rethinking Attention with Performers
Action Genome: Action As Composition of Spatio Temporal Scene GraphsSangmin Woo
Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. Action genome: Actions as composition of spatio-temporal scene graphs. arXiv preprint arXiv:1912.06992, 2019.
Neural motifs scene graph parsing with global contextSangmin Woo
Zellers, Rowan, et al. "Neural Motifs: Scene Graph Parsing With Global Context." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
Attentive Relational Networks for Mapping Images to Scene GraphsSangmin Woo
M. Qi, W. Li, Z. Yang, Y. Wang, and J. Luo.: Attentive relational networks for mapping images to scene graphs. In The
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
Immersive Learning That Works: Research Grounding and Paths ForwardLeonel Morgado
We will metaverse into the essence of immersive learning, into its three dimensions and conceptual models. This approach encompasses elements from teaching methodologies to social involvement, through organizational concerns and technologies. Challenging the perception of learning as knowledge transfer, we introduce a 'Uses, Practices & Strategies' model operationalized by the 'Immersive Learning Brain' and ‘Immersion Cube’ frameworks. This approach offers a comprehensive guide through the intricacies of immersive educational experiences and spotlighting research frontiers, along the immersion dimensions of system, narrative, and agency. Our discourse extends to stakeholders beyond the academic sphere, addressing the interests of technologists, instructional designers, and policymakers. We span various contexts, from formal education to organizational transformation to the new horizon of an AI-pervasive society. This keynote aims to unite the iLRN community in a collaborative journey towards a future where immersive learning research and practice coalesce, paving the way for innovative educational research and practice landscapes.
Or: Beyond linear.
Abstract: Equivariant neural networks are neural networks that incorporate symmetries. The nonlinear activation functions in these networks result in interesting nonlinear equivariant maps between simple representations, and motivate the key player of this talk: piecewise linear representation theory.
Disclaimer: No one is perfect, so please mind that there might be mistakes and typos.
dtubbenhauer@gmail.com
Corrected slides: dtubbenhauer.com/talks.html
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...AbdullaAlAsif1
The pygmy halfbeak Dermogenys colletei, is known for its viviparous nature, this presents an intriguing case of relatively low fecundity, raising questions about potential compensatory reproductive strategies employed by this species. Our study delves into the examination of fecundity and the Gonadosomatic Index (GSI) in the Pygmy Halfbeak, D. colletei (Meisner, 2001), an intriguing viviparous fish indigenous to Sarawak, Borneo. We hypothesize that the Pygmy halfbeak, D. colletei, may exhibit unique reproductive adaptations to offset its low fecundity, thus enhancing its survival and fitness. To address this, we conducted a comprehensive study utilizing 28 mature female specimens of D. colletei, carefully measuring fecundity and GSI to shed light on the reproductive adaptations of this species. Our findings reveal that D. colletei indeed exhibits low fecundity, with a mean of 16.76 ± 2.01, and a mean GSI of 12.83 ± 1.27, providing crucial insights into the reproductive mechanisms at play in this species. These results underscore the existence of unique reproductive strategies in D. colletei, enabling its adaptation and persistence in Borneo's diverse aquatic ecosystems, and call for further ecological research to elucidate these mechanisms. This study lends to a better understanding of viviparous fish in Borneo and contributes to the broader field of aquatic ecology, enhancing our knowledge of species adaptations to unique ecological challenges.
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxMAGOTI ERNEST
Although Artemia has been known to man for centuries, its use as a food for the culture of larval organisms apparently began only in the 1930s, when several investigators found that it made an excellent food for newly hatched fish larvae (Litvinenko et al., 2023). As aquaculture developed in the 1960s and ‘70s, the use of Artemia also became more widespread, due both to its convenience and to its nutritional value for larval organisms (Arenas-Pardo et al., 2024). The fact that Artemia dormant cysts can be stored for long periods in cans, and then used as an off-the-shelf food requiring only 24 h of incubation makes them the most convenient, least labor-intensive, live food available for aquaculture (Sorgeloos & Roubach, 2021). The nutritional value of Artemia, especially for marine organisms, is not constant, but varies both geographically and temporally. During the last decade, however, both the causes of Artemia nutritional variability and methods to improve poorquality Artemia have been identified (Loufi et al., 2024).
Brine shrimp (Artemia spp.) are used in marine aquaculture worldwide. Annually, more than 2,000 metric tons of dry cysts are used for cultivation of fish, crustacean, and shellfish larva. Brine shrimp are important to aquaculture because newly hatched brine shrimp nauplii (larvae) provide a food source for many fish fry (Mozanzadeh et al., 2021). Culture and harvesting of brine shrimp eggs represents another aspect of the aquaculture industry. Nauplii and metanauplii of Artemia, commonly known as brine shrimp, play a crucial role in aquaculture due to their nutritional value and suitability as live feed for many aquatic species, particularly in larval stages (Sorgeloos & Roubach, 2021).
Authoring a personal GPT for your research and practice: How we created the Q...Leonel Morgado
Thematic analysis in qualitative research is a time-consuming and systematic task, typically done using teams. Team members must ground their activities on common understandings of the major concepts underlying the thematic analysis, and define criteria for its development. However, conceptual misunderstandings, equivocations, and lack of adherence to criteria are challenges to the quality and speed of this process. Given the distributed and uncertain nature of this process, we wondered if the tasks in thematic analysis could be supported by readily available artificial intelligence chatbots. Our early efforts point to potential benefits: not just saving time in the coding process but better adherence to criteria and grounding, by increasing triangulation between humans and artificial intelligence. This tutorial will provide a description and demonstration of the process we followed, as two academic researchers, to develop a custom ChatGPT to assist with qualitative coding in the thematic data analysis process of immersive learning accounts in a survey of the academic literature: QUAL-E Immersive Learning Thematic Analysis Helper. In the hands-on time, participants will try out QUAL-E and develop their ideas for their own qualitative coding ChatGPT. Participants that have the paid ChatGPT Plus subscription can create a draft of their assistants. The organizers will provide course materials and slide deck that participants will be able to utilize to continue development of their custom GPT. The paid subscription to ChatGPT Plus is not required to participate in this workshop, just for trying out personal GPTs during it.
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Leonel Morgado
Current descriptions of immersive learning cases are often difficult or impossible to compare. This is due to a myriad of different options on what details to include, which aspects are relevant, and on the descriptive approaches employed. Also, these aspects often combine very specific details with more general guidelines or indicate intents and rationales without clarifying their implementation. In this paper we provide a method to describe immersive learning cases that is structured to enable comparisons, yet flexible enough to allow researchers and practitioners to decide which aspects to include. This method leverages a taxonomy that classifies educational aspects at three levels (uses, practices, and strategies) and then utilizes two frameworks, the Immersive Learning Brain and the Immersion Cube, to enable a structured description and interpretation of immersive learning cases. The method is then demonstrated on a published immersive learning case on training for wind turbine maintenance using virtual reality. Applying the method results in a structured artifact, the Immersive Learning Case Sheet, that tags the case with its proximal uses, practices, and strategies, and refines the free text case description to ensure that matching details are included. This contribution is thus a case description method in support of future comparative research of immersive learning cases. We then discuss how the resulting description and interpretation can be leveraged to change immersion learning cases, by enriching them (considering low-effort changes or additions) or innovating (exploring more challenging avenues of transformation). The method holds significant promise to support better-grounded research in immersive learning.
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the Wild
1. Recent Breakthroughs in AI
- Clubhouse Podcasts
YouTube Link: https://www.youtube.com/watch?v=3OxEpGU1unA
Presenter: Sangmin Woo
2021.03.10
Andrej Karpathy Justin Johnson Lex Fridman Richard Socher Russell Kaplan
2. 2 36
Rise of multimodal learning: CLIP and DALL-E
• CLIP efficiently learns visual concepts from natural language supervision
• DALL-E creates images from text captions for a wide range of concepts expressible in natural language
‘Data’ is the KING: Importance of data and datasets
• Academia: Given dataset – build more powerful model vs. Reality (Industry): Given model – collect/generate dataset
• In fact, many innovation comes from data (not model…)
• Data curation & MLOps will become more important
Will Transformers overtake CNNs? and towards "generalized neural substrates“
• Image = CNN, sequence = RNN → All = Transformer (consolidation of architectures)
• 2020 is the year of Transformer, All we need is Transformer!
Lifelong learning (need to consider catastrophic forgetting & semantic shift, …)
• Benchmarking is difficult… since the tasks & models will be all different from previous SOTA…
• Then why not fix the model? Model-first benchmark design!
Taking hard data structures and "softening" to make differentiable
• Transformer is softened version of hash table
• What would be the next generation data structure?
Talk Summary https://www.youtube.com/watch?v=3OxEpGU1unA
3. Learning Visual-Linguistic
Representation in the Wild
Presenter: Sangmin Woo
2021.03.10
CLIP – OpenAI / DALL-E – OpenAI / ALIGN – Google Research
(+ UniT – Facebook AI Research)
5. 5 36
“Scaling a simple pre-training task is sufficient to achieve competitive zero-shot performance on
a great variety of image classification datasets”
CLIP is trained on 400M (image, text) pairs found across the internet.
Given an image, CLIP predicts which out of a set of 32,768 randomly sampled text snippets, was
actually paired with it in the dataset.
CLIP learns to recognize a wide variety of visual concepts in images and associate them with
their names.
CLIP models can then be applied to nearly arbitrary visual classification tasks.
Summary
6. 6 36
Current approaches have several major problems:
datasets are labor intensive and costly to create
models are good at one task and one task only
models that perform well on benchmarks have disappointingly poor performance on real-world.
CLIP (Contrastive Language–Image Pre-training) aims to address these problems:
• It is trained on image & natural language supervision that’s abundantly available on the internet.
• It can be instructed in natural language to perform several classification benchmarks, without directly
optimizing for the benchmark’s performance (similar to the “zero-shot” capabilities of GPT-3).
• It matches the accuracy of the original ResNet-50 on ImageNet zero-shot without using any of the
1.28M training examples.
Introduction
7. 7 36
Both models show the same accuracy on the
ImageNet test set.
In non-ImageNet settings, CLIP significantly
outperforms ImageNet model.
ObjectNet checks a model’s ability to recognize
objects in many different poses and with many
different backgrounds inside homes.
ImageNet Rendition and ImageNet Sketch check
a model’s ability to recognize more abstract
depictions of objects.
ImageNet ResNet-101 vs. CLIP ViT-L
8. 8 36
CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training
examples.
At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of
the target dataset’s classes.
Approach
10. 10 36
Random, non-cherry picked,
predictions of zero-shot CLIP
classifiers on examples from
various datasets:
Qualitative Examples
11. 11 36
CLIP is highly efficient
• CLIP learns from unfiltered, highly varied, and highly noisy data, and is intended to be used
in a zero-shot manner.
• CLIP (GPT-2 and 3) can achieve compelling zero shot performance. However, it requires
significant training compute.
• Two algorithmic choices to save compute:
contrastive objective for connecting text with images.
Vision Transformer gives 3x gain in compute efficiency over a standard ResNet.
Key takeaways
12. 12 36
Image-to-caption
Transformer model
struggled at zero-shot
transfer. It only
achieves 16% accuracy
on ImageNet after
training for 400M
images.
CLIP is much more
efficient and achieves
the same accuracy
roughly 12x faster.
Key takeaways
13. 13 36
CLIP is flexible and general
• CLIP models are more flexible and general than ImageNet models because they learn a wide range of
visual concepts directly from natural language. They are able to zero-shot perform many different tasks.
• CLIP has validated its zero-shot performance on over 30 different datasets including tasks such as
fine-grained object classification, geo-localization, action recognition in videos, and OCR.
• Learning OCR is an exciting behavior that does not occur in standard ImageNet models.
• The best CLIP model outperforms the best publicly available ImageNet model, the Noisy Student
EfficientNet-L2, on 20 out of 26 different transfer datasets.
Key takeaways
14. 14 36
Across 27 tasks such as fine-
grained object classification, OCR,
activity recognition in videos, and
geo-localization, CLIP models
learn more widely useful image
representations.
Key takeaways
15. 15 36
While CLIP usually performs well on recognizing common objects, it struggles on counting the
number of objects in an image and on predicting how close the nearest car is in a photo.
Zero-shot CLIP also struggles compared to task specific models on very fine-grained
classification, such as telling the difference between car models, variants of aircraft, or flower
species.
CLIP also still has poor generalization to images not covered in its pre-training dataset. For
instance, although CLIP learns a capable OCR system, when evaluated on MNIST dataset, zero-
shot CLIP only achieves 88% accuracy, well below the 99.75% of humans on the dataset.
Limitations
16. Blog Link : https://openai.com/blog/dall-e/
YouTube: https://www.youtube.com/watch?v=az-OV47oKvA
(for more detailed and friendly explanation)
17. 17 36
“DALL-E is a 12B parameter AR Transformer trained to generate images from text descriptions in
a zero-shot manner, using 250M text–image pairs collected from the internet”
DALL-E achieves high quality image generation on MS-COCO dataset zero-shot, without using
any of the training labels.
preferred over prior work trained on the dataset by human evaluators 90% of the time.
Image-to-image translation
Summary
+
DALL-E =
Salvador Dalí WALL-E
18. 18 36
GPT-3: text generation
Image GPT: image generation
Jukebox: music generation
DALL-E extend these findings to show that manipulating visual concepts through language is
now within reach.
DALL-E can
• create anthropomorphized versions of animals and objects
• combine unrelated concepts in plausible ways
• render text
• apply transformations to existing images
Qualitative examples: https://openai.com/blog/dall-e/
Introduction
20. 20 36
The goal is to train a transformer to autoregressively model the text and image tokens as a
single stream of data.
However, using pixels directly as image tokens would require an inordinate amount of memory
for high-resolution images.
A discrete variational autoencoder (dVAE) is trained to compress each 256×256 RGB image
into a 32×32 grid of image tokens, each element of which can assume 8192 possible values.
256 BPE-encoded text tokens are concatenated with the 32×32=1024 image tokens, and train
an autoregressive transformer to model the joint distribution over the text and image tokens.
Approach
21. 21 36
VQ-VAE (Vector Quantized Variational AutoEncoder) for image compression
Approach
Oord et al., Neural Discrete Representation Learning
22. 22 36
Gumbel Softmax
Approach
Jang et al., Categorical Reparameterization with Gumbel-Softmax
23. 23 36
A discrete variational autoencoder (dVAE) is
trained to compress each 256×256 RGB image
into a 32×32 grid of image tokens, each element
of which can assume 8192 possible values.
The encoder downsamples the spatial resolution by
a factor of 8.
While details are sometimes lost or distorted, the
main features of the image are still typically
recognizable.
Approach
26. 26 36
“ALIGN (A Large-scale ImaGe and Noisy-text embedding) uses a noisy dataset of over 1B image
alt-text pairs, obtained without expensive filtering or post-processing steps, to learn a simple dual-
encoder architecture (image and text) by aligning visual-language representations using a
contrastive loss”
While representation learning in NLP has transitioned to training on raw text without human
annotations, visual and vision-language representations still rely heavily on curated training
datasets that are expensive or require expert knowledge. This costly curation process limits
the size of datasets and hence hinders the scaling of trained models.
The scale of the corpus can make up for its noise and leads to state-of-the-art representations
even with such a simple learning scheme.
Summary
27. 27 36
Visual and language representations are jointly learned from noisy image alt-text data and can be used for
vision-only or vision-language task transfer.
Without any fine-tuning, ALIGN powers cross-modal search including image-to-text search, text-to-image
search and even search with joint image+text queries.
Summary
28. 28 36
The goal is to align the visual-language representations in a shared latent embedding space
using a simple dual-encoder architecture (image: EfficientNet, text: BERT)
Image and text encoders are learned via a contrastive loss (formulated as normalized softmax)
that pushes the embeddings of matched image-text pair together while pushing those of non-
matched image-text pair apart.
Considering paired texts as fine-grained labels of images, image-to-text contrastive loss is
analogous to the conventional label-based classification objective; and the key difference is
that the text encoder generates the “label” weights.
Approach
29. 29 36
The image (EfficientNet) and text encoders (BERT) are optimized via contrastive loss (sum of
two normalized softmax losses) that pushes the embeddings of matched image-text pair
(positive) together while pushing those of non-matched image-text pair (negative) apart.
• Image-to-text classification loss
• Text-to-image classification loss
Approach
𝑥𝑖: image embedding in the 𝑖-th pair
𝑦𝑗: text embedding in the 𝑗-th pair
𝑁: batch size
𝜎: (learnable) temperature
32. 32 36
“Unified Transformer (UniT) is built upon the transformer encoder-decoder architecture and jointly
learns multiple tasks across different modalities (image & text), ranging from object detection to
language understanding and multimodal reasoning”
UniT model encodes each input modality with an encoder and makes predictions on each task
with a shared decoder over the encoded input representations, followed by task-specific output
heads
Compared to previous efforts on multi-task learning with transformers, UniT share the same
model parameters to all tasks instead of separately fine-tuning task-specific models and handle
a much higher variety of tasks across different domains.
UniT learns 7 tasks jointly over 8 datasets, achieving comparable performance to well-
established prior work on each domain under the same supervision with a compact set of
model parameters.
Summary
34. 34 36
UniT uses an image encoder, a text encoder, and a joint decoder with per-task query embedding
followed by task-specific heads to make the final outputs for each task.
Approach
35. 35 36
Among the existing architectures, Transformer is the most generic architectures because it has
less inductive bias than others.
A new formula such as “Large Transformer + Large scale dataset” has begun to emerge
(CLIP:400M, DALL-E:250M, ALIGN:1B).
All we need is data: the recent BIG studies talk about how they collected/curated data, not
much about models.
Transformers are replacing CNN-based SOTAs, which were considered de-facto standard in
the image domain, on several benchmarks.
Also, Transformers are indeed strong at multi-modality.
Wrap up