This paper proposes a framework called NoC for noise-aware learning from web-crawled image-text data for image captioning. NoC assigns an alignment level to each image-text pair based on CLIP similarity scores, then trains a captioning model conditioned on the alignment level. This allows the model to learn from noisy data while generating accurate captions by controlling the alignment level at inference time. Experiments show NoC outperforms baseline methods on zero-shot captioning and self-retrieval tasks, demonstrating it can effectively leverage noisy web data for image captioning.
論文紹介:Learning With Neighbor Consistency for Noisy LabelsToru Tamaki
Ahmet Iscen, Jack Valmadre, Anurag Arnab, Cordelia Schmid, "Learning With Neighbor Consistency for Noisy Labels" CVPR2022
https://openaccess.thecvf.com/content/CVPR2022/html/Iscen_Learning_With_Neighbor_Consistency_for_Noisy_Labels_CVPR_2022_paper.html
- The document presents a neural network model for recognizing handwritten digits. It uses a dataset of 20x20 pixel grayscale images of digits 0-9.
- The proposed neural network has an input layer of 400 nodes, a hidden layer of 25 nodes, and an output layer of 10 nodes. It is trained using backpropagation to classify images.
- The model achieves an accuracy of over 96.5% on test data after 200 iterations of training, outperforming a logistic regression model which achieved 91.5% accuracy. Future work could involve classifying more complex natural images.
Performance Improvement of Vector Quantization with Bit-parallelism HardwareCSCJournals
Vector quantization is an elementary technique for image compression; however, searching for the nearest codeword in a codebook is time-consuming. In this work, we propose a hardware-based scheme by adopting bit-parallelism to prune unnecessary codewords. The new scheme uses a “Bit-mapped Look-up Table” to represent the positional information of the codewords. The lookup procedure can simply refer to the bitmaps to find the candidate codewords. Our simulation results further confirm the effectiveness of the proposed scheme.
This document discusses comparing the performance of different convolutional neural networks (CNNs) when trained on large image datasets using Apache Spark. It summarizes the datasets used - CIFAR-10 and ImageNet - and preprocessing done to standardize image sizes. It then provides an overview of CNN architecture, including convolutional layers, pooling layers, and fully connected layers. Finally, it introduces SparkNet, a framework that allows training deep networks using Spark by wrapping Caffe and providing tools for distributed deep learning on Spark. The goal is to see if SparkNet can provide faster training times compared to a single machine by distributing training across a cluster.
11.secure compressed image transmission using self organizing feature mapsAlexander Decker
This document summarizes a research paper that proposes a method for secure compressed image transmission using self-organizing feature maps. The method involves compressing images using SOFM-based vector quantization, entropy coding the results, and encrypting the compressed data using a scrambler before transmission. Simulation results show the method achieves a compression ratio of up to 38:1 while providing security, outperforming JPEG compression by up to 1 dB. The paper presents the technical details and evaluation of the proposed secure image transmission system.
BMVA summer school MATLAB programming tutorialpotaters
This document discusses improving the runtime performance of MATLAB code through vectorization. It provides an example of an inefficient MATLAB function that approximates cycles of a square wave using sine waves. To optimize this code, the document suggests manipulating arrays rather than individual array elements, which can be done by removing the nested for loops. Vectorizing the code to operate on entire arrays at once rather than elements sequentially would improve performance. Profiling the code using MATLAB's profiler tool can help identify bottlenecks to target for optimization.
論文紹介:Learning With Neighbor Consistency for Noisy LabelsToru Tamaki
Ahmet Iscen, Jack Valmadre, Anurag Arnab, Cordelia Schmid, "Learning With Neighbor Consistency for Noisy Labels" CVPR2022
https://openaccess.thecvf.com/content/CVPR2022/html/Iscen_Learning_With_Neighbor_Consistency_for_Noisy_Labels_CVPR_2022_paper.html
- The document presents a neural network model for recognizing handwritten digits. It uses a dataset of 20x20 pixel grayscale images of digits 0-9.
- The proposed neural network has an input layer of 400 nodes, a hidden layer of 25 nodes, and an output layer of 10 nodes. It is trained using backpropagation to classify images.
- The model achieves an accuracy of over 96.5% on test data after 200 iterations of training, outperforming a logistic regression model which achieved 91.5% accuracy. Future work could involve classifying more complex natural images.
Performance Improvement of Vector Quantization with Bit-parallelism HardwareCSCJournals
Vector quantization is an elementary technique for image compression; however, searching for the nearest codeword in a codebook is time-consuming. In this work, we propose a hardware-based scheme by adopting bit-parallelism to prune unnecessary codewords. The new scheme uses a “Bit-mapped Look-up Table” to represent the positional information of the codewords. The lookup procedure can simply refer to the bitmaps to find the candidate codewords. Our simulation results further confirm the effectiveness of the proposed scheme.
This document discusses comparing the performance of different convolutional neural networks (CNNs) when trained on large image datasets using Apache Spark. It summarizes the datasets used - CIFAR-10 and ImageNet - and preprocessing done to standardize image sizes. It then provides an overview of CNN architecture, including convolutional layers, pooling layers, and fully connected layers. Finally, it introduces SparkNet, a framework that allows training deep networks using Spark by wrapping Caffe and providing tools for distributed deep learning on Spark. The goal is to see if SparkNet can provide faster training times compared to a single machine by distributing training across a cluster.
11.secure compressed image transmission using self organizing feature mapsAlexander Decker
This document summarizes a research paper that proposes a method for secure compressed image transmission using self-organizing feature maps. The method involves compressing images using SOFM-based vector quantization, entropy coding the results, and encrypting the compressed data using a scrambler before transmission. Simulation results show the method achieves a compression ratio of up to 38:1 while providing security, outperforming JPEG compression by up to 1 dB. The paper presents the technical details and evaluation of the proposed secure image transmission system.
BMVA summer school MATLAB programming tutorialpotaters
This document discusses improving the runtime performance of MATLAB code through vectorization. It provides an example of an inefficient MATLAB function that approximates cycles of a square wave using sine waves. To optimize this code, the document suggests manipulating arrays rather than individual array elements, which can be done by removing the nested for loops. Vectorizing the code to operate on entire arrays at once rather than elements sequentially would improve performance. Profiling the code using MATLAB's profiler tool can help identify bottlenecks to target for optimization.
The document provides an agenda for a practical session on digital image processing. It discusses stages of computer vision including stereo images, optical flow, and machine learning techniques like classification and clustering. Stereo vision and depth maps from stereo images are explained. Optical flow concepts like the Lucas-Kanade method are covered. Machine learning algorithms like KNN, SVM, and K-means clustering are also summarized. The document concludes with information about a project, assignment, and notable AI companies in Egypt.
A simple framework for contrastive learning of visual representationsDevansh16
Link: https://machine-learning-made-simple.medium.com/learnings-from-simclr-a-framework-contrastive-learning-for-visual-representations-6c145a5d8e99
If you'd like to discuss something, text me on LinkedIn, IG, or Twitter. To support me, please use my referral link to Robinhood. It's completely free, and we both get a free stock. Not using it is literally losing out on free money.
Check out my other articles on Medium. : https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn. Let's connect: https://rb.gy/m5ok2y
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
My Substack: https://devanshacc.substack.com/
Live conversations at twitch here: https://rb.gy/zlhk9y
Get a free stock on Robinhood: https://join.robinhood.com/fnud75
This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100X fewer labels.
Comments: ICML'2020. Code and pretrained models at this https URL
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Cite as: arXiv:2002.05709 [cs.LG]
(or arXiv:2002.05709v3 [cs.LG] for this version)
Submission history
From: Ting Chen [view email]
[v1] Thu, 13 Feb 2020 18:50:45 UTC (5,093 KB)
[v2] Mon, 30 Mar 2020 15:32:51 UTC (5,047 KB)
[v3] Wed, 1 Jul 2020 00:09:08 UTC (5,829 KB)
This document discusses the use of block coordinate descent (BCD) for training convolutional neural networks on computer vision tasks. The authors trained a CNN on MNIST and FashionMNIST datasets using both BCD and SGD. They found that BCD achieved slightly better accuracy than SGD (0.1% on average) when optimizing 2500 parameters per batch with a batch size of 50. BCD requires smaller batch sizes for stochasticity to improve convergence. While BCD did not significantly outperform SGD in terms of speed or accuracy on CPU, the authors believe improvements to the BCD implementation could make it faster than SGD, especially on GPUs.
Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...Alex Conway
Slides for my talk on:
"Convolutional Neural Networks for Image Classification"
...at the Cape Town Deep Learning Meet-up 20170620
https://www.meetup.com/Cape-Town-deep-learning/events/240485642/
Template matching is a basic method in image analysis to extract useful information from images. In this
paper, we suggest a new method for pattern matching. Our method transform the template image from two
dimensional image into one dimensional vector. Also all sub-windows (same size of template) in the
reference image will transform into one dimensional vectors. The three similarity measures SAD, SSD, and
Euclidean are used to compute the likeness between template and all sub-windows in the reference image
to find the best match. The experimental results show the superior performance of the proposed method
over the conventional methods on various template of different sizes.
This document summarizes the results of experiments comparing the performance of Naive Bayes and Very Fast Decision Tree classifiers implemented in Java, native Objective-C code, and using JNI to integrate Objective-C and Java. The experiments varied data set size, attributes, and presence of noise. Key findings include:
1) The JNI implementation was significantly faster than the pure Java implementation, with little difference from native code. Native code outperformed Java when initializing many objects.
2) Increasing the number of examples had a similar effect on performance as increasing attributes for both NB and VFDT classifiers.
3) The NB classifier consistently reached high accuracy earlier than VFDT due to its updating process, explaining
Fixed-Point Code Synthesis for Neural Networksgerogepatton
Over the last few years, neural networks have started penetrating safety critical systems to take decisions in robots, rockets, autonomous driving car, etc. A problem is that these critical systems often have limited computing resources. Often, they use the fixed-point arithmetic for its many advantages (rapidity, compatibility with small memory devices.) In this article, a new technique is introduced to tune the formats (precision) of already trained neural networks using fixed-point arithmetic, which can be implemented using integer operations only. The new optimized neural network computes the output with fixed-point numbers without modifying the accuracy up to a threshold fixed by the user. A fixed-point code is synthesized for the new optimized neural network ensuring the respect of the threshold for any input vector belonging the range [xmin, xmax] determined during the analysis. From a technical point of view, we do a preliminary analysis of our floating neural network to determine the worst cases, then we generate a system of linear constraints among integer variables that we can solve by linear programming. The solution of this system is the new fixed-point format of each neuron. The experimental results obtained show the efficiency of our method which can ensure that the new fixed-point neural network has the same behavior as the initial floating-point neural network.
Fixed-Point Code Synthesis for Neural NetworksIJITE
Over the last few years, neural networks have started penetrating safety critical systems to take decisions in robots, rockets, autonomous driving car, etc. A problem is that these critical systems often have limited computing resources. Often, they use the fixed-point arithmetic for its many advantages (rapidity, compatibility with small memory devices.) In this article, a new technique is introduced to tune the formats (precision) of already trained neural networks using fixed-point arithmetic, which can be implemented using integer operations only. The new optimized neural network computes the output with fixed-point numbers without modifying the accuracy up to a threshold fixed by the user. A fixed-point code is synthesized for the new optimized neural network ensuring the respect of the threshold for any input vector belonging the range [xmin, xmax] determined during the analysis. From a technical point of view, we do a preliminary analysis of our floating neural network to determine the worst cases, then we generate a system of linear constraints among integer variables that we can solve by linear programming. The solution of this system is the new fixed-point format of each neuron. The experimental results obtained show the efficiency of our method which can ensure that the new fixed-point neural network has the same behavior as the initial floating-point neural network.
This document proposes a new style-based generator architecture for generative adversarial networks. The architecture uses adaptive instance normalization controlled by learned styles at each convolutional layer, rather than feeding the latent code directly into the generator. This leads to an unsupervised separation of high-level attributes from stochastic variation in images. The new generator improves over traditional architectures on quality metrics and interpolation properties, and better disentangles latent factors of variation. New metrics are also introduced to quantify disentanglement and interpolation quality. A new high-quality dataset of human faces is also presented.
Road Segmentation from satellites imagesYoussefKitane
The document describes a method for road segmentation from satellite images using a fully convolutional neural network approach. Specifically, it proposes using a ResNet-50 encoder with a custom decoder consisting of transpose convolutional layers. To address the limited size of the training dataset, the method employs transfer learning using an ImageNet-pretrained ResNet-50 and data augmentation. It also pre-trains the network on the larger SpaceNet roads dataset before fine-tuning on the target dataset. Experimental results show the proposed approach achieves better performance than baselines that do not use these techniques, demonstrating their effectiveness for addressing the small dataset size.
In this project, we consider the deep learning-based approaches to performing Neural Style Transfer (NST) on images. In particular, we intend to assess the Real-Time performance of this approach, since it has become a trending topic both in academia and in industrial applications.
For this purpose, after exploring the perceptual loss concept, which is used by the majority of models when performing NST, we conducted a review on a range of existing methods for this practical problem. We found that the feedforward based methods allow to achieve real time performance as opposed to the framework of iterative optimization proposed in the original Neural Style Transfer algorithm introduced by Gatys et al. Which is why we mainly focused on two feed-forward methods proposed in the literature: one that focuses on Single-Style transfer, TransformNet, and one that tackles the more generic problem of Multiple Style Transfer, MSG-Net.
This document summarizes a student project report on neural style transfer. The students tested existing feed-forward neural style transfer methods (TransformNet and MSG-Net) and compared their real-time performance and visual quality. They also experimented with combining multiple styles using the original optimization-based neural style transfer algorithm. The document provides background on neural style transfer techniques, discusses the methods tested, and outlines the project's methodology for evaluating and comparing the methods.
Corrosion Detection Using A.I : A Comparison of Standard Computer Vision Tech...csandit
In this paper we present a comparison between stand
ard computer vision techniques and Deep
Learning approach for automatic metal corrosion (ru
st) detection. For the classic approach, a
classification based on the number of pixels contai
ning specific red components has been
utilized. The code written in Python used OpenCV li
braries to compute and categorize the
images. For the Deep Learning approach, we chose Ca
ffe, a powerful framework developed at
“Berkeley Vision and Learning Center” (BVLC). The
test has been performed by classifying
images and calculating the total accuracy for the t
wo different approaches.
CORROSION DETECTION USING A.I. : A COMPARISON OF STANDARD COMPUTER VISION TEC...cscpconf
In this paper we present a comparison between standard computer vision techniques and Deep
Learning approach for automatic metal corrosion (rust) detection. For the classic approach, a
classification based on the number of pixels containing specific red components has been
utilized. The code written in Python used OpenCV libraries to compute and categorize the
images. For the Deep Learning approach, we chose Caffe, a powerful framework developed at
“Berkeley Vision and Learning Center” (BVLC). The test has been performed by classifying
images and calculating the total accuracy for the two different approaches.
This document outlines an assignment for a computer vision course. Students are asked to implement 4 vision algorithms: 2 using OpenCV and 2 using MATLAB. The algorithms are the log-polar transform, background subtraction, histogram equalization, and contrast stretching. Students must also answer 3 short questions about orthographic vs perspective projection, efficient filtering, and sensors beyond cameras for computer vision.
This document discusses parallelizing object detection in videos for many-core systems. It presents an object detection algorithm that includes frame differencing, background differencing, post-processing, and background updating. The algorithm is parallelized by vertically partitioning video frames across cores, with some pixel overlap between partitions to reduce communication overhead. The parallel implementation achieves a speedup of 37.2x on a 64-core Tilera system processing 18 full-HD frames per second. A performance prediction equation is also developed and shown to accurately model the real performance results.
The document describes a vehicle detection system using a fully convolutional regression network (FCRN). The FCRN is trained on patches from aerial images to predict a density map indicating vehicle locations. The proposed system is evaluated on two public datasets and achieves higher precision and recall than comparative shallow and deep learning methods for vehicle detection in aerial images. The system could help with applications like urban planning and traffic management.
1) The document proposes using homogeneous motion discovery to generate additional reference frames for 4K video coding. Motion is estimated between reference frames and the current frame to generate affine motion models and associated masks.
2) Experimental results on 3 video sequences show average bit rate savings of 3.78% over HEVC by using the additional reference frames generated from the affine motion models.
3) The approach provides a simpler computation method for high resolution video coding compared to motion hint estimation, which requires super-pixel segmentation that becomes impractical for resolutions like 4K.
This document proposes a new digital image encryption technique based on multi-scroll chaotic delay differential equations (DDEs). The technique uses a XOR operation between separated binary planes of a grayscale image and a shuffled attractor image from a DDE. Security keys include DDE parameters like initial conditions, time constants, and simulation time. Experimental results using a 512x512 Lena image in MATLAB demonstrate the DDE dynamics, encryption/decryption security through histograms, power spectrums, and image correlations. Wrong key decryption is also shown. The technique offers potential for simple yet secure image transmission applications.
論文紹介:A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, a...Toru Tamaki
Chaoyang Zhu, Long Chen, "A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future" arXiv2023
https://arxiv.org/abs/2307.09220
More Related Content
Similar to 論文紹介:Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning
The document provides an agenda for a practical session on digital image processing. It discusses stages of computer vision including stereo images, optical flow, and machine learning techniques like classification and clustering. Stereo vision and depth maps from stereo images are explained. Optical flow concepts like the Lucas-Kanade method are covered. Machine learning algorithms like KNN, SVM, and K-means clustering are also summarized. The document concludes with information about a project, assignment, and notable AI companies in Egypt.
A simple framework for contrastive learning of visual representationsDevansh16
Link: https://machine-learning-made-simple.medium.com/learnings-from-simclr-a-framework-contrastive-learning-for-visual-representations-6c145a5d8e99
If you'd like to discuss something, text me on LinkedIn, IG, or Twitter. To support me, please use my referral link to Robinhood. It's completely free, and we both get a free stock. Not using it is literally losing out on free money.
Check out my other articles on Medium. : https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn. Let's connect: https://rb.gy/m5ok2y
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
My Substack: https://devanshacc.substack.com/
Live conversations at twitch here: https://rb.gy/zlhk9y
Get a free stock on Robinhood: https://join.robinhood.com/fnud75
This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100X fewer labels.
Comments: ICML'2020. Code and pretrained models at this https URL
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Cite as: arXiv:2002.05709 [cs.LG]
(or arXiv:2002.05709v3 [cs.LG] for this version)
Submission history
From: Ting Chen [view email]
[v1] Thu, 13 Feb 2020 18:50:45 UTC (5,093 KB)
[v2] Mon, 30 Mar 2020 15:32:51 UTC (5,047 KB)
[v3] Wed, 1 Jul 2020 00:09:08 UTC (5,829 KB)
This document discusses the use of block coordinate descent (BCD) for training convolutional neural networks on computer vision tasks. The authors trained a CNN on MNIST and FashionMNIST datasets using both BCD and SGD. They found that BCD achieved slightly better accuracy than SGD (0.1% on average) when optimizing 2500 parameters per batch with a batch size of 50. BCD requires smaller batch sizes for stochasticity to improve convergence. While BCD did not significantly outperform SGD in terms of speed or accuracy on CPU, the authors believe improvements to the BCD implementation could make it faster than SGD, especially on GPUs.
Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...Alex Conway
Slides for my talk on:
"Convolutional Neural Networks for Image Classification"
...at the Cape Town Deep Learning Meet-up 20170620
https://www.meetup.com/Cape-Town-deep-learning/events/240485642/
Template matching is a basic method in image analysis to extract useful information from images. In this
paper, we suggest a new method for pattern matching. Our method transform the template image from two
dimensional image into one dimensional vector. Also all sub-windows (same size of template) in the
reference image will transform into one dimensional vectors. The three similarity measures SAD, SSD, and
Euclidean are used to compute the likeness between template and all sub-windows in the reference image
to find the best match. The experimental results show the superior performance of the proposed method
over the conventional methods on various template of different sizes.
This document summarizes the results of experiments comparing the performance of Naive Bayes and Very Fast Decision Tree classifiers implemented in Java, native Objective-C code, and using JNI to integrate Objective-C and Java. The experiments varied data set size, attributes, and presence of noise. Key findings include:
1) The JNI implementation was significantly faster than the pure Java implementation, with little difference from native code. Native code outperformed Java when initializing many objects.
2) Increasing the number of examples had a similar effect on performance as increasing attributes for both NB and VFDT classifiers.
3) The NB classifier consistently reached high accuracy earlier than VFDT due to its updating process, explaining
Fixed-Point Code Synthesis for Neural Networksgerogepatton
Over the last few years, neural networks have started penetrating safety critical systems to take decisions in robots, rockets, autonomous driving car, etc. A problem is that these critical systems often have limited computing resources. Often, they use the fixed-point arithmetic for its many advantages (rapidity, compatibility with small memory devices.) In this article, a new technique is introduced to tune the formats (precision) of already trained neural networks using fixed-point arithmetic, which can be implemented using integer operations only. The new optimized neural network computes the output with fixed-point numbers without modifying the accuracy up to a threshold fixed by the user. A fixed-point code is synthesized for the new optimized neural network ensuring the respect of the threshold for any input vector belonging the range [xmin, xmax] determined during the analysis. From a technical point of view, we do a preliminary analysis of our floating neural network to determine the worst cases, then we generate a system of linear constraints among integer variables that we can solve by linear programming. The solution of this system is the new fixed-point format of each neuron. The experimental results obtained show the efficiency of our method which can ensure that the new fixed-point neural network has the same behavior as the initial floating-point neural network.
Fixed-Point Code Synthesis for Neural NetworksIJITE
Over the last few years, neural networks have started penetrating safety critical systems to take decisions in robots, rockets, autonomous driving car, etc. A problem is that these critical systems often have limited computing resources. Often, they use the fixed-point arithmetic for its many advantages (rapidity, compatibility with small memory devices.) In this article, a new technique is introduced to tune the formats (precision) of already trained neural networks using fixed-point arithmetic, which can be implemented using integer operations only. The new optimized neural network computes the output with fixed-point numbers without modifying the accuracy up to a threshold fixed by the user. A fixed-point code is synthesized for the new optimized neural network ensuring the respect of the threshold for any input vector belonging the range [xmin, xmax] determined during the analysis. From a technical point of view, we do a preliminary analysis of our floating neural network to determine the worst cases, then we generate a system of linear constraints among integer variables that we can solve by linear programming. The solution of this system is the new fixed-point format of each neuron. The experimental results obtained show the efficiency of our method which can ensure that the new fixed-point neural network has the same behavior as the initial floating-point neural network.
This document proposes a new style-based generator architecture for generative adversarial networks. The architecture uses adaptive instance normalization controlled by learned styles at each convolutional layer, rather than feeding the latent code directly into the generator. This leads to an unsupervised separation of high-level attributes from stochastic variation in images. The new generator improves over traditional architectures on quality metrics and interpolation properties, and better disentangles latent factors of variation. New metrics are also introduced to quantify disentanglement and interpolation quality. A new high-quality dataset of human faces is also presented.
Road Segmentation from satellites imagesYoussefKitane
The document describes a method for road segmentation from satellite images using a fully convolutional neural network approach. Specifically, it proposes using a ResNet-50 encoder with a custom decoder consisting of transpose convolutional layers. To address the limited size of the training dataset, the method employs transfer learning using an ImageNet-pretrained ResNet-50 and data augmentation. It also pre-trains the network on the larger SpaceNet roads dataset before fine-tuning on the target dataset. Experimental results show the proposed approach achieves better performance than baselines that do not use these techniques, demonstrating their effectiveness for addressing the small dataset size.
In this project, we consider the deep learning-based approaches to performing Neural Style Transfer (NST) on images. In particular, we intend to assess the Real-Time performance of this approach, since it has become a trending topic both in academia and in industrial applications.
For this purpose, after exploring the perceptual loss concept, which is used by the majority of models when performing NST, we conducted a review on a range of existing methods for this practical problem. We found that the feedforward based methods allow to achieve real time performance as opposed to the framework of iterative optimization proposed in the original Neural Style Transfer algorithm introduced by Gatys et al. Which is why we mainly focused on two feed-forward methods proposed in the literature: one that focuses on Single-Style transfer, TransformNet, and one that tackles the more generic problem of Multiple Style Transfer, MSG-Net.
This document summarizes a student project report on neural style transfer. The students tested existing feed-forward neural style transfer methods (TransformNet and MSG-Net) and compared their real-time performance and visual quality. They also experimented with combining multiple styles using the original optimization-based neural style transfer algorithm. The document provides background on neural style transfer techniques, discusses the methods tested, and outlines the project's methodology for evaluating and comparing the methods.
Corrosion Detection Using A.I : A Comparison of Standard Computer Vision Tech...csandit
In this paper we present a comparison between stand
ard computer vision techniques and Deep
Learning approach for automatic metal corrosion (ru
st) detection. For the classic approach, a
classification based on the number of pixels contai
ning specific red components has been
utilized. The code written in Python used OpenCV li
braries to compute and categorize the
images. For the Deep Learning approach, we chose Ca
ffe, a powerful framework developed at
“Berkeley Vision and Learning Center” (BVLC). The
test has been performed by classifying
images and calculating the total accuracy for the t
wo different approaches.
CORROSION DETECTION USING A.I. : A COMPARISON OF STANDARD COMPUTER VISION TEC...cscpconf
In this paper we present a comparison between standard computer vision techniques and Deep
Learning approach for automatic metal corrosion (rust) detection. For the classic approach, a
classification based on the number of pixels containing specific red components has been
utilized. The code written in Python used OpenCV libraries to compute and categorize the
images. For the Deep Learning approach, we chose Caffe, a powerful framework developed at
“Berkeley Vision and Learning Center” (BVLC). The test has been performed by classifying
images and calculating the total accuracy for the two different approaches.
This document outlines an assignment for a computer vision course. Students are asked to implement 4 vision algorithms: 2 using OpenCV and 2 using MATLAB. The algorithms are the log-polar transform, background subtraction, histogram equalization, and contrast stretching. Students must also answer 3 short questions about orthographic vs perspective projection, efficient filtering, and sensors beyond cameras for computer vision.
This document discusses parallelizing object detection in videos for many-core systems. It presents an object detection algorithm that includes frame differencing, background differencing, post-processing, and background updating. The algorithm is parallelized by vertically partitioning video frames across cores, with some pixel overlap between partitions to reduce communication overhead. The parallel implementation achieves a speedup of 37.2x on a 64-core Tilera system processing 18 full-HD frames per second. A performance prediction equation is also developed and shown to accurately model the real performance results.
The document describes a vehicle detection system using a fully convolutional regression network (FCRN). The FCRN is trained on patches from aerial images to predict a density map indicating vehicle locations. The proposed system is evaluated on two public datasets and achieves higher precision and recall than comparative shallow and deep learning methods for vehicle detection in aerial images. The system could help with applications like urban planning and traffic management.
1) The document proposes using homogeneous motion discovery to generate additional reference frames for 4K video coding. Motion is estimated between reference frames and the current frame to generate affine motion models and associated masks.
2) Experimental results on 3 video sequences show average bit rate savings of 3.78% over HEVC by using the additional reference frames generated from the affine motion models.
3) The approach provides a simpler computation method for high resolution video coding compared to motion hint estimation, which requires super-pixel segmentation that becomes impractical for resolutions like 4K.
This document proposes a new digital image encryption technique based on multi-scroll chaotic delay differential equations (DDEs). The technique uses a XOR operation between separated binary planes of a grayscale image and a shuffled attractor image from a DDE. Security keys include DDE parameters like initial conditions, time constants, and simulation time. Experimental results using a 512x512 Lena image in MATLAB demonstrate the DDE dynamics, encryption/decryption security through histograms, power spectrums, and image correlations. Wrong key decryption is also shown. The technique offers potential for simple yet secure image transmission applications.
Similar to 論文紹介:Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning (20)
論文紹介:A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, a...Toru Tamaki
Chaoyang Zhu, Long Chen, "A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future" arXiv2023
https://arxiv.org/abs/2307.09220
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...Toru Tamaki
Sanghyeok Lee, Joonmyung Choi, Hyunwoo J. Kim, "Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers" arXiv2024
https://arxiv.org/abs/2403.10030
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face RecognitionToru Tamaki
Jiankang Deng, Jia Guo, Niannan Xue, Stefanos Zafeiriou , "ArcFace: Additive Angular Margin Loss for Deep Face Recognition" CVPR2019
https://openaccess.thecvf.com/content_CVPR_2019/html/Deng_ArcFace_Additive_Angular_Margin_Loss_for_Deep_Face_Recognition_CVPR_2019_paper.html
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayersToru Tamaki
Lei Ke, Yu-Wing Tai, Chi-Keung Tang, "Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers" CVPR2021
https://openaccess.thecvf.com/content/CVPR2021/html/Ke_Deep_Occlusion-Aware_Instance_Segmentation_With_Overlapping_BiLayers_CVPR_2021_paper.html
論文紹介:Automated Classification of Model Errors on ImageNetToru Tamaki
Momchil Peychev, Mark Müller, Marc Fischer, Martin Vechev, " Automated Classification of Model Errors on ImageNet", NeurIPS2023
https://proceedings.neurips.cc/paper_files/paper/2023/hash/7480ed13740773505262791131c12b89-Abstract-Conference.html
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex ScenesToru Tamaki
Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip H.S. Torr, Song Bai, " MOSE: A New Dataset for Video Object Segmentation in Complex Scenes " ICCV2023
論文紹介:Tracking Anything with Decoupled Video SegmentationToru Tamaki
Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, Joon-Young Lee, " Tracking Anything with Decoupled Video Segmentation " ICCV2023
https://openaccess.thecvf.com/content/ICCV2023/html/Cheng_Tracking_Anything_with_Decoupled_Video_Segmentation_ICCV_2023_paper.html
論文紹介:Real-Time Evaluation in Online Continual Learning: A New HopeToru Tamaki
Yasir Ghunaim, Adel Bibi, Kumail Alhamoud, Motasem Alfarra, Hasan Abed Al Kader Hammoud, Ameya Prabhu, Philip H.S. Torr, Bernard Ghanem, " Real-Time Evaluation in Online Continual Learning: A New Hope " CVPR2023
https://openaccess.thecvf.com/content/CVPR2023/html/Ghunaim_Real-Time_Evaluation_in_Online_Continual_Learning_A_New_Hope_CVPR_2023_paper.html
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...Toru Tamaki
Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas, " PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation " CVPR2017
https://openaccess.thecvf.com/content_cvpr_2017/html/Qi_PointNet_Deep_Learning_CVPR_2017_paper.html
The Microsoft 365 Migration Tutorial For Beginner.pptxoperationspcvita
This presentation will help you understand the power of Microsoft 365. However, we have mentioned every productivity app included in Office 365. Additionally, we have suggested the migration situation related to Office 365 and how we can help you.
You can also read: https://www.systoolsgroup.com/updates/office-365-tenant-to-tenant-migration-step-by-step-complete-guide/
From Natural Language to Structured Solr Queries using LLMsSease
This talk draws on experimentation to enable AI applications with Solr. One important use case is to use AI for better accessibility and discoverability of the data: while User eXperience techniques, lexical search improvements, and data harmonization can take organizations to a good level of accessibility, a structural (or “cognitive” gap) remains between the data user needs and the data producer constraints.
That is where AI – and most importantly, Natural Language Processing and Large Language Model techniques – could make a difference. This natural language, conversational engine could facilitate access and usage of the data leveraging the semantics of any data source.
The objective of the presentation is to propose a technical approach and a way forward to achieve this goal.
The key concept is to enable users to express their search queries in natural language, which the LLM then enriches, interprets, and translates into structured queries based on the Solr index’s metadata.
This approach leverages the LLM’s ability to understand the nuances of natural language and the structure of documents within Apache Solr.
The LLM acts as an intermediary agent, offering a transparent experience to users automatically and potentially uncovering relevant documents that conventional search methods might overlook. The presentation will include the results of this experimental work, lessons learned, best practices, and the scope of future work that should improve the approach and make it production-ready.
The Department of Veteran Affairs (VA) invited Taylor Paschal, Knowledge & Information Management Consultant at Enterprise Knowledge, to speak at a Knowledge Management Lunch and Learn hosted on June 12, 2024. All Office of Administration staff were invited to attend and received professional development credit for participating in the voluntary event.
The objectives of the Lunch and Learn presentation were to:
- Review what KM ‘is’ and ‘isn’t’
- Understand the value of KM and the benefits of engaging
- Define and reflect on your “what’s in it for me?”
- Share actionable ways you can participate in Knowledge - - Capture & Transfer
Introducing BoxLang : A new JVM language for productivity and modularity!Ortus Solutions, Corp
Just like life, our code must adapt to the ever changing world we live in. From one day coding for the web, to the next for our tablets or APIs or for running serverless applications. Multi-runtime development is the future of coding, the future is to be dynamic. Let us introduce you to BoxLang.
Dynamic. Modular. Productive.
BoxLang redefines development with its dynamic nature, empowering developers to craft expressive and functional code effortlessly. Its modular architecture prioritizes flexibility, allowing for seamless integration into existing ecosystems.
Interoperability at its Core
With 100% interoperability with Java, BoxLang seamlessly bridges the gap between traditional and modern development paradigms, unlocking new possibilities for innovation and collaboration.
Multi-Runtime
From the tiny 2m operating system binary to running on our pure Java web server, CommandBox, Jakarta EE, AWS Lambda, Microsoft Functions, Web Assembly, Android and more. BoxLang has been designed to enhance and adapt according to it's runnable runtime.
The Fusion of Modernity and Tradition
Experience the fusion of modern features inspired by CFML, Node, Ruby, Kotlin, Java, and Clojure, combined with the familiarity of Java bytecode compilation, making BoxLang a language of choice for forward-thinking developers.
Empowering Transition with Transpiler Support
Transitioning from CFML to BoxLang is seamless with our JIT transpiler, facilitating smooth migration and preserving existing code investments.
Unlocking Creativity with IDE Tools
Unleash your creativity with powerful IDE tools tailored for BoxLang, providing an intuitive development experience and streamlining your workflow. Join us as we embark on a journey to redefine JVM development. Welcome to the era of BoxLang.
How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...AlexanderRichford
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation Functions to Prevent Interaction with Malicious QR Codes.
Aim of the Study: The goal of this research was to develop a robust hybrid approach for identifying malicious and insecure URLs derived from QR codes, ensuring safe interactions.
This is achieved through:
Machine Learning Model: Predicts the likelihood of a URL being malicious.
Security Validation Functions: Ensures the derived URL has a valid certificate and proper URL format.
This innovative blend of technology aims to enhance cybersecurity measures and protect users from potential threats hidden within QR codes 🖥 🔒
This study was my first introduction to using ML which has shown me the immense potential of ML in creating more secure digital environments!
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
Conversational agents, or chatbots, are increasingly used to access all sorts of services using natural language. While open-domain chatbots - like ChatGPT - can converse on any topic, task-oriented chatbots - the focus of this paper - are designed for specific tasks, like booking a flight, obtaining customer support, or setting an appointment. Like any other software, task-oriented chatbots need to be properly tested, usually by defining and executing test scenarios (i.e., sequences of user-chatbot interactions). However, there is currently a lack of methods to quantify the completeness and strength of such test scenarios, which can lead to low-quality tests, and hence to buggy chatbots.
To fill this gap, we propose adapting mutation testing (MuT) for task-oriented chatbots. To this end, we introduce a set of mutation operators that emulate faults in chatbot designs, an architecture that enables MuT on chatbots built using heterogeneous technologies, and a practical realisation as an Eclipse plugin. Moreover, we evaluate the applicability, effectiveness and efficiency of our approach on open-source chatbots, with promising results.
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
In our second session, we shall learn all about the main features and fundamentals of UiPath Studio that enable us to use the building blocks for any automation project.
📕 Detailed agenda:
Variables and Datatypes
Workflow Layouts
Arguments
Control Flows and Loops
Conditional Statements
💻 Extra training through UiPath Academy:
Variables, Constants, and Arguments in Studio
Control Flow in Studio
This talk will cover ScyllaDB Architecture from the cluster-level view and zoom in on data distribution and internal node architecture. In the process, we will learn the secret sauce used to get ScyllaDB's high availability and superior performance. We will also touch on the upcoming changes to ScyllaDB architecture, moving to strongly consistent metadata and tablets.
論文紹介:Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning
1. Noise-aware Learning from Web-
crawled Image-Text Data for Image
Captioning
Wooyoung Kang, Jonghwan Mun, Sungjun Lee, Byungseok Roh, ICCV2023
橋口凌大(名工大)
2023/10/30
2. 概要
nImage Captioningとは
• 入力画像に適した説明文を生成させる
nウェブから収集した多量のデータ [Changpinyo+, CVPR2021], [Hu+, CVPR2022]で事
前学習を行う
• ノイズが多く含まれるという問題がある
nノイズを活用したImage Captioningモデルの学習
• データ数を落とさないようにノイズも同時に学習させるフレームワークを提案
this is about
what it looked like
filtered noise sample
0.171
Figure 2: Examples of web-crawled im
a threshold of 0.3, a selected threshold
noise samples (left). However, accordin
4. 観察
nCLIP [Radford+, ICML2021]を用いたImageとTextの類似度の例
• 類似度が低いものは低品質なものが多い
• しかし,類似度は低いが人間が観察する限り有用なデータも存在する
n単純なフィルタリングだと除去されてしまうデータも有効に活用した
い
this is about
what it looked like
people walk
on the sidewalk
a girl scrubs her shoes clean
in a dish of soapy water
ambulances line up outside the
international airport terminal
filtered sample, but informative
filtered noise sample non-filtered sample
0.289
0.171 0.353
0.193
5. 概観
nImage-Textペアのアライメントレベル𝑧を定義
nアライメントレベルを埋め込んで学習
n最上位のアライメントレベルでキャプションを生成
Bucketing
Frozen CLIP
Image-Text Pairs
Captioning Model
z
z=3
The restaurant is a nod to the late 19th century,
with a large glass - walled dining room
z=2
the restaurant is a popular place for locals and
tourists alike to eat and drink
z=1
the view from the dining room
(c) Quality-controllable Caption Generation at Infererence Time
(a) Alignment Level Assign
Captioning Model
(b) Alignment-level-conditioned Training
Batch
Bucket
z=3
Less-aligned
pairs
( =2)
Well-aligned
pairs
( =3)
Noisy pairs
( =1)
z=1 z=2
hes, the noise transition
n matrix is added on the
mates noisy class poste-
e underlying label transi-
transition matrix for im-
to the numerous words
ver, other noise-handling
dditional co-trained net-
able for our large-scale
computational costs.
tion, BLIP [24] tackled
led data with a data boot-
addition to web-crawled
ntary clean human anno-
the captioning and filter-
ontrast, without any clean
with the proposed noise-
nly web-crawled data.
r Image Captioning
fully benefit from the rich information of web-crawled data,
we aim to make a captioning model robust to noise while
using the whole dataset.
As illustrated in Fig. 3, we propose our NoC frame-
work with an alignment-level-controllable captioner. Dur-
ing training, we identify an alignment level l of the given
image-text pair and use it as a control signal z to make
a captioning model be conditioned on the alignment level,
which leads to a new objective as follows:
L = log p(c|I, z) =
T
X
t=0
log p(wt+1|wt, I, z). (2)
This objective encourages the model to learn the capabil-
ity of generating captions of different alignment levels de-
pending on a control signal. Then, at the inference time, we
can steer the captioning model to generate accurate captions
by simply feeding a control signal, meaning a top-level of
alignment. With this model design, as discussed in our ex-
periments, we can take the following two advantages: (1)
it can learn well-aligned data with less distraction by noisy
data and (2) it can still take advantage of data containing
6. アライメントレベルの決め方
nCLIP [Radford+, ICML2021]を用いてアライメントレベルを決定
• Image-Textの相関を定量的に表す
nCLIP類似度スコア𝑠によって𝐾個のアライメントレベルに変換
• 類似度スコアの高いデータは𝑙 = 𝐾
nバケッティングの種類
• 一様バケッティング
• 分位バケッティング
web-crawled data; contrastive learning typically drives a
model to learn the correlation between images and texts in
a data-driven manner, so the similarity scores by CLIP can
be used as alignment scores for image-text pairs.
Given the CLIP similarity scores, s 2 R, from all train-
ing samples, we convert them into K discrete alignment lev-
els l 2 {1, ..., K} via a bucketing technique:
l = fbucket(s), (3)
where fbucket is a bucketing function with K bins of equally
spaced boundaries. This bucketing makes more noisy data
of low CLIP similarity assigned to a bucket of the lower
alignment level (e.g., l = 1) and well-aligned data of high
CLIP similarity allocated to a bucket of the higher align-
Each model is
VirTex-like [1
due to its sim
the encoder fi
age. Next, we
ment level (i.
scribed in Se
ence time—in
then concaten
image embed
into each cros
and captions
more detailed
<latexit sha1_base64="S0VSKpZYjp21A+soHQsQiqqZTDM=">AAACf3icSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtI5SjEZOYpxFQb6igY6SjE5KTklxTrKHjH1HLFCygb6BmAgQImwxDKUGaAgoB8geUMMQwpDPkMyQylDLkMqQx5DCVAdg5DIkMxEEYzGDIYMBQAxWIZqoFiRUBWJlg+laGWgQuotxSoKhWoIhEomg0k04G8aKhoHpAPMrMYrDsZaEsOEBcBdSowqBpcNVhp8NnghMFqg5cGf3CaVQ02A+SWSiCdBNGbWhDP3yUR/J2grlwgXcKQgdCF180lDGkMFmC3ZgLdXgAWAfkiGaK/rGr652CrINVqNYNFBq+B7l9ocNPgMNAHeWVfkpcGpgbNZgBFgCF6cGMywoz0DM30TAJNlB2coFHBwSDNoMSgAQxvcwYHBg+GAIZQoL0NDMsY1jNsYGJkUmfSYzKAKGVihOoRZkABTJYAzuyRYg==</latexit>
l 2 {1, 2, . . . , K}
7. アライメントレベルを埋め込んで学習
nアライメントレベルを受け取るよ
うに修正
• CLIPから出力された類似度から
Bucketingにより𝑧を生成
• 𝑧を学習可能な埋め込み層によりImage
Encoderと同じ次元のベクトルを獲得
• Image Encoderとzから生成した埋め込
みベクトルからDecoderに入力する
Key, Valueを生成
ndix
aining Details for All Experiments
provide detailed explanations for experimental set-
f all our reported results in the main paper. We have
ted our experiments under two different settings: 1)
ble-Dense, 2) Frozen-CLS. The main difference be-
the two configurations depends on the way of uti-
he visual encoder. Regarding the text decoder, we
the same decoder model, which consists of a 6-layer
rmer network initialized randomly for both settings.
the main experiments i.e., Tabs. 1 to 3, we use the
ble-Dense setting where the visual encoder, i.e., pre-
CLIP ViT-L/14, is tunned during the training phase
“The people … sitting <EOS>”
“The people are all
in a restaurant some
are sitting”
key, value
Image
Encoder
Caption Decoder
Bucketing
Condition Embedder
Frozen CLIP
“<BOS> The people are all in
a restaurant some are sitting”
CLIP
Similarity
Control signal
Figure 10: The network architecture of our alignment-lev
8. 実験設定
nタスク
• Zero-shot captioning
• Self-retrival
n評価指標
• Zero-shot captioning
• BLEU@4 [Papineni+, ACL2002]
• METEOR [Banerjee & Lavie, ACL
Workshop2005]
• CIDEr [Vedantam+, CVPR2015]
• SPCIE [Fernando+, ECCV2016]
• CLIP Score [Hessel+, EMNLP2021]
• Self-retrival
• CLIP Score, Recall
nデータセット
• 学習
• Conceptual Captions 3M (CC3M)
[Sharma+, ACL2018]
• 評価
• MSCOCO [Lin+, ECCV2014]
• Nocaps [Agrawal+, ICCV2019]
• Flickr30k [Plummer+, ICCV2015]
nベースライン (transformer)
• Vanila
• Vanila (Filterring)
• CLIP similalityによる重み付け
z = K) is specialized to a bucket of highly aligned image-
text pairs (like filtering), while parameter sharing between
experts allows sharing of learned knowledge for rich visual
concepts across all experts (e.g., our model with z K).
4. Experiment
4.1. Evaluation Setting
Recall that our goal is to learn a captioning model from
the web-crawled data, including inherent noises; therefore,
we evaluate the quality of models without fine-tuning on
clean data to check how effectively to tackle noisy data. To
assess the quality of models, we consider two aspects of
generated captions: descriptiveness (i.e., how well aligned
with given images) and distinctiveness (i.e., how well de-
scribe images with their unique aspects). To be specific, we
the noise issue. Following the previous convention [42], we
calculate cosine similarity for each pair of (image, text) us-
ing a pre-trained CLIP ViT-B/32 model [38], then leave
pairs having similarity larger than 0.3 as training data. The
threshold of 0.3 is selected the following [42], which pro-
vides the best performance among thresholds from 0.1 to
0.35 with steps of 0.05 as presented in Fig. 4.
Loss weighting. The final baseline is a vanilla caption-
ing model trained with a loss re-weighting strategy to tackle
the noise issue. In this baseline, instead of filtering, we use
CLIP similarity score to re-weight sample-wise loss as
Lweighting =
1
N
N
X
i=1
si log p(ci|Ii), (4)
10. 従来手法との比較 (Zero-shot Captioning)
n提案手法は少ないパラメータで高性能
Vanilla 10.31 15.48 47.56 62.89 41.58 60.49 38.60 58.64 39.24 59.91 51.22 62.15
Vanilla (Filtering) 12.81 17.30 54.66 64.84 48.96 62.70 46.06 60.74 46.33 62.35 59.50 63.92
Loss weighting 11.16 16.15 50.86 63.87 43.89 61.18 39.30 59.23 41.80 60.50 53.84 63.04
NoC (z=7) 15.96 19.50 62.04 66.70 54.94 64.21 51.74 62.54 53.09 63.92 63.15 65.19
Table 1: Caption generation performance comparison with baselines on MSCOCO and nocaps datasets where all models are trained
with CC3M without fine-tuning on the target dataset. B@4, M, C, and CS mean BLEU@4, METEOR, CIDEr, and CLIPScore metrics,
respectively. Numbers in bold indicates the best method.
Method Visual Encoder Text Decoder Data
MSCOCO
BLEU@4 METEOR CIDEr SPICE
Inference time optimization or un-paired training
ZeroCap [44] CLIP ViT-B/32 GPT2 (345M) [39] - 2.60 11.50 14.60 5.50
Socratic Models [56] CLIP ViT-L/14 GPT3 (175B) [4] - 10.00 16.20 50.10 10.80
DeCAP [25] CLIP ViT-B/32 Transformer4-layer (76.5M) CC3M-text 8.80 16.00 42.10 10.90
Supervised training with image-text paired data
Re-ViLM [51] CLIP ViT-L/14 RETRO (410M) [3] CCS + COYO [5] 17.90 - 53.60 -
SimVLM1.4B [49] - - ALIGN 1.8B [20] 11.20 14.70 32.20 8.50
NoC (z=7)†
CLIP ViT-B/32 Transformer4-layer (76.5M) CC3M 14.10 18.12 48.66 12.57
NoC (z=7)†
CLIP ViT-L/14 Transformer6-layer (94.5M) CC3M 15.96 19.50 62.04 14.37
Table 2: Comparison with other models reporting the zero-shot captioning scores. The CCS is a combination of CC3M, CC12M [6], and
SBU [35] datasets. † indicates a method that explicitly handles the problem of noisy samples. Numbers in bold indicate the best method.
which indicates learning a model using well-aligned cap-
tions is important for descriptive caption generation. On the
other hand, our model significantly outperforms all base-
lines on both MSCOCO and nocaps datasets. Especially, the
notable gain on the nocaps, which covers more diverse vi-
sual concepts, implies the importance of noise-aware learn-
ing across the entire dataset for acquiring a broader range of
visual knowledge compared to learning from filtered data.
Models
MSCOCO Flickr30k
R@1 R@5 R@10 R@1 R@5 R@10
GT Caption 34.57 59.30 69.91 63.08 86.50 92.00
Vanilla 25.44 50.38 61.66 47.10 76.60 85.90
Vanilla (Filtering) 31.64 58.90 70.36 56.50 85.50 92.50
Loss weighting 28.78 54.44 65.44 48.00 78.90 87.50
NoC (z=7) 40.00 66.78 77.53 65.10 92.00 96.20
Table 3: Comparison of self-retrieval capability on MSCOCO and
Cattle graze in a field.
Cattle grazing in a field.
Person and I had a lot of fun.
A desk full of computers.
A desk in the office.
The desk of people, person.
Vanilla
Vanilla
(Filtering)
Ours(z=3)
(z=5)
(z=7)
Vanilla
Vanilla
(Filtering)
Ours(z=3)
(z=5)
(z=7)
Cattle graze in a pasture with storm
cloud overhead.
Cows grazing in a field. A desk full of computer.
A photo of a cluttered desk with
computers, keyboard and mouse.
Figure 8: Examples of generated captions. Compared to the baselines, our model can generate captions of different quality by adjusting
11. ベースラインとの比較 (Self-retrival)
nすべてのベースラインと
比較して高性能
n提案手法はGT Captionよ
り高性能
• 人間によるアノテーション
が統一的ではない
NoC (z=7)†
CLIP ViT-B/32 Transformer4-layer (76.5M) CC3M 14.10 18.12 48.66 12.57
NoC (z=7)†
CLIP ViT-L/14 Transformer6-layer (94.5M) CC3M 15.96 19.50 62.04 14.37
e 2: Comparison with other models reporting the zero-shot captioning scores. The CCS is a combination of CC3M, CC12M [6], and
[35] datasets. † indicates a method that explicitly handles the problem of noisy samples. Numbers in bold indicate the best method.
ch indicates learning a model using well-aligned cap-
s is important for descriptive caption generation. On the
r hand, our model significantly outperforms all base-
s on both MSCOCO and nocaps datasets. Especially, the
ble gain on the nocaps, which covers more diverse vi-
concepts, implies the importance of noise-aware learn-
across the entire dataset for acquiring a broader range of
al knowledge compared to learning from filtered data.
2 Comparison with other concurrent works
le the primary goal of our experiments is to mea-
the noise-robustness, we provide additional compar-
s with other works that report zero-shot performance
he MSCOCO to better contextualize the effectiveness
our work in Tab. 2. The reported works are divided
two groups: 1) captioning with inference time opti-
ation [44, 56] or leveraging an unpaired training strat-
[25], and 2) directly training entire networks [49] or
Models
MSCOCO Flickr30k
R@1 R@5 R@10 R@1 R@5 R@10
GT Caption 34.57 59.30 69.91 63.08 86.50 92.00
Vanilla 25.44 50.38 61.66 47.10 76.60 85.90
Vanilla (Filtering) 31.64 58.90 70.36 56.50 85.50 92.50
Loss weighting 28.78 54.44 65.44 48.00 78.90 87.50
NoC (z=7) 40.00 66.78 77.53 65.10 92.00 96.20
Table 3: Comparison of self-retrieval capability on MSCOCO and
Flickr30k datasets. Numbers in bold indicate the best method.
the n-grams in retrieved captions by their retrieval augmen-
tation technique at inference time. While the much higher
CIDEr score of our model compared to Re-ViLM indicates
NoC generates captions with more diverse expressions con-
sidering the TF-IDF weighting of the CIDEr.
4.5. Self-retrieval Task
We compare self-retrieval capability to measure how
well each algorithm distinctively describes given images.
Cattle graze in a field.
Cattle grazing in a field.
Person and I had a lot of fun.
A desk full of computers.
A desk in the office.
The desk of people, person.
Vanilla
Vanilla
(Filtering)
Ours(z=3)
(z=5)
(z=7)
Vanilla
Vanilla
(Filtering)
Ours(z=3)
(z=5)
(z=7)
Cattle graze in a pasture with storm
cloud overhead.
Cows grazing in a field. A desk full of computer.
A photo of a cluttered desk with
computers, keyboard and mouse.
Figure 8: Examples of generated captions. Compared to the baselines, our model can generate captions of different quality by adjusting
a control signal (z); as we feed z meaning higher alignment levels (3 ! 5 ! 7), captions become more descriptive and distinct with
expressions (highlighted in red) capturing fine details from images.
12. ファインチューニングの効果
nCC3Mで事前学習したモデルをMSCOCOでファインチューニング
n提案手法が高性能
• 提案手法のノイズ耐性が高い
• 人間がアノテーションしたデータ
も一貫したアノテーションに
なっていない
• 提案手法が有効である
Table 4: Performances on MSCOCO and nocaps after fine-
tuning on MSCOCO.
Method
MSCOCO test split nocaps (overall)
CIDEr SPICE CIDEr SPICE
Vanilla (Filtering) 126.14 22.37 87.03 12.52
NoC 129.09 23.23 93.33 13.40
Figure 5: Distribution of CLIP scores on MSCOCO dataset and
examples at different alignment scores. Best viewed in zoom-in.
Table 4: Performances on MSCOCO and nocaps after fine-
uning on MSCOCO.
Method
MSCOCO test split nocaps (overall)
CIDEr SPICE CIDEr SPICE
Vanilla (Filtering) 126.14 22.37 87.03 12.52
NoC 129.09 23.23 93.33 13.40
13. データセットサイズによる性能
nCOYOデータセット
• 700MのImage-Textデータ
nデータセットを分割して様々なサイズの
データセットを作成
• 3M, 10M, 23M, 125M
nVanilaモデルは性能が頭打ち
nVanila (Filtering)はデータが増えると性
能向上
n提案手法は一貫して高性能
Table 4: Performances on MSCOCO and nocaps after fine-
tuning on MSCOCO.
Method
MSCOCO test split nocaps (overall)
CIDEr SPICE CIDEr SPICE
Vanilla (Filtering) 126.14 22.37 87.03 12.52
NoC 129.09 23.23 93.33 13.40
Figure 5: Distribution of CLIP scores on MSCOCO dataset and
examples at different alignment scores. Best viewed in zoom-in.
in a way that they are distinguishable from others. In con-
trast, our model can generate highly aligned captions with
fine-grained details for the given images by adjusting the
control signal with a high alignment level. This result im-
plies our model is effective in generating distinct captions.
The qualitative examples of the retrieval results are illus-
trated in Fig. 9 and Appendix H.
14. なぜノイズにロバストなのか?
n長い期間学習をして,学習データの記憶によるノイズ解析を行う
n学習データ
• CC3M train
n検証データ
• CC3M train (CLIP similality: 0 ~ 0.15)
• CC3M train (CLIP similality: 0.35 ~ )
n𝑧 = 7のモデル
• 類似度の高いデータを中心に学習している
• Noisy GroupのEM (Exact Matching: 完全一致)の数が少ない(well-aligned
Groupは多い)
• ノイズの多いデータの影響を受けない
nVanilaモデル
• ノイズの多いデータの記憶によって類似度の高いデータの学習が妨げられる
Method
Noisy Group Well-aligned Group
EM B@4 CIDEr CS EM B@4 CIDEr CS
Vanilla 5350 12.63 122.24 52.49 8092 43.46 407.10 82.67
NoC (z=3) 10527 20.41 210.83 42.87 225 7.43 53.33 56.35
NoC (z=5) 939 5.83 48.47 55.86 4882 33.32 304.18 77.69
NoC (z=7) 47 2.10 18.28 59.57 10562 52.78 504.49 86.57
Table 5: Comparison of memorization capability on CC3M training samples of two
different noise levels. Note that higher EM (# Exact Matching), B@4 (BLEU@4), and
CIDEr scores in the noisy group mean over-memorization to noisy data. Higher CS
(CLIPScore) means better alignment with given images.
Options CIDEr CLIPScore
Quantile 49.22 65.72
Uniform 49.18 66.65
(a) Bucketing strategy
Options CIDEr CLIPScore
Sum 48.99 65.19
MLP 48.70 66.94
Concat. 49.18 66.65
(b) Control fusion method
Table 6: Ablations on the zero-shot MSCOCO captioning after training on the CC3M data
15. Ablation study
nバケッティングとアライメントレベル埋め込みのablation study
nバケッティング
• Uniformかつ𝑏𝑖𝑛𝑠 = 8が高性能
nアライメントレベル埋め込み
• Concatが高性能
• Sum, MLPでは平滑化のような操作になる
• アライメントレベルとしての条件の意味合いを持たすにはConcatが一番良い
Method
Noisy Group Well-aligned Group
EM B@4 CIDEr CS EM B@4 CIDEr CS
Vanilla 5350 12.63 122.24 52.49 8092 43.46 407.10 82.67
NoC (z=3) 10527 20.41 210.83 42.87 225 7.43 53.33 56.35
NoC (z=5) 939 5.83 48.47 55.86 4882 33.32 304.18 77.69
NoC (z=7) 47 2.10 18.28 59.57 10562 52.78 504.49 86.57
Table 5: Comparison of memorization capability on CC3M training samples of two
different noise levels. Note that higher EM (# Exact Matching), B@4 (BLEU@4), and
CIDEr scores in the noisy group mean over-memorization to noisy data. Higher CS
(CLIPScore) means better alignment with given images.
Figure 7: Performances on two groups of dif-
ferent similarities from CC3M validation set.
Options CIDEr CLIPScore
Quantile 49.22 65.72
Uniform 49.18 66.65
(a) Bucketing strategy
Options CIDEr CLIPScore
Sum 48.99 65.19
MLP 48.70 66.94
Concat. 49.18 66.65
(b) Control fusion method
Options CIDEr CLIPScore
4 49.28 65.67
8 49.18 66.65
16 48.58 65.67
(c) The number of bucket bins
Table 6: Ablations on the zero-shot MSCOCO captioning after training on the CC3M dataset. The options, highlighted in bold, are selected
18. 重み付けフィルタリングの詳細
nCLIP similalityで重み付けをして損失を計算
n類似度が高いデータで0.482
• 収束が遅くなる
n0-1正規化して2倍した類似度を使用
• 平均: 1
• 最小: 0
• 最大: 2
om
ore,
on
To
s of
ned
de-
we
i.e.,
de-
ely.
her
ure
@4,
ner-
we
ated
ing
Loss weighting. The final baseline is a vanilla caption-
ing model trained with a loss re-weighting strategy to tackle
the noise issue. In this baseline, instead of filtering, we use
CLIP similarity score to re-weight sample-wise loss as
Lweighting =
1
N
N
X
i=1
si log p(ci|Ii), (4)
where si indicates a cosine similarity computed by CLIP for
ith
image-text pair in a minibatch of N instances. Further
details for this baseline are explained in Appendix D.
4.3. Implementation details
We employ a pre-trained CLIP ViT-L/14 for encoding
visual features and computing the alignment level z, and
use 6 randomly-initialized transformer blocks as the cap-
tion decoder. Except for results in Tabs. 1 to 3, we freeze
the visual encoder and take a single CLS token as the visual
feature due to its efficiency in all ablation and analytical ex-
periments. More details regarding training settings are de-
Vanilla (Filtering) 12.81 17.30 54.66 64.84 48.96 62.70 46.06 60.74 46.33 62.35 59.50 63.92
Bootstrap 13.51 17.46 55.13 64.31 49.46 62.16 45.23 60.60 47.14 61.62 59.93 63.62
NoC (z=7) 15.96 19.50 62.04 66.70 54.94 64.21 51.74 62.54 53.09 63.92 63.15 65.19
Table 8: Comparison with the Bootstrap baseline. All models are trained on CC3M and evaluated on MSCOCO and nocaps datasets. B@4,
M, C, and CS mean BLEU@4, METEOR, CIDEr, and CLIPScore metrics, respectively. Numbers in bold indicate the best method.
D. Details for Loss Weighting Baseline
We have defined the Loss weighting baseline in Sec. 4.2
as follows:
Lweighting =
1
N
N
X
i=1
si log p(ci|Ii), (5)
where si indicates a cosine similarity computed by CLIP
for ith
image-text pair in a minibatch of N instances.
Additionally, since the range of cosine similarity from
training samples is approximately ranged between 0 and 0.5
as shown in Fig. 11, we apply min-max normalization to s
and multiply by 2 for equalizing the averaged loss scale.
This loss reweighting strategy makes the model be updated
by less for miss-aligned samples and more on well-aligned
ones.
For a better understanding, we visualize the distribution
of cosine similarities of the CC3M training split in Fig. 11.
The range of raw cosine similarities is distributed from -
0.037 to 0.482, and the mean of the distribution is 0.239. If
we use the cosine similarities directly for loss re-weighting
defined in Eq. (5), the scale of the overall loss value be-
comes low, which could result in slow convergence of the
training phase. Thus, to equalize the averaged loss scale,
we re-scale the cosine similarities for the loss re-weighting
strategy by applying min-max normalization and multiply-
Min: -0.037
Max: 0.482
Mean: 0.239
(a) Distribution of cosine similarities before re-scaling.
Min: 0.0
Max: 2.0
Mean: 1.066
(b) Distribution of cosine similarities after re-scaling.