SlideShare a Scribd company logo
Noise-aware Learning from Web-
crawled Image-Text Data for Image
Captioning
Wooyoung Kang, Jonghwan Mun, Sungjun Lee, Byungseok Roh, ICCV2023
橋口凌大(名工大)
2023/10/30
概要
nImage Captioningとは
• 入力画像に適した説明文を生成させる
nウェブから収集した多量のデータ [Changpinyo+, CVPR2021], [Hu+, CVPR2022]で事
前学習を行う
• ノイズが多く含まれるという問題がある
nノイズを活用したImage Captioningモデルの学習
• データ数を落とさないようにノイズも同時に学習させるフレームワークを提案
this is about
what it looked like
filtered noise sample
0.171
Figure 2: Examples of web-crawled im
a threshold of 0.3, a selected threshold
noise samples (left). However, accordin
関連研究
n人間がアノテーションしたキャプションデータでの学習 [Lin+, ECCV2014],
[Krishna+, IJCV2017]
• データセットサイズをスケールできない(アノテーションが高価)
• 学習データが小さいので限られた知識の獲得
nウェブクローリングされたデータセットを用いた学習 [Li+, ECCV2020],
[Zhang+, CVPR2021]
• データセットをスケールアップできる
• データの性質上ノイズが多い
nノイズの多いデータを効果的に扱う手法は存在しない
観察
nCLIP [Radford+, ICML2021]を用いたImageとTextの類似度の例
• 類似度が低いものは低品質なものが多い
• しかし,類似度は低いが人間が観察する限り有用なデータも存在する
n単純なフィルタリングだと除去されてしまうデータも有効に活用した
い
this is about
what it looked like
people walk
on the sidewalk
a girl scrubs her shoes clean
in a dish of soapy water
ambulances line up outside the
international airport terminal
filtered sample, but informative
filtered noise sample non-filtered sample
0.289
0.171 0.353
0.193
概観
nImage-Textペアのアライメントレベル𝑧を定義
nアライメントレベルを埋め込んで学習
n最上位のアライメントレベルでキャプションを生成
Bucketing
Frozen CLIP
Image-Text Pairs
Captioning Model
z
z=3
The restaurant is a nod to the late 19th century,
with a large glass - walled dining room
z=2
the restaurant is a popular place for locals and
tourists alike to eat and drink
z=1
the view from the dining room
(c) Quality-controllable Caption Generation at Infererence Time
(a) Alignment Level Assign
Captioning Model
(b) Alignment-level-conditioned Training
Batch
Bucket
z=3
Less-aligned
pairs
( =2)
Well-aligned
pairs
( =3)
Noisy pairs
( =1)
z=1 z=2
hes, the noise transition
n matrix is added on the
mates noisy class poste-
e underlying label transi-
transition matrix for im-
to the numerous words
ver, other noise-handling
dditional co-trained net-
able for our large-scale
computational costs.
tion, BLIP [24] tackled
led data with a data boot-
addition to web-crawled
ntary clean human anno-
the captioning and filter-
ontrast, without any clean
with the proposed noise-
nly web-crawled data.
r Image Captioning
fully benefit from the rich information of web-crawled data,
we aim to make a captioning model robust to noise while
using the whole dataset.
As illustrated in Fig. 3, we propose our NoC frame-
work with an alignment-level-controllable captioner. Dur-
ing training, we identify an alignment level l of the given
image-text pair and use it as a control signal z to make
a captioning model be conditioned on the alignment level,
which leads to a new objective as follows:
L = log p(c|I, z) =
T
X
t=0
log p(wt+1|wt, I, z). (2)
This objective encourages the model to learn the capabil-
ity of generating captions of different alignment levels de-
pending on a control signal. Then, at the inference time, we
can steer the captioning model to generate accurate captions
by simply feeding a control signal, meaning a top-level of
alignment. With this model design, as discussed in our ex-
periments, we can take the following two advantages: (1)
it can learn well-aligned data with less distraction by noisy
data and (2) it can still take advantage of data containing
アライメントレベルの決め方
nCLIP [Radford+, ICML2021]を用いてアライメントレベルを決定
• Image-Textの相関を定量的に表す
nCLIP類似度スコア𝑠によって𝐾個のアライメントレベルに変換
• 類似度スコアの高いデータは𝑙 = 𝐾
nバケッティングの種類
• 一様バケッティング
• 分位バケッティング
web-crawled data; contrastive learning typically drives a
model to learn the correlation between images and texts in
a data-driven manner, so the similarity scores by CLIP can
be used as alignment scores for image-text pairs.
Given the CLIP similarity scores, s 2 R, from all train-
ing samples, we convert them into K discrete alignment lev-
els l 2 {1, ..., K} via a bucketing technique:
l = fbucket(s), (3)
where fbucket is a bucketing function with K bins of equally
spaced boundaries. This bucketing makes more noisy data
of low CLIP similarity assigned to a bucket of the lower
alignment level (e.g., l = 1) and well-aligned data of high
CLIP similarity allocated to a bucket of the higher align-
Each model is
VirTex-like [1
due to its sim
the encoder fi
age. Next, we
ment level (i.
scribed in Se
ence time—in
then concaten
image embed
into each cros
and captions
more detailed
<latexit sha1_base64="S0VSKpZYjp21A+soHQsQiqqZTDM=">AAACf3icSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtI5SjEZOYpxFQb6igY6SjE5KTklxTrKHjH1HLFCygb6BmAgQImwxDKUGaAgoB8geUMMQwpDPkMyQylDLkMqQx5DCVAdg5DIkMxEEYzGDIYMBQAxWIZqoFiRUBWJlg+laGWgQuotxSoKhWoIhEomg0k04G8aKhoHpAPMrMYrDsZaEsOEBcBdSowqBpcNVhp8NnghMFqg5cGf3CaVQ02A+SWSiCdBNGbWhDP3yUR/J2grlwgXcKQgdCF180lDGkMFmC3ZgLdXgAWAfkiGaK/rGr652CrINVqNYNFBq+B7l9ocNPgMNAHeWVfkpcGpgbNZgBFgCF6cGMywoz0DM30TAJNlB2coFHBwSDNoMSgAQxvcwYHBg+GAIZQoL0NDMsY1jNsYGJkUmfSYzKAKGVihOoRZkABTJYAzuyRYg==</latexit>
l 2 {1, 2, . . . , K}
アライメントレベルを埋め込んで学習
nアライメントレベルを受け取るよ
うに修正
• CLIPから出力された類似度から
Bucketingにより𝑧を生成
• 𝑧を学習可能な埋め込み層によりImage
Encoderと同じ次元のベクトルを獲得
• Image Encoderとzから生成した埋め込
みベクトルからDecoderに入力する
Key, Valueを生成
ndix
aining Details for All Experiments
provide detailed explanations for experimental set-
f all our reported results in the main paper. We have
ted our experiments under two different settings: 1)
ble-Dense, 2) Frozen-CLS. The main difference be-
the two configurations depends on the way of uti-
he visual encoder. Regarding the text decoder, we
the same decoder model, which consists of a 6-layer
rmer network initialized randomly for both settings.
the main experiments i.e., Tabs. 1 to 3, we use the
ble-Dense setting where the visual encoder, i.e., pre-
CLIP ViT-L/14, is tunned during the training phase
“The people … sitting <EOS>”
“The people are all
in a restaurant some
are sitting”
key, value
Image
Encoder
Caption Decoder
Bucketing
Condition Embedder
Frozen CLIP
“<BOS> The people are all in
a restaurant some are sitting”
CLIP
Similarity
Control signal
Figure 10: The network architecture of our alignment-lev
実験設定
nタスク
• Zero-shot captioning
• Self-retrival
n評価指標
• Zero-shot captioning
• BLEU@4 [Papineni+, ACL2002]
• METEOR [Banerjee & Lavie, ACL
Workshop2005]
• CIDEr [Vedantam+, CVPR2015]
• SPCIE [Fernando+, ECCV2016]
• CLIP Score [Hessel+, EMNLP2021]
• Self-retrival
• CLIP Score, Recall
nデータセット
• 学習
• Conceptual Captions 3M (CC3M)
[Sharma+, ACL2018]
• 評価
• MSCOCO [Lin+, ECCV2014]
• Nocaps [Agrawal+, ICCV2019]
• Flickr30k [Plummer+, ICCV2015]
nベースライン (transformer)
• Vanila
• Vanila (Filterring)
• CLIP similalityによる重み付け
z = K) is specialized to a bucket of highly aligned image-
text pairs (like filtering), while parameter sharing between
experts allows sharing of learned knowledge for rich visual
concepts across all experts (e.g., our model with z  K).
4. Experiment
4.1. Evaluation Setting
Recall that our goal is to learn a captioning model from
the web-crawled data, including inherent noises; therefore,
we evaluate the quality of models without fine-tuning on
clean data to check how effectively to tackle noisy data. To
assess the quality of models, we consider two aspects of
generated captions: descriptiveness (i.e., how well aligned
with given images) and distinctiveness (i.e., how well de-
scribe images with their unique aspects). To be specific, we
the noise issue. Following the previous convention [42], we
calculate cosine similarity for each pair of (image, text) us-
ing a pre-trained CLIP ViT-B/32 model [38], then leave
pairs having similarity larger than 0.3 as training data. The
threshold of 0.3 is selected the following [42], which pro-
vides the best performance among thresholds from 0.1 to
0.35 with steps of 0.05 as presented in Fig. 4.
Loss weighting. The final baseline is a vanilla caption-
ing model trained with a loss re-weighting strategy to tackle
the noise issue. In this baseline, instead of filtering, we use
CLIP similarity score to re-weight sample-wise loss as
Lweighting =
1
N
N
X
i=1
si log p(ci|Ii), (4)
ベースラインとの比較 (Zero-shot Captioning)
nVanilaは一番性能が低い
• ノイズによる影響が大きい
nコサイン類似度によるフィルタリングによる性能向上が大きい
• 類似度のフィルタリングによってノイズが除去される
nすべての指標において提案手法の性能が高い
Models
MSCOCO
nocaps
overall in-domain near-domain out-of-domain
B@4 M C CS C CS C CS C CS C CS
Vanilla 10.31 15.48 47.56 62.89 41.58 60.49 38.60 58.64 39.24 59.91 51.22 62.15
Vanilla (Filtering) 12.81 17.30 54.66 64.84 48.96 62.70 46.06 60.74 46.33 62.35 59.50 63.92
Loss weighting 11.16 16.15 50.86 63.87 43.89 61.18 39.30 59.23 41.80 60.50 53.84 63.04
NoC (z=7) 15.96 19.50 62.04 66.70 54.94 64.21 51.74 62.54 53.09 63.92 63.15 65.19
従来手法との比較 (Zero-shot Captioning)
n提案手法は少ないパラメータで高性能
Vanilla 10.31 15.48 47.56 62.89 41.58 60.49 38.60 58.64 39.24 59.91 51.22 62.15
Vanilla (Filtering) 12.81 17.30 54.66 64.84 48.96 62.70 46.06 60.74 46.33 62.35 59.50 63.92
Loss weighting 11.16 16.15 50.86 63.87 43.89 61.18 39.30 59.23 41.80 60.50 53.84 63.04
NoC (z=7) 15.96 19.50 62.04 66.70 54.94 64.21 51.74 62.54 53.09 63.92 63.15 65.19
Table 1: Caption generation performance comparison with baselines on MSCOCO and nocaps datasets where all models are trained
with CC3M without fine-tuning on the target dataset. B@4, M, C, and CS mean BLEU@4, METEOR, CIDEr, and CLIPScore metrics,
respectively. Numbers in bold indicates the best method.
Method Visual Encoder Text Decoder Data
MSCOCO
BLEU@4 METEOR CIDEr SPICE
Inference time optimization or un-paired training
ZeroCap [44] CLIP ViT-B/32 GPT2 (345M) [39] - 2.60 11.50 14.60 5.50
Socratic Models [56] CLIP ViT-L/14 GPT3 (175B) [4] - 10.00 16.20 50.10 10.80
DeCAP [25] CLIP ViT-B/32 Transformer4-layer (76.5M) CC3M-text 8.80 16.00 42.10 10.90
Supervised training with image-text paired data
Re-ViLM [51] CLIP ViT-L/14 RETRO (410M) [3] CCS + COYO [5] 17.90 - 53.60 -
SimVLM1.4B [49] - - ALIGN 1.8B [20] 11.20 14.70 32.20 8.50
NoC (z=7)†
CLIP ViT-B/32 Transformer4-layer (76.5M) CC3M 14.10 18.12 48.66 12.57
NoC (z=7)†
CLIP ViT-L/14 Transformer6-layer (94.5M) CC3M 15.96 19.50 62.04 14.37
Table 2: Comparison with other models reporting the zero-shot captioning scores. The CCS is a combination of CC3M, CC12M [6], and
SBU [35] datasets. † indicates a method that explicitly handles the problem of noisy samples. Numbers in bold indicate the best method.
which indicates learning a model using well-aligned cap-
tions is important for descriptive caption generation. On the
other hand, our model significantly outperforms all base-
lines on both MSCOCO and nocaps datasets. Especially, the
notable gain on the nocaps, which covers more diverse vi-
sual concepts, implies the importance of noise-aware learn-
ing across the entire dataset for acquiring a broader range of
visual knowledge compared to learning from filtered data.
Models
MSCOCO Flickr30k
R@1 R@5 R@10 R@1 R@5 R@10
GT Caption 34.57 59.30 69.91 63.08 86.50 92.00
Vanilla 25.44 50.38 61.66 47.10 76.60 85.90
Vanilla (Filtering) 31.64 58.90 70.36 56.50 85.50 92.50
Loss weighting 28.78 54.44 65.44 48.00 78.90 87.50
NoC (z=7) 40.00 66.78 77.53 65.10 92.00 96.20
Table 3: Comparison of self-retrieval capability on MSCOCO and
Cattle graze in a field.
Cattle grazing in a field.
Person and I had a lot of fun.
A desk full of computers.
A desk in the office.
The desk of people, person.
Vanilla
Vanilla
(Filtering)
Ours(z=3)
(z=5)
(z=7)
Vanilla
Vanilla
(Filtering)
Ours(z=3)
(z=5)
(z=7)
Cattle graze in a pasture with storm
cloud overhead.
Cows grazing in a field. A desk full of computer.
A photo of a cluttered desk with
computers, keyboard and mouse.
Figure 8: Examples of generated captions. Compared to the baselines, our model can generate captions of different quality by adjusting
ベースラインとの比較 (Self-retrival)
nすべてのベースラインと
比較して高性能
n提案手法はGT Captionよ
り高性能
• 人間によるアノテーション
が統一的ではない
NoC (z=7)†
CLIP ViT-B/32 Transformer4-layer (76.5M) CC3M 14.10 18.12 48.66 12.57
NoC (z=7)†
CLIP ViT-L/14 Transformer6-layer (94.5M) CC3M 15.96 19.50 62.04 14.37
e 2: Comparison with other models reporting the zero-shot captioning scores. The CCS is a combination of CC3M, CC12M [6], and
[35] datasets. † indicates a method that explicitly handles the problem of noisy samples. Numbers in bold indicate the best method.
ch indicates learning a model using well-aligned cap-
s is important for descriptive caption generation. On the
r hand, our model significantly outperforms all base-
s on both MSCOCO and nocaps datasets. Especially, the
ble gain on the nocaps, which covers more diverse vi-
concepts, implies the importance of noise-aware learn-
across the entire dataset for acquiring a broader range of
al knowledge compared to learning from filtered data.
2 Comparison with other concurrent works
le the primary goal of our experiments is to mea-
the noise-robustness, we provide additional compar-
s with other works that report zero-shot performance
he MSCOCO to better contextualize the effectiveness
our work in Tab. 2. The reported works are divided
two groups: 1) captioning with inference time opti-
ation [44, 56] or leveraging an unpaired training strat-
[25], and 2) directly training entire networks [49] or
Models
MSCOCO Flickr30k
R@1 R@5 R@10 R@1 R@5 R@10
GT Caption 34.57 59.30 69.91 63.08 86.50 92.00
Vanilla 25.44 50.38 61.66 47.10 76.60 85.90
Vanilla (Filtering) 31.64 58.90 70.36 56.50 85.50 92.50
Loss weighting 28.78 54.44 65.44 48.00 78.90 87.50
NoC (z=7) 40.00 66.78 77.53 65.10 92.00 96.20
Table 3: Comparison of self-retrieval capability on MSCOCO and
Flickr30k datasets. Numbers in bold indicate the best method.
the n-grams in retrieved captions by their retrieval augmen-
tation technique at inference time. While the much higher
CIDEr score of our model compared to Re-ViLM indicates
NoC generates captions with more diverse expressions con-
sidering the TF-IDF weighting of the CIDEr.
4.5. Self-retrieval Task
We compare self-retrieval capability to measure how
well each algorithm distinctively describes given images.
Cattle graze in a field.
Cattle grazing in a field.
Person and I had a lot of fun.
A desk full of computers.
A desk in the office.
The desk of people, person.
Vanilla
Vanilla
(Filtering)
Ours(z=3)
(z=5)
(z=7)
Vanilla
Vanilla
(Filtering)
Ours(z=3)
(z=5)
(z=7)
Cattle graze in a pasture with storm
cloud overhead.
Cows grazing in a field. A desk full of computer.
A photo of a cluttered desk with
computers, keyboard and mouse.
Figure 8: Examples of generated captions. Compared to the baselines, our model can generate captions of different quality by adjusting
a control signal (z); as we feed z meaning higher alignment levels (3 ! 5 ! 7), captions become more descriptive and distinct with
expressions (highlighted in red) capturing fine details from images.
ファインチューニングの効果
nCC3Mで事前学習したモデルをMSCOCOでファインチューニング
n提案手法が高性能
• 提案手法のノイズ耐性が高い
• 人間がアノテーションしたデータ
も一貫したアノテーションに
なっていない
• 提案手法が有効である
Table 4: Performances on MSCOCO and nocaps after fine-
tuning on MSCOCO.
Method
MSCOCO test split nocaps (overall)
CIDEr SPICE CIDEr SPICE
Vanilla (Filtering) 126.14 22.37 87.03 12.52
NoC 129.09 23.23 93.33 13.40
Figure 5: Distribution of CLIP scores on MSCOCO dataset and
examples at different alignment scores. Best viewed in zoom-in.
Table 4: Performances on MSCOCO and nocaps after fine-
uning on MSCOCO.
Method
MSCOCO test split nocaps (overall)
CIDEr SPICE CIDEr SPICE
Vanilla (Filtering) 126.14 22.37 87.03 12.52
NoC 129.09 23.23 93.33 13.40
データセットサイズによる性能
nCOYOデータセット
• 700MのImage-Textデータ
nデータセットを分割して様々なサイズの
データセットを作成
• 3M, 10M, 23M, 125M
nVanilaモデルは性能が頭打ち
nVanila (Filtering)はデータが増えると性
能向上
n提案手法は一貫して高性能
Table 4: Performances on MSCOCO and nocaps after fine-
tuning on MSCOCO.
Method
MSCOCO test split nocaps (overall)
CIDEr SPICE CIDEr SPICE
Vanilla (Filtering) 126.14 22.37 87.03 12.52
NoC 129.09 23.23 93.33 13.40
Figure 5: Distribution of CLIP scores on MSCOCO dataset and
examples at different alignment scores. Best viewed in zoom-in.
in a way that they are distinguishable from others. In con-
trast, our model can generate highly aligned captions with
fine-grained details for the given images by adjusting the
control signal with a high alignment level. This result im-
plies our model is effective in generating distinct captions.
The qualitative examples of the retrieval results are illus-
trated in Fig. 9 and Appendix H.
なぜノイズにロバストなのか?
n長い期間学習をして,学習データの記憶によるノイズ解析を行う
n学習データ
• CC3M train
n検証データ
• CC3M train (CLIP similality: 0 ~ 0.15)
• CC3M train (CLIP similality: 0.35 ~ )
n𝑧 = 7のモデル
• 類似度の高いデータを中心に学習している
• Noisy GroupのEM (Exact Matching: 完全一致)の数が少ない(well-aligned
Groupは多い)
• ノイズの多いデータの影響を受けない
nVanilaモデル
• ノイズの多いデータの記憶によって類似度の高いデータの学習が妨げられる
Method
Noisy Group Well-aligned Group
EM B@4 CIDEr CS EM B@4 CIDEr CS
Vanilla 5350 12.63 122.24 52.49 8092 43.46 407.10 82.67
NoC (z=3) 10527 20.41 210.83 42.87 225 7.43 53.33 56.35
NoC (z=5) 939 5.83 48.47 55.86 4882 33.32 304.18 77.69
NoC (z=7) 47 2.10 18.28 59.57 10562 52.78 504.49 86.57
Table 5: Comparison of memorization capability on CC3M training samples of two
different noise levels. Note that higher EM (# Exact Matching), B@4 (BLEU@4), and
CIDEr scores in the noisy group mean over-memorization to noisy data. Higher CS
(CLIPScore) means better alignment with given images.
Options CIDEr CLIPScore
Quantile 49.22 65.72
Uniform 49.18 66.65
(a) Bucketing strategy
Options CIDEr CLIPScore
Sum 48.99 65.19
MLP 48.70 66.94
Concat. 49.18 66.65
(b) Control fusion method
Table 6: Ablations on the zero-shot MSCOCO captioning after training on the CC3M data
Ablation study
nバケッティングとアライメントレベル埋め込みのablation study
nバケッティング
• Uniformかつ𝑏𝑖𝑛𝑠 = 8が高性能
nアライメントレベル埋め込み
• Concatが高性能
• Sum, MLPでは平滑化のような操作になる
• アライメントレベルとしての条件の意味合いを持たすにはConcatが一番良い
Method
Noisy Group Well-aligned Group
EM B@4 CIDEr CS EM B@4 CIDEr CS
Vanilla 5350 12.63 122.24 52.49 8092 43.46 407.10 82.67
NoC (z=3) 10527 20.41 210.83 42.87 225 7.43 53.33 56.35
NoC (z=5) 939 5.83 48.47 55.86 4882 33.32 304.18 77.69
NoC (z=7) 47 2.10 18.28 59.57 10562 52.78 504.49 86.57
Table 5: Comparison of memorization capability on CC3M training samples of two
different noise levels. Note that higher EM (# Exact Matching), B@4 (BLEU@4), and
CIDEr scores in the noisy group mean over-memorization to noisy data. Higher CS
(CLIPScore) means better alignment with given images.
Figure 7: Performances on two groups of dif-
ferent similarities from CC3M validation set.
Options CIDEr CLIPScore
Quantile 49.22 65.72
Uniform 49.18 66.65
(a) Bucketing strategy
Options CIDEr CLIPScore
Sum 48.99 65.19
MLP 48.70 66.94
Concat. 49.18 66.65
(b) Control fusion method
Options CIDEr CLIPScore
4 49.28 65.67
8 49.18 66.65
16 48.58 65.67
(c) The number of bucket bins
Table 6: Ablations on the zero-shot MSCOCO captioning after training on the CC3M dataset. The options, highlighted in bold, are selected
まとめ
nウェブクローリングから集めたノイズが含まれるデータをうまく活用
しながら学習させるフレームワークの提案
• 人間がアノテーションする方法はデータセットをスケールアップすることが困
難
• ウェブから収集した大規模なデータを用いて学習する
• 低品質のデータが含まれるものも利用しながら学習するフレームワーク
n実験から提案手法が一貫してベースラインを上回った
補足資料
重み付けフィルタリングの詳細
nCLIP similalityで重み付けをして損失を計算
n類似度が高いデータで0.482
• 収束が遅くなる
n0-1正規化して2倍した類似度を使用
• 平均: 1
• 最小: 0
• 最大: 2
om
ore,
on
To
s of
ned
de-
we
i.e.,
de-
ely.
her
ure
@4,
ner-
we
ated
ing
Loss weighting. The final baseline is a vanilla caption-
ing model trained with a loss re-weighting strategy to tackle
the noise issue. In this baseline, instead of filtering, we use
CLIP similarity score to re-weight sample-wise loss as
Lweighting =
1
N
N
X
i=1
si log p(ci|Ii), (4)
where si indicates a cosine similarity computed by CLIP for
ith
image-text pair in a minibatch of N instances. Further
details for this baseline are explained in Appendix D.
4.3. Implementation details
We employ a pre-trained CLIP ViT-L/14 for encoding
visual features and computing the alignment level z, and
use 6 randomly-initialized transformer blocks as the cap-
tion decoder. Except for results in Tabs. 1 to 3, we freeze
the visual encoder and take a single CLS token as the visual
feature due to its efficiency in all ablation and analytical ex-
periments. More details regarding training settings are de-
Vanilla (Filtering) 12.81 17.30 54.66 64.84 48.96 62.70 46.06 60.74 46.33 62.35 59.50 63.92
Bootstrap 13.51 17.46 55.13 64.31 49.46 62.16 45.23 60.60 47.14 61.62 59.93 63.62
NoC (z=7) 15.96 19.50 62.04 66.70 54.94 64.21 51.74 62.54 53.09 63.92 63.15 65.19
Table 8: Comparison with the Bootstrap baseline. All models are trained on CC3M and evaluated on MSCOCO and nocaps datasets. B@4,
M, C, and CS mean BLEU@4, METEOR, CIDEr, and CLIPScore metrics, respectively. Numbers in bold indicate the best method.
D. Details for Loss Weighting Baseline
We have defined the Loss weighting baseline in Sec. 4.2
as follows:
Lweighting =
1
N
N
X
i=1
si log p(ci|Ii), (5)
where si indicates a cosine similarity computed by CLIP
for ith
image-text pair in a minibatch of N instances.
Additionally, since the range of cosine similarity from
training samples is approximately ranged between 0 and 0.5
as shown in Fig. 11, we apply min-max normalization to s
and multiply by 2 for equalizing the averaged loss scale.
This loss reweighting strategy makes the model be updated
by less for miss-aligned samples and more on well-aligned
ones.
For a better understanding, we visualize the distribution
of cosine similarities of the CC3M training split in Fig. 11.
The range of raw cosine similarities is distributed from -
0.037 to 0.482, and the mean of the distribution is 0.239. If
we use the cosine similarities directly for loss re-weighting
defined in Eq. (5), the scale of the overall loss value be-
comes low, which could result in slow convergence of the
training phase. Thus, to equalize the averaged loss scale,
we re-scale the cosine similarities for the loss re-weighting
strategy by applying min-max normalization and multiply-
Min: -0.037
Max: 0.482
Mean: 0.239
(a) Distribution of cosine similarities before re-scaling.
Min: 0.0
Max: 2.0
Mean: 1.066
(b) Distribution of cosine similarities after re-scaling.
アライメントレベルごとの性能
nZero-shotの結果
• 𝑧 ≥ 6でベースラインを上回る
• 𝑧 ≤ 5でベースラインより性能低下
• アライメントレベルが低いデータで学習されているので妥当な結果
• 𝑧 = 1だと性能が向上
• 𝑧 = 1のサンプル (n = 21)が少ないことが要因
Models
MSCOCO
nocaps
overall in-domain near-domain out-of-domain
B@4 M C CS C CS C CS C CS C CS
Vanilla 10.31 15.48 47.56 62.89 41.58 60.49 38.60 58.64 39.24 59.91 51.22 62.15
Vanilla (Filtering) 12.81 17.30 54.66 64.84 48.96 62.70 46.06 60.74 46.33 62.35 59.50 63.92
Loss weighting 11.16 16.15 50.86 63.87 43.89 61.18 39.30 59.23 41.80 60.50 53.84 63.04
NoC (z=1) 12.65 16.44 52.78 63.23 43.95 60.55 41.39 58.50 42.11 60.30 51.70 61.62
NoC (z=2) 3.77 8.21 12.83 45.24 10.26 43.43 11.96 46.29 10.76 44.46 7.45 40.60
NoC (z=3) 3.70 9.19 17.33 49.67 15.12 48.05 16.45 49.56 15.68 48.66 12.40 46.48
NoC (z=4) 8.19 13.52 39.61 59.86 32.25 57.23 30.04 56.01 31.76 57.23 35.39 57.62
NoC (z=5) 11.88 16.28 52.11 64.01 45.68 61.72 41.25 59.57 43.74 61.18 55.05 63.23
NoC (z=6) 14.38 18.27 58.76 65.82 51.50 63.52 48.02 61.82 49.60 63.13 60.06 63.52
NoC (z=7) 15.96 19.50 62.04 66.70 54.94 64.21 51.74 62.54 53.09 63.92 63.15 65.19
NoC (z=8) 12.82 17.46 53.50 64.94 48.05 62.64 42.44 60.16 45.86 62.20 59.08 64.21
アライメントレベルごとの性能
nSelf-retrievalの結果
• Zero-shotの結果と同様の傾向
52.11 64.01 45.68 61.72 41.25 59.57 43.74 61.18 55.05 63.23
58.76 65.82 51.50 63.52 48.02 61.82 49.60 63.13 60.06 63.52
62.04 66.70 54.94 64.21 51.74 62.54 53.09 63.92 63.15 65.19
53.50 64.94 48.05 62.64 42.44 60.16 45.86 62.20 59.08 64.21
bin indices from models trained on CC3M. Numbers in bold and underlined indicate the best
After separating the
ring) model to gen-
he second group, as
Vanilla model with
nable-Dense setting.
e Bootstrap baseline
res than the Vanilla
seline shows signif-
model. We carefully
hnique would yield
the target domain is
filter. On the other
ly mitigate the neg-
Models
MSCOCO Flickr30k
R@1 R@5 R@10 R@1 R@5 R@10
GT Caption 34.57 59.30 69.91 63.08 86.50 92.00
Vanilla 25.44 50.38 61.66 47.10 76.60 85.90
Vanilla (Filtering) 31.64 58.90 70.36 56.50 85.50 92.50
Loss weighting 28.78 54.44 65.44 48.00 78.90 87.50
NoC (z=1) 25.92 49.92 61.98 44.60 77.20 86.60
NoC (z=2) 2.44 7.86 11.46 8.30 21.50 29.20
NoC (z=3) 5.02 13.12 18.98 12.70 30.90 40.00
NoC (z=4) 17.34 38.86 50.30 32.30 66.70 75.30
NoC (z=5) 29.02 54.32 66.26 50.30 84.10 91.10
NoC (z=6) 35.26 62.78 73.88 60.00 89.90 95.40
NoC (z=7) 40.00 66.78 77.53 65.10 92.00 96.20
NoC (z=8) 32.30 57.60 69.72 54.90 85.80 93.50

More Related Content

Similar to 論文紹介:Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning

Practical Digital Image Processing 5
Practical Digital Image Processing 5Practical Digital Image Processing 5
Practical Digital Image Processing 5
Aly Abdelkareem
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
Devansh16
 
Block coordinate descent__in_computer_vision
Block coordinate descent__in_computer_visionBlock coordinate descent__in_computer_vision
Block coordinate descent__in_computer_vision
YoussefKitane
 
[Paper] learning video representations from correspondence proposals
[Paper]  learning video representations from correspondence proposals[Paper]  learning video representations from correspondence proposals
[Paper] learning video representations from correspondence proposals
Susang Kim
 
Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...
Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...
Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...
Alex Conway
 
One dimensional vector based pattern
One dimensional vector based patternOne dimensional vector based pattern
One dimensional vector based pattern
ijcsit
 
Eryk_Kulikowski_a2
Eryk_Kulikowski_a2Eryk_Kulikowski_a2
Eryk_Kulikowski_a2
Eryk Kulikowski
 
Fixed-Point Code Synthesis for Neural Networks
Fixed-Point Code Synthesis for Neural NetworksFixed-Point Code Synthesis for Neural Networks
Fixed-Point Code Synthesis for Neural Networks
gerogepatton
 
Fixed-Point Code Synthesis for Neural Networks
Fixed-Point Code Synthesis for Neural NetworksFixed-Point Code Synthesis for Neural Networks
Fixed-Point Code Synthesis for Neural Networks
IJITE
 
stylegan.pdf
stylegan.pdfstylegan.pdf
stylegan.pdf
NimeshPudasaini1
 
Road Segmentation from satellites images
Road Segmentation from satellites imagesRoad Segmentation from satellites images
Road Segmentation from satellites images
YoussefKitane
 
Neural Style Transfer in practice
Neural Style Transfer in practiceNeural Style Transfer in practice
Neural Style Transfer in practice
MohamedAmineHACHICHA1
 
Neural Style Transfer in Practice
Neural Style Transfer in PracticeNeural Style Transfer in Practice
Neural Style Transfer in Practice
KhalilBergaoui
 
Corrosion Detection Using A.I : A Comparison of Standard Computer Vision Tech...
Corrosion Detection Using A.I : A Comparison of Standard Computer Vision Tech...Corrosion Detection Using A.I : A Comparison of Standard Computer Vision Tech...
Corrosion Detection Using A.I : A Comparison of Standard Computer Vision Tech...
csandit
 
CORROSION DETECTION USING A.I. : A COMPARISON OF STANDARD COMPUTER VISION TEC...
CORROSION DETECTION USING A.I. : A COMPARISON OF STANDARD COMPUTER VISION TEC...CORROSION DETECTION USING A.I. : A COMPARISON OF STANDARD COMPUTER VISION TEC...
CORROSION DETECTION USING A.I. : A COMPARISON OF STANDARD COMPUTER VISION TEC...
cscpconf
 
Log polar coordinates
Log polar coordinatesLog polar coordinates
Log polar coordinates
Oğul Göçmen
 
00463517b1e90c1e63000000
00463517b1e90c1e6300000000463517b1e90c1e63000000
00463517b1e90c1e63000000
Ivonne Liu
 
CNN and its applications by ketaki
CNN and its applications by ketakiCNN and its applications by ketaki
CNN and its applications by ketaki
Ketaki Patwari
 
DICTA 2017 poster
DICTA 2017 posterDICTA 2017 poster
DICTA 2017 poster
Ashek Ahmmed
 
Paper_KST
Paper_KSTPaper_KST
Paper_KST
Sarun Maksuanpan
 

Similar to 論文紹介:Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning (20)

Practical Digital Image Processing 5
Practical Digital Image Processing 5Practical Digital Image Processing 5
Practical Digital Image Processing 5
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 
Block coordinate descent__in_computer_vision
Block coordinate descent__in_computer_visionBlock coordinate descent__in_computer_vision
Block coordinate descent__in_computer_vision
 
[Paper] learning video representations from correspondence proposals
[Paper]  learning video representations from correspondence proposals[Paper]  learning video representations from correspondence proposals
[Paper] learning video representations from correspondence proposals
 
Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...
Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...
Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...
 
One dimensional vector based pattern
One dimensional vector based patternOne dimensional vector based pattern
One dimensional vector based pattern
 
Eryk_Kulikowski_a2
Eryk_Kulikowski_a2Eryk_Kulikowski_a2
Eryk_Kulikowski_a2
 
Fixed-Point Code Synthesis for Neural Networks
Fixed-Point Code Synthesis for Neural NetworksFixed-Point Code Synthesis for Neural Networks
Fixed-Point Code Synthesis for Neural Networks
 
Fixed-Point Code Synthesis for Neural Networks
Fixed-Point Code Synthesis for Neural NetworksFixed-Point Code Synthesis for Neural Networks
Fixed-Point Code Synthesis for Neural Networks
 
stylegan.pdf
stylegan.pdfstylegan.pdf
stylegan.pdf
 
Road Segmentation from satellites images
Road Segmentation from satellites imagesRoad Segmentation from satellites images
Road Segmentation from satellites images
 
Neural Style Transfer in practice
Neural Style Transfer in practiceNeural Style Transfer in practice
Neural Style Transfer in practice
 
Neural Style Transfer in Practice
Neural Style Transfer in PracticeNeural Style Transfer in Practice
Neural Style Transfer in Practice
 
Corrosion Detection Using A.I : A Comparison of Standard Computer Vision Tech...
Corrosion Detection Using A.I : A Comparison of Standard Computer Vision Tech...Corrosion Detection Using A.I : A Comparison of Standard Computer Vision Tech...
Corrosion Detection Using A.I : A Comparison of Standard Computer Vision Tech...
 
CORROSION DETECTION USING A.I. : A COMPARISON OF STANDARD COMPUTER VISION TEC...
CORROSION DETECTION USING A.I. : A COMPARISON OF STANDARD COMPUTER VISION TEC...CORROSION DETECTION USING A.I. : A COMPARISON OF STANDARD COMPUTER VISION TEC...
CORROSION DETECTION USING A.I. : A COMPARISON OF STANDARD COMPUTER VISION TEC...
 
Log polar coordinates
Log polar coordinatesLog polar coordinates
Log polar coordinates
 
00463517b1e90c1e63000000
00463517b1e90c1e6300000000463517b1e90c1e63000000
00463517b1e90c1e63000000
 
CNN and its applications by ketaki
CNN and its applications by ketakiCNN and its applications by ketaki
CNN and its applications by ketaki
 
DICTA 2017 poster
DICTA 2017 posterDICTA 2017 poster
DICTA 2017 poster
 
Paper_KST
Paper_KSTPaper_KST
Paper_KST
 

More from Toru Tamaki

論文紹介:Deep Learning-Based Human Pose Estimation: A Survey
論文紹介:Deep Learning-Based Human Pose Estimation: A Survey論文紹介:Deep Learning-Based Human Pose Estimation: A Survey
論文紹介:Deep Learning-Based Human Pose Estimation: A Survey
Toru Tamaki
 
論文紹介:A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, a...
論文紹介:A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, a...論文紹介:A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, a...
論文紹介:A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, a...
Toru Tamaki
 
論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
Toru Tamaki
 
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
Toru Tamaki
 
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
Toru Tamaki
 
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
Toru Tamaki
 
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
Toru Tamaki
 
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
Toru Tamaki
 
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
Toru Tamaki
 
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
Toru Tamaki
 
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
Toru Tamaki
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
Toru Tamaki
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet
Toru Tamaki
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey
Toru Tamaki
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
Toru Tamaki
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
Toru Tamaki
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation
Toru Tamaki
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
Toru Tamaki
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
Toru Tamaki
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning
Toru Tamaki
 

More from Toru Tamaki (20)

論文紹介:Deep Learning-Based Human Pose Estimation: A Survey
論文紹介:Deep Learning-Based Human Pose Estimation: A Survey論文紹介:Deep Learning-Based Human Pose Estimation: A Survey
論文紹介:Deep Learning-Based Human Pose Estimation: A Survey
 
論文紹介:A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, a...
論文紹介:A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, a...論文紹介:A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, a...
論文紹介:A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, a...
 
論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
論文紹介:When Visual Prompt Tuning Meets Source-Free Domain Adaptive Semantic Seg...
 
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
論文紹介:Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
 
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...
 
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
論文紹介:ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
 
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face Recognition
 
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers
 
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
 
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
 
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning
 

Recently uploaded

The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
Sease
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)
HarpalGohil4
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
Enterprise Knowledge
 
Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!
Ortus Solutions, Corp
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
AlexanderRichford
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
UiPathCommunity
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
FilipTomaszewski5
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
ScyllaDB
 

Recently uploaded (20)

The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
 
Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
 

論文紹介:Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning

  • 1. Noise-aware Learning from Web- crawled Image-Text Data for Image Captioning Wooyoung Kang, Jonghwan Mun, Sungjun Lee, Byungseok Roh, ICCV2023 橋口凌大(名工大) 2023/10/30
  • 2. 概要 nImage Captioningとは • 入力画像に適した説明文を生成させる nウェブから収集した多量のデータ [Changpinyo+, CVPR2021], [Hu+, CVPR2022]で事 前学習を行う • ノイズが多く含まれるという問題がある nノイズを活用したImage Captioningモデルの学習 • データ数を落とさないようにノイズも同時に学習させるフレームワークを提案 this is about what it looked like filtered noise sample 0.171 Figure 2: Examples of web-crawled im a threshold of 0.3, a selected threshold noise samples (left). However, accordin
  • 3. 関連研究 n人間がアノテーションしたキャプションデータでの学習 [Lin+, ECCV2014], [Krishna+, IJCV2017] • データセットサイズをスケールできない(アノテーションが高価) • 学習データが小さいので限られた知識の獲得 nウェブクローリングされたデータセットを用いた学習 [Li+, ECCV2020], [Zhang+, CVPR2021] • データセットをスケールアップできる • データの性質上ノイズが多い nノイズの多いデータを効果的に扱う手法は存在しない
  • 4. 観察 nCLIP [Radford+, ICML2021]を用いたImageとTextの類似度の例 • 類似度が低いものは低品質なものが多い • しかし,類似度は低いが人間が観察する限り有用なデータも存在する n単純なフィルタリングだと除去されてしまうデータも有効に活用した い this is about what it looked like people walk on the sidewalk a girl scrubs her shoes clean in a dish of soapy water ambulances line up outside the international airport terminal filtered sample, but informative filtered noise sample non-filtered sample 0.289 0.171 0.353 0.193
  • 5. 概観 nImage-Textペアのアライメントレベル𝑧を定義 nアライメントレベルを埋め込んで学習 n最上位のアライメントレベルでキャプションを生成 Bucketing Frozen CLIP Image-Text Pairs Captioning Model z z=3 The restaurant is a nod to the late 19th century, with a large glass - walled dining room z=2 the restaurant is a popular place for locals and tourists alike to eat and drink z=1 the view from the dining room (c) Quality-controllable Caption Generation at Infererence Time (a) Alignment Level Assign Captioning Model (b) Alignment-level-conditioned Training Batch Bucket z=3 Less-aligned pairs ( =2) Well-aligned pairs ( =3) Noisy pairs ( =1) z=1 z=2 hes, the noise transition n matrix is added on the mates noisy class poste- e underlying label transi- transition matrix for im- to the numerous words ver, other noise-handling dditional co-trained net- able for our large-scale computational costs. tion, BLIP [24] tackled led data with a data boot- addition to web-crawled ntary clean human anno- the captioning and filter- ontrast, without any clean with the proposed noise- nly web-crawled data. r Image Captioning fully benefit from the rich information of web-crawled data, we aim to make a captioning model robust to noise while using the whole dataset. As illustrated in Fig. 3, we propose our NoC frame- work with an alignment-level-controllable captioner. Dur- ing training, we identify an alignment level l of the given image-text pair and use it as a control signal z to make a captioning model be conditioned on the alignment level, which leads to a new objective as follows: L = log p(c|I, z) = T X t=0 log p(wt+1|wt, I, z). (2) This objective encourages the model to learn the capabil- ity of generating captions of different alignment levels de- pending on a control signal. Then, at the inference time, we can steer the captioning model to generate accurate captions by simply feeding a control signal, meaning a top-level of alignment. With this model design, as discussed in our ex- periments, we can take the following two advantages: (1) it can learn well-aligned data with less distraction by noisy data and (2) it can still take advantage of data containing
  • 6. アライメントレベルの決め方 nCLIP [Radford+, ICML2021]を用いてアライメントレベルを決定 • Image-Textの相関を定量的に表す nCLIP類似度スコア𝑠によって𝐾個のアライメントレベルに変換 • 類似度スコアの高いデータは𝑙 = 𝐾 nバケッティングの種類 • 一様バケッティング • 分位バケッティング web-crawled data; contrastive learning typically drives a model to learn the correlation between images and texts in a data-driven manner, so the similarity scores by CLIP can be used as alignment scores for image-text pairs. Given the CLIP similarity scores, s 2 R, from all train- ing samples, we convert them into K discrete alignment lev- els l 2 {1, ..., K} via a bucketing technique: l = fbucket(s), (3) where fbucket is a bucketing function with K bins of equally spaced boundaries. This bucketing makes more noisy data of low CLIP similarity assigned to a bucket of the lower alignment level (e.g., l = 1) and well-aligned data of high CLIP similarity allocated to a bucket of the higher align- Each model is VirTex-like [1 due to its sim the encoder fi age. Next, we ment level (i. scribed in Se ence time—in then concaten image embed into each cros and captions more detailed <latexit sha1_base64="S0VSKpZYjp21A+soHQsQiqqZTDM=">AAACf3icSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtI5SjEZOYpxFQb6igY6SjE5KTklxTrKHjH1HLFCygb6BmAgQImwxDKUGaAgoB8geUMMQwpDPkMyQylDLkMqQx5DCVAdg5DIkMxEEYzGDIYMBQAxWIZqoFiRUBWJlg+laGWgQuotxSoKhWoIhEomg0k04G8aKhoHpAPMrMYrDsZaEsOEBcBdSowqBpcNVhp8NnghMFqg5cGf3CaVQ02A+SWSiCdBNGbWhDP3yUR/J2grlwgXcKQgdCF180lDGkMFmC3ZgLdXgAWAfkiGaK/rGr652CrINVqNYNFBq+B7l9ocNPgMNAHeWVfkpcGpgbNZgBFgCF6cGMywoz0DM30TAJNlB2coFHBwSDNoMSgAQxvcwYHBg+GAIZQoL0NDMsY1jNsYGJkUmfSYzKAKGVihOoRZkABTJYAzuyRYg==</latexit> l 2 {1, 2, . . . , K}
  • 7. アライメントレベルを埋め込んで学習 nアライメントレベルを受け取るよ うに修正 • CLIPから出力された類似度から Bucketingにより𝑧を生成 • 𝑧を学習可能な埋め込み層によりImage Encoderと同じ次元のベクトルを獲得 • Image Encoderとzから生成した埋め込 みベクトルからDecoderに入力する Key, Valueを生成 ndix aining Details for All Experiments provide detailed explanations for experimental set- f all our reported results in the main paper. We have ted our experiments under two different settings: 1) ble-Dense, 2) Frozen-CLS. The main difference be- the two configurations depends on the way of uti- he visual encoder. Regarding the text decoder, we the same decoder model, which consists of a 6-layer rmer network initialized randomly for both settings. the main experiments i.e., Tabs. 1 to 3, we use the ble-Dense setting where the visual encoder, i.e., pre- CLIP ViT-L/14, is tunned during the training phase “The people … sitting <EOS>” “The people are all in a restaurant some are sitting” key, value Image Encoder Caption Decoder Bucketing Condition Embedder Frozen CLIP “<BOS> The people are all in a restaurant some are sitting” CLIP Similarity Control signal Figure 10: The network architecture of our alignment-lev
  • 8. 実験設定 nタスク • Zero-shot captioning • Self-retrival n評価指標 • Zero-shot captioning • BLEU@4 [Papineni+, ACL2002] • METEOR [Banerjee & Lavie, ACL Workshop2005] • CIDEr [Vedantam+, CVPR2015] • SPCIE [Fernando+, ECCV2016] • CLIP Score [Hessel+, EMNLP2021] • Self-retrival • CLIP Score, Recall nデータセット • 学習 • Conceptual Captions 3M (CC3M) [Sharma+, ACL2018] • 評価 • MSCOCO [Lin+, ECCV2014] • Nocaps [Agrawal+, ICCV2019] • Flickr30k [Plummer+, ICCV2015] nベースライン (transformer) • Vanila • Vanila (Filterring) • CLIP similalityによる重み付け z = K) is specialized to a bucket of highly aligned image- text pairs (like filtering), while parameter sharing between experts allows sharing of learned knowledge for rich visual concepts across all experts (e.g., our model with z  K). 4. Experiment 4.1. Evaluation Setting Recall that our goal is to learn a captioning model from the web-crawled data, including inherent noises; therefore, we evaluate the quality of models without fine-tuning on clean data to check how effectively to tackle noisy data. To assess the quality of models, we consider two aspects of generated captions: descriptiveness (i.e., how well aligned with given images) and distinctiveness (i.e., how well de- scribe images with their unique aspects). To be specific, we the noise issue. Following the previous convention [42], we calculate cosine similarity for each pair of (image, text) us- ing a pre-trained CLIP ViT-B/32 model [38], then leave pairs having similarity larger than 0.3 as training data. The threshold of 0.3 is selected the following [42], which pro- vides the best performance among thresholds from 0.1 to 0.35 with steps of 0.05 as presented in Fig. 4. Loss weighting. The final baseline is a vanilla caption- ing model trained with a loss re-weighting strategy to tackle the noise issue. In this baseline, instead of filtering, we use CLIP similarity score to re-weight sample-wise loss as Lweighting = 1 N N X i=1 si log p(ci|Ii), (4)
  • 9. ベースラインとの比較 (Zero-shot Captioning) nVanilaは一番性能が低い • ノイズによる影響が大きい nコサイン類似度によるフィルタリングによる性能向上が大きい • 類似度のフィルタリングによってノイズが除去される nすべての指標において提案手法の性能が高い Models MSCOCO nocaps overall in-domain near-domain out-of-domain B@4 M C CS C CS C CS C CS C CS Vanilla 10.31 15.48 47.56 62.89 41.58 60.49 38.60 58.64 39.24 59.91 51.22 62.15 Vanilla (Filtering) 12.81 17.30 54.66 64.84 48.96 62.70 46.06 60.74 46.33 62.35 59.50 63.92 Loss weighting 11.16 16.15 50.86 63.87 43.89 61.18 39.30 59.23 41.80 60.50 53.84 63.04 NoC (z=7) 15.96 19.50 62.04 66.70 54.94 64.21 51.74 62.54 53.09 63.92 63.15 65.19
  • 10. 従来手法との比較 (Zero-shot Captioning) n提案手法は少ないパラメータで高性能 Vanilla 10.31 15.48 47.56 62.89 41.58 60.49 38.60 58.64 39.24 59.91 51.22 62.15 Vanilla (Filtering) 12.81 17.30 54.66 64.84 48.96 62.70 46.06 60.74 46.33 62.35 59.50 63.92 Loss weighting 11.16 16.15 50.86 63.87 43.89 61.18 39.30 59.23 41.80 60.50 53.84 63.04 NoC (z=7) 15.96 19.50 62.04 66.70 54.94 64.21 51.74 62.54 53.09 63.92 63.15 65.19 Table 1: Caption generation performance comparison with baselines on MSCOCO and nocaps datasets where all models are trained with CC3M without fine-tuning on the target dataset. B@4, M, C, and CS mean BLEU@4, METEOR, CIDEr, and CLIPScore metrics, respectively. Numbers in bold indicates the best method. Method Visual Encoder Text Decoder Data MSCOCO BLEU@4 METEOR CIDEr SPICE Inference time optimization or un-paired training ZeroCap [44] CLIP ViT-B/32 GPT2 (345M) [39] - 2.60 11.50 14.60 5.50 Socratic Models [56] CLIP ViT-L/14 GPT3 (175B) [4] - 10.00 16.20 50.10 10.80 DeCAP [25] CLIP ViT-B/32 Transformer4-layer (76.5M) CC3M-text 8.80 16.00 42.10 10.90 Supervised training with image-text paired data Re-ViLM [51] CLIP ViT-L/14 RETRO (410M) [3] CCS + COYO [5] 17.90 - 53.60 - SimVLM1.4B [49] - - ALIGN 1.8B [20] 11.20 14.70 32.20 8.50 NoC (z=7)† CLIP ViT-B/32 Transformer4-layer (76.5M) CC3M 14.10 18.12 48.66 12.57 NoC (z=7)† CLIP ViT-L/14 Transformer6-layer (94.5M) CC3M 15.96 19.50 62.04 14.37 Table 2: Comparison with other models reporting the zero-shot captioning scores. The CCS is a combination of CC3M, CC12M [6], and SBU [35] datasets. † indicates a method that explicitly handles the problem of noisy samples. Numbers in bold indicate the best method. which indicates learning a model using well-aligned cap- tions is important for descriptive caption generation. On the other hand, our model significantly outperforms all base- lines on both MSCOCO and nocaps datasets. Especially, the notable gain on the nocaps, which covers more diverse vi- sual concepts, implies the importance of noise-aware learn- ing across the entire dataset for acquiring a broader range of visual knowledge compared to learning from filtered data. Models MSCOCO Flickr30k R@1 R@5 R@10 R@1 R@5 R@10 GT Caption 34.57 59.30 69.91 63.08 86.50 92.00 Vanilla 25.44 50.38 61.66 47.10 76.60 85.90 Vanilla (Filtering) 31.64 58.90 70.36 56.50 85.50 92.50 Loss weighting 28.78 54.44 65.44 48.00 78.90 87.50 NoC (z=7) 40.00 66.78 77.53 65.10 92.00 96.20 Table 3: Comparison of self-retrieval capability on MSCOCO and Cattle graze in a field. Cattle grazing in a field. Person and I had a lot of fun. A desk full of computers. A desk in the office. The desk of people, person. Vanilla Vanilla (Filtering) Ours(z=3) (z=5) (z=7) Vanilla Vanilla (Filtering) Ours(z=3) (z=5) (z=7) Cattle graze in a pasture with storm cloud overhead. Cows grazing in a field. A desk full of computer. A photo of a cluttered desk with computers, keyboard and mouse. Figure 8: Examples of generated captions. Compared to the baselines, our model can generate captions of different quality by adjusting
  • 11. ベースラインとの比較 (Self-retrival) nすべてのベースラインと 比較して高性能 n提案手法はGT Captionよ り高性能 • 人間によるアノテーション が統一的ではない NoC (z=7)† CLIP ViT-B/32 Transformer4-layer (76.5M) CC3M 14.10 18.12 48.66 12.57 NoC (z=7)† CLIP ViT-L/14 Transformer6-layer (94.5M) CC3M 15.96 19.50 62.04 14.37 e 2: Comparison with other models reporting the zero-shot captioning scores. The CCS is a combination of CC3M, CC12M [6], and [35] datasets. † indicates a method that explicitly handles the problem of noisy samples. Numbers in bold indicate the best method. ch indicates learning a model using well-aligned cap- s is important for descriptive caption generation. On the r hand, our model significantly outperforms all base- s on both MSCOCO and nocaps datasets. Especially, the ble gain on the nocaps, which covers more diverse vi- concepts, implies the importance of noise-aware learn- across the entire dataset for acquiring a broader range of al knowledge compared to learning from filtered data. 2 Comparison with other concurrent works le the primary goal of our experiments is to mea- the noise-robustness, we provide additional compar- s with other works that report zero-shot performance he MSCOCO to better contextualize the effectiveness our work in Tab. 2. The reported works are divided two groups: 1) captioning with inference time opti- ation [44, 56] or leveraging an unpaired training strat- [25], and 2) directly training entire networks [49] or Models MSCOCO Flickr30k R@1 R@5 R@10 R@1 R@5 R@10 GT Caption 34.57 59.30 69.91 63.08 86.50 92.00 Vanilla 25.44 50.38 61.66 47.10 76.60 85.90 Vanilla (Filtering) 31.64 58.90 70.36 56.50 85.50 92.50 Loss weighting 28.78 54.44 65.44 48.00 78.90 87.50 NoC (z=7) 40.00 66.78 77.53 65.10 92.00 96.20 Table 3: Comparison of self-retrieval capability on MSCOCO and Flickr30k datasets. Numbers in bold indicate the best method. the n-grams in retrieved captions by their retrieval augmen- tation technique at inference time. While the much higher CIDEr score of our model compared to Re-ViLM indicates NoC generates captions with more diverse expressions con- sidering the TF-IDF weighting of the CIDEr. 4.5. Self-retrieval Task We compare self-retrieval capability to measure how well each algorithm distinctively describes given images. Cattle graze in a field. Cattle grazing in a field. Person and I had a lot of fun. A desk full of computers. A desk in the office. The desk of people, person. Vanilla Vanilla (Filtering) Ours(z=3) (z=5) (z=7) Vanilla Vanilla (Filtering) Ours(z=3) (z=5) (z=7) Cattle graze in a pasture with storm cloud overhead. Cows grazing in a field. A desk full of computer. A photo of a cluttered desk with computers, keyboard and mouse. Figure 8: Examples of generated captions. Compared to the baselines, our model can generate captions of different quality by adjusting a control signal (z); as we feed z meaning higher alignment levels (3 ! 5 ! 7), captions become more descriptive and distinct with expressions (highlighted in red) capturing fine details from images.
  • 12. ファインチューニングの効果 nCC3Mで事前学習したモデルをMSCOCOでファインチューニング n提案手法が高性能 • 提案手法のノイズ耐性が高い • 人間がアノテーションしたデータ も一貫したアノテーションに なっていない • 提案手法が有効である Table 4: Performances on MSCOCO and nocaps after fine- tuning on MSCOCO. Method MSCOCO test split nocaps (overall) CIDEr SPICE CIDEr SPICE Vanilla (Filtering) 126.14 22.37 87.03 12.52 NoC 129.09 23.23 93.33 13.40 Figure 5: Distribution of CLIP scores on MSCOCO dataset and examples at different alignment scores. Best viewed in zoom-in. Table 4: Performances on MSCOCO and nocaps after fine- uning on MSCOCO. Method MSCOCO test split nocaps (overall) CIDEr SPICE CIDEr SPICE Vanilla (Filtering) 126.14 22.37 87.03 12.52 NoC 129.09 23.23 93.33 13.40
  • 13. データセットサイズによる性能 nCOYOデータセット • 700MのImage-Textデータ nデータセットを分割して様々なサイズの データセットを作成 • 3M, 10M, 23M, 125M nVanilaモデルは性能が頭打ち nVanila (Filtering)はデータが増えると性 能向上 n提案手法は一貫して高性能 Table 4: Performances on MSCOCO and nocaps after fine- tuning on MSCOCO. Method MSCOCO test split nocaps (overall) CIDEr SPICE CIDEr SPICE Vanilla (Filtering) 126.14 22.37 87.03 12.52 NoC 129.09 23.23 93.33 13.40 Figure 5: Distribution of CLIP scores on MSCOCO dataset and examples at different alignment scores. Best viewed in zoom-in. in a way that they are distinguishable from others. In con- trast, our model can generate highly aligned captions with fine-grained details for the given images by adjusting the control signal with a high alignment level. This result im- plies our model is effective in generating distinct captions. The qualitative examples of the retrieval results are illus- trated in Fig. 9 and Appendix H.
  • 14. なぜノイズにロバストなのか? n長い期間学習をして,学習データの記憶によるノイズ解析を行う n学習データ • CC3M train n検証データ • CC3M train (CLIP similality: 0 ~ 0.15) • CC3M train (CLIP similality: 0.35 ~ ) n𝑧 = 7のモデル • 類似度の高いデータを中心に学習している • Noisy GroupのEM (Exact Matching: 完全一致)の数が少ない(well-aligned Groupは多い) • ノイズの多いデータの影響を受けない nVanilaモデル • ノイズの多いデータの記憶によって類似度の高いデータの学習が妨げられる Method Noisy Group Well-aligned Group EM B@4 CIDEr CS EM B@4 CIDEr CS Vanilla 5350 12.63 122.24 52.49 8092 43.46 407.10 82.67 NoC (z=3) 10527 20.41 210.83 42.87 225 7.43 53.33 56.35 NoC (z=5) 939 5.83 48.47 55.86 4882 33.32 304.18 77.69 NoC (z=7) 47 2.10 18.28 59.57 10562 52.78 504.49 86.57 Table 5: Comparison of memorization capability on CC3M training samples of two different noise levels. Note that higher EM (# Exact Matching), B@4 (BLEU@4), and CIDEr scores in the noisy group mean over-memorization to noisy data. Higher CS (CLIPScore) means better alignment with given images. Options CIDEr CLIPScore Quantile 49.22 65.72 Uniform 49.18 66.65 (a) Bucketing strategy Options CIDEr CLIPScore Sum 48.99 65.19 MLP 48.70 66.94 Concat. 49.18 66.65 (b) Control fusion method Table 6: Ablations on the zero-shot MSCOCO captioning after training on the CC3M data
  • 15. Ablation study nバケッティングとアライメントレベル埋め込みのablation study nバケッティング • Uniformかつ𝑏𝑖𝑛𝑠 = 8が高性能 nアライメントレベル埋め込み • Concatが高性能 • Sum, MLPでは平滑化のような操作になる • アライメントレベルとしての条件の意味合いを持たすにはConcatが一番良い Method Noisy Group Well-aligned Group EM B@4 CIDEr CS EM B@4 CIDEr CS Vanilla 5350 12.63 122.24 52.49 8092 43.46 407.10 82.67 NoC (z=3) 10527 20.41 210.83 42.87 225 7.43 53.33 56.35 NoC (z=5) 939 5.83 48.47 55.86 4882 33.32 304.18 77.69 NoC (z=7) 47 2.10 18.28 59.57 10562 52.78 504.49 86.57 Table 5: Comparison of memorization capability on CC3M training samples of two different noise levels. Note that higher EM (# Exact Matching), B@4 (BLEU@4), and CIDEr scores in the noisy group mean over-memorization to noisy data. Higher CS (CLIPScore) means better alignment with given images. Figure 7: Performances on two groups of dif- ferent similarities from CC3M validation set. Options CIDEr CLIPScore Quantile 49.22 65.72 Uniform 49.18 66.65 (a) Bucketing strategy Options CIDEr CLIPScore Sum 48.99 65.19 MLP 48.70 66.94 Concat. 49.18 66.65 (b) Control fusion method Options CIDEr CLIPScore 4 49.28 65.67 8 49.18 66.65 16 48.58 65.67 (c) The number of bucket bins Table 6: Ablations on the zero-shot MSCOCO captioning after training on the CC3M dataset. The options, highlighted in bold, are selected
  • 18. 重み付けフィルタリングの詳細 nCLIP similalityで重み付けをして損失を計算 n類似度が高いデータで0.482 • 収束が遅くなる n0-1正規化して2倍した類似度を使用 • 平均: 1 • 最小: 0 • 最大: 2 om ore, on To s of ned de- we i.e., de- ely. her ure @4, ner- we ated ing Loss weighting. The final baseline is a vanilla caption- ing model trained with a loss re-weighting strategy to tackle the noise issue. In this baseline, instead of filtering, we use CLIP similarity score to re-weight sample-wise loss as Lweighting = 1 N N X i=1 si log p(ci|Ii), (4) where si indicates a cosine similarity computed by CLIP for ith image-text pair in a minibatch of N instances. Further details for this baseline are explained in Appendix D. 4.3. Implementation details We employ a pre-trained CLIP ViT-L/14 for encoding visual features and computing the alignment level z, and use 6 randomly-initialized transformer blocks as the cap- tion decoder. Except for results in Tabs. 1 to 3, we freeze the visual encoder and take a single CLS token as the visual feature due to its efficiency in all ablation and analytical ex- periments. More details regarding training settings are de- Vanilla (Filtering) 12.81 17.30 54.66 64.84 48.96 62.70 46.06 60.74 46.33 62.35 59.50 63.92 Bootstrap 13.51 17.46 55.13 64.31 49.46 62.16 45.23 60.60 47.14 61.62 59.93 63.62 NoC (z=7) 15.96 19.50 62.04 66.70 54.94 64.21 51.74 62.54 53.09 63.92 63.15 65.19 Table 8: Comparison with the Bootstrap baseline. All models are trained on CC3M and evaluated on MSCOCO and nocaps datasets. B@4, M, C, and CS mean BLEU@4, METEOR, CIDEr, and CLIPScore metrics, respectively. Numbers in bold indicate the best method. D. Details for Loss Weighting Baseline We have defined the Loss weighting baseline in Sec. 4.2 as follows: Lweighting = 1 N N X i=1 si log p(ci|Ii), (5) where si indicates a cosine similarity computed by CLIP for ith image-text pair in a minibatch of N instances. Additionally, since the range of cosine similarity from training samples is approximately ranged between 0 and 0.5 as shown in Fig. 11, we apply min-max normalization to s and multiply by 2 for equalizing the averaged loss scale. This loss reweighting strategy makes the model be updated by less for miss-aligned samples and more on well-aligned ones. For a better understanding, we visualize the distribution of cosine similarities of the CC3M training split in Fig. 11. The range of raw cosine similarities is distributed from - 0.037 to 0.482, and the mean of the distribution is 0.239. If we use the cosine similarities directly for loss re-weighting defined in Eq. (5), the scale of the overall loss value be- comes low, which could result in slow convergence of the training phase. Thus, to equalize the averaged loss scale, we re-scale the cosine similarities for the loss re-weighting strategy by applying min-max normalization and multiply- Min: -0.037 Max: 0.482 Mean: 0.239 (a) Distribution of cosine similarities before re-scaling. Min: 0.0 Max: 2.0 Mean: 1.066 (b) Distribution of cosine similarities after re-scaling.
  • 19. アライメントレベルごとの性能 nZero-shotの結果 • 𝑧 ≥ 6でベースラインを上回る • 𝑧 ≤ 5でベースラインより性能低下 • アライメントレベルが低いデータで学習されているので妥当な結果 • 𝑧 = 1だと性能が向上 • 𝑧 = 1のサンプル (n = 21)が少ないことが要因 Models MSCOCO nocaps overall in-domain near-domain out-of-domain B@4 M C CS C CS C CS C CS C CS Vanilla 10.31 15.48 47.56 62.89 41.58 60.49 38.60 58.64 39.24 59.91 51.22 62.15 Vanilla (Filtering) 12.81 17.30 54.66 64.84 48.96 62.70 46.06 60.74 46.33 62.35 59.50 63.92 Loss weighting 11.16 16.15 50.86 63.87 43.89 61.18 39.30 59.23 41.80 60.50 53.84 63.04 NoC (z=1) 12.65 16.44 52.78 63.23 43.95 60.55 41.39 58.50 42.11 60.30 51.70 61.62 NoC (z=2) 3.77 8.21 12.83 45.24 10.26 43.43 11.96 46.29 10.76 44.46 7.45 40.60 NoC (z=3) 3.70 9.19 17.33 49.67 15.12 48.05 16.45 49.56 15.68 48.66 12.40 46.48 NoC (z=4) 8.19 13.52 39.61 59.86 32.25 57.23 30.04 56.01 31.76 57.23 35.39 57.62 NoC (z=5) 11.88 16.28 52.11 64.01 45.68 61.72 41.25 59.57 43.74 61.18 55.05 63.23 NoC (z=6) 14.38 18.27 58.76 65.82 51.50 63.52 48.02 61.82 49.60 63.13 60.06 63.52 NoC (z=7) 15.96 19.50 62.04 66.70 54.94 64.21 51.74 62.54 53.09 63.92 63.15 65.19 NoC (z=8) 12.82 17.46 53.50 64.94 48.05 62.64 42.44 60.16 45.86 62.20 59.08 64.21
  • 20. アライメントレベルごとの性能 nSelf-retrievalの結果 • Zero-shotの結果と同様の傾向 52.11 64.01 45.68 61.72 41.25 59.57 43.74 61.18 55.05 63.23 58.76 65.82 51.50 63.52 48.02 61.82 49.60 63.13 60.06 63.52 62.04 66.70 54.94 64.21 51.74 62.54 53.09 63.92 63.15 65.19 53.50 64.94 48.05 62.64 42.44 60.16 45.86 62.20 59.08 64.21 bin indices from models trained on CC3M. Numbers in bold and underlined indicate the best After separating the ring) model to gen- he second group, as Vanilla model with nable-Dense setting. e Bootstrap baseline res than the Vanilla seline shows signif- model. We carefully hnique would yield the target domain is filter. On the other ly mitigate the neg- Models MSCOCO Flickr30k R@1 R@5 R@10 R@1 R@5 R@10 GT Caption 34.57 59.30 69.91 63.08 86.50 92.00 Vanilla 25.44 50.38 61.66 47.10 76.60 85.90 Vanilla (Filtering) 31.64 58.90 70.36 56.50 85.50 92.50 Loss weighting 28.78 54.44 65.44 48.00 78.90 87.50 NoC (z=1) 25.92 49.92 61.98 44.60 77.20 86.60 NoC (z=2) 2.44 7.86 11.46 8.30 21.50 29.20 NoC (z=3) 5.02 13.12 18.98 12.70 30.90 40.00 NoC (z=4) 17.34 38.86 50.30 32.30 66.70 75.30 NoC (z=5) 29.02 54.32 66.26 50.30 84.10 91.10 NoC (z=6) 35.26 62.78 73.88 60.00 89.90 95.40 NoC (z=7) 40.00 66.78 77.53 65.10 92.00 96.20 NoC (z=8) 32.30 57.60 69.72 54.90 85.80 93.50