論文紹介：Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning

Noise-aware Learning from Web-
crawled Image-Text Data for Image
Captioning
Wooyoung Kang, Jonghwan Mun, Sungjun Lee, Byungseok Roh, ICCV2023
橋口凌大（名工大）
2023/10/30

概要
nImage Captioningとは
• 入力画像に適した説明文を生成させる
nウェブから収集した多量のデータ [Changpinyo+, CVPR2021], [Hu+, CVPR2022]で事
前学習を行う
• ノイズが多く含まれるという問題がある
nノイズを活用したImage Captioningモデルの学習
• データ数を落とさないようにノイズも同時に学習させるフレームワークを提案
this is about
what it looked like
filtered noise sample
0.171
Figure 2: Examples of web-crawled im
a threshold of 0.3, a selected threshold
noise samples (left). However, accordin

関連研究
n人間がアノテーションしたキャプションデータでの学習 [Lin+, ECCV2014],
[Krishna+, IJCV2017]
• データセットサイズをスケールできない（アノテーションが高価）
• 学習データが小さいので限られた知識の獲得
nウェブクローリングされたデータセットを用いた学習 [Li+, ECCV2020],
[Zhang+, CVPR2021]
• データセットをスケールアップできる
• データの性質上ノイズが多い
nノイズの多いデータを効果的に扱う手法は存在しない

観察
nCLIP [Radford+, ICML2021]を用いたImageとTextの類似度の例
• 類似度が低いものは低品質なものが多い
• しかし，類似度は低いが人間が観察する限り有用なデータも存在する
n単純なフィルタリングだと除去されてしまうデータも有効に活用した
い
this is about
what it looked like
people walk
on the sidewalk
a girl scrubs her shoes clean
in a dish of soapy water
ambulances line up outside the
international airport terminal
filtered sample, but informative
filtered noise sample non-filtered sample
0.289
0.171 0.353
0.193

概観
nImage-Textペアのアライメントレベル𝑧を定義
nアライメントレベルを埋め込んで学習
n最上位のアライメントレベルでキャプションを生成
Bucketing
Frozen CLIP
Image-Text Pairs
Captioning Model
z
z=3
The restaurant is a nod to the late 19th century,
with a large glass - walled dining room
z=2
the restaurant is a popular place for locals and
tourists alike to eat and drink
z=1
the view from the dining room
(c) Quality-controllable Caption Generation at Infererence Time
(a) Alignment Level Assign
Captioning Model
(b) Alignment-level-conditioned Training
Batch
Bucket
z=3
Less-aligned
pairs
( =2)
Well-aligned
pairs
( =3)
Noisy pairs
( =1)
z=1 z=2
hes, the noise transition
n matrix is added on the
mates noisy class poste-
e underlying label transi-
transition matrix for im-
to the numerous words
ver, other noise-handling
dditional co-trained net-
able for our large-scale
computational costs.
tion, BLIP [24] tackled
led data with a data boot-
addition to web-crawled
ntary clean human anno-
the captioning and filter-
ontrast, without any clean
with the proposed noise-
nly web-crawled data.
r Image Captioning
fully benefit from the rich information of web-crawled data,
we aim to make a captioning model robust to noise while
using the whole dataset.
As illustrated in Fig. 3, we propose our NoC frame-
work with an alignment-level-controllable captioner. Dur-
ing training, we identify an alignment level l of the given
image-text pair and use it as a control signal z to make
a captioning model be conditioned on the alignment level,
which leads to a new objective as follows:
L = log p(c|I, z) =
T
X
t=0
log p(wt+1|wt, I, z). (2)
This objective encourages the model to learn the capabil-
ity of generating captions of different alignment levels de-
pending on a control signal. Then, at the inference time, we
can steer the captioning model to generate accurate captions
by simply feeding a control signal, meaning a top-level of
alignment. With this model design, as discussed in our ex-
periments, we can take the following two advantages: (1)
it can learn well-aligned data with less distraction by noisy
data and (2) it can still take advantage of data containing

アライメントレベルの決め方
nCLIP [Radford+, ICML2021]を用いてアライメントレベルを決定
• Image-Textの相関を定量的に表す
nCLIP類似度スコア𝑠によって𝐾個のアライメントレベルに変換
• 類似度スコアの高いデータは𝑙 = 𝐾
nバケッティングの種類
• 一様バケッティング
• 分位バケッティング
web-crawled data; contrastive learning typically drives a
model to learn the correlation between images and texts in
a data-driven manner, so the similarity scores by CLIP can
be used as alignment scores for image-text pairs.
Given the CLIP similarity scores, s 2 R, from all train-
ing samples, we convert them into K discrete alignment lev-
els l 2 {1, ..., K} via a bucketing technique:
l = fbucket(s), (3)
where fbucket is a bucketing function with K bins of equally
spaced boundaries. This bucketing makes more noisy data
of low CLIP similarity assigned to a bucket of the lower
alignment level (e.g., l = 1) and well-aligned data of high
CLIP similarity allocated to a bucket of the higher align-
Each model is
VirTex-like [1
due to its sim
the encoder fi
age. Next, we
ment level (i.
scribed in Se
ence time—in
then concaten
image embed
into each cros
and captions
more detailed
<latexit sha1_base64="S0VSKpZYjp21A+soHQsQiqqZTDM=">AAACf3icSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQtI5SjEZOYpxFQb6igY6SjE5KTklxTrKHjH1HLFCygb6BmAgQImwxDKUGaAgoB8geUMMQwpDPkMyQylDLkMqQx5DCVAdg5DIkMxEEYzGDIYMBQAxWIZqoFiRUBWJlg+laGWgQuotxSoKhWoIhEomg0k04G8aKhoHpAPMrMYrDsZaEsOEBcBdSowqBpcNVhp8NnghMFqg5cGf3CaVQ02A+SWSiCdBNGbWhDP3yUR/J2grlwgXcKQgdCF180lDGkMFmC3ZgLdXgAWAfkiGaK/rGr652CrINVqNYNFBq+B7l9ocNPgMNAHeWVfkpcGpgbNZgBFgCF6cGMywoz0DM30TAJNlB2coFHBwSDNoMSgAQxvcwYHBg+GAIZQoL0NDMsY1jNsYGJkUmfSYzKAKGVihOoRZkABTJYAzuyRYg==</latexit>
l 2 {1, 2, . . . , K}

アライメントレベルを埋め込んで学習
nアライメントレベルを受け取るよ
うに修正
• CLIPから出力された類似度から
Bucketingにより𝑧を生成
• 𝑧を学習可能な埋め込み層によりImage
Encoderと同じ次元のベクトルを獲得
• Image Encoderとzから生成した埋め込
みベクトルからDecoderに入力する
Key, Valueを生成
ndix
aining Details for All Experiments
provide detailed explanations for experimental set-
f all our reported results in the main paper. We have
ted our experiments under two different settings: 1)
ble-Dense, 2) Frozen-CLS. The main difference be-
the two configurations depends on the way of uti-
he visual encoder. Regarding the text decoder, we
the same decoder model, which consists of a 6-layer
rmer network initialized randomly for both settings.
the main experiments i.e., Tabs. 1 to 3, we use the
ble-Dense setting where the visual encoder, i.e., pre-
CLIP ViT-L/14, is tunned during the training phase
“The people … sitting <EOS>”
“The people are all
in a restaurant some
are sitting”
key, value
Image
Encoder
Caption Decoder
Bucketing
Condition Embedder
Frozen CLIP
“<BOS> The people are all in
a restaurant some are sitting”
CLIP
Similarity
Control signal
Figure 10: The network architecture of our alignment-lev

実験設定
nタスク
• Zero-shot captioning
• Self-retrival
n評価指標
• Zero-shot captioning
• BLEU@4 [Papineni+, ACL2002]
• METEOR [Banerjee & Lavie, ACL
Workshop2005]
• CIDEr [Vedantam+, CVPR2015]
• SPCIE [Fernando+, ECCV2016]
• CLIP Score [Hessel+, EMNLP2021]
• Self-retrival
• CLIP Score, Recall
nデータセット
• 学習
• Conceptual Captions 3M (CC3M)
[Sharma+, ACL2018]
• 評価
• MSCOCO [Lin+, ECCV2014]
• Nocaps [Agrawal+, ICCV2019]
• Flickr30k [Plummer+, ICCV2015]
nベースライン (transformer)
• Vanila
• Vanila (Filterring)
• CLIP similalityによる重み付け
z = K) is specialized to a bucket of highly aligned image-
text pairs (like filtering), while parameter sharing between
experts allows sharing of learned knowledge for rich visual
concepts across all experts (e.g., our model with z  K).
4. Experiment
4.1. Evaluation Setting
Recall that our goal is to learn a captioning model from
the web-crawled data, including inherent noises; therefore,
we evaluate the quality of models without fine-tuning on
clean data to check how effectively to tackle noisy data. To
assess the quality of models, we consider two aspects of
generated captions: descriptiveness (i.e., how well aligned
with given images) and distinctiveness (i.e., how well de-
scribe images with their unique aspects). To be specific, we
the noise issue. Following the previous convention [42], we
calculate cosine similarity for each pair of (image, text) us-
ing a pre-trained CLIP ViT-B/32 model [38], then leave
pairs having similarity larger than 0.3 as training data. The
threshold of 0.3 is selected the following [42], which pro-
vides the best performance among thresholds from 0.1 to
0.35 with steps of 0.05 as presented in Fig. 4.
Loss weighting. The final baseline is a vanilla caption-
ing model trained with a loss re-weighting strategy to tackle
the noise issue. In this baseline, instead of filtering, we use
CLIP similarity score to re-weight sample-wise loss as
Lweighting =
1
N
N
X
i=1
si log p(ci|Ii), (4)

ベースラインとの比較 (Zero-shot Captioning)
nVanilaは一番性能が低い
• ノイズによる影響が大きい
nコサイン類似度によるフィルタリングによる性能向上が大きい
• 類似度のフィルタリングによってノイズが除去される
nすべての指標において提案手法の性能が高い
Models
MSCOCO
nocaps
overall in-domain near-domain out-of-domain
B@4 M C CS C CS C CS C CS C CS
Vanilla 10.31 15.48 47.56 62.89 41.58 60.49 38.60 58.64 39.24 59.91 51.22 62.15
Vanilla (Filtering) 12.81 17.30 54.66 64.84 48.96 62.70 46.06 60.74 46.33 62.35 59.50 63.92
Loss weighting 11.16 16.15 50.86 63.87 43.89 61.18 39.30 59.23 41.80 60.50 53.84 63.04
NoC (z=7) 15.96 19.50 62.04 66.70 54.94 64.21 51.74 62.54 53.09 63.92 63.15 65.19

従来手法との比較 (Zero-shot Captioning)
n提案手法は少ないパラメータで高性能
Vanilla 10.31 15.48 47.56 62.89 41.58 60.49 38.60 58.64 39.24 59.91 51.22 62.15
Loss weighting 11.16 16.15 50.86 63.87 43.89 61.18 39.30 59.23 41.80 60.50 53.84 63.04
NoC (z=7) 15.96 19.50 62.04 66.70 54.94 64.21 51.74 62.54 53.09 63.92 63.15 65.19
Table 1: Caption generation performance comparison with baselines on MSCOCO and nocaps datasets where all models are trained
with CC3M without fine-tuning on the target dataset. B@4, M, C, and CS mean BLEU@4, METEOR, CIDEr, and CLIPScore metrics,
respectively. Numbers in bold indicates the best method.
Method Visual Encoder Text Decoder Data
MSCOCO
BLEU@4 METEOR CIDEr SPICE
Inference time optimization or un-paired training
ZeroCap [44] CLIP ViT-B/32 GPT2 (345M) [39] - 2.60 11.50 14.60 5.50
Socratic Models [56] CLIP ViT-L/14 GPT3 (175B) [4] - 10.00 16.20 50.10 10.80
DeCAP [25] CLIP ViT-B/32 Transformer4-layer (76.5M) CC3M-text 8.80 16.00 42.10 10.90
Supervised training with image-text paired data
Re-ViLM [51] CLIP ViT-L/14 RETRO (410M) [3] CCS + COYO [5] 17.90 - 53.60 -
SimVLM1.4B [49] - - ALIGN 1.8B [20] 11.20 14.70 32.20 8.50
NoC (z=7)†
CLIP ViT-B/32 Transformer4-layer (76.5M) CC3M 14.10 18.12 48.66 12.57
NoC (z=7)†
CLIP ViT-L/14 Transformer6-layer (94.5M) CC3M 15.96 19.50 62.04 14.37
Table 2: Comparison with other models reporting the zero-shot captioning scores. The CCS is a combination of CC3M, CC12M [6], and
SBU [35] datasets. † indicates a method that explicitly handles the problem of noisy samples. Numbers in bold indicate the best method.
which indicates learning a model using well-aligned cap-
tions is important for descriptive caption generation. On the
other hand, our model significantly outperforms all base-
lines on both MSCOCO and nocaps datasets. Especially, the
notable gain on the nocaps, which covers more diverse vi-
sual concepts, implies the importance of noise-aware learn-
ing across the entire dataset for acquiring a broader range of
visual knowledge compared to learning from filtered data.
Models
MSCOCO Flickr30k
R@1 R@5 R@10 R@1 R@5 R@10
GT Caption 34.57 59.30 69.91 63.08 86.50 92.00
Vanilla 25.44 50.38 61.66 47.10 76.60 85.90
Vanilla (Filtering) 31.64 58.90 70.36 56.50 85.50 92.50
Loss weighting 28.78 54.44 65.44 48.00 78.90 87.50
NoC (z=7) 40.00 66.78 77.53 65.10 92.00 96.20
Table 3: Comparison of self-retrieval capability on MSCOCO and
Cattle graze in a field.
Cattle grazing in a field.
Person and I had a lot of fun.
A desk full of computers.
A desk in the office.
The desk of people, person.
Vanilla
Vanilla
(Filtering)
Ours(z=3)
(z=5)
(z=7)
Vanilla
Vanilla
(Filtering)
Ours(z=3)
(z=5)
(z=7)
Cattle graze in a pasture with storm
cloud overhead.
Cows grazing in a field. A desk full of computer.
A photo of a cluttered desk with
computers, keyboard and mouse.
Figure 8: Examples of generated captions. Compared to the baselines, our model can generate captions of different quality by adjusting

ベースラインとの比較 (Self-retrival)
nすべてのベースラインと
比較して高性能
n提案手法はGT Captionよ
り高性能
• 人間によるアノテーション
が統一的ではない
NoC (z=7)†
CLIP ViT-B/32 Transformer4-layer (76.5M) CC3M 14.10 18.12 48.66 12.57
NoC (z=7)†
CLIP ViT-L/14 Transformer6-layer (94.5M) CC3M 15.96 19.50 62.04 14.37
e 2: Comparison with other models reporting the zero-shot captioning scores. The CCS is a combination of CC3M, CC12M [6], and
[35] datasets. † indicates a method that explicitly handles the problem of noisy samples. Numbers in bold indicate the best method.
ch indicates learning a model using well-aligned cap-
s is important for descriptive caption generation. On the
r hand, our model significantly outperforms all base-
s on both MSCOCO and nocaps datasets. Especially, the
ble gain on the nocaps, which covers more diverse vi-
concepts, implies the importance of noise-aware learn-
across the entire dataset for acquiring a broader range of
al knowledge compared to learning from filtered data.
2 Comparison with other concurrent works
le the primary goal of our experiments is to mea-
the noise-robustness, we provide additional compar-
s with other works that report zero-shot performance
he MSCOCO to better contextualize the effectiveness
our work in Tab. 2. The reported works are divided
two groups: 1) captioning with inference time opti-
ation [44, 56] or leveraging an unpaired training strat-
[25], and 2) directly training entire networks [49] or
Models
MSCOCO Flickr30k
R@1 R@5 R@10 R@1 R@5 R@10
GT Caption 34.57 59.30 69.91 63.08 86.50 92.00
Vanilla 25.44 50.38 61.66 47.10 76.60 85.90
Loss weighting 28.78 54.44 65.44 48.00 78.90 87.50
NoC (z=7) 40.00 66.78 77.53 65.10 92.00 96.20
Table 3: Comparison of self-retrieval capability on MSCOCO and
Flickr30k datasets. Numbers in bold indicate the best method.
the n-grams in retrieved captions by their retrieval augmen-
tation technique at inference time. While the much higher
CIDEr score of our model compared to Re-ViLM indicates
NoC generates captions with more diverse expressions con-
sidering the TF-IDF weighting of the CIDEr.
4.5. Self-retrieval Task
We compare self-retrieval capability to measure how
well each algorithm distinctively describes given images.
Cattle graze in a field.
Cattle grazing in a field.
Person and I had a lot of fun.
A desk full of computers.
A desk in the office.
The desk of people, person.
Vanilla
Vanilla
(Filtering)
Ours(z=3)
(z=5)
(z=7)
Vanilla
Vanilla
(Filtering)
Ours(z=3)
(z=5)
(z=7)
Cattle graze in a pasture with storm
cloud overhead.
Cows grazing in a field. A desk full of computer.
A photo of a cluttered desk with
computers, keyboard and mouse.
Figure 8: Examples of generated captions. Compared to the baselines, our model can generate captions of different quality by adjusting
a control signal (z); as we feed z meaning higher alignment levels (3 ! 5 ! 7), captions become more descriptive and distinct with
expressions (highlighted in red) capturing fine details from images.

ファインチューニングの効果
nCC3Mで事前学習したモデルをMSCOCOでファインチューニング
n提案手法が高性能
• 提案手法のノイズ耐性が高い
• 人間がアノテーションしたデータ
も一貫したアノテーションに
なっていない
• 提案手法が有効である
Table 4: Performances on MSCOCO and nocaps after fine-
tuning on MSCOCO.
Method
MSCOCO test split nocaps (overall)
CIDEr SPICE CIDEr SPICE
Vanilla (Filtering) 126.14 22.37 87.03 12.52
NoC 129.09 23.23 93.33 13.40
Figure 5: Distribution of CLIP scores on MSCOCO dataset and
examples at different alignment scores. Best viewed in zoom-in.
uning on MSCOCO.
Method
NoC 129.09 23.23 93.33 13.40

データセットサイズによる性能
nCOYOデータセット
• 700MのImage-Textデータ
nデータセットを分割して様々なサイズの
データセットを作成
• 3M, 10M, 23M, 125M
nVanilaモデルは性能が頭打ち
nVanila (Filtering)はデータが増えると性
能向上
n提案手法は一貫して高性能
tuning on MSCOCO.
Method
NoC 129.09 23.23 93.33 13.40
Figure 5: Distribution of CLIP scores on MSCOCO dataset and
examples at different alignment scores. Best viewed in zoom-in.
in a way that they are distinguishable from others. In con-
trast, our model can generate highly aligned captions with
fine-grained details for the given images by adjusting the
control signal with a high alignment level. This result im-
plies our model is effective in generating distinct captions.
The qualitative examples of the retrieval results are illus-
trated in Fig. 9 and Appendix H.

なぜノイズにロバストなのか？
n長い期間学習をして，学習データの記憶によるノイズ解析を行う
n学習データ
• CC3M train
n検証データ
• CC3M train (CLIP similality: 0 ~ 0.15)
• CC3M train (CLIP similality: 0.35 ~ )
n𝑧 = 7のモデル
• 類似度の高いデータを中心に学習している
• Noisy GroupのEM (Exact Matching: 完全一致)の数が少ない（well-aligned
Groupは多い）
• ノイズの多いデータの影響を受けない
nVanilaモデル
• ノイズの多いデータの記憶によって類似度の高いデータの学習が妨げられる
Method
Noisy Group Well-aligned Group
EM B@4 CIDEr CS EM B@4 CIDEr CS
Vanilla 5350 12.63 122.24 52.49 8092 43.46 407.10 82.67
NoC (z=3) 10527 20.41 210.83 42.87 225 7.43 53.33 56.35
NoC (z=5) 939 5.83 48.47 55.86 4882 33.32 304.18 77.69
NoC (z=7) 47 2.10 18.28 59.57 10562 52.78 504.49 86.57
Table 5: Comparison of memorization capability on CC3M training samples of two
different noise levels. Note that higher EM (# Exact Matching), B@4 (BLEU@4), and
CIDEr scores in the noisy group mean over-memorization to noisy data. Higher CS
(CLIPScore) means better alignment with given images.
Options CIDEr CLIPScore
Quantile 49.22 65.72
Uniform 49.18 66.65
(a) Bucketing strategy
Sum 48.99 65.19
MLP 48.70 66.94
Concat. 49.18 66.65
(b) Control fusion method
Table 6: Ablations on the zero-shot MSCOCO captioning after training on the CC3M data

Ablation study
nバケッティングとアライメントレベル埋め込みのablation study
nバケッティング
• Uniformかつ𝑏𝑖𝑛𝑠 = 8が高性能
nアライメントレベル埋め込み
• Concatが高性能
• Sum, MLPでは平滑化のような操作になる
• アライメントレベルとしての条件の意味合いを持たすにはConcatが一番良い
Method
Noisy Group Well-aligned Group
EM B@4 CIDEr CS EM B@4 CIDEr CS
Vanilla 5350 12.63 122.24 52.49 8092 43.46 407.10 82.67
NoC (z=3) 10527 20.41 210.83 42.87 225 7.43 53.33 56.35
NoC (z=5) 939 5.83 48.47 55.86 4882 33.32 304.18 77.69
NoC (z=7) 47 2.10 18.28 59.57 10562 52.78 504.49 86.57
Table 5: Comparison of memorization capability on CC3M training samples of two
different noise levels. Note that higher EM (# Exact Matching), B@4 (BLEU@4), and
CIDEr scores in the noisy group mean over-memorization to noisy data. Higher CS
(CLIPScore) means better alignment with given images.
Figure 7: Performances on two groups of dif-
ferent similarities from CC3M validation set.
Quantile 49.22 65.72
Uniform 49.18 66.65
(a) Bucketing strategy
Sum 48.99 65.19
MLP 48.70 66.94
Concat. 49.18 66.65
(b) Control fusion method
4 49.28 65.67
8 49.18 66.65
16 48.58 65.67
(c) The number of bucket bins
Table 6: Ablations on the zero-shot MSCOCO captioning after training on the CC3M dataset. The options, highlighted in bold, are selected

まとめ
nウェブクローリングから集めたノイズが含まれるデータをうまく活用
しながら学習させるフレームワークの提案
• 人間がアノテーションする方法はデータセットをスケールアップすることが困
難
• ウェブから収集した大規模なデータを用いて学習する
• 低品質のデータが含まれるものも利用しながら学習するフレームワーク
n実験から提案手法が一貫してベースラインを上回った

重み付けフィルタリングの詳細
nCLIP similalityで重み付けをして損失を計算
n類似度が高いデータで0.482
• 収束が遅くなる
n0-1正規化して2倍した類似度を使用
• 平均: 1
• 最小: 0
• 最大: 2
om
ore,
on
To
s of
ned
de-
we
i.e.,
de-
ely.
her
ure
@4,
ner-
we
ated
ing
Loss weighting. The final baseline is a vanilla caption-
ing model trained with a loss re-weighting strategy to tackle
the noise issue. In this baseline, instead of filtering, we use
CLIP similarity score to re-weight sample-wise loss as
Lweighting =
1
N
N
X
i=1
where si indicates a cosine similarity computed by CLIP for
ith
image-text pair in a minibatch of N instances. Further
details for this baseline are explained in Appendix D.
4.3. Implementation details
We employ a pre-trained CLIP ViT-L/14 for encoding
visual features and computing the alignment level z, and
use 6 randomly-initialized transformer blocks as the cap-
tion decoder. Except for results in Tabs. 1 to 3, we freeze
the visual encoder and take a single CLS token as the visual
feature due to its efficiency in all ablation and analytical ex-
periments. More details regarding training settings are de-
Bootstrap 13.51 17.46 55.13 64.31 49.46 62.16 45.23 60.60 47.14 61.62 59.93 63.62
NoC (z=7) 15.96 19.50 62.04 66.70 54.94 64.21 51.74 62.54 53.09 63.92 63.15 65.19
Table 8: Comparison with the Bootstrap baseline. All models are trained on CC3M and evaluated on MSCOCO and nocaps datasets. B@4,
M, C, and CS mean BLEU@4, METEOR, CIDEr, and CLIPScore metrics, respectively. Numbers in bold indicate the best method.
D. Details for Loss Weighting Baseline
We have defined the Loss weighting baseline in Sec. 4.2
as follows:
Lweighting =
1
N
N
X
i=1
where si indicates a cosine similarity computed by CLIP
for ith
image-text pair in a minibatch of N instances.
Additionally, since the range of cosine similarity from
training samples is approximately ranged between 0 and 0.5
as shown in Fig. 11, we apply min-max normalization to s
and multiply by 2 for equalizing the averaged loss scale.
This loss reweighting strategy makes the model be updated
by less for miss-aligned samples and more on well-aligned
ones.
For a better understanding, we visualize the distribution
of cosine similarities of the CC3M training split in Fig. 11.
The range of raw cosine similarities is distributed from -
0.037 to 0.482, and the mean of the distribution is 0.239. If
we use the cosine similarities directly for loss re-weighting
defined in Eq. (5), the scale of the overall loss value be-
comes low, which could result in slow convergence of the
training phase. Thus, to equalize the averaged loss scale,
we re-scale the cosine similarities for the loss re-weighting
strategy by applying min-max normalization and multiply-
Min: -0.037
Max: 0.482
Mean: 0.239
(a) Distribution of cosine similarities before re-scaling.
Min: 0.0
Max: 2.0
Mean: 1.066
(b) Distribution of cosine similarities after re-scaling.

アライメントレベルごとの性能
nZero-shotの結果
• 𝑧 ≥ 6でベースラインを上回る
• 𝑧 ≤ 5でベースラインより性能低下
• アライメントレベルが低いデータで学習されているので妥当な結果
• 𝑧 = 1だと性能が向上
• 𝑧 = 1のサンプル (n = 21)が少ないことが要因
Models
MSCOCO
nocaps
overall in-domain near-domain out-of-domain
B@4 M C CS C CS C CS C CS C CS
Vanilla 10.31 15.48 47.56 62.89 41.58 60.49 38.60 58.64 39.24 59.91 51.22 62.15
Loss weighting 11.16 16.15 50.86 63.87 43.89 61.18 39.30 59.23 41.80 60.50 53.84 63.04
NoC (z=1) 12.65 16.44 52.78 63.23 43.95 60.55 41.39 58.50 42.11 60.30 51.70 61.62
NoC (z=2) 3.77 8.21 12.83 45.24 10.26 43.43 11.96 46.29 10.76 44.46 7.45 40.60
NoC (z=3) 3.70 9.19 17.33 49.67 15.12 48.05 16.45 49.56 15.68 48.66 12.40 46.48
NoC (z=4) 8.19 13.52 39.61 59.86 32.25 57.23 30.04 56.01 31.76 57.23 35.39 57.62
NoC (z=5) 11.88 16.28 52.11 64.01 45.68 61.72 41.25 59.57 43.74 61.18 55.05 63.23
NoC (z=6) 14.38 18.27 58.76 65.82 51.50 63.52 48.02 61.82 49.60 63.13 60.06 63.52
NoC (z=7) 15.96 19.50 62.04 66.70 54.94 64.21 51.74 62.54 53.09 63.92 63.15 65.19
NoC (z=8) 12.82 17.46 53.50 64.94 48.05 62.64 42.44 60.16 45.86 62.20 59.08 64.21

アライメントレベルごとの性能
nSelf-retrievalの結果
• Zero-shotの結果と同様の傾向
52.11 64.01 45.68 61.72 41.25 59.57 43.74 61.18 55.05 63.23
58.76 65.82 51.50 63.52 48.02 61.82 49.60 63.13 60.06 63.52
62.04 66.70 54.94 64.21 51.74 62.54 53.09 63.92 63.15 65.19
53.50 64.94 48.05 62.64 42.44 60.16 45.86 62.20 59.08 64.21
bin indices from models trained on CC3M. Numbers in bold and underlined indicate the best
After separating the
ring) model to gen-
he second group, as
Vanilla model with
nable-Dense setting.
e Bootstrap baseline
res than the Vanilla
seline shows signif-
model. We carefully
hnique would yield
the target domain is
filter. On the other
ly mitigate the neg-
Models
MSCOCO Flickr30k
R@1 R@5 R@10 R@1 R@5 R@10
GT Caption 34.57 59.30 69.91 63.08 86.50 92.00
Vanilla 25.44 50.38 61.66 47.10 76.60 85.90
Loss weighting 28.78 54.44 65.44 48.00 78.90 87.50
NoC (z=1) 25.92 49.92 61.98 44.60 77.20 86.60
NoC (z=2) 2.44 7.86 11.46 8.30 21.50 29.20
NoC (z=3) 5.02 13.12 18.98 12.70 30.90 40.00
NoC (z=4) 17.34 38.86 50.30 32.30 66.70 75.30
NoC (z=5) 29.02 54.32 66.26 50.30 84.10 91.10
NoC (z=6) 35.26 62.78 73.88 60.00 89.90 95.40
NoC (z=7) 40.00 66.78 77.53 65.10 92.00 96.20
NoC (z=8) 32.30 57.60 69.72 54.90 85.80 93.50

論文紹介：Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning

Recommended

Recommended

More Related Content

Similar to 論文紹介：Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning

Similar to 論文紹介：Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning (20)

More from Toru Tamaki

More from Toru Tamaki (20)

Recently uploaded

Recently uploaded (20)

論文紹介：Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning