The document proposes a method to improve image-to-image translation tasks by enforcing consistency of patch-wise semantic relations between the input and output images. It introduces a consistency loss to preserve the semantic relation distribution of patches during translation. It also uses a contrastive loss with hard negative mining, where negatives are weighted based on their semantic closeness to query patches. Experimental results on single-modal, multi-modal, and GAN compression tasks show the method enhances spatial correspondence and improves output quality by better retaining patch-wise semantic relations between input and output images.
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Exploring Patch-wise Semantic Relation for Contrastive Learning in Image-to-Image Translation Tasks
1. Exploring Patch-wise Semantic Relation for Contrastive Learning
in Image-to-Image Translation Tasks
Chanyong Jung*, Gihyun Kwon*, Jong Chul Ye (*: co-author)
Bio Imaging, Signal Processing & Learning Lab, KAIST
CVPR 2022
2. Introduction
Heterogeneous semantic relation
1. Patch-wise semantic relation should be preserved, to enhance
spatial correspondence.
2. Negative samples for contrastive loss should be treated
differently, since they have heterogeneous semantics.
We claim:
Patch-wise heterogenous semantic relation Proposed method
Consistency of semantic relation
Contrastive loss using hard negative mining by the semantic relation
Shared Encoder Embedding Space
𝑧𝑘 𝑘=1
𝐾
𝑤𝑘 𝑘=1
𝐾
𝑧2
𝑧3
𝑧4 𝑧1
𝑧1
𝑧2
𝑧3
𝑧4
𝑧5
𝑤1
𝑤2
𝑤3
𝑤4
𝑤5
𝑧5
𝑤2
𝑤3
𝑤4 𝑤1 𝑤5
Input & Output
: Consistency of Contrastive Semantic Relation
with Hard Negative mining
We propose:
Patches from horse Patches from
background
: Semantically Unrelated
: Semantically Related
𝑧𝑘
𝑧1
𝑧2
𝑧3
𝑧4
𝑧5
3. Method
1. Consistency of Semantic relation distribution
Consistency
Similarity Distribution 𝑃𝑘
𝑧𝑘
𝑃𝑘
𝑧1 𝑧2 𝑧3 𝑧4
𝑖
Similarity Distribution 𝑄𝑘
𝑤𝑘
𝑄𝑘
𝑤1 𝑤2 𝑤3 𝑤4
𝑖
Input
Output
𝑃𝑘 𝑖 =
exp 𝑧𝑘
⊤
𝑧𝑖
𝑗=1
𝐾
exp 𝑧𝑘
⊤
𝑧𝑗
𝑄𝑘 𝑖 =
exp 𝑤𝑘
⊤
𝑤𝑖
𝑗=1
𝐾
exp 𝑤𝑘
⊤
𝑤𝑗
𝐿𝑆𝑅𝐶 =
𝑘=1
𝐾
𝐽𝑆𝐷(𝑃𝑘||𝑄𝑘)
Semantic relation of 𝑖-th patch for 𝑘-th
patch is defined as:
Input
Output
Jensen-Shannon divergence(JSD) between 𝑃𝑘, 𝑄𝑘 is
minimized for the semantic relation consistency (SRC) :
4. Method
2. Contrastive loss with Hard negatives mining
Sampling negatives 𝑧−
by query 𝑧 is modeled as
the von Mises Fisher distribution
𝑧−
∼ 𝑞𝑧− 𝑧−
; 𝑧, 𝛾 =
1
𝑁𝑞
exp 𝛾 𝑧⊤
𝑧−
𝑝𝑍(𝑧−
)
: Hard negatives
: Negative samples
𝑧
: Query point
Embedding space of input image 𝒳
We use the contrastive loss by decoupled infoNCE (DCE)
with hard negatives (hDCE)
𝐿ℎ𝐷𝐶𝐸 𝛾, 𝜏 =
exp 𝑤⊤
𝑧
𝐸𝑞 exp 𝑤⊤𝑧−
=
exp 𝑤⊤
𝑧
Ep[exp 𝛾 𝑧⊤𝑧− exp(𝑤⊤𝑧−)]
𝜏: Temperature parameter
𝛾: Hardness of the negatives
Negatives are weighted by semantic closeness, exp{𝛾 𝑧⊤
𝑧−
}
Hardness of the negatives is explicitly controlled by 𝛾
: We train networks by curriculum learning with varying 𝛾
For positive pair (𝑤, 𝑧) and negative pair (𝑤, 𝑧−
) :
7. Results
2. Multi-modal translation
Latent-guided translation
Reference-guided translation
Source Ours
Improved output by retaining
patch-wise semantic relation
Diverse outputs by random style codes of each class
Input Spring Summer Autumn Winter
8. Results
3. GAN Compression
Input Teacher Ours Baseline
Our student inherits the patch-wise
semantic relation from the teacher.
The output shows improved
correspondence with the teacher
Horse-to-Zebra Map-to-Satellite Cityscapes
9. Results
- Similarity Map
Input
Output
: Query point
Semantic relation consistency (SRC) enhances the input-output correspondence
Hard negative mining (Hneg) sharpens the semantic relations
Input &
Query point
DCE
DCE +SRC
DCE+Hneg
+SRC
InfoNCE
10. Thank You
Jong Chul Ye
E-mail:
jong.ye@kaist.ac.kr
Gihyun Kwon
E-mail:
cyclomon@kaist.ac.kr
Chanyong Jung
E-mail:
jcy@kaist.ac.kr
Editor's Notes
Hi, I’m chanyong Jung. I would like to introduce our work, investigating patch-wise relation for image translation tasks.
The motivation of the work is the heterogeneous semantic relations between the patches of the single image.
We claim that the semantic relation should be preserved in the image translation procedure,
And the negative samples for the patch-wise contrastive loss should be treated differently.
Following the claim, our method have two parts.
In the first part, we enhance the consistency of the semantic relation between the input and the output.
Next, we introduce the contrastive loss using the hard negatives, sampled by the semantic relation.
We first impose the consistency to enhance the spatial correspondence between the input and the output.
The figure shows the semantic relation between the k-th patch and the other patches.
Z and w indicates the embedding vectors from the input and the outputs.
The semantic relational distribution is defined as the similarity distribution.
We denote the distribution P_k for the input, and Q_k for the output.
We minimize the Jensen-Shannon divergence between the distributions, to enhance the consistency.
Next, we introduce the contrastive loss with hard negative mining, considering the semantic relation.
We sample the hard negatives by the von Mises Fisher distribution, as shown in the figure.
Then, the contrastive loss is defined by the decoupled infoNCE with hard negatives.
If you see the loss function, the hardness of the negative mining can be controlled by the gamma.
Using gamma, we applied the curriculum learning by the progressive increase of the hardness of the training.
We verified our method by three tasks, which are single- and multi-modal translation, and GAN compression.
For GAN compression, we distill the patch-wise relational knowledge, to enhance the spatial correspondence between the teacher and the student.
For single-modal translation, we verified our method by horse-to-zebra dataset and the cityscapes dataset.
Our method improved the output for both qualitative and quantitative evaluation.
Specifically, the consistency of the semantic relation enhances the spatial correspondence between the images,
And results the output images with better visual quality.
For the multi-modal translation, the visual quality and the evaluation metrics also verified the improvement by our method.
Similarily to the single-modal translation, the correspondence between the input and the output is enhanced, which results the satisfactory visual quality of the output images.
We also demonstrate the diverse outputs by the random style codes for each class.
In case of the GAN compression, we applied our method to enhance the correspondence between the teacher and the student.
In our method, the student model additionally receive the patch-wise relational knowledge, which results to better performance.
The visual assessment also verifies the enhance of the correspondence between the teacher and the student.
The quantitative scores also demonstrate the improvement.
Lastly, we demonstrate the consistency of the semantic relation, by showing the similarity maps.
As expected, the SRC loss enhanced the consistency of the semantic relation.
Also, the proposed hard negative mining sharpens the semantic relation, reducing the redundant similarity.