Deep learning paper review ppt sourece -Direct clr

Understanding Dimensional Collapse in
Contrastive Self-Supervised Learning
1
Li Jing et al.
Facebook AI research
Presenter
이재윤
Fundamental Team
김동희, 김지연, 김창연, 송헌, 이근배

Contents
1. Motivation
2. Two Mechanisms of collapse
3. DirectCLR
4. Conclusion

4
Self-Supervision
• Define pretext task
 Context Prediction : Learn to recognize spatial relationships
 Jigsaw Puzzle : Restore the puzzle
 Joint Embedding Vector : Match the embedding vector of augmented views
( ) = 6
5
4
2
1 3
6 7 8
( ) = 1
CNN
CNN
𝑓𝑖
𝑓
𝑗
<Context Prediction> <Jigsaw Puzzle> <Joint Embedding vector>

5
Collapsing Problem
• 2 types of collapse
 complete collapse : all vector shrinks to one vector
 dimensional collapse : embedding vectors only span a lower-dimensional subspace
• Self-Supervision prevents complete collapse
 dimensional collapse still occurs

6
Contrastive Learning
• Compare training samples
 Encourage positive pairs to be close
 Negative pairs are pushed away
• It seems intuitive to speculate that negative
pairs prevent dimensional collapse
• Contrary to the intuition, contrastive
learning stills suffers from dimensional
collapse
CNN
CNN
𝑓𝑖
𝑓𝑗
“Negative” Pairs

7
Contrastive Learning
• Singular value spectrum of embedding space of SimCLR
 𝐶 = 𝑖 𝑧𝑖 − 𝑧 𝑧𝑖 − 𝑧 𝑇/𝑁 = 𝑖 𝑊 𝑥𝑖 − 𝑥 𝑥𝑖 − 𝑥 𝑇𝑊𝑇/𝑁
 Covariance matrix 𝐶 = 𝑈𝑆𝑉𝑇
• Embedding vectors only span a lower-dimensional subspace
 About 30 singular values drop to zero
• Bad influence on downstream task (e.g classification)

8
DirectCLR
• Show contrastive learning also suffers from dimensional collapse
• Explain why dimensional collapse also occur in contrastive learning
 Data augmentation
 Implicit regularization
• Propose novel contrastive learning method, called DirectCLR

Data Augmentation
11
• Assume a simple linear network
 Trained with contrastive learning
 InfoNCE Loss
𝐿 = −
𝑖=1
𝑁
log
exp( 𝑧𝑖 − 𝑧𝑗
2
)/2
𝑗≠𝑖 exp( 𝑧𝑖 − 𝑧𝑗
2
)/2 + exp( 𝑧𝑖 − 𝑧′
𝑖
2)/2
 𝑧𝑖, 𝑧′𝑖: 𝑝𝑎𝑖𝑟 𝑜𝑓 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑣𝑒𝑐𝑡𝑜𝑟 𝑓𝑟𝑜𝑚 𝑡𝑤𝑜 𝑏𝑟𝑎𝑛𝑐ℎ
 𝑧𝑗 ∶ 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑝𝑎𝑖𝑟

Data Augmentation
12
• Calculate the gradient 𝑮 which evolves the weight matrix 𝑊
• For specifically, 𝒈𝒛𝒊
and 𝒈𝒛′𝒊 is written as:
𝐺 = −
𝜕𝐿
𝜕𝑊
=
𝑖
𝜕𝐿
𝜕𝑧𝑖
𝜕𝑧𝑖
𝜕𝑊
+
𝜕𝐿
𝜕𝑧′
𝑖
𝜕𝑧′
𝑖
𝜕𝑊
= −
𝑖
𝑔𝑧𝑖
𝑥𝑖
𝑇
+ 𝑔′
𝑧𝑖
𝑥′
𝑖
𝑇
 𝐿 ∶ 𝐼𝑛𝑓𝑜𝑁𝐶𝐸 𝑙𝑜𝑠𝑠
 𝑊 ∶ 𝑤𝑒𝑖𝑔ℎ𝑡 𝑚𝑎𝑡𝑟𝑖𝑥
 𝒈𝒛𝒊
∶ 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑜𝑛 𝑡ℎ𝑒 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑣𝑒𝑐𝑡𝑜𝑟 𝑧𝑖
𝑔𝑧𝑖
=
𝑗≠𝑖
𝛼𝑖𝑗(𝑧𝑗 − 𝑧′𝑖) +
𝑗≠𝑖
𝛼𝑗𝑖(𝑧𝑗 − 𝑧𝑖) , 𝑔𝑧′𝑖
=
𝑗≠𝑖
𝛼𝑖𝑗(𝑧′𝑖 − 𝑧𝑖)
 𝛼𝑖𝑗 = exp(−|𝑧𝑖 − 𝑧𝑗
2
/2) 𝑍𝑖 , 𝛼𝑖𝑖 = exp(−| |
𝑧𝑖 − 𝑧𝑖
2
/2)
 𝑍𝑖 = 𝑗≠𝑖 exp( 𝑧𝑖 − 𝑧𝑗
2
)/2 + exp( 𝑧𝑖 − 𝑧′
𝑖
2)/2 (denominator of InfoNCE Loss)

Data Augmentation
13
• Calculate the gradient 𝑮 which evolves the weight matrix 𝑊
• For specifically, 𝒈𝒛𝒊
and 𝒈𝒛′𝒊 is written as:
𝑮 = −
𝜕𝐿
𝜕𝑊
=
𝑖
𝜕𝐿
𝜕𝑧𝑖
𝜕𝑧𝑖
𝜕𝑊
+
𝜕𝐿
𝜕𝑧′
𝑖
𝜕𝑧′
𝑖
𝜕𝑊
= −
𝑖
𝒈𝒛𝒊
𝑥𝑖
𝑇
+ 𝒈′
𝒛𝒊
𝑥′
𝑖
𝑇
 𝐿 ∶ 𝐼𝑛𝑓𝑜𝑁𝐶𝐸 𝑙𝑜𝑠𝑠
 𝑊 ∶ 𝑤𝑒𝑖𝑔ℎ𝑡 𝑚𝑎𝑡𝑟𝑖𝑥
 𝒈𝒛𝒊
∶ 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑜𝑛 𝑡ℎ𝑒 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑣𝑒𝑐𝑡𝑜𝑟 𝑧𝑖
𝒈𝒛𝒊
=
𝑗≠𝑖
𝛼𝑖𝑗(𝒛𝒋 − 𝒛′𝒊) +
𝑗≠𝑖
𝛼𝑗𝑖(𝒛𝒋 − 𝒛𝒊) , 𝒈𝒛′𝒊
=
𝑗≠𝑖
𝛼𝑖𝑗(𝒛′𝒊 − 𝒛𝒊)
 𝛼𝑖𝑗 = exp(−|𝑧𝑖 − 𝑧𝑗
2
/2) 𝑍𝑖 , 𝛼𝑖𝑖 = exp(−| |
𝑧𝑖 − 𝑧𝑖
2
/2)
 𝑍𝑖 = 𝑗≠𝑖 exp( 𝑧𝑖 − 𝑧𝑗
2
)/2 + exp( 𝑧𝑖 − 𝑧′
𝑖
2)/2 (denominator of InfoNCE Loss)

Data Augmentation
14
• We have,
• 𝑋 is the difference of two PSD matrix (𝐗 = Σ0 − Σ1)
 Σ0 : weighted distribution covariance matrix
 Σ1 : weighted augmentation covariance matrix
• If the matrix 𝑋 has negative eigen values, the weight matrix 𝑾 after 𝑡 updates,has vanishing
singular values
• Covariance matrix C is also low-rank
𝐶 =
𝑖
(𝑧𝑖 − 𝑧)(𝑧𝑖 − 𝑧)𝑇/𝑁 =
𝑖
𝑾(𝑥𝑖 − 𝑥)(𝑥𝑖 − 𝑥)𝑇𝑾𝑇/𝑁
𝐺 = −𝑊𝑿
𝑋: =
𝑖,𝑗
𝛼𝑖𝑗 𝑥𝑖 − 𝑥𝑗 𝑥𝑖 − 𝑥𝑗
𝑇
−
𝑖,𝑗
(1 − 𝛼𝑖𝑗) 𝑥′𝑖 − 𝑥𝑖 𝑥′𝑖 − 𝑥𝑖
𝑇

Data Augmentation
15
• What matters is the eigen values of 𝑿
• Weight matrix after 𝑡 updates is as follows:
• If 𝜦 has negative eigen value, weight matrix 𝑊 has vanishing singular values
 As 𝑡 → ∞, 𝑈𝑒𝑥𝑝 𝜦𝑡 𝑈𝑇 is rank-deficient
 𝑾 𝒕 become rank-deficient
• At last, 𝑪 become low-rank, dimensional collapse
 𝐶 = 𝑖 𝑧𝑖 − 𝑧 𝑧𝑖 − 𝑧 𝑇/𝑁 = 𝑖 𝑾 𝑥𝑖 − 𝑥 𝑥𝑖 − 𝑥 𝑇𝑾𝑻/𝑁
𝑊 𝑡 = 𝑊 0 𝑒𝑥𝑝(𝑋𝑡) = 𝑊 0 𝑈𝑒𝑥𝑝 𝜦𝑡 𝑈𝑇
 𝑋 = 𝑈𝜦𝑈𝑇

Implicit Regularization
17
• The first scenario is usually hard to happen
 Only with strong augmentation
 Assumes single layer
• Even with small augmentation, dimensional collapse still happens for deep network
 Implicit regularization
 Over-parameterized linear network fine low-rank 𝐶
• Assume two-layer linear MLP

18
• Gradient which evolves weight matrix 𝑊1 and 𝑊2 is as follows:
• Interaction between two weight matrix is the key
 Governed by adjacent orthonormal matrices
𝐺1 = W2
T
G
𝐺2 = 𝐺𝑊1
𝑇
 𝐺 = − 𝑖 𝑔𝑧𝑖
𝑥𝑖
𝑇
+ 𝑔′
𝑧𝑖
𝑥′
𝑖
𝑇
 𝐺 = −𝑊2𝑊1𝑋
𝑊2𝑊1𝑋 = 𝑈2𝑆2𝑽𝟐
𝑻
𝑼𝟏𝑆1𝑉1
𝑇
𝑋
 Theorem 2
If for all t, 𝑊2 𝑡 𝑊1 𝑡 ≠ 0, 𝑋 𝑡 is positive-definite and
𝑊1 +∞ , 𝑊2(+∞) have distinctive singular values, then the
alignment matrix 𝐴 = 𝑽𝟐
𝑻
𝑼𝟏 → 𝐼
<Visualization of Matrix 𝐴>

19
• In Real scenario,
 Singular value initialized with random value
 Alignment is not perfect
 Alignment matrix is block-diagonal matrix
 Each block is a group of degenerate singular value
• Singular values of each weight matrix evolves by the values as follows:
𝜎1
𝑘
= 𝜎1
𝑘
𝜎2
𝑘 2
𝑣1
𝑘𝑇
𝑋𝑣1
𝑘
, 𝜎2
𝑘
= 𝜎2
𝑘
𝜎1
𝑘 2
𝑣1
𝑘𝑇
𝑋𝑣1
𝑘
𝜎1
𝑘
= 𝜎1
𝑘
𝜎1
𝑘
+ 𝐶
2
𝑣1
𝑘𝑇
𝑋𝑣1
𝑘

20
• Singular values grows proportional to themselves
 Small singular values grow significantly slower
• Embedding space identified by the singular value of covariance matrix
 𝐶 = 𝑖 𝑧𝑖 − 𝑧 𝑧𝑖 − 𝑧 𝑇
/𝑁 = 𝑖 𝑾𝟐𝑾1 𝑥𝑖 − 𝑥 𝑥𝑖 − 𝑥 𝑇
𝑾𝟏
𝑻
𝑾𝟐
𝑻
/𝑁
𝜎1
𝑘
= 𝜎1
𝑘
𝜎1
𝑘
+ 𝐶
2
𝑣1
𝑘𝑇
𝑋𝑣1
𝑘
-8
-6
-4
-2
0
0 500 1000 1500 2000 2500 3000 3500 4000
iteration
Log
singular
values
-8
-6
-4
-2
0
0 500 1000 1500 2000 2500 3000 3500 4000
iteration
Log
singular
values
-8
-6
-4
-2
0
0 500 1000 1500 2000 2500 3000 3500 4000
iteration
Log
singular
values

21
• With over-parameterized network (more than 2 layers),
 Stronger collapsing effect is amplified because of matrix product
• Dimensional collapse also occurs in non-linear scenario (ReLU)
-25
-20
-15
-10
-5
0
0 2 4 6 8 10 12 14
Singular Value Rank Index
Log
singular
values
-25
-20
-15
-10
-5
0
0 2 4 6 8 10 12 14
Singular Value Rank Index
Log
singular
values
<Multiple Layers> <Nonlinear>

Using projector
• Singular vectors of embedding vector of SimCLR suffers dimensional collapse
• Instead, representation suffers from less dimensional collapse
 Projector prevents the collapse
• For downstream task, only representation is used
24
<Representation and embedding> <Representation space spectrum>

Using projector
• The effect of projector
I. Projector weight matrix is diagonal
 As the alignment occurs, matrix becomes a simple diagonal matrix
II. Projector weight matrix is low-rank
 As the weight matrix of projector is low-rank,
gradient is only applied to the subspace of the representation
25
𝑊2𝑊1𝑋 = 𝑈2𝑆2𝑽𝟐
𝑻
𝑼𝟏(→ 𝐼)𝑆1𝑉1
𝑇
𝑋

Main Idea
• DirectCLR
 Remove the projector
 Directly send sub-vector of representation to the loss
Simplified training framework!
• InfoNCE Loss is calculated only with 𝒛 = 𝐫 𝟎: 𝐝𝟎
• Comparison between SimCLR
 DirectCLR trained with standard recipe of SimCLR for100 epochs
 ResNet-50 as backbone
26
<Test accuracy on ImageNet>

Main Idea
• Why the rest of the representation, 𝐫[𝐝𝟎 + 𝟏: ], contains useful information?
 𝐫[𝐝𝟎 + 𝟏: ] is copied from the layer before the last residual block
 DirectCLR takes advantage of the ResNet
27

Conclusion
• Provide theoretical understanding of dimensional collapse
I. Strong Augmentation
II. Implicit Regularization
• Propose novel contrastive self-supervised learning, DirectCLR
 Prevents dimensional collapse without projector
 Better performance compared to SimCLR with trainable linear projector
• Limitation
 DirectCLR does not perform better than SimCLR with 2-layer projector
 Limitation on generalization to other architecture
30

Deep learning paper review ppt sourece -Direct clr

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep learning paper review ppt sourece -Direct clr

Similar to Deep learning paper review ppt sourece -Direct clr (20)

More from taeseon ryu

More from taeseon ryu (20)

Recently uploaded

Recently uploaded (20)

Deep learning paper review ppt sourece -Direct clr

Editor's Notes