101번째 영상,
펀디멘탈팀 김준호 님의
Restricting the Flow: Information Bottlenecks for Attribution
논문 리뷰 입니다
Explanable ai, xai와 관련된 페이퍼 입니다! 관련되어 관심있으신 분들이 많은 도움이 되시길 바랍니다! attribution map을 이용하여 결과물에 영향을 준 네트워크의 gradient를 직접 추적하여 비주얼 explanation을 추적하는 방식입니다! 펀디멘탈팀 김준호님이 밑바닥부터 자세한 리뷰를 도와주셨습니다!
오늘도 많은 관심과 사랑 감사합니다!
Restricting the Flow: Information Bottlenecks for Attribution
1. 제목
펀디멘탈팀
고형권, 김동희, 김창연, 송헌, 이민경, 이재윤
2021.02.07
Restricting the Flow: Information Bottlenecks for Attribution
ICLR 20’
1
딥러닝읽기모임
2. 딥러닝읽기모임
2
- What is attribution map?: Providing insights of the DNNs’ decision-making
Understanding Deep Neural Networks
Input Image
How to “explain” the network prediction to Dog or Cat?
Find the region where network is looking at. (Visual Explanation)
DNNs
Dog
Cat
3. 딥러닝읽기모임
3
- Visualizing Attribution Maps
Visual Explanation
White-box Approach
- Pros: Simple and Fast.
- Cons: Need tractable internal components
(e.g., gradient, activation), Network
architecture dependency.
Black-box approach
- Pros: Model-agnostic property,
Interpretability for the black-box models.
- Cons: difficult to optimize.
[Selvaraju et al. 2017; Fong et al. 2017]
4. 딥러닝읽기모임
4
- Restricting the flow: Information Bottlenecks for Attribution (ICLR 20’)
Introduction
Existing attribution heatmap highlights subjectively irrelevant areas and this might
correctly reflect the network’s unexpected way of processing the data
So, the authors proposed a novel attribution method that estimates the amount
of information an image region provides to the network’s prediction using
information bottleneck concept
5. 딥러닝읽기모임
5
- Information Bottleneck:
Preliminaries
[Tishby et al. 2000]
max 𝐼𝐼 𝑌𝑌; 𝑍𝑍 − 𝛽𝛽𝛽𝛽[𝑋𝑋; 𝑍𝑍]
New Random Variable
Control the trade-off between predicting the
labels well and using little information of 𝑋𝑋
Label Input
Goal: minimizing information flow + maximizing original model objective
Common way to reduce the amount of information?
𝑍𝑍 = 𝜆𝜆 𝑋𝑋 𝑅𝑅 + 1 − 𝜆𝜆 𝑋𝑋 𝜖𝜖, 𝜆𝜆𝑖𝑖 ∈ [0,1]
𝑅𝑅 = 𝑓𝑓𝑙𝑙 X 𝑙𝑙th layer output
𝜖𝜖~𝒩𝒩(𝜇𝜇𝑅𝑅, 𝜎𝜎𝑅𝑅
2
)mean, var as 𝑅𝑅
1. 𝜆𝜆𝑖𝑖 𝑋𝑋 = 1 Transmit all information (𝑍𝑍𝑖𝑖 = 𝑅𝑅𝑖𝑖)
2. 𝜆𝜆𝑖𝑖 𝑋𝑋 = 0 All information in 𝑅𝑅𝑖𝑖 is lost (𝑍𝑍𝑖𝑖 = 𝜖𝜖)
6. 딥러닝읽기모임
6
- Information Bottleneck for attribution:
Proposed Method
Now, it is required to estimate how much information is contained in 𝑍𝑍 from 𝑅𝑅
Variational approximation 𝑄𝑄 𝑍𝑍 = 𝒩𝒩(𝜇𝜇𝑅𝑅, 𝜎𝜎𝑅𝑅) (Assumption : All dim of 𝑍𝑍 are distributed normally and independent.).
𝐼𝐼 𝑅𝑅, 𝑍𝑍 = 𝔼𝔼𝑅𝑅[𝐷𝐷𝐾𝐾𝐾𝐾[𝑃𝑃(𝑍𝑍|𝑅𝑅)‖𝑃𝑃(𝑍𝑍)]]
𝑝𝑝 𝑧𝑧 = ∫ 𝑝𝑝 𝑧𝑧|𝑟𝑟 𝑝𝑝 𝑟𝑟 𝑑𝑑𝑑𝑑 intractable
𝐼𝐼 𝑅𝑅, 𝑍𝑍 = � 𝑝𝑝 𝑅𝑅 � 𝑝𝑝 𝑧𝑧|𝑟𝑟 log
𝑝𝑝 𝑧𝑧|𝑟𝑟
𝑝𝑝 𝑧𝑧
𝑑𝑑𝑑𝑑 𝑑𝑑𝑑𝑑
= � � 𝑝𝑝(𝑟𝑟, 𝑧𝑧) log
𝑝𝑝 𝑧𝑧|𝑟𝑟
𝑝𝑝 𝑧𝑧
𝑞𝑞(𝑧𝑧)
𝑞𝑞(𝑧𝑧)
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑
= � � 𝑝𝑝(𝑟𝑟, 𝑧𝑧) log
𝑝𝑝 𝑧𝑧|𝑟𝑟
𝑞𝑞 𝑧𝑧
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 + � � 𝑝𝑝(𝑟𝑟, 𝑧𝑧) log
𝑞𝑞 𝑧𝑧
𝑝𝑝 𝑧𝑧
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑
= � � 𝑝𝑝(𝑟𝑟, 𝑧𝑧) log
𝑝𝑝 𝑧𝑧|𝑟𝑟
𝑞𝑞 𝑧𝑧
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 + � 𝑝𝑝 𝑧𝑧 � 𝑝𝑝 𝑟𝑟|𝑧𝑧 𝑑𝑑𝑑𝑑 log
𝑞𝑞(𝑧𝑧)
𝑝𝑝(𝑧𝑧)
𝑑𝑑𝑑𝑑
= 𝔼𝔼𝑅𝑅 𝐷𝐷𝐾𝐾𝐾𝐾 𝑃𝑃 𝑍𝑍|𝑅𝑅 ‖𝑄𝑄 𝑍𝑍 − 𝐷𝐷𝐾𝐾𝐾𝐾[𝑃𝑃(𝑍𝑍)‖𝑄𝑄(𝑍𝑍)]
≤ 𝔼𝔼𝑅𝑅 𝐷𝐷𝐾𝐾𝐾𝐾 𝑃𝑃 𝑍𝑍|𝑅𝑅 ‖𝑄𝑄 𝑍𝑍
[Klambauer et al. 2017]
7. 딥러닝읽기모임
7
- Information Bottleneck for attribution:
Proposed Method
𝐼𝐼 𝑅𝑅, 𝑍𝑍 = 𝔼𝔼𝑅𝑅 𝐷𝐷𝐾𝐾𝐾𝐾 𝑃𝑃 𝑍𝑍|𝑅𝑅 ‖𝑄𝑄 𝑍𝑍 − 𝐷𝐷𝐾𝐾𝐾𝐾[𝑃𝑃(𝑍𝑍)‖𝑄𝑄(𝑍𝑍)]
ℒ𝐼𝐼 ≥ 𝐼𝐼[𝑅𝑅, 𝑍𝑍]
If ℒ𝐼𝐼 = 0 for an area information from this area is not necessary for the network’s prediction
Their goal is keeping only the information necessary for correct classification. Thus, the mutual
information should be minimal while the classification score remain high.
Total objective function:
ℒ = ℒ𝐶𝐶𝐶𝐶 + 𝛽𝛽ℒ𝐼𝐼
𝛽𝛽 controls the relative importance of both objectives. (e.g., for a small 𝛽𝛽, more bits of
information are flowing and less for a higher 𝛽𝛽)
[Alemi et al. 2017]
8. 딥러닝읽기모임
8
- Per-Sample Bottleneck
Proposed Method
Parameterization
: The bottleneck parameters 𝜆𝜆 have to be in [0,1]. Therefore, they parametrize 𝜆𝜆 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝛼𝛼), where 𝛼𝛼 ∈ ℝ𝑑𝑑
.
Initialization
: In the beginning, they wanted retain all the information initialize 𝛼𝛼𝑖𝑖 = 5, thus 𝜆𝜆 ≈ 0.993 ⟹ 𝑍𝑍 ≈ 𝑅𝑅.
At first, the bottleneck has practically no impact on the model performance. It then deviates from this starting point to
suppress unimportant regions.
Optimization
: 10 iters using Adam with lr=1 to fit the mask 𝛼𝛼. To stabilize the training, they copy the single sample 10 times and apply
different noise to each.
9. 딥러닝읽기모임
9
- Per-Sample Bottleneck
Proposed Method
Measure of information in 𝑍𝑍 (i.e., 𝐷𝐷𝐾𝐾𝐾𝐾(𝑃𝑃(𝑍𝑍|𝑅𝑅)‖𝑄𝑄(𝑍𝑍))) per dimension)
: summing over the channel axis 𝑚𝑚 ℎ,𝑤𝑤 = ∑𝑖𝑖=0
𝑐𝑐
𝐷𝐷𝐾𝐾𝐾𝐾(𝑃𝑃(𝑍𝑍 𝑖𝑖,ℎ,𝑤𝑤 �𝑅𝑅 𝑖𝑖,ℎ,𝑤𝑤 )�𝑄𝑄(𝑍𝑍 𝑖𝑖,ℎ,𝑤𝑤 ))
Enforcing local smoothness
: Pooling and conv stride ignore parts of the input, causing PSB overfit to a grid structure
Convolve the sigmoid output with a fixed Gaussian kernel with standard deviation 𝜎𝜎𝑠𝑠.
𝜆𝜆 = 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏(𝜎𝜎𝑠𝑠, 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝛼𝛼))
10. 딥러닝읽기모임
10
- Readout Bottleneck
Proposed Method
Collect feature maps from different depths and then combine them with 1X1 conv
① In a first forward pass, no noise is added and collect different feature maps and interpolate them bilinearly
to match the spatial dimension.
② In a second forward pass, they insert the bottleneck layer into the network and restrict the flow of
information.
13. 딥러닝읽기모임
13
- Sanity Check (Randomization of Model Parameters)
Evaluation
: Starting from the last layer, an increasing proportion of the network parameters is re-initialized until all parameters are
random. The difference between original heatmap and the heatmap obtained from the randomized model is quantified using
SSIM.
For their methods, the randomizing the final dense layer drops the mean SSIM by around 0.4
[Adebayo et al. 2018]
14. 딥러닝읽기모임
14
- Sensitivity-N
Evaluation
: Masks the network’s input randomly and then measures how strongly the amount of attribution in the mask correlates with
the drop in classifier score.
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 �
𝑖𝑖∈𝑇𝑇𝑛𝑛
𝑅𝑅𝑖𝑖 𝑥𝑥 , 𝑆𝑆𝑐𝑐 𝑥𝑥 − 𝑆𝑆𝑐𝑐 𝑥𝑥 𝑥𝑥𝑇𝑇𝑛𝑛=0
Classifier logit output for class 𝑐𝑐
The input with all pixels in 𝑇𝑇𝑛𝑛 set to zero
PSB (𝛽𝛽 = 10/𝑘𝑘) perform best for both models above 𝑛𝑛 = 2 ⋅ 103
pixels (i.e., when more than 2% of all pixels are masked).
[Anacona et al. 2018]
15. 딥러닝읽기모임
15
- Localization
Evaluation
1. Bounding Box: If the bbox contains 𝑛𝑛 pixels, measuring how many of 𝑛𝑛-th highest scored pixels are contained in the
bounding box (Then divided by 𝑛𝑛 = ratio).
2. Image Degradation: The most relevant tiles are removed first (MoRF) ⟹ Removing tiles ranked as least relevant by the
attribution method first (LeRF).
𝑠𝑠 𝑥𝑥 =
𝑝𝑝(𝑦𝑦|𝑥𝑥) − 𝑏𝑏
𝑡𝑡1 − 𝑏𝑏
Top-1 probability on the original samples
Mean model output on the
fully degraded images.
Both LeRf, MoRF degradation yield curves measuring
different qualities of the attribution method.
calculate the integral between the two curves