2. Concern: Model Transparency & Interpretability
Despite unprecedented breakthroughs of CNN in a variety of
computer vision tasks, their lack of decomposability into
individually intuitive components makes them hard to interpret
Purpose:
1. Visualizing CNNs
visualized CNN predictions by highlighting ‘important’ pixels (i.e. change in
intensities of these pixels have the most impact on the prediction score)
2. Help Users to Build Trust to AI
we must build ‘transparent’ models that have the ability to explain why they
predict what they predict.
3. What makes a good visual explanation?
1. Class Discriminative – localize the category in the image
Class Activation Mapping (CAM)
Gradient-weighted Class Activation Mapping (Grad-CAM)
2. High-Resolution –
capture fine-grained
detail (Pixel-space
gradient visualizations)
Guided Back
propagation
Deconvolution
3. Both –
Guided Grad-CAM
not class-
discriminative
4. 1. Brief Introduction for CNN Visualizing Tools
CAM
Grad-CAM
Guided back propagation
Guided Grad-CAM
6. Convolutional layers of
CNNs actually behave as
object detectors (i.e., to
localize objects)
despite no supervision on the
location of the object is
provided
In other words,
convolutional layers
naturally retain spatial
information
E.g., for action classification, CNN is able
to localize the discriminative regions as
the objects that the humans are
interacting with rather than the humans
themselves
7. However, this ability (spatial information / object detectors
) is lost in fully-connected layers
So we expect the last convolutional layer have the most
detailed spatial information
The higher the convolutional layers are, the higher level of
semantics are extracted
8. For a particular category (𝑐), a Class Activation Map
(CAM) indicates the discriminative image regions used by
the CNN to identify that category
Characteristics
Replace fully-connected layers
with global average pooling
(GAP) layers
1. to minimize the number of
parameters while maintaining
high performance
2. act as structural regularizer,
preventing overfitting during
training
9. CNN
Architecture
1. For each feature map (𝑓𝑘 𝑥, 𝑦 , 𝑘 = 1, … , 𝑛) at the last convolutional
layer, GAP outputs the spatial average of each feature map
𝐹𝑘 =
𝑥,𝑦
𝑓𝑘 (𝑥, 𝑦)
2. For a given class 𝑐, the input for output layer: 𝑆𝑐 = 𝑘 𝑤𝑘
𝑐
𝐹𝑘
(𝑤𝑘
𝑐
: importance of 𝐹𝑘 for class 𝑐)
3. Output score for class 𝑐: 𝑃𝑐 =
exp(𝑆𝑐)
𝑐 exp(𝑆𝑐)
(e.g., softmax)
𝑓𝑘 𝑥, 𝑦 P𝑐
10. CAM
Procedure
Weights (𝑤1
𝑐
, 𝑤2
𝑐
, …, 𝑤𝑛
𝑐
) of output layer indicate the importance
of the image regions (𝐹𝑘) to a specific class (𝑐)
Compute CAM:
𝑀𝑐(𝑥, 𝑦) =
𝑘
𝑤𝑘
𝑐
𝑓𝑘(𝑥, 𝑦)
Note: if the shape (H, W) of CAM (𝑀𝑐) is different from that of input images, up-sampling is needed to equalize the
shapes
𝑓𝑘 𝑥, 𝑦 P𝑐
11. CAM trades off model complexity and performance (using
global average pooling (GAP)) for more transparency
Shortage
To apply CAM, any CNN-based network must change its
architecture, where GAP is a must before the output layer
i.e., architectural changes and hence re-training is needed
12. Gradient-weighted Class Activation Mapping (Grad-CAM)
generalizes CAM for a wide variety of CNN-based
architectures
i.e., without requiring architectural changes or re-training
Characteristics
Without GAP layer, we need a way to define weights – 𝑤𝑘
𝑐
Grad-CAM uses the gradients of any target concept (𝑐) (e.g., ‘dog’
in a classification network) flowing into the final convolutional layer,
and derive summary statistics out of it to represent the weights
(importance)
Source: Selvaraju, Ramprasaath R., et al. "Grad-cam: Visual explanations from deep
networks via gradient-based localization." Proceedings of the IEEE international conference
on computer vision. 2017.
13. Procedure
For a given class 𝑐, compute the gradient of its score– y𝑐
(before the
softmax), w.r.t. each feature map activations 𝐴𝑘 ∈ ℝ𝑢×𝑣, 𝑘 = 1, … , 𝑛 of a
convolutional layer, i.e.
𝜕𝑦𝑐
𝜕𝐴𝑘 ∈ ℝ𝑢×𝑣
Define the importance weights of feature map 𝑘 via GAP:
𝛼𝑘
𝑐
=
1
𝑍 𝑖∈𝑥 𝑗∈𝑦
𝜕𝑦𝑐
𝜕𝐴𝑖𝑗
𝑘
𝐴𝑘 𝑥, 𝑦
y𝑐
Influence of 𝐴𝑘
𝑥, 𝑦 to 𝑦𝑐
14. Procedure
Compute Grad-CAM:
𝐿𝐺𝑟𝑎𝑑−𝐶𝐴𝑀
𝑐
𝑥, 𝑦 = 𝑅𝑒𝐿𝑈
𝑘
𝛼𝑘
𝑐
𝐴𝑘 𝑥, 𝑦 ∈ ℝ𝑢×𝑣
ReLU is applied because we are only interested in the features (neurons) that
have a positive influence on the class of interest
i.e. pixels whose intensity should be increased in order to increase 𝑦𝑐
Note: if the shape (u, v) of 𝐿𝐺𝑟𝑎𝑑−𝐶𝐴𝑀
𝑐
is different from that of input images, up-sampling is needed to equalize the
shapes
𝐴𝑘 𝑥, 𝑦
y𝑐
15. Grad-CAM generates visual explanations for a wide
variety of CNN-based networks without requiring
architectural changes or re-training.
16. Grad-CAM can help
identify the biases in
dataset
Models trained on biased
datasets may not generalize to
real-world scenarios, or worse,
may perpetuate biases and
stereotypes (w.r.t. gender, race,
age, etc.)
E.g., for a “doctor” vs. “nurse”
binary classification task
Biased model had learned to look at
the person’s face hairstyle to
distinguish nurses from doctors
thus learning gender stereotype
Unbiased model made the right
prediction looking at the white
coat, and the stethoscope
17. Shortage
The generated localization map (heatmap) from Grad-CAM (also
CAM) is coarse (low-resolution) unclear enough why the
network predicts a particular
instance (e.g., “tiger cat”)
Guided Back Propagation
is another approach to
provide high-resolution map
i.e. fine-grained detail, or pixel-
space gradient visualizations
18. Guided Backpropagation visualizes gradients of the
network’s prediction (i.e., output neuron) w.r.t. the input
image
This determines which pixels need to be changed the least to affect
the prediction the most (i.e., higher absolute gradients)
Negative gradients are suppressed through ReLU when
backpropagating
because we are only interested in the pixels that increase the
activation of the output neuron, rather than suppressing it
19. Guided Back Propagation is high-resolution
since it derive gradients directly w.r.t. the input image instead of
w.r.t. last convolutional Layer (i.e., Grad-CAM)
Shortage
Not class-discriminative
Guided Grad-CAM
combines Guided
backpropagation and
Grad-CAM, and thus
becomes class-
discriminative
20. Characteristics
Guided Grad-CAM is both high-resolution and class-
discriminative
Procedure
Fusing Guided Back Propagation with Grad-CAM to create
Guided Grad-CAM visualizations
× =
21. Guided Grad-CAM also help untrained users successfully
discern a ‘stronger’ network from a ‘weaker’ one, even
when both make identical predictions.
stronger
network
weaker
network
2 models (A vs B)
with same
prediction accuracies
22.
23.
24. because guided backpropagation adds an additional
guidance signal from the higher layers to usual
backpropagation.
This prevents backward flow of negative gradients,
corresponding to the neurons which decrease the activation
of the higher layer unit we aim to visualize
Editor's Notes
just before the final output layer (softmax in the case of categorization), we perform global average pooling on the convolutional feature maps and use those as features for a fully-connected layer that produces the desired output (categorical or otherwise). Given this simple connectivity structure, we can identify the importance of the image regions by projecting back the weights of the output layer on to the convolutional feature maps, a technique we call class activation mapping.
Feature map: fk is the map of the presence of this visual pattern.
MAC: Mc(x, y) directly indicates the importance of the activation at spatial grid (x, y) leading to the classification of an image to class c
just before the final output layer (softmax in the case of categorization), we perform global average pooling on the convolutional feature maps and use those as features for a fully-connected layer that produces the desired output (categorical or otherwise). Given this simple connectivity structure, we can identify the importance of the image regions by projecting back the weights of the output layer on to the convolutional feature maps, a technique we call class activation mapping.
Feature map: fk is the map of the presence of this visual pattern.
MAC: Mc(x, y) directly indicates the importance of the activation at spatial grid (x, y) leading to the classification of an image to class c