Introduction to Grad-CAM (short version)

Advisor: Henry Horng-Shing Lu
Students: Jane Hsing-Chuan Hsieh
Date: 2021-08-10

Concern: Model Transparency & Interpretability
 Despite unprecedented breakthroughs of CNN in a variety of
computer vision tasks, their lack of decomposability into
individually intuitive components makes them hard to interpret
Purpose:
1. Visualizing CNNs
 visualized CNN predictions by highlighting ‘important’ pixels (i.e. change in
intensities of these pixels have the most impact on the prediction score)
2. Help Users to Build Trust to AI
 we must build ‘transparent’ models that have the ability to explain why they
predict what they predict.

What makes a good visual explanation?
1. Class Discriminative – localize the category in the image
 Class Activation Mapping (CAM)
 Gradient-weighted Class Activation Mapping (Grad-CAM)
2. High-Resolution –
capture fine-grained
detail (Pixel-space
gradient visualizations)
 Guided Back
propagation
 Deconvolution
3. Both –
 Guided Grad-CAM
not class-
discriminative

1. Brief Introduction for CNN Visualizing Tools
 CAM
 Grad-CAM
 Guided back propagation
 Guided Grad-CAM

• CAM
• Grad-CAM
• Guided back propagation
• Guided Grad-CAM

Convolutional layers of
CNNs actually behave as
object detectors (i.e., to
localize objects)
 despite no supervision on the
location of the object is
provided
In other words,
convolutional layers
naturally retain spatial
information
E.g., for action classification, CNN is able
to localize the discriminative regions as
the objects that the humans are
interacting with rather than the humans
themselves

However, this ability (spatial information / object detectors
) is lost in fully-connected layers
 So we expect the last convolutional layer have the most
detailed spatial information
 The higher the convolutional layers are, the higher level of
semantics are extracted

For a particular category (𝑐), a Class Activation Map
(CAM) indicates the discriminative image regions used by
the CNN to identify that category
Characteristics
Replace fully-connected layers
with global average pooling
(GAP) layers
1. to minimize the number of
parameters while maintaining
high performance
2. act as structural regularizer,
preventing overfitting during
training

CNN
Architecture
1. For each feature map (𝑓𝑘 𝑥, 𝑦 , 𝑘 = 1, … , 𝑛) at the last convolutional
layer, GAP outputs the spatial average of each feature map
𝐹𝑘 =
𝑥,𝑦
𝑓𝑘 (𝑥, 𝑦)
2. For a given class 𝑐, the input for output layer: 𝑆𝑐 = 𝑘 𝑤𝑘
𝑐
𝐹𝑘
(𝑤𝑘
𝑐
: importance of 𝐹𝑘 for class 𝑐)
3. Output score for class 𝑐: 𝑃𝑐 =
exp(𝑆𝑐)
𝑐 exp(𝑆𝑐)
(e.g., softmax)
𝑓𝑘 𝑥, 𝑦 P𝑐

CAM
Procedure
 Weights (𝑤1
𝑐
, 𝑤2
𝑐
, …, 𝑤𝑛
𝑐
) of output layer indicate the importance
of the image regions (𝐹𝑘) to a specific class (𝑐)
  Compute CAM:
𝑀𝑐(𝑥, 𝑦) =
𝑘
𝑤𝑘
𝑐
𝑓𝑘(𝑥, 𝑦)
Note: if the shape (H, W) of CAM (𝑀𝑐) is different from that of input images, up-sampling is needed to equalize the
shapes
𝑓𝑘 𝑥, 𝑦 P𝑐

CAM trades off model complexity and performance (using
global average pooling (GAP)) for more transparency
Shortage
 To apply CAM, any CNN-based network must change its
architecture, where GAP is a must before the output layer
 i.e., architectural changes and hence re-training is needed

Gradient-weighted Class Activation Mapping (Grad-CAM)
generalizes CAM for a wide variety of CNN-based
architectures
 i.e., without requiring architectural changes or re-training
Characteristics
 Without GAP layer, we need a way to define weights – 𝑤𝑘
𝑐
  Grad-CAM uses the gradients of any target concept (𝑐) (e.g., ‘dog’
in a classification network) flowing into the final convolutional layer,
and derive summary statistics out of it to represent the weights
(importance)
Source: Selvaraju, Ramprasaath R., et al. "Grad-cam: Visual explanations from deep
networks via gradient-based localization." Proceedings of the IEEE international conference
on computer vision. 2017.

Procedure
 For a given class 𝑐, compute the gradient of its score– y𝑐
(before the
softmax), w.r.t. each feature map activations 𝐴𝑘 ∈ ℝ𝑢×𝑣, 𝑘 = 1, … , 𝑛 of a
convolutional layer, i.e.
𝜕𝑦𝑐
𝜕𝐴𝑘 ∈ ℝ𝑢×𝑣
 Define the importance weights of feature map 𝑘 via GAP:
𝛼𝑘
𝑐
=
1
𝑍 𝑖∈𝑥 𝑗∈𝑦
𝜕𝑦𝑐
𝜕𝐴𝑖𝑗
𝑘
𝐴𝑘 𝑥, 𝑦
y𝑐
 Influence of 𝐴𝑘
𝑥, 𝑦 to 𝑦𝑐

Procedure
  Compute Grad-CAM:
𝐿𝐺𝑟𝑎𝑑−𝐶𝐴𝑀
𝑐
𝑥, 𝑦 = 𝑅𝑒𝐿𝑈
𝑘
𝛼𝑘
𝑐
𝐴𝑘 𝑥, 𝑦 ∈ ℝ𝑢×𝑣
 ReLU is applied because we are only interested in the features (neurons) that
have a positive influence on the class of interest
 i.e. pixels whose intensity should be increased in order to increase 𝑦𝑐
Note: if the shape (u, v) of 𝐿𝐺𝑟𝑎𝑑−𝐶𝐴𝑀
𝑐
is different from that of input images, up-sampling is needed to equalize the
shapes
𝐴𝑘 𝑥, 𝑦
y𝑐

Grad-CAM generates visual explanations for a wide
variety of CNN-based networks without requiring
architectural changes or re-training.

Grad-CAM can help
identify the biases in
dataset
 Models trained on biased
datasets may not generalize to
real-world scenarios, or worse,
may perpetuate biases and
stereotypes (w.r.t. gender, race,
age, etc.)
 E.g., for a “doctor” vs. “nurse”
binary classification task
Biased model had learned to look at
the person’s face hairstyle to
distinguish nurses from doctors
 thus learning gender stereotype
Unbiased model made the right
prediction looking at the white
coat, and the stethoscope

Shortage
 The generated localization map (heatmap) from Grad-CAM (also
CAM) is coarse (low-resolution)  unclear enough why the
network predicts a particular
instance (e.g., “tiger cat”)
 Guided Back Propagation
is another approach to
provide high-resolution map
 i.e. fine-grained detail, or pixel-
space gradient visualizations

Guided Backpropagation visualizes gradients of the
network’s prediction (i.e., output neuron) w.r.t. the input
image
 This determines which pixels need to be changed the least to affect
the prediction the most (i.e., higher absolute gradients)
Negative gradients are suppressed through ReLU when
backpropagating
 because we are only interested in the pixels that increase the
activation of the output neuron, rather than suppressing it

Guided Back Propagation is high-resolution
 since it derive gradients directly w.r.t. the input image instead of
w.r.t. last convolutional Layer (i.e., Grad-CAM)
Shortage
 Not class-discriminative
  Guided Grad-CAM
combines Guided
backpropagation and
Grad-CAM, and thus
becomes class-
discriminative

Characteristics
 Guided Grad-CAM is both high-resolution and class-
discriminative
Procedure
 Fusing Guided Back Propagation with Grad-CAM to create
Guided Grad-CAM visualizations
× =

Guided Grad-CAM also help untrained users successfully
discern a ‘stronger’ network from a ‘weaker’ one, even
when both make identical predictions.
stronger
network
weaker
network
2 models (A vs B)
with same
prediction accuracies

because guided backpropagation adds an additional
guidance signal from the higher layers to usual
backpropagation.
This prevents backward flow of negative gradients,
corresponding to the neurons which decrease the activation
of the higher layer unit we aim to visualize

Introduction to Grad-CAM (short version)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Grad-CAM (short version)

Similar to Introduction to Grad-CAM (short version) (20)

Recently uploaded

Recently uploaded (20)

Introduction to Grad-CAM (short version)

Editor's Notes