Detecting adversarials examples attacks to deep neural networks

Detecting adversarial example attacks to
deep neural networks
CBMI, 19-21 June 2017, Florence, Italy 1
Fabio Carrara, Fabrizio Falchi, Roberto Caldelli, Giuseppe Amato, Roberta Fumarola, Rudy Becarelli

Outline
● Introduction
○ Adversarial examples for deep neural network image classifiers
○ Risk of attacks to vision systems
● Our defense strategy
○ Detecting adversarial examples
○ CBIR approach
● Evaluation
2

Deep Neural Networks as Image Classifiers
● DNNs as image classifiers for several vision tasks
○ image annotation, face recognition, etc.
● More and more in sensible applications (safety- or security-related)
○ content filtering (spam, porn, violence, terrorist propaganda images, etc.)
○ malware detection
○ self-driving cars
CONV1
RELU1
POOL1
CONV2
RELU2
POOL2
CONV3
RELU3
CONV4
RELU4
CONV5
RELU5
POOL5
FC6
RELU6
FC7
RELU7
FC8
It’s a stop sign.
I’m pretty sure.
Image Classifier
(Deep Neural Network)
3

Adversarial images
● DNN image classifiers are vulnerable to adversarial images
○ malicious images crafted adding a small but intentional (not random!)
perturbation
○ adversarial images fool DNNs to predict a wrong class with high confidence
○ imperceptible to human eye, like an optical illusion for the DNN
○ efficient algorithms to find them
Image Classifier
(DNN) It’s a
roundabout
sign!
No doubt.
+ =
Original Image Adv. Perturbation
(5x amplified for visualization)
Adversarial Image
Adversary
4

Risk of attacks to DNNs
5
● Attacks are possible:
○ if you have the model [1,2]
[1] Szegedy, Christian, et al. "Intriguing properties of neural networks." arXiv preprint arXiv:1312.6199(2013).
[2] Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial
examples." arXiv preprint arXiv:1412.6572 (2014).
● Bypass filters
○ E.g: NSFW images
https://github.com/yahoo/open_nsfw

6
○ if you have access to input and output only! [3]
[2] Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014).
[3] Papernot, Nicolas, et al. "Practical black-box attacks against deep learning systems using adversarial examples." arXiv preprint arXiv:1602.02697 (2016).
84.24% 88.94% 96.19%Error Rate:

7
○ if you have access to input and output only! [3]
○ in the physical world (printout adversarial images) [4]
● Safety-related issues
○ E.g: self-driving car crash
[2] Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial
examples." arXiv preprint arXiv:1412.6572 (2014).
[3] Papernot, Nicolas, et al. "Practical black-box attacks against deep learning systems using adversarial examples." arXiv preprint arXiv:1602.02697 (2016).
[4] Kurakin, Alexey, Ian Goodfellow, and Samy Bengio. "Adversarial examples in the physical world." arXiv preprint arXiv:1607.02533 (2016).

How to defend from adversarial attacks?
● Make more robust classifiers
○ E.g: include adversarial images in the training phase
○ “Every law has a loophole” → Every model has its adversarial images it is vulnerable to
● Detect adversarial inputs
○ Understand when the network is talking nonsense
8

Our Adversarial Detection Approach (I)
9
input image

DNN says:
Stop Sign
10
Image
Classifier
(DNN)

DNN says:
Stop Sign
11
Image
Classifier
(DNN)
LABELLED
IMAGES
search for
similar images
most similar images retrieved
LABELLED
IMAGES
(TRAIN SET)

DNN says:
Stop Sign
12
✓Image
Classifier
(DNN)
LABELLED
IMAGES
search for
similar images
LABELLED
IMAGES
(TRAIN SET)

Our Adversarial Detection Approach (II)
DNN says:
Stop Sign
13
Image
Classifier
(DNN)
LABELLED
IMAGES
(TRAIN SET)
search for
similar images
✗

Deep Features as Similarity Measure
● Reuse the intermediate output of the network (deep features)
○ intermediate representation of visual aspects of the image
○ we can use euclidean distance among deep features to evaluate visual similarity
CONV1
RELU1
POOL1
CONV2
RELU2
POOL2
CONV3
RELU3
CONV4
RELU4
CONV5
RELU5
POOL5
FC6
RELU6
FC7
RELU7
FC8
0.2, 1.5, 5.4, …, 1.0, 0.0, 8.3
1.6, 4.3, 0.1, …, 0.2, 7.0, 4.9
0.3, 1.9, 5.1, …, 0.0, 0.1, 6.3
14

kNN scoring
● Use k-Nearest Neighbors score to evaluate the goodness of the classification
● Score is assigned looking at the classes of the k neighbors
● Neighbor distance is important
● Threshold on score to accept or reject classification
15
= 1 if i-th neighbor label == predicted label
= 0 otherwise

Evaluation (I)
● OverFeat classifier on ImageNet, pool5 layer (1024-D) as deep feature
● set of images to be classified generated from ImageNet validation set
(a) 1000 authentic images correctly classified by the DCNN (1 per calss, randomly chosen)
(b) ~2000 adversarial images generated from (a) with L-BFGS and FGS generation algorithms
(c) 1000 authentic images incorrectly classified by the DCNN (errors, 1 per class, randomly chosen)
● model evaluated as a binary classifier (authentic / spurious) 16
kNN ScoringOverFeat DCNN pretrained on ImageNet
CONV1
RELU1
POOL1
CONV2
RELU2
POOL2
CONV3
RELU3
CONV4
RELU4
CONV5
RELU5
POOL5
FC6
RELU6
FC7
RELU7
FC8
ImageNet
TRAIN SET
predicted
class Accept
or
Reject
deep feature
threshold

Evaluation (II)
17
ー % of authentic images correctly retained
ー % of FGS adversarials correctly discarded
ー % of L-BFGS adversarials correctly
discarded
ー % of wrong classifications of authentic
images correctly discarded
Threshold

Evaluation (II)
● with low threshold you can:
[1] Filter out 50% of adversarial images
(and 10% of errors) while retaining
almost all the authentic images
[2] Filter out 80% of adversarial images
(and 30% of errors) while retaining 90%
the authentic images
● filter aggressiveness can be
adjusted
[1] [2]
18
Threshold

● Examples of content that might
be filtered
● Our approach successfully
identifies adversarial images,
assigning them low scores
19
Evaluation (IV) -
Good Detections
http://deepfeatures.org/adversarials/

● The most difficult adversarial
images to detect (the ones
having highest kNN scores)
● Note the visual similarity and
common aspects among the
classes
20
Evaluation (IV) -
Bad Detections

Conclusions
● We presented an approach to cope with adversarial images
○ with a satisfactory level of accuracy
○ without changing the model (no retrain)
○ without using additional data
● Future Works
○ test more network architectures
○ test more generation algorithms for adversarial images
○ compare with other defense methodologies
21

Thanks for your attention!
Questions ?
Fabio Carrara <fabio.carrara@isti.cnr.it>
22

Detecting adversarials examples attacks to deep neural networks

Recommended

Recommended

More Related Content

Similar to Detecting adversarials examples attacks to deep neural networks

Similar to Detecting adversarials examples attacks to deep neural networks (20)

Recently uploaded

Recently uploaded (20)

Detecting adversarials examples attacks to deep neural networks