Black-Box attacks against Neural Networks - technical project report

Sapienza – University of Rome
MSc in Engineering in Computer Science
Neural Networks, AY 2018/19
Submitted to Prof. A. Uncini
S. Clinciu – R. Falconi
Practical Black-Box Attacks against Machine Learning

0. Summary
1. Introduction................................................................................................................................3
2. How to run the code..................................................................................................................3
3. Deep Neural Networks.............................................................................................................4
4. Threat model ..............................................................................................................................4
5. Black Box attack strategy ..........................................................................................................5
6. Attack validation........................................................................................................................7
7. Generalization of the attack......................................................................................................9
8. Defense strategies ......................................................................................................................9
9. Conclusions...............................................................................................................................10
10. References.............................................................................................................................11

1. Introduction
This is the technical report for the Neural Networks course by Professor A. Uncini. The
report is about Practical Black-Box Attacks against Machine Learning, scientific paper by N.
Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik and A. Swami.
The work is done by students S. Clinciu and R. Falconi, while studying at MSc in
Engineering in Computer Science, at Sapienza University of Rome.
The paper’s goal is to proclaim a demonstration that black box attacks against deep neural
networks (DNN) classifiers are practical for real-world adversaries with no knowledge
about the model. Achievement is to implement the discussed algorithm assuming the
adversary has no information about the structure or parameters of the DNN, and the
defender does not have access to any large training dataset.
The threat model thus corresponds to the real-world scenario of users interacting with
classifiers hosted remotely by a third-party keeping the model internals secret.
In fact, authors of the paper instantiate attacks against classifiers automatically trained by
Amazon and Google. They gained the access to them only after training is completed. Thus,
they provide the first correctly blinded experiments concerning adversarial examples as a
security risk. It shows that black-box attack is applicable to many remote systems taking
decisions based on ML, because it combines three key properties: the capabilities required
are limited to observing output class labels, the number of labels queried is limited, and the
approach applies and scales to different ML classifier types, in addition to state-of-the-art
DNNs.
2. How to run the code
To run the code is very easy, everything needed is to:
a. Clone the GitHub repository using the command
‘git clone https://github.com/RobertoFalconi/BlackBoxAttackDNN’
b. Access the repository with ‘cd BlackBoxAttackDNN’
c. Use the command ‘pip3 install <framework name> to import each required library
d. Run FGSM strategy with ‘python FastGradientSignMethods’ or JSMA strategy with
the code ‘python JacobianSaliencyMapApproach’.
Tested on Python 3.7.3 64-bit edition and NVIDIA 425.31 drivers, using a GeForce RTX 2080.

3. Deep Neural Networks
To fully understands the threat model and the attack it is good to provide some preliminary
information.
As reported in the paper, Deep Neural Networks (DNN), is a ML technique that uses a
hierarchical composition of n parametric function to model an input 𝑥̅. Each function 𝑓𝑖 for
𝑖 ∈ 1, … , 𝑛 is modeled using a layer of neurons, elementary calculating units applying an
activation function to the preceding layer’s weighted representation of the input to generate
a new representation. Each layer is parameterized by a weight vector. Such weights hold
the knowledge of DNN model F and are evaluated during its training phase.
𝐹(𝑥̅) = 𝑓𝑛 (𝜃 𝑛, 𝑓𝑛−1 (𝜃 𝑛−1, … , 𝑓2(𝜃2, 𝑓1(𝜃1, 𝑥̅))))
The training phase of a DNN F learns values for its parameters 𝜃 𝐹 = {𝜃1, … , 𝜃 𝑛}.
During the test phase, the DNN is deployed with a fixed set of parameters 𝜃 𝐹 to make
predictions on inputs unseen during training.
4. Threat model
In our work, the adversary pursues to force a classifier to misclassify inputs in any class
dissimilar from their right class. In order to accomplish our project, considering a weak
adversary with access to the DNN output only. The opponent has no information of the
architectural selections made to design the DNN, which include the number, type and size
of layers, nor the training data used to learn the DNN’s parameters.
Such attacks are referred to as black box, where adversaries don’t need to know internal
details of a system to compromise it.
The targeted model is a scenario where an attacker is targeting a multiclass DNN classifier.
Its outputs probability vectors, where each vector component encodes the DNN’s belief of
the input being part of one of the predefined classes. The ongoing example of a DNN
classifying images, as shown in the following picture. Such DNNs can be used to classify
handwritten digits into classes associated with digits from 0 to 9, images of objects in a fixed
number of categories, or images of traffic signs into classes identifying its type (STOP, yield,
...).
The adversarial capabilities refer to the only capability of the adversary accessing the label
O(x) for any input x by querying oracle O. the output label O(x) is the index of the class
assigned the largest probability by the DNN:

𝑂̅(𝑥̅) = arg
𝑚𝑎𝑥
𝑗 ∈ 0 … 𝑁 − 1 𝑂𝑗(𝑥̅)
Where 𝑂𝑗(𝑥̅) is the j-th component of the probability vector 𝑂(𝑥̅) output by DNN 𝑂.
Accessing labels 𝑂̅ produced by the DNN 𝑂 is the only capability assumed in our threat
model. It is impossible to access to the oracle internals or training data.
In order to produce a minimally altered version of any input 𝑥̅, named adversarial sample,
denoted 𝑥̅∗
, misclassified by oracle 𝑂: 𝑂̅(𝑥̅∗) ≠ 𝑂̅(𝑥̅). This corresponds to an attack on the
oracle’s output integrity. Adversarial samples solve the following optimization problem:
𝑥̅∗
= 𝑥̅ + arg min{𝑧̅: 𝑂̅(𝑥̅ + 𝑧̅) ≠ 𝑂̅(𝑥̅)} = 𝑥̅ + 𝛿 𝑥̅
Examples of adversarial samples can be found in the following figure.
5. Black Box attack strategy
We are going to implement a Black Box attack following the papers strategy. The adversary
wants to craft inputs misclassified by the ML model using the sole capability of accessing
the label 𝑂̅(𝑥̅) assigned by classifier for any chosen input 𝑥̅. The strategy is to learn a
substitute for the target model using a dataset generated by the adversary and labeled by
observing the oracle output. Then, adversarial examples are crafted using this substitute. It
is expected the target DNN to misclassify them due to transferability between architectures.
Usually, ML need large training sets for training. For instance, attackers can consider models
trained with several tens of thousands of labeled examples. This makes attacks based on this
paradigm unfeasible for adversaries without large labeled datasets. In this paper, the
authors show black-box attacks can be accomplished at a much lower cost, without labeling
an independent training set. In our approach, to enable the adversary to train a substitute
model without a real labeled dataset, it is required to use the target DNN as an oracle to
construct a synthetic dataset.

Authors propose the following two strategies: substitute model training and adversarial
sample crafting.
Substitute model training: attacker interrogates oracle with synthetic inputs selected by a
Jacobian based heuristic to build a model F approximating the oracle model O’s decision
boundaries. Training a substitute model F approximating oracle O is challenging because
we must: (1) select an architecture for our substitute without knowledge of the targeted
oracle’s architecture, and (2) limit the number of queries made to the oracle in order to
ensure that the approach is tractable.
The approach, illustrated in Figure 3, overcomes these challenges mainly by introducing a
synthetic data generation technique, the Jacobian-based Dataset Augmentation.
Adversarial sample crafting: attacker substitutes the network F to craft adversarial samples,
which are then misclassified by oracle O due to the transferability of adversarial samples.
Once the adversary trained a substitute DNN, it uses it to craft adversarial samples.
In our project, we provided an overview of two approaches discussed in the paper, namely
the Goodfellow et al. algorithm (also known as Fast Gradient Sign Method or FGSM) and
the Papernot et al. algorithm (also known as Jacobian-based Saliency Map Attack or JSMA).
Both share a similar intuition of evaluating model’s sensitivity to input modification to
select small perturbation achieving misclassification goal.
Goodfellow gives a model F with an associated cost function 𝑐(𝐹, 𝑥̅, 𝑦), the adversary crafts
an adversarial sample 𝑥∗̅̅̅ = 𝑥̅ + 𝛿 𝑥̅ for a given legitimate sample 𝑥̅ by computing the
following perturbation: 𝛿 𝑥̅ = 𝜖 𝑠𝑔𝑛(∇ 𝑥̅ 𝑐(𝐹, 𝑥̅, 𝑦)) where the perturbation sgn is the sign of
the model’s cost function gradient, computed with respect to 𝑥̅ using sample 𝑥̅ and label y
as inputs.

Figure 1: example of FGSM
Papernot algorithm is suitable for source-target misclassification attacks where adversaries
seek to take samples from any legitimate source class to any chosen target class.
Misclassification attacks are a special case of source-target misclassifications, where the
target class can be any class different from the legitimate source class. Given model F, the
adversary crafts an adversarial sample 𝑥̅∗
= 𝑥̅ + 𝛿 𝑥̅ for a given legitimate sample 𝑥̅ by
adding a perturbation 𝛿 𝑥̅ to a subset of the input components 𝑥̅𝑖.
Both the algorithms have benefits and drawbacks, because the Goodfellow one is suitable
for fast crafting of many adversarial samples with quite large perturbations thus potentially
easier to detect, while the Papernot one tends to reduce perturbations in exchange of a
greater computing cost.
6. Attack validation
To validate the attack, we tried it against different classifiers and using also different types
of attack. We first made an FGSM attack to target DNN trained using MNIST dataset, then
we made another attack against a DNN trained with CIFAR dataset, both attacks have the
goal to misclassify most of adversarial examples crafted with a perturbation not affecting
human recognition. Finally, we repeat both the attack using a JSMA type of attack.
The MNIST database (Modified National Institute of Standards and Technology database)
is a large database of handwritten digits that is commonly used for training various image
processing systems, it is widely used for training and testing in the field of machine learning.
The CIFAR dataset (Canadian Institute For Advanced Research) is a collection of images
that are commonly used to train machine learning and computer vision algorithms. The
CIFAR dataset contains 60,000 32x32 color images in 10 different classes. The 10 different
classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks.
There are 6,000 images of each class.

Both MNIST and CIFAR are two of the most widely used datasets for machine learning
research.
In the end, we made another two attacks, FGSM and JSMA, to a local trained dataset
composed by pictures and photos of real world, in order to make better the idea of the effects
for real world elements and subjects, instead of using MNIST and CIFAR datasets only.
Our goal is to verify whether these samples are also misclassified by the oracle or not.
Therefore, the transferability of adversarial samples refers to the oracle misclassification rate
of adversarial samples crafted using the substitute DNN.

7. Generalization of the attack
Substitutes and oracles take in cause were learned with DNNs, but the attack bounds its
applicability to other ML systems. For examples, substitutes can also be learned with logistic
regression and the attack generalizes to additional ML models.
8. Defense strategies
According to the paper of Practical Black-Box Attacks against Machine Learning which we
based our work on, the two types of defense strategies are: (1) reactive where one seeks to
detect adversarial examples, and (2) proactive where one makes the model itself more
robust.

Adversarial training. The attack is not more easily detectable than a classic adversarial
example attack. Oracle queries may be distributed among a set of colluding users, and as
such remain hard to detect, but defender may increase the attacker’s cost by training models
with higher input dimensionality or modeling complexity. Indeed, the authors experimental
results indicate that these two factors increase the number of queries required to train
substitutes: black-box attack based on transfer from a substitute model overcomes gradient
masking defenses.
Defensive distillation. It is a defense which make models robust in a small neighborhood of
the training manifold perform gradient masking: they smooth the decision surface and
reduce gradients used by adversarial crafting in small neighborhoods. However, using a
substitute and our black-box approach evades these defenses, as the substitute model is not
trained to be robust to the said small perturbations.
9. Conclusions
Defending against finite perturbations is a more promising avenue for future work than
defending against infinitesimal perturbations.
Our implementation reflects what the paper is about. The authors show a work based on a
novel substitute training algorithm using synthetic data generation, to craft adversarial
examples misclassified by black-box DNNs. Our work is a significant step towards relaxing
strong assumptions about adversarial capabilities made by previous attacks.

10. References
1. Alexey Kurakin, Ian J. Goodfellow, Samy Bengio. Adversarial Examples in the
physical world. [Online] 2017. https://arxiv.org/pdf/1607.02533.pdf.
2. Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z. Berkay Celik,
Ananthram Swami. Practical Black-Box Attacks against Machine Learning. [Online] 2017.
https://arxiv.org/pdf/1602.02697.pdf.
3. Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy. Explaining and harnessing
adversarial examples. [Online] 2015. https://arxiv.org/pdf/1412.6572.pdf.
4. Anish Athalye, Nicholas Carlini, David Wagner. Obfuscated Gradients Give a False
Sense of Security: Circumventing Defenses to Adversarial Examples. [Online] 2018.
https://arxiv.org/pdf/1802.00420v4.pdf.
5. Papernot, Nicolas. Gradient Masking in Machine Learning. [Online]
https://seclab.stanford.edu/AdvML2017/slides/17-09-aro-aml.pdf.
6. Goodfellow, Ian and Papernot, Nicolas. Is attacking machine learning easier than
defending it? [Online] http://www.cleverhans.io/security/privacy/ml/2017/02/15/why-
attacking-machine-learning-is-easier-than-defending-it.html.

Black-Box attacks against Neural Networks - technical project report

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Black-Box attacks against Neural Networks - technical project report

Similar to Black-Box attacks against Neural Networks - technical project report (20)

More from Roberto Falconi

More from Roberto Falconi (15)

Recently uploaded

Recently uploaded (20)

Black-Box attacks against Neural Networks - technical project report