Master Thesis Presentation (Subselection of Topics)

1/26
Master Thesis
-
Mathematical Analysis of Neural Networks
Alina Leidinger
TUM
15 June, 2019
Alina Leidinger (TUM) Mathematical Analysis of NNs 15 June, 2019 1 / 26

2/26
Main Idea
Overview of the status quo of mathematics of neural networks

2/26
Main Idea
Central Question: Why are neural networks so successful in practice?

2/26
Main Idea
Central Question: Why are neural networks so successful in practice?
Focus on three areas:
1 Approximation Theory
2 Stability
3 Unique Identiﬁcation of Network Parameters

3/26
Outline
Overview of the Thesis
2 Stability

3/26
Outline
Overview of the Thesis
2 Stability
Stability
Adversarial Examples
Scattering Networks

4/26
Approximation Theory
Universal approximation: Theoretic ability to approximate any
(reasonable) function to any arbitrary degree of accuracy by a shallow
neural network under mild assumptions on the non-linearity [6][2].

4/26
Order of Approximation: Upper and lower bounds on
E(X; Σr (σ)) := sup
f ∈X
inf
g∈Σr (σ)
f − g X
are in [11]
O(r−s/d
)
for d the dimension, r the number of hidden units, s the degree of
smoothness.

4/26
Order of Approximation: Upper and lower bounds on
E(X; Σr (σ)) := sup
f ∈X
inf
g∈Σr (σ)
f − g X
are in [11]
O(r−s/d
)
for d the dimension, r the number of hidden units, s the degree of
smoothness.
Curse of Dimensionality: The number of units in the hidden layer
necessary for ﬁxed accuracy is in the order O( −d/s).

5/26
Approximation Theory - Breaking the Curse of
Dimensionality
Central Question: Why are deep neural networks preferred over
shallow ones in practice?
1
Tomaso Poggio et al. Why and when can deep-but not shallow-networks avoid the
curse of dimensionality: A review. 2017. DOI: 10.1007/s11633-017-1054-2. arXiv:
1611.00740.

5/26
Dimensionality
Linear instead of exponential order of approximation with deep neural
networks for compositional functions1 (e.g.
f (x1, x2, x3, x4) = h2(h11(x1, x2), h12(x3, x4)))
1
1611.00740.

5/26
Dimensionality
Linear instead of exponential order of approximation with deep neural
networks for compositional functions1 (e.g.
f (x1, x2, x3, x4) = h2(h11(x1, x2), h12(x3, x4)))
Locality of the constituent functions as a key to the success of CNNs,
rather than weight sharing.
1
1611.00740.

6/26
Unique Identiﬁcation of Neural Network Parameters

7/26
Approximation of a target function given by a one layer neural network
using polynomially many samples
2
Hanie Sedghi and Anima Anandkumar. “Provable methods for training neural
networks with sparse connectivity”. In: arXiv preprint arXiv:1412.2693 (2014).
3
Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. “Beating the perils of
non-convexity: Guaranteed training of neural networks using tensor methods”. In: arXiv
preprint arXiv:1506.08473 (2015).
4
Massimo Fornasier, Jan Vybíral, and Ingrid Daubechies. “Robust and Resource
Eﬃcient Identiﬁcation of Shallow Neural Networks by Fewest Samples”. In: (2019).
arXiv: arXiv:1804.01592v2.

7/26
Unique identiﬁcation of the network parameters
2
3
preprint arXiv:1506.08473 (2015).
4

7/26
Unique identiﬁcation of the network parameters
Comparison of two approaches using tensor decomposition2,3 and
spectral norm optimisation in matrix subspaces4
2
3
preprint arXiv:1506.08473 (2015).
4

8/26
Stability

9/26
Stability

9/26
Stability
Scattering Networks

10/26
Adversarial Perturbations
Adversarial examples generalise across models [15], [5]

10/26
Hypotheses on distribution of adversarial examples in input space [15],
[5]

10/26
[5]
Fast Gradient Sign Method by Goodfellow [5]

10/26
[5]
Maximise the network loss:
∆x = arg max
r ∞≤λ
J(θ, x + r, y)

10/26
[5]
Maximise the network loss:
∆x = arg max
r ∞≤λ
J(θ, x + r, y)
DeepFool [12] for minimal adversarial perturbations in the lp norm

11/26
ADef5
- Adversarial Deformations
Deformation of an input image f : [0, 1]2 → R with respect to vector
ﬁeld τ : [0, 1]2 → R2 as
Lτ f (x) = f (x − τ(x)) ∀x ∈ [0, 1]2
5
Rima Alaifari, Giovanni S. Alberti, and Tandri Gauksson. “ADef: an Iterative
Algorithm to Construct Adversarial Deformations”. In: (2018). DOI:
10.1111/j.1469-185X.1996.tb00746.x. arXiv: 1804.07729.

11/26
ADef5
ﬁeld τ : [0, 1]2 → R2 as
Lτ f (x) = f (x − τ(x)) ∀x ∈ [0, 1]2
In general, unboundedness of r = f − Lτ f in the Lp norm even for
imperceptible deformations
5
10.1111/j.1469-185X.1996.tb00746.x. arXiv: 1804.07729.

11/26
ADef5
ﬁeld τ : [0, 1]2 → R2 as
Lτ f (x) = f (x − τ(x)) ∀x ∈ [0, 1]2
In general, unboundedness of r = f − Lτ f in the Lp norm even for
imperceptible deformations
Size of the deformation measured as
τ T := max
s,t∈[W ]
τ(s, t) 2 .
5
10.1111/j.1469-185X.1996.tb00746.x. arXiv: 1804.07729.

12/26
ADef6
6
10.1111/j.1469-185X.1996.tb00746.x. arXiv: 1804.07729.

13/26
ADef7
Iterative Construction of Adversarial Examples with Gradient Descent
7
10.1111/j.1469-185X.1996.tb00746.x. arXiv: 1804.07729.

13/26
ADef7
ADef successfully fools a CNN on MNIST and an Inception-v3 and a
ResNet-101 on ImageNet.
7
10.1111/j.1469-185X.1996.tb00746.x. arXiv: 1804.07729.

13/26
ADef7
ADef successfully fools a CNN on MNIST and an Inception-v3 and a
ResNet-101 on ImageNet.
Deformations larger in · T than common perturbations in the · ∞
norm, though imperceptible to the human eye.
7
10.1111/j.1469-185X.1996.tb00746.x. arXiv: 1804.07729.

14/26
Adversarial Training - A Game Theory Perspective8
Cast the optimisation problem of FGSM
∆x = arg max
r ∞≤λ
J(θ, x + r, y)
into a game theory framework
π∗
, ρ∗
:= arg min
π
arg max
ρ
Ep∼π,r∼ρ [J(Mp(θ), x + r, y)]
8
Guneet S Dhillon et al. “Stochastic activation pruning for robust adversarial
defense”. In: arXiv preprint arXiv:1803.01442 (2018).

14/26
∆x = arg max
r ∞≤λ
J(θ, x + r, y)
π∗
, ρ∗
:= arg min
π
arg max
ρ
Defence strategy Mp(θ): Stochastic Activation Pruning (SAP)
8

14/26
∆x = arg max
r ∞≤λ
J(θ, x + r, y)
π∗
, ρ∗
:= arg min
π
arg max
ρ
Drop activations on the forward pass with probability proportional to
their absolute value. Rescale remaining activations.
8

14/26
∆x = arg max
r ∞≤λ
J(θ, x + r, y)
π∗
, ρ∗
:= arg min
π
arg max
ρ
Drop activations on the forward pass with probability proportional to
their absolute value. Rescale remaining activations.
Apply SAP post-hoc to the pretrained model.
8

15/26
Scattering Networks

16/26
Scattering Networks9
Aim: Find an embedding Φ for signal f that is translation invariant
and stable to deformations
9
Stéphane Mallat. Understanding deep convolutional networks. 2016. DOI:
10.1098/rsta.2015.0203. arXiv: 1601.04920.

16/26
Assumption: Labels do not vary under translation, scaling or (slight)
deformation
9
10.1098/rsta.2015.0203. arXiv: 1601.04920.

16/26
deformation
Deformation: Lτ f (x) = f (x − τ(x))
9
10.1098/rsta.2015.0203. arXiv: 1601.04920.

16/26
deformation
Deformation: Lτ f (x) = f (x − τ(x))
Lipschitz continuity wrt to deformation Lτ : For compact Ω ⊂ Rd ,
there exists C > 0 with
Φ(Lτ f ) − Φ(f ) ≤ C f ( sup
x∈Rd
| τ(x)| + sup
x∈Rd
|Hτ(x)|)
for all f ∈ L2(Rd ) with support in Ω and for all τ ∈ C2(Rd ).
9
10.1098/rsta.2015.0203. arXiv: 1601.04920.

17/26
Due to Lipschitz continuity, linearisation of the deformation in the
embedding domain
10
10.1098/rsta.2015.0203. arXiv: 1601.04920.

17/26
embedding domain
Interpretation of stability as an operator that maps a non-linear
deformation to a linear movement in a linear space that can be
captured by a linear classiﬁer
10
10.1098/rsta.2015.0203. arXiv: 1601.04920.

17/26
embedding domain
Interpretation of stability as an operator that maps a non-linear
deformation to a linear movement in a linear space that can be
captured by a linear classifier
After application of Φ, classification with linear classifier despite the
deformation τ being potentially very non-linear
10
10.1098/rsta.2015.0203. arXiv: 1601.04920.

18/26
Why use wavelets?11
Inspiration for the construction of Φ from the Littlewood-Paley
wavelet transform
11
Stéphane Mallat. Scattering Invariant Deep Networks for Classiﬁcation.

18/26
Why use wavelets?11
Inspiration for the construction of Φ from the Littlewood-Paley
wavelet transform
Function representation
WJf := {f ∗ φ2J , (f ∗ ψλ)λ∈ΛJ
}
where ΛJ = {λ = 2j r : r ∈ G+, 2j > 2−J}.
11
Stéphane Mallat. Scattering Invariant Deep Networks for Classiﬁcation.

19/26
The Scattering Transform
Deﬁnition (Scattering Transform)
Deﬁne the windowed scattering transform for all p = (λ1, · · · , λm) as
SJ[p]f (x) = |· · · |f ∗ ψλ1 | ∗ ψλ2 | · · · |∗ψλm | ∗ φ2J (x)
φ2J (x) is a local averaging.
Properties12:
Preservation of the L2 norm
12
10.1098/rsta.2015.0203. arXiv: 1601.04920.

19/26
Properties12:
Translation Invariance
12
10.1098/rsta.2015.0203. arXiv: 1601.04920.

19/26
Properties12:
Translation Invariance
Stability to deformations
12
10.1098/rsta.2015.0203. arXiv: 1601.04920.

20/26
Connection to CNNs13
To compute the scattering transform iterate on operator
UJf = {f ∗ φ2J , (U[λ]f )λ∈ΛJ
}
where U[λ]f = |f ∗ ψλ|.
Figure: Black nodes: averaging operation. White nodes: application of the
operator U[λ1], U[λ1, λ2] etc.
13
Stéphane Mallat. “Group invariant scattering”. In: Communications on Pure and
Applied Mathematics 65.10 (2012), pp. 1331–1398.

21/26
Insights - Progressive Linearisation14
Locally linearisation of Lτ from one layer to the next
14
10.1098/rsta.2015.0203. arXiv: 1601.04920.

21/26
Lipschitz continuity to deformations preferred over invariance
14
10.1098/rsta.2015.0203. arXiv: 1601.04920.

21/26
Assumption that translations/deformations are local symmetries i.e.
that the target function does not vary locally under them.
14
10.1098/rsta.2015.0203. arXiv: 1601.04920.

21/26
After linearising local symmetries project linearly
14
10.1098/rsta.2015.0203. arXiv: 1601.04920.

21/26
After linearising local symmetries project linearly
Classiﬁcation using a linear classiﬁer as is usually done in CNNs at the
last layer.
14
10.1098/rsta.2015.0203. arXiv: 1601.04920.

22/26
References I
Rima Alaifari, Giovanni S. Alberti, and Tandri Gauksson. “ADef: an
Iterative Algorithm to Construct Adversarial Deformations”. In:
(2018). DOI: 10.1111/j.1469-185X.1996.tb00746.x. arXiv:
1804.07729.
George Cybenko. “Approximation by superpositions of a sigmoidal
function”. In: Mathematics of control, signals and systems 2.4
(1989), pp. 303–314.
Guneet S Dhillon et al. “Stochastic activation pruning for robust
adversarial defense”. In: arXiv preprint arXiv:1803.01442 (2018).
Massimo Fornasier, Jan Vybíral, and Ingrid Daubechies. “Robust and
Resource Eﬃcient Identiﬁcation of Shallow Neural Networks by
Fewest Samples”. In: (2019). arXiv: arXiv:1804.01592v2.
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy.
“Explaining and harnessing adversarial examples”. In: arXiv preprint
arXiv:1412.6572 (2014).

23/26
References II
Kurt Hornik, Maxwell Stinchcombe, and Halbert White. “Multilayer
feedforward networks are universal approximators”. In: Neural
networks 2.5 (1989), pp. 359–366.
Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. “Beating
the perils of non-convexity: Guaranteed training of neural networks
using tensor methods”. In: arXiv preprint arXiv:1506.08473 (2015).
Stéphane Mallat. “Group invariant scattering”. In: Communications
on Pure and Applied Mathematics 65.10 (2012), pp. 1331–1398.
Stéphane Mallat. Scattering Invariant Deep Networks for
Classiﬁcation.
Stéphane Mallat. Understanding deep convolutional networks. 2016.
DOI: 10.1098/rsta.2015.0203. arXiv: 1601.04920.
H N Mhaskar. NEURAL NETWORKS FOR OPTIMAL
APPROXIMATION OF SMOOTH AND ANALYTIC FUNCTIONS.
Tech. rep. 1996, pp. 164–177.

24/26
References III
Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and
Pascal Frossard. “Deepfool: a simple and accurate method to fool
deep neural networks”. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. 2016, pp. 2574–2582.
Tomaso Poggio et al. Why and when can deep-but not
shallow-networks avoid the curse of dimensionality: A review. 2017.
DOI: 10.1007/s11633-017-1054-2. arXiv: 1611.00740.
Hanie Sedghi and Anima Anandkumar. “Provable methods for
training neural networks with sparse connectivity”. In: arXiv preprint
arXiv:1412.2693 (2014).
Christian Szegedy et al. “Intriguing properties of neural networks”.
In: arXiv preprint arXiv:1312.6199 (2013).

25/26
ADef-Finding Adversarial Deformations15
For score function F = (F1, · · · , FC ) : Y → RC deﬁne F∗ = Fk − Fl
where l is the true label. F∗(y) < 0.
15
10.1111/j.1469-185X.1996.tb00746.x. arXiv: 1804.07729.

25/26
Deﬁne g : T → R as the map τ → F∗(yτ ). g(0) = F∗(y) < 0.
15
10.1111/j.1469-185X.1996.tb00746.x. arXiv: 1804.07729.

25/26
Aim: Find small τ ∈ T such that g(τ) ≥ 0.
15
10.1111/j.1469-185X.1996.tb00746.x. arXiv: 1804.07729.

25/26
Approximate g around 0 as
g(τ) ≈ g(0) + (D0g)τ (1)
where D0g is the derivative of g evaluated at zero.
15
10.1111/j.1469-185X.1996.tb00746.x. arXiv: 1804.07729.

25/26
Approximate g around 0 as
g(τ) ≈ g(0) + (D0g)τ (1)
where D0g is the derivative of g evaluated at zero.
Solve (D0g)τ = −g(0).
15
10.1111/j.1469-185X.1996.tb00746.x. arXiv: 1804.07729.

26/26
Insights16
Theorem (Lipschitz continuity to diﬀeomorphisms)
There exists C > 0 such that for all f ∈ L2(Rd ) with f 1 < ∞ and all
τ ∈ C2(Rd ) with τ ∞ ≤ 1/2 the following statement holds:
SJ[PJ]Lτ f − SJ[PJ]f ≤ C f 1K(τ) (2)
where
K(τ) = 2−J
τ ∞ + τ ∞ max log
∆τ ∞
τ ∞
, 1 + Hτ ∞.
2−J τ ∞ and τ ∞ quantify the extent to which τ translates and
deforms the input respectively.
16
Stéphane Mallat. “Group invariant scattering”. In: Communications on Pure and
Applied Mathematics 65.10 (2012), pp. 1331–1398.

Master Thesis Presentation (Subselection of Topics)

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Master Thesis Presentation (Subselection of Topics)

Similar to Master Thesis Presentation (Subselection of Topics) (20)

Recently uploaded

Recently uploaded (20)

Master Thesis Presentation (Subselection of Topics)