Talk at MIT, Maths on deep neural networks

Mathematics of neural networks
Anirbit
AMS
Johns Hopkins University
Invited talk at MIT Maths
“Seminar on Applied Algebra and Geometry”, 14th November 2017
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 1 / 30

Outline
1 Introduction
2 The questions (which have some answers!)
What functions does a deep net represent?
Why can the deep net do dictionary learning?
3 Open questions

Introduction
This talk is based on the following 3 papers,
https://arxiv.org/abs/1711.03073
“Lower bounds over Boolean inputs for deep neural networks with
ReLU gates”
https://arxiv.org/abs/1708.03735
“Sparse Coding and Autoencoders”
https://eccc.weizmann.ac.il/report/2017/098/
“Understanding Deep Neural Networks with Rectiﬁed Linear Units”

Introduction
The collaborators!
These are works with Amitabh Basu (AMS, JHU)
and diﬀerent subsets of,
Akshay Rangamani (ECE, JHU)
Tejaswini Ganapathy (Salesforce, San Francisco Bay Area)
Ashish Arora, Trac D.Tran (ECE, JHU)
Raman Arora, Poorya Mianjy (CS, JHU)
Sang (Peter) Chin (CS, BU)

Introduction
Activation gates of the neural network
The building block of a “neural net” are its activation gates which do the
basic analogue computations (as opposed to Boolean gates in Boolean
circuits which compute the AND, OR, NOT and Threshold functions.)

Introduction
Activation gates of the neural network
The building block of a “neural net” are its activation gates which do the
basic analogue computations (as opposed to Boolean gates in Boolean
circuits which compute the AND, OR, NOT and Threshold functions.)
The above is a R3 → R neural gate evaluating the “activation” function
f : R → R on a linear (in general can be aﬃne) transformation of the
input vector. The ws are the ‘weights’. If there were many Y s coming
out of the gate then it would pass on the same value to all of them.

Introduction
The ReLU activation function
Almost uniformly it is now believed that the “best” activation function to
use is the “Rectiﬁed Linear Unit (ReLU)”
ReLU : R → R
x → max{0, x}

Introduction
ReLU : R → R
x → max{0, x}
At this point it is useful to also deﬁne the more studied activation, “Linear
Threshold Function (LTF)”, in Boolean complexity to which we shall at
times compare later,
LTF : R → R
x → 1x≥0 or 21x≥0 − 1

Introduction
ReLU : R → R
x → max{0, x}
At this point it is useful to also deﬁne the more studied activation, “Linear
Threshold Function (LTF)”, in Boolean complexity to which we shall at
times compare later,
LTF : R → R
x → 1x≥0 or 21x≥0 − 1
If LTF(x) = 1x≥0, then its easy to see that, ReLU(x) = xLTF(x).

Introduction
What is a neural network?
The following diagram (imagine it as a directed acyclic graph where all
edges are pointing to the right) represents an instance of a “neural
network”.
Since there are no “weights” assigned to the edges of the above graph,
one should think of this as representing a certain class (set) of R4 → R3
functions which can be computed by the above “architecture” for a
*ﬁxed* choice of “activation functions” at each of the blue nodes. The
yellow nodes are where the input vector comes in and the orange nodes are
where the output vector comes out.

Introduction
An example of a neurally representable function
Input x1
Input x2
x1+x2
2 + |x1−x2|
2
1
1
-1
-1
-1
1
1
-1
1
2
−1
2
1
2
1
2
In the above we see a “1-DNN” with ReLU activation computing the
R2 → R function given as (x1, x2) → max{x1, x2}. The above neural
network would be said to be of size 4.

Introduction
An example of a neurally representable function
Input x1
Input x2
x1+x2
2 + |x1−x2|
2
1
1
-1
-1
-1
1
1
-1
1
2
−1
2
1
2
1
2
In the above we see a “1-DNN” with ReLU activation computing the
R2 → R function given as (x1, x2) → max{x1, x2}. The above neural
network would be said to be of size 4.These “max” functions are kind
of particularly interesting and we would soon come back to this!

Introduction
Neural nets used in real life!
Neural nets deployed in the real world which are creating the engineering
miracles everyday come in various complicated designs. “The Asimov
Institute” recently compiled this beautiful chart summarizing many of the
architectures in use,

Introduction
When do real weights matter?
Consider a very restricted class of neural networks built out of such
piecewise-linear gates like ReLU AND all inputs as well as the gates’
description is restricted to be coming from rational numbers requiring at
most m bits for their description AND the neural network ends with a
threshold gate at the top AND every gate has fan-out at most 1.
November 2017 10 / 30

Introduction
It was shown by Wolfgang Maass in 1997 that for such networks one can
trade oﬀ real weights for rational weights such that the corresponding
integers are in absolute value at most (2s + 1)!22m(2s+1) where s is the
total number of weights.
November 2017 10 / 30

Introduction
It was shown by Wolfgang Maass in 1997 that for such networks one can
trade oﬀ real weights for rational weights such that the corresponding
integers are in absolute value at most (2s + 1)!22m(2s+1) where s is the
total number of weights.
But for most usual nets where there is no restriction on the fan-out
we do not know of such a transformation as above and its also not
clear to us if we can always simulate these with LTF gates without
blowing up the size!
November 2017 10 / 30

Introduction
Neural networks : recent resurgence as an “universal algorithm”!
They are revolutionizing the techniques in various ﬁelds ranging from
particle physics to genomics to signal processing to computer vision.
Most recently it has been shown by the Google DeepMind
group that one can within hours of training make a neural network
capable of ﬁnding new strategies of winning the game of “Go”
that people havent seen despite (more than?) a 1000 years of
having been playing the game! Most importantly this seems possible
without feeding the network with any human game histories!
November 2017 11 / 30

Introduction
Neural networks : recent resurgence as an “universal algorithm”!
They are revolutionizing the techniques in various ﬁelds ranging from
particle physics to genomics to signal processing to computer vision.
Most recently it has been shown by the Google DeepMind
group that one can within hours of training make a neural network
capable of ﬁnding new strategies of winning the game of “Go”
that people havent seen despite (more than?) a 1000 years of
having been playing the game! Most importantly this seems possible
without feeding the network with any human game histories!
Mathematically nets are still extremely hard to analyze. A nice survey
of many of the recent ideas have been compiled in a 3 part series
of articles by “Center for Brains, Minds, and Machines (CBMM)” at
MIT. Here we report on some of the attempts we have been making
to rigorously understand neural networks.
November 2017 11 / 30

The questions (which have some answers!)
Formalizing the questions about neural nets
Broadly it seems that we can take 3 diﬀerent directions of study,
(1) Generalization
It needs an explanation as to why does the training of neural nets
generally not overﬁt despite being over-parametrized? Some recent
attempts at explaining this can be seen in the works by Kaelbling,
Kawaguchi (MIT) and Bengio (UMontreal), Brutzkus, Globerson (Tel
Aviv University), Malach and Shai Shalev-Schwartz (The Hebrew
University)
November 2017 12 / 30

(2a) Trainability of the nets
Theorem (Ours)
Empirical risk minimization on 1-DNN with a convex loss,
like minwi ,ai ,bi ,b
1
S
S
i=1 yi − width
p=1 ap max{0, wp, xi + bp} − b 2
2
,can be done in time
poly(number of data points)e(width,dimension).

(2a) Trainability of the nets
Theorem (Ours)
Empirical risk minimization on 1-DNN with a convex loss,
like minwi ,ai ,bi ,b
1
S
S
i=1 yi − width
p=1 ap max{0, wp, xi + bp} − b 2
2
,can be done in time
poly(number of data points)e(width,dimension).
This is the *only* algorithm we are aware of which gets exact
global minima of the empirical risk of some net in time
polynomial in any of the parameters.
The possibility of a similar result for deeper networks or
ameliorating the dependency on width remains wildly
open!
November 2017 13 / 30

(2b) Structure discovery by the nets
Real-life data can be modeled as observations of some structured
distribution. One view of the success of neural nets can be to say
that somehow nets can often be set up in such a way that they give
a function to optimize over which reveals this hidden structure at
its optima/critical points. In one classic scenario called the “sparse
coding” we will show proofs about how the net’s loss function has
certain nice properties which are possibly helping towards revealing
the hidden data generation model (the “dictionary”).
November 2017 14 / 30

(3) The deep-net functions.
One of the themes that we have looked into a lot is to try to ﬁnd
good descriptions of the functions that nets can compute.
November 2017 15 / 30

(3) The deep-net functions.
One of the themes that we have looked into a lot is to try to ﬁnd
good descriptions of the functions that nets can compute.
Let us start with this last kind of questions!
November 2017 15 / 30

The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
“The Big Question!”
Can one ﬁnd a complete characterization of the neural functions
parametrized by architecture?

“The Big Question!”
Can one ﬁnd a complete characterization of the neural functions
parametrized by architecture?
Theorem (Ours)
Any ReLU deep net always computes a piecewise linear function
and all Rn → R piecewise linear functions are ReLU net functions of
depth at most, 1 + log2(n + 1) . For n = 1 there is also a sharp
width lowerbound.
November 2017 16 / 30

A very small part of “The Big Question”
A simple (but somewhat surprising!) is the following fact,
Theorem (Ours)
1-DNN 2-DNN and the following R2 → R function
(x1, x2) → max{0, x1, x2} is in the gap.

A very small part of “The Big Question”
A simple (but somewhat surprising!) is the following fact,
Theorem (Ours)
1-DNN 2-DNN and the following R2 → R function
(x1, x2) → max{0, x1, x2} is in the gap.
Proof.
That 1-DNN ⊂ 2-DNN is obvious. Now observe that any R2 → R
1−DNN function is non-diﬀerentiable on a union of lines (one line
along each ReLU gate’s argument) but the given function is
non-diﬀerentiable on a union of 3 half-lines. Hence proved!
November 2017 17 / 30

A small part of “The Big Question” which is already unclear!
It is easy to see that max{0, x1, x2, .., x−1+2k } ∈ k-DNN. But
is this in (k-1)-DNN? This corresponding statement at higher
depths (k ≥ 3) remains unresolved as of now!
November 2017 18 / 30

Depth separation for R → R nets
Can one show neural functions at every depth such that lower depths
will necessarily require a much larger size to represent them?
Theorem (Ours)
∀k ∈ N, there exists a continuum of R → R neural net functions
of depth 1 + k2 (and size k3) which needs size Ω kk+1 for depths
≤ 1 + k.
November 2017 19 / 30

Depth separation for R → R nets
Can one show neural functions at every depth such that lower depths
will necessarily require a much larger size to represent them?
Theorem (Ours)
∀k ∈ N, there exists a continuum of R → R neural net functions
of depth 1 + k2 (and size k3) which needs size Ω kk+1 for depths
≤ 1 + k.
Here the basic intuition is that if one starts with a small depth func-
tion which is oscillating then *without* blowing up the width too
much higher depths can be set up to recursively increase the number
of oscillations. And then such functions get very hard for the smaller
depths to even approximate in 1 norm unless they blow up in size.
November 2017 19 / 30

Separations for Boolean functions with one layer of gates
For real valued functions on the Boolean hypercube, is ReLU stronger
than the LTF activation?

than the LTF activation? The best gap we know of is the following,

Theorem (Ours)
There is at least a Ω(n) gap between Sum-of-ReLU and
Sum-of-LTF

Theorem (Ours)
There is at least a Ω(n) gap between Sum-of-ReLU and
Sum-of-LTF
Proof.
This follows by looking at this function on the hypercube, {0, 1}n
given as, f (x) = n
i=i 2i−1xi . This has 2n level sets on the discrete
cube and hence needs that many polyhedral cells to be produced by
the hyperplanes of the Sum-of-LTF circuit whereas being a linear
function it can be implemented by just 2 ReLU gates!
November 2017 20 / 30

The next set of results we will present are more recent and seem to
need signiﬁcantly more eﬀort than those till now.
November 2017 21 / 30

The *ideal* depth separation!
Can one show neural functions at every depth such that it
will necessarily require Ω edimension size to represent them by
circuits of even one depth less? This is a major open question
and over real inputs this is currently known only between 2-DNN and
1-DNN from the works of Eldan-Shamir and Amit Daniely.
November 2017 22 / 30

The *ideal* depth separation!
Can one show neural functions at every depth such that it
will necessarily require Ω edimension size to represent them by
circuits of even one depth less? This is a major open question
and over real inputs this is currently known only between 2-DNN and
1-DNN from the works of Eldan-Shamir and Amit Daniely.
We go beyond small depth lower bounds in the following restricted sense,
November 2017 22 / 30

Theorem (Ours)
There exists small depth 2 Boolean functions such that LTF-of-(ReLU)d−1
circuits require size Ω

(d − 1) 2
(dimension)
1
8
d−1
((dimension)W)
1
d−1

 when the bottom most
layer weight vectors are such that their coordinates are integers of size at
most W and that these weight vectors induce the same ordering on the set
{−1, 1}(dimension) when ranked by value of the innerproduct with them.
(Note that all other weights are left completely free!)
November 2017 23 / 30

Theorem (Ours)
There exists small depth 2 Boolean functions such that LTF-of-(ReLU)d−1
circuits require size Ω

(d − 1) 2
(dimension)
1
8
d−1
((dimension)W)
1
d−1

 when the bottom most
layer weight vectors are such that their coordinates are integers of size at
most W and that these weight vectors induce the same ordering on the set
{−1, 1}(dimension) when ranked by value of the innerproduct with them.
(Note that all other weights are left completely free!)
This is achieved by showing that under the above restriction the
“sign-rank” is quadratically (in dimension) bounded for the func-
tions computed by such circuits, thought of as the matrix of dimen-
sion 2
dimension
2 × 2
dimension
2 . (And we recall that small depth small size
functions are known which have exponentially large sign-rank.)
November 2017 23 / 30

Separations for Boolean functions
Despite the result by Eldan-Shamir and Amit Daniely this curiosity
still remains as to how powerful is the LTF-of-ReLU-of-ReLU than
LTF-of-ReLU for Boolean functions.
Theorem (Ours)
For any δ ∈ (0, 1
2), there exists N(δ) ∈ N such that for all n ≥ N(δ)
and > 2 log
2
2−δ (n)
n , any LTF-of-ReLU circuit on n bits that
matches the Andreev function on n−bits for at least 1/2 +
fraction of the inputs, has size Ω( 2(1−δ)n1−δ).
November 2017 24 / 30

Separations for Boolean functions
Despite the result by Eldan-Shamir and Amit Daniely this curiosity
still remains as to how powerful is the LTF-of-ReLU-of-ReLU than
LTF-of-ReLU for Boolean functions.
Theorem (Ours)
For any δ ∈ (0, 1
2), there exists N(δ) ∈ N such that for all n ≥ N(δ)
and > 2 log
2
2−δ (n)
n , any LTF-of-ReLU circuit on n bits that
matches the Andreev function on n−bits for at least 1/2 +
fraction of the inputs, has size Ω( 2(1−δ)n1−δ).
This is proven by the “method of random restrictions” and in particular a very
recent version of it by Daniel Kane (UCSD) and Ryan Williams (MIT) based on
the Littlewood-Oﬀord theorem.
November 2017 24 / 30

The questions (which have some answers!) Why can the deep net do dictionary learning?
What makes the deep net landscape special?
A fundamental challenge with deep nets is to be able to explain as to why
is it able to solve so many diverse kinds of real-life optimization problems.
This is a serious mathematical challenge to be able to understand as to
how the deep net “sees” the classic optimization questions.
November 2017 25 / 30

For a net say N and a distribution D lets call its “landscape” (L)
corresponding to a “loss function ( )” (typically the squared-loss) as
L(D, N) = Ex,y∈D[ (y, N(x))].
November 2017 25 / 30

For a net say N and a distribution D lets call its “landscape” (L)
corresponding to a “loss function ( )” (typically the squared-loss) as
L(D, N) = Ex,y∈D[ (y, N(x))].
Why is this L so often somehow a nice function to optimize on to solve an
optimization problem which on its own had nothing to do with nets?
November 2017 25 / 30

Sparse coding
We isolate one special optimization question where we can attempt to
oﬀer some mathematical explanation for this phenomenon.
“Sparse Coding” is a classic learning challenge where given access
to vectors y = A∗x∗ and some distributional (sparsity) guarantees
about x∗ we try to infer A∗. Breakthrough work by Spielman, Wang
and Wright (2012) : This is sometimes provably doable in poly-time!
November 2017 26 / 30

Sparse coding
We isolate one special optimization question where we can attempt to
oﬀer some mathematical explanation for this phenomenon.
“Sparse Coding” is a classic learning challenge where given access
to vectors y = A∗x∗ and some distributional (sparsity) guarantees
about x∗ we try to infer A∗. Breakthrough work by Spielman, Wang
and Wright (2012) : This is sometimes provably doable in poly-time!
In this work we attempt to progress towards giving some rigorous
explanation for the observation that nets seem to solve sparse coding!
November 2017 26 / 30

Sparse coding
The deﬁning equation of our autoencoder computing ˜y ∈ Rn
from y ∈ Rn
The generative model: Sparse x∗
∈ Rh
and y = A∗
x∗
∈ Rn
and h n
h = ReLU(W y − ) = max{0, W y − } ∈ Rh
˜y = W T
h ∈ Rn
November 2017 27 / 30

Our TensorFlow experiments provide evidence for the power of
autoencoders
Software : TensorFlow (with complicated gradient updates!)
November 2017 28 / 30

autoencoders
6000 training examples and 1000 testing examples for each digit
November 2017 28 / 30

autoencoders
6000 training examples and 1000 testing examples for each digit
n = 784 and the number of ReLU gates were 10000 for the 1−DNN
and 5000 and 784 for the 2−DNN.
November 2017 28 / 30

Why can deep nets do sparse coding?
After laborious algebra (over months!) we can oﬀer the following insight,
November 2017 29 / 30

Theorem (Ours)
If the source sparse vectors x∗ ∈ Rh are such that their non-zero
coordinates are sampled from a interval in R+ and it has a support of size
at most hp with p < 1
2 and A∗ ∈ Rn×h is incoherent enough then a
constant can be chosen such that the autoencoder landscape,
Ey=A∗x∗ [ y − W T
ReLU(0, W y − ) 2
2]
is such that it is asymptotically (in h) critical in a neighbourhood of A∗.
November 2017 29 / 30

Theorem (Ours)
If the source sparse vectors x∗ ∈ Rh are such that their non-zero
coordinates are sampled from a interval in R+ and it has a support of size
at most hp with p < 1
2 and A∗ ∈ Rn×h is incoherent enough then a
constant can be chosen such that the autoencoder landscape,
Ey=A∗x∗ [ y − W T
ReLU(0, W y − ) 2
2]
is such that it is asymptotically (in h) critical in a neighbourhood of A∗.
Such criticality around the right answer is clearly a plausible reason why
gradient descent might ﬁnd the right answer! Experiments infact
suggest that asymptotically in h, A∗ might even be a global minima
- but as of now that we have no clue how to prove such a thing!
November 2017 29 / 30

Open questions
Even for the speciﬁc case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
November 2017 30 / 30

Open questions
Can one exactly characterize the set of functions parameterized by
the architecture?
November 2017 30 / 30

Open questions
the architecture?
How to (dis?)prove the existence of dimension exponential gaps
between consecutive depths? (This isn’t clear even with just Boolean
inputs and unrestricted weights!)
November 2017 30 / 30

Open questions
the architecture?
Can the max of 2k
+ 1 numbers be taken using k layers of ReLU gates?
(A negative answer immediately shows that with depth the deep net
function class strictly increases!)
November 2017 30 / 30

Open questions
the architecture?
Can the max of 2k
Can one show that *every* low depth function has a “simpler” (maybe
smaller circuit size or smaller weight) representation at higher depths?
November 2017 30 / 30

Open questions
the architecture?
Can the max of 2k
In the space of deep net functions how dense are the “high complexity”
functions? (like those with Ω(sizedimension
) number of aﬃne pieces)
November 2017 30 / 30

Open questions
the architecture?
Can the max of 2k
In the space of deep net functions how dense are the “high complexity”
functions? (like those with Ω(sizedimension
) number of aﬃne pieces)
Are there Boolean functions which have smaller representations using
ReLU gates than LTF gates? (A peculiarly puzzling question!)
November 2017 30 / 30

Talk at MIT, Maths on deep neural networks

Recommended

Recommended

More Related Content

Similar to Talk at MIT, Maths on deep neural networks

Similar to Talk at MIT, Maths on deep neural networks (20)

Recently uploaded

Recently uploaded (20)

Talk at MIT, Maths on deep neural networks