1. Mathematics of neural networks
Anirbit
AMS
Johns Hopkins University
Invited talk at MIT Maths
“Seminar on Applied Algebra and Geometry”, 14th November 2017
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 1 / 30
2. Outline
1 Introduction
2 The questions (which have some answers!)
What functions does a deep net represent?
Why can the deep net do dictionary learning?
3 Open questions
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 2 / 30
3. Introduction
This talk is based on the following 3 papers,
https://arxiv.org/abs/1711.03073
“Lower bounds over Boolean inputs for deep neural networks with
ReLU gates”
https://arxiv.org/abs/1708.03735
“Sparse Coding and Autoencoders”
https://eccc.weizmann.ac.il/report/2017/098/
“Understanding Deep Neural Networks with Rectified Linear Units”
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 3 / 30
4. Introduction
The collaborators!
These are works with Amitabh Basu (AMS, JHU)
and different subsets of,
Akshay Rangamani (ECE, JHU)
Tejaswini Ganapathy (Salesforce, San Francisco Bay Area)
Ashish Arora, Trac D.Tran (ECE, JHU)
Raman Arora, Poorya Mianjy (CS, JHU)
Sang (Peter) Chin (CS, BU)
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 4 / 30
5. Introduction
Activation gates of the neural network
The building block of a “neural net” are its activation gates which do the
basic analogue computations (as opposed to Boolean gates in Boolean
circuits which compute the AND, OR, NOT and Threshold functions.)
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 5 / 30
6. Introduction
Activation gates of the neural network
The building block of a “neural net” are its activation gates which do the
basic analogue computations (as opposed to Boolean gates in Boolean
circuits which compute the AND, OR, NOT and Threshold functions.)
The above is a R3 → R neural gate evaluating the “activation” function
f : R → R on a linear (in general can be affine) transformation of the
input vector. The ws are the ‘weights’. If there were many Y s coming
out of the gate then it would pass on the same value to all of them.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 5 / 30
7. Introduction
The ReLU activation function
Almost uniformly it is now believed that the “best” activation function to
use is the “Rectified Linear Unit (ReLU)”
ReLU : R → R
x → max{0, x}
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 6 / 30
8. Introduction
The ReLU activation function
Almost uniformly it is now believed that the “best” activation function to
use is the “Rectified Linear Unit (ReLU)”
ReLU : R → R
x → max{0, x}
At this point it is useful to also define the more studied activation, “Linear
Threshold Function (LTF)”, in Boolean complexity to which we shall at
times compare later,
LTF : R → R
x → 1x≥0 or 21x≥0 − 1
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 6 / 30
9. Introduction
The ReLU activation function
Almost uniformly it is now believed that the “best” activation function to
use is the “Rectified Linear Unit (ReLU)”
ReLU : R → R
x → max{0, x}
At this point it is useful to also define the more studied activation, “Linear
Threshold Function (LTF)”, in Boolean complexity to which we shall at
times compare later,
LTF : R → R
x → 1x≥0 or 21x≥0 − 1
If LTF(x) = 1x≥0, then its easy to see that, ReLU(x) = xLTF(x).
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 6 / 30
10. Introduction
What is a neural network?
The following diagram (imagine it as a directed acyclic graph where all
edges are pointing to the right) represents an instance of a “neural
network”.
Since there are no “weights” assigned to the edges of the above graph,
one should think of this as representing a certain class (set) of R4 → R3
functions which can be computed by the above “architecture” for a
*fixed* choice of “activation functions” at each of the blue nodes. The
yellow nodes are where the input vector comes in and the orange nodes are
where the output vector comes out.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 7 / 30
11. Introduction
An example of a neurally representable function
Input x1
Input x2
x1+x2
2 + |x1−x2|
2
1
1
-1
-1
-1
1
1
-1
1
2
−1
2
1
2
1
2
In the above we see a “1-DNN” with ReLU activation computing the
R2 → R function given as (x1, x2) → max{x1, x2}. The above neural
network would be said to be of size 4.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 8 / 30
12. Introduction
An example of a neurally representable function
Input x1
Input x2
x1+x2
2 + |x1−x2|
2
1
1
-1
-1
-1
1
1
-1
1
2
−1
2
1
2
1
2
In the above we see a “1-DNN” with ReLU activation computing the
R2 → R function given as (x1, x2) → max{x1, x2}. The above neural
network would be said to be of size 4.These “max” functions are kind
of particularly interesting and we would soon come back to this!
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 8 / 30
13. Introduction
Neural nets used in real life!
Neural nets deployed in the real world which are creating the engineering
miracles everyday come in various complicated designs. “The Asimov
Institute” recently compiled this beautiful chart summarizing many of the
architectures in use,
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 9 / 30
14. Introduction
Neural nets used in real life!
Neural nets deployed in the real world which are creating the engineering
miracles everyday come in various complicated designs. “The Asimov
Institute” recently compiled this beautiful chart summarizing many of the
architectures in use,
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 9 / 30
15. Introduction
When do real weights matter?
Consider a very restricted class of neural networks built out of such
piecewise-linear gates like ReLU AND all inputs as well as the gates’
description is restricted to be coming from rational numbers requiring at
most m bits for their description AND the neural network ends with a
threshold gate at the top AND every gate has fan-out at most 1.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 10 / 30
16. Introduction
When do real weights matter?
Consider a very restricted class of neural networks built out of such
piecewise-linear gates like ReLU AND all inputs as well as the gates’
description is restricted to be coming from rational numbers requiring at
most m bits for their description AND the neural network ends with a
threshold gate at the top AND every gate has fan-out at most 1.
It was shown by Wolfgang Maass in 1997 that for such networks one can
trade off real weights for rational weights such that the corresponding
integers are in absolute value at most (2s + 1)!22m(2s+1) where s is the
total number of weights.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 10 / 30
17. Introduction
When do real weights matter?
Consider a very restricted class of neural networks built out of such
piecewise-linear gates like ReLU AND all inputs as well as the gates’
description is restricted to be coming from rational numbers requiring at
most m bits for their description AND the neural network ends with a
threshold gate at the top AND every gate has fan-out at most 1.
It was shown by Wolfgang Maass in 1997 that for such networks one can
trade off real weights for rational weights such that the corresponding
integers are in absolute value at most (2s + 1)!22m(2s+1) where s is the
total number of weights.
But for most usual nets where there is no restriction on the fan-out
we do not know of such a transformation as above and its also not
clear to us if we can always simulate these with LTF gates without
blowing up the size!
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 10 / 30
18. Introduction
Neural networks : recent resurgence as an “universal algorithm”!
They are revolutionizing the techniques in various fields ranging from
particle physics to genomics to signal processing to computer vision.
Most recently it has been shown by the Google DeepMind
group that one can within hours of training make a neural network
capable of finding new strategies of winning the game of “Go”
that people havent seen despite (more than?) a 1000 years of
having been playing the game! Most importantly this seems possible
without feeding the network with any human game histories!
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 11 / 30
19. Introduction
Neural networks : recent resurgence as an “universal algorithm”!
They are revolutionizing the techniques in various fields ranging from
particle physics to genomics to signal processing to computer vision.
Most recently it has been shown by the Google DeepMind
group that one can within hours of training make a neural network
capable of finding new strategies of winning the game of “Go”
that people havent seen despite (more than?) a 1000 years of
having been playing the game! Most importantly this seems possible
without feeding the network with any human game histories!
Mathematically nets are still extremely hard to analyze. A nice survey
of many of the recent ideas have been compiled in a 3 part series
of articles by “Center for Brains, Minds, and Machines (CBMM)” at
MIT. Here we report on some of the attempts we have been making
to rigorously understand neural networks.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 11 / 30
20. The questions (which have some answers!)
Formalizing the questions about neural nets
Broadly it seems that we can take 3 different directions of study,
(1) Generalization
It needs an explanation as to why does the training of neural nets
generally not overfit despite being over-parametrized? Some recent
attempts at explaining this can be seen in the works by Kaelbling,
Kawaguchi (MIT) and Bengio (UMontreal), Brutzkus, Globerson (Tel
Aviv University), Malach and Shai Shalev-Schwartz (The Hebrew
University)
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 12 / 30
21. The questions (which have some answers!)
Formalizing the questions about neural nets
(2a) Trainability of the nets
Theorem (Ours)
Empirical risk minimization on 1-DNN with a convex loss,
like minwi ,ai ,bi ,b
1
S
S
i=1 yi − width
p=1 ap max{0, wp, xi + bp} − b 2
2
,can be done in time
poly(number of data points)e(width,dimension).
22. The questions (which have some answers!)
Formalizing the questions about neural nets
(2a) Trainability of the nets
Theorem (Ours)
Empirical risk minimization on 1-DNN with a convex loss,
like minwi ,ai ,bi ,b
1
S
S
i=1 yi − width
p=1 ap max{0, wp, xi + bp} − b 2
2
,can be done in time
poly(number of data points)e(width,dimension).
This is the *only* algorithm we are aware of which gets exact
global minima of the empirical risk of some net in time
polynomial in any of the parameters.
The possibility of a similar result for deeper networks or
ameliorating the dependency on width remains wildly
open!
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 13 / 30
23. The questions (which have some answers!)
Formalizing the questions about neural nets
(2b) Structure discovery by the nets
Real-life data can be modeled as observations of some structured
distribution. One view of the success of neural nets can be to say
that somehow nets can often be set up in such a way that they give
a function to optimize over which reveals this hidden structure at
its optima/critical points. In one classic scenario called the “sparse
coding” we will show proofs about how the net’s loss function has
certain nice properties which are possibly helping towards revealing
the hidden data generation model (the “dictionary”).
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 14 / 30
24. The questions (which have some answers!)
Formalizing the questions about neural nets
(3) The deep-net functions.
One of the themes that we have looked into a lot is to try to find
good descriptions of the functions that nets can compute.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 15 / 30
25. The questions (which have some answers!)
Formalizing the questions about neural nets
(3) The deep-net functions.
One of the themes that we have looked into a lot is to try to find
good descriptions of the functions that nets can compute.
Let us start with this last kind of questions!
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 15 / 30
26. The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
“The Big Question!”
Can one find a complete characterization of the neural functions
parametrized by architecture?
27. The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
“The Big Question!”
Can one find a complete characterization of the neural functions
parametrized by architecture?
Theorem (Ours)
Any ReLU deep net always computes a piecewise linear function
and all Rn → R piecewise linear functions are ReLU net functions of
depth at most, 1 + log2(n + 1) . For n = 1 there is also a sharp
width lowerbound.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 16 / 30
28. The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
A very small part of “The Big Question”
A simple (but somewhat surprising!) is the following fact,
Theorem (Ours)
1-DNN 2-DNN and the following R2 → R function
(x1, x2) → max{0, x1, x2} is in the gap.
29. The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
A very small part of “The Big Question”
A simple (but somewhat surprising!) is the following fact,
Theorem (Ours)
1-DNN 2-DNN and the following R2 → R function
(x1, x2) → max{0, x1, x2} is in the gap.
Proof.
That 1-DNN ⊂ 2-DNN is obvious. Now observe that any R2 → R
1−DNN function is non-differentiable on a union of lines (one line
along each ReLU gate’s argument) but the given function is
non-differentiable on a union of 3 half-lines. Hence proved!
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 17 / 30
30. The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
A small part of “The Big Question” which is already unclear!
It is easy to see that max{0, x1, x2, .., x−1+2k } ∈ k-DNN. But
is this in (k-1)-DNN? This corresponding statement at higher
depths (k ≥ 3) remains unresolved as of now!
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 18 / 30
31. The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
Depth separation for R → R nets
Can one show neural functions at every depth such that lower depths
will necessarily require a much larger size to represent them?
Theorem (Ours)
∀k ∈ N, there exists a continuum of R → R neural net functions
of depth 1 + k2 (and size k3) which needs size Ω kk+1 for depths
≤ 1 + k.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 19 / 30
32. The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
Depth separation for R → R nets
Can one show neural functions at every depth such that lower depths
will necessarily require a much larger size to represent them?
Theorem (Ours)
∀k ∈ N, there exists a continuum of R → R neural net functions
of depth 1 + k2 (and size k3) which needs size Ω kk+1 for depths
≤ 1 + k.
Here the basic intuition is that if one starts with a small depth func-
tion which is oscillating then *without* blowing up the width too
much higher depths can be set up to recursively increase the number
of oscillations. And then such functions get very hard for the smaller
depths to even approximate in 1 norm unless they blow up in size.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 19 / 30
33. The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions with one layer of gates
For real valued functions on the Boolean hypercube, is ReLU stronger
than the LTF activation?
34. The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions with one layer of gates
For real valued functions on the Boolean hypercube, is ReLU stronger
than the LTF activation? The best gap we know of is the following,
35. The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions with one layer of gates
For real valued functions on the Boolean hypercube, is ReLU stronger
than the LTF activation? The best gap we know of is the following,
Theorem (Ours)
There is at least a Ω(n) gap between Sum-of-ReLU and
Sum-of-LTF
36. The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions with one layer of gates
For real valued functions on the Boolean hypercube, is ReLU stronger
than the LTF activation? The best gap we know of is the following,
Theorem (Ours)
There is at least a Ω(n) gap between Sum-of-ReLU and
Sum-of-LTF
Proof.
This follows by looking at this function on the hypercube, {0, 1}n
given as, f (x) = n
i=i 2i−1xi . This has 2n level sets on the discrete
cube and hence needs that many polyhedral cells to be produced by
the hyperplanes of the Sum-of-LTF circuit whereas being a linear
function it can be implemented by just 2 ReLU gates!
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 20 / 30
37. The questions (which have some answers!) What functions does a deep net represent?
The next set of results we will present are more recent and seem to
need significantly more effort than those till now.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 21 / 30
38. The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
The *ideal* depth separation!
Can one show neural functions at every depth such that it
will necessarily require Ω edimension size to represent them by
circuits of even one depth less? This is a major open question
and over real inputs this is currently known only between 2-DNN and
1-DNN from the works of Eldan-Shamir and Amit Daniely.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 22 / 30
39. The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
The *ideal* depth separation!
Can one show neural functions at every depth such that it
will necessarily require Ω edimension size to represent them by
circuits of even one depth less? This is a major open question
and over real inputs this is currently known only between 2-DNN and
1-DNN from the works of Eldan-Shamir and Amit Daniely.
We go beyond small depth lower bounds in the following restricted sense,
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 22 / 30
40. The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
Theorem (Ours)
There exists small depth 2 Boolean functions such that LTF-of-(ReLU)d−1
circuits require size Ω
(d − 1) 2
(dimension)
1
8
d−1
((dimension)W)
1
d−1
when the bottom most
layer weight vectors are such that their coordinates are integers of size at
most W and that these weight vectors induce the same ordering on the set
{−1, 1}(dimension) when ranked by value of the innerproduct with them.
(Note that all other weights are left completely free!)
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 23 / 30
41. The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
Theorem (Ours)
There exists small depth 2 Boolean functions such that LTF-of-(ReLU)d−1
circuits require size Ω
(d − 1) 2
(dimension)
1
8
d−1
((dimension)W)
1
d−1
when the bottom most
layer weight vectors are such that their coordinates are integers of size at
most W and that these weight vectors induce the same ordering on the set
{−1, 1}(dimension) when ranked by value of the innerproduct with them.
(Note that all other weights are left completely free!)
This is achieved by showing that under the above restriction the
“sign-rank” is quadratically (in dimension) bounded for the func-
tions computed by such circuits, thought of as the matrix of dimen-
sion 2
dimension
2 × 2
dimension
2 . (And we recall that small depth small size
functions are known which have exponentially large sign-rank.)
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 23 / 30
42. The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions
Despite the result by Eldan-Shamir and Amit Daniely this curiosity
still remains as to how powerful is the LTF-of-ReLU-of-ReLU than
LTF-of-ReLU for Boolean functions.
Theorem (Ours)
For any δ ∈ (0, 1
2), there exists N(δ) ∈ N such that for all n ≥ N(δ)
and > 2 log
2
2−δ (n)
n , any LTF-of-ReLU circuit on n bits that
matches the Andreev function on n−bits for at least 1/2 +
fraction of the inputs, has size Ω( 2(1−δ)n1−δ).
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 24 / 30
43. The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions
Despite the result by Eldan-Shamir and Amit Daniely this curiosity
still remains as to how powerful is the LTF-of-ReLU-of-ReLU than
LTF-of-ReLU for Boolean functions.
Theorem (Ours)
For any δ ∈ (0, 1
2), there exists N(δ) ∈ N such that for all n ≥ N(δ)
and > 2 log
2
2−δ (n)
n , any LTF-of-ReLU circuit on n bits that
matches the Andreev function on n−bits for at least 1/2 +
fraction of the inputs, has size Ω( 2(1−δ)n1−δ).
This is proven by the “method of random restrictions” and in particular a very
recent version of it by Daniel Kane (UCSD) and Ryan Williams (MIT) based on
the Littlewood-Offord theorem.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 24 / 30
44. The questions (which have some answers!) Why can the deep net do dictionary learning?
What makes the deep net landscape special?
A fundamental challenge with deep nets is to be able to explain as to why
is it able to solve so many diverse kinds of real-life optimization problems.
This is a serious mathematical challenge to be able to understand as to
how the deep net “sees” the classic optimization questions.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 25 / 30
45. The questions (which have some answers!) Why can the deep net do dictionary learning?
What makes the deep net landscape special?
A fundamental challenge with deep nets is to be able to explain as to why
is it able to solve so many diverse kinds of real-life optimization problems.
This is a serious mathematical challenge to be able to understand as to
how the deep net “sees” the classic optimization questions.
For a net say N and a distribution D lets call its “landscape” (L)
corresponding to a “loss function ( )” (typically the squared-loss) as
L(D, N) = Ex,y∈D[ (y, N(x))].
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 25 / 30
46. The questions (which have some answers!) Why can the deep net do dictionary learning?
What makes the deep net landscape special?
A fundamental challenge with deep nets is to be able to explain as to why
is it able to solve so many diverse kinds of real-life optimization problems.
This is a serious mathematical challenge to be able to understand as to
how the deep net “sees” the classic optimization questions.
For a net say N and a distribution D lets call its “landscape” (L)
corresponding to a “loss function ( )” (typically the squared-loss) as
L(D, N) = Ex,y∈D[ (y, N(x))].
Why is this L so often somehow a nice function to optimize on to solve an
optimization problem which on its own had nothing to do with nets?
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 25 / 30
47. The questions (which have some answers!) Why can the deep net do dictionary learning?
Sparse coding
We isolate one special optimization question where we can attempt to
offer some mathematical explanation for this phenomenon.
“Sparse Coding” is a classic learning challenge where given access
to vectors y = A∗x∗ and some distributional (sparsity) guarantees
about x∗ we try to infer A∗. Breakthrough work by Spielman, Wang
and Wright (2012) : This is sometimes provably doable in poly-time!
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 26 / 30
48. The questions (which have some answers!) Why can the deep net do dictionary learning?
Sparse coding
We isolate one special optimization question where we can attempt to
offer some mathematical explanation for this phenomenon.
“Sparse Coding” is a classic learning challenge where given access
to vectors y = A∗x∗ and some distributional (sparsity) guarantees
about x∗ we try to infer A∗. Breakthrough work by Spielman, Wang
and Wright (2012) : This is sometimes provably doable in poly-time!
In this work we attempt to progress towards giving some rigorous
explanation for the observation that nets seem to solve sparse coding!
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 26 / 30
49. The questions (which have some answers!) Why can the deep net do dictionary learning?
Sparse coding
The defining equation of our autoencoder computing ˜y ∈ Rn
from y ∈ Rn
The generative model: Sparse x∗
∈ Rh
and y = A∗
x∗
∈ Rn
and h n
h = ReLU(W y − ) = max{0, W y − } ∈ Rh
˜y = W T
h ∈ Rn
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 27 / 30
50. The questions (which have some answers!) Why can the deep net do dictionary learning?
Our TensorFlow experiments provide evidence for the power of
autoencoders
Software : TensorFlow (with complicated gradient updates!)
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 28 / 30
51. The questions (which have some answers!) Why can the deep net do dictionary learning?
Our TensorFlow experiments provide evidence for the power of
autoencoders
Software : TensorFlow (with complicated gradient updates!)
6000 training examples and 1000 testing examples for each digit
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 28 / 30
52. The questions (which have some answers!) Why can the deep net do dictionary learning?
Our TensorFlow experiments provide evidence for the power of
autoencoders
Software : TensorFlow (with complicated gradient updates!)
6000 training examples and 1000 testing examples for each digit
n = 784 and the number of ReLU gates were 10000 for the 1−DNN
and 5000 and 784 for the 2−DNN.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 28 / 30
53. The questions (which have some answers!) Why can the deep net do dictionary learning?
Why can deep nets do sparse coding?
After laborious algebra (over months!) we can offer the following insight,
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 29 / 30
54. The questions (which have some answers!) Why can the deep net do dictionary learning?
Why can deep nets do sparse coding?
After laborious algebra (over months!) we can offer the following insight,
Theorem (Ours)
If the source sparse vectors x∗ ∈ Rh are such that their non-zero
coordinates are sampled from a interval in R+ and it has a support of size
at most hp with p < 1
2 and A∗ ∈ Rn×h is incoherent enough then a
constant can be chosen such that the autoencoder landscape,
Ey=A∗x∗ [ y − W T
ReLU(0, W y − ) 2
2]
is such that it is asymptotically (in h) critical in a neighbourhood of A∗.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 29 / 30
55. The questions (which have some answers!) Why can the deep net do dictionary learning?
Why can deep nets do sparse coding?
After laborious algebra (over months!) we can offer the following insight,
Theorem (Ours)
If the source sparse vectors x∗ ∈ Rh are such that their non-zero
coordinates are sampled from a interval in R+ and it has a support of size
at most hp with p < 1
2 and A∗ ∈ Rn×h is incoherent enough then a
constant can be chosen such that the autoencoder landscape,
Ey=A∗x∗ [ y − W T
ReLU(0, W y − ) 2
2]
is such that it is asymptotically (in h) critical in a neighbourhood of A∗.
Such criticality around the right answer is clearly a plausible reason why
gradient descent might find the right answer! Experiments infact
suggest that asymptotically in h, A∗ might even be a global minima
- but as of now that we have no clue how to prove such a thing!
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 29 / 30
56. Open questions
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 30 / 30
57. Open questions
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
Can one exactly characterize the set of functions parameterized by
the architecture?
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 30 / 30
58. Open questions
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
Can one exactly characterize the set of functions parameterized by
the architecture?
How to (dis?)prove the existence of dimension exponential gaps
between consecutive depths? (This isn’t clear even with just Boolean
inputs and unrestricted weights!)
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 30 / 30
59. Open questions
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
Can one exactly characterize the set of functions parameterized by
the architecture?
How to (dis?)prove the existence of dimension exponential gaps
between consecutive depths? (This isn’t clear even with just Boolean
inputs and unrestricted weights!)
Can the max of 2k
+ 1 numbers be taken using k layers of ReLU gates?
(A negative answer immediately shows that with depth the deep net
function class strictly increases!)
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 30 / 30
60. Open questions
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
Can one exactly characterize the set of functions parameterized by
the architecture?
How to (dis?)prove the existence of dimension exponential gaps
between consecutive depths? (This isn’t clear even with just Boolean
inputs and unrestricted weights!)
Can the max of 2k
+ 1 numbers be taken using k layers of ReLU gates?
(A negative answer immediately shows that with depth the deep net
function class strictly increases!)
Can one show that *every* low depth function has a “simpler” (maybe
smaller circuit size or smaller weight) representation at higher depths?
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 30 / 30
61. Open questions
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
Can one exactly characterize the set of functions parameterized by
the architecture?
How to (dis?)prove the existence of dimension exponential gaps
between consecutive depths? (This isn’t clear even with just Boolean
inputs and unrestricted weights!)
Can the max of 2k
+ 1 numbers be taken using k layers of ReLU gates?
(A negative answer immediately shows that with depth the deep net
function class strictly increases!)
Can one show that *every* low depth function has a “simpler” (maybe
smaller circuit size or smaller weight) representation at higher depths?
In the space of deep net functions how dense are the “high complexity”
functions? (like those with Ω(sizedimension
) number of affine pieces)
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 30 / 30
62. Open questions
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
Can one exactly characterize the set of functions parameterized by
the architecture?
How to (dis?)prove the existence of dimension exponential gaps
between consecutive depths? (This isn’t clear even with just Boolean
inputs and unrestricted weights!)
Can the max of 2k
+ 1 numbers be taken using k layers of ReLU gates?
(A negative answer immediately shows that with depth the deep net
function class strictly increases!)
Can one show that *every* low depth function has a “simpler” (maybe
smaller circuit size or smaller weight) representation at higher depths?
In the space of deep net functions how dense are the “high complexity”
functions? (like those with Ω(sizedimension
) number of affine pieces)
Are there Boolean functions which have smaller representations using
ReLU gates than LTF gates? (A peculiarly puzzling question!)
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 30 / 30