This document provides an overview of the author's research on neural networks. It begins with an introduction to the papers the overview is based on and the collaborators involved. It then discusses open questions about characterizing the functions represented by neural networks and some of the author's results, including: proving certain functions require a depth of log(n+1) to represent; showing depth separations between network depths; and establishing gaps between different network architectures for Boolean functions. The author outlines ongoing work on fully characterizing neural network functions and establishing stronger depth separations.
My invited talk at the 23rd International Symposium of Mathematical Programming (ISMP, 2018)
1. Mathematics Of Neural Networks
Anirbit
AMS
Johns Hopkins University
( AMS Johns Hopkins University ) 1 / 25
2. Outline
1 Introduction
2 An overview of our results about neural nets
What functions does a deep net represent?
Why can the deep net do dictionary learning?
3 Open questions
( AMS Johns Hopkins University ) 2 / 25
4. Introduction
This overview is based on the following 4 papers of ours,
ICML 2018 Workshop On Non-Convex Optimization (Not yet public)
“Convergence guarantees for RMSProp and ADAM in non-convex optimiza-
tion and their comparison to Nesterov acceleration on autoencoders”
https://eccc.weizmann.ac.il/report/2017/190/
“Lower bounds over Boolean inputs for deep neural networks with
ReLU gates”
https://arxiv.org/abs/1708.03735 (ISIT 2018)
“Sparse Coding and Autoencoders”
https://eccc.weizmann.ac.il/report/2017/098/(ICLR 2018)
“Understanding Deep Neural Networks with Rectified Linear Units”
( AMS Johns Hopkins University ) 3 / 25
5. Introduction
The collaborators!
These are works with Amitabh Basu (AMS, JHU)
and different subsets of,
Akshay Rangamani (ECE, JHU)
Soham De (CS, UMD)
Enayat Ullah (CS, JHU)
Tejaswini Ganapathy (Salesforce, San Francisco Bay Area)
Ashish Arora, Trac D.Tran (ECE, JHU)
Raman Arora, Poorya Mianjy (CS, JHU)
Sang (Peter) Chin (CS, BU)
( AMS Johns Hopkins University ) 4 / 25
6. Introduction
What is a neural network?
The following diagram (imagine it as a directed acyclic graph where all
edges are pointing to the right) represents an instance of a “neural
network”.
Since there are no “weights” assigned to the edges of the above graph,
one should think of this as representing a certain class (set) of R4 → R3
functions which can be computed by the above “architecture” for a
*fixed* choice of “activation functions” (like, ReLU(x) = max{0, x}) at
each of the blue nodes. The yellow nodes are where the input vector
comes in and the orange nodes are where the output vector comes out.
( AMS Johns Hopkins University ) 5 / 25
7. An overview of our results about neural nets
Formalizing the questions about neural nets
(1) Exact trainability of the nets
Theorem (Ours)
Empirical risk minimization on 1-DNN with a convex loss,
like minwi ,ai ,bi ,b
1
S
S
i=1 yi − width
p=1 ap max{0, wp, xi + bp} 2
2
can be done in time, 2width
Sn×width
poly(n, S, width).
8. An overview of our results about neural nets
Formalizing the questions about neural nets
(1) Exact trainability of the nets
Theorem (Ours)
Empirical risk minimization on 1-DNN with a convex loss,
like minwi ,ai ,bi ,b
1
S
S
i=1 yi − width
p=1 ap max{0, wp, xi + bp} 2
2
can be done in time, 2width
Sn×width
poly(n, S, width).
This is the *only* algorithm we are aware of which gets exact
global minima of the empirical risk of some net in time
polynomial in any of the parameters.
The possibility of a similar result for deeper networks or
ameliorating the dependency on width remains wildly
open!
( AMS Johns Hopkins University ) 6 / 25
9. An overview of our results about neural nets
Formalizing the questions about neural nets
(2) Structure discovery by the nets
Real-life data can be modeled as observations of some structured
distribution. One view of the success of neural nets can be to say
that somehow nets can often be set up in such a way that they give
a function to optimize over which reveals this hidden structure at
its optima/critical points. In one classic scenario called the “sparse
coding” we will show proofs about how the net’s loss function has
certain nice properties which are possibly helping towards revealing
the hidden data generation model (the “dictionary”).
( AMS Johns Hopkins University ) 7 / 25
10. An overview of our results about neural nets
Formalizing the questions about neural nets
(3) The deep-net functions.
One of the themes that we have looked into a lot is to try to find
good descriptions of the functions that nets can compute.
( AMS Johns Hopkins University ) 8 / 25
11. An overview of our results about neural nets
Formalizing the questions about neural nets
(3) The deep-net functions.
One of the themes that we have looked into a lot is to try to find
good descriptions of the functions that nets can compute.
Let us start with this last kind of questions!
( AMS Johns Hopkins University ) 8 / 25
12. An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
“The Big Question!”
Can one find a complete characterization of the neural functions
parametrized by architecture?
13. An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
“The Big Question!”
Can one find a complete characterization of the neural functions
parametrized by architecture? No Clue!
14. An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
“The Big Question!”
Can one find a complete characterization of the neural functions
parametrized by architecture? No Clue!
Theorem (Ours)
A function f : Rn → R is continuous piecewise linear iff it is
representable by a ReLU deep net. Further a ReLU deep net of
depth at most, 1 + log2(n + 1) is required to represent f . For
n = 1 there is also a sharp width lowerbound.
( AMS Johns Hopkins University ) 9 / 25
15. An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
A very small part of “The Big Question”
A simple (but somewhat surprising!) is the following fact,
Theorem (Ours)
1-DNN 2-DNN and the following R2 → R function
(x1, x2) → max{0, x1, x2} is in the gap.
16. An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
A very small part of “The Big Question”
A simple (but somewhat surprising!) is the following fact,
Theorem (Ours)
1-DNN 2-DNN and the following R2 → R function
(x1, x2) → max{0, x1, x2} is in the gap.
Proof.
That 1-DNN ⊂ 2-DNN is obvious. Now observe that any R2 → R
1−DNN function is non-differentiable on a union of lines (one line
along each ReLU gate’s argument) but the given function is
non-differentiable on a union of 3 half-lines. Hence proved!
( AMS Johns Hopkins University ) 10 / 25
17. An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
A small part of “The Big Question” which is already unclear!
The family of 2-DNN functions is parameterized as follows by
(dimension compatible) choices of matrices W1, W2, vectors
b1, b2 and a number b3,
f2-DNN(x) = b3 + a, max {0, b2 + W2(max{0, b1 + W1x})}
18. An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
A small part of “The Big Question” which is already unclear!
The family of 2-DNN functions is parameterized as follows by
(dimension compatible) choices of matrices W1, W2, vectors
b1, b2 and a number b3,
f2-DNN(x) = b3 + a, max {0, b2 + W2(max{0, b1 + W1x})}
Can the R4 → R function given as x → max{0, x1, x2, x3, x4} be
written in the above form?
(While its easy to see that max{0, x1, x2, .., x2k } ∈ (k+1)-DNN)
( AMS Johns Hopkins University ) 11 / 25
19. An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Depth separation for R → R nets
Can one show neural functions at every depth such that lower depths
will necessarily require a much larger size to represent them?
Theorem (We generalize a result by Matus Telgarsky (UIUC))
∀k ∈ N, there exists a continuum of R → R neural net functions
of depth 1 + k2 (and size k3) which needs size Ω kk+1 for depths
≤ 1 + k1.
( AMS Johns Hopkins University ) 12 / 25
20. An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Depth separation for R → R nets
Can one show neural functions at every depth such that lower depths
will necessarily require a much larger size to represent them?
Theorem (We generalize a result by Matus Telgarsky (UIUC))
∀k ∈ N, there exists a continuum of R → R neural net functions
of depth 1 + k2 (and size k3) which needs size Ω kk+1 for depths
≤ 1 + k1.
Here the basic intuition is that if one starts with a small depth func-
tion which is oscillating then *without* blowing up the width too
much higher depths can be set up to recursively increase the number
of oscillations. And then such functions get very hard for the smaller
depths to even approximate in 1 norm unless they blow up in size.
( AMS Johns Hopkins University ) 12 / 25
21. An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions with one layer of gates
For real valued functions on the Boolean hypercube, is ReLU stronger
than the LTF activation?
22. An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions with one layer of gates
For real valued functions on the Boolean hypercube, is ReLU stronger
than the LTF activation? The best gap we know of is the following,
23. An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions with one layer of gates
For real valued functions on the Boolean hypercube, is ReLU stronger
than the LTF activation? The best gap we know of is the following,
Theorem (Ours)
There is at least a Ω(n) gap between Sum-of-ReLU and
Sum-of-LTF
24. An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions with one layer of gates
For real valued functions on the Boolean hypercube, is ReLU stronger
than the LTF activation? The best gap we know of is the following,
Theorem (Ours)
There is at least a Ω(n) gap between Sum-of-ReLU and
Sum-of-LTF
Proof.
This follows by looking at this function on the hypercube, {0, 1}n
given as, f (x) = n
i=i 2i−1xi . This has 2n level sets on the discrete
cube and hence needs that many polyhedral cells to be produced by
the hyperplanes of the Sum-of-LTF circuit whereas being a linear
function it can be implemented by just 2 ReLU gates!
( AMS Johns Hopkins University ) 13 / 25
25. An overview of our results about neural nets What functions does a deep net represent?
Now that we are done with the preliminaries, we move on to
the results which seem to need significantly more effort.
( AMS Johns Hopkins University ) 14 / 25
26. An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
The *ideal* depth separation!
Can one show neural functions at every depth such that it
will necessarily require Ω edimension size to represent them by
circuits of even one depth less? This is a major open question
and over real inputs this is currently known only between 2-DNN and
1-DNN from the works of Eldan-Shamir and Amit Daniely.
( AMS Johns Hopkins University ) 15 / 25
27. An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
The *ideal* depth separation!
Can one show neural functions at every depth such that it
will necessarily require Ω edimension size to represent them by
circuits of even one depth less? This is a major open question
and over real inputs this is currently known only between 2-DNN and
1-DNN from the works of Eldan-Shamir and Amit Daniely.
We go beyond small depth lower bounds in the following restricted sense,
( AMS Johns Hopkins University ) 15 / 25
28. An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Theorem (Ours)
There exists small depth 2 Boolean functions such that LTF-of-(ReLU)d−1
circuits require size Ω
(d − 1) 2
(dimension)
1
8
d−1
((dimension)W)
1
d−1
when the bottom most
layer weight vectors are such that their coordinates are integers of size at
most W and that these weight vectors induce the same ordering on the set
{−1, 1}(dimension) when ranked by value of the innerproduct with them.
(Note that all other weights are left completely free!)
( AMS Johns Hopkins University ) 16 / 25
29. An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Theorem (Ours)
There exists small depth 2 Boolean functions such that LTF-of-(ReLU)d−1
circuits require size Ω
(d − 1) 2
(dimension)
1
8
d−1
((dimension)W)
1
d−1
when the bottom most
layer weight vectors are such that their coordinates are integers of size at
most W and that these weight vectors induce the same ordering on the set
{−1, 1}(dimension) when ranked by value of the innerproduct with them.
(Note that all other weights are left completely free!)
This is achieved by showing that under the above restriction the
“sign-rank” is quadratically (in dimension) bounded for the func-
tions computed by such circuits, thought of as the matrix of dimen-
sion 2
dimension
2 × 2
dimension
2 . (And we recall that small depth small size
functions are known which have exponentially large sign-rank.)
( AMS Johns Hopkins University ) 16 / 25
30. An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions
Despite the result by Eldan-Shamir and Amit Daniely this curiosity
still remains as to how powerful is the LTF-of-ReLU-of-ReLU than
LTF-of-ReLU for Boolean functions.
Theorem (Ours)
For any δ ∈ (0, 1
2), there exists N(δ) ∈ N such that for all n ≥ N(δ)
and > 2 log
2
2−δ (n)
n , any LTF-of-ReLU circuit on n bits that
matches the Andreev function on n−bits for at least 1/2 +
fraction of the inputs, has size Ω( 2(1−δ)n1−δ).
( AMS Johns Hopkins University ) 17 / 25
31. An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions
Despite the result by Eldan-Shamir and Amit Daniely this curiosity
still remains as to how powerful is the LTF-of-ReLU-of-ReLU than
LTF-of-ReLU for Boolean functions.
Theorem (Ours)
For any δ ∈ (0, 1
2), there exists N(δ) ∈ N such that for all n ≥ N(δ)
and > 2 log
2
2−δ (n)
n , any LTF-of-ReLU circuit on n bits that
matches the Andreev function on n−bits for at least 1/2 +
fraction of the inputs, has size Ω( 2(1−δ)n1−δ).
This is proven by the “method of random restrictions” and in particular a very
recent version of it by Daniel Kane (UCSD) and Ryan Williams (MIT) based on
the Littlewood-Offord theorem.
( AMS Johns Hopkins University ) 17 / 25
32. An overview of our results about neural nets Why can the deep net do dictionary learning?
What makes the deep net landscape special?
A fundamental challenge with deep nets is to be able to explain as to why
is it able to solve so many diverse kinds of real-life learning problems. This
is a serious mathematical challenge to be able to understand as to how the
deep net “sees” these as optimization questions.
( AMS Johns Hopkins University ) 18 / 25
33. An overview of our results about neural nets Why can the deep net do dictionary learning?
What makes the deep net landscape special?
A fundamental challenge with deep nets is to be able to explain as to why
is it able to solve so many diverse kinds of real-life learning problems. This
is a serious mathematical challenge to be able to understand as to how the
deep net “sees” these as optimization questions.
For a net say N and a distribution D lets call its “landscape” (L)
corresponding to a “loss function ( )” (typically the squared-loss) as,
L(D, N) = Ex,y∈D[ (y, N(x))]
( AMS Johns Hopkins University ) 18 / 25
34. An overview of our results about neural nets Why can the deep net do dictionary learning?
What makes the deep net landscape special?
A fundamental challenge with deep nets is to be able to explain as to why
is it able to solve so many diverse kinds of real-life learning problems. This
is a serious mathematical challenge to be able to understand as to how the
deep net “sees” these as optimization questions.
For a net say N and a distribution D lets call its “landscape” (L)
corresponding to a “loss function ( )” (typically the squared-loss) as,
L(D, N) = Ex,y∈D[ (y, N(x))]
Why is this L so often somehow a nice function to optimize on to solve a
question which a priori had nothing to do with nets?
( AMS Johns Hopkins University ) 18 / 25
35. An overview of our results about neural nets Why can the deep net do dictionary learning?
Sparse coding
We isolate one special optimization question where we can attempt to
offer some mathematical explanation for this phenomenon.
“Sparse Coding” is a classic learning challenge where given access
to vectors y = A∗x∗ and some distributional (sparsity) guarantees
about x∗ we try to infer A∗. Breakthrough work by Spielman, Wang
and Wright (2012) : This is sometimes provably doable in poly-time!
( AMS Johns Hopkins University ) 19 / 25
36. An overview of our results about neural nets Why can the deep net do dictionary learning?
Sparse coding
We isolate one special optimization question where we can attempt to
offer some mathematical explanation for this phenomenon.
“Sparse Coding” is a classic learning challenge where given access
to vectors y = A∗x∗ and some distributional (sparsity) guarantees
about x∗ we try to infer A∗. Breakthrough work by Spielman, Wang
and Wright (2012) : This is sometimes provably doable in poly-time!
In this work we attempt to progress towards giving some rigorous
explanation for the observation that nets seem to solve sparse coding!
( AMS Johns Hopkins University ) 19 / 25
37. An overview of our results about neural nets Why can the deep net do dictionary learning?
Sparse coding
The defining equation of our autoencoder computing ˜y ∈ Rn
from y ∈ Rn
The generative model: Sparse x∗
∈ Rh
and y = A∗
x∗
∈ Rn
and h n
h = ReLU(W y − ) = max{0, W y − } ∈ Rh
˜y = W T
h ∈ Rn
( AMS Johns Hopkins University ) 20 / 25
38. An overview of our results about neural nets Why can the deep net do dictionary learning?
The power of autoencoders is surprisingly easy to demonstrate!
Software : TensorFlow (with a complicated iterative technique
called “RMSProp” which we shall explain in the next slide!)
( AMS Johns Hopkins University ) 21 / 25
39. An overview of our results about neural nets Why can the deep net do dictionary learning?
The power of autoencoders is surprisingly easy to demonstrate!
Software : TensorFlow (with a complicated iterative technique
called “RMSProp” which we shall explain in the next slide!)
6000 training examples and 1000 testing examples for each digit
( AMS Johns Hopkins University ) 21 / 25
40. An overview of our results about neural nets Why can the deep net do dictionary learning?
The power of autoencoders is surprisingly easy to demonstrate!
Software : TensorFlow (with a complicated iterative technique
called “RMSProp” which we shall explain in the next slide!)
6000 training examples and 1000 testing examples for each digit
n = 784 and the number of ReLU gates were 10000 for the 1−DNN
and 5000 and 784 for the 2−DNN.
( AMS Johns Hopkins University ) 21 / 25
41. An overview of our results about neural nets Why can the deep net do dictionary learning?
What exactly do algorithms like ADAM and RMSProp do?
Algorithm ADAM on a differentiable function f : Rd → R
1: function ADAM(x1, β1, β2, α, ξ)
2: Initialize : m0 = 0, v0 = 0
3: for t = 1, 2, . . . do
4: gt = f (xt)
5: mt = β1mt−1 + (1 − β1)gt
6: vt = β2vt−1 + (1 − β2)g2
t
7: Vt = diag(vt)
8: xt+1 = xt − αt V
1
2
t + diag(ξ1d )
−1
mt
9: end for
10: end function
( AMS Johns Hopkins University ) 22 / 25
42. An overview of our results about neural nets Why can the deep net do dictionary learning?
What exactly do algorithms like ADAM and RMSProp do?
Algorithm ADAM on a differentiable function f : Rd → R
1: function ADAM(x1, β1, β2, α, ξ)
2: Initialize : m0 = 0, v0 = 0
3: for t = 1, 2, . . . do
4: gt = f (xt)
5: mt = β1mt−1 + (1 − β1)gt
6: vt = β2vt−1 + (1 − β2)g2
t
7: Vt = diag(vt)
8: xt+1 = xt − αt V
1
2
t + diag(ξ1d )
−1
mt
9: end for
10: end function
These “adaptive gradient” algorithms like ADAM (or RMSProp =
ADAM at β1 = 0) which seem to work the best on autoencoder
neural nets are currently very poorly understood!
( AMS Johns Hopkins University ) 22 / 25
43. An overview of our results about neural nets Why can the deep net do dictionary learning?
What exactly do algorithms like ADAM and RMSProp do?
( AMS Johns Hopkins University ) 23 / 25
44. An overview of our results about neural nets Why can the deep net do dictionary learning?
What exactly do algorithms like ADAM and RMSProp do?
Our experimental conclusions and proofs about ADAM
We have shown controlled experiments to suggest that for
large enough autoencoders standard methods possibly cannot
surpass ADAM’s ability of reducing training as well as test
losses particularly when its parameters are set as, β1 ∼ 0.99 for
both full-batch as well as mini-batch settings.
[Theorem] There exists a sequence of step-size choices and
ranges of values of ξ and β1 for which ADAM provably
converges to criticality with no convexity assumptions.
(The proof technique here might be of independent interest!)
( AMS Johns Hopkins University ) 23 / 25
45. An overview of our results about neural nets Why can the deep net do dictionary learning?
What exactly do algorithms like ADAM and RMSProp do?
Our experimental conclusions and proofs about ADAM
We have shown controlled experiments to suggest that for
large enough autoencoders standard methods possibly cannot
surpass ADAM’s ability of reducing training as well as test
losses particularly when its parameters are set as, β1 ∼ 0.99 for
both full-batch as well as mini-batch settings.
[Theorem] There exists a sequence of step-size choices and
ranges of values of ξ and β1 for which ADAM provably
converges to criticality with no convexity assumptions.
(The proof technique here might be of independent interest!)
Now lets try to gain some mathematical control on the neural
net landscape - at least in the depth 2 case where RMSProp
and ADAM have almost similar performance.
( AMS Johns Hopkins University ) 23 / 25
46. An overview of our results about neural nets Why can the deep net do dictionary learning?
Why can deep nets do sparse coding?
After laborious algebra (over months!) we can offer the following insight,
( AMS Johns Hopkins University ) 24 / 25
47. An overview of our results about neural nets Why can the deep net do dictionary learning?
Why can deep nets do sparse coding?
After laborious algebra (over months!) we can offer the following insight,
Theorem (Ours)
If the source sparse vectors x∗ ∈ Rh are such that their non-zero
coordinates are sampled from a interval in R+ and it has a support of size
at most hp with p < 1
2 and A∗ ∈ Rn×h is incoherent enough then a
constant can be chosen such that the autoencoder landscape,
Ey=A∗x∗ [ y − W T
ReLU(0, W y − ) 2
2]
is such that it is asymptotically (in h) critical in a neighbourhood of A∗.
( AMS Johns Hopkins University ) 24 / 25
48. An overview of our results about neural nets Why can the deep net do dictionary learning?
Why can deep nets do sparse coding?
After laborious algebra (over months!) we can offer the following insight,
Theorem (Ours)
If the source sparse vectors x∗ ∈ Rh are such that their non-zero
coordinates are sampled from a interval in R+ and it has a support of size
at most hp with p < 1
2 and A∗ ∈ Rn×h is incoherent enough then a
constant can be chosen such that the autoencoder landscape,
Ey=A∗x∗ [ y − W T
ReLU(0, W y − ) 2
2]
is such that it is asymptotically (in h) critical in a neighbourhood of A∗.
Such criticality around the right answer is clearly a plausible reason why
gradient descent might find the right answer! Experiments infact
suggest that asymptotically in h, A∗ might even be a global minima
- but as of now we have no clue how to prove such a thing!
( AMS Johns Hopkins University ) 24 / 25
49. Open questions
Explain ADAM! Why is ADAM so good at minimizing the
generalization error on autoencoders? (and many other nets!)
( AMS Johns Hopkins University ) 25 / 25
50. Open questions
Explain ADAM! Why is ADAM so good at minimizing the
generalization error on autoencoders? (and many other nets!)
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
( AMS Johns Hopkins University ) 25 / 25
51. Open questions
Explain ADAM! Why is ADAM so good at minimizing the
generalization error on autoencoders? (and many other nets!)
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
We have shown an example of a manifold of “high complexity” neural
functions. But in the space of deep net functions how dense are such
complex functions?
( AMS Johns Hopkins University ) 25 / 25
52. Open questions
Explain ADAM! Why is ADAM so good at minimizing the
generalization error on autoencoders? (and many other nets!)
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
We have shown an example of a manifold of “high complexity” neural
functions. But in the space of deep net functions how dense are such
complex functions?
Can one exactly characterize the set of functions parameterized by
the architecture?
( AMS Johns Hopkins University ) 25 / 25
53. Open questions
Explain ADAM! Why is ADAM so good at minimizing the
generalization error on autoencoders? (and many other nets!)
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
We have shown an example of a manifold of “high complexity” neural
functions. But in the space of deep net functions how dense are such
complex functions?
Can one exactly characterize the set of functions parameterized by
the architecture?
How to (dis?)prove the existence of dimension exponential gaps
between consecutive depths? (This isn’t clear even when restricted to
Boolean inputs and with unrestricted weights!)
( AMS Johns Hopkins University ) 25 / 25
54. Open questions
Explain ADAM! Why is ADAM so good at minimizing the
generalization error on autoencoders? (and many other nets!)
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
We have shown an example of a manifold of “high complexity” neural
functions. But in the space of deep net functions how dense are such
complex functions?
Can one exactly characterize the set of functions parameterized by
the architecture?
How to (dis?)prove the existence of dimension exponential gaps
between consecutive depths? (This isn’t clear even when restricted to
Boolean inputs and with unrestricted weights!)
Can the max of 2k
+ 1 numbers be taken using k layers of ReLU gates?
(A negative answer immediately shows that with depth the deep net
function class strictly increases!)
( AMS Johns Hopkins University ) 25 / 25
55. Open questions
Explain ADAM! Why is ADAM so good at minimizing the
generalization error on autoencoders? (and many other nets!)
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
We have shown an example of a manifold of “high complexity” neural
functions. But in the space of deep net functions how dense are such
complex functions?
Can one exactly characterize the set of functions parameterized by
the architecture?
How to (dis?)prove the existence of dimension exponential gaps
between consecutive depths? (This isn’t clear even when restricted to
Boolean inputs and with unrestricted weights!)
Can the max of 2k
+ 1 numbers be taken using k layers of ReLU gates?
(A negative answer immediately shows that with depth the deep net
function class strictly increases!)
( AMS Johns Hopkins University ) 25 / 25
56. Open questions
Explain ADAM! Why is ADAM so good at minimizing the
generalization error on autoencoders? (and many other nets!)
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
We have shown an example of a manifold of “high complexity” neural
functions. But in the space of deep net functions how dense are such
complex functions?
Can one exactly characterize the set of functions parameterized by
the architecture?
How to (dis?)prove the existence of dimension exponential gaps
between consecutive depths? (This isn’t clear even when restricted to
Boolean inputs and with unrestricted weights!)
Can the max of 2k
+ 1 numbers be taken using k layers of ReLU gates?
(A negative answer immediately shows that with depth the deep net
function class strictly increases!)
Are there Boolean functions which have smaller representations using
ReLU gates than LTF gates? (A peculiarly puzzling question!)
( AMS Johns Hopkins University ) 25 / 25