SlideShare a Scribd company logo
Mathematics Of Neural Networks
Anirbit
AMS
Johns Hopkins University
( AMS Johns Hopkins University ) 1 / 25
Outline
1 Introduction
2 An overview of our results about neural nets
What functions does a deep net represent?
Why can the deep net do dictionary learning?
3 Open questions
( AMS Johns Hopkins University ) 2 / 25
Introduction
This overview is based on the following 4 papers of ours,
( AMS Johns Hopkins University ) 3 / 25
Introduction
This overview is based on the following 4 papers of ours,
ICML 2018 Workshop On Non-Convex Optimization (Not yet public)
“Convergence guarantees for RMSProp and ADAM in non-convex optimiza-
tion and their comparison to Nesterov acceleration on autoencoders”
https://eccc.weizmann.ac.il/report/2017/190/
“Lower bounds over Boolean inputs for deep neural networks with
ReLU gates”
https://arxiv.org/abs/1708.03735 (ISIT 2018)
“Sparse Coding and Autoencoders”
https://eccc.weizmann.ac.il/report/2017/098/(ICLR 2018)
“Understanding Deep Neural Networks with Rectified Linear Units”
( AMS Johns Hopkins University ) 3 / 25
Introduction
The collaborators!
These are works with Amitabh Basu (AMS, JHU)
and different subsets of,
Akshay Rangamani (ECE, JHU)
Soham De (CS, UMD)
Enayat Ullah (CS, JHU)
Tejaswini Ganapathy (Salesforce, San Francisco Bay Area)
Ashish Arora, Trac D.Tran (ECE, JHU)
Raman Arora, Poorya Mianjy (CS, JHU)
Sang (Peter) Chin (CS, BU)
( AMS Johns Hopkins University ) 4 / 25
Introduction
What is a neural network?
The following diagram (imagine it as a directed acyclic graph where all
edges are pointing to the right) represents an instance of a “neural
network”.
Since there are no “weights” assigned to the edges of the above graph,
one should think of this as representing a certain class (set) of R4 → R3
functions which can be computed by the above “architecture” for a
*fixed* choice of “activation functions” (like, ReLU(x) = max{0, x}) at
each of the blue nodes. The yellow nodes are where the input vector
comes in and the orange nodes are where the output vector comes out.
( AMS Johns Hopkins University ) 5 / 25
An overview of our results about neural nets
Formalizing the questions about neural nets
(1) Exact trainability of the nets
Theorem (Ours)
Empirical risk minimization on 1-DNN with a convex loss,
like minwi ,ai ,bi ,b
1
S
S
i=1 yi − width
p=1 ap max{0, wp, xi + bp} 2
2
can be done in time, 2width
Sn×width
poly(n, S, width).
An overview of our results about neural nets
Formalizing the questions about neural nets
(1) Exact trainability of the nets
Theorem (Ours)
Empirical risk minimization on 1-DNN with a convex loss,
like minwi ,ai ,bi ,b
1
S
S
i=1 yi − width
p=1 ap max{0, wp, xi + bp} 2
2
can be done in time, 2width
Sn×width
poly(n, S, width).
This is the *only* algorithm we are aware of which gets exact
global minima of the empirical risk of some net in time
polynomial in any of the parameters.
The possibility of a similar result for deeper networks or
ameliorating the dependency on width remains wildly
open!
( AMS Johns Hopkins University ) 6 / 25
An overview of our results about neural nets
Formalizing the questions about neural nets
(2) Structure discovery by the nets
Real-life data can be modeled as observations of some structured
distribution. One view of the success of neural nets can be to say
that somehow nets can often be set up in such a way that they give
a function to optimize over which reveals this hidden structure at
its optima/critical points. In one classic scenario called the “sparse
coding” we will show proofs about how the net’s loss function has
certain nice properties which are possibly helping towards revealing
the hidden data generation model (the “dictionary”).
( AMS Johns Hopkins University ) 7 / 25
An overview of our results about neural nets
Formalizing the questions about neural nets
(3) The deep-net functions.
One of the themes that we have looked into a lot is to try to find
good descriptions of the functions that nets can compute.
( AMS Johns Hopkins University ) 8 / 25
An overview of our results about neural nets
Formalizing the questions about neural nets
(3) The deep-net functions.
One of the themes that we have looked into a lot is to try to find
good descriptions of the functions that nets can compute.
Let us start with this last kind of questions!
( AMS Johns Hopkins University ) 8 / 25
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
“The Big Question!”
Can one find a complete characterization of the neural functions
parametrized by architecture?
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
“The Big Question!”
Can one find a complete characterization of the neural functions
parametrized by architecture? No Clue!
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
“The Big Question!”
Can one find a complete characterization of the neural functions
parametrized by architecture? No Clue!
Theorem (Ours)
A function f : Rn → R is continuous piecewise linear iff it is
representable by a ReLU deep net. Further a ReLU deep net of
depth at most, 1 + log2(n + 1) is required to represent f . For
n = 1 there is also a sharp width lowerbound.
( AMS Johns Hopkins University ) 9 / 25
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
A very small part of “The Big Question”
A simple (but somewhat surprising!) is the following fact,
Theorem (Ours)
1-DNN 2-DNN and the following R2 → R function
(x1, x2) → max{0, x1, x2} is in the gap.
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
A very small part of “The Big Question”
A simple (but somewhat surprising!) is the following fact,
Theorem (Ours)
1-DNN 2-DNN and the following R2 → R function
(x1, x2) → max{0, x1, x2} is in the gap.
Proof.
That 1-DNN ⊂ 2-DNN is obvious. Now observe that any R2 → R
1−DNN function is non-differentiable on a union of lines (one line
along each ReLU gate’s argument) but the given function is
non-differentiable on a union of 3 half-lines. Hence proved!
( AMS Johns Hopkins University ) 10 / 25
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
A small part of “The Big Question” which is already unclear!
The family of 2-DNN functions is parameterized as follows by
(dimension compatible) choices of matrices W1, W2, vectors
b1, b2 and a number b3,
f2-DNN(x) = b3 + a, max {0, b2 + W2(max{0, b1 + W1x})}
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
A small part of “The Big Question” which is already unclear!
The family of 2-DNN functions is parameterized as follows by
(dimension compatible) choices of matrices W1, W2, vectors
b1, b2 and a number b3,
f2-DNN(x) = b3 + a, max {0, b2 + W2(max{0, b1 + W1x})}
Can the R4 → R function given as x → max{0, x1, x2, x3, x4} be
written in the above form?
(While its easy to see that max{0, x1, x2, .., x2k } ∈ (k+1)-DNN)
( AMS Johns Hopkins University ) 11 / 25
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Depth separation for R → R nets
Can one show neural functions at every depth such that lower depths
will necessarily require a much larger size to represent them?
Theorem (We generalize a result by Matus Telgarsky (UIUC))
∀k ∈ N, there exists a continuum of R → R neural net functions
of depth 1 + k2 (and size k3) which needs size Ω kk+1 for depths
≤ 1 + k1.
( AMS Johns Hopkins University ) 12 / 25
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Depth separation for R → R nets
Can one show neural functions at every depth such that lower depths
will necessarily require a much larger size to represent them?
Theorem (We generalize a result by Matus Telgarsky (UIUC))
∀k ∈ N, there exists a continuum of R → R neural net functions
of depth 1 + k2 (and size k3) which needs size Ω kk+1 for depths
≤ 1 + k1.
Here the basic intuition is that if one starts with a small depth func-
tion which is oscillating then *without* blowing up the width too
much higher depths can be set up to recursively increase the number
of oscillations. And then such functions get very hard for the smaller
depths to even approximate in 1 norm unless they blow up in size.
( AMS Johns Hopkins University ) 12 / 25
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions with one layer of gates
For real valued functions on the Boolean hypercube, is ReLU stronger
than the LTF activation?
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions with one layer of gates
For real valued functions on the Boolean hypercube, is ReLU stronger
than the LTF activation? The best gap we know of is the following,
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions with one layer of gates
For real valued functions on the Boolean hypercube, is ReLU stronger
than the LTF activation? The best gap we know of is the following,
Theorem (Ours)
There is at least a Ω(n) gap between Sum-of-ReLU and
Sum-of-LTF
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions with one layer of gates
For real valued functions on the Boolean hypercube, is ReLU stronger
than the LTF activation? The best gap we know of is the following,
Theorem (Ours)
There is at least a Ω(n) gap between Sum-of-ReLU and
Sum-of-LTF
Proof.
This follows by looking at this function on the hypercube, {0, 1}n
given as, f (x) = n
i=i 2i−1xi . This has 2n level sets on the discrete
cube and hence needs that many polyhedral cells to be produced by
the hyperplanes of the Sum-of-LTF circuit whereas being a linear
function it can be implemented by just 2 ReLU gates!
( AMS Johns Hopkins University ) 13 / 25
An overview of our results about neural nets What functions does a deep net represent?
Now that we are done with the preliminaries, we move on to
the results which seem to need significantly more effort.
( AMS Johns Hopkins University ) 14 / 25
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
The *ideal* depth separation!
Can one show neural functions at every depth such that it
will necessarily require Ω edimension size to represent them by
circuits of even one depth less? This is a major open question
and over real inputs this is currently known only between 2-DNN and
1-DNN from the works of Eldan-Shamir and Amit Daniely.
( AMS Johns Hopkins University ) 15 / 25
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
The *ideal* depth separation!
Can one show neural functions at every depth such that it
will necessarily require Ω edimension size to represent them by
circuits of even one depth less? This is a major open question
and over real inputs this is currently known only between 2-DNN and
1-DNN from the works of Eldan-Shamir and Amit Daniely.
We go beyond small depth lower bounds in the following restricted sense,
( AMS Johns Hopkins University ) 15 / 25
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Theorem (Ours)
There exists small depth 2 Boolean functions such that LTF-of-(ReLU)d−1
circuits require size Ω

(d − 1) 2
(dimension)
1
8
d−1
((dimension)W)
1
d−1

 when the bottom most
layer weight vectors are such that their coordinates are integers of size at
most W and that these weight vectors induce the same ordering on the set
{−1, 1}(dimension) when ranked by value of the innerproduct with them.
(Note that all other weights are left completely free!)
( AMS Johns Hopkins University ) 16 / 25
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Theorem (Ours)
There exists small depth 2 Boolean functions such that LTF-of-(ReLU)d−1
circuits require size Ω

(d − 1) 2
(dimension)
1
8
d−1
((dimension)W)
1
d−1

 when the bottom most
layer weight vectors are such that their coordinates are integers of size at
most W and that these weight vectors induce the same ordering on the set
{−1, 1}(dimension) when ranked by value of the innerproduct with them.
(Note that all other weights are left completely free!)
This is achieved by showing that under the above restriction the
“sign-rank” is quadratically (in dimension) bounded for the func-
tions computed by such circuits, thought of as the matrix of dimen-
sion 2
dimension
2 × 2
dimension
2 . (And we recall that small depth small size
functions are known which have exponentially large sign-rank.)
( AMS Johns Hopkins University ) 16 / 25
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions
Despite the result by Eldan-Shamir and Amit Daniely this curiosity
still remains as to how powerful is the LTF-of-ReLU-of-ReLU than
LTF-of-ReLU for Boolean functions.
Theorem (Ours)
For any δ ∈ (0, 1
2), there exists N(δ) ∈ N such that for all n ≥ N(δ)
and > 2 log
2
2−δ (n)
n , any LTF-of-ReLU circuit on n bits that
matches the Andreev function on n−bits for at least 1/2 +
fraction of the inputs, has size Ω( 2(1−δ)n1−δ).
( AMS Johns Hopkins University ) 17 / 25
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions
Despite the result by Eldan-Shamir and Amit Daniely this curiosity
still remains as to how powerful is the LTF-of-ReLU-of-ReLU than
LTF-of-ReLU for Boolean functions.
Theorem (Ours)
For any δ ∈ (0, 1
2), there exists N(δ) ∈ N such that for all n ≥ N(δ)
and > 2 log
2
2−δ (n)
n , any LTF-of-ReLU circuit on n bits that
matches the Andreev function on n−bits for at least 1/2 +
fraction of the inputs, has size Ω( 2(1−δ)n1−δ).
This is proven by the “method of random restrictions” and in particular a very
recent version of it by Daniel Kane (UCSD) and Ryan Williams (MIT) based on
the Littlewood-Offord theorem.
( AMS Johns Hopkins University ) 17 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
What makes the deep net landscape special?
A fundamental challenge with deep nets is to be able to explain as to why
is it able to solve so many diverse kinds of real-life learning problems. This
is a serious mathematical challenge to be able to understand as to how the
deep net “sees” these as optimization questions.
( AMS Johns Hopkins University ) 18 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
What makes the deep net landscape special?
A fundamental challenge with deep nets is to be able to explain as to why
is it able to solve so many diverse kinds of real-life learning problems. This
is a serious mathematical challenge to be able to understand as to how the
deep net “sees” these as optimization questions.
For a net say N and a distribution D lets call its “landscape” (L)
corresponding to a “loss function ( )” (typically the squared-loss) as,
L(D, N) = Ex,y∈D[ (y, N(x))]
( AMS Johns Hopkins University ) 18 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
What makes the deep net landscape special?
A fundamental challenge with deep nets is to be able to explain as to why
is it able to solve so many diverse kinds of real-life learning problems. This
is a serious mathematical challenge to be able to understand as to how the
deep net “sees” these as optimization questions.
For a net say N and a distribution D lets call its “landscape” (L)
corresponding to a “loss function ( )” (typically the squared-loss) as,
L(D, N) = Ex,y∈D[ (y, N(x))]
Why is this L so often somehow a nice function to optimize on to solve a
question which a priori had nothing to do with nets?
( AMS Johns Hopkins University ) 18 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
Sparse coding
We isolate one special optimization question where we can attempt to
offer some mathematical explanation for this phenomenon.
“Sparse Coding” is a classic learning challenge where given access
to vectors y = A∗x∗ and some distributional (sparsity) guarantees
about x∗ we try to infer A∗. Breakthrough work by Spielman, Wang
and Wright (2012) : This is sometimes provably doable in poly-time!
( AMS Johns Hopkins University ) 19 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
Sparse coding
We isolate one special optimization question where we can attempt to
offer some mathematical explanation for this phenomenon.
“Sparse Coding” is a classic learning challenge where given access
to vectors y = A∗x∗ and some distributional (sparsity) guarantees
about x∗ we try to infer A∗. Breakthrough work by Spielman, Wang
and Wright (2012) : This is sometimes provably doable in poly-time!
In this work we attempt to progress towards giving some rigorous
explanation for the observation that nets seem to solve sparse coding!
( AMS Johns Hopkins University ) 19 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
Sparse coding
The defining equation of our autoencoder computing ˜y ∈ Rn
from y ∈ Rn
The generative model: Sparse x∗
∈ Rh
and y = A∗
x∗
∈ Rn
and h n
h = ReLU(W y − ) = max{0, W y − } ∈ Rh
˜y = W T
h ∈ Rn
( AMS Johns Hopkins University ) 20 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
The power of autoencoders is surprisingly easy to demonstrate!
Software : TensorFlow (with a complicated iterative technique
called “RMSProp” which we shall explain in the next slide!)
( AMS Johns Hopkins University ) 21 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
The power of autoencoders is surprisingly easy to demonstrate!
Software : TensorFlow (with a complicated iterative technique
called “RMSProp” which we shall explain in the next slide!)
6000 training examples and 1000 testing examples for each digit
( AMS Johns Hopkins University ) 21 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
The power of autoencoders is surprisingly easy to demonstrate!
Software : TensorFlow (with a complicated iterative technique
called “RMSProp” which we shall explain in the next slide!)
6000 training examples and 1000 testing examples for each digit
n = 784 and the number of ReLU gates were 10000 for the 1−DNN
and 5000 and 784 for the 2−DNN.
( AMS Johns Hopkins University ) 21 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
What exactly do algorithms like ADAM and RMSProp do?
Algorithm ADAM on a differentiable function f : Rd → R
1: function ADAM(x1, β1, β2, α, ξ)
2: Initialize : m0 = 0, v0 = 0
3: for t = 1, 2, . . . do
4: gt = f (xt)
5: mt = β1mt−1 + (1 − β1)gt
6: vt = β2vt−1 + (1 − β2)g2
t
7: Vt = diag(vt)
8: xt+1 = xt − αt V
1
2
t + diag(ξ1d )
−1
mt
9: end for
10: end function
( AMS Johns Hopkins University ) 22 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
What exactly do algorithms like ADAM and RMSProp do?
Algorithm ADAM on a differentiable function f : Rd → R
1: function ADAM(x1, β1, β2, α, ξ)
2: Initialize : m0 = 0, v0 = 0
3: for t = 1, 2, . . . do
4: gt = f (xt)
5: mt = β1mt−1 + (1 − β1)gt
6: vt = β2vt−1 + (1 − β2)g2
t
7: Vt = diag(vt)
8: xt+1 = xt − αt V
1
2
t + diag(ξ1d )
−1
mt
9: end for
10: end function
These “adaptive gradient” algorithms like ADAM (or RMSProp =
ADAM at β1 = 0) which seem to work the best on autoencoder
neural nets are currently very poorly understood!
( AMS Johns Hopkins University ) 22 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
What exactly do algorithms like ADAM and RMSProp do?
( AMS Johns Hopkins University ) 23 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
What exactly do algorithms like ADAM and RMSProp do?
Our experimental conclusions and proofs about ADAM
We have shown controlled experiments to suggest that for
large enough autoencoders standard methods possibly cannot
surpass ADAM’s ability of reducing training as well as test
losses particularly when its parameters are set as, β1 ∼ 0.99 for
both full-batch as well as mini-batch settings.
[Theorem] There exists a sequence of step-size choices and
ranges of values of ξ and β1 for which ADAM provably
converges to criticality with no convexity assumptions.
(The proof technique here might be of independent interest!)
( AMS Johns Hopkins University ) 23 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
What exactly do algorithms like ADAM and RMSProp do?
Our experimental conclusions and proofs about ADAM
We have shown controlled experiments to suggest that for
large enough autoencoders standard methods possibly cannot
surpass ADAM’s ability of reducing training as well as test
losses particularly when its parameters are set as, β1 ∼ 0.99 for
both full-batch as well as mini-batch settings.
[Theorem] There exists a sequence of step-size choices and
ranges of values of ξ and β1 for which ADAM provably
converges to criticality with no convexity assumptions.
(The proof technique here might be of independent interest!)
Now lets try to gain some mathematical control on the neural
net landscape - at least in the depth 2 case where RMSProp
and ADAM have almost similar performance.
( AMS Johns Hopkins University ) 23 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
Why can deep nets do sparse coding?
After laborious algebra (over months!) we can offer the following insight,
( AMS Johns Hopkins University ) 24 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
Why can deep nets do sparse coding?
After laborious algebra (over months!) we can offer the following insight,
Theorem (Ours)
If the source sparse vectors x∗ ∈ Rh are such that their non-zero
coordinates are sampled from a interval in R+ and it has a support of size
at most hp with p < 1
2 and A∗ ∈ Rn×h is incoherent enough then a
constant can be chosen such that the autoencoder landscape,
Ey=A∗x∗ [ y − W T
ReLU(0, W y − ) 2
2]
is such that it is asymptotically (in h) critical in a neighbourhood of A∗.
( AMS Johns Hopkins University ) 24 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
Why can deep nets do sparse coding?
After laborious algebra (over months!) we can offer the following insight,
Theorem (Ours)
If the source sparse vectors x∗ ∈ Rh are such that their non-zero
coordinates are sampled from a interval in R+ and it has a support of size
at most hp with p < 1
2 and A∗ ∈ Rn×h is incoherent enough then a
constant can be chosen such that the autoencoder landscape,
Ey=A∗x∗ [ y − W T
ReLU(0, W y − ) 2
2]
is such that it is asymptotically (in h) critical in a neighbourhood of A∗.
Such criticality around the right answer is clearly a plausible reason why
gradient descent might find the right answer! Experiments infact
suggest that asymptotically in h, A∗ might even be a global minima
- but as of now we have no clue how to prove such a thing!
( AMS Johns Hopkins University ) 24 / 25
Open questions
Explain ADAM! Why is ADAM so good at minimizing the
generalization error on autoencoders? (and many other nets!)
( AMS Johns Hopkins University ) 25 / 25
Open questions
Explain ADAM! Why is ADAM so good at minimizing the
generalization error on autoencoders? (and many other nets!)
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
( AMS Johns Hopkins University ) 25 / 25
Open questions
Explain ADAM! Why is ADAM so good at minimizing the
generalization error on autoencoders? (and many other nets!)
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
We have shown an example of a manifold of “high complexity” neural
functions. But in the space of deep net functions how dense are such
complex functions?
( AMS Johns Hopkins University ) 25 / 25
Open questions
Explain ADAM! Why is ADAM so good at minimizing the
generalization error on autoencoders? (and many other nets!)
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
We have shown an example of a manifold of “high complexity” neural
functions. But in the space of deep net functions how dense are such
complex functions?
Can one exactly characterize the set of functions parameterized by
the architecture?
( AMS Johns Hopkins University ) 25 / 25
Open questions
Explain ADAM! Why is ADAM so good at minimizing the
generalization error on autoencoders? (and many other nets!)
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
We have shown an example of a manifold of “high complexity” neural
functions. But in the space of deep net functions how dense are such
complex functions?
Can one exactly characterize the set of functions parameterized by
the architecture?
How to (dis?)prove the existence of dimension exponential gaps
between consecutive depths? (This isn’t clear even when restricted to
Boolean inputs and with unrestricted weights!)
( AMS Johns Hopkins University ) 25 / 25
Open questions
Explain ADAM! Why is ADAM so good at minimizing the
generalization error on autoencoders? (and many other nets!)
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
We have shown an example of a manifold of “high complexity” neural
functions. But in the space of deep net functions how dense are such
complex functions?
Can one exactly characterize the set of functions parameterized by
the architecture?
How to (dis?)prove the existence of dimension exponential gaps
between consecutive depths? (This isn’t clear even when restricted to
Boolean inputs and with unrestricted weights!)
Can the max of 2k
+ 1 numbers be taken using k layers of ReLU gates?
(A negative answer immediately shows that with depth the deep net
function class strictly increases!)
( AMS Johns Hopkins University ) 25 / 25
Open questions
Explain ADAM! Why is ADAM so good at minimizing the
generalization error on autoencoders? (and many other nets!)
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
We have shown an example of a manifold of “high complexity” neural
functions. But in the space of deep net functions how dense are such
complex functions?
Can one exactly characterize the set of functions parameterized by
the architecture?
How to (dis?)prove the existence of dimension exponential gaps
between consecutive depths? (This isn’t clear even when restricted to
Boolean inputs and with unrestricted weights!)
Can the max of 2k
+ 1 numbers be taken using k layers of ReLU gates?
(A negative answer immediately shows that with depth the deep net
function class strictly increases!)
( AMS Johns Hopkins University ) 25 / 25
Open questions
Explain ADAM! Why is ADAM so good at minimizing the
generalization error on autoencoders? (and many other nets!)
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
We have shown an example of a manifold of “high complexity” neural
functions. But in the space of deep net functions how dense are such
complex functions?
Can one exactly characterize the set of functions parameterized by
the architecture?
How to (dis?)prove the existence of dimension exponential gaps
between consecutive depths? (This isn’t clear even when restricted to
Boolean inputs and with unrestricted weights!)
Can the max of 2k
+ 1 numbers be taken using k layers of ReLU gates?
(A negative answer immediately shows that with depth the deep net
function class strictly increases!)
Are there Boolean functions which have smaller representations using
ReLU gates than LTF gates? (A peculiarly puzzling question!)
( AMS Johns Hopkins University ) 25 / 25

More Related Content

What's hot

SECURITY ENHANCED KEY PREDISTRIBUTION SCHEME USING TRANSVERSAL DESIGNS AND RE...
SECURITY ENHANCED KEY PREDISTRIBUTION SCHEME USING TRANSVERSAL DESIGNS AND RE...SECURITY ENHANCED KEY PREDISTRIBUTION SCHEME USING TRANSVERSAL DESIGNS AND RE...
SECURITY ENHANCED KEY PREDISTRIBUTION SCHEME USING TRANSVERSAL DESIGNS AND RE...IJNSA Journal
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronMostafa G. M. Mostafa
 
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)Universitat Politècnica de Catalunya
 
Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)SungminYou
 
Understanding RNN and LSTM
Understanding RNN and LSTMUnderstanding RNN and LSTM
Understanding RNN and LSTM健程 杨
 
TypeScript and Deep Learning
TypeScript and Deep LearningTypeScript and Deep Learning
TypeScript and Deep LearningOswald Campesato
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
mohsin dalvi artificial neural networks questions
mohsin dalvi   artificial neural networks questionsmohsin dalvi   artificial neural networks questions
mohsin dalvi artificial neural networks questionsAkash Maurya
 
Hybrid neural networks for time series learning by Tian Guo, EPFL, Switzerland
Hybrid neural networks for time series learning by Tian Guo,  EPFL, SwitzerlandHybrid neural networks for time series learning by Tian Guo,  EPFL, Switzerland
Hybrid neural networks for time series learning by Tian Guo, EPFL, SwitzerlandEuroIoTa
 
RNN Explore
RNN ExploreRNN Explore
RNN ExploreYan Kang
 
Recurrent neural networks for sequence learning and learning human identity f...
Recurrent neural networks for sequence learning and learning human identity f...Recurrent neural networks for sequence learning and learning human identity f...
Recurrent neural networks for sequence learning and learning human identity f...SungminYou
 
Synthetic dialogue generation with Deep Learning
Synthetic dialogue generation with Deep LearningSynthetic dialogue generation with Deep Learning
Synthetic dialogue generation with Deep LearningS N
 
Electricity price forecasting with Recurrent Neural Networks
Electricity price forecasting with Recurrent Neural NetworksElectricity price forecasting with Recurrent Neural Networks
Electricity price forecasting with Recurrent Neural NetworksTaegyun Jeon
 

What's hot (20)

AlexNet
AlexNetAlexNet
AlexNet
 
LSTM Tutorial
LSTM TutorialLSTM Tutorial
LSTM Tutorial
 
SECURITY ENHANCED KEY PREDISTRIBUTION SCHEME USING TRANSVERSAL DESIGNS AND RE...
SECURITY ENHANCED KEY PREDISTRIBUTION SCHEME USING TRANSVERSAL DESIGNS AND RE...SECURITY ENHANCED KEY PREDISTRIBUTION SCHEME USING TRANSVERSAL DESIGNS AND RE...
SECURITY ENHANCED KEY PREDISTRIBUTION SCHEME USING TRANSVERSAL DESIGNS AND RE...
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer Perceptron
 
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
 
Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)
 
Understanding RNN and LSTM
Understanding RNN and LSTMUnderstanding RNN and LSTM
Understanding RNN and LSTM
 
TypeScript and Deep Learning
TypeScript and Deep LearningTypeScript and Deep Learning
TypeScript and Deep Learning
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
mohsin dalvi artificial neural networks questions
mohsin dalvi   artificial neural networks questionsmohsin dalvi   artificial neural networks questions
mohsin dalvi artificial neural networks questions
 
Hybrid neural networks for time series learning by Tian Guo, EPFL, Switzerland
Hybrid neural networks for time series learning by Tian Guo,  EPFL, SwitzerlandHybrid neural networks for time series learning by Tian Guo,  EPFL, Switzerland
Hybrid neural networks for time series learning by Tian Guo, EPFL, Switzerland
 
RNN Explore
RNN ExploreRNN Explore
RNN Explore
 
Recurrent neural networks for sequence learning and learning human identity f...
Recurrent neural networks for sequence learning and learning human identity f...Recurrent neural networks for sequence learning and learning human identity f...
Recurrent neural networks for sequence learning and learning human identity f...
 
Synthetic dialogue generation with Deep Learning
Synthetic dialogue generation with Deep LearningSynthetic dialogue generation with Deep Learning
Synthetic dialogue generation with Deep Learning
 
rnn BASICS
rnn BASICSrnn BASICS
rnn BASICS
 
Lec10new
Lec10newLec10new
Lec10new
 
Deepwalk vs Node2vec
Deepwalk vs Node2vecDeepwalk vs Node2vec
Deepwalk vs Node2vec
 
The impact of visual saliency prediction in image classification
The impact of visual saliency prediction in image classificationThe impact of visual saliency prediction in image classification
The impact of visual saliency prediction in image classification
 
Electricity price forecasting with Recurrent Neural Networks
Electricity price forecasting with Recurrent Neural NetworksElectricity price forecasting with Recurrent Neural Networks
Electricity price forecasting with Recurrent Neural Networks
 

Similar to My invited talk at the 23rd International Symposium of Mathematical Programming (ISMP, 2018)

Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learningRADO7900
 
20141003.journal club
20141003.journal club20141003.journal club
20141003.journal clubHayaru SHOUNO
 
Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)Oswald Campesato
 
[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun YooJaeJun Yoo
 
Understanding Deep Learning & Parameter Tuning with MXnet, H2o Package in R
Understanding Deep Learning & Parameter Tuning with MXnet, H2o Package in RUnderstanding Deep Learning & Parameter Tuning with MXnet, H2o Package in R
Understanding Deep Learning & Parameter Tuning with MXnet, H2o Package in RManish Saraswat
 
DEEPLEARNING recurrent neural networs.pdf
DEEPLEARNING recurrent neural networs.pdfDEEPLEARNING recurrent neural networs.pdf
DEEPLEARNING recurrent neural networs.pdfAamirMaqsood8
 
Talk at MIT, Maths on deep neural networks
Talk at MIT, Maths on deep neural networks Talk at MIT, Maths on deep neural networks
Talk at MIT, Maths on deep neural networks Anirbit Mukherjee
 
Neural Networks Ver1
Neural  Networks  Ver1Neural  Networks  Ver1
Neural Networks Ver1ncct
 
ADFUNN
ADFUNNADFUNN
ADFUNNadfunn
 
nlp dl 1.pdf
nlp dl 1.pdfnlp dl 1.pdf
nlp dl 1.pdfnyomans1
 
Machine Learning - Neural Networks - Perceptron
Machine Learning - Neural Networks - PerceptronMachine Learning - Neural Networks - Perceptron
Machine Learning - Neural Networks - PerceptronAndrew Ferlitsch
 
Machine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural NetworksMachine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural NetworksAndrew Ferlitsch
 
A Survey of Deep Learning Algorithms for Malware Detection
A Survey of Deep Learning Algorithms for Malware DetectionA Survey of Deep Learning Algorithms for Malware Detection
A Survey of Deep Learning Algorithms for Malware DetectionIJCSIS Research Publications
 
X-TREPAN: A MULTI CLASS REGRESSION AND ADAPTED EXTRACTION OF COMPREHENSIBLE D...
X-TREPAN: A MULTI CLASS REGRESSION AND ADAPTED EXTRACTION OF COMPREHENSIBLE D...X-TREPAN: A MULTI CLASS REGRESSION AND ADAPTED EXTRACTION OF COMPREHENSIBLE D...
X-TREPAN: A MULTI CLASS REGRESSION AND ADAPTED EXTRACTION OF COMPREHENSIBLE D...cscpconf
 
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...csandit
 

Similar to My invited talk at the 23rd International Symposium of Mathematical Programming (ISMP, 2018) (20)

tutorial.ppt
tutorial.ppttutorial.ppt
tutorial.ppt
 
Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learning
 
20141003.journal club
20141003.journal club20141003.journal club
20141003.journal club
 
Java and Deep Learning
Java and Deep LearningJava and Deep Learning
Java and Deep Learning
 
Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)
 
[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo
 
Understanding Deep Learning & Parameter Tuning with MXnet, H2o Package in R
Understanding Deep Learning & Parameter Tuning with MXnet, H2o Package in RUnderstanding Deep Learning & Parameter Tuning with MXnet, H2o Package in R
Understanding Deep Learning & Parameter Tuning with MXnet, H2o Package in R
 
DEEPLEARNING recurrent neural networs.pdf
DEEPLEARNING recurrent neural networs.pdfDEEPLEARNING recurrent neural networs.pdf
DEEPLEARNING recurrent neural networs.pdf
 
Talk at MIT, Maths on deep neural networks
Talk at MIT, Maths on deep neural networks Talk at MIT, Maths on deep neural networks
Talk at MIT, Maths on deep neural networks
 
Neural Networks Ver1
Neural  Networks  Ver1Neural  Networks  Ver1
Neural Networks Ver1
 
ADFUNN
ADFUNNADFUNN
ADFUNN
 
nlp dl 1.pdf
nlp dl 1.pdfnlp dl 1.pdf
nlp dl 1.pdf
 
Machine Learning - Neural Networks - Perceptron
Machine Learning - Neural Networks - PerceptronMachine Learning - Neural Networks - Perceptron
Machine Learning - Neural Networks - Perceptron
 
Machine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural NetworksMachine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural Networks
 
Deep Learning for Computer Vision: Deep Networks (UPC 2016)
Deep Learning for Computer Vision: Deep Networks (UPC 2016)Deep Learning for Computer Vision: Deep Networks (UPC 2016)
Deep Learning for Computer Vision: Deep Networks (UPC 2016)
 
CNN for modeling sentence
CNN for modeling sentenceCNN for modeling sentence
CNN for modeling sentence
 
A Survey of Deep Learning Algorithms for Malware Detection
A Survey of Deep Learning Algorithms for Malware DetectionA Survey of Deep Learning Algorithms for Malware Detection
A Survey of Deep Learning Algorithms for Malware Detection
 
ai7.ppt
ai7.pptai7.ppt
ai7.ppt
 
X-TREPAN: A MULTI CLASS REGRESSION AND ADAPTED EXTRACTION OF COMPREHENSIBLE D...
X-TREPAN: A MULTI CLASS REGRESSION AND ADAPTED EXTRACTION OF COMPREHENSIBLE D...X-TREPAN: A MULTI CLASS REGRESSION AND ADAPTED EXTRACTION OF COMPREHENSIBLE D...
X-TREPAN: A MULTI CLASS REGRESSION AND ADAPTED EXTRACTION OF COMPREHENSIBLE D...
 
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
 

Recently uploaded

A Giant Impact Origin for the First Subduction on Earth
A Giant Impact Origin for the First Subduction on EarthA Giant Impact Origin for the First Subduction on Earth
A Giant Impact Origin for the First Subduction on EarthSérgio Sacani
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSELF-EXPLANATORY
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...Health Advances
 
FAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable PredictionsFAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable PredictionsMichel Dumontier
 
biotech-regenration of plants, pharmaceutical applications.pptx
biotech-regenration of plants, pharmaceutical applications.pptxbiotech-regenration of plants, pharmaceutical applications.pptx
biotech-regenration of plants, pharmaceutical applications.pptxANONYMOUS
 
SAMPLING.pptx for analystical chemistry sample techniques
SAMPLING.pptx for analystical chemistry sample techniquesSAMPLING.pptx for analystical chemistry sample techniques
SAMPLING.pptx for analystical chemistry sample techniquesrodneykiptoo8
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxmuralinath2
 
Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...Sérgio Sacani
 
NuGOweek 2024 full programme - hosted by Ghent University
NuGOweek 2024 full programme - hosted by Ghent UniversityNuGOweek 2024 full programme - hosted by Ghent University
NuGOweek 2024 full programme - hosted by Ghent Universitypablovgd
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionpablovgd
 
Seminar on Halal AGriculture and Fisheries.pptx
Seminar on Halal AGriculture and Fisheries.pptxSeminar on Halal AGriculture and Fisheries.pptx
Seminar on Halal AGriculture and Fisheries.pptxRUDYLUMAPINET2
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsmuralinath2
 
Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Sérgio Sacani
 
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...Sérgio Sacani
 
mixotrophy in cyanobacteria: a dual nutritional strategy
mixotrophy in cyanobacteria: a dual nutritional strategymixotrophy in cyanobacteria: a dual nutritional strategy
mixotrophy in cyanobacteria: a dual nutritional strategyMansiBishnoi1
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinossaicprecious19
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...muralinath2
 
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdfPests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdfPirithiRaju
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
 
GBSN - Microbiology (Lab 1) Microbiology Lab Safety Procedures
GBSN -  Microbiology (Lab  1) Microbiology Lab Safety ProceduresGBSN -  Microbiology (Lab  1) Microbiology Lab Safety Procedures
GBSN - Microbiology (Lab 1) Microbiology Lab Safety ProceduresAreesha Ahmad
 

Recently uploaded (20)

A Giant Impact Origin for the First Subduction on Earth
A Giant Impact Origin for the First Subduction on EarthA Giant Impact Origin for the First Subduction on Earth
A Giant Impact Origin for the First Subduction on Earth
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 
FAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable PredictionsFAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable Predictions
 
biotech-regenration of plants, pharmaceutical applications.pptx
biotech-regenration of plants, pharmaceutical applications.pptxbiotech-regenration of plants, pharmaceutical applications.pptx
biotech-regenration of plants, pharmaceutical applications.pptx
 
SAMPLING.pptx for analystical chemistry sample techniques
SAMPLING.pptx for analystical chemistry sample techniquesSAMPLING.pptx for analystical chemistry sample techniques
SAMPLING.pptx for analystical chemistry sample techniques
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
 
Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...
 
NuGOweek 2024 full programme - hosted by Ghent University
NuGOweek 2024 full programme - hosted by Ghent UniversityNuGOweek 2024 full programme - hosted by Ghent University
NuGOweek 2024 full programme - hosted by Ghent University
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
Seminar on Halal AGriculture and Fisheries.pptx
Seminar on Halal AGriculture and Fisheries.pptxSeminar on Halal AGriculture and Fisheries.pptx
Seminar on Halal AGriculture and Fisheries.pptx
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditions
 
Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...
 
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
 
mixotrophy in cyanobacteria: a dual nutritional strategy
mixotrophy in cyanobacteria: a dual nutritional strategymixotrophy in cyanobacteria: a dual nutritional strategy
mixotrophy in cyanobacteria: a dual nutritional strategy
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
 
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdfPests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
GBSN - Microbiology (Lab 1) Microbiology Lab Safety Procedures
GBSN -  Microbiology (Lab  1) Microbiology Lab Safety ProceduresGBSN -  Microbiology (Lab  1) Microbiology Lab Safety Procedures
GBSN - Microbiology (Lab 1) Microbiology Lab Safety Procedures
 

My invited talk at the 23rd International Symposium of Mathematical Programming (ISMP, 2018)

  • 1. Mathematics Of Neural Networks Anirbit AMS Johns Hopkins University ( AMS Johns Hopkins University ) 1 / 25
  • 2. Outline 1 Introduction 2 An overview of our results about neural nets What functions does a deep net represent? Why can the deep net do dictionary learning? 3 Open questions ( AMS Johns Hopkins University ) 2 / 25
  • 3. Introduction This overview is based on the following 4 papers of ours, ( AMS Johns Hopkins University ) 3 / 25
  • 4. Introduction This overview is based on the following 4 papers of ours, ICML 2018 Workshop On Non-Convex Optimization (Not yet public) “Convergence guarantees for RMSProp and ADAM in non-convex optimiza- tion and their comparison to Nesterov acceleration on autoencoders” https://eccc.weizmann.ac.il/report/2017/190/ “Lower bounds over Boolean inputs for deep neural networks with ReLU gates” https://arxiv.org/abs/1708.03735 (ISIT 2018) “Sparse Coding and Autoencoders” https://eccc.weizmann.ac.il/report/2017/098/(ICLR 2018) “Understanding Deep Neural Networks with Rectified Linear Units” ( AMS Johns Hopkins University ) 3 / 25
  • 5. Introduction The collaborators! These are works with Amitabh Basu (AMS, JHU) and different subsets of, Akshay Rangamani (ECE, JHU) Soham De (CS, UMD) Enayat Ullah (CS, JHU) Tejaswini Ganapathy (Salesforce, San Francisco Bay Area) Ashish Arora, Trac D.Tran (ECE, JHU) Raman Arora, Poorya Mianjy (CS, JHU) Sang (Peter) Chin (CS, BU) ( AMS Johns Hopkins University ) 4 / 25
  • 6. Introduction What is a neural network? The following diagram (imagine it as a directed acyclic graph where all edges are pointing to the right) represents an instance of a “neural network”. Since there are no “weights” assigned to the edges of the above graph, one should think of this as representing a certain class (set) of R4 → R3 functions which can be computed by the above “architecture” for a *fixed* choice of “activation functions” (like, ReLU(x) = max{0, x}) at each of the blue nodes. The yellow nodes are where the input vector comes in and the orange nodes are where the output vector comes out. ( AMS Johns Hopkins University ) 5 / 25
  • 7. An overview of our results about neural nets Formalizing the questions about neural nets (1) Exact trainability of the nets Theorem (Ours) Empirical risk minimization on 1-DNN with a convex loss, like minwi ,ai ,bi ,b 1 S S i=1 yi − width p=1 ap max{0, wp, xi + bp} 2 2 can be done in time, 2width Sn×width poly(n, S, width).
  • 8. An overview of our results about neural nets Formalizing the questions about neural nets (1) Exact trainability of the nets Theorem (Ours) Empirical risk minimization on 1-DNN with a convex loss, like minwi ,ai ,bi ,b 1 S S i=1 yi − width p=1 ap max{0, wp, xi + bp} 2 2 can be done in time, 2width Sn×width poly(n, S, width). This is the *only* algorithm we are aware of which gets exact global minima of the empirical risk of some net in time polynomial in any of the parameters. The possibility of a similar result for deeper networks or ameliorating the dependency on width remains wildly open! ( AMS Johns Hopkins University ) 6 / 25
  • 9. An overview of our results about neural nets Formalizing the questions about neural nets (2) Structure discovery by the nets Real-life data can be modeled as observations of some structured distribution. One view of the success of neural nets can be to say that somehow nets can often be set up in such a way that they give a function to optimize over which reveals this hidden structure at its optima/critical points. In one classic scenario called the “sparse coding” we will show proofs about how the net’s loss function has certain nice properties which are possibly helping towards revealing the hidden data generation model (the “dictionary”). ( AMS Johns Hopkins University ) 7 / 25
  • 10. An overview of our results about neural nets Formalizing the questions about neural nets (3) The deep-net functions. One of the themes that we have looked into a lot is to try to find good descriptions of the functions that nets can compute. ( AMS Johns Hopkins University ) 8 / 25
  • 11. An overview of our results about neural nets Formalizing the questions about neural nets (3) The deep-net functions. One of the themes that we have looked into a lot is to try to find good descriptions of the functions that nets can compute. Let us start with this last kind of questions! ( AMS Johns Hopkins University ) 8 / 25
  • 12. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space “The Big Question!” Can one find a complete characterization of the neural functions parametrized by architecture?
  • 13. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space “The Big Question!” Can one find a complete characterization of the neural functions parametrized by architecture? No Clue!
  • 14. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space “The Big Question!” Can one find a complete characterization of the neural functions parametrized by architecture? No Clue! Theorem (Ours) A function f : Rn → R is continuous piecewise linear iff it is representable by a ReLU deep net. Further a ReLU deep net of depth at most, 1 + log2(n + 1) is required to represent f . For n = 1 there is also a sharp width lowerbound. ( AMS Johns Hopkins University ) 9 / 25
  • 15. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space A very small part of “The Big Question” A simple (but somewhat surprising!) is the following fact, Theorem (Ours) 1-DNN 2-DNN and the following R2 → R function (x1, x2) → max{0, x1, x2} is in the gap.
  • 16. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space A very small part of “The Big Question” A simple (but somewhat surprising!) is the following fact, Theorem (Ours) 1-DNN 2-DNN and the following R2 → R function (x1, x2) → max{0, x1, x2} is in the gap. Proof. That 1-DNN ⊂ 2-DNN is obvious. Now observe that any R2 → R 1−DNN function is non-differentiable on a union of lines (one line along each ReLU gate’s argument) but the given function is non-differentiable on a union of 3 half-lines. Hence proved! ( AMS Johns Hopkins University ) 10 / 25
  • 17. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space A small part of “The Big Question” which is already unclear! The family of 2-DNN functions is parameterized as follows by (dimension compatible) choices of matrices W1, W2, vectors b1, b2 and a number b3, f2-DNN(x) = b3 + a, max {0, b2 + W2(max{0, b1 + W1x})}
  • 18. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space A small part of “The Big Question” which is already unclear! The family of 2-DNN functions is parameterized as follows by (dimension compatible) choices of matrices W1, W2, vectors b1, b2 and a number b3, f2-DNN(x) = b3 + a, max {0, b2 + W2(max{0, b1 + W1x})} Can the R4 → R function given as x → max{0, x1, x2, x3, x4} be written in the above form? (While its easy to see that max{0, x1, x2, .., x2k } ∈ (k+1)-DNN) ( AMS Johns Hopkins University ) 11 / 25
  • 19. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space Depth separation for R → R nets Can one show neural functions at every depth such that lower depths will necessarily require a much larger size to represent them? Theorem (We generalize a result by Matus Telgarsky (UIUC)) ∀k ∈ N, there exists a continuum of R → R neural net functions of depth 1 + k2 (and size k3) which needs size Ω kk+1 for depths ≤ 1 + k1. ( AMS Johns Hopkins University ) 12 / 25
  • 20. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space Depth separation for R → R nets Can one show neural functions at every depth such that lower depths will necessarily require a much larger size to represent them? Theorem (We generalize a result by Matus Telgarsky (UIUC)) ∀k ∈ N, there exists a continuum of R → R neural net functions of depth 1 + k2 (and size k3) which needs size Ω kk+1 for depths ≤ 1 + k1. Here the basic intuition is that if one starts with a small depth func- tion which is oscillating then *without* blowing up the width too much higher depths can be set up to recursively increase the number of oscillations. And then such functions get very hard for the smaller depths to even approximate in 1 norm unless they blow up in size. ( AMS Johns Hopkins University ) 12 / 25
  • 21. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space Separations for Boolean functions with one layer of gates For real valued functions on the Boolean hypercube, is ReLU stronger than the LTF activation?
  • 22. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space Separations for Boolean functions with one layer of gates For real valued functions on the Boolean hypercube, is ReLU stronger than the LTF activation? The best gap we know of is the following,
  • 23. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space Separations for Boolean functions with one layer of gates For real valued functions on the Boolean hypercube, is ReLU stronger than the LTF activation? The best gap we know of is the following, Theorem (Ours) There is at least a Ω(n) gap between Sum-of-ReLU and Sum-of-LTF
  • 24. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space Separations for Boolean functions with one layer of gates For real valued functions on the Boolean hypercube, is ReLU stronger than the LTF activation? The best gap we know of is the following, Theorem (Ours) There is at least a Ω(n) gap between Sum-of-ReLU and Sum-of-LTF Proof. This follows by looking at this function on the hypercube, {0, 1}n given as, f (x) = n i=i 2i−1xi . This has 2n level sets on the discrete cube and hence needs that many polyhedral cells to be produced by the hyperplanes of the Sum-of-LTF circuit whereas being a linear function it can be implemented by just 2 ReLU gates! ( AMS Johns Hopkins University ) 13 / 25
  • 25. An overview of our results about neural nets What functions does a deep net represent? Now that we are done with the preliminaries, we move on to the results which seem to need significantly more effort. ( AMS Johns Hopkins University ) 14 / 25
  • 26. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space The *ideal* depth separation! Can one show neural functions at every depth such that it will necessarily require Ω edimension size to represent them by circuits of even one depth less? This is a major open question and over real inputs this is currently known only between 2-DNN and 1-DNN from the works of Eldan-Shamir and Amit Daniely. ( AMS Johns Hopkins University ) 15 / 25
  • 27. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space The *ideal* depth separation! Can one show neural functions at every depth such that it will necessarily require Ω edimension size to represent them by circuits of even one depth less? This is a major open question and over real inputs this is currently known only between 2-DNN and 1-DNN from the works of Eldan-Shamir and Amit Daniely. We go beyond small depth lower bounds in the following restricted sense, ( AMS Johns Hopkins University ) 15 / 25
  • 28. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space Theorem (Ours) There exists small depth 2 Boolean functions such that LTF-of-(ReLU)d−1 circuits require size Ω  (d − 1) 2 (dimension) 1 8 d−1 ((dimension)W) 1 d−1   when the bottom most layer weight vectors are such that their coordinates are integers of size at most W and that these weight vectors induce the same ordering on the set {−1, 1}(dimension) when ranked by value of the innerproduct with them. (Note that all other weights are left completely free!) ( AMS Johns Hopkins University ) 16 / 25
  • 29. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space Theorem (Ours) There exists small depth 2 Boolean functions such that LTF-of-(ReLU)d−1 circuits require size Ω  (d − 1) 2 (dimension) 1 8 d−1 ((dimension)W) 1 d−1   when the bottom most layer weight vectors are such that their coordinates are integers of size at most W and that these weight vectors induce the same ordering on the set {−1, 1}(dimension) when ranked by value of the innerproduct with them. (Note that all other weights are left completely free!) This is achieved by showing that under the above restriction the “sign-rank” is quadratically (in dimension) bounded for the func- tions computed by such circuits, thought of as the matrix of dimen- sion 2 dimension 2 × 2 dimension 2 . (And we recall that small depth small size functions are known which have exponentially large sign-rank.) ( AMS Johns Hopkins University ) 16 / 25
  • 30. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space Separations for Boolean functions Despite the result by Eldan-Shamir and Amit Daniely this curiosity still remains as to how powerful is the LTF-of-ReLU-of-ReLU than LTF-of-ReLU for Boolean functions. Theorem (Ours) For any δ ∈ (0, 1 2), there exists N(δ) ∈ N such that for all n ≥ N(δ) and > 2 log 2 2−δ (n) n , any LTF-of-ReLU circuit on n bits that matches the Andreev function on n−bits for at least 1/2 + fraction of the inputs, has size Ω( 2(1−δ)n1−δ). ( AMS Johns Hopkins University ) 17 / 25
  • 31. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space Separations for Boolean functions Despite the result by Eldan-Shamir and Amit Daniely this curiosity still remains as to how powerful is the LTF-of-ReLU-of-ReLU than LTF-of-ReLU for Boolean functions. Theorem (Ours) For any δ ∈ (0, 1 2), there exists N(δ) ∈ N such that for all n ≥ N(δ) and > 2 log 2 2−δ (n) n , any LTF-of-ReLU circuit on n bits that matches the Andreev function on n−bits for at least 1/2 + fraction of the inputs, has size Ω( 2(1−δ)n1−δ). This is proven by the “method of random restrictions” and in particular a very recent version of it by Daniel Kane (UCSD) and Ryan Williams (MIT) based on the Littlewood-Offord theorem. ( AMS Johns Hopkins University ) 17 / 25
  • 32. An overview of our results about neural nets Why can the deep net do dictionary learning? What makes the deep net landscape special? A fundamental challenge with deep nets is to be able to explain as to why is it able to solve so many diverse kinds of real-life learning problems. This is a serious mathematical challenge to be able to understand as to how the deep net “sees” these as optimization questions. ( AMS Johns Hopkins University ) 18 / 25
  • 33. An overview of our results about neural nets Why can the deep net do dictionary learning? What makes the deep net landscape special? A fundamental challenge with deep nets is to be able to explain as to why is it able to solve so many diverse kinds of real-life learning problems. This is a serious mathematical challenge to be able to understand as to how the deep net “sees” these as optimization questions. For a net say N and a distribution D lets call its “landscape” (L) corresponding to a “loss function ( )” (typically the squared-loss) as, L(D, N) = Ex,y∈D[ (y, N(x))] ( AMS Johns Hopkins University ) 18 / 25
  • 34. An overview of our results about neural nets Why can the deep net do dictionary learning? What makes the deep net landscape special? A fundamental challenge with deep nets is to be able to explain as to why is it able to solve so many diverse kinds of real-life learning problems. This is a serious mathematical challenge to be able to understand as to how the deep net “sees” these as optimization questions. For a net say N and a distribution D lets call its “landscape” (L) corresponding to a “loss function ( )” (typically the squared-loss) as, L(D, N) = Ex,y∈D[ (y, N(x))] Why is this L so often somehow a nice function to optimize on to solve a question which a priori had nothing to do with nets? ( AMS Johns Hopkins University ) 18 / 25
  • 35. An overview of our results about neural nets Why can the deep net do dictionary learning? Sparse coding We isolate one special optimization question where we can attempt to offer some mathematical explanation for this phenomenon. “Sparse Coding” is a classic learning challenge where given access to vectors y = A∗x∗ and some distributional (sparsity) guarantees about x∗ we try to infer A∗. Breakthrough work by Spielman, Wang and Wright (2012) : This is sometimes provably doable in poly-time! ( AMS Johns Hopkins University ) 19 / 25
  • 36. An overview of our results about neural nets Why can the deep net do dictionary learning? Sparse coding We isolate one special optimization question where we can attempt to offer some mathematical explanation for this phenomenon. “Sparse Coding” is a classic learning challenge where given access to vectors y = A∗x∗ and some distributional (sparsity) guarantees about x∗ we try to infer A∗. Breakthrough work by Spielman, Wang and Wright (2012) : This is sometimes provably doable in poly-time! In this work we attempt to progress towards giving some rigorous explanation for the observation that nets seem to solve sparse coding! ( AMS Johns Hopkins University ) 19 / 25
  • 37. An overview of our results about neural nets Why can the deep net do dictionary learning? Sparse coding The defining equation of our autoencoder computing ˜y ∈ Rn from y ∈ Rn The generative model: Sparse x∗ ∈ Rh and y = A∗ x∗ ∈ Rn and h n h = ReLU(W y − ) = max{0, W y − } ∈ Rh ˜y = W T h ∈ Rn ( AMS Johns Hopkins University ) 20 / 25
  • 38. An overview of our results about neural nets Why can the deep net do dictionary learning? The power of autoencoders is surprisingly easy to demonstrate! Software : TensorFlow (with a complicated iterative technique called “RMSProp” which we shall explain in the next slide!) ( AMS Johns Hopkins University ) 21 / 25
  • 39. An overview of our results about neural nets Why can the deep net do dictionary learning? The power of autoencoders is surprisingly easy to demonstrate! Software : TensorFlow (with a complicated iterative technique called “RMSProp” which we shall explain in the next slide!) 6000 training examples and 1000 testing examples for each digit ( AMS Johns Hopkins University ) 21 / 25
  • 40. An overview of our results about neural nets Why can the deep net do dictionary learning? The power of autoencoders is surprisingly easy to demonstrate! Software : TensorFlow (with a complicated iterative technique called “RMSProp” which we shall explain in the next slide!) 6000 training examples and 1000 testing examples for each digit n = 784 and the number of ReLU gates were 10000 for the 1−DNN and 5000 and 784 for the 2−DNN. ( AMS Johns Hopkins University ) 21 / 25
  • 41. An overview of our results about neural nets Why can the deep net do dictionary learning? What exactly do algorithms like ADAM and RMSProp do? Algorithm ADAM on a differentiable function f : Rd → R 1: function ADAM(x1, β1, β2, α, ξ) 2: Initialize : m0 = 0, v0 = 0 3: for t = 1, 2, . . . do 4: gt = f (xt) 5: mt = β1mt−1 + (1 − β1)gt 6: vt = β2vt−1 + (1 − β2)g2 t 7: Vt = diag(vt) 8: xt+1 = xt − αt V 1 2 t + diag(ξ1d ) −1 mt 9: end for 10: end function ( AMS Johns Hopkins University ) 22 / 25
  • 42. An overview of our results about neural nets Why can the deep net do dictionary learning? What exactly do algorithms like ADAM and RMSProp do? Algorithm ADAM on a differentiable function f : Rd → R 1: function ADAM(x1, β1, β2, α, ξ) 2: Initialize : m0 = 0, v0 = 0 3: for t = 1, 2, . . . do 4: gt = f (xt) 5: mt = β1mt−1 + (1 − β1)gt 6: vt = β2vt−1 + (1 − β2)g2 t 7: Vt = diag(vt) 8: xt+1 = xt − αt V 1 2 t + diag(ξ1d ) −1 mt 9: end for 10: end function These “adaptive gradient” algorithms like ADAM (or RMSProp = ADAM at β1 = 0) which seem to work the best on autoencoder neural nets are currently very poorly understood! ( AMS Johns Hopkins University ) 22 / 25
  • 43. An overview of our results about neural nets Why can the deep net do dictionary learning? What exactly do algorithms like ADAM and RMSProp do? ( AMS Johns Hopkins University ) 23 / 25
  • 44. An overview of our results about neural nets Why can the deep net do dictionary learning? What exactly do algorithms like ADAM and RMSProp do? Our experimental conclusions and proofs about ADAM We have shown controlled experiments to suggest that for large enough autoencoders standard methods possibly cannot surpass ADAM’s ability of reducing training as well as test losses particularly when its parameters are set as, β1 ∼ 0.99 for both full-batch as well as mini-batch settings. [Theorem] There exists a sequence of step-size choices and ranges of values of ξ and β1 for which ADAM provably converges to criticality with no convexity assumptions. (The proof technique here might be of independent interest!) ( AMS Johns Hopkins University ) 23 / 25
  • 45. An overview of our results about neural nets Why can the deep net do dictionary learning? What exactly do algorithms like ADAM and RMSProp do? Our experimental conclusions and proofs about ADAM We have shown controlled experiments to suggest that for large enough autoencoders standard methods possibly cannot surpass ADAM’s ability of reducing training as well as test losses particularly when its parameters are set as, β1 ∼ 0.99 for both full-batch as well as mini-batch settings. [Theorem] There exists a sequence of step-size choices and ranges of values of ξ and β1 for which ADAM provably converges to criticality with no convexity assumptions. (The proof technique here might be of independent interest!) Now lets try to gain some mathematical control on the neural net landscape - at least in the depth 2 case where RMSProp and ADAM have almost similar performance. ( AMS Johns Hopkins University ) 23 / 25
  • 46. An overview of our results about neural nets Why can the deep net do dictionary learning? Why can deep nets do sparse coding? After laborious algebra (over months!) we can offer the following insight, ( AMS Johns Hopkins University ) 24 / 25
  • 47. An overview of our results about neural nets Why can the deep net do dictionary learning? Why can deep nets do sparse coding? After laborious algebra (over months!) we can offer the following insight, Theorem (Ours) If the source sparse vectors x∗ ∈ Rh are such that their non-zero coordinates are sampled from a interval in R+ and it has a support of size at most hp with p < 1 2 and A∗ ∈ Rn×h is incoherent enough then a constant can be chosen such that the autoencoder landscape, Ey=A∗x∗ [ y − W T ReLU(0, W y − ) 2 2] is such that it is asymptotically (in h) critical in a neighbourhood of A∗. ( AMS Johns Hopkins University ) 24 / 25
  • 48. An overview of our results about neural nets Why can the deep net do dictionary learning? Why can deep nets do sparse coding? After laborious algebra (over months!) we can offer the following insight, Theorem (Ours) If the source sparse vectors x∗ ∈ Rh are such that their non-zero coordinates are sampled from a interval in R+ and it has a support of size at most hp with p < 1 2 and A∗ ∈ Rn×h is incoherent enough then a constant can be chosen such that the autoencoder landscape, Ey=A∗x∗ [ y − W T ReLU(0, W y − ) 2 2] is such that it is asymptotically (in h) critical in a neighbourhood of A∗. Such criticality around the right answer is clearly a plausible reason why gradient descent might find the right answer! Experiments infact suggest that asymptotically in h, A∗ might even be a global minima - but as of now we have no clue how to prove such a thing! ( AMS Johns Hopkins University ) 24 / 25
  • 49. Open questions Explain ADAM! Why is ADAM so good at minimizing the generalization error on autoencoders? (and many other nets!) ( AMS Johns Hopkins University ) 25 / 25
  • 50. Open questions Explain ADAM! Why is ADAM so good at minimizing the generalization error on autoencoders? (and many other nets!) Even for the specific case of sparse coding how to analyze all the critical points of the landscape or even just (dis?)prove that the right answer is a global minima? ( AMS Johns Hopkins University ) 25 / 25
  • 51. Open questions Explain ADAM! Why is ADAM so good at minimizing the generalization error on autoencoders? (and many other nets!) Even for the specific case of sparse coding how to analyze all the critical points of the landscape or even just (dis?)prove that the right answer is a global minima? We have shown an example of a manifold of “high complexity” neural functions. But in the space of deep net functions how dense are such complex functions? ( AMS Johns Hopkins University ) 25 / 25
  • 52. Open questions Explain ADAM! Why is ADAM so good at minimizing the generalization error on autoencoders? (and many other nets!) Even for the specific case of sparse coding how to analyze all the critical points of the landscape or even just (dis?)prove that the right answer is a global minima? We have shown an example of a manifold of “high complexity” neural functions. But in the space of deep net functions how dense are such complex functions? Can one exactly characterize the set of functions parameterized by the architecture? ( AMS Johns Hopkins University ) 25 / 25
  • 53. Open questions Explain ADAM! Why is ADAM so good at minimizing the generalization error on autoencoders? (and many other nets!) Even for the specific case of sparse coding how to analyze all the critical points of the landscape or even just (dis?)prove that the right answer is a global minima? We have shown an example of a manifold of “high complexity” neural functions. But in the space of deep net functions how dense are such complex functions? Can one exactly characterize the set of functions parameterized by the architecture? How to (dis?)prove the existence of dimension exponential gaps between consecutive depths? (This isn’t clear even when restricted to Boolean inputs and with unrestricted weights!) ( AMS Johns Hopkins University ) 25 / 25
  • 54. Open questions Explain ADAM! Why is ADAM so good at minimizing the generalization error on autoencoders? (and many other nets!) Even for the specific case of sparse coding how to analyze all the critical points of the landscape or even just (dis?)prove that the right answer is a global minima? We have shown an example of a manifold of “high complexity” neural functions. But in the space of deep net functions how dense are such complex functions? Can one exactly characterize the set of functions parameterized by the architecture? How to (dis?)prove the existence of dimension exponential gaps between consecutive depths? (This isn’t clear even when restricted to Boolean inputs and with unrestricted weights!) Can the max of 2k + 1 numbers be taken using k layers of ReLU gates? (A negative answer immediately shows that with depth the deep net function class strictly increases!) ( AMS Johns Hopkins University ) 25 / 25
  • 55. Open questions Explain ADAM! Why is ADAM so good at minimizing the generalization error on autoencoders? (and many other nets!) Even for the specific case of sparse coding how to analyze all the critical points of the landscape or even just (dis?)prove that the right answer is a global minima? We have shown an example of a manifold of “high complexity” neural functions. But in the space of deep net functions how dense are such complex functions? Can one exactly characterize the set of functions parameterized by the architecture? How to (dis?)prove the existence of dimension exponential gaps between consecutive depths? (This isn’t clear even when restricted to Boolean inputs and with unrestricted weights!) Can the max of 2k + 1 numbers be taken using k layers of ReLU gates? (A negative answer immediately shows that with depth the deep net function class strictly increases!) ( AMS Johns Hopkins University ) 25 / 25
  • 56. Open questions Explain ADAM! Why is ADAM so good at minimizing the generalization error on autoencoders? (and many other nets!) Even for the specific case of sparse coding how to analyze all the critical points of the landscape or even just (dis?)prove that the right answer is a global minima? We have shown an example of a manifold of “high complexity” neural functions. But in the space of deep net functions how dense are such complex functions? Can one exactly characterize the set of functions parameterized by the architecture? How to (dis?)prove the existence of dimension exponential gaps between consecutive depths? (This isn’t clear even when restricted to Boolean inputs and with unrestricted weights!) Can the max of 2k + 1 numbers be taken using k layers of ReLU gates? (A negative answer immediately shows that with depth the deep net function class strictly increases!) Are there Boolean functions which have smaller representations using ReLU gates than LTF gates? (A peculiarly puzzling question!) ( AMS Johns Hopkins University ) 25 / 25