SlideShare a Scribd company logo
1 of 62
Download to read offline
Mathematics of neural networks
Anirbit
AMS
Johns Hopkins University
Invited talk at MIT Maths
“Seminar on Applied Algebra and Geometry”, 14th November 2017
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 1 / 30
Outline
1 Introduction
2 The questions (which have some answers!)
What functions does a deep net represent?
Why can the deep net do dictionary learning?
3 Open questions
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 2 / 30
Introduction
This talk is based on the following 3 papers,
https://arxiv.org/abs/1711.03073
“Lower bounds over Boolean inputs for deep neural networks with
ReLU gates”
https://arxiv.org/abs/1708.03735
“Sparse Coding and Autoencoders”
https://eccc.weizmann.ac.il/report/2017/098/
“Understanding Deep Neural Networks with Rectified Linear Units”
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 3 / 30
Introduction
The collaborators!
These are works with Amitabh Basu (AMS, JHU)
and different subsets of,
Akshay Rangamani (ECE, JHU)
Tejaswini Ganapathy (Salesforce, San Francisco Bay Area)
Ashish Arora, Trac D.Tran (ECE, JHU)
Raman Arora, Poorya Mianjy (CS, JHU)
Sang (Peter) Chin (CS, BU)
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 4 / 30
Introduction
Activation gates of the neural network
The building block of a “neural net” are its activation gates which do the
basic analogue computations (as opposed to Boolean gates in Boolean
circuits which compute the AND, OR, NOT and Threshold functions.)
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 5 / 30
Introduction
Activation gates of the neural network
The building block of a “neural net” are its activation gates which do the
basic analogue computations (as opposed to Boolean gates in Boolean
circuits which compute the AND, OR, NOT and Threshold functions.)
The above is a R3 → R neural gate evaluating the “activation” function
f : R → R on a linear (in general can be affine) transformation of the
input vector. The ws are the ‘weights’. If there were many Y s coming
out of the gate then it would pass on the same value to all of them.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 5 / 30
Introduction
The ReLU activation function
Almost uniformly it is now believed that the “best” activation function to
use is the “Rectified Linear Unit (ReLU)”
ReLU : R → R
x → max{0, x}
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 6 / 30
Introduction
The ReLU activation function
Almost uniformly it is now believed that the “best” activation function to
use is the “Rectified Linear Unit (ReLU)”
ReLU : R → R
x → max{0, x}
At this point it is useful to also define the more studied activation, “Linear
Threshold Function (LTF)”, in Boolean complexity to which we shall at
times compare later,
LTF : R → R
x → 1x≥0 or 21x≥0 − 1
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 6 / 30
Introduction
The ReLU activation function
Almost uniformly it is now believed that the “best” activation function to
use is the “Rectified Linear Unit (ReLU)”
ReLU : R → R
x → max{0, x}
At this point it is useful to also define the more studied activation, “Linear
Threshold Function (LTF)”, in Boolean complexity to which we shall at
times compare later,
LTF : R → R
x → 1x≥0 or 21x≥0 − 1
If LTF(x) = 1x≥0, then its easy to see that, ReLU(x) = xLTF(x).
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 6 / 30
Introduction
What is a neural network?
The following diagram (imagine it as a directed acyclic graph where all
edges are pointing to the right) represents an instance of a “neural
network”.
Since there are no “weights” assigned to the edges of the above graph,
one should think of this as representing a certain class (set) of R4 → R3
functions which can be computed by the above “architecture” for a
*fixed* choice of “activation functions” at each of the blue nodes. The
yellow nodes are where the input vector comes in and the orange nodes are
where the output vector comes out.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 7 / 30
Introduction
An example of a neurally representable function
Input x1
Input x2
x1+x2
2 + |x1−x2|
2
1
1
-1
-1
-1
1
1
-1
1
2
−1
2
1
2
1
2
In the above we see a “1-DNN” with ReLU activation computing the
R2 → R function given as (x1, x2) → max{x1, x2}. The above neural
network would be said to be of size 4.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 8 / 30
Introduction
An example of a neurally representable function
Input x1
Input x2
x1+x2
2 + |x1−x2|
2
1
1
-1
-1
-1
1
1
-1
1
2
−1
2
1
2
1
2
In the above we see a “1-DNN” with ReLU activation computing the
R2 → R function given as (x1, x2) → max{x1, x2}. The above neural
network would be said to be of size 4.These “max” functions are kind
of particularly interesting and we would soon come back to this!
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 8 / 30
Introduction
Neural nets used in real life!
Neural nets deployed in the real world which are creating the engineering
miracles everyday come in various complicated designs. “The Asimov
Institute” recently compiled this beautiful chart summarizing many of the
architectures in use,
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 9 / 30
Introduction
Neural nets used in real life!
Neural nets deployed in the real world which are creating the engineering
miracles everyday come in various complicated designs. “The Asimov
Institute” recently compiled this beautiful chart summarizing many of the
architectures in use,
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 9 / 30
Introduction
When do real weights matter?
Consider a very restricted class of neural networks built out of such
piecewise-linear gates like ReLU AND all inputs as well as the gates’
description is restricted to be coming from rational numbers requiring at
most m bits for their description AND the neural network ends with a
threshold gate at the top AND every gate has fan-out at most 1.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 10 / 30
Introduction
When do real weights matter?
Consider a very restricted class of neural networks built out of such
piecewise-linear gates like ReLU AND all inputs as well as the gates’
description is restricted to be coming from rational numbers requiring at
most m bits for their description AND the neural network ends with a
threshold gate at the top AND every gate has fan-out at most 1.
It was shown by Wolfgang Maass in 1997 that for such networks one can
trade off real weights for rational weights such that the corresponding
integers are in absolute value at most (2s + 1)!22m(2s+1) where s is the
total number of weights.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 10 / 30
Introduction
When do real weights matter?
Consider a very restricted class of neural networks built out of such
piecewise-linear gates like ReLU AND all inputs as well as the gates’
description is restricted to be coming from rational numbers requiring at
most m bits for their description AND the neural network ends with a
threshold gate at the top AND every gate has fan-out at most 1.
It was shown by Wolfgang Maass in 1997 that for such networks one can
trade off real weights for rational weights such that the corresponding
integers are in absolute value at most (2s + 1)!22m(2s+1) where s is the
total number of weights.
But for most usual nets where there is no restriction on the fan-out
we do not know of such a transformation as above and its also not
clear to us if we can always simulate these with LTF gates without
blowing up the size!
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 10 / 30
Introduction
Neural networks : recent resurgence as an “universal algorithm”!
They are revolutionizing the techniques in various fields ranging from
particle physics to genomics to signal processing to computer vision.
Most recently it has been shown by the Google DeepMind
group that one can within hours of training make a neural network
capable of finding new strategies of winning the game of “Go”
that people havent seen despite (more than?) a 1000 years of
having been playing the game! Most importantly this seems possible
without feeding the network with any human game histories!
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 11 / 30
Introduction
Neural networks : recent resurgence as an “universal algorithm”!
They are revolutionizing the techniques in various fields ranging from
particle physics to genomics to signal processing to computer vision.
Most recently it has been shown by the Google DeepMind
group that one can within hours of training make a neural network
capable of finding new strategies of winning the game of “Go”
that people havent seen despite (more than?) a 1000 years of
having been playing the game! Most importantly this seems possible
without feeding the network with any human game histories!
Mathematically nets are still extremely hard to analyze. A nice survey
of many of the recent ideas have been compiled in a 3 part series
of articles by “Center for Brains, Minds, and Machines (CBMM)” at
MIT. Here we report on some of the attempts we have been making
to rigorously understand neural networks.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 11 / 30
The questions (which have some answers!)
Formalizing the questions about neural nets
Broadly it seems that we can take 3 different directions of study,
(1) Generalization
It needs an explanation as to why does the training of neural nets
generally not overfit despite being over-parametrized? Some recent
attempts at explaining this can be seen in the works by Kaelbling,
Kawaguchi (MIT) and Bengio (UMontreal), Brutzkus, Globerson (Tel
Aviv University), Malach and Shai Shalev-Schwartz (The Hebrew
University)
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 12 / 30
The questions (which have some answers!)
Formalizing the questions about neural nets
(2a) Trainability of the nets
Theorem (Ours)
Empirical risk minimization on 1-DNN with a convex loss,
like minwi ,ai ,bi ,b
1
S
S
i=1 yi − width
p=1 ap max{0, wp, xi + bp} − b 2
2
,can be done in time
poly(number of data points)e(width,dimension).
The questions (which have some answers!)
Formalizing the questions about neural nets
(2a) Trainability of the nets
Theorem (Ours)
Empirical risk minimization on 1-DNN with a convex loss,
like minwi ,ai ,bi ,b
1
S
S
i=1 yi − width
p=1 ap max{0, wp, xi + bp} − b 2
2
,can be done in time
poly(number of data points)e(width,dimension).
This is the *only* algorithm we are aware of which gets exact
global minima of the empirical risk of some net in time
polynomial in any of the parameters.
The possibility of a similar result for deeper networks or
ameliorating the dependency on width remains wildly
open!
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 13 / 30
The questions (which have some answers!)
Formalizing the questions about neural nets
(2b) Structure discovery by the nets
Real-life data can be modeled as observations of some structured
distribution. One view of the success of neural nets can be to say
that somehow nets can often be set up in such a way that they give
a function to optimize over which reveals this hidden structure at
its optima/critical points. In one classic scenario called the “sparse
coding” we will show proofs about how the net’s loss function has
certain nice properties which are possibly helping towards revealing
the hidden data generation model (the “dictionary”).
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 14 / 30
The questions (which have some answers!)
Formalizing the questions about neural nets
(3) The deep-net functions.
One of the themes that we have looked into a lot is to try to find
good descriptions of the functions that nets can compute.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 15 / 30
The questions (which have some answers!)
Formalizing the questions about neural nets
(3) The deep-net functions.
One of the themes that we have looked into a lot is to try to find
good descriptions of the functions that nets can compute.
Let us start with this last kind of questions!
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 15 / 30
The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
“The Big Question!”
Can one find a complete characterization of the neural functions
parametrized by architecture?
The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
“The Big Question!”
Can one find a complete characterization of the neural functions
parametrized by architecture?
Theorem (Ours)
Any ReLU deep net always computes a piecewise linear function
and all Rn → R piecewise linear functions are ReLU net functions of
depth at most, 1 + log2(n + 1) . For n = 1 there is also a sharp
width lowerbound.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 16 / 30
The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
A very small part of “The Big Question”
A simple (but somewhat surprising!) is the following fact,
Theorem (Ours)
1-DNN 2-DNN and the following R2 → R function
(x1, x2) → max{0, x1, x2} is in the gap.
The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
A very small part of “The Big Question”
A simple (but somewhat surprising!) is the following fact,
Theorem (Ours)
1-DNN 2-DNN and the following R2 → R function
(x1, x2) → max{0, x1, x2} is in the gap.
Proof.
That 1-DNN ⊂ 2-DNN is obvious. Now observe that any R2 → R
1−DNN function is non-differentiable on a union of lines (one line
along each ReLU gate’s argument) but the given function is
non-differentiable on a union of 3 half-lines. Hence proved!
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 17 / 30
The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
A small part of “The Big Question” which is already unclear!
It is easy to see that max{0, x1, x2, .., x−1+2k } ∈ k-DNN. But
is this in (k-1)-DNN? This corresponding statement at higher
depths (k ≥ 3) remains unresolved as of now!
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 18 / 30
The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
Depth separation for R → R nets
Can one show neural functions at every depth such that lower depths
will necessarily require a much larger size to represent them?
Theorem (Ours)
∀k ∈ N, there exists a continuum of R → R neural net functions
of depth 1 + k2 (and size k3) which needs size Ω kk+1 for depths
≤ 1 + k.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 19 / 30
The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
Depth separation for R → R nets
Can one show neural functions at every depth such that lower depths
will necessarily require a much larger size to represent them?
Theorem (Ours)
∀k ∈ N, there exists a continuum of R → R neural net functions
of depth 1 + k2 (and size k3) which needs size Ω kk+1 for depths
≤ 1 + k.
Here the basic intuition is that if one starts with a small depth func-
tion which is oscillating then *without* blowing up the width too
much higher depths can be set up to recursively increase the number
of oscillations. And then such functions get very hard for the smaller
depths to even approximate in 1 norm unless they blow up in size.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 19 / 30
The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions with one layer of gates
For real valued functions on the Boolean hypercube, is ReLU stronger
than the LTF activation?
The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions with one layer of gates
For real valued functions on the Boolean hypercube, is ReLU stronger
than the LTF activation? The best gap we know of is the following,
The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions with one layer of gates
For real valued functions on the Boolean hypercube, is ReLU stronger
than the LTF activation? The best gap we know of is the following,
Theorem (Ours)
There is at least a Ω(n) gap between Sum-of-ReLU and
Sum-of-LTF
The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions with one layer of gates
For real valued functions on the Boolean hypercube, is ReLU stronger
than the LTF activation? The best gap we know of is the following,
Theorem (Ours)
There is at least a Ω(n) gap between Sum-of-ReLU and
Sum-of-LTF
Proof.
This follows by looking at this function on the hypercube, {0, 1}n
given as, f (x) = n
i=i 2i−1xi . This has 2n level sets on the discrete
cube and hence needs that many polyhedral cells to be produced by
the hyperplanes of the Sum-of-LTF circuit whereas being a linear
function it can be implemented by just 2 ReLU gates!
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 20 / 30
The questions (which have some answers!) What functions does a deep net represent?
The next set of results we will present are more recent and seem to
need significantly more effort than those till now.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 21 / 30
The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
The *ideal* depth separation!
Can one show neural functions at every depth such that it
will necessarily require Ω edimension size to represent them by
circuits of even one depth less? This is a major open question
and over real inputs this is currently known only between 2-DNN and
1-DNN from the works of Eldan-Shamir and Amit Daniely.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 22 / 30
The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
The *ideal* depth separation!
Can one show neural functions at every depth such that it
will necessarily require Ω edimension size to represent them by
circuits of even one depth less? This is a major open question
and over real inputs this is currently known only between 2-DNN and
1-DNN from the works of Eldan-Shamir and Amit Daniely.
We go beyond small depth lower bounds in the following restricted sense,
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 22 / 30
The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
Theorem (Ours)
There exists small depth 2 Boolean functions such that LTF-of-(ReLU)d−1
circuits require size Ω

(d − 1) 2
(dimension)
1
8
d−1
((dimension)W)
1
d−1

 when the bottom most
layer weight vectors are such that their coordinates are integers of size at
most W and that these weight vectors induce the same ordering on the set
{−1, 1}(dimension) when ranked by value of the innerproduct with them.
(Note that all other weights are left completely free!)
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 23 / 30
The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
Theorem (Ours)
There exists small depth 2 Boolean functions such that LTF-of-(ReLU)d−1
circuits require size Ω

(d − 1) 2
(dimension)
1
8
d−1
((dimension)W)
1
d−1

 when the bottom most
layer weight vectors are such that their coordinates are integers of size at
most W and that these weight vectors induce the same ordering on the set
{−1, 1}(dimension) when ranked by value of the innerproduct with them.
(Note that all other weights are left completely free!)
This is achieved by showing that under the above restriction the
“sign-rank” is quadratically (in dimension) bounded for the func-
tions computed by such circuits, thought of as the matrix of dimen-
sion 2
dimension
2 × 2
dimension
2 . (And we recall that small depth small size
functions are known which have exponentially large sign-rank.)
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 23 / 30
The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions
Despite the result by Eldan-Shamir and Amit Daniely this curiosity
still remains as to how powerful is the LTF-of-ReLU-of-ReLU than
LTF-of-ReLU for Boolean functions.
Theorem (Ours)
For any δ ∈ (0, 1
2), there exists N(δ) ∈ N such that for all n ≥ N(δ)
and > 2 log
2
2−δ (n)
n , any LTF-of-ReLU circuit on n bits that
matches the Andreev function on n−bits for at least 1/2 +
fraction of the inputs, has size Ω( 2(1−δ)n1−δ).
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 24 / 30
The questions (which have some answers!) What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions
Despite the result by Eldan-Shamir and Amit Daniely this curiosity
still remains as to how powerful is the LTF-of-ReLU-of-ReLU than
LTF-of-ReLU for Boolean functions.
Theorem (Ours)
For any δ ∈ (0, 1
2), there exists N(δ) ∈ N such that for all n ≥ N(δ)
and > 2 log
2
2−δ (n)
n , any LTF-of-ReLU circuit on n bits that
matches the Andreev function on n−bits for at least 1/2 +
fraction of the inputs, has size Ω( 2(1−δ)n1−δ).
This is proven by the “method of random restrictions” and in particular a very
recent version of it by Daniel Kane (UCSD) and Ryan Williams (MIT) based on
the Littlewood-Offord theorem.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 24 / 30
The questions (which have some answers!) Why can the deep net do dictionary learning?
What makes the deep net landscape special?
A fundamental challenge with deep nets is to be able to explain as to why
is it able to solve so many diverse kinds of real-life optimization problems.
This is a serious mathematical challenge to be able to understand as to
how the deep net “sees” the classic optimization questions.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 25 / 30
The questions (which have some answers!) Why can the deep net do dictionary learning?
What makes the deep net landscape special?
A fundamental challenge with deep nets is to be able to explain as to why
is it able to solve so many diverse kinds of real-life optimization problems.
This is a serious mathematical challenge to be able to understand as to
how the deep net “sees” the classic optimization questions.
For a net say N and a distribution D lets call its “landscape” (L)
corresponding to a “loss function ( )” (typically the squared-loss) as
L(D, N) = Ex,y∈D[ (y, N(x))].
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 25 / 30
The questions (which have some answers!) Why can the deep net do dictionary learning?
What makes the deep net landscape special?
A fundamental challenge with deep nets is to be able to explain as to why
is it able to solve so many diverse kinds of real-life optimization problems.
This is a serious mathematical challenge to be able to understand as to
how the deep net “sees” the classic optimization questions.
For a net say N and a distribution D lets call its “landscape” (L)
corresponding to a “loss function ( )” (typically the squared-loss) as
L(D, N) = Ex,y∈D[ (y, N(x))].
Why is this L so often somehow a nice function to optimize on to solve an
optimization problem which on its own had nothing to do with nets?
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 25 / 30
The questions (which have some answers!) Why can the deep net do dictionary learning?
Sparse coding
We isolate one special optimization question where we can attempt to
offer some mathematical explanation for this phenomenon.
“Sparse Coding” is a classic learning challenge where given access
to vectors y = A∗x∗ and some distributional (sparsity) guarantees
about x∗ we try to infer A∗. Breakthrough work by Spielman, Wang
and Wright (2012) : This is sometimes provably doable in poly-time!
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 26 / 30
The questions (which have some answers!) Why can the deep net do dictionary learning?
Sparse coding
We isolate one special optimization question where we can attempt to
offer some mathematical explanation for this phenomenon.
“Sparse Coding” is a classic learning challenge where given access
to vectors y = A∗x∗ and some distributional (sparsity) guarantees
about x∗ we try to infer A∗. Breakthrough work by Spielman, Wang
and Wright (2012) : This is sometimes provably doable in poly-time!
In this work we attempt to progress towards giving some rigorous
explanation for the observation that nets seem to solve sparse coding!
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 26 / 30
The questions (which have some answers!) Why can the deep net do dictionary learning?
Sparse coding
The defining equation of our autoencoder computing ˜y ∈ Rn
from y ∈ Rn
The generative model: Sparse x∗
∈ Rh
and y = A∗
x∗
∈ Rn
and h n
h = ReLU(W y − ) = max{0, W y − } ∈ Rh
˜y = W T
h ∈ Rn
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 27 / 30
The questions (which have some answers!) Why can the deep net do dictionary learning?
Our TensorFlow experiments provide evidence for the power of
autoencoders
Software : TensorFlow (with complicated gradient updates!)
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 28 / 30
The questions (which have some answers!) Why can the deep net do dictionary learning?
Our TensorFlow experiments provide evidence for the power of
autoencoders
Software : TensorFlow (with complicated gradient updates!)
6000 training examples and 1000 testing examples for each digit
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 28 / 30
The questions (which have some answers!) Why can the deep net do dictionary learning?
Our TensorFlow experiments provide evidence for the power of
autoencoders
Software : TensorFlow (with complicated gradient updates!)
6000 training examples and 1000 testing examples for each digit
n = 784 and the number of ReLU gates were 10000 for the 1−DNN
and 5000 and 784 for the 2−DNN.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 28 / 30
The questions (which have some answers!) Why can the deep net do dictionary learning?
Why can deep nets do sparse coding?
After laborious algebra (over months!) we can offer the following insight,
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 29 / 30
The questions (which have some answers!) Why can the deep net do dictionary learning?
Why can deep nets do sparse coding?
After laborious algebra (over months!) we can offer the following insight,
Theorem (Ours)
If the source sparse vectors x∗ ∈ Rh are such that their non-zero
coordinates are sampled from a interval in R+ and it has a support of size
at most hp with p < 1
2 and A∗ ∈ Rn×h is incoherent enough then a
constant can be chosen such that the autoencoder landscape,
Ey=A∗x∗ [ y − W T
ReLU(0, W y − ) 2
2]
is such that it is asymptotically (in h) critical in a neighbourhood of A∗.
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 29 / 30
The questions (which have some answers!) Why can the deep net do dictionary learning?
Why can deep nets do sparse coding?
After laborious algebra (over months!) we can offer the following insight,
Theorem (Ours)
If the source sparse vectors x∗ ∈ Rh are such that their non-zero
coordinates are sampled from a interval in R+ and it has a support of size
at most hp with p < 1
2 and A∗ ∈ Rn×h is incoherent enough then a
constant can be chosen such that the autoencoder landscape,
Ey=A∗x∗ [ y − W T
ReLU(0, W y − ) 2
2]
is such that it is asymptotically (in h) critical in a neighbourhood of A∗.
Such criticality around the right answer is clearly a plausible reason why
gradient descent might find the right answer! Experiments infact
suggest that asymptotically in h, A∗ might even be a global minima
- but as of now that we have no clue how to prove such a thing!
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 29 / 30
Open questions
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 30 / 30
Open questions
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
Can one exactly characterize the set of functions parameterized by
the architecture?
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 30 / 30
Open questions
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
Can one exactly characterize the set of functions parameterized by
the architecture?
How to (dis?)prove the existence of dimension exponential gaps
between consecutive depths? (This isn’t clear even with just Boolean
inputs and unrestricted weights!)
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 30 / 30
Open questions
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
Can one exactly characterize the set of functions parameterized by
the architecture?
How to (dis?)prove the existence of dimension exponential gaps
between consecutive depths? (This isn’t clear even with just Boolean
inputs and unrestricted weights!)
Can the max of 2k
+ 1 numbers be taken using k layers of ReLU gates?
(A negative answer immediately shows that with depth the deep net
function class strictly increases!)
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 30 / 30
Open questions
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
Can one exactly characterize the set of functions parameterized by
the architecture?
How to (dis?)prove the existence of dimension exponential gaps
between consecutive depths? (This isn’t clear even with just Boolean
inputs and unrestricted weights!)
Can the max of 2k
+ 1 numbers be taken using k layers of ReLU gates?
(A negative answer immediately shows that with depth the deep net
function class strictly increases!)
Can one show that *every* low depth function has a “simpler” (maybe
smaller circuit size or smaller weight) representation at higher depths?
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 30 / 30
Open questions
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
Can one exactly characterize the set of functions parameterized by
the architecture?
How to (dis?)prove the existence of dimension exponential gaps
between consecutive depths? (This isn’t clear even with just Boolean
inputs and unrestricted weights!)
Can the max of 2k
+ 1 numbers be taken using k layers of ReLU gates?
(A negative answer immediately shows that with depth the deep net
function class strictly increases!)
Can one show that *every* low depth function has a “simpler” (maybe
smaller circuit size or smaller weight) representation at higher depths?
In the space of deep net functions how dense are the “high complexity”
functions? (like those with Ω(sizedimension
) number of affine pieces)
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 30 / 30
Open questions
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
Can one exactly characterize the set of functions parameterized by
the architecture?
How to (dis?)prove the existence of dimension exponential gaps
between consecutive depths? (This isn’t clear even with just Boolean
inputs and unrestricted weights!)
Can the max of 2k
+ 1 numbers be taken using k layers of ReLU gates?
(A negative answer immediately shows that with depth the deep net
function class strictly increases!)
Can one show that *every* low depth function has a “simpler” (maybe
smaller circuit size or smaller weight) representation at higher depths?
In the space of deep net functions how dense are the “high complexity”
functions? (like those with Ω(sizedimension
) number of affine pieces)
Are there Boolean functions which have smaller representations using
ReLU gates than LTF gates? (A peculiarly puzzling question!)
Anirbit ( AMS Johns Hopkins University ) 14th
November 2017 30 / 30

More Related Content

Similar to Talk at MIT, Maths on deep neural networks

Artificial Neural Networks Lect1: Introduction & neural computation
Artificial Neural Networks Lect1: Introduction & neural computationArtificial Neural Networks Lect1: Introduction & neural computation
Artificial Neural Networks Lect1: Introduction & neural computationMohammed Bennamoun
 
NEURAL NETWORKS
NEURAL NETWORKSNEURAL NETWORKS
NEURAL NETWORKSESCOM
 
Summary Of Thesis
Summary Of ThesisSummary Of Thesis
Summary Of Thesisguestb452d6
 
BACKPROPOGATION ALGO.pdfLECTURE NOTES WITH SOLVED EXAMPLE AND FEED FORWARD NE...
BACKPROPOGATION ALGO.pdfLECTURE NOTES WITH SOLVED EXAMPLE AND FEED FORWARD NE...BACKPROPOGATION ALGO.pdfLECTURE NOTES WITH SOLVED EXAMPLE AND FEED FORWARD NE...
BACKPROPOGATION ALGO.pdfLECTURE NOTES WITH SOLVED EXAMPLE AND FEED FORWARD NE...DurgadeviParamasivam
 
Neural networks
Neural networksNeural networks
Neural networksBasil John
 
Artificial Neural Networks.pdf
Artificial Neural Networks.pdfArtificial Neural Networks.pdf
Artificial Neural Networks.pdfBria Davis
 
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thịDistance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thịHong Ong
 
CNN Structure: From LeNet to ShuffleNet
CNN Structure: From LeNet to ShuffleNetCNN Structure: From LeNet to ShuffleNet
CNN Structure: From LeNet to ShuffleNetDalin Zhang
 
Foundations of ANNs: Tolstoy’s Genius Explored Using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored Using Transformer ArchitectureFoundations of ANNs: Tolstoy’s Genius Explored Using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored Using Transformer Architecturegerogepatton
 
Foundations of ANNs: Tolstoy’s Genius Explored Using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored Using Transformer ArchitectureFoundations of ANNs: Tolstoy’s Genius Explored Using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored Using Transformer Architectureijaia
 
Foundations of ANNs: Tolstoy’s Genius Explored using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored using Transformer ArchitectureFoundations of ANNs: Tolstoy’s Genius Explored using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored using Transformer Architecturegerogepatton
 
INTRODUCTION TO NEURAL NETWORKS
INTRODUCTION TO NEURAL NETWORKSINTRODUCTION TO NEURAL NETWORKS
INTRODUCTION TO NEURAL NETWORKSPrashant Srivastav
 
Non-parametric regressions & Neural Networks
Non-parametric regressions & Neural NetworksNon-parametric regressions & Neural Networks
Non-parametric regressions & Neural NetworksGiuseppe Broccolo
 

Similar to Talk at MIT, Maths on deep neural networks (20)

Artificial Neural Networks Lect1: Introduction & neural computation
Artificial Neural Networks Lect1: Introduction & neural computationArtificial Neural Networks Lect1: Introduction & neural computation
Artificial Neural Networks Lect1: Introduction & neural computation
 
NEURAL NETWORKS
NEURAL NETWORKSNEURAL NETWORKS
NEURAL NETWORKS
 
Summary Of Thesis
Summary Of ThesisSummary Of Thesis
Summary Of Thesis
 
BACKPROPOGATION ALGO.pdfLECTURE NOTES WITH SOLVED EXAMPLE AND FEED FORWARD NE...
BACKPROPOGATION ALGO.pdfLECTURE NOTES WITH SOLVED EXAMPLE AND FEED FORWARD NE...BACKPROPOGATION ALGO.pdfLECTURE NOTES WITH SOLVED EXAMPLE AND FEED FORWARD NE...
BACKPROPOGATION ALGO.pdfLECTURE NOTES WITH SOLVED EXAMPLE AND FEED FORWARD NE...
 
SoftComputing5
SoftComputing5SoftComputing5
SoftComputing5
 
Introduction to Neural Network
Introduction to Neural NetworkIntroduction to Neural Network
Introduction to Neural Network
 
Neural networks
Neural networksNeural networks
Neural networks
 
Artificial Neural Networks.pdf
Artificial Neural Networks.pdfArtificial Neural Networks.pdf
Artificial Neural Networks.pdf
 
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thịDistance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
 
SCA ANN-01.pdf
SCA ANN-01.pdfSCA ANN-01.pdf
SCA ANN-01.pdf
 
CNN Structure: From LeNet to ShuffleNet
CNN Structure: From LeNet to ShuffleNetCNN Structure: From LeNet to ShuffleNet
CNN Structure: From LeNet to ShuffleNet
 
Foundations of ANNs: Tolstoy’s Genius Explored Using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored Using Transformer ArchitectureFoundations of ANNs: Tolstoy’s Genius Explored Using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored Using Transformer Architecture
 
Foundations of ANNs: Tolstoy’s Genius Explored Using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored Using Transformer ArchitectureFoundations of ANNs: Tolstoy’s Genius Explored Using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored Using Transformer Architecture
 
Foundations of ANNs: Tolstoy’s Genius Explored using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored using Transformer ArchitectureFoundations of ANNs: Tolstoy’s Genius Explored using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored using Transformer Architecture
 
K0363063068
K0363063068K0363063068
K0363063068
 
INTRODUCTION TO NEURAL NETWORKS
INTRODUCTION TO NEURAL NETWORKSINTRODUCTION TO NEURAL NETWORKS
INTRODUCTION TO NEURAL NETWORKS
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Non-parametric regressions & Neural Networks
Non-parametric regressions & Neural NetworksNon-parametric regressions & Neural Networks
Non-parametric regressions & Neural Networks
 
3 Ayhan Esi.pdf
3 Ayhan Esi.pdf3 Ayhan Esi.pdf
3 Ayhan Esi.pdf
 
3 Ayhan Esi.pdf
3 Ayhan Esi.pdf3 Ayhan Esi.pdf
3 Ayhan Esi.pdf
 

Recently uploaded

如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证zifhagzkk
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives23050636
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchersdarmandersingh4580
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...ssuserf63bd7
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...yulianti213969
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证ppy8zfkfm
 
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证a8om7o51
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesBoston Institute of Analytics
 
Data Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster AnalysisData Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster AnalysisBoston Institute of Analytics
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"John Sobanski
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunksgmuir1066
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Klinik Aborsi
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证pwgnohujw
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证dq9vz1isj
 
Digital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
Digital Marketing Demystified: Expert Tips from Samantha Rae CoolbethDigital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
Digital Marketing Demystified: Expert Tips from Samantha Rae CoolbethSamantha Rae Coolbeth
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一fztigerwe
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...ssuserf63bd7
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证ju0dztxtn
 

Recently uploaded (20)

如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
 
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
Data Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster AnalysisData Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster Analysis
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
 
Digital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
Digital Marketing Demystified: Expert Tips from Samantha Rae CoolbethDigital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
Digital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
 

Talk at MIT, Maths on deep neural networks

  • 1. Mathematics of neural networks Anirbit AMS Johns Hopkins University Invited talk at MIT Maths “Seminar on Applied Algebra and Geometry”, 14th November 2017 Anirbit ( AMS Johns Hopkins University ) 14th November 2017 1 / 30
  • 2. Outline 1 Introduction 2 The questions (which have some answers!) What functions does a deep net represent? Why can the deep net do dictionary learning? 3 Open questions Anirbit ( AMS Johns Hopkins University ) 14th November 2017 2 / 30
  • 3. Introduction This talk is based on the following 3 papers, https://arxiv.org/abs/1711.03073 “Lower bounds over Boolean inputs for deep neural networks with ReLU gates” https://arxiv.org/abs/1708.03735 “Sparse Coding and Autoencoders” https://eccc.weizmann.ac.il/report/2017/098/ “Understanding Deep Neural Networks with Rectified Linear Units” Anirbit ( AMS Johns Hopkins University ) 14th November 2017 3 / 30
  • 4. Introduction The collaborators! These are works with Amitabh Basu (AMS, JHU) and different subsets of, Akshay Rangamani (ECE, JHU) Tejaswini Ganapathy (Salesforce, San Francisco Bay Area) Ashish Arora, Trac D.Tran (ECE, JHU) Raman Arora, Poorya Mianjy (CS, JHU) Sang (Peter) Chin (CS, BU) Anirbit ( AMS Johns Hopkins University ) 14th November 2017 4 / 30
  • 5. Introduction Activation gates of the neural network The building block of a “neural net” are its activation gates which do the basic analogue computations (as opposed to Boolean gates in Boolean circuits which compute the AND, OR, NOT and Threshold functions.) Anirbit ( AMS Johns Hopkins University ) 14th November 2017 5 / 30
  • 6. Introduction Activation gates of the neural network The building block of a “neural net” are its activation gates which do the basic analogue computations (as opposed to Boolean gates in Boolean circuits which compute the AND, OR, NOT and Threshold functions.) The above is a R3 → R neural gate evaluating the “activation” function f : R → R on a linear (in general can be affine) transformation of the input vector. The ws are the ‘weights’. If there were many Y s coming out of the gate then it would pass on the same value to all of them. Anirbit ( AMS Johns Hopkins University ) 14th November 2017 5 / 30
  • 7. Introduction The ReLU activation function Almost uniformly it is now believed that the “best” activation function to use is the “Rectified Linear Unit (ReLU)” ReLU : R → R x → max{0, x} Anirbit ( AMS Johns Hopkins University ) 14th November 2017 6 / 30
  • 8. Introduction The ReLU activation function Almost uniformly it is now believed that the “best” activation function to use is the “Rectified Linear Unit (ReLU)” ReLU : R → R x → max{0, x} At this point it is useful to also define the more studied activation, “Linear Threshold Function (LTF)”, in Boolean complexity to which we shall at times compare later, LTF : R → R x → 1x≥0 or 21x≥0 − 1 Anirbit ( AMS Johns Hopkins University ) 14th November 2017 6 / 30
  • 9. Introduction The ReLU activation function Almost uniformly it is now believed that the “best” activation function to use is the “Rectified Linear Unit (ReLU)” ReLU : R → R x → max{0, x} At this point it is useful to also define the more studied activation, “Linear Threshold Function (LTF)”, in Boolean complexity to which we shall at times compare later, LTF : R → R x → 1x≥0 or 21x≥0 − 1 If LTF(x) = 1x≥0, then its easy to see that, ReLU(x) = xLTF(x). Anirbit ( AMS Johns Hopkins University ) 14th November 2017 6 / 30
  • 10. Introduction What is a neural network? The following diagram (imagine it as a directed acyclic graph where all edges are pointing to the right) represents an instance of a “neural network”. Since there are no “weights” assigned to the edges of the above graph, one should think of this as representing a certain class (set) of R4 → R3 functions which can be computed by the above “architecture” for a *fixed* choice of “activation functions” at each of the blue nodes. The yellow nodes are where the input vector comes in and the orange nodes are where the output vector comes out. Anirbit ( AMS Johns Hopkins University ) 14th November 2017 7 / 30
  • 11. Introduction An example of a neurally representable function Input x1 Input x2 x1+x2 2 + |x1−x2| 2 1 1 -1 -1 -1 1 1 -1 1 2 −1 2 1 2 1 2 In the above we see a “1-DNN” with ReLU activation computing the R2 → R function given as (x1, x2) → max{x1, x2}. The above neural network would be said to be of size 4. Anirbit ( AMS Johns Hopkins University ) 14th November 2017 8 / 30
  • 12. Introduction An example of a neurally representable function Input x1 Input x2 x1+x2 2 + |x1−x2| 2 1 1 -1 -1 -1 1 1 -1 1 2 −1 2 1 2 1 2 In the above we see a “1-DNN” with ReLU activation computing the R2 → R function given as (x1, x2) → max{x1, x2}. The above neural network would be said to be of size 4.These “max” functions are kind of particularly interesting and we would soon come back to this! Anirbit ( AMS Johns Hopkins University ) 14th November 2017 8 / 30
  • 13. Introduction Neural nets used in real life! Neural nets deployed in the real world which are creating the engineering miracles everyday come in various complicated designs. “The Asimov Institute” recently compiled this beautiful chart summarizing many of the architectures in use, Anirbit ( AMS Johns Hopkins University ) 14th November 2017 9 / 30
  • 14. Introduction Neural nets used in real life! Neural nets deployed in the real world which are creating the engineering miracles everyday come in various complicated designs. “The Asimov Institute” recently compiled this beautiful chart summarizing many of the architectures in use, Anirbit ( AMS Johns Hopkins University ) 14th November 2017 9 / 30
  • 15. Introduction When do real weights matter? Consider a very restricted class of neural networks built out of such piecewise-linear gates like ReLU AND all inputs as well as the gates’ description is restricted to be coming from rational numbers requiring at most m bits for their description AND the neural network ends with a threshold gate at the top AND every gate has fan-out at most 1. Anirbit ( AMS Johns Hopkins University ) 14th November 2017 10 / 30
  • 16. Introduction When do real weights matter? Consider a very restricted class of neural networks built out of such piecewise-linear gates like ReLU AND all inputs as well as the gates’ description is restricted to be coming from rational numbers requiring at most m bits for their description AND the neural network ends with a threshold gate at the top AND every gate has fan-out at most 1. It was shown by Wolfgang Maass in 1997 that for such networks one can trade off real weights for rational weights such that the corresponding integers are in absolute value at most (2s + 1)!22m(2s+1) where s is the total number of weights. Anirbit ( AMS Johns Hopkins University ) 14th November 2017 10 / 30
  • 17. Introduction When do real weights matter? Consider a very restricted class of neural networks built out of such piecewise-linear gates like ReLU AND all inputs as well as the gates’ description is restricted to be coming from rational numbers requiring at most m bits for their description AND the neural network ends with a threshold gate at the top AND every gate has fan-out at most 1. It was shown by Wolfgang Maass in 1997 that for such networks one can trade off real weights for rational weights such that the corresponding integers are in absolute value at most (2s + 1)!22m(2s+1) where s is the total number of weights. But for most usual nets where there is no restriction on the fan-out we do not know of such a transformation as above and its also not clear to us if we can always simulate these with LTF gates without blowing up the size! Anirbit ( AMS Johns Hopkins University ) 14th November 2017 10 / 30
  • 18. Introduction Neural networks : recent resurgence as an “universal algorithm”! They are revolutionizing the techniques in various fields ranging from particle physics to genomics to signal processing to computer vision. Most recently it has been shown by the Google DeepMind group that one can within hours of training make a neural network capable of finding new strategies of winning the game of “Go” that people havent seen despite (more than?) a 1000 years of having been playing the game! Most importantly this seems possible without feeding the network with any human game histories! Anirbit ( AMS Johns Hopkins University ) 14th November 2017 11 / 30
  • 19. Introduction Neural networks : recent resurgence as an “universal algorithm”! They are revolutionizing the techniques in various fields ranging from particle physics to genomics to signal processing to computer vision. Most recently it has been shown by the Google DeepMind group that one can within hours of training make a neural network capable of finding new strategies of winning the game of “Go” that people havent seen despite (more than?) a 1000 years of having been playing the game! Most importantly this seems possible without feeding the network with any human game histories! Mathematically nets are still extremely hard to analyze. A nice survey of many of the recent ideas have been compiled in a 3 part series of articles by “Center for Brains, Minds, and Machines (CBMM)” at MIT. Here we report on some of the attempts we have been making to rigorously understand neural networks. Anirbit ( AMS Johns Hopkins University ) 14th November 2017 11 / 30
  • 20. The questions (which have some answers!) Formalizing the questions about neural nets Broadly it seems that we can take 3 different directions of study, (1) Generalization It needs an explanation as to why does the training of neural nets generally not overfit despite being over-parametrized? Some recent attempts at explaining this can be seen in the works by Kaelbling, Kawaguchi (MIT) and Bengio (UMontreal), Brutzkus, Globerson (Tel Aviv University), Malach and Shai Shalev-Schwartz (The Hebrew University) Anirbit ( AMS Johns Hopkins University ) 14th November 2017 12 / 30
  • 21. The questions (which have some answers!) Formalizing the questions about neural nets (2a) Trainability of the nets Theorem (Ours) Empirical risk minimization on 1-DNN with a convex loss, like minwi ,ai ,bi ,b 1 S S i=1 yi − width p=1 ap max{0, wp, xi + bp} − b 2 2 ,can be done in time poly(number of data points)e(width,dimension).
  • 22. The questions (which have some answers!) Formalizing the questions about neural nets (2a) Trainability of the nets Theorem (Ours) Empirical risk minimization on 1-DNN with a convex loss, like minwi ,ai ,bi ,b 1 S S i=1 yi − width p=1 ap max{0, wp, xi + bp} − b 2 2 ,can be done in time poly(number of data points)e(width,dimension). This is the *only* algorithm we are aware of which gets exact global minima of the empirical risk of some net in time polynomial in any of the parameters. The possibility of a similar result for deeper networks or ameliorating the dependency on width remains wildly open! Anirbit ( AMS Johns Hopkins University ) 14th November 2017 13 / 30
  • 23. The questions (which have some answers!) Formalizing the questions about neural nets (2b) Structure discovery by the nets Real-life data can be modeled as observations of some structured distribution. One view of the success of neural nets can be to say that somehow nets can often be set up in such a way that they give a function to optimize over which reveals this hidden structure at its optima/critical points. In one classic scenario called the “sparse coding” we will show proofs about how the net’s loss function has certain nice properties which are possibly helping towards revealing the hidden data generation model (the “dictionary”). Anirbit ( AMS Johns Hopkins University ) 14th November 2017 14 / 30
  • 24. The questions (which have some answers!) Formalizing the questions about neural nets (3) The deep-net functions. One of the themes that we have looked into a lot is to try to find good descriptions of the functions that nets can compute. Anirbit ( AMS Johns Hopkins University ) 14th November 2017 15 / 30
  • 25. The questions (which have some answers!) Formalizing the questions about neural nets (3) The deep-net functions. One of the themes that we have looked into a lot is to try to find good descriptions of the functions that nets can compute. Let us start with this last kind of questions! Anirbit ( AMS Johns Hopkins University ) 14th November 2017 15 / 30
  • 26. The questions (which have some answers!) What functions does a deep net represent? The questions about the function space “The Big Question!” Can one find a complete characterization of the neural functions parametrized by architecture?
  • 27. The questions (which have some answers!) What functions does a deep net represent? The questions about the function space “The Big Question!” Can one find a complete characterization of the neural functions parametrized by architecture? Theorem (Ours) Any ReLU deep net always computes a piecewise linear function and all Rn → R piecewise linear functions are ReLU net functions of depth at most, 1 + log2(n + 1) . For n = 1 there is also a sharp width lowerbound. Anirbit ( AMS Johns Hopkins University ) 14th November 2017 16 / 30
  • 28. The questions (which have some answers!) What functions does a deep net represent? The questions about the function space A very small part of “The Big Question” A simple (but somewhat surprising!) is the following fact, Theorem (Ours) 1-DNN 2-DNN and the following R2 → R function (x1, x2) → max{0, x1, x2} is in the gap.
  • 29. The questions (which have some answers!) What functions does a deep net represent? The questions about the function space A very small part of “The Big Question” A simple (but somewhat surprising!) is the following fact, Theorem (Ours) 1-DNN 2-DNN and the following R2 → R function (x1, x2) → max{0, x1, x2} is in the gap. Proof. That 1-DNN ⊂ 2-DNN is obvious. Now observe that any R2 → R 1−DNN function is non-differentiable on a union of lines (one line along each ReLU gate’s argument) but the given function is non-differentiable on a union of 3 half-lines. Hence proved! Anirbit ( AMS Johns Hopkins University ) 14th November 2017 17 / 30
  • 30. The questions (which have some answers!) What functions does a deep net represent? The questions about the function space A small part of “The Big Question” which is already unclear! It is easy to see that max{0, x1, x2, .., x−1+2k } ∈ k-DNN. But is this in (k-1)-DNN? This corresponding statement at higher depths (k ≥ 3) remains unresolved as of now! Anirbit ( AMS Johns Hopkins University ) 14th November 2017 18 / 30
  • 31. The questions (which have some answers!) What functions does a deep net represent? The questions about the function space Depth separation for R → R nets Can one show neural functions at every depth such that lower depths will necessarily require a much larger size to represent them? Theorem (Ours) ∀k ∈ N, there exists a continuum of R → R neural net functions of depth 1 + k2 (and size k3) which needs size Ω kk+1 for depths ≤ 1 + k. Anirbit ( AMS Johns Hopkins University ) 14th November 2017 19 / 30
  • 32. The questions (which have some answers!) What functions does a deep net represent? The questions about the function space Depth separation for R → R nets Can one show neural functions at every depth such that lower depths will necessarily require a much larger size to represent them? Theorem (Ours) ∀k ∈ N, there exists a continuum of R → R neural net functions of depth 1 + k2 (and size k3) which needs size Ω kk+1 for depths ≤ 1 + k. Here the basic intuition is that if one starts with a small depth func- tion which is oscillating then *without* blowing up the width too much higher depths can be set up to recursively increase the number of oscillations. And then such functions get very hard for the smaller depths to even approximate in 1 norm unless they blow up in size. Anirbit ( AMS Johns Hopkins University ) 14th November 2017 19 / 30
  • 33. The questions (which have some answers!) What functions does a deep net represent? The questions about the function space Separations for Boolean functions with one layer of gates For real valued functions on the Boolean hypercube, is ReLU stronger than the LTF activation?
  • 34. The questions (which have some answers!) What functions does a deep net represent? The questions about the function space Separations for Boolean functions with one layer of gates For real valued functions on the Boolean hypercube, is ReLU stronger than the LTF activation? The best gap we know of is the following,
  • 35. The questions (which have some answers!) What functions does a deep net represent? The questions about the function space Separations for Boolean functions with one layer of gates For real valued functions on the Boolean hypercube, is ReLU stronger than the LTF activation? The best gap we know of is the following, Theorem (Ours) There is at least a Ω(n) gap between Sum-of-ReLU and Sum-of-LTF
  • 36. The questions (which have some answers!) What functions does a deep net represent? The questions about the function space Separations for Boolean functions with one layer of gates For real valued functions on the Boolean hypercube, is ReLU stronger than the LTF activation? The best gap we know of is the following, Theorem (Ours) There is at least a Ω(n) gap between Sum-of-ReLU and Sum-of-LTF Proof. This follows by looking at this function on the hypercube, {0, 1}n given as, f (x) = n i=i 2i−1xi . This has 2n level sets on the discrete cube and hence needs that many polyhedral cells to be produced by the hyperplanes of the Sum-of-LTF circuit whereas being a linear function it can be implemented by just 2 ReLU gates! Anirbit ( AMS Johns Hopkins University ) 14th November 2017 20 / 30
  • 37. The questions (which have some answers!) What functions does a deep net represent? The next set of results we will present are more recent and seem to need significantly more effort than those till now. Anirbit ( AMS Johns Hopkins University ) 14th November 2017 21 / 30
  • 38. The questions (which have some answers!) What functions does a deep net represent? The questions about the function space The *ideal* depth separation! Can one show neural functions at every depth such that it will necessarily require Ω edimension size to represent them by circuits of even one depth less? This is a major open question and over real inputs this is currently known only between 2-DNN and 1-DNN from the works of Eldan-Shamir and Amit Daniely. Anirbit ( AMS Johns Hopkins University ) 14th November 2017 22 / 30
  • 39. The questions (which have some answers!) What functions does a deep net represent? The questions about the function space The *ideal* depth separation! Can one show neural functions at every depth such that it will necessarily require Ω edimension size to represent them by circuits of even one depth less? This is a major open question and over real inputs this is currently known only between 2-DNN and 1-DNN from the works of Eldan-Shamir and Amit Daniely. We go beyond small depth lower bounds in the following restricted sense, Anirbit ( AMS Johns Hopkins University ) 14th November 2017 22 / 30
  • 40. The questions (which have some answers!) What functions does a deep net represent? The questions about the function space Theorem (Ours) There exists small depth 2 Boolean functions such that LTF-of-(ReLU)d−1 circuits require size Ω  (d − 1) 2 (dimension) 1 8 d−1 ((dimension)W) 1 d−1   when the bottom most layer weight vectors are such that their coordinates are integers of size at most W and that these weight vectors induce the same ordering on the set {−1, 1}(dimension) when ranked by value of the innerproduct with them. (Note that all other weights are left completely free!) Anirbit ( AMS Johns Hopkins University ) 14th November 2017 23 / 30
  • 41. The questions (which have some answers!) What functions does a deep net represent? The questions about the function space Theorem (Ours) There exists small depth 2 Boolean functions such that LTF-of-(ReLU)d−1 circuits require size Ω  (d − 1) 2 (dimension) 1 8 d−1 ((dimension)W) 1 d−1   when the bottom most layer weight vectors are such that their coordinates are integers of size at most W and that these weight vectors induce the same ordering on the set {−1, 1}(dimension) when ranked by value of the innerproduct with them. (Note that all other weights are left completely free!) This is achieved by showing that under the above restriction the “sign-rank” is quadratically (in dimension) bounded for the func- tions computed by such circuits, thought of as the matrix of dimen- sion 2 dimension 2 × 2 dimension 2 . (And we recall that small depth small size functions are known which have exponentially large sign-rank.) Anirbit ( AMS Johns Hopkins University ) 14th November 2017 23 / 30
  • 42. The questions (which have some answers!) What functions does a deep net represent? The questions about the function space Separations for Boolean functions Despite the result by Eldan-Shamir and Amit Daniely this curiosity still remains as to how powerful is the LTF-of-ReLU-of-ReLU than LTF-of-ReLU for Boolean functions. Theorem (Ours) For any δ ∈ (0, 1 2), there exists N(δ) ∈ N such that for all n ≥ N(δ) and > 2 log 2 2−δ (n) n , any LTF-of-ReLU circuit on n bits that matches the Andreev function on n−bits for at least 1/2 + fraction of the inputs, has size Ω( 2(1−δ)n1−δ). Anirbit ( AMS Johns Hopkins University ) 14th November 2017 24 / 30
  • 43. The questions (which have some answers!) What functions does a deep net represent? The questions about the function space Separations for Boolean functions Despite the result by Eldan-Shamir and Amit Daniely this curiosity still remains as to how powerful is the LTF-of-ReLU-of-ReLU than LTF-of-ReLU for Boolean functions. Theorem (Ours) For any δ ∈ (0, 1 2), there exists N(δ) ∈ N such that for all n ≥ N(δ) and > 2 log 2 2−δ (n) n , any LTF-of-ReLU circuit on n bits that matches the Andreev function on n−bits for at least 1/2 + fraction of the inputs, has size Ω( 2(1−δ)n1−δ). This is proven by the “method of random restrictions” and in particular a very recent version of it by Daniel Kane (UCSD) and Ryan Williams (MIT) based on the Littlewood-Offord theorem. Anirbit ( AMS Johns Hopkins University ) 14th November 2017 24 / 30
  • 44. The questions (which have some answers!) Why can the deep net do dictionary learning? What makes the deep net landscape special? A fundamental challenge with deep nets is to be able to explain as to why is it able to solve so many diverse kinds of real-life optimization problems. This is a serious mathematical challenge to be able to understand as to how the deep net “sees” the classic optimization questions. Anirbit ( AMS Johns Hopkins University ) 14th November 2017 25 / 30
  • 45. The questions (which have some answers!) Why can the deep net do dictionary learning? What makes the deep net landscape special? A fundamental challenge with deep nets is to be able to explain as to why is it able to solve so many diverse kinds of real-life optimization problems. This is a serious mathematical challenge to be able to understand as to how the deep net “sees” the classic optimization questions. For a net say N and a distribution D lets call its “landscape” (L) corresponding to a “loss function ( )” (typically the squared-loss) as L(D, N) = Ex,y∈D[ (y, N(x))]. Anirbit ( AMS Johns Hopkins University ) 14th November 2017 25 / 30
  • 46. The questions (which have some answers!) Why can the deep net do dictionary learning? What makes the deep net landscape special? A fundamental challenge with deep nets is to be able to explain as to why is it able to solve so many diverse kinds of real-life optimization problems. This is a serious mathematical challenge to be able to understand as to how the deep net “sees” the classic optimization questions. For a net say N and a distribution D lets call its “landscape” (L) corresponding to a “loss function ( )” (typically the squared-loss) as L(D, N) = Ex,y∈D[ (y, N(x))]. Why is this L so often somehow a nice function to optimize on to solve an optimization problem which on its own had nothing to do with nets? Anirbit ( AMS Johns Hopkins University ) 14th November 2017 25 / 30
  • 47. The questions (which have some answers!) Why can the deep net do dictionary learning? Sparse coding We isolate one special optimization question where we can attempt to offer some mathematical explanation for this phenomenon. “Sparse Coding” is a classic learning challenge where given access to vectors y = A∗x∗ and some distributional (sparsity) guarantees about x∗ we try to infer A∗. Breakthrough work by Spielman, Wang and Wright (2012) : This is sometimes provably doable in poly-time! Anirbit ( AMS Johns Hopkins University ) 14th November 2017 26 / 30
  • 48. The questions (which have some answers!) Why can the deep net do dictionary learning? Sparse coding We isolate one special optimization question where we can attempt to offer some mathematical explanation for this phenomenon. “Sparse Coding” is a classic learning challenge where given access to vectors y = A∗x∗ and some distributional (sparsity) guarantees about x∗ we try to infer A∗. Breakthrough work by Spielman, Wang and Wright (2012) : This is sometimes provably doable in poly-time! In this work we attempt to progress towards giving some rigorous explanation for the observation that nets seem to solve sparse coding! Anirbit ( AMS Johns Hopkins University ) 14th November 2017 26 / 30
  • 49. The questions (which have some answers!) Why can the deep net do dictionary learning? Sparse coding The defining equation of our autoencoder computing ˜y ∈ Rn from y ∈ Rn The generative model: Sparse x∗ ∈ Rh and y = A∗ x∗ ∈ Rn and h n h = ReLU(W y − ) = max{0, W y − } ∈ Rh ˜y = W T h ∈ Rn Anirbit ( AMS Johns Hopkins University ) 14th November 2017 27 / 30
  • 50. The questions (which have some answers!) Why can the deep net do dictionary learning? Our TensorFlow experiments provide evidence for the power of autoencoders Software : TensorFlow (with complicated gradient updates!) Anirbit ( AMS Johns Hopkins University ) 14th November 2017 28 / 30
  • 51. The questions (which have some answers!) Why can the deep net do dictionary learning? Our TensorFlow experiments provide evidence for the power of autoencoders Software : TensorFlow (with complicated gradient updates!) 6000 training examples and 1000 testing examples for each digit Anirbit ( AMS Johns Hopkins University ) 14th November 2017 28 / 30
  • 52. The questions (which have some answers!) Why can the deep net do dictionary learning? Our TensorFlow experiments provide evidence for the power of autoencoders Software : TensorFlow (with complicated gradient updates!) 6000 training examples and 1000 testing examples for each digit n = 784 and the number of ReLU gates were 10000 for the 1−DNN and 5000 and 784 for the 2−DNN. Anirbit ( AMS Johns Hopkins University ) 14th November 2017 28 / 30
  • 53. The questions (which have some answers!) Why can the deep net do dictionary learning? Why can deep nets do sparse coding? After laborious algebra (over months!) we can offer the following insight, Anirbit ( AMS Johns Hopkins University ) 14th November 2017 29 / 30
  • 54. The questions (which have some answers!) Why can the deep net do dictionary learning? Why can deep nets do sparse coding? After laborious algebra (over months!) we can offer the following insight, Theorem (Ours) If the source sparse vectors x∗ ∈ Rh are such that their non-zero coordinates are sampled from a interval in R+ and it has a support of size at most hp with p < 1 2 and A∗ ∈ Rn×h is incoherent enough then a constant can be chosen such that the autoencoder landscape, Ey=A∗x∗ [ y − W T ReLU(0, W y − ) 2 2] is such that it is asymptotically (in h) critical in a neighbourhood of A∗. Anirbit ( AMS Johns Hopkins University ) 14th November 2017 29 / 30
  • 55. The questions (which have some answers!) Why can the deep net do dictionary learning? Why can deep nets do sparse coding? After laborious algebra (over months!) we can offer the following insight, Theorem (Ours) If the source sparse vectors x∗ ∈ Rh are such that their non-zero coordinates are sampled from a interval in R+ and it has a support of size at most hp with p < 1 2 and A∗ ∈ Rn×h is incoherent enough then a constant can be chosen such that the autoencoder landscape, Ey=A∗x∗ [ y − W T ReLU(0, W y − ) 2 2] is such that it is asymptotically (in h) critical in a neighbourhood of A∗. Such criticality around the right answer is clearly a plausible reason why gradient descent might find the right answer! Experiments infact suggest that asymptotically in h, A∗ might even be a global minima - but as of now that we have no clue how to prove such a thing! Anirbit ( AMS Johns Hopkins University ) 14th November 2017 29 / 30
  • 56. Open questions Even for the specific case of sparse coding how to analyze all the critical points of the landscape or even just (dis?)prove that the right answer is a global minima? Anirbit ( AMS Johns Hopkins University ) 14th November 2017 30 / 30
  • 57. Open questions Even for the specific case of sparse coding how to analyze all the critical points of the landscape or even just (dis?)prove that the right answer is a global minima? Can one exactly characterize the set of functions parameterized by the architecture? Anirbit ( AMS Johns Hopkins University ) 14th November 2017 30 / 30
  • 58. Open questions Even for the specific case of sparse coding how to analyze all the critical points of the landscape or even just (dis?)prove that the right answer is a global minima? Can one exactly characterize the set of functions parameterized by the architecture? How to (dis?)prove the existence of dimension exponential gaps between consecutive depths? (This isn’t clear even with just Boolean inputs and unrestricted weights!) Anirbit ( AMS Johns Hopkins University ) 14th November 2017 30 / 30
  • 59. Open questions Even for the specific case of sparse coding how to analyze all the critical points of the landscape or even just (dis?)prove that the right answer is a global minima? Can one exactly characterize the set of functions parameterized by the architecture? How to (dis?)prove the existence of dimension exponential gaps between consecutive depths? (This isn’t clear even with just Boolean inputs and unrestricted weights!) Can the max of 2k + 1 numbers be taken using k layers of ReLU gates? (A negative answer immediately shows that with depth the deep net function class strictly increases!) Anirbit ( AMS Johns Hopkins University ) 14th November 2017 30 / 30
  • 60. Open questions Even for the specific case of sparse coding how to analyze all the critical points of the landscape or even just (dis?)prove that the right answer is a global minima? Can one exactly characterize the set of functions parameterized by the architecture? How to (dis?)prove the existence of dimension exponential gaps between consecutive depths? (This isn’t clear even with just Boolean inputs and unrestricted weights!) Can the max of 2k + 1 numbers be taken using k layers of ReLU gates? (A negative answer immediately shows that with depth the deep net function class strictly increases!) Can one show that *every* low depth function has a “simpler” (maybe smaller circuit size or smaller weight) representation at higher depths? Anirbit ( AMS Johns Hopkins University ) 14th November 2017 30 / 30
  • 61. Open questions Even for the specific case of sparse coding how to analyze all the critical points of the landscape or even just (dis?)prove that the right answer is a global minima? Can one exactly characterize the set of functions parameterized by the architecture? How to (dis?)prove the existence of dimension exponential gaps between consecutive depths? (This isn’t clear even with just Boolean inputs and unrestricted weights!) Can the max of 2k + 1 numbers be taken using k layers of ReLU gates? (A negative answer immediately shows that with depth the deep net function class strictly increases!) Can one show that *every* low depth function has a “simpler” (maybe smaller circuit size or smaller weight) representation at higher depths? In the space of deep net functions how dense are the “high complexity” functions? (like those with Ω(sizedimension ) number of affine pieces) Anirbit ( AMS Johns Hopkins University ) 14th November 2017 30 / 30
  • 62. Open questions Even for the specific case of sparse coding how to analyze all the critical points of the landscape or even just (dis?)prove that the right answer is a global minima? Can one exactly characterize the set of functions parameterized by the architecture? How to (dis?)prove the existence of dimension exponential gaps between consecutive depths? (This isn’t clear even with just Boolean inputs and unrestricted weights!) Can the max of 2k + 1 numbers be taken using k layers of ReLU gates? (A negative answer immediately shows that with depth the deep net function class strictly increases!) Can one show that *every* low depth function has a “simpler” (maybe smaller circuit size or smaller weight) representation at higher depths? In the space of deep net functions how dense are the “high complexity” functions? (like those with Ω(sizedimension ) number of affine pieces) Are there Boolean functions which have smaller representations using ReLU gates than LTF gates? (A peculiarly puzzling question!) Anirbit ( AMS Johns Hopkins University ) 14th November 2017 30 / 30