6. RepresentationRepresentation
Why are deeper nets better
than shallower nets?
OptimizationOptimization
Why can SGD find much
better local optimum? What
characteristics better
optimum are?
GeneralizationGeneralization
Why still generalize well as
the number of parameters is
bigger than that of data?
8. Composite functionsComposite functions
# of parameters grow exponentially with the dimension of
the equations
# of units grows linearly with the dimension of functions
worse performance for deep learning for non-composite
functions
9. RepresentationRepresentation
Why are deeper nets better
than shallower nets?
OptimizationOptimization
Why can SGD find much
better local optimum? What
characteristics better
optimum are?
GeneralizationGeneralization
Why still generalize well as
the number of parameters is
bigger than that of data?
10. RepresentationRepresentation
Why are deeper nets better
than shallower nets?
OptimizationOptimization
Why can SGD find much
better local optimum? What
characteristics better
optimum are?
GeneralizationGeneralization
Why still generalize well as
the number of parameters is
bigger than that of data?
11. Optimization 1Optimization 1
Linear equations: # of unknowns > # of equations ⇒ more
than one solution
Neural net for ImageNet: # of parameters(~millions) ≫ # of
samples(~60,000) Overparameterization
Bézout's Theorem: # of solutions > # of atoms in the
universe ⇒ degenerate: each solution corresponds to a
infinite solution set
12. Optimization 2Optimization 2
Overparameterization: neural nets have infinite number of
global optimum solution, which form a plato valley in the
loss space.
SGD could stay in the degenerating valley with high
probability
Good news: easy to optimize, global optimum exist, many,
easy to find by opt algorithms
13. RepresentationRepresentation
Why are deeper nets better
than shallower nets?
OptimizationOptimization
Why can SGD find much
better local optimum? What
characteristics better
optimum are?
GeneralizationGeneralization
Why still generalize well as
the number of parameters is
bigger than that of data?
14. RepresentationRepresentation
Why are deeper nets better
than shallower nets?
OptimizationOptimization
Why can SGD find much
better local optimum? What
characteristics better
optimum are?
GeneralizationGeneralization
Why still generalize well as
the number of parameters is
bigger than that of data?
15. Generalization 1Generalization 1
Overparameterization: good for optimization, bad for
generalization
Deep learning: tasks reasonably mix well with loss functions
Srebro's work: CROSS ENTROPY wins, i.e., overfits test set ⇏
overfits classification error
Differential equation dynamic system: near global minimum,
deep nn works like a linear network
16. Generalization 2Generalization 2
Srebro's work: CROSS ENTROPY wins, i.e., overfits test set ⇏
overfits classification error
Cross Entropy ∈ Exponential loss
asymmetricity ?⇒ Special property
17. RepresentationRepresentation
Why are deeper nets better
than shallower nets?
OptimizationOptimization
Why can SGD find much
better local optimum? What
characteristics better
optimum are?
GeneralizationGeneralization
Why still generalize well as
the number of parameters is
bigger than that of data?
18. RepresentationRepresentation
Why are deeper nets better
than shallower nets?
OptimizationOptimization
Why can SGD find much
better local optimum? What
characteristics better
optimum are?
GeneralizationGeneralization
Why still generalize well as
the number of parameters is
bigger than that of data?
WHAT'sWHAT's
More?More?
19. Plato optimumPlato optimum
=> better=> better
generalization?generalization?
Overfitting?Overfitting?
Look out!Look out!
Do we needDo we need
Prior?Prior?
Whether BrainWhether Brain
research isresearch is
useful for DL?useful for DL?
20. ReferencesReferences
1. Cucker, F., & Smale, S. (2002). On the mathematical foundations of learning. Bulletin of the American mathematical society, 39(1), 1-49.
2. Neyshabur, B., Tomioka, R., Salakhutdinov, R., & Srebro, N. (2017). Geometry of optimization and implicit regularization in deep learning. arXiv
preprint arXiv:1705.03071.
3. Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., & Liao, Q. (2017). Why and when can deep-but not shallow-networks avoid the curse of
dimensionality: A review. International Journal of Automation and Computing, 14(5), 503-519.
4. Liao, Q., & Poggio, T. (2017). Theory of Deep Learning II: Landscape of the Empirical Risk in Deep Learning. arXiv preprint arXiv:1703.09833.
5. Zhang, C., Liao, Q., Rakhlin, A., Miranda, B., Golowich, N., & Poggio, T. (2018). Theory of Deep Learning IIb: Optimization Properties of SGD.
arXiv preprint arXiv:1801.02254.
6. Poggio, T., Kawaguchi, K., Liao, Q., Miranda, B., Rosasco, L., Boix, X., ... & Mhaskar, H. (2017). Theory of Deep Learning III: explaining the non-
overfitting puzzle. arXiv preprint arXiv:1801.00173.
7. Zhang, C., Liao, Q., Rakhlin, A., Sridharan, K., Miranda, B., Golowich, N., & Poggio, T. (2017). Theory of deep learning iii: Generalization properties
of sgd. Center for Brains, Minds and Machines (CBMM).
8. Dinh, L., Pascanu, R., Bengio, S., & Bengio, Y. (2017). Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933.
9. Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. Neural computation, 8(7), 1341-1390.