1. Basic Concepts of Large Scale
Optimization for Machine Learning
Devdatt Dubhashi
AI and Data Science
Computer Science and Engineering
Chalmers
Machine Intelligence Sweden AB
2. Behind the Cat Pictures …
• Amazing successes of ML in
computer vision, natural
language processing …
• Underneath the hood is
optimization
• Large scale machine learning:
– large n (data points)
– large d (dimension)
4. Empirical Risk Minimization (ERM)
• Labelled training data:
• Parametrized class of prediction
functions:
• Empirical Loss:
min
5. Data Driven Clustering
• Given data points, cluster
• Classic K-means algorithm
• Needs to know k, number of
clusters
• Data driven clustering: find the
right number of clusters driven
by data. (Panahi, D: ICML 2017)
9. Stochastic Gradient Descent (SGD)
Robbins and Munro 1950
• Index sampled uniformly at random with
replacement from [n]
• Cost per iteration is d
• Hugely successful in machine learning!
10. Stochastic, Batch and Full Gradient Descent
• Full GD:
• Minibatch GD:
• Stochastic GD:
11. The Unreasonable Effectiveness of SGD
• Very fast initial convergence
• Cheap O(d) per iteration as
opposed to O(nd) for full GD
• Very slow at the end ...
Convergence is only O(1/ 𝑘) for
smooth and O(1/k) for smooth
strongly convex functions.
• … but we do not need to run the
iterations to optimum, better to
stop early (Bottou and Bosquet)
12. SGD: Have the Cake and Eat it Too!
(Bottou and Bosquet 2008)
24. MP-SAGA: Stochastic Prox with Variance Reduction
Panahi, Dubhashi, ICML 2017, (2019 under review: proximal operator in closed form!
25. SGD for Deep Learning
• SGD variants (Adagrad, RMSprop, Adam …) used to train
neural networks.
• Use aggressive adaptation with different learning rates for
different parameters.
• Theory says it shouldn’t work for highly nonconvex problems!
• But Adagrad greatly improved the robustness of SGD and
Google used it for training large-scale neural nets to recognize
cats