Optforml

•Download as PPTX, PDF•

0 likes•53 views

Basic introduction to first order optimization techniques in modern ML, Spring School on Algorithms for Big Data IIt Mumbai, Feb. 18-22.

Data & Analytics

Basic Concepts of Large Scale
Optimization for Machine Learning
Devdatt Dubhashi
AI and Data Science
Computer Science and Engineering
Chalmers
Machine Intelligence Sweden AB

Behind the Cat Pictures …
• Amazing successes of ML in
computer vision, natural
language processing …
• Underneath the hood is
optimization
• Large scale machine learning:
– large n (data points)
– large d (dimension)

Minimization of Finite Sums
• Assumptions on component functions: convex, smooth, …

Empirical Risk Minimization (ERM)
• Labelled training data:
• Parametrized class of prediction
functions:
• Empirical Loss:
min

Data Driven Clustering
• Given data points, cluster
• Classic K-means algorithm
• Needs to know k, number of
clusters
• Data driven clustering: find the
right number of clusters driven
by data. (Panahi, D: ICML 2017)

Mother of all First Order Methods:Gradient Descent

Gradient Descent Convergence
However, GD is not viable for large scale ML because each iteration has cost nd

Stochastic Gradient Descent (SGD)
Robbins and Munro 1950
• Index sampled uniformly at random with
replacement from [n]
• Cost per iteration is d
• Hugely successful in machine learning!

Stochastic, Batch and Full Gradient Descent
• Full GD:
• Minibatch GD:
• Stochastic GD:

The Unreasonable Effectiveness of SGD
• Very fast initial convergence
• Cheap O(d) per iteration as
opposed to O(nd) for full GD
• Very slow at the end ...
Convergence is only O(1/ 𝑘) for
smooth and O(1/k) for smooth
strongly convex functions.
• … but we do not need to run the
iterations to optimum, better to
stop early (Bottou and Bosquet)

SGD: Have the Cake and Eat it Too!
(Bottou and Bosquet 2008)

Variance/Noise Reduction
Can we improve convergence of SGD?

Nesterov Momentum for GD
Y. Nesterov, Doklady 1983.

Katyusha Momentum for SGD
Z. Allen-Zhu, STOC 2017, JMLR 2018

Non-smooth Objectives
• What if the objective is non-smooth?
• Need a proxy for gradients.
LASSO
SON Clustering

Proximal Operator
• Proximal operator:
• Special case: projection:
• Like a gradient step:
• Fixed points are minimizers:

Proximal Gradient Algorithm
• Objective split
• Iteration
• Special cases:
– g=0: usual gradient descent
– f=0: proximal algorithm
– g= indicator of convex set: projected gradient descent (constrained
optimization)
• Only works if the proximal operator can be evaluated efficiently!

PointSAGA: Stochastic Prox with Variance Reduction
Defazio 2016

MP-SAGA: Stochastic Prox with Variance Reduction
Panahi, Dubhashi, ICML 2017, (2019 under review: proximal operator in closed form!

SGD for Deep Learning
• SGD variants (Adagrad, RMSprop, Adam …) used to train
neural networks.
• Use aggressive adaptation with different learning rates for
different parameters.
• Theory says it shouldn’t work for highly nonconvex problems!
• But Adagrad greatly improved the robustness of SGD and
Google used it for training large-scale neural nets to recognize
cats

Variance Reduction for Deep Learning
Defazio, Bottou, 2019

References
Frances Bach Tutorials/Short Courses: https://www.di.ens.fr/~fbach/

Similar to Optforml

Chromatic Sparse LearningDatabricks

Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya

Dictionary Learning in Games - GDC 2014Manchor Ko

Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya

Disentangled Representation Learning of Deep Generative ModelsRyohei Suzuki

30thSep2014Mia liu

Machine Learning Tools and Particle Swarm Optimization for Content-Based Sear...Distinguished Lecturer Series - Leon The Mathematician

Machine learning for IoT - unpacking the blackboxIvo Andreev

DeepLearningLecture.pptxssuserf07225

General Tips for participating Kaggle CompetitionsMark Peng

Webpage Personalization and User Profilingyingfeng

ngboost.pptxMohamedAliHabib3

Deep Implicit Layers: Learning Structured Problems with Neural NetworksSangwoo Mo

Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hakky St

Machine Learning - Supervised LearningGiorgio Alfredo Spedicato

Elegant Graphics for Data Analysis with ggplot2yannabraham

Sparse Graph Attention Networks 2021.pptxssuser2624f71

Final Presentation - Edan&Itzikitzik cohen

MLSEV. Logistic Regression, Deepnets, and Time Series BigML, Inc

Learning visual representation without human labelKai-Wen Zhao

Similar to Optforml (20)

Chromatic Sparse Learning

Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)

Dictionary Learning in Games - GDC 2014

Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)

Disentangled Representation Learning of Deep Generative Models

30thSep2014

Machine Learning Tools and Particle Swarm Optimization for Content-Based Sear...

Machine learning for IoT - unpacking the blackbox

DeepLearningLecture.pptx

General Tips for participating Kaggle Competitions

Webpage Personalization and User Profiling

ngboost.pptx

Deep Implicit Layers: Learning Structured Problems with Neural Networks

Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8

Machine Learning - Supervised Learning

Elegant Graphics for Data Analysis with ggplot2

Sparse Graph Attention Networks 2021.pptx

Final Presentation - Edan&Itzik

MLSEV. Logistic Regression, Deepnets, and Time Series

Learning visual representation without human label

Recently uploaded

定制英国白金汉大学毕业证（UCB毕业证书）成绩单原版一比一ffjhghh

BabyOno dropshipping via API with DroFx.pptxolyaivanovalion

Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls

VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor

Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth

Data-Analysis for Chicago Crime Data 2023ymrp368

RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh

Smarteg dropshipping via API with DroFx.pptxolyaivanovalion

April 2024 - Crypto Market Report's Analysismanisha194592

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor

FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

Ukraine War presentation: KNOW THE BASICSAishani27

BigBuy dropshipping via API with DroFx.pptxolyaivanovalion

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

Edukaciniai dropshipping via API with DroFxolyaivanovalion

Recently uploaded (20)

定制英国白金汉大学毕业证（UCB毕业证书）成绩单原版一比一

BabyOno dropshipping via API with DroFx.pptx

Generative AI on Enterprise Cloud with NiFi and Milvus

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...

VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati

Unveiling Insights: The Role of a Data Analyst

Data-Analysis for Chicago Crime Data 2023

RA-11058_IRR-COMPRESS Do 198 series of 1998

Smarteg dropshipping via API with DroFx.pptx

April 2024 - Crypto Market Report's Analysis

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai

FESE Capital Markets Fact Sheet 2024 Q1.pdf

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha

04242024_CCC TUG_Joins and Relationships

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service

Ukraine War presentation: KNOW THE BASICS

BigBuy dropshipping via API with DroFx.pptx

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure

Edukaciniai dropshipping via API with DroFx

Optforml

1. Basic Concepts of Large Scale Optimization for Machine Learning Devdatt Dubhashi AI and Data Science Computer Science and Engineering Chalmers Machine Intelligence Sweden AB

2. Behind the Cat Pictures … • Amazing successes of ML in computer vision, natural language processing … • Underneath the hood is optimization • Large scale machine learning: – large n (data points) – large d (dimension)

3. Minimization of Finite Sums • Assumptions on component functions: convex, smooth, …

4. Empirical Risk Minimization (ERM) • Labelled training data: • Parametrized class of prediction functions: • Empirical Loss: min

5. Data Driven Clustering • Given data points, cluster • Classic K-means algorithm • Needs to know k, number of clusters • Data driven clustering: find the right number of clusters driven by data. (Panahi, D: ICML 2017)

6. Minimization of Finite Sums • Assumptions on component functions: convex, smooth, …

7. Mother of all First Order Methods:Gradient Descent

8. Gradient Descent Convergence However, GD is not viable for large scale ML because each iteration has cost nd

9. Stochastic Gradient Descent (SGD) Robbins and Munro 1950 • Index sampled uniformly at random with replacement from [n] • Cost per iteration is d • Hugely successful in machine learning!

10. Stochastic, Batch and Full Gradient Descent • Full GD: • Minibatch GD: • Stochastic GD:

11. The Unreasonable Effectiveness of SGD • Very fast initial convergence • Cheap O(d) per iteration as opposed to O(nd) for full GD • Very slow at the end ... Convergence is only O(1/ 𝑘) for smooth and O(1/k) for smooth strongly convex functions. • … but we do not need to run the iterations to optimum, better to stop early (Bottou and Bosquet)

12. SGD: Have the Cake and Eat it Too! (Bottou and Bosquet 2008)

13. Variance/Noise Reduction Can we improve convergence of SGD?

14. Variance/Noise Reduction

15.

16. Variance Reduction: Three Takes

17.

18. Nesterov Momentum for GD Y. Nesterov, Doklady 1983.

19. Katyusha Momentum for SGD Z. Allen-Zhu, STOC 2017, JMLR 2018

20. Non-smooth Objectives • What if the objective is non-smooth? • Need a proxy for gradients. LASSO SON Clustering

21. Proximal Operator • Proximal operator: • Special case: projection: • Like a gradient step: • Fixed points are minimizers:

22. Proximal Gradient Algorithm • Objective split • Iteration • Special cases: – g=0: usual gradient descent – f=0: proximal algorithm – g= indicator of convex set: projected gradient descent (constrained optimization) • Only works if the proximal operator can be evaluated efficiently!

23. PointSAGA: Stochastic Prox with Variance Reduction Defazio 2016

24. MP-SAGA: Stochastic Prox with Variance Reduction Panahi, Dubhashi, ICML 2017, (2019 under review: proximal operator in closed form!

25. SGD for Deep Learning • SGD variants (Adagrad, RMSprop, Adam …) used to train neural networks. • Use aggressive adaptation with different learning rates for different parameters. • Theory says it shouldn’t work for highly nonconvex problems! • But Adagrad greatly improved the robustness of SGD and Google used it for training large-scale neural nets to recognize cats

26. Variance Reduction for Deep Learning Defazio, Bottou, 2019

27. References Frances Bach Tutorials/Short Courses: https://www.di.ens.fr/~fbach/

Optforml

Recommended

Recommended

More Related Content

Similar to Optforml

Similar to Optforml (20)

More from Devdatt Dubhashi

More from Devdatt Dubhashi (7)

Recently uploaded

Recently uploaded (20)

Optforml