SlideShare a Scribd company logo
Stochastic Gradient Descent with
Exponential Convergence Rates of
Expected Classification Errors
Atsushi Nitanda and Taiji Suzuki
AISTATS
April 18th, 2019
Naha, Okinawa
RIKEN AIP
1, 2 1, 2
1 2
Overview
• Topic
Convergence analysis of (averaged) SGD for binary classification
problems.
• Key assumption
Strongest version of low noise condition (margin condition) on the
conditional label probability.
• Result
Exponential convergence rates of expected classification errors
2
Background
• Stochastic Gradient Descent (SGD)
Simple and effective method for training machine learning models.
Significantly faster than vanilla gradient descent.
• Convergence Rates
Expected risk: sublinear convergence 𝑂(1/𝑛&
), (𝛼 ∈ [1/2,1]).
Expected classification error: How fast dose it converge?
GD SGD
SGD: 𝑔/01 ← 𝑔/ − 𝜂𝐺6(𝑔/, 𝑍/) (𝑍/ ∼ 𝜌),
GD : 𝑔/01 ← 𝑔/ − 𝜂𝔼;<∼= 𝐺6 𝑔/, 𝑍/
Cost per iteration:
1 (SGD) vs #data examples (GD)
3
Background
Common way to bound classification error.
• Classification error bound via consistency of loss functions:
[T. Zhang(2004), P. Bartlett+(2006)]
ℙ sgn 𝑔 𝑋 ≠ 𝑌 − ℙ sgn 2𝜌 1 𝑋 − 1 ≠ 𝑌 ≲ ℒ 𝑔 − ℒ∗
H
,
𝑔: predictor, ℒ∗: Bayes optimal for ℒ,
𝜌 1 𝑋 : conditional probability of label 𝑌 = 1.
𝑝 = 1/2 for logistic, exponential, and squared losses.
• Sublinear convergence 𝑂
1
KLM of excess classification error.
4
Excess classification error Excess risk
Background
Faster convergence rates of excess classification error.
• Low noise condition on 𝜌 𝑌 = 1 𝑋)
[A.B. Tsybakov(2004), P. Bartlett+(2006)]
improves the consistency property,
resulting in faster rates: 𝑂
1
K
. (still sublinear convergence)
• Low noise condition (strongest version)
[V. Koltchinskii & O. Benzosova(2005), J-Y. Audibert & A.B. Tsybakov(2007)]
accelerates the rates for ERM to linear rates 𝑂 exp(−𝑛) .
5
Background
Faster convergence rates of excess classification error for SGD.
• Linear convergence rate
[L. Pillaud-Vivien, A. Rudi, & Francis Bach(2018)]
has been shown for the squared loss function under the strong low
noise condition.
• This work
shows the linear convergence for more suitable loss functions (e.g.,
logistic loss) under the strong low noise condition.
6
Outline
• Problem Settings and Assumptions
• (Averaged) Stochastic Gradient Descent
• Main Results: Linear Convergence Rates of SGD and ASGD
• Proof Idea
• Toy Experiment
7
Problem Setting
• Regularized expected risk minimization problems
min
S∈ℋU
ℒ6 𝑔 = 𝔼(V,W) 𝑙(𝑔 𝑋 , 𝑌) +
𝜆
2
𝑔 [

,
(ℋ[, , [): Reproducing kernel Hilbert space,
𝑙: Differentiable loss,
(𝑋, 𝑌): random variables on feature space and label set −1,1 ,
𝜆: Regularization parameter.
8
Loss Function
Example ∃𝜙: ℝ → ℝbc:convex s.t. 𝑙 𝜁, 𝑦 = 𝜙 𝑦𝜁 ,
𝜙 𝑣 = g
log(1 + exp −𝑣 ) 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑙𝑜𝑠𝑠 ,
exp −𝑣 𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙 𝑙𝑜𝑠𝑠 ,
𝑣
𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑙𝑜𝑠𝑠 .
9
Assumption
- sup
𝒳
𝑘(𝑥, 𝑥) ≤ 𝑅
,
- ∃𝑀 > 0, 𝜕• 𝑙 𝜁, 𝑦 ≤ 𝑀,
- ∃𝐿 > 0 ∀𝑔, ℎ ∈ ℋ[, ℒ 𝑔 + ℎ − ℒ 𝑔 − ∇ℒ(𝑔), ℎ [ ≤
†

ℎ [

,
- 𝜌 𝑌 = 1 𝑋 ∈ 0,1 , 𝑎. 𝑒.,
- ℎ∗ : increasing function on 0,1 ,
- sgn 𝜇 − 0.5 = sgn ℎ∗ 𝜇 ,
- 𝑔∗ ≔ arg min
S:Œ•Ž••‘Ž’“•
ℒ 𝑔 ∈ ℋ[.
Remark Logistic loss satisfies these assumptions.
The other loss functions also satisfy them by restricting Hypothesis space.
10
Link function:
ℎ∗ 𝜇 = arg min
”∈ℝ
𝜇𝜙 ℎ + 1 − 𝜇 𝜙(−ℎ) .
Strongest Low Noise Condition
Assumption ∃𝛿 ∈ 0,1/2 , for 𝑋 a.e. w.r.t. 𝜌 𝒳,
𝜌 𝑌 = 1 𝑋) − 0.5 > 𝛿.
11
𝜌 𝑌 = 1 𝑥)
𝒳
0.5
1.0
𝑌 = −1 𝑌 = +1
𝛿
𝛿
Strongest Low Noise Condition
Assumption ∃𝛿 ∈ 0,1/2 , for 𝑋 a.e. w.r.t. 𝜌 𝒳,
𝜌 𝑌 = 1 𝑋) − 0.5 > 𝛿.
MNIST
12
Toy data used in experiment
Example
(Averaged) Stochastic Gradient Descent
13
• Stochastic Gradient in RKHS
𝐺6 𝑔, 𝑋, 𝑌 = 𝜕• 𝑙 𝑔 𝑋 , 𝑌 𝑘 𝑋,⋅ + 𝜆𝑔.
𝜂/ =
2
𝜆(𝛾 + 𝑡)
, 𝛼/ =
2 𝛾 + 𝑡 − 1
(2𝛾 + 𝑇)(𝑇 + 1)
Note: averaging can be updated iteratively.
Convergence Analyses
• For simplicity, we focus on the following case:
𝑔1 = 0,
𝑘: Gaussian kernel,
𝜙 𝑣 = log(1 + exp(−𝑣)): Logistic loss.
• We analyze convergence rates of excess classification error:
ℛ 𝑔 − 𝑅∗: = ℙ sgn 𝑔 𝑋 ≠ 𝑌 − ℙ sgn 𝑔∗ ≠ 𝑌 .
14
Main Result 1: Linear Convergence of SGD
Theorem There exists ∃𝜆 > 0 s.t. the following holds.
Assume 𝜂1 ≤ min
1
†06
,
1
6
, V 𝜕• 𝑙 𝑔 𝑋 , 𝑌 𝑘 𝑋,⋅ ≤ 𝜎
.
Set 𝜈 ≔

6• 𝐿 + 𝜆 𝜎
, 1 + 𝛾 ℒ6 𝑔1 − ℒ6 𝑔6 ) .
Then, for 𝑇 ≥
Ÿ
6
log¡1 10¢
1¡¢
− 𝛾, we have
𝔼 ℛ 𝑔£01 − ℛ∗ ≤ 2 exp −
𝜆
𝛾 + 𝑇
2¤ ⋅ 9
log
1 + 2𝛿
1 − 2𝛿
.
#iterations for 𝜖-solution: 𝑂
1
6• log
1
§
log¡ 10¢
1¡¢
.
15
Main Result 2: Linear Convergence of ASGD
Theorem There exists ∃𝜆 > 0 s.t. the following holds.
Assume 𝜂1 ≤ min
1
†
,
1
6
. Then, if
max
٬
6•(©0£)
,
© ©¡1 Sª U
(©0£)(£01)
≤ 32 log 10¢
1¡¢
, we have
𝔼 ℛ 𝑔£01
− ℛ∗ ≤ 2 exp −
𝜆
2𝛾 + 𝑇
21c ⋅ 9
log
1 + 2𝛿
1 − 2𝛿
.
#iterations for 𝜖-solution: 𝑂
1
6• log
1
§
log¡ 10¢
1¡¢
.
Remark Condition on 𝑇 is much improved.
A dominant term can be satisfied for somewhat small 𝜖.
16
Toy Experiment
• 2-dim toy dataset.
• 𝛿 ∈ 0.1, 0.25, 0.4 .
• Linear separable.
• Logistic loss.
• 𝜆 was determined by validation.
Right Figure
Generated samples for 𝛿 = 0.4.
𝑥1 = 1 is the Bayes optimal.
17
18
From top to bottom:
1. Risk value
2. Class. error
3. Excess class. error
/Excess risk value
Purple line: SGD
Blue line : ASGD
ASGD is much faster
especially when 𝛿 = 0.4.
Summary
• We explained convergence rates of expected classification
errors for (A)SGD are sublinear 𝑂(1/𝑛&
) in general.
• We showed that these rates can be accelerated to linear rates
𝑂(exp(−𝑛)) under strong low noise condition.
Future Work
• Faster convergence by making more additional assumptions.
• Variants of SGD(Acceleration, Variance reduction).
• Non-convex models such as deep neural networks.
• Random Fourier features (ongoing work with collaborators).
19
References
- T. Zhang. Statistical behavior and consistency of classification methods based on convex risk
minimization. The Annals of Statistics, 2004.
- P. Bartlett, M. Jordan, & J. McAuliffe. Convexity, classification, and risk bounds. Journal of the
American Statistical Association, 2006.
- A.B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 2004.
- V. Koltchinskii & O. Benzosova. Exponential convergence rates in classification. In International
Conference on Computational Learning Theory, 2005.
- J-Y. Audibert & A.B. Tsybakov. Fast learning rates for plug-in classifiers. The Annals of statistics, 2007.
- L. Bottou & O. Bousquet. The Tradeoffs of Large Scale Learning, Advances in Neural Information
Processing Systems, 2008.
- L. Pillaud-Vivien, A. Rudi, & Francis Bach. Exponential convergence of testing error for stochastic
gradient methods. In International Conference on Computational Learning Theory, 2018.
20
Appendix
21
Link Function
Definition (Link function) ℎ∗: 0,1 → ℝ,
ℎ∗ 𝜇 = arg min
”∈ℝ
𝜇𝜙 ℎ + 1 − 𝜇 𝜙(−ℎ) .
ℎ∗ connects conditional probability of label to model outputs.
Example (Logistic loss)
ℎ∗ 𝜇 = log
𝜇
1 − 𝜇
, ℎ∗
¡1
𝑎 =
1
1 + exp −𝑎
.
22
0
ℎ∗
Expected risk defined by
conditional probability 𝜇.
ℎ∗(𝜇)
Proof Idea
Set 𝑚 𝛿 ≔ max ℎ∗ 0.5 + 𝛿 , ℎ∗ 0.5 − 𝛿 .
Example (logistic loss) 𝑚 𝛿 = log
10¢
1¡¢
.
Through ℎ∗, noise condition is converted to: 𝑔∗ 𝑋 ≥ 𝑚 𝛿 .
Set 𝑔6 ≔ arg min
S∈ℋU
ℒ6(𝑔).
When 𝜆 is sufficiently small, 𝑔6 is close to 𝑔∗. Moreover,
Proposition
There exists 𝜆 s.t. 𝑔 − 𝑔6 [ ≤
Œ ¢
®
→ ℛ 𝑔 = ℛ∗.
23
24
Analyze the convergence speed and probability to get in in RKHS.
𝜌(1|𝑋)
Space of conditional probabilities
Small ball which provides Bayes rule.
𝑔∗
𝑔6
Small ball mapped into .
RKHS (predictor)
SGD
ℎ∗
Recall ℎ∗ 𝜇 = arg min
”∈ℝ
𝜇𝜙 ℎ + 1 − 𝜇 𝜙(−ℎ) .
Proof Idea
Proof Sketch
1. Let	𝑍1, … , 𝑍£~𝜌 be i.i.d., random variables,
𝐷/: = 𝔼 ¶𝑔£01|𝑍1, … , 𝑍/ − 𝔼 ¶𝑔£01|𝑍1, … , 𝑍/¡1 ,
¶𝑔£01 = 𝔼 ¶𝑔£01 + ·
/¸1
£
𝐷/ .
2. Convergence of 𝔼 ¶𝑔£01 can be analyzed by
𝔼 ¶𝑔£01 − 𝑔6 ≤ 𝔼 ℒ6 𝔼 ¶𝑔£01 − ℒ6 𝑔6 .
3. Bound ∑/¸1
£
𝐷/ by Martingale inequality: for 𝑐£ s.t. ∑/¸1
£
𝐷/ º

≤ 𝑐£

,
ℙ ·
/¸1
£
𝐷/
[
≥ 𝜖 ≤ 2 exp −
𝜖
𝑐£
 .
4. Bound 𝑐£ by stability of A(SGD).
5. Combining 1 and 2, probability to get Bayes rule can be obtained.
6. Finally, 𝔼 ¶𝑔£01 = ℙ ¶𝑔£01 𝑖𝑠 𝑛𝑜𝑡 𝐵𝑎𝑦𝑒𝑠 .
25

More Related Content

What's hot

Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential
David Gleich
 
Spectral clustering Tutorial
Spectral clustering TutorialSpectral clustering Tutorial
Spectral clustering Tutorial
Zitao Liu
 
Tensor Train decomposition in machine learning
Tensor Train decomposition in machine learningTensor Train decomposition in machine learning
Tensor Train decomposition in machine learning
Alexander Novikov
 
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종
WooSung Choi
 
PageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresPageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structures
David Gleich
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
The Statistical and Applied Mathematical Sciences Institute
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
arogozhnikov
 
Tensorizing Neural Network
Tensorizing Neural NetworkTensorizing Neural Network
Tensorizing Neural Network
Ruochun Tzeng
 
Syde770a presentation
Syde770a presentationSyde770a presentation
Syde770a presentationSai Kumar
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep Learning
Sourya Dey
 
Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...
Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...
Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...
Atsushi Nitanda
 
PR 103: t-SNE
PR 103: t-SNEPR 103: t-SNE
PR 103: t-SNE
Taeoh Kim
 
K-means and GMM
K-means and GMMK-means and GMM
K-means and GMM
Sanghyuk Chun
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
MLconf
 
Survadapt-Webinar_2014_SLIDES
Survadapt-Webinar_2014_SLIDESSurvadapt-Webinar_2014_SLIDES
Survadapt-Webinar_2014_SLIDES
therealreverendbayes
 
icml2004 tutorial on spectral clustering part II
icml2004 tutorial on spectral clustering part IIicml2004 tutorial on spectral clustering part II
icml2004 tutorial on spectral clustering part IIzukun
 
icml2004 tutorial on spectral clustering part I
icml2004 tutorial on spectral clustering part Iicml2004 tutorial on spectral clustering part I
icml2004 tutorial on spectral clustering part Izukun
 
Foundation of KL Divergence
Foundation of KL DivergenceFoundation of KL Divergence
Foundation of KL Divergence
Natan Katz
 
Spacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsSpacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chains
David Gleich
 
Spectral Clustering Report
Spectral Clustering ReportSpectral Clustering Report
Spectral Clustering ReportMiaolan Xie
 

What's hot (20)

Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential
 
Spectral clustering Tutorial
Spectral clustering TutorialSpectral clustering Tutorial
Spectral clustering Tutorial
 
Tensor Train decomposition in machine learning
Tensor Train decomposition in machine learningTensor Train decomposition in machine learning
Tensor Train decomposition in machine learning
 
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종
 
PageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresPageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structures
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
 
Tensorizing Neural Network
Tensorizing Neural NetworkTensorizing Neural Network
Tensorizing Neural Network
 
Syde770a presentation
Syde770a presentationSyde770a presentation
Syde770a presentation
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep Learning
 
Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...
Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...
Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...
 
PR 103: t-SNE
PR 103: t-SNEPR 103: t-SNE
PR 103: t-SNE
 
K-means and GMM
K-means and GMMK-means and GMM
K-means and GMM
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
 
Survadapt-Webinar_2014_SLIDES
Survadapt-Webinar_2014_SLIDESSurvadapt-Webinar_2014_SLIDES
Survadapt-Webinar_2014_SLIDES
 
icml2004 tutorial on spectral clustering part II
icml2004 tutorial on spectral clustering part IIicml2004 tutorial on spectral clustering part II
icml2004 tutorial on spectral clustering part II
 
icml2004 tutorial on spectral clustering part I
icml2004 tutorial on spectral clustering part Iicml2004 tutorial on spectral clustering part I
icml2004 tutorial on spectral clustering part I
 
Foundation of KL Divergence
Foundation of KL DivergenceFoundation of KL Divergence
Foundation of KL Divergence
 
Spacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chainsSpacey random walks and higher order Markov chains
Spacey random walks and higher order Markov chains
 
Spectral Clustering Report
Spectral Clustering ReportSpectral Clustering Report
Spectral Clustering Report
 

Similar to Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors

MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
arogozhnikov
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
ChenYiHuang5
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3
arogozhnikov
 
Average Sensitivity of Graph Algorithms
Average Sensitivity of Graph AlgorithmsAverage Sensitivity of Graph Algorithms
Average Sensitivity of Graph Algorithms
Yuichi Yoshida
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Universitat Politècnica de Catalunya
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
홍배 김
 
STLtalk about statistical analysis and its application
STLtalk about statistical analysis and its applicationSTLtalk about statistical analysis and its application
STLtalk about statistical analysis and its application
JulieDash5
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
The Statistical and Applied Mathematical Sciences Institute
 
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Chiheb Ben Hammouda
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketching
Hsing-chuan Hsieh
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descent
Revanth Kumar
 
Stochastic Optimization
Stochastic OptimizationStochastic Optimization
Stochastic Optimization
Mohammad Reza Jabbari
 
20180831 riemannian representation learning
20180831 riemannian representation learning20180831 riemannian representation learning
20180831 riemannian representation learning
segwangkim
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
ChenYiHuang5
 
Optimization of positive linear systems via geometric programming
Optimization of positive linear systems via geometric programmingOptimization of positive linear systems via geometric programming
Optimization of positive linear systems via geometric programming
Masaki Ogura
 
2 random variables notes 2p3
2 random variables notes 2p32 random variables notes 2p3
2 random variables notes 2p3
MuhannadSaleh
 
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUs
Pantelis Sopasakis
 
04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks
Tamer Ahmed Farrag, PhD
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagation
ParveenMalik18
 

Similar to Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors (20)

MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3
 
Average Sensitivity of Graph Algorithms
Average Sensitivity of Graph AlgorithmsAverage Sensitivity of Graph Algorithms
Average Sensitivity of Graph Algorithms
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
STLtalk about statistical analysis and its application
STLtalk about statistical analysis and its applicationSTLtalk about statistical analysis and its application
STLtalk about statistical analysis and its application
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
 
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketching
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descent
 
Presentation
PresentationPresentation
Presentation
 
Stochastic Optimization
Stochastic OptimizationStochastic Optimization
Stochastic Optimization
 
20180831 riemannian representation learning
20180831 riemannian representation learning20180831 riemannian representation learning
20180831 riemannian representation learning
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
 
Optimization of positive linear systems via geometric programming
Optimization of positive linear systems via geometric programmingOptimization of positive linear systems via geometric programming
Optimization of positive linear systems via geometric programming
 
2 random variables notes 2p3
2 random variables notes 2p32 random variables notes 2p3
2 random variables notes 2p3
 
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUs
 
04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagation
 

Recently uploaded

Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 

Recently uploaded (20)

Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 

Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors

  • 1. Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors Atsushi Nitanda and Taiji Suzuki AISTATS April 18th, 2019 Naha, Okinawa RIKEN AIP 1, 2 1, 2 1 2
  • 2. Overview • Topic Convergence analysis of (averaged) SGD for binary classification problems. • Key assumption Strongest version of low noise condition (margin condition) on the conditional label probability. • Result Exponential convergence rates of expected classification errors 2
  • 3. Background • Stochastic Gradient Descent (SGD) Simple and effective method for training machine learning models. Significantly faster than vanilla gradient descent. • Convergence Rates Expected risk: sublinear convergence 𝑂(1/𝑛& ), (𝛼 ∈ [1/2,1]). Expected classification error: How fast dose it converge? GD SGD SGD: 𝑔/01 ← 𝑔/ − 𝜂𝐺6(𝑔/, 𝑍/) (𝑍/ ∼ 𝜌), GD : 𝑔/01 ← 𝑔/ − 𝜂𝔼;<∼= 𝐺6 𝑔/, 𝑍/ Cost per iteration: 1 (SGD) vs #data examples (GD) 3
  • 4. Background Common way to bound classification error. • Classification error bound via consistency of loss functions: [T. Zhang(2004), P. Bartlett+(2006)] ℙ sgn 𝑔 𝑋 ≠ 𝑌 − ℙ sgn 2𝜌 1 𝑋 − 1 ≠ 𝑌 ≲ ℒ 𝑔 − ℒ∗ H , 𝑔: predictor, ℒ∗: Bayes optimal for ℒ, 𝜌 1 𝑋 : conditional probability of label 𝑌 = 1. 𝑝 = 1/2 for logistic, exponential, and squared losses. • Sublinear convergence 𝑂 1 KLM of excess classification error. 4 Excess classification error Excess risk
  • 5. Background Faster convergence rates of excess classification error. • Low noise condition on 𝜌 𝑌 = 1 𝑋) [A.B. Tsybakov(2004), P. Bartlett+(2006)] improves the consistency property, resulting in faster rates: 𝑂 1 K . (still sublinear convergence) • Low noise condition (strongest version) [V. Koltchinskii & O. Benzosova(2005), J-Y. Audibert & A.B. Tsybakov(2007)] accelerates the rates for ERM to linear rates 𝑂 exp(−𝑛) . 5
  • 6. Background Faster convergence rates of excess classification error for SGD. • Linear convergence rate [L. Pillaud-Vivien, A. Rudi, & Francis Bach(2018)] has been shown for the squared loss function under the strong low noise condition. • This work shows the linear convergence for more suitable loss functions (e.g., logistic loss) under the strong low noise condition. 6
  • 7. Outline • Problem Settings and Assumptions • (Averaged) Stochastic Gradient Descent • Main Results: Linear Convergence Rates of SGD and ASGD • Proof Idea • Toy Experiment 7
  • 8. Problem Setting • Regularized expected risk minimization problems min S∈ℋU ℒ6 𝑔 = 𝔼(V,W) 𝑙(𝑔 𝑋 , 𝑌) + 𝜆 2 𝑔 [ , (ℋ[, , [): Reproducing kernel Hilbert space, 𝑙: Differentiable loss, (𝑋, 𝑌): random variables on feature space and label set −1,1 , 𝜆: Regularization parameter. 8
  • 9. Loss Function Example ∃𝜙: ℝ → ℝbc:convex s.t. 𝑙 𝜁, 𝑦 = 𝜙 𝑦𝜁 , 𝜙 𝑣 = g log(1 + exp −𝑣 ) 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑙𝑜𝑠𝑠 , exp −𝑣 𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙 𝑙𝑜𝑠𝑠 , 𝑣 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑙𝑜𝑠𝑠 . 9
  • 10. Assumption - sup 𝒳 𝑘(𝑥, 𝑥) ≤ 𝑅 , - ∃𝑀 > 0, 𝜕• 𝑙 𝜁, 𝑦 ≤ 𝑀, - ∃𝐿 > 0 ∀𝑔, ℎ ∈ ℋ[, ℒ 𝑔 + ℎ − ℒ 𝑔 − ∇ℒ(𝑔), ℎ [ ≤ † ℎ [ , - 𝜌 𝑌 = 1 𝑋 ∈ 0,1 , 𝑎. 𝑒., - ℎ∗ : increasing function on 0,1 , - sgn 𝜇 − 0.5 = sgn ℎ∗ 𝜇 , - 𝑔∗ ≔ arg min S:Œ•Ž••‘Ž’“• ℒ 𝑔 ∈ ℋ[. Remark Logistic loss satisfies these assumptions. The other loss functions also satisfy them by restricting Hypothesis space. 10 Link function: ℎ∗ 𝜇 = arg min ”∈ℝ 𝜇𝜙 ℎ + 1 − 𝜇 𝜙(−ℎ) .
  • 11. Strongest Low Noise Condition Assumption ∃𝛿 ∈ 0,1/2 , for 𝑋 a.e. w.r.t. 𝜌 𝒳, 𝜌 𝑌 = 1 𝑋) − 0.5 > 𝛿. 11 𝜌 𝑌 = 1 𝑥) 𝒳 0.5 1.0 𝑌 = −1 𝑌 = +1 𝛿 𝛿
  • 12. Strongest Low Noise Condition Assumption ∃𝛿 ∈ 0,1/2 , for 𝑋 a.e. w.r.t. 𝜌 𝒳, 𝜌 𝑌 = 1 𝑋) − 0.5 > 𝛿. MNIST 12 Toy data used in experiment Example
  • 13. (Averaged) Stochastic Gradient Descent 13 • Stochastic Gradient in RKHS 𝐺6 𝑔, 𝑋, 𝑌 = 𝜕• 𝑙 𝑔 𝑋 , 𝑌 𝑘 𝑋,⋅ + 𝜆𝑔. 𝜂/ = 2 𝜆(𝛾 + 𝑡) , 𝛼/ = 2 𝛾 + 𝑡 − 1 (2𝛾 + 𝑇)(𝑇 + 1) Note: averaging can be updated iteratively.
  • 14. Convergence Analyses • For simplicity, we focus on the following case: 𝑔1 = 0, 𝑘: Gaussian kernel, 𝜙 𝑣 = log(1 + exp(−𝑣)): Logistic loss. • We analyze convergence rates of excess classification error: ℛ 𝑔 − 𝑅∗: = ℙ sgn 𝑔 𝑋 ≠ 𝑌 − ℙ sgn 𝑔∗ ≠ 𝑌 . 14
  • 15. Main Result 1: Linear Convergence of SGD Theorem There exists ∃𝜆 > 0 s.t. the following holds. Assume 𝜂1 ≤ min 1 †06 , 1 6 , V 𝜕• 𝑙 𝑔 𝑋 , 𝑌 𝑘 𝑋,⋅ ≤ 𝜎 . Set 𝜈 ≔ 6• 𝐿 + 𝜆 𝜎 , 1 + 𝛾 ℒ6 𝑔1 − ℒ6 𝑔6 ) . Then, for 𝑇 ≥ Ÿ 6 log¡1 10¢ 1¡¢ − 𝛾, we have 𝔼 ℛ 𝑔£01 − ℛ∗ ≤ 2 exp − 𝜆 𝛾 + 𝑇 2¤ ⋅ 9 log 1 + 2𝛿 1 − 2𝛿 . #iterations for 𝜖-solution: 𝑂 1 6• log 1 § log¡ 10¢ 1¡¢ . 15
  • 16. Main Result 2: Linear Convergence of ASGD Theorem There exists ∃𝜆 > 0 s.t. the following holds. Assume 𝜂1 ≤ min 1 † , 1 6 . Then, if max Ÿ¨ 6•(©0£) , © ©¡1 Sª U (©0£)(£01) ≤ 32 log 10¢ 1¡¢ , we have 𝔼 ℛ 𝑔£01 − ℛ∗ ≤ 2 exp − 𝜆 2𝛾 + 𝑇 21c ⋅ 9 log 1 + 2𝛿 1 − 2𝛿 . #iterations for 𝜖-solution: 𝑂 1 6• log 1 § log¡ 10¢ 1¡¢ . Remark Condition on 𝑇 is much improved. A dominant term can be satisfied for somewhat small 𝜖. 16
  • 17. Toy Experiment • 2-dim toy dataset. • 𝛿 ∈ 0.1, 0.25, 0.4 . • Linear separable. • Logistic loss. • 𝜆 was determined by validation. Right Figure Generated samples for 𝛿 = 0.4. 𝑥1 = 1 is the Bayes optimal. 17
  • 18. 18 From top to bottom: 1. Risk value 2. Class. error 3. Excess class. error /Excess risk value Purple line: SGD Blue line : ASGD ASGD is much faster especially when 𝛿 = 0.4.
  • 19. Summary • We explained convergence rates of expected classification errors for (A)SGD are sublinear 𝑂(1/𝑛& ) in general. • We showed that these rates can be accelerated to linear rates 𝑂(exp(−𝑛)) under strong low noise condition. Future Work • Faster convergence by making more additional assumptions. • Variants of SGD(Acceleration, Variance reduction). • Non-convex models such as deep neural networks. • Random Fourier features (ongoing work with collaborators). 19
  • 20. References - T. Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 2004. - P. Bartlett, M. Jordan, & J. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 2006. - A.B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 2004. - V. Koltchinskii & O. Benzosova. Exponential convergence rates in classification. In International Conference on Computational Learning Theory, 2005. - J-Y. Audibert & A.B. Tsybakov. Fast learning rates for plug-in classifiers. The Annals of statistics, 2007. - L. Bottou & O. Bousquet. The Tradeoffs of Large Scale Learning, Advances in Neural Information Processing Systems, 2008. - L. Pillaud-Vivien, A. Rudi, & Francis Bach. Exponential convergence of testing error for stochastic gradient methods. In International Conference on Computational Learning Theory, 2018. 20
  • 22. Link Function Definition (Link function) ℎ∗: 0,1 → ℝ, ℎ∗ 𝜇 = arg min ”∈ℝ 𝜇𝜙 ℎ + 1 − 𝜇 𝜙(−ℎ) . ℎ∗ connects conditional probability of label to model outputs. Example (Logistic loss) ℎ∗ 𝜇 = log 𝜇 1 − 𝜇 , ℎ∗ ¡1 𝑎 = 1 1 + exp −𝑎 . 22 0 ℎ∗ Expected risk defined by conditional probability 𝜇. ℎ∗(𝜇)
  • 23. Proof Idea Set 𝑚 𝛿 ≔ max ℎ∗ 0.5 + 𝛿 , ℎ∗ 0.5 − 𝛿 . Example (logistic loss) 𝑚 𝛿 = log 10¢ 1¡¢ . Through ℎ∗, noise condition is converted to: 𝑔∗ 𝑋 ≥ 𝑚 𝛿 . Set 𝑔6 ≔ arg min S∈ℋU ℒ6(𝑔). When 𝜆 is sufficiently small, 𝑔6 is close to 𝑔∗. Moreover, Proposition There exists 𝜆 s.t. 𝑔 − 𝑔6 [ ≤ Œ ¢ ® → ℛ 𝑔 = ℛ∗. 23
  • 24. 24 Analyze the convergence speed and probability to get in in RKHS. 𝜌(1|𝑋) Space of conditional probabilities Small ball which provides Bayes rule. 𝑔∗ 𝑔6 Small ball mapped into . RKHS (predictor) SGD ℎ∗ Recall ℎ∗ 𝜇 = arg min ”∈ℝ 𝜇𝜙 ℎ + 1 − 𝜇 𝜙(−ℎ) . Proof Idea
  • 25. Proof Sketch 1. Let 𝑍1, … , 𝑍£~𝜌 be i.i.d., random variables, 𝐷/: = 𝔼 ¶𝑔£01|𝑍1, … , 𝑍/ − 𝔼 ¶𝑔£01|𝑍1, … , 𝑍/¡1 , ¶𝑔£01 = 𝔼 ¶𝑔£01 + · /¸1 £ 𝐷/ . 2. Convergence of 𝔼 ¶𝑔£01 can be analyzed by 𝔼 ¶𝑔£01 − 𝑔6 ≤ 𝔼 ℒ6 𝔼 ¶𝑔£01 − ℒ6 𝑔6 . 3. Bound ∑/¸1 £ 𝐷/ by Martingale inequality: for 𝑐£ s.t. ∑/¸1 £ 𝐷/ º ≤ 𝑐£ , ℙ · /¸1 £ 𝐷/ [ ≥ 𝜖 ≤ 2 exp − 𝜖 𝑐£ . 4. Bound 𝑐£ by stability of A(SGD). 5. Combining 1 and 2, probability to get Bayes rule can be obtained. 6. Finally, 𝔼 ¶𝑔£01 = ℙ ¶𝑔£01 𝑖𝑠 𝑛𝑜𝑡 𝐵𝑎𝑦𝑒𝑠 . 25