Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors
The document presents two main results:
1) Stochastic Gradient Descent (SGD) achieves linear convergence rates for expected classification error under a strong low noise condition. The number of iterations needed for an epsilon solution is O(log(1/epsilon)).
2) Averaged SGD (ASGD) achieves even faster linear convergence rates under the same condition, requiring O(log(1/epsilon)) iterations for an epsilon solution.
The results improve upon prior work by showing faster-than-sublinear convergence rates for more suitable loss functions like logistic loss. Toy experiments demonstrate the theoretical findings.
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...David Gleich
This talk covers the idea of anti-differentiating approximation algorithms, which is an idea to explain the success of widely used heuristic procedures. Formally, this involves finding an optimization problem solved exactly by an approximation algorithm or heuristic.
The asynchronous parallel algorithms are developed to solve massive optimization problems in a distributed data system, which can be run in parallel on multiple nodes with little or no synchronization. Recently they have been successfully implemented to solve a range of difficult problems in practice. However, the existing theories are mostly based on fairly restrictive assumptions on the delays, and cannot explain the convergence and speedup properties of such algorithms. In this talk we will give an overview on distributed optimization, and discuss some new theoretical results on the convergence of asynchronous parallel stochastic gradient algorithm with unbounded delays. Simulated and real data will be used to demonstrate the practical implication of these theoretical results.
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...David Gleich
This talk covers the idea of anti-differentiating approximation algorithms, which is an idea to explain the success of widely used heuristic procedures. Formally, this involves finding an optimization problem solved exactly by an approximation algorithm or heuristic.
The asynchronous parallel algorithms are developed to solve massive optimization problems in a distributed data system, which can be run in parallel on multiple nodes with little or no synchronization. Recently they have been successfully implemented to solve a range of difficult problems in practice. However, the existing theories are mostly based on fairly restrictive assumptions on the delays, and cannot explain the convergence and speedup properties of such algorithms. In this talk we will give an overview on distributed optimization, and discuss some new theoretical results on the convergence of asynchronous parallel stochastic gradient algorithm with unbounded delays. Simulated and real data will be used to demonstrate the practical implication of these theoretical results.
Fast relaxation methods for the matrix exponential David Gleich
The matrix exponential is a matrix computing primitive used in link prediction and community detection. We describe a fast method to compute it using relaxation on a large linear system of equations. This enables us to compute a column of the matrix exponential is sublinear time, or under a second on a standard desktop computer.
Tensor Train (TT) decomposition [3] is a generalization of SVD decomposition from matrices to tensors (=multidimensional arrays).
It represents a tensor compactly in terms of factors and allows to work with the tensor via its factors without materializing the tensor itself.
For example, we can find the elementwise product of two TT-tensors of size 2^100 and get the result in the TT-format as well.
In the talk, we will show how Tensor Train decomposition can be used to represent parameters of neural networks [1] and polynomial models [2].
This parametrization allows exponentially many 'virtual' parameters while working only with small factors of the TT-format.
To train the model, i.e. optimize the objective subject to the constraint that the parameters are in the TT-format, [2] uses stochastic Riemannian optimization.
[1] Novikov, A., Podoprikhin, D., Osokin, A., & Vetrov, D. P. (2015). Tensorizing neural networks. In Advances in Neural Information Processing Systems.
[2] Novikov, A., Trofimov, M., & Oseledets, I. (2016). Tensor Train polynomial models via Riemannian optimization. arXiv:1605.03795.
[3] Oseledets, I. (2011). Tensor-train decomposition. SIAM Journal on Scientific Computing.
Learning a nonlinear embedding by preserving class neibourhood structure 최종WooSung Choi
Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Learning a nonlinear embedding by preserving class neighbourhood structure." International Conference on Artificial Intelligence and Statistics. 2007.
PageRank Centrality of dynamic graph structuresDavid Gleich
A talk I gave at the SIAM Annual Meeting Mini-symposium on the mathematics of the power grid organized by Mahantesh Halappanavar. I discuss a few ideas on how our dynamic centrality could help analyze such situations.
In this talk we consider the question of how to use QMC with an empirical dataset, such as a set of points generated by MCMC. Using ideas from partitioning for parallel computing, we apply recursive bisection to reorder the points, and then interleave the bits of the QMC coordinates to select the appropriate point from the dataset. Numerical tests show that in the case of known distributions this is almost as effective as applying QMC directly to the original distribution. The same recursive bisection can also be used to thin the dataset, by recursively bisecting down to many small subsets of points, and then randomly selecting one point from each subset. This makes it possible to reduce the size of the dataset greatly without significantly increasing the overall error. Co-author: Fei Xie
* Logistic regression, logistic loss (log loss)
* stochastic optimization
* adding new features, generalized linear model
* Kernel trick, intro to SVM
* Overfitting
* Decision trees for classification and regression
* Building trees greedily: Gini index, entropy
* Trees fighting with overfitting: pre-stopping and post-pruning
* Feature importances
Basic knowhow of several techniques commonly used in deep learning and neural networks -- activation functions, cost functions, optimizers, regularization, parameter initialization, normalization, data handling, hyperparameter selection. Presented as lecture material for the course EE599 Deep Learning in Spring 2019 at University of Southern California.
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...MLconf
Anima Anandkumar is a faculty at the EECS Dept. at U.C.Irvine since August 2010. Her research interests are in the area of large-scale machine learning and high-dimensional statistics. She received her B.Tech in Electrical Engineering from IIT Madras in 2004 and her PhD from Cornell University in 2009. She has been a visiting faculty at Microsoft Research New England in 2012 and a postdoctoral researcher at the Stochastic Systems Group at MIT between 2009-2010. She is the recipient of the Microsoft Faculty Fellowship, ARO Young Investigator Award, NSF CAREER Award, and IBM Fran Allen PhD fellowship.
Spacey random walks and higher order Markov chainsDavid Gleich
My talk at SIAM NetSci workshop (2015) on our new spacey random walk and spacey random surfer models and how we derived them. There many potential extensions and opportunities to use this for analyzing big data as tensors.
Paper Study: Melding the data decision pipelineChenYiHuang5
Melding the data decision pipeline: Decision-Focused Learning for Combinatorial Optimization from AAAI2019.
Derive the math equation from myself and match the same result as two mentioned CMU papers [Donti et. al. 2017, Amos et. al. 2017] while applying the same derivation procedure.
Fast relaxation methods for the matrix exponential David Gleich
The matrix exponential is a matrix computing primitive used in link prediction and community detection. We describe a fast method to compute it using relaxation on a large linear system of equations. This enables us to compute a column of the matrix exponential is sublinear time, or under a second on a standard desktop computer.
Tensor Train (TT) decomposition [3] is a generalization of SVD decomposition from matrices to tensors (=multidimensional arrays).
It represents a tensor compactly in terms of factors and allows to work with the tensor via its factors without materializing the tensor itself.
For example, we can find the elementwise product of two TT-tensors of size 2^100 and get the result in the TT-format as well.
In the talk, we will show how Tensor Train decomposition can be used to represent parameters of neural networks [1] and polynomial models [2].
This parametrization allows exponentially many 'virtual' parameters while working only with small factors of the TT-format.
To train the model, i.e. optimize the objective subject to the constraint that the parameters are in the TT-format, [2] uses stochastic Riemannian optimization.
[1] Novikov, A., Podoprikhin, D., Osokin, A., & Vetrov, D. P. (2015). Tensorizing neural networks. In Advances in Neural Information Processing Systems.
[2] Novikov, A., Trofimov, M., & Oseledets, I. (2016). Tensor Train polynomial models via Riemannian optimization. arXiv:1605.03795.
[3] Oseledets, I. (2011). Tensor-train decomposition. SIAM Journal on Scientific Computing.
Learning a nonlinear embedding by preserving class neibourhood structure 최종WooSung Choi
Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Learning a nonlinear embedding by preserving class neighbourhood structure." International Conference on Artificial Intelligence and Statistics. 2007.
PageRank Centrality of dynamic graph structuresDavid Gleich
A talk I gave at the SIAM Annual Meeting Mini-symposium on the mathematics of the power grid organized by Mahantesh Halappanavar. I discuss a few ideas on how our dynamic centrality could help analyze such situations.
In this talk we consider the question of how to use QMC with an empirical dataset, such as a set of points generated by MCMC. Using ideas from partitioning for parallel computing, we apply recursive bisection to reorder the points, and then interleave the bits of the QMC coordinates to select the appropriate point from the dataset. Numerical tests show that in the case of known distributions this is almost as effective as applying QMC directly to the original distribution. The same recursive bisection can also be used to thin the dataset, by recursively bisecting down to many small subsets of points, and then randomly selecting one point from each subset. This makes it possible to reduce the size of the dataset greatly without significantly increasing the overall error. Co-author: Fei Xie
* Logistic regression, logistic loss (log loss)
* stochastic optimization
* adding new features, generalized linear model
* Kernel trick, intro to SVM
* Overfitting
* Decision trees for classification and regression
* Building trees greedily: Gini index, entropy
* Trees fighting with overfitting: pre-stopping and post-pruning
* Feature importances
Basic knowhow of several techniques commonly used in deep learning and neural networks -- activation functions, cost functions, optimizers, regularization, parameter initialization, normalization, data handling, hyperparameter selection. Presented as lecture material for the course EE599 Deep Learning in Spring 2019 at University of Southern California.
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...MLconf
Anima Anandkumar is a faculty at the EECS Dept. at U.C.Irvine since August 2010. Her research interests are in the area of large-scale machine learning and high-dimensional statistics. She received her B.Tech in Electrical Engineering from IIT Madras in 2004 and her PhD from Cornell University in 2009. She has been a visiting faculty at Microsoft Research New England in 2012 and a postdoctoral researcher at the Stochastic Systems Group at MIT between 2009-2010. She is the recipient of the Microsoft Faculty Fellowship, ARO Young Investigator Award, NSF CAREER Award, and IBM Fran Allen PhD fellowship.
Spacey random walks and higher order Markov chainsDavid Gleich
My talk at SIAM NetSci workshop (2015) on our new spacey random walk and spacey random surfer models and how we derived them. There many potential extensions and opportunities to use this for analyzing big data as tensors.
Paper Study: Melding the data decision pipelineChenYiHuang5
Melding the data decision pipeline: Decision-Focused Learning for Combinatorial Optimization from AAAI2019.
Derive the math equation from myself and match the same result as two mentioned CMU papers [Donti et. al. 2017, Amos et. al. 2017] while applying the same derivation procedure.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
We approach the screening problem - i.e. detecting which inputs of a computer model significantly impact the output - from a formal Bayesian model selection point of view. That is, we place a Gaussian process prior on the computer model and consider the $2^p$ models that result from assuming that each of the subsets of the $p$ inputs affect the response. The goal is to obtain the posterior probabilities of each of these models. In this talk, we focus on the specification of objective priors on the model-specific parameters and on convenient ways to compute the associated marginal likelihoods. These two problems that normally are seen as unrelated, have challenging connections since the priors proposed in the literature are specifically designed to have posterior modes in the boundary of the parameter space, hence precluding the application of approximate integration techniques based on e.g. Laplace approximations. We explore several ways of circumventing this difficulty, comparing different methodologies with synthetic examples taken from the literature.
Authors: Gonzalo Garcia-Donato (Universidad de Castilla-La Mancha) and Rui Paulo (Universidade de Lisboa)
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...Chiheb Ben Hammouda
Conference talk at the SIAM Conference on Financial Mathematics and Engineering, held in virtual format, June 1-4 2021, about our recently published work "Hierarchical adaptive sparse grids and quasi-Monte Carlo for option pricing under the rough Bergomi model".
- Link of the paper: https://www.tandfonline.com/doi/abs/10.1080/14697688.2020.1744700
We consider the problem of finding anomalies in high-dimensional data using popular PCA based anomaly scores. The naive algorithms for computing these scores explicitly compute the PCA of the covariance matrix which uses space quadratic in the dimensionality of the data. We give the first streaming algorithms
that use space that is linear or sublinear in the dimension. We prove general results showing that any sketch of a matrix that satisfies a certain operator norm guarantee can be used to approximate these scores. We instantiate these results with powerful matrix sketching techniques such as Frequent Directions and random projections to derive efficient and practical algorithms for these problems, which we validate over real-world data sets. Our main technical contribution is to prove matrix perturbation
inequalities for operators arising in the computation of these measures.
-Proceedings: https://arxiv.org/abs/1804.03065
-Origin: https://arxiv.org/abs/1804.03065
Many Decision Problems in business and social systems can be modeled using mathematical optimization, which seeks to maximize or minimize some objective which is a function of the decisions.
Stochastic Optimization Problems are mathematical programs where some of the data incorporated into the objective or constraints are Uncertain.
whereas, Deterministic Optimization Problems are formulated with known parameters.
Distributed solution of stochastic optimal control problem on GPUsPantelis Sopasakis
Stochastic optimal control problems arise in many
applications and are, in principle,
large-scale involving up to millions of decision variables. Their
applicability in control applications is often limited by the
availability of algorithms that can solve them efficiently and within
the sampling time of the controlled system.
In this paper we propose a dual accelerated proximal
gradient algorithm which is amenable to parallelization and
demonstrate that its GPU implementation affords high speed-up
values (with respect to a CPU implementation) and greatly outperforms
well-established commercial optimizers such as Gurobi.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Epistemic Interaction - tuning interfaces to provide information for AI support
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors
1. Stochastic Gradient Descent with
Exponential Convergence Rates of
Expected Classification Errors
Atsushi Nitanda and Taiji Suzuki
AISTATS
April 18th, 2019
Naha, Okinawa
RIKEN AIP
1, 2 1, 2
1 2
2. Overview
• Topic
Convergence analysis of (averaged) SGD for binary classification
problems.
• Key assumption
Strongest version of low noise condition (margin condition) on the
conditional label probability.
• Result
Exponential convergence rates of expected classification errors
2
3. Background
• Stochastic Gradient Descent (SGD)
Simple and effective method for training machine learning models.
Significantly faster than vanilla gradient descent.
• Convergence Rates
Expected risk: sublinear convergence 𝑂(1/𝑛&
), (𝛼 ∈ [1/2,1]).
Expected classification error: How fast dose it converge?
GD SGD
SGD: 𝑔/01 ← 𝑔/ − 𝜂𝐺6(𝑔/, 𝑍/) (𝑍/ ∼ 𝜌),
GD : 𝑔/01 ← 𝑔/ − 𝜂𝔼;<∼= 𝐺6 𝑔/, 𝑍/
Cost per iteration:
1 (SGD) vs #data examples (GD)
3
4. Background
Common way to bound classification error.
• Classification error bound via consistency of loss functions:
[T. Zhang(2004), P. Bartlett+(2006)]
ℙ sgn 𝑔 𝑋 ≠ 𝑌 − ℙ sgn 2𝜌 1 𝑋 − 1 ≠ 𝑌 ≲ ℒ 𝑔 − ℒ∗
H
,
𝑔: predictor, ℒ∗: Bayes optimal for ℒ,
𝜌 1 𝑋 : conditional probability of label 𝑌 = 1.
𝑝 = 1/2 for logistic, exponential, and squared losses.
• Sublinear convergence 𝑂
1
KLM of excess classification error.
4
Excess classification error Excess risk
5. Background
Faster convergence rates of excess classification error.
• Low noise condition on 𝜌 𝑌 = 1 𝑋)
[A.B. Tsybakov(2004), P. Bartlett+(2006)]
improves the consistency property,
resulting in faster rates: 𝑂
1
K
. (still sublinear convergence)
• Low noise condition (strongest version)
[V. Koltchinskii & O. Benzosova(2005), J-Y. Audibert & A.B. Tsybakov(2007)]
accelerates the rates for ERM to linear rates 𝑂 exp(−𝑛) .
5
6. Background
Faster convergence rates of excess classification error for SGD.
• Linear convergence rate
[L. Pillaud-Vivien, A. Rudi, & Francis Bach(2018)]
has been shown for the squared loss function under the strong low
noise condition.
• This work
shows the linear convergence for more suitable loss functions (e.g.,
logistic loss) under the strong low noise condition.
6
7. Outline
• Problem Settings and Assumptions
• (Averaged) Stochastic Gradient Descent
• Main Results: Linear Convergence Rates of SGD and ASGD
• Proof Idea
• Toy Experiment
7
8. Problem Setting
• Regularized expected risk minimization problems
min
S∈ℋU
ℒ6 𝑔 = 𝔼(V,W) 𝑙(𝑔 𝑋 , 𝑌) +
𝜆
2
𝑔 [
,
(ℋ[, , [): Reproducing kernel Hilbert space,
𝑙: Differentiable loss,
(𝑋, 𝑌): random variables on feature space and label set −1,1 ,
𝜆: Regularization parameter.
8
9. Loss Function
Example ∃𝜙: ℝ → ℝbc:convex s.t. 𝑙 𝜁, 𝑦 = 𝜙 𝑦𝜁 ,
𝜙 𝑣 = g
log(1 + exp −𝑣 ) 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑙𝑜𝑠𝑠 ,
exp −𝑣 𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙 𝑙𝑜𝑠𝑠 ,
𝑣
𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑙𝑜𝑠𝑠 .
9
10. Assumption
- sup
𝒳
𝑘(𝑥, 𝑥) ≤ 𝑅
,
- ∃𝑀 > 0, 𝜕• 𝑙 𝜁, 𝑦 ≤ 𝑀,
- ∃𝐿 > 0 ∀𝑔, ℎ ∈ ℋ[, ℒ 𝑔 + ℎ − ℒ 𝑔 − ∇ℒ(𝑔), ℎ [ ≤
†
ℎ [
,
- 𝜌 𝑌 = 1 𝑋 ∈ 0,1 , 𝑎. 𝑒.,
- ℎ∗ : increasing function on 0,1 ,
- sgn 𝜇 − 0.5 = sgn ℎ∗ 𝜇 ,
- 𝑔∗ ≔ arg min
S:Œ•Ž••‘Ž’“•
ℒ 𝑔 ∈ ℋ[.
Remark Logistic loss satisfies these assumptions.
The other loss functions also satisfy them by restricting Hypothesis space.
10
Link function:
ℎ∗ 𝜇 = arg min
”∈ℝ
𝜇𝜙 ℎ + 1 − 𝜇 𝜙(−ℎ) .
17. Toy Experiment
• 2-dim toy dataset.
• 𝛿 ∈ 0.1, 0.25, 0.4 .
• Linear separable.
• Logistic loss.
• 𝜆 was determined by validation.
Right Figure
Generated samples for 𝛿 = 0.4.
𝑥1 = 1 is the Bayes optimal.
17
18. 18
From top to bottom:
1. Risk value
2. Class. error
3. Excess class. error
/Excess risk value
Purple line: SGD
Blue line : ASGD
ASGD is much faster
especially when 𝛿 = 0.4.
19. Summary
• We explained convergence rates of expected classification
errors for (A)SGD are sublinear 𝑂(1/𝑛&
) in general.
• We showed that these rates can be accelerated to linear rates
𝑂(exp(−𝑛)) under strong low noise condition.
Future Work
• Faster convergence by making more additional assumptions.
• Variants of SGD(Acceleration, Variance reduction).
• Non-convex models such as deep neural networks.
• Random Fourier features (ongoing work with collaborators).
19
20. References
- T. Zhang. Statistical behavior and consistency of classification methods based on convex risk
minimization. The Annals of Statistics, 2004.
- P. Bartlett, M. Jordan, & J. McAuliffe. Convexity, classification, and risk bounds. Journal of the
American Statistical Association, 2006.
- A.B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 2004.
- V. Koltchinskii & O. Benzosova. Exponential convergence rates in classification. In International
Conference on Computational Learning Theory, 2005.
- J-Y. Audibert & A.B. Tsybakov. Fast learning rates for plug-in classifiers. The Annals of statistics, 2007.
- L. Bottou & O. Bousquet. The Tradeoffs of Large Scale Learning, Advances in Neural Information
Processing Systems, 2008.
- L. Pillaud-Vivien, A. Rudi, & Francis Bach. Exponential convergence of testing error for stochastic
gradient methods. In International Conference on Computational Learning Theory, 2018.
20
22. Link Function
Definition (Link function) ℎ∗: 0,1 → ℝ,
ℎ∗ 𝜇 = arg min
”∈ℝ
𝜇𝜙 ℎ + 1 − 𝜇 𝜙(−ℎ) .
ℎ∗ connects conditional probability of label to model outputs.
Example (Logistic loss)
ℎ∗ 𝜇 = log
𝜇
1 − 𝜇
, ℎ∗
¡1
𝑎 =
1
1 + exp −𝑎
.
22
0
ℎ∗
Expected risk defined by
conditional probability 𝜇.
ℎ∗(𝜇)
23. Proof Idea
Set 𝑚 𝛿 ≔ max ℎ∗ 0.5 + 𝛿 , ℎ∗ 0.5 − 𝛿 .
Example (logistic loss) 𝑚 𝛿 = log
10¢
1¡¢
.
Through ℎ∗, noise condition is converted to: 𝑔∗ 𝑋 ≥ 𝑚 𝛿 .
Set 𝑔6 ≔ arg min
S∈ℋU
ℒ6(𝑔).
When 𝜆 is sufficiently small, 𝑔6 is close to 𝑔∗. Moreover,
Proposition
There exists 𝜆 s.t. 𝑔 − 𝑔6 [ ≤
Œ ¢
®
→ ℛ 𝑔 = ℛ∗.
23
24. 24
Analyze the convergence speed and probability to get in in RKHS.
𝜌(1|𝑋)
Space of conditional probabilities
Small ball which provides Bayes rule.
𝑔∗
𝑔6
Small ball mapped into .
RKHS (predictor)
SGD
ℎ∗
Recall ℎ∗ 𝜇 = arg min
”∈ℝ
𝜇𝜙 ℎ + 1 − 𝜇 𝜙(−ℎ) .
Proof Idea
25. Proof Sketch
1. Let 𝑍1, … , 𝑍£~𝜌 be i.i.d., random variables,
𝐷/: = 𝔼 ¶𝑔£01|𝑍1, … , 𝑍/ − 𝔼 ¶𝑔£01|𝑍1, … , 𝑍/¡1 ,
¶𝑔£01 = 𝔼 ¶𝑔£01 + ·
/¸1
£
𝐷/ .
2. Convergence of 𝔼 ¶𝑔£01 can be analyzed by
𝔼 ¶𝑔£01 − 𝑔6 ≤ 𝔼 ℒ6 𝔼 ¶𝑔£01 − ℒ6 𝑔6 .
3. Bound ∑/¸1
£
𝐷/ by Martingale inequality: for 𝑐£ s.t. ∑/¸1
£
𝐷/ º
≤ 𝑐£
,
ℙ ·
/¸1
£
𝐷/
[
≥ 𝜖 ≤ 2 exp −
𝜖
𝑐£
.
4. Bound 𝑐£ by stability of A(SGD).
5. Combining 1 and 2, probability to get Bayes rule can be obtained.
6. Finally, 𝔼 ¶𝑔£01 = ℙ ¶𝑔£01 𝑖𝑠 𝑛𝑜𝑡 𝐵𝑎𝑦𝑒𝑠 .
25