This document provides an overview of asynchronous stochastic optimization methods and algorithms. It discusses asynchronous parallel stochastic gradient descent (SGD) and how it can minimize idle time. It also introduces asynchronous variance-reduced optimization methods like asynchronous SAGA that provide faster convergence than SGD. The document analyzes the convergence properties of asynchronous optimization methods and presents empirical results demonstrating the speedups achieved by asynchronous proximal SAGA (ProxASAGA) on large datasets.
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Fabian Pedregosa
The document proposes a new parallel method called Proximal Asynchronous Stochastic Gradient Average (ProxASAGA) for solving composite optimization problems. ProxASAGA extends SAGA to handle nonsmooth objectives using proximal operators, and runs asynchronously in parallel without locks. It is shown to converge at the same linear rate as the sequential algorithm theoretically, and achieves speedups of 6-12x on a 20-core machine in practice on large datasets, with greater speedups on sparser problems as predicted by theory.
The document discusses building robust machine learning systems that can handle concept drift. It introduces the challenges of concept drift when the underlying data distribution changes over time. It proposes using Gaussian process classifiers with an adaptive training window approach. The approach monitors for concept drift and retrains the model if detected. It tests the approach on artificial data streams with different drift scenarios and finds the adaptive approach performs better than a static model at handling concept drift. Future work could explore other drift detection methods and ensembles of adaptive Gaussian process classifiers.
Learning to discover monte carlo algorithm on spin ice manifoldKai-Wen Zhao
The global update Monte Carlo sampler can be discovered naturally by trained machine using policy gradient method on topologically constrained environment.
Dictionary Learning for Massive Matrix Factorizationrecsysfr
The document presents a new algorithm called Subsampled Online Dictionary Learning (SODL) for solving very large matrix factorization problems with missing values efficiently. SODL adapts an existing online dictionary learning algorithm to handle missing values by only using the known ratings for each user, allowing it to process large datasets with billions of ratings in linear time with respect to the number of known ratings. Experiments on movie rating datasets show that SODL achieves similar prediction accuracy as the fastest existing solver but with a speed up of up to 6.8 times on the largest Netflix dataset tested.
Bayesian Nonparametrics: Models Based on the Dirichlet ProcessAlessandro Panella
This document summarizes an introduction to Bayesian nonparametric models presented by Alessandro Panella. It discusses Bayesian learning and De Finetti's theorem, which shows that any exchangeable sequence of random variables can be represented as conditionally independent given a random variable. Finite mixture models are introduced as a Bayesian approach to clustering. Dirichlet process mixture models provide a nonparametric generalization that allows for an unbounded number of clusters.
The document discusses recommender systems and sequential recommendation problems. It covers several key points:
1) Matrix factorization and collaborative filtering techniques are commonly used to build recommender systems, but have limitations like cold start problems and how to incorporate additional constraints.
2) Sequential recommendation problems can be framed as multi-armed bandit problems, where past recommendations influence future recommendations.
3) Various bandit algorithms like UCB, Thompson sampling, and LinUCB can be applied, but extending guarantees to models like matrix factorization is challenging. Offline evaluation on real-world datasets is important.
Topic of presentation: Variational autoencoders for speech processing
The main points of the presentation: Variational autoencoders (or VAE) have become one of the most popular unsupervised learning techniques for modelling complex data distributions, such as images and audio. In this talk I'll begin with a general introduction to VAEs and then review a recent technique called VQ-VAE which is capable of learning rundimentary phoneme-level language model from raw audio without any supervision.
http://dataconf.com.ua/speaker-page/dmytro-bielievtsov.php
https://www.youtube.com/watch?v=euYSAL-aKMI&list=PL5_LBM8-5sLjbRFUtXaUpg84gtJtyc4Pu&t=0s&index=9
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Fabian Pedregosa
The document proposes a new parallel method called Proximal Asynchronous Stochastic Gradient Average (ProxASAGA) for solving composite optimization problems. ProxASAGA extends SAGA to handle nonsmooth objectives using proximal operators, and runs asynchronously in parallel without locks. It is shown to converge at the same linear rate as the sequential algorithm theoretically, and achieves speedups of 6-12x on a 20-core machine in practice on large datasets, with greater speedups on sparser problems as predicted by theory.
The document discusses building robust machine learning systems that can handle concept drift. It introduces the challenges of concept drift when the underlying data distribution changes over time. It proposes using Gaussian process classifiers with an adaptive training window approach. The approach monitors for concept drift and retrains the model if detected. It tests the approach on artificial data streams with different drift scenarios and finds the adaptive approach performs better than a static model at handling concept drift. Future work could explore other drift detection methods and ensembles of adaptive Gaussian process classifiers.
Learning to discover monte carlo algorithm on spin ice manifoldKai-Wen Zhao
The global update Monte Carlo sampler can be discovered naturally by trained machine using policy gradient method on topologically constrained environment.
Dictionary Learning for Massive Matrix Factorizationrecsysfr
The document presents a new algorithm called Subsampled Online Dictionary Learning (SODL) for solving very large matrix factorization problems with missing values efficiently. SODL adapts an existing online dictionary learning algorithm to handle missing values by only using the known ratings for each user, allowing it to process large datasets with billions of ratings in linear time with respect to the number of known ratings. Experiments on movie rating datasets show that SODL achieves similar prediction accuracy as the fastest existing solver but with a speed up of up to 6.8 times on the largest Netflix dataset tested.
Bayesian Nonparametrics: Models Based on the Dirichlet ProcessAlessandro Panella
This document summarizes an introduction to Bayesian nonparametric models presented by Alessandro Panella. It discusses Bayesian learning and De Finetti's theorem, which shows that any exchangeable sequence of random variables can be represented as conditionally independent given a random variable. Finite mixture models are introduced as a Bayesian approach to clustering. Dirichlet process mixture models provide a nonparametric generalization that allows for an unbounded number of clusters.
The document discusses recommender systems and sequential recommendation problems. It covers several key points:
1) Matrix factorization and collaborative filtering techniques are commonly used to build recommender systems, but have limitations like cold start problems and how to incorporate additional constraints.
2) Sequential recommendation problems can be framed as multi-armed bandit problems, where past recommendations influence future recommendations.
3) Various bandit algorithms like UCB, Thompson sampling, and LinUCB can be applied, but extending guarantees to models like matrix factorization is challenging. Offline evaluation on real-world datasets is important.
Topic of presentation: Variational autoencoders for speech processing
The main points of the presentation: Variational autoencoders (or VAE) have become one of the most popular unsupervised learning techniques for modelling complex data distributions, such as images and audio. In this talk I'll begin with a general introduction to VAEs and then review a recent technique called VQ-VAE which is capable of learning rundimentary phoneme-level language model from raw audio without any supervision.
http://dataconf.com.ua/speaker-page/dmytro-bielievtsov.php
https://www.youtube.com/watch?v=euYSAL-aKMI&list=PL5_LBM8-5sLjbRFUtXaUpg84gtJtyc4Pu&t=0s&index=9
This document presents a dissertation on improving the baby step giant step algorithm for solving the elliptic curve discrete logarithmic problem. It begins with an overview of cryptography, symmetric and asymmetric encryption, and elliptic curve cryptography. It then discusses the elliptic curve discrete logarithmic problem and surveys existing literature. The proposed approach improves the baby step giant step algorithm by using a smaller baby step set size. Experimental results on two examples show that the proposed approach has faster runtime than the previous method. A complexity analysis is also presented.
This document summarizes results on analyzing stochastic gradient descent (SGD) algorithms for minimizing convex functions. It shows that a continuous-time version of SGD (SGD-c) can strongly approximate the discrete-time version (SGD-d) under certain conditions. It also establishes that SGD achieves the minimax optimal convergence rate of O(t^-1/2) for α=1/2 by using an "averaging from the past" procedure, closing the gap between previous lower and upper bound results.
We provide a review of the recent literature on statistical risk bounds for deep neural networks. We also discuss some theoretical results that compare the performance of deep ReLU networks to other methods such as wavelets and spline-type methods. The talk will moreover highlight some open problems and sketch possible new directions.
Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal...Kohei Hayashi
1) The document presents a new method called generalized factorized asymptotic Bayesian inference (gFAB) that extends previous work on factorized asymptotic Bayesian inference (FAB) to allow it to be applied to general latent variable models, not just binary latent variable models.
2) gFAB involves defining a new criterion called the generalized factorized information criterion (gFIC) that can be used for model selection. gFIC approximates the marginal likelihood and adds a penalty term involving the Hessian of the log joint distribution with respect to the model parameters.
3) gFAB can be optimized using an alternating updating procedure similar to expectation-maximization (EM) and provides an asymptotically accurate approximation to
This document summarizes and analyzes first-order meta-learning algorithms. It discusses MAML, which approximates the MAML objective using only first-order information (FOMAML). FOMAML is equivalent to applying the last gradient to the initial parameters. Reptile is also analyzed, which simply averages the parameter updates. In expectation, the gradients of MAML, FOMAML and Reptile depend on the average gradient and average inner product of gradients. Experiments show similar performance between FOMAML and Reptile. The analysis suggests SGD may generalize well due to being an approximation of MAML.
A Gentle Introduction to Bayesian NonparametricsJulyan Arbel
The document provides an introduction to Bayesian nonparametrics and the Dirichlet process. It explains that Bayesian nonparametrics aims to fit models that can adapt their complexity based on the data, without strictly imposing a fixed structure. The Dirichlet process is described as a prior distribution on the space of all probability distributions, allowing the model to utilize an infinite number of parameters. Nonparametric mixture models using the Dirichlet process provide a flexible approach to density estimation and clustering.
This document discusses macrocanonical models for texture synthesis. It begins by introducing the goal of texture synthesis and providing a brief history. It then describes the parametric question of combining randomness and structure in images. Specifically, it discusses maximizing entropy under geometric constraints. The document goes on to discuss links to statistical physics, defining microcanonical and macrocanonical models. It focuses on studying the macrocanonical model, describing how to find optimal parameters through gradient descent and how to sample from the model using Langevin dynamics. The document provides examples of texture synthesis and compares results to other methods.
This document provides an overview of the key topics covered in Lecture 9 of an Artificial Intelligence course on fuzzy logic. The lecture introduces fuzzy sets and membership functions as a way to represent ambiguous or uncertain values. It covers fuzzy set operations, fuzzy numbers, fuzzy rules for reasoning, and fuzzy inference. An example is provided to illustrate how fuzzy logic can be applied to control the speed of a vehicle based on road curvature. The homework assignments involve problems working with the concepts introduced in the lecture.
A discussion on sampling graphs to approximate network classification functionsLARCA UPC
The problem of network classification consists on assigning a finite set of labels to the nodes of the graphs; the underlying assumption is that nodes with the same label tend to be connected via strong paths in the graph. This is similar to the assumptions made by semi-supervised learning algorithms based on graphs, which build an artificial graph from vectorial data. Such semi-supervised algorithms are based on label propagation principles and their accuracy heavily relies on the structure (presence of edges) in the graph.
In this talk I will discuss ideas of how to perform sampling in the network graph, thus sparsifying the structure in order to apply semi-supervised algorithms and compute efficiently the classification function on the network. I will show very preliminary experiments indicating that the sampling technique has an important effect on the final results and discuss open theoretical and practical questions that are to be solved yet.
This document outlines Hadi Sinaee's seminar on Restricted Boltzmann Machines (RBMs) from scratch. The seminar covers:
1. Unsupervised learning and using Markov Random Fields (MRFs) to learn unknown data distributions.
2. Maximum likelihood estimation cannot be done analytically for MRFs, so numerical approximation is required.
3. Introducing latent variables in the form of hidden units allows modeling high-dimensional distributions like images.
4. Computing the log-likelihood gradient involves taking expectations that require summing over all possible latent variable assignments, so approximation is needed.
New Insights and Perspectives on the Natural Gradient MethodYoonho Lee
The document discusses the natural gradient method for optimizing neural networks. It explains that the natural gradient finds the direction of steepest descent in function space rather than parameter space. The natural gradient is invariant to reparameterization. For most neural networks, natural gradient descent is equivalent to a second-order optimization method called the generalized Gauss-Newton method. The natural gradient takes into account the geometry of the parameter space defined by the Fisher information matrix.
RuleML2015: Learning Characteristic Rules in Geographic Information SystemsRuleML
We provide a general framework for learning characterization
rules of a set of objects in Geographic Information Systems (GIS) relying
on the definition of distance quantified paths. Such expressions specify
how to navigate between the different layers of the GIS starting from
the target set of objects to characterize. We have defined a generality
relation between quantified paths and proved that it is monotonous with
respect to the notion of coverage, thus allowing to develop an interactive
and effective algorithm to explore the search space of possible rules. We
describe GISMiner, an interactive system that we have developed based
on our framework. Finally, we present our experimental results from a
real GIS about mineral exploration.
The document presents algorithms for finding the largest induced q-colorable subgraph of a given graph G. It first describes a randomized algorithm that runs in time proportional to enumerating maximal independent sets and a polynomial in n and q. For perfect graphs, where maximum independent sets can be found efficiently, it gives a deterministic algorithm running in similar time. It also shows that the problem does not admit a polynomial kernel when parameterized by the solution size for split and perfect graphs under standard assumptions.
We consider the problem of model estimation in episodic Block MDPs. In these MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states. We are interested in estimating the latent state decoding function (the mapping from the observations to latent states) based on data generated under a fixed behavior policy. We derive an information-theoretical lower bound on the error rate for estimating this function and present an algorithm approaching this fundamental limit. In turn, our algorithm also provides estimates of all the components of the MDP.
We apply our results to the problem of learning near-optimal policies in the reward-free setting. Based on our efficient model estimation algorithm, we show that we can infer a policy converging (as the number of collected samples grows large) to the optimal policy at the best possible asymptotic rate. Our analysis provides necessary and sufficient conditions under which exploiting the block structure yields improvements in the sample complexity for identifying near-optimal policies. When these conditions are met, the sample complexity in the minimax reward-free setting is improved by a multiplicative factor $n$, where $n$ is the number of contexts.
This document discusses Latent Dirichlet Allocation (LDA), a probabilistic topic modeling technique. It begins with an introduction to topic models and their use in understanding large collections of documents. It then describes LDA's generative process using Dirichlet distributions to represent document-topic and topic-term distributions. Approximate inference methods for LDA like Gibbs sampling are also summarized. The document concludes by outlining the implementation of an LDA model, including preprocessing of documents and collapsed Gibbs sampling.
This document presents a dissertation on improving the baby step giant step algorithm for solving the elliptic curve discrete logarithmic problem. It begins with an overview of cryptography, symmetric and asymmetric encryption, and elliptic curve cryptography. It then discusses the elliptic curve discrete logarithmic problem and surveys existing literature. The proposed approach improves the baby step giant step algorithm by using a smaller baby step set size. Experimental results on two examples show that the proposed approach has faster runtime than the previous method. A complexity analysis is also presented.
This document summarizes results on analyzing stochastic gradient descent (SGD) algorithms for minimizing convex functions. It shows that a continuous-time version of SGD (SGD-c) can strongly approximate the discrete-time version (SGD-d) under certain conditions. It also establishes that SGD achieves the minimax optimal convergence rate of O(t^-1/2) for α=1/2 by using an "averaging from the past" procedure, closing the gap between previous lower and upper bound results.
We provide a review of the recent literature on statistical risk bounds for deep neural networks. We also discuss some theoretical results that compare the performance of deep ReLU networks to other methods such as wavelets and spline-type methods. The talk will moreover highlight some open problems and sketch possible new directions.
Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal...Kohei Hayashi
1) The document presents a new method called generalized factorized asymptotic Bayesian inference (gFAB) that extends previous work on factorized asymptotic Bayesian inference (FAB) to allow it to be applied to general latent variable models, not just binary latent variable models.
2) gFAB involves defining a new criterion called the generalized factorized information criterion (gFIC) that can be used for model selection. gFIC approximates the marginal likelihood and adds a penalty term involving the Hessian of the log joint distribution with respect to the model parameters.
3) gFAB can be optimized using an alternating updating procedure similar to expectation-maximization (EM) and provides an asymptotically accurate approximation to
This document summarizes and analyzes first-order meta-learning algorithms. It discusses MAML, which approximates the MAML objective using only first-order information (FOMAML). FOMAML is equivalent to applying the last gradient to the initial parameters. Reptile is also analyzed, which simply averages the parameter updates. In expectation, the gradients of MAML, FOMAML and Reptile depend on the average gradient and average inner product of gradients. Experiments show similar performance between FOMAML and Reptile. The analysis suggests SGD may generalize well due to being an approximation of MAML.
A Gentle Introduction to Bayesian NonparametricsJulyan Arbel
The document provides an introduction to Bayesian nonparametrics and the Dirichlet process. It explains that Bayesian nonparametrics aims to fit models that can adapt their complexity based on the data, without strictly imposing a fixed structure. The Dirichlet process is described as a prior distribution on the space of all probability distributions, allowing the model to utilize an infinite number of parameters. Nonparametric mixture models using the Dirichlet process provide a flexible approach to density estimation and clustering.
This document discusses macrocanonical models for texture synthesis. It begins by introducing the goal of texture synthesis and providing a brief history. It then describes the parametric question of combining randomness and structure in images. Specifically, it discusses maximizing entropy under geometric constraints. The document goes on to discuss links to statistical physics, defining microcanonical and macrocanonical models. It focuses on studying the macrocanonical model, describing how to find optimal parameters through gradient descent and how to sample from the model using Langevin dynamics. The document provides examples of texture synthesis and compares results to other methods.
This document provides an overview of the key topics covered in Lecture 9 of an Artificial Intelligence course on fuzzy logic. The lecture introduces fuzzy sets and membership functions as a way to represent ambiguous or uncertain values. It covers fuzzy set operations, fuzzy numbers, fuzzy rules for reasoning, and fuzzy inference. An example is provided to illustrate how fuzzy logic can be applied to control the speed of a vehicle based on road curvature. The homework assignments involve problems working with the concepts introduced in the lecture.
A discussion on sampling graphs to approximate network classification functionsLARCA UPC
The problem of network classification consists on assigning a finite set of labels to the nodes of the graphs; the underlying assumption is that nodes with the same label tend to be connected via strong paths in the graph. This is similar to the assumptions made by semi-supervised learning algorithms based on graphs, which build an artificial graph from vectorial data. Such semi-supervised algorithms are based on label propagation principles and their accuracy heavily relies on the structure (presence of edges) in the graph.
In this talk I will discuss ideas of how to perform sampling in the network graph, thus sparsifying the structure in order to apply semi-supervised algorithms and compute efficiently the classification function on the network. I will show very preliminary experiments indicating that the sampling technique has an important effect on the final results and discuss open theoretical and practical questions that are to be solved yet.
This document outlines Hadi Sinaee's seminar on Restricted Boltzmann Machines (RBMs) from scratch. The seminar covers:
1. Unsupervised learning and using Markov Random Fields (MRFs) to learn unknown data distributions.
2. Maximum likelihood estimation cannot be done analytically for MRFs, so numerical approximation is required.
3. Introducing latent variables in the form of hidden units allows modeling high-dimensional distributions like images.
4. Computing the log-likelihood gradient involves taking expectations that require summing over all possible latent variable assignments, so approximation is needed.
New Insights and Perspectives on the Natural Gradient MethodYoonho Lee
The document discusses the natural gradient method for optimizing neural networks. It explains that the natural gradient finds the direction of steepest descent in function space rather than parameter space. The natural gradient is invariant to reparameterization. For most neural networks, natural gradient descent is equivalent to a second-order optimization method called the generalized Gauss-Newton method. The natural gradient takes into account the geometry of the parameter space defined by the Fisher information matrix.
RuleML2015: Learning Characteristic Rules in Geographic Information SystemsRuleML
We provide a general framework for learning characterization
rules of a set of objects in Geographic Information Systems (GIS) relying
on the definition of distance quantified paths. Such expressions specify
how to navigate between the different layers of the GIS starting from
the target set of objects to characterize. We have defined a generality
relation between quantified paths and proved that it is monotonous with
respect to the notion of coverage, thus allowing to develop an interactive
and effective algorithm to explore the search space of possible rules. We
describe GISMiner, an interactive system that we have developed based
on our framework. Finally, we present our experimental results from a
real GIS about mineral exploration.
The document presents algorithms for finding the largest induced q-colorable subgraph of a given graph G. It first describes a randomized algorithm that runs in time proportional to enumerating maximal independent sets and a polynomial in n and q. For perfect graphs, where maximum independent sets can be found efficiently, it gives a deterministic algorithm running in similar time. It also shows that the problem does not admit a polynomial kernel when parameterized by the solution size for split and perfect graphs under standard assumptions.
We consider the problem of model estimation in episodic Block MDPs. In these MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states. We are interested in estimating the latent state decoding function (the mapping from the observations to latent states) based on data generated under a fixed behavior policy. We derive an information-theoretical lower bound on the error rate for estimating this function and present an algorithm approaching this fundamental limit. In turn, our algorithm also provides estimates of all the components of the MDP.
We apply our results to the problem of learning near-optimal policies in the reward-free setting. Based on our efficient model estimation algorithm, we show that we can infer a policy converging (as the number of collected samples grows large) to the optimal policy at the best possible asymptotic rate. Our analysis provides necessary and sufficient conditions under which exploiting the block structure yields improvements in the sample complexity for identifying near-optimal policies. When these conditions are met, the sample complexity in the minimax reward-free setting is improved by a multiplicative factor $n$, where $n$ is the number of contexts.
This document discusses Latent Dirichlet Allocation (LDA), a probabilistic topic modeling technique. It begins with an introduction to topic models and their use in understanding large collections of documents. It then describes LDA's generative process using Dirichlet distributions to represent document-topic and topic-term distributions. Approximate inference methods for LDA like Gibbs sampling are also summarized. The document concludes by outlining the implementation of an LDA model, including preprocessing of documents and collapsed Gibbs sampling.
The asynchronous parallel algorithms are developed to solve massive optimization problems in a distributed data system, which can be run in parallel on multiple nodes with little or no synchronization. Recently they have been successfully implemented to solve a range of difficult problems in practice. However, the existing theories are mostly based on fairly restrictive assumptions on the delays, and cannot explain the convergence and speedup properties of such algorithms. In this talk we will give an overview on distributed optimization, and discuss some new theoretical results on the convergence of asynchronous parallel stochastic gradient algorithm with unbounded delays. Simulated and real data will be used to demonstrate the practical implication of these theoretical results.
Metaheuristic Algorithms: A Critical AnalysisXin-She Yang
The document discusses metaheuristic algorithms and their application to optimization problems. It provides an overview of several nature-inspired algorithms including particle swarm optimization, firefly algorithm, harmony search, and cuckoo search. It describes how these algorithms were inspired by natural phenomena like swarming behavior, flashing fireflies, and bird breeding. The document also discusses applications of these algorithms to engineering design problems like pressure vessel design and gear box design optimization.
Reading revue of "Inferring Multiple Graphical Structures"tuxette
This document summarizes and reviews methods for inferring gene co-expression networks from gene expression data, as presented in related articles including Chiquet et al. It describes various statistical approaches implemented in packages like GeneNet and glasso, including graphical Gaussian models using shrinkage and sparse linear regression. It compares the resulting network densities produced by different methods.
This document discusses applying deep learning techniques like variational autoencoders to cyber security and anomaly detection in network traffic. It notes that while deep learning has made progress in related areas, modeling categorical network flow data poses unique challenges. It proposes using variational inference with a Gumbel softmax relaxation to train a generative model on network flows in an unsupervised manner. The trained model could then be used for tasks like anomaly detection based on the model's predictions or a sample's reconstruction error.
This document proposes a method for linear regression on symbolic data where each observation is represented by a Gaussian distribution. It derives the likelihood function for such "Gaussian symbols" and shows that it can be maximized using gradient descent. Simulation results demonstrate that the maximum likelihood estimator performs better than a naive least squares regression on the mean of each symbol. The method extends classical linear regression to the symbolic data setting.
We approach the screening problem - i.e. detecting which inputs of a computer model significantly impact the output - from a formal Bayesian model selection point of view. That is, we place a Gaussian process prior on the computer model and consider the $2^p$ models that result from assuming that each of the subsets of the $p$ inputs affect the response. The goal is to obtain the posterior probabilities of each of these models. In this talk, we focus on the specification of objective priors on the model-specific parameters and on convenient ways to compute the associated marginal likelihoods. These two problems that normally are seen as unrelated, have challenging connections since the priors proposed in the literature are specifically designed to have posterior modes in the boundary of the parameter space, hence precluding the application of approximate integration techniques based on e.g. Laplace approximations. We explore several ways of circumventing this difficulty, comparing different methodologies with synthetic examples taken from the literature.
Authors: Gonzalo Garcia-Donato (Universidad de Castilla-La Mancha) and Rui Paulo (Universidade de Lisboa)
This document provides an overview of automated theorem proving. It discusses:
1) The history and background of automated theorem proving, from Hobbes and Leibniz proposing algorithmic logic to modern computer-based approaches.
2) The theoretical limitations of automated reasoning due to results like Godel's incompleteness theorems, but also practical applications like verifying mathematics and computer systems.
3) How automated reasoning involves expressing statements formally and then manipulating those expressions algorithmically, as anticipated by Leibniz centuries ago.
The document provides an introduction to deep learning, including the following key points:
- Deep learning uses neural networks inspired by the human brain to perform machine learning tasks. The basic unit is an artificial neuron that takes weighted inputs and applies an activation function.
- Popular deep learning libraries and frameworks include TensorFlow, Keras, PyTorch, and Caffe. Common activation functions are sigmoid, tanh, and ReLU.
- Neural networks are trained using forward and backpropagation. Forward propagation feeds inputs through the network while backpropagation calculates errors to update weights.
- Convolutional neural networks are effective for image and visual data tasks due to their use of convolutional and pooling layers. Recurrent neural networks can process sequential data due
Cuckoo Search Algorithm: An IntroductionXin-She Yang
This presentation explains the fundamental ideas of the standard Cuckoo Search (CS) algorithm, which also contains the links to the free Matlab codes at Mathswork file exchanges and the animations of numerical simulations (video at Youtube). An example of multi-objective cuckoo search (MOCS) is also given with link to the Matlab code.
The document discusses data structures and algorithms. It defines key concepts like primitive data types, data structures, static vs dynamic structures, abstract data types, algorithm design, analysis of time and space complexity, recursion, stacks and common stack operations like push and pop. Examples are provided to illustrate factorial calculation using recursion and implementation of a stack.
The document discusses data structures and algorithms. It defines key concepts like primitive data types, data structures, static vs dynamic structures, abstract data types, algorithm design, analysis of time and space complexity, and recursion. It provides examples of algorithms and data structures like stacks and using recursion to calculate factorials. The document covers fundamental topics in data structures and algorithms.
The document discusses data structures and algorithms. It defines key concepts like primitive data types, data structures, static vs dynamic structures, abstract data types, algorithm design, analysis of time and space complexity, and recursion. It provides examples of algorithms and data structures like arrays, stacks and the factorial function to illustrate recursive and iterative implementations. Problem solving techniques like defining the problem, designing algorithms, analyzing and testing solutions are also covered.
The document discusses data structures and algorithms. It defines key concepts like primitive data types, data structures, static vs dynamic structures, abstract data types, algorithm analysis including time and space complexity, and common algorithm design techniques like recursion. It provides examples of algorithms and data structures like stacks and using recursion to calculate factorials. The document covers fundamental topics in data structures and algorithms.
This document discusses data structures and algorithms. It begins by defining data structures as the logical organization of data and primitive data types like integers that hold single pieces of data. It then discusses static versus dynamic data structures and abstract data types. The document outlines the main steps in problem solving as defining the problem, designing algorithms, analyzing algorithms, implementing, testing, and maintaining solutions. It provides examples of space and time complexity analysis and discusses analyzing recursive algorithms through repeated substitution and telescoping methods.
The document discusses data structures and algorithms. It defines key concepts like primitive data types, data structures, static vs dynamic structures, abstract data types, algorithm design, analysis of time and space complexity, and recursion. It provides examples of algorithms and data structures like stacks and using recursion to calculate factorials. The document covers fundamental topics in data structures and algorithms.
Learning for Optimization: EDAs, probabilistic modelling, or ...butest
Marcus Gallagher gave a talk on explicit modelling in metaheuristic optimization. He discussed estimation of distribution algorithms which use probabilistic models to represent promising regions of the search space. He provided examples of modelling approaches like PBIL, MIMIC, COMIT and BOA. Finally, he summarized that EDAs take an explicit modelling approach to optimization using existing statistical models and can solve challenging problems by visualizing the model.
MLPfit is a tool for designing and training multi-layer perceptrons (MLPs) for tasks like function approximation and classification. It implements stochastic minimization as well as more powerful methods like conjugate gradients and BFGS. MLPfit is designed to be simple, precise, fast and easy to use for both standalone and integrated applications. Documentation and source code are available online.
The document discusses portfolio methods for optimization problems with uncertainty. It introduces noisy optimization problems where the objective function includes random variables. It then discusses various optimization criteria and methods for noisy optimization problems, including resampling methods to reduce noise. The document also covers portfolio approaches that combine or select among multiple optimization solvers to handle uncertainty.
Similar to Asynchronous Stochastic Optimization, New Analysis and Algorithms (20)
Random Matrix Theory and Machine Learning - Part 4Fabian Pedregosa
Deep learning models with millions or billions of parameters should overfit according to classical theory, but they do not. The emerging theory of double descent seeks to explain why larger neural networks can generalize well. Random matrix theory provides a tractable framework to model double descent through random feature models, where the number of random features controls model capacity. In the high-dimensional limit, the test error of random feature regression exhibits a double descent shape that can be computed analytically.
Random Matrix Theory and Machine Learning - Part 3Fabian Pedregosa
ICML 2021 tutorial on random matrix theory and machine learning.
Part 3 covers: 1. Motivation: Average-case versus worst-case in high dimensions 2. Algorithm halting times (runtimes) 3. Outlook
Random Matrix Theory and Machine Learning - Part 1Fabian Pedregosa
This document provides an introduction to random matrix theory and its applications in machine learning. It discusses several classical random matrix ensembles like the Gaussian Orthogonal Ensemble (GOE) and Wishart ensemble. These ensembles are used to model phenomena in fields like number theory, physics, and machine learning. Specifically, the GOE is used to model Hamiltonians of heavy nuclei, while the Wishart ensemble relates to the Hessian of least squares problems. The tutorial will cover applications of random matrix theory to analyzing loss landscapes, numerical algorithms, and the generalization properties of machine learning models.
Average case acceleration through spectral density estimationFabian Pedregosa
We develop a framework for designing optimal quadratic optimization methods in terms of their average-case runtime. This yields a new class of methods that achieve acceleration through a model of the Hessian's expected spectral density. We develop explicit algorithms for the uniform, Marchenko-Pastur, and exponential distributions. These methods are momentum-based gradient algorithms whose hyper-parameters can be estimated without knowledge of the Hessian's smallest singular value, in contrast with classical accelerated methods like Nesterov acceleration and Polyak momentum. Empirical results on quadratic, logistic regression and neural networks show the proposed methods always match and in many cases significantly improve over classical accelerated methods.
Full paper: https://arxiv.org/pdf/1804.02339.pdf
We propose and analyze a novel adaptive step size variant of the Davis-Yin three operator splitting, a method that can solve optimization problems composed of a sum of a smooth term for which we have access to its gradient and an arbitrary number of potentially non-smooth terms for which we have access to their proximal operator. The proposed method leverages local information of the objective function, allowing for larger step sizes while preserving the convergence properties of the original method. It only requires two extra function evaluations per iteration and does not depend on any step size hyperparameter besides an initial estimate. We provide a convergence rate analysis of this method, showing sublinear convergence rate for general convex functions and linear convergence under stronger assumptions, matching the best known rates of its non adaptive variant. Finally, an empirical comparison with related methods on 6 different problems illustrates the computational advantage of the adaptive step size strategy.
This document discusses an adaptive step-size method for the Frank-Wolfe algorithm that eliminates the need for a manually selected step-size parameter. It presents the standard Frank-Wolfe algorithm and the Demyanov-Rubinov variant that uses a step-size based on sufficient decrease. It then proposes an adaptive Frank-Wolfe algorithm that replaces the global Lipschitz constant L with a local constant Lt, allowing for potentially larger step sizes. This adaptive approach is shown to maintain sufficient decrease and can be extended to other Frank-Wolfe variants like away-steps Frank-Wolfe.
Hyperparameter optimization with approximate gradientFabian Pedregosa
This document discusses hyperparameter optimization using approximate gradients. It introduces the problem of optimizing hyperparameters along with model parameters. While model parameters can be estimated from data, hyperparameters require methods like cross-validation. The document proposes using approximate gradients to optimize hyperparameters more efficiently than costly methods like grid search. It derives the gradient of the objective with respect to hyperparameters and presents an algorithm called HOAG that approximates this gradient using inexact solutions. The document analyzes HOAG's convergence and provides experimental results comparing it to other hyperparameter optimization methods.
Lightning: large scale machine learning in pythonFabian Pedregosa
Lightning is a Python library for large-scale machine learning that incorporates recent advances in optimization algorithms. It is compatible with scikit-learn and supports both dense and sparse data as well as structured sparsity penalties. Lightning scales to large datasets using stochastic optimization methods like SGD, SVRG, SDCA, and SAGA. It also efficiently handles large feature spaces using coordinate descent algorithms. The API is similar to scikit-learn but is based on optimization algorithms rather than machine learning models. Lightning is part of the scikit-learn-contrib project.
Profiling in Python provides concise summaries of key profiling tools in 3 sentences:
cProfile and line_profiler profile execution time and identify slow lines of code. memory_profiler profiles memory usage with line-by-line or time-based outputs. YEP extends profiling to compiled C/C++ extensions like Cython modules, which are not covered by the standard Python profilers.
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
Nucleophilic Addition of carbonyl compounds.pptxSSR02
Nucleophilic addition is the most important reaction of carbonyls. Not just aldehydes and ketones, but also carboxylic acid derivatives in general.
Carbonyls undergo addition reactions with a large range of nucleophiles.
Comparing the relative basicity of the nucleophile and the product is extremely helpful in determining how reversible the addition reaction is. Reactions with Grignards and hydrides are irreversible. Reactions with weak bases like halides and carboxylates generally don’t happen.
Electronic effects (inductive effects, electron donation) have a large impact on reactivity.
Large groups adjacent to the carbonyl will slow the rate of reaction.
Neutral nucleophiles can also add to carbonyls, although their additions are generally slower and more reversible. Acid catalysis is sometimes employed to increase the rate of addition.
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
Phenomics assisted breeding in crop improvementIshaGoswami9
As the population is increasing and will reach about 9 billion upto 2050. Also due to climate change, it is difficult to meet the food requirement of such a large population. Facing the challenges presented by resource shortages, climate
change, and increasing global population, crop yield and quality need to be improved in a sustainable way over the coming decades. Genetic improvement by breeding is the best way to increase crop productivity. With the rapid progression of functional
genomics, an increasing number of crop genomes have been sequenced and dozens of genes influencing key agronomic traits have been identified. However, current genome sequence information has not been adequately exploited for understanding
the complex characteristics of multiple gene, owing to a lack of crop phenotypic data. Efficient, automatic, and accurate technologies and platforms that can capture phenotypic data that can
be linked to genomics information for crop improvement at all growth stages have become as important as genotyping. Thus,
high-throughput phenotyping has become the major bottleneck restricting crop breeding. Plant phenomics has been defined as the high-throughput, accurate acquisition and analysis of multi-dimensional phenotypes
during crop growing stages at the organism level, including the cell, tissue, organ, individual plant, plot, and field levels. With the rapid development of novel sensors, imaging technology,
and analysis methods, numerous infrastructure platforms have been developed for phenotyping.
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
ESPP presentation to EU Waste Water Network, 4th June 2024 “EU policies driving nutrient removal and recycling
and the revised UWWTD (Urban Waste Water Treatment Directive)”
The binding of cosmological structures by massless topological defectsSérgio Sacani
Assuming spherical symmetry and weak field, it is shown that if one solves the Poisson equation or the Einstein field
equations sourced by a topological defect, i.e. a singularity of a very specific form, the result is a localized gravitational
field capable of driving flat rotation (i.e. Keplerian circular orbits at a constant speed for all radii) of test masses on a thin
spherical shell without any underlying mass. Moreover, a large-scale structure which exploits this solution by assembling
concentrically a number of such topological defects can establish a flat stellar or galactic rotation curve, and can also deflect
light in the same manner as an equipotential (isothermal) sphere. Thus, the need for dark matter or modified gravity theory is
mitigated, at least in part.
2. Where I Come From
ML/Optimization/Software Guy
Engineer (2010–2012)
First contact with ML: develop ML
library (scikit-learn).
ML and NeuroScience (2012–2015)
PhD applying ML to neuroscience.
ML and Optimization (2015–)
Stochastic / Parallel / Constrained /
Hyperparameter optimization.
1/33
3. Outline
Goal: Review recent work in asynchronous parallel optimization for
machine learning1,2.
1. Asynchronous parallel optimization, Asynchronous SGD.
2. Asynchronous variance-reduced optimization.
3. Analysis of asynchronous methods: What we can prove.
1
R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien (2018).
“Improved asynchronous parallel optimization analysis for stochastic
incremental methods”. In: to appear in Journal of Machine Learning Research.
2
Fabian Pedregosa, R´emi Leblond, and Simon Lacoste-Julien (2017).
“Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite
Optimization”. In: Advances in Neural Information Processing Systems 30
(NIPS).
2/33
5. 40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
3/33
6. 40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
• At the same time, the number of cores increases exponentially.
3/33
7. 40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
• At the same time, the number of cores increases exponentially.
3/33
8. 40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
• At the same time, the number of cores increases exponentially.
Parallel algorithms needed to take advantage of modern
CPUs. 3/33
9. Parallel Optimization: Not a new topic
• Most of the principles and
methods already in
(Bertsekas and Tsitsiklis,
1989).
• For linear systems it can be
traced even earlier (Arrow
and Hurwicz, 1958).
4/33
10. Asynchronous vs Synchronous methods
Synchronous methods
• Wait for slowest worker.
• Limited speedup due to
synchronization cost.
Asynchronous methods
• Workers receive work as
needed.
• Minimize idle time.
• Challenging analysis.
t0 t1 t2
Worker 4
Worker 3
Worker 2
Worker 1
idle
idle
idle
idle
idle
idle
t0 t1t2t3 t4 t5t6 t7 t8
Worker 4
Worker 3
Worker 2
Worker 1
Time
5/33
11. Optimization for machine learning
Many problems in machine learning can be framed as
minimize
x∈Rp
f (x)
def
=
1
n
n
i=1
fi (x)
Gradient descent (Cauchy, 1847).
Descend along steepest direction
x+
= x − γ f (x)
Stochastic gradient descent (SGD)
(Robbins and Monro, 1951). Select
random i, descent along − fi (x):
x+
= x − γ fi (x) Figure source: Francis Bach
6/33
12. Example: Asynchronous SGD (Tsitsiklis, Bertsekas, and
Athans, 1986)
Recent revival due to applications in machine learning, (Niu et al.,
2011; Dean et al., 2012). Other names: Downpour SGD, Hogwild.
Problem: minimize
x
f (x)
def
=
1
n
n
i=1
fi (x)
General Algorithm
All workers do in parallel:
1. Read the information in shared memory (ˆx).
2. Sample i ∈ {1, . . . , n} and compute fi (ˆx).
3. Perform SGD update on shared memory x = x − γ fi (ˆx).
7/33
13. Example: Asynchronous SGD (Tsitsiklis, Bertsekas, and
Athans, 1986)
Recent revival due to applications in machine learning, (Niu et al.,
2011; Dean et al., 2012). Other names: Downpour SGD, Hogwild.
Problem: minimize
x
f (x)
def
=
1
n
n
i=1
fi (x)
General Algorithm
All workers do in parallel:
1. Read the information in shared memory (ˆx).
2. Sample i ∈ {1, . . . , n} and compute fi (ˆx).
3. Perform SGD update on shared memory x = x − γ fi (ˆx).
x and ˆx might be different. 7/33
14. Asynchronous SGD
• Write is performed with old version of coefficients.
• Update requires a lock on the vector of coefficients.
8/33
15. Hogwild! (Niu et al., 2011): Lock-free Async. SGD
Algorithm 1 Hogwild
1: loop
2: ˆx = inconsistent read of x
3: Sample i uniformly in {1, ..., n}
4: Let Si be fi ’s support
5: [δx]Si := −γ fi (ˆx)
6: for v in Si do
7: [x]v ← [x]v + [δx]v // atomic
8: end for
9: end loop
• All read/write operations to shared memory are
inconsistent, i.e., no vector-level locks while updating shared
memory.
• Key assumption. Sparse gradients (|Si | dimension).
9/33
16. Hogwild: when does it converge?
Sparse fi . Is this a reasonable assumption?
• If fi (x) = ϕ(aT
i x) then fi (x) = ai ϕ (aT
i x).
• Gradients are sparse whenever data ai is sparse.
• This is the case for generalized linear models (least squares,
logistic regression, linear SVMs, etc.).
In this class of models, Hogwild enjoys almost linear speedups.
Figure 1: Speedup of Hogwild. Image source: (Niu et al., 2011)
10/33
17. Hogwild is fast
Hogwild can be very fast. But its still SGD...
• With constant step size, bounces around the optimum.
• With decreasing step size, slow convergence.
• There are better alternatives 11/33
20. Variance-reduced Stochastic Optimization
Problem: Finite sum
minimize
x∈Rp
1
n
n
i=1
fi (x) , where n < ∞
The SAGA algorithm (Defazio, Bach, and Lacoste-Julien,
2014)
Sample uniformly i ∈ {1, . . . , n} and compute (x+, α+) as
x+
= x − γ ( fi (x) − αi + α)
gradient estimate
; α+
i = fi (x)
Variance-reduction technique known under different names, e.g.,
control variates in Monte Carlo methods. 12/33
21. The SAGA Algorithm
Theory: Linear (i.e., exponential
convergence) on strongly convex
problems.
Practical algorithm: converges
with a fixed step-size 1/(3L).
0 20 40 60 80 100
Time
10 12
10 10
10 8
10 6
10 4
10 2
100
functionsuboptimality
SAGA
SGD constant step size
SGD decreasing step size
13/33
22. The SAGA Algorithm
Theory: Linear (i.e., exponential
convergence) on strongly convex
problems.
Practical algorithm: converges
with a fixed step-size 1/(3L).
0 20 40 60 80 100
Time
10 12
10 10
10 8
10 6
10 4
10 2
100
functionsuboptimality
SAGA
SGD constant step size
SGD decreasing step size
Already used in scikit-learn
13/33
24. Asynchronous SAGA
Motivation: Can we design asynchronous version of SAGA?
SAGA update is inefficient (without tricks) for sparse gradients.
x+
= x − γ( fi (x)
sparse
− αi
sparse
+ α
dense!
) ;
Need for a sparse variant of SAGA
• Many large scale datasets are sparse.
• Asynchronous algorithms work best when updates are sparse.
14/33
25. Sparse SAGA
We can get away with “sparsifying” the gradient estimate.
3
R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien (2017).
“ASAGA: synchronous parallel SAGA”. In: Proceedings of the 20th
International Conference on Artificial Intelligence and Statistics (AISTATS
2017).
15/33
26. Sparse SAGA
We can get away with “sparsifying” the gradient estimate.
• Let Pi be the projection onto support( fi )
• Let Di = Pi /(1
n
n
i=1 Pi )
• Crucial property: Ei [Di ] = I
3
R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien (2017).
“ASAGA: synchronous parallel SAGA”. In: Proceedings of the 20th
International Conference on Artificial Intelligence and Statistics (AISTATS
2017).
15/33
27. Sparse SAGA
We can get away with “sparsifying” the gradient estimate.
• Let Pi be the projection onto support( fi )
• Let Di = Pi /(1
n
n
i=1 Pi )
• Crucial property: Ei [Di ] = I
Sparse SAGA algorithm3
Sample uniformly i ∈ {1, . . . , n} and compute (x+, α+) as
x+
= x − γ( fi (x) − αi + Di α) ; α+
i = fi (x)
3
R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien (2017).
“ASAGA: synchronous parallel SAGA”. In: Proceedings of the 20th
International Conference on Artificial Intelligence and Statistics (AISTATS
2017).
15/33
28. Sparse SAGA
• All operations are sparse, cost per iteration is
O(—nonzeros in fi —).
• Same convergence properties than SAGA, but with cheaper
iterations in presence of sparsity.
16/33
29. Sparse SAGA
• All operations are sparse, cost per iteration is
O(—nonzeros in fi —).
• Same convergence properties than SAGA, but with cheaper
iterations in presence of sparsity.
16/33
30. Proximal Sparse SAGA
Problem: Composite finite sum
minimize
x∈Rp
1
n
n
i=1
fi (x) + g(x) , where
• g is potentially nonsmooth (think λ · 1 or indicator) but we
have access to proxγg (x) = arg minz g(z) + 1
2 x − z 2.
• For some g, its proximal operator is available in closed form.
Examples: 1 norm (soft thresholding), indicator function
(projection).
17/33
31. Sparse Proximal SAGA
We can extend Sparse SAGA to incorporate one proximal term.
• Assume g separable: g(x) = p
j=1 gj (xj )
• Let ϕi = d
j (Di )j,j gj (xj )
• Crucial property: Ei [Di ] = I, Ei [ϕi ] = h
4
Fabian Pedregosa, R´emi Leblond, and Simon Lacoste-Julien (2017).
“Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite
Optimization”. In: Advances in Neural Information Processing Systems 30
(NIPS).
18/33
32. Sparse Proximal SAGA
We can extend Sparse SAGA to incorporate one proximal term.
• Assume g separable: g(x) = p
j=1 gj (xj )
• Let ϕi = d
j (Di )j,j gj (xj )
• Crucial property: Ei [Di ] = I, Ei [ϕi ] = h
Sparse SAGA algorithm4
Sample uniformly i ∈ {1, . . . , n} and compute (x+, α+) as
x+
= proxγϕi
(x − γ( fi (x) − αi + Di α)) ; α+
i = fi (x)
4
Fabian Pedregosa, R´emi Leblond, and Simon Lacoste-Julien (2017).
“Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite
Optimization”. In: Advances in Neural Information Processing Systems 30
(NIPS).
18/33
33. Sparse Proximal SAGA
As SAGA, linear convergence under strong convexity.
Theorem
For step size γ = 1
5L and f L-smooth and µ-strongly convex
(µ > 0), at iteration t we have
E xt − x∗ 2
≤ (1 − 1
5 min{1
n , µ
L })t C0 ,
with C0 = x0 − x∗ 2 + 1
5L2
n
i=1 α0
i − fi (x∗) 2.
Implications
• Same convergence than SAGA with cheaper updates in
presence of sparsity.
• Adaptivity to strong convexity, i.e., no need to know strong
convexity parameter to obtain linear convergence.
19/33
34. Asynchronous Proximal SAGA
ProxASAGA (Pedregosa, Leblond, and Lacoste-Julien,
2017)
1. Read the information in shared memory (ˆx, ˆα, ˆα).
2. Sample i and compute fi (ˆx).
3. Perform Sparse SAGA update on shared memory
x = proxγϕi
(x − γ( fi (ˆx) − ˆαi + Di ˆα)) ; αi = fi (ˆx)
20/33
35. Asynchronous Proximal SAGA
ProxASAGA (Pedregosa, Leblond, and Lacoste-Julien,
2017)
1. Read the information in shared memory (ˆx, ˆα, ˆα).
2. Sample i and compute fi (ˆx).
3. Perform Sparse SAGA update on shared memory
x = proxγϕi
(x − γ( fi (ˆx) − ˆαi + Di ˆα)) ; αi = fi (ˆx)
• As Hogwild!, inconsistent read and writes.
• Same convergence rate than sequential version under sparsity
of the gradients (delays ≤ 1
10
√
sparsity
.)
20/33
37. Empirical Results - Speedup
Speedup =
Time to 10−10 suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
22/33
38. Empirical Results - Speedup
Speedup =
Time to 10−10 suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
• ProxASAGA achieves speedups between 6x and 12x on a 20
cores architecture.
22/33
39. Empirical Results - Speedup
Speedup =
Time to 10−10 suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
• ProxASAGA achieves speedups between 6x and 12x on a 20
cores architecture.
• As predicted by theory, there is a high correlation between
degree of sparsity and speedup. 22/33
41. Analysis
Active Research Topic
• Lock-free Asynchronous SGD: Hogwild! (Niu et al., 2011)
• Stochastic Approximation (Duchi, Chaturapruek, and R´e,
2015)
• Nonconvex losses (De Sa et al., 2015; Lian et al., 2015)
• Variance-reduced stochastic methods (Reddi et al., 2015)
23/33
42. Analysis
Active Research Topic
• Lock-free Asynchronous SGD: Hogwild! (Niu et al., 2011)
• Stochastic Approximation (Duchi, Chaturapruek, and R´e,
2015)
• Nonconvex losses (De Sa et al., 2015; Lian et al., 2015)
• Variance-reduced stochastic methods (Reddi et al., 2015)
Claim #1
There are fundamental flaws in these analysis.
23/33
43. Analysis
Analysis of optimization algorithms requires to prove progress from
one iterate to the next.
How to define an iterate?
24/33
44. Analysis
Analysis of optimization algorithms requires to prove progress from
one iterate to the next.
How to define an iterate?
Asynchronous SGD
All workers do in parallel:
1. Read the information in shared memory (ˆx).
2. Sample i and compute fi (ˆx).
3. Perform SGD update on shared memory x = x − γ fi (ˆx).
24/33
45. Naming Scheme and Unbiasedness Assumption
“After Write” Labeling (Niu et al., 2011)
Each time a worker has finished writing to shared memory,
increment iteration counter.
⇐⇒ ˆxt = (t + 1)-th successful update to shared memory.
25/33
46. Naming Scheme and Unbiasedness Assumption
“After Write” Labeling (Niu et al., 2011)
Each time a worker has finished writing to shared memory,
increment iteration counter.
⇐⇒ ˆxt = (t + 1)-th successful update to shared memory.
Unbiasedness Assumption
Asynchronous SGD-like algorithms crucially rely on the unbiased
property
Ei [ fi (x)] = f (x) .
25/33
47. Naming Scheme and Unbiasedness Assumption
“After Write” Labeling (Niu et al., 2011)
Each time a worker has finished writing to shared memory,
increment iteration counter.
⇐⇒ ˆxt = (t + 1)-th successful update to shared memory.
Unbiasedness Assumption
Asynchronous SGD-like algorithms crucially rely on the unbiased
property
Ei [ fi (x)] = f (x) .
Issue
The naming scheme and unbiased assumption are incompatible.
25/33
48. A Problematic Example
Problem: minimizex
1
2(f1(x) + f2(x)) with 2 workers.
Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]?
26/33
49. A Problematic Example
Problem: minimizex
1
2(f1(x) + f2(x)) with 2 workers.
Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]?
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
26/33
50. A Problematic Example
Problem: minimizex
1
2(f1(x) + f2(x)) with 2 workers.
Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]?
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
26/33
51. A Problematic Example
Problem: minimizex
1
2(f1(x) + f2(x)) with 2 workers.
Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]?
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
26/33
52. A Problematic Example
Problem: minimizex
1
2(f1(x) + f2(x)) with 2 workers.
Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]?
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f2(ˆx0)
26/33
53. A Problematic Example
Problem: minimizex
1
2(f1(x) + f2(x)) with 2 workers.
Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]?
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f2(ˆx0)
In all, Ei0 [ fi0 (ˆx0)] =
3
4
f1(ˆx0) +
1
4
f2(ˆx0) = f (ˆx0)
26/33
54. A Problematic Example
Problem: minimizex
1
2(f1(x) + f2(x)) with 2 workers.
Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]?
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f2(ˆx0)
In all, Ei0 [ fi0 (ˆx0)] =
3
4
f1(ˆx0) +
1
4
f2(ˆx0) = f (ˆx0)
26/33
• This scheme does not satisfy the
crucial unbiasedness condition.
• Can we fix it?
55. A New Labeling Scheme
After read labeling scheme
Each time a worker has finished reading from shared memory,
increment iteration counter.
⇐⇒ ˆxt = (t + 1)-th successful read from shared memory.
5
R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien (2018).
“Improved asynchronous parallel optimization analysis for stochastic
incremental methods”. In: to appear in Journal of Machine Learning Research.
27/33
56. A New Labeling Scheme
After read labeling scheme
Each time a worker has finished reading from shared memory,
increment iteration counter.
⇐⇒ ˆxt = (t + 1)-th successful read from shared memory.
No dependency between it and the cost of computing fit .
Full analysis of Hogwild, Asynchronous SVRG and
Asynchronous SAGA in5.
5
R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien (2018).
“Improved asynchronous parallel optimization analysis for stochastic
incremental methods”. In: to appear in Journal of Machine Learning Research.
27/33
57. Convergence results – preliminaries
Some notation.
• ∆ = maxj∈1,...,d |{j : j ∈ supp( fi )}|/n. We always have
1/n ≤ ∆ ≤ 1.
• τ = Number of updates between the time that the vector of
coefficients is read to memory and the time the update is
finished.
28/33
58. A rigorous analysis of Hogwild (Niu et al., 2011)
• Inconsistent reads.
• Unlike (Niu et al., 2011), allow for inconsistent writes.
• Unlike (Niu et al., 2011; Mania et al., 2017), no global bound
on gradient.
Main result for Hogwild (handwaiving)
Let f be µ-strongly convex and L-smooth and assume (for
simplicity)
√
∆ ≤ µ
L . Then Hogwild converges with the same rate
as SGD with step size γ = a
L with
a ≤ min
1
5(1 + 2τ
√
∆)
,
L
µ∆
.
=⇒ theoretical linear speedup.
29/33
59. Main result for ASAGA
Main result for ASAGA (handwaiving)
Let f be µ-strongly convex and L-smooth and assume (for
simplicity)
√
∆ ≤ µ
L . Then ASAGA converges with the same rate
as SAGA with step size γ = a
L with
a ≤
1
32(1 + τ
√
∆)
.
=⇒ theoretical linear speedup, step size independent of µ.
30/33
60. Perspectives
• Better scalability ⇐⇒ communication efficiency.
• Tighter analysis with better constants / step-size independent
of ∆.
• Large gap between theory and practice.
• Interplay with generalization and momentum
Thanks for your attention!
31/33
61. References
Arrow, Kenneth Joseph and Leonid Hurwicz (1958). Decentralization and computation
in resource allocation. Stanford University, Department of Economics.
Bertsekas, Dimitri P. and John N. Tsitsiklis (1989). Parallel and Distributed
Computation: Numerical Methods. Athena Scientific.
Cauchy, Augustin (1847). “M´ethode g´en´erale pour la r´esolution des systemes
d´equations simultan´ees”. In: Comp. Rend. Sci. Paris.
De Sa, Christopher M et al. (2015). “Taming the wild: A unified analysis of
hogwild-style algorithms”. In: Advances in neural information processing systems.
Dean, Jeffrey et al. (2012). “Large scale distributed deep networks”. In: Advances in
neural information processing systems.
Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien (2014). “SAGA: A fast
incremental gradient method with support for non-strongly convex composite
objectives”. In: Advances in Neural Information Processing Systems.
Duchi, John C, Sorathan Chaturapruek, and Christopher R´e (2015). “Asynchronous
stochastic convex optimization”. In: arXiv preprint arXiv:1508.00882.
31/33
62. Leblond, R´emi, Fabian Pedregosa, and Simon Lacoste-Julien (2017). “ASAGA:
synchronous parallel SAGA”. In: Proceedings of the 20th International Conference
on Artificial Intelligence and Statistics (AISTATS 2017).
— (2018). “Improved asynchronous parallel optimization analysis for stochastic
incremental methods”. In: to appear in Journal of Machine Learning Research.
Lian, Xiangru et al. (2015). “Asynchronous parallel stochastic gradient for nonconvex
optimization”. In: Advances in Neural Information Processing Systems.
Mania, Horia et al. (2017). “Perturbed iterate analysis for asynchronous stochastic
optimization”. In: SIAM Journal on Optimization.
Niu, Feng et al. (2011). “Hogwild: A lock-free approach to parallelizing stochastic
gradient descent”. In: Advances in Neural Information Processing Systems.
Pedregosa, Fabian, R´emi Leblond, and Simon Lacoste-Julien (2017). “Breaking the
Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization”. In:
Advances in Neural Information Processing Systems 30 (NIPS).
Reddi, Sashank J et al. (2015). “On variance reduction in stochastic gradient descent
and its asynchronous variants”. In: Advances in Neural Information Processing
Systems.
Robbins, Herbert and Sutton Monro (1951). “A Stochastic Approximation Method”.
In: Ann. Math. Statist.
32/33
63. Tsitsiklis, John, Dimitri Bertsekas, and Michael Athans (1986). “Distributed
asynchronous deterministic and stochastic gradient optimization algorithms”. In:
IEEE transactions on automatic control.
33/33
64. Supervised Machine Learning
Data: n observations (ai , bi ) ∈ Rp × R
Prediction function: h(a, x) ∈ R
Motivating examples:
• Linear prediction: h(a, x) = xT a
• Neural networks: h(a, x) = xT
mσ(xm−1σ(· · · xT
2 σ(xT
1 a))
65. Sparse Proximal SAGA
For step size γ = 1
5L and f be gradient L-Lipschitz and µ-strongly
convex (µ > 0), Sparse Proximal SAGA converges geometrically in
expectation. At iteration t we have
E xt − x∗ 2
≤ (1 − 1
5 min{1
n , 1
κ })t C0 ,
with C0 = x0 − x∗ 2 + 1
5L2
n
i=1 α0
i − fi (x∗) 2 and κ = L
µ
(condition number).
Implications
• Same convergence rate than SAGA with cheaper updates.
• In the “big data regime” (n ≥ κ): rate in O(1/n).
• In the “ill-conditioned regime” (n ≤ κ): rate in O(1/κ).