Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Fabian Pedregosa
The document proposes a new parallel method called Proximal Asynchronous Stochastic Gradient Average (ProxASAGA) for solving composite optimization problems. ProxASAGA extends SAGA to handle nonsmooth objectives using proximal operators, and runs asynchronously in parallel without locks. It is shown to converge at the same linear rate as the sequential algorithm theoretically, and achieves speedups of 6-12x on a 20-core machine in practice on large datasets, with greater speedups on sparser problems as predicted by theory.
Hyperparameter optimization with approximate gradientFabian Pedregosa
This document discusses hyperparameter optimization using approximate gradients. It introduces the problem of optimizing hyperparameters along with model parameters. While model parameters can be estimated from data, hyperparameters require methods like cross-validation. The document proposes using approximate gradients to optimize hyperparameters more efficiently than costly methods like grid search. It derives the gradient of the objective with respect to hyperparameters and presents an algorithm called HOAG that approximates this gradient using inexact solutions. The document analyzes HOAG's convergence and provides experimental results comparing it to other hyperparameter optimization methods.
Safe and Efficient Off-Policy Reinforcement Learningmooopan
This document summarizes the Retrace(λ) reinforcement learning algorithm presented by Remi Munos, Thomas Stepleton, Anna Harutyunyan and Marc G. Bellemare. Retrace(λ) is an off-policy multi-step reinforcement learning algorithm that is safe (converges for any policy), efficient (makes best use of samples when policies are close), and has lower variance than importance sampling. Empirical results on Atari 2600 games show Retrace(λ) outperforms one-step Q-learning and existing multi-step methods.
Improving Variational Inference with Inverse Autoregressive FlowTatsuya Shirakawa
This slide was created for NIPS 2016 study meetup.
IAF and other related researches are briefly explained.
paper:
Diederik P. Kingma et al., "Improving Variational Inference with Inverse Autoregressive Flow", 2016
https://papers.nips.cc/paper/6581-improving-variational-autoencoders-with-inverse-autoregressive-flow
The document discusses building robust machine learning systems that can handle concept drift. It introduces the challenges of concept drift when the underlying data distribution changes over time. It proposes using Gaussian process classifiers with an adaptive training window approach. The approach monitors for concept drift and retrains the model if detected. It tests the approach on artificial data streams with different drift scenarios and finds the adaptive approach performs better than a static model at handling concept drift. Future work could explore other drift detection methods and ensembles of adaptive Gaussian process classifiers.
This document discusses speaker diarization, which is the process of segmenting an audio stream into homogeneous segments according to speaker identity. It covers feature extraction methods like MFCCs, segmentation using Bayesian Information Criteria to compare Gaussian mixture models, and clustering algorithms like k-means and hierarchical agglomerative clustering. Dendrogram visualizations are used to identify natural speaker clusters. The overall goal is to partition audio recordings of discussions or debates into homogeneous segments to attribute speech segments to individual speakers.
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Fabian Pedregosa
The document proposes a new parallel method called Proximal Asynchronous Stochastic Gradient Average (ProxASAGA) for solving composite optimization problems. ProxASAGA extends SAGA to handle nonsmooth objectives using proximal operators, and runs asynchronously in parallel without locks. It is shown to converge at the same linear rate as the sequential algorithm theoretically, and achieves speedups of 6-12x on a 20-core machine in practice on large datasets, with greater speedups on sparser problems as predicted by theory.
Hyperparameter optimization with approximate gradientFabian Pedregosa
This document discusses hyperparameter optimization using approximate gradients. It introduces the problem of optimizing hyperparameters along with model parameters. While model parameters can be estimated from data, hyperparameters require methods like cross-validation. The document proposes using approximate gradients to optimize hyperparameters more efficiently than costly methods like grid search. It derives the gradient of the objective with respect to hyperparameters and presents an algorithm called HOAG that approximates this gradient using inexact solutions. The document analyzes HOAG's convergence and provides experimental results comparing it to other hyperparameter optimization methods.
Safe and Efficient Off-Policy Reinforcement Learningmooopan
This document summarizes the Retrace(λ) reinforcement learning algorithm presented by Remi Munos, Thomas Stepleton, Anna Harutyunyan and Marc G. Bellemare. Retrace(λ) is an off-policy multi-step reinforcement learning algorithm that is safe (converges for any policy), efficient (makes best use of samples when policies are close), and has lower variance than importance sampling. Empirical results on Atari 2600 games show Retrace(λ) outperforms one-step Q-learning and existing multi-step methods.
Improving Variational Inference with Inverse Autoregressive FlowTatsuya Shirakawa
This slide was created for NIPS 2016 study meetup.
IAF and other related researches are briefly explained.
paper:
Diederik P. Kingma et al., "Improving Variational Inference with Inverse Autoregressive Flow", 2016
https://papers.nips.cc/paper/6581-improving-variational-autoencoders-with-inverse-autoregressive-flow
The document discusses building robust machine learning systems that can handle concept drift. It introduces the challenges of concept drift when the underlying data distribution changes over time. It proposes using Gaussian process classifiers with an adaptive training window approach. The approach monitors for concept drift and retrains the model if detected. It tests the approach on artificial data streams with different drift scenarios and finds the adaptive approach performs better than a static model at handling concept drift. Future work could explore other drift detection methods and ensembles of adaptive Gaussian process classifiers.
This document discusses speaker diarization, which is the process of segmenting an audio stream into homogeneous segments according to speaker identity. It covers feature extraction methods like MFCCs, segmentation using Bayesian Information Criteria to compare Gaussian mixture models, and clustering algorithms like k-means and hierarchical agglomerative clustering. Dendrogram visualizations are used to identify natural speaker clusters. The overall goal is to partition audio recordings of discussions or debates into homogeneous segments to attribute speech segments to individual speakers.
Introduction of "TrailBlazer" algorithmKatsuki Ohto
論文「Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning」紹介スライドです。NIPS2016読み会@PFN(2017/1/19) https://connpass.com/event/47580/ にて。
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEHONGJOO LEE
45 min talk about collecting home network performance measures, analyzing and forecasting time series data, and building anomaly detection system.
In this talk, we will go through the whole process of data mining and knowledge discovery. Firstly we write a script to run speed test periodically and log the metric. Then we parse the log data and convert them into a time series and visualize the data for a certain period.
Next we conduct some data analysis; finding trends, forecasting, and detecting anomalous data. There will be several statistic or deep learning techniques used for the analysis; ARIMA (Autoregressive Integrated Moving Average), LSTM (Long Short Term Memory).
Profiling in Python provides concise summaries of key profiling tools in 3 sentences:
cProfile and line_profiler profile execution time and identify slow lines of code. memory_profiler profiles memory usage with line-by-line or time-based outputs. YEP extends profiling to compiled C/C++ extensions like Cython modules, which are not covered by the standard Python profilers.
"Scalable Link Discovery for Modern Data-Driven Applications" as presented in the 15th International Semantic Web Conference ISWC, Doctoral Consortium, October 18th, 2016, held in Kobe, Japan
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
The asynchronous parallel algorithms are developed to solve massive optimization problems in a distributed data system, which can be run in parallel on multiple nodes with little or no synchronization. Recently they have been successfully implemented to solve a range of difficult problems in practice. However, the existing theories are mostly based on fairly restrictive assumptions on the delays, and cannot explain the convergence and speedup properties of such algorithms. In this talk we will give an overview on distributed optimization, and discuss some new theoretical results on the convergence of asynchronous parallel stochastic gradient algorithm with unbounded delays. Simulated and real data will be used to demonstrate the practical implication of these theoretical results.
Data science involves extracting insights from large volumes of data. It is an interdisciplinary field that uses techniques from statistics, machine learning, and other domains. The document provides examples of classification algorithms like k-nearest neighbors, naive Bayes, and perceptrons that are commonly used in data science to build models for tasks like spam filtering or sentiment analysis. It also discusses clustering, frequent pattern mining, and other machine learning concepts.
The document summarizes computational aspects of vehicle routing problems. It discusses time complexity and space complexity, and how they are measured as functions of problem size. It provides examples of calculating complexity for different algorithms. It also discusses common data structures for representing routes, including array lists, doubly linked lists, and their pros and cons for different operations. The document outlines Java code examples for comparing route representation using these data structures.
Accelerating Random Forests in Scikit-LearnGilles Louppe
Random Forests are without contest one of the most robust, accurate and versatile tools for solving machine learning tasks. Implementing this algorithm properly and efficiently remains however a challenging task involving issues that are easily overlooked if not considered with care. In this talk, we present the Random Forests implementation developed within the Scikit-Learn machine learning library. In particular, we describe the iterative team efforts that led us to gradually improve our codebase and eventually make Scikit-Learn's Random Forests one of the most efficient implementations in the scientific ecosystem, across all libraries and programming languages. Algorithmic and technical optimizations that have made this possible include:
- An efficient formulation of the decision tree algorithm, tailored for Random Forests;
- Cythonization of the tree induction algorithm;
- CPU cache optimizations, through low-level organization of data into contiguous memory blocks;
- Efficient multi-threading through GIL-free routines;
- A dedicated sorting procedure, taking into account the properties of data;
- Shared pre-computations whenever critical.
Overall, we believe that lessons learned from this case study extend to a broad range of scientific applications and may be of interest to anybody doing data analysis in Python.
The variational Gaussian process (VGP), a Bayesian nonparametric model which adapts its shape to match com- plex posterior distributions. The VGP generates approximate posterior samples by generating latent inputs and warping them through random non-linear mappings; the distribution over random mappings is learned during inference, enabling the transformed outputs to adapt to varying complexity.
The document discusses data stream classification and algorithms for handling data streams. It begins with an introduction to data stream characteristics and challenges. It then discusses approximation algorithms for data streams, including maintaining statistics over sliding windows. Classification algorithms for data streams discussed include Naive Bayes classifiers, perceptrons, and Hoeffding trees, which are decision trees adapted for data streams using the Hoeffding bound inequality to determine the optimal split attribute.
We provide a review of the recent literature on statistical risk bounds for deep neural networks. We also discuss some theoretical results that compare the performance of deep ReLU networks to other methods such as wavelets and spline-type methods. The talk will moreover highlight some open problems and sketch possible new directions.
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsFabian Pedregosa
This document provides an overview of asynchronous stochastic optimization methods and algorithms. It discusses asynchronous parallel stochastic gradient descent (SGD) and how it can minimize idle time. It also introduces asynchronous variance-reduced optimization methods like asynchronous SAGA that provide faster convergence than SGD. The document analyzes the convergence properties of asynchronous optimization methods and presents empirical results demonstrating the speedups achieved by asynchronous proximal SAGA (ProxASAGA) on large datasets.
Overview of the course. Introduction to image sciences, image processing and computer vision. Basics of machine learning, terminologies, paradigms. No-free lunch theorem. Supervised versus unsupervised learning. Clustering and K-Means. Classification and regression. Linear least squares and polynomial curve fitting. Model complexity and overfitting. Curse of dimensionality. Dimensionality reduction and principal component analysis. Image representation, semantic gap, image features, and classical computer vision pipelines.
Introduction of "TrailBlazer" algorithmKatsuki Ohto
論文「Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning」紹介スライドです。NIPS2016読み会@PFN(2017/1/19) https://connpass.com/event/47580/ にて。
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEHONGJOO LEE
45 min talk about collecting home network performance measures, analyzing and forecasting time series data, and building anomaly detection system.
In this talk, we will go through the whole process of data mining and knowledge discovery. Firstly we write a script to run speed test periodically and log the metric. Then we parse the log data and convert them into a time series and visualize the data for a certain period.
Next we conduct some data analysis; finding trends, forecasting, and detecting anomalous data. There will be several statistic or deep learning techniques used for the analysis; ARIMA (Autoregressive Integrated Moving Average), LSTM (Long Short Term Memory).
Profiling in Python provides concise summaries of key profiling tools in 3 sentences:
cProfile and line_profiler profile execution time and identify slow lines of code. memory_profiler profiles memory usage with line-by-line or time-based outputs. YEP extends profiling to compiled C/C++ extensions like Cython modules, which are not covered by the standard Python profilers.
"Scalable Link Discovery for Modern Data-Driven Applications" as presented in the 15th International Semantic Web Conference ISWC, Doctoral Consortium, October 18th, 2016, held in Kobe, Japan
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
The asynchronous parallel algorithms are developed to solve massive optimization problems in a distributed data system, which can be run in parallel on multiple nodes with little or no synchronization. Recently they have been successfully implemented to solve a range of difficult problems in practice. However, the existing theories are mostly based on fairly restrictive assumptions on the delays, and cannot explain the convergence and speedup properties of such algorithms. In this talk we will give an overview on distributed optimization, and discuss some new theoretical results on the convergence of asynchronous parallel stochastic gradient algorithm with unbounded delays. Simulated and real data will be used to demonstrate the practical implication of these theoretical results.
Data science involves extracting insights from large volumes of data. It is an interdisciplinary field that uses techniques from statistics, machine learning, and other domains. The document provides examples of classification algorithms like k-nearest neighbors, naive Bayes, and perceptrons that are commonly used in data science to build models for tasks like spam filtering or sentiment analysis. It also discusses clustering, frequent pattern mining, and other machine learning concepts.
The document summarizes computational aspects of vehicle routing problems. It discusses time complexity and space complexity, and how they are measured as functions of problem size. It provides examples of calculating complexity for different algorithms. It also discusses common data structures for representing routes, including array lists, doubly linked lists, and their pros and cons for different operations. The document outlines Java code examples for comparing route representation using these data structures.
Accelerating Random Forests in Scikit-LearnGilles Louppe
Random Forests are without contest one of the most robust, accurate and versatile tools for solving machine learning tasks. Implementing this algorithm properly and efficiently remains however a challenging task involving issues that are easily overlooked if not considered with care. In this talk, we present the Random Forests implementation developed within the Scikit-Learn machine learning library. In particular, we describe the iterative team efforts that led us to gradually improve our codebase and eventually make Scikit-Learn's Random Forests one of the most efficient implementations in the scientific ecosystem, across all libraries and programming languages. Algorithmic and technical optimizations that have made this possible include:
- An efficient formulation of the decision tree algorithm, tailored for Random Forests;
- Cythonization of the tree induction algorithm;
- CPU cache optimizations, through low-level organization of data into contiguous memory blocks;
- Efficient multi-threading through GIL-free routines;
- A dedicated sorting procedure, taking into account the properties of data;
- Shared pre-computations whenever critical.
Overall, we believe that lessons learned from this case study extend to a broad range of scientific applications and may be of interest to anybody doing data analysis in Python.
The variational Gaussian process (VGP), a Bayesian nonparametric model which adapts its shape to match com- plex posterior distributions. The VGP generates approximate posterior samples by generating latent inputs and warping them through random non-linear mappings; the distribution over random mappings is learned during inference, enabling the transformed outputs to adapt to varying complexity.
The document discusses data stream classification and algorithms for handling data streams. It begins with an introduction to data stream characteristics and challenges. It then discusses approximation algorithms for data streams, including maintaining statistics over sliding windows. Classification algorithms for data streams discussed include Naive Bayes classifiers, perceptrons, and Hoeffding trees, which are decision trees adapted for data streams using the Hoeffding bound inequality to determine the optimal split attribute.
We provide a review of the recent literature on statistical risk bounds for deep neural networks. We also discuss some theoretical results that compare the performance of deep ReLU networks to other methods such as wavelets and spline-type methods. The talk will moreover highlight some open problems and sketch possible new directions.
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsFabian Pedregosa
This document provides an overview of asynchronous stochastic optimization methods and algorithms. It discusses asynchronous parallel stochastic gradient descent (SGD) and how it can minimize idle time. It also introduces asynchronous variance-reduced optimization methods like asynchronous SAGA that provide faster convergence than SGD. The document analyzes the convergence properties of asynchronous optimization methods and presents empirical results demonstrating the speedups achieved by asynchronous proximal SAGA (ProxASAGA) on large datasets.
Overview of the course. Introduction to image sciences, image processing and computer vision. Basics of machine learning, terminologies, paradigms. No-free lunch theorem. Supervised versus unsupervised learning. Clustering and K-Means. Classification and regression. Linear least squares and polynomial curve fitting. Model complexity and overfitting. Curse of dimensionality. Dimensionality reduction and principal component analysis. Image representation, semantic gap, image features, and classical computer vision pipelines.
The document discusses data structures and algorithms. It defines key concepts like primitive data types, data structures, static vs dynamic structures, abstract data types, algorithm design, analysis of time and space complexity, recursion, stacks and common stack operations like push and pop. Examples are provided to illustrate factorial calculation using recursion and implementation of a stack.
The document discusses data structures and algorithms. It defines key concepts like primitive data types, data structures, static vs dynamic structures, abstract data types, algorithm design, analysis of time and space complexity, and recursion. It provides examples of algorithms and data structures like stacks and using recursion to calculate factorials. The document covers fundamental topics in data structures and algorithms.
The document discusses data structures and algorithms. It defines key concepts like primitive data types, data structures, static vs dynamic structures, abstract data types, algorithm design, analysis of time and space complexity, and recursion. It provides examples of algorithms and data structures like arrays, stacks and the factorial function to illustrate recursive and iterative implementations. Problem solving techniques like defining the problem, designing algorithms, analyzing and testing solutions are also covered.
The document discusses data structures and algorithms. It defines key concepts like primitive data types, data structures, static vs dynamic structures, abstract data types, algorithm analysis including time and space complexity, and common algorithm design techniques like recursion. It provides examples of algorithms and data structures like stacks and using recursion to calculate factorials. The document covers fundamental topics in data structures and algorithms.
This document discusses data structures and algorithms. It begins by defining data structures as the logical organization of data and primitive data types like integers that hold single pieces of data. It then discusses static versus dynamic data structures and abstract data types. The document outlines the main steps in problem solving as defining the problem, designing algorithms, analyzing algorithms, implementing, testing, and maintaining solutions. It provides examples of space and time complexity analysis and discusses analyzing recursive algorithms through repeated substitution and telescoping methods.
The document discusses data structures and algorithms. It defines key concepts like primitive data types, data structures, static vs dynamic structures, abstract data types, algorithm design, analysis of time and space complexity, and recursion. It provides examples of algorithms and data structures like stacks and using recursion to calculate factorials. The document covers fundamental topics in data structures and algorithms.
Chap 8. Optimization for training deep modelsYoung-Geun Choi
연구실 내부 세미나 자료. Goodfellow et al. (2016), Deep Learning, MIT Press의 Chapter 8을 요약/발췌하였습니다. 깊은 신경망(deep neural network) 모형 훈련시 목적함수 최적화 방법으로 흔히 사용되는 방법들을 소개합니다.
Metaheuristic Algorithms: A Critical AnalysisXin-She Yang
The document discusses metaheuristic algorithms and their application to optimization problems. It provides an overview of several nature-inspired algorithms including particle swarm optimization, firefly algorithm, harmony search, and cuckoo search. It describes how these algorithms were inspired by natural phenomena like swarming behavior, flashing fireflies, and bird breeding. The document also discusses applications of these algorithms to engineering design problems like pressure vessel design and gear box design optimization.
Nature-Inspired Optimization Algorithms Xin-She Yang
This document discusses nature-inspired optimization algorithms. It begins with an overview of the essence of optimization algorithms and their goal of moving to better solutions. It then discusses some issues with traditional algorithms and how nature-inspired algorithms aim to address these. Several nature-inspired algorithms are described in detail, including particle swarm optimization, firefly algorithm, cuckoo search, and bat algorithm. These are inspired by behaviors in swarms, fireflies, cuckoos, and bats respectively. Examples of applications to engineering design problems are also provided.
The document proposes and evaluates two techniques for attention in multi-source sequence-to-sequence learning: flat attention combination and hierarchical attention combination. Both techniques achieved comparable results to existing context vector concatenation approaches on tasks of multimodal translation and automatic post-editing. Hierarchical attention combination performed best on multimodal translation and allows inspecting individual input attentions. The techniques provide a way to model importance of each input sequence.
This document discusses using particle swarm optimization based on variable neighborhood search (PSO-VNS) to attack classical cryptography ciphers. PSO is a population-based optimization algorithm inspired by bird flocking behavior. VNS is a metaheuristic algorithm that explores neighborhoods of solutions to escape local optima. The paper proposes improving PSO with VNS to find better solutions. It evaluates PSO-VNS on substitution and transposition ciphers, finding it recovers keys better than standard PSO and other variants.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Slides by Alexander März:
The language of statistics is of probabilistic nature. Any model that falls short of providing quantification of the uncertainty attached to its outcome is likely to provide an incomplete and potentially misleading
picture. While this is an irrevocable consensus in statistics, machine
learning approaches usually lack proper ways of quantifying uncertainty. In fact, a possible distinction between the two modelling cultures can be
attributed to the (non)-existence of uncertainty estimates that allow for,
e.g., hypothesis testing or the construction of estimation/prediction
intervals. However, quantification of uncertainty in general and
probabilistic forecasting in particular doesn’t just provide an average
point forecast, but it rather equips the user with a range of outcomes and the probability of each of those occurring.
In an effort of bringing both disciplines closer together, the audience is
introduced to a new framework of XGBoost that predicts the entire
conditional distribution of a univariate response variable. In particular,
XGBoostLSS models all moments of a parametric distribution (i.e., mean,
location, scale and shape [LSS]) instead of the conditional mean only.
Choosing from a wide range of continuous, discrete and mixed
discrete-continuous distribution, modelling and predicting the entire
conditional distribution greatly enhances the flexibility of XGBoost, as it
allows to gain additional insight into the data generating process, as well
as to create probabilistic forecasts from which prediction intervals and
quantiles of interest can be derived. As such, XGBoostLSS contributes to
the growing literature on statistical machine learning that aims at
weakening the separation between Breiman‘s „Data Modelling Culture“ and „Algorithmic Modelling Culture“, so that models designed mainly for
prediction can also be used to describe and explain the underlying data
generating process of the response of interest.
How to easily find the optimal solution without exhaustive search using Genet...Viach Kakovskyi
Genetic algorithms are a type of stochastic optimization technique inspired by biological evolution. They can be used to find optimal or suboptimal solutions to problems that are difficult to solve directly. The document discusses using genetic algorithms to solve optimization problems in software projects, including minimizing nutrition plan differences and maximizing intersection of traffic data. It provides an example of using genetic algorithms to solve a Diophantine equation and summaries that genetic algorithms are easy to start but computationally expensive and only find local optima or suboptimal solutions.
Parallel Computing 2007: Bring your own parallel applicationGeoffrey Fox
This document discusses parallelizing several algorithms and applications including k-means clustering, frequent itemset mining, integer programming, computer chess, and support vector machines (SVM). For k-means and frequent itemset mining, the algorithms can be parallelized by partitioning the data across processors and performing partial computations locally before combining results with an allreduce operation. Computer chess can be parallelized by exploring different game tree branches simultaneously on different processors. SVM problems involve large dense matrices that are difficult to solve in parallel directly due to their size exceeding memory; alternative approaches include solving smaller subproblems independently.
Differential game theory for Traffic Flow ModellingSerge Hoogendoorn
Lecture given at the INdAM symposium in Rome, 2017. The lecture shows how you can use differential games to model traffic flows, focussing on pedestrian simulation.
Paper Study: Melding the data decision pipelineChenYiHuang5
Melding the data decision pipeline: Decision-Focused Learning for Combinatorial Optimization from AAAI2019.
Derive the math equation from myself and match the same result as two mentioned CMU papers [Donti et. al. 2017, Amos et. al. 2017] while applying the same derivation procedure.
This document provides an introduction to deep learning. It discusses sources for further reading on deep learning. It then summarizes key topics that will be covered in Unit 1, including the relationship between AI, machine learning, and deep learning. It provides definitions and examples of neural networks, gradient descent optimization, and stochastic gradient descent. It also discusses neurons, the history of the perceptron, and the basic components of artificial neural networks.
Similar to Parallel Optimization in Machine Learning (20)
Random Matrix Theory and Machine Learning - Part 4Fabian Pedregosa
Deep learning models with millions or billions of parameters should overfit according to classical theory, but they do not. The emerging theory of double descent seeks to explain why larger neural networks can generalize well. Random matrix theory provides a tractable framework to model double descent through random feature models, where the number of random features controls model capacity. In the high-dimensional limit, the test error of random feature regression exhibits a double descent shape that can be computed analytically.
Random Matrix Theory and Machine Learning - Part 3Fabian Pedregosa
ICML 2021 tutorial on random matrix theory and machine learning.
Part 3 covers: 1. Motivation: Average-case versus worst-case in high dimensions 2. Algorithm halting times (runtimes) 3. Outlook
Random Matrix Theory and Machine Learning - Part 1Fabian Pedregosa
This document provides an introduction to random matrix theory and its applications in machine learning. It discusses several classical random matrix ensembles like the Gaussian Orthogonal Ensemble (GOE) and Wishart ensemble. These ensembles are used to model phenomena in fields like number theory, physics, and machine learning. Specifically, the GOE is used to model Hamiltonians of heavy nuclei, while the Wishart ensemble relates to the Hessian of least squares problems. The tutorial will cover applications of random matrix theory to analyzing loss landscapes, numerical algorithms, and the generalization properties of machine learning models.
Average case acceleration through spectral density estimationFabian Pedregosa
We develop a framework for designing optimal quadratic optimization methods in terms of their average-case runtime. This yields a new class of methods that achieve acceleration through a model of the Hessian's expected spectral density. We develop explicit algorithms for the uniform, Marchenko-Pastur, and exponential distributions. These methods are momentum-based gradient algorithms whose hyper-parameters can be estimated without knowledge of the Hessian's smallest singular value, in contrast with classical accelerated methods like Nesterov acceleration and Polyak momentum. Empirical results on quadratic, logistic regression and neural networks show the proposed methods always match and in many cases significantly improve over classical accelerated methods.
Full paper: https://arxiv.org/pdf/1804.02339.pdf
We propose and analyze a novel adaptive step size variant of the Davis-Yin three operator splitting, a method that can solve optimization problems composed of a sum of a smooth term for which we have access to its gradient and an arbitrary number of potentially non-smooth terms for which we have access to their proximal operator. The proposed method leverages local information of the objective function, allowing for larger step sizes while preserving the convergence properties of the original method. It only requires two extra function evaluations per iteration and does not depend on any step size hyperparameter besides an initial estimate. We provide a convergence rate analysis of this method, showing sublinear convergence rate for general convex functions and linear convergence under stronger assumptions, matching the best known rates of its non adaptive variant. Finally, an empirical comparison with related methods on 6 different problems illustrates the computational advantage of the adaptive step size strategy.
This document discusses an adaptive step-size method for the Frank-Wolfe algorithm that eliminates the need for a manually selected step-size parameter. It presents the standard Frank-Wolfe algorithm and the Demyanov-Rubinov variant that uses a step-size based on sufficient decrease. It then proposes an adaptive Frank-Wolfe algorithm that replaces the global Lipschitz constant L with a local constant Lt, allowing for potentially larger step sizes. This adaptive approach is shown to maintain sufficient decrease and can be extended to other Frank-Wolfe variants like away-steps Frank-Wolfe.
Lightning: large scale machine learning in pythonFabian Pedregosa
Lightning is a Python library for large-scale machine learning that incorporates recent advances in optimization algorithms. It is compatible with scikit-learn and supports both dense and sparse data as well as structured sparsity penalties. Lightning scales to large datasets using stochastic optimization methods like SGD, SVRG, SDCA, and SAGA. It also efficiently handles large feature spaces using coordinate descent algorithms. The API is similar to scikit-learn but is based on optimization algorithms rather than machine learning models. Lightning is part of the scikit-learn-contrib project.
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
Immersive Learning That Works: Research Grounding and Paths ForwardLeonel Morgado
We will metaverse into the essence of immersive learning, into its three dimensions and conceptual models. This approach encompasses elements from teaching methodologies to social involvement, through organizational concerns and technologies. Challenging the perception of learning as knowledge transfer, we introduce a 'Uses, Practices & Strategies' model operationalized by the 'Immersive Learning Brain' and ‘Immersion Cube’ frameworks. This approach offers a comprehensive guide through the intricacies of immersive educational experiences and spotlighting research frontiers, along the immersion dimensions of system, narrative, and agency. Our discourse extends to stakeholders beyond the academic sphere, addressing the interests of technologists, instructional designers, and policymakers. We span various contexts, from formal education to organizational transformation to the new horizon of an AI-pervasive society. This keynote aims to unite the iLRN community in a collaborative journey towards a future where immersive learning research and practice coalesce, paving the way for innovative educational research and practice landscapes.
The technology uses reclaimed CO₂ as the dyeing medium in a closed loop process. When pressurized, CO₂ becomes supercritical (SC-CO₂). In this state CO₂ has a very high solvent power, allowing the dye to dissolve easily.
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...AbdullaAlAsif1
The pygmy halfbeak Dermogenys colletei, is known for its viviparous nature, this presents an intriguing case of relatively low fecundity, raising questions about potential compensatory reproductive strategies employed by this species. Our study delves into the examination of fecundity and the Gonadosomatic Index (GSI) in the Pygmy Halfbeak, D. colletei (Meisner, 2001), an intriguing viviparous fish indigenous to Sarawak, Borneo. We hypothesize that the Pygmy halfbeak, D. colletei, may exhibit unique reproductive adaptations to offset its low fecundity, thus enhancing its survival and fitness. To address this, we conducted a comprehensive study utilizing 28 mature female specimens of D. colletei, carefully measuring fecundity and GSI to shed light on the reproductive adaptations of this species. Our findings reveal that D. colletei indeed exhibits low fecundity, with a mean of 16.76 ± 2.01, and a mean GSI of 12.83 ± 1.27, providing crucial insights into the reproductive mechanisms at play in this species. These results underscore the existence of unique reproductive strategies in D. colletei, enabling its adaptation and persistence in Borneo's diverse aquatic ecosystems, and call for further ecological research to elucidate these mechanisms. This study lends to a better understanding of viviparous fish in Borneo and contributes to the broader field of aquatic ecology, enhancing our knowledge of species adaptations to unique ecological challenges.
Current Ms word generated power point presentation covers major details about the micronuclei test. It's significance and assays to conduct it. It is used to detect the micronuclei formation inside the cells of nearly every multicellular organism. It's formation takes place during chromosomal sepration at metaphase.
Phenomics assisted breeding in crop improvementIshaGoswami9
As the population is increasing and will reach about 9 billion upto 2050. Also due to climate change, it is difficult to meet the food requirement of such a large population. Facing the challenges presented by resource shortages, climate
change, and increasing global population, crop yield and quality need to be improved in a sustainable way over the coming decades. Genetic improvement by breeding is the best way to increase crop productivity. With the rapid progression of functional
genomics, an increasing number of crop genomes have been sequenced and dozens of genes influencing key agronomic traits have been identified. However, current genome sequence information has not been adequately exploited for understanding
the complex characteristics of multiple gene, owing to a lack of crop phenotypic data. Efficient, automatic, and accurate technologies and platforms that can capture phenotypic data that can
be linked to genomics information for crop improvement at all growth stages have become as important as genotyping. Thus,
high-throughput phenotyping has become the major bottleneck restricting crop breeding. Plant phenomics has been defined as the high-throughput, accurate acquisition and analysis of multi-dimensional phenotypes
during crop growing stages at the organism level, including the cell, tissue, organ, individual plant, plot, and field levels. With the rapid development of novel sensors, imaging technology,
and analysis methods, numerous infrastructure platforms have been developed for phenotyping.
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Leonel Morgado
Current descriptions of immersive learning cases are often difficult or impossible to compare. This is due to a myriad of different options on what details to include, which aspects are relevant, and on the descriptive approaches employed. Also, these aspects often combine very specific details with more general guidelines or indicate intents and rationales without clarifying their implementation. In this paper we provide a method to describe immersive learning cases that is structured to enable comparisons, yet flexible enough to allow researchers and practitioners to decide which aspects to include. This method leverages a taxonomy that classifies educational aspects at three levels (uses, practices, and strategies) and then utilizes two frameworks, the Immersive Learning Brain and the Immersion Cube, to enable a structured description and interpretation of immersive learning cases. The method is then demonstrated on a published immersive learning case on training for wind turbine maintenance using virtual reality. Applying the method results in a structured artifact, the Immersive Learning Case Sheet, that tags the case with its proximal uses, practices, and strategies, and refines the free text case description to ensure that matching details are included. This contribution is thus a case description method in support of future comparative research of immersive learning cases. We then discuss how the resulting description and interpretation can be leveraged to change immersion learning cases, by enriching them (considering low-effort changes or additions) or innovating (exploring more challenging avenues of transformation). The method holds significant promise to support better-grounded research in immersive learning.
4. Motivation
Computer add in 1993 Computer add in 2006
What has changed?
2006 = no longer mentions to speed of processors.
2/32
5. Motivation
Computer add in 1993 Computer add in 2006
What has changed?
2006 = no longer mentions to speed of processors.
Primary feature: number of cores.
2/32
6. 40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
3/32
7. 40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
• Multi-core architectures are here to stay.
3/32
8. 40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
• Multi-core architectures are here to stay.
3/32
9. 40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
• Multi-core architectures are here to stay.
Parallel algorithms needed to take advantage of modern CPUs.
3/32
10. Parallel optimization
Parallel algorithms can be divided into two large categories:
synchronous and asynchronous. Image credits: (Peng et al. 2016)
Synchronous methods
Easy to implement (i.e.,
developed software packages).
Well understood.
Limited speedup due to
synchronization costs.
Asynchronous methods
Faster, typically larger
speedups.
Not well understood, large gap
between theory and practice.
No mature software solutions.
4/32
11. Outline
Synchronous methods
• Synchronous (stochastic) gradient descent.
Asynchronous methods
• Asynchronous stochastic gradient descent (Hogwild) (Niu et al.
2011)
• Asynchronous variance-reduced stochastic methods (Leblond, P.,
and Lacoste-Julien 2017), (Pedregosa, Leblond, and
Lacoste-Julien 2017).
• Analysis of asynchronous methods.
• Codes and implementation aspects.
Leaving out many parallel synchronous methods: ADMM (Glowinski
and Marroco 1975), CoCoA (Jaggi et al. 2014), DANE (Shamir, Srebro,
and Zhang 2014), to name a few.
5/32
12. Outline
Most of the following is joint work with Rémi Leblond and Simon
Lacoste-Julien
Rémi Leblond Simon Lacoste–Julien
6/32
14. Optimization for machine learning
Large part of problems in machine learning can be framed as
optimization problems of the form
minimize
x
f(x)
def
=
1
n
n∑
i=1
fi(x)
Gradient descent (Cauchy 1847). Descend
along steepest direction (−∇f(x))
x+
= x − γ∇f(x)
Stochastic gradient descent (SGD)
(Robbins and Monro 1951). Select a
random index i and descent along
− ∇fi(x):
x+
= x − γ∇fi(x) images source: Francis Bach
7/32
15. Parallel synchronous gradient descent
Computation of gradient is distributed among k workers
• Workers can be: different computers, CPUs
or GPUs
• Popular frameworks: Spark, Tensorflow,
PyTorch, neHadoop.
8/32
16. Parallel synchronous gradient descent
1. Choose n1, . . . nk that sum to n.
2. Distribute computation of ∇f(x) among k nodes
∇f(x) =
1
n
∑
i=1
∇fi(x)
=
1
k
(
1
n1
n1∑
i=1
∇fi(x)
done by worker 1
+ . . . +
1
n1
nk∑
i=nk−1
∇fi(x)
done by worker k
)
3. Perform the gradient descent update by a master node
x+
= x − γ∇f(x)
9/32
17. Parallel synchronous gradient descent
1. Choose n1, . . . nk that sum to n.
2. Distribute computation of ∇f(x) among k nodes
∇f(x) =
1
n
∑
i=1
∇fi(x)
=
1
k
(
1
n1
n1∑
i=1
∇fi(x)
done by worker 1
+ . . . +
1
n1
nk∑
i=nk−1
∇fi(x)
done by worker k
)
3. Perform the gradient descent update by a master node
x+
= x − γ∇f(x)
Trivial parallelization, same analysis as gradient descent.
Synchronization step every iteration (3.).
9/32
18. Parallel synchronous SGD
Can also be extended to stochastic gradient descent.
1. Select k samples i0, . . . , ik uniformly at random.
2. Compute in parallel ∇fit
on worker t
3. Perform the (mini-batch) stochastic gradient descent update
x+
= x − γ
1
k
k∑
t=1
∇fit
(x)
10/32
19. Parallel synchronous SGD
Can also be extended to stochastic gradient descent.
1. Select k samples i0, . . . , ik uniformly at random.
2. Compute in parallel ∇fit
on worker t
3. Perform the (mini-batch) stochastic gradient descent update
x+
= x − γ
1
k
k∑
t=1
∇fit
(x)
Trivial parallelization, same analysis as (mini-batch) stochastic
gradient descent.
The kind of parallelization that is implemented in deep learning
libraries (tensorflow, PyTorch, Thano, etc.).
Synchronization step every iteration (3.).
10/32
22. Asynchronous SGD
Synchronization is the bottleneck.
What if we just ignore it?
Hogwild (Niu et al. 2011): each core runs SGD in parallel, without
synchronization, and updates the same vector of coefficients.
In theory: convergence under very strong assumptions.
In practice: just works.
11/32
23. Hogwild in more detail
Each core follows the same procedure
1. Read the information from shared memory ˆx.
2. Sample i ∈ {1, . . . , n} uniformly at random.
3. Compute partial gradient ∇fi(ˆx).
4. Write the SGD update to shared memory x = x − γ∇fi(ˆx).
12/32
24. Hogwild is fast
Hogwild can be very fast. But its still SGD...
• With constant step size, bounces around the optimum.
• With decreasing step size, slow convergence.
• There are better alternatives (Emilie already mentioned some)
13/32
26. Analysis of asynchronous methods
Simple things become counter-intuitive, e.g, how to name the
iterates?
Iterates will change depending on the speed of processors
14/32
27. Naming scheme in Hogwild
Simple, intuitive and wrong
Each time a core has finished writing to shared memory, increment
iteration counter.
⇐⇒ ˆxt = (t + 1)-th succesfull update to shared memory.
Value of ˆxt and it are not determined until the iteration has finished.
=⇒ ˆxt and it are not necessarily independent.
15/32
28. Unbiased gradient estimate
SGD-like algorithms crucially rely on the unbiased property
Ei[∇fi(x)] = ∇f(x).
For synchronous algorithms, follows from the uniform sampling of i
Ei[∇fi(x)] =
n∑
i=1
Proba(selecting i)∇fi(x)
uniform sampling
=
n∑
i=1
1
n
∇fi(x) = ∇f(x)
16/32
29. A problematic example
This labeling scheme is incompatible with unbiasedness assumption
used in proofs.
17/32
30. A problematic example
This labeling scheme is incompatible with unbiasedness assumption
used in proofs.
Illustration: problem with two samples and two cores f = 1
2 (f1 + f2).
Computing ∇f1 is much expensive than ∇f2.
17/32
31. A problematic example
This labeling scheme is incompatible with unbiasedness assumption
used in proofs.
Illustration: problem with two samples and two cores f = 1
2 (f1 + f2).
Computing ∇f1 is much expensive than ∇f2.
Start at x0. Because of the random sampling there are 4 possible
scenarios:
1. Core 1 selects f1, Core 2 selects f1 =⇒ x1 = x0 − γ∇f1(x)
2. Core 1 selects f1, Core 2 selects f2 =⇒ x1 = x0 − γ∇f2(x)
3. Core 1 selects f2, Core 2 selects f1 =⇒ x1 = x0 − γ∇f2(x)
4. Core 1 selects f2, Core 2 selects f2 =⇒ x1 = x0 − γ∇f2(x)
So we have
Ei [∇fi] =
1
4
f1 +
3
4
f2
̸=
1
2
f1 +
1
2
f2 !!
17/32
33. A new labeling scheme
New way to name iterates.
“After read” labeling (Leblond, P., and Lacoste-Julien 2017). Increment
counter each time we read the vector of coefficients from shared
memory.
18/32
34. A new labeling scheme
New way to name iterates.
“After read” labeling (Leblond, P., and Lacoste-Julien 2017). Increment
counter each time we read the vector of coefficients from shared
memory.
No dependency between it and the cost of computing ∇fit
.
Full analysis of Hogwild and other asynchronous methods in
“Improved parallel stochastic optimization analysis for incremental
methods”, Leblond, P., and Lacoste-Julien (submitted).
18/32
36. The SAGA algorithm
Setting:
minimize
x
1
n
n∑
i=1
fi(x)
The SAGA algorithm (Defazio, Bach, and Lacoste-Julien 2014).
Select i ∈ {1, . . . , n} and compute (x+
, α+
) as
x+
= x − γ(∇fi(x) − αi + α) ; α+
i = ∇fi(x)
• Like SGD, update is unbiased, i.e., Ei[∇fi(x) − αi + α)] = ∇f(x).
• Unlike SGD, because of memory terms α, variance → 0.
• Unlike SGD, converges with fixed step size (1/3L)
19/32
37. The SAGA algorithm
Setting:
minimize
x
1
n
n∑
i=1
fi(x)
The SAGA algorithm (Defazio, Bach, and Lacoste-Julien 2014).
Select i ∈ {1, . . . , n} and compute (x+
, α+
) as
x+
= x − γ(∇fi(x) − αi + α) ; α+
i = ∇fi(x)
• Like SGD, update is unbiased, i.e., Ei[∇fi(x) − αi + α)] = ∇f(x).
• Unlike SGD, because of memory terms α, variance → 0.
• Unlike SGD, converges with fixed step size (1/3L)
Super easy to use in scikit-learn
19/32
38. Sparse SAGA
Need for a sparse variant of SAGA
• A large part of large scale datasets are sparse.
• For sparse datasets and generalized linear models (e.g., least
squares, logistic regression, etc.), partial gradients ∇fi are sparse
too.
• Asynchronous algorithms work best when updates are sparse.
SAGA update is inefficient for sparse data
x+
= x − γ(∇fi(x)
sparse
− αi
sparse
+ α
dense!
) ; α+
i = ∇fi(x)
[scikit-learn uses many tricks to make it efficient that we cannot use
in asynchronous version]
20/32
39. Sparse SAGA
Sparse variant of SAGA. Relies on
• Diagonal matrix Pi = projection onto the support of ∇fi
• Diagonal matrix D defined as
Dj,j = n/number of times ∇jfi is nonzero.
Sparse SAGA algorithm (Leblond, P., and Lacoste-Julien 2017)
x+
= x − γ(∇fi(x) − αi + PiDα) ; α+
i = ∇fi(x)
• All operations are sparse, cost per iteration is
O(nonzeros in ∇fi).
• Same convergence properties than SAGA, but with cheaper
iterations in presence of sparsity.
• Crucial property: Ei[PiD] = I.
21/32
40. Asynchronous SAGA (ASAGA)
• Each core runs an instance of Sparse SAGA.
• Updates the same vector of coefficients α, α.
Theory: Under standard assumptions (bounded dalays), same
convergence rate than sequential version.
=⇒ theoretical linear speedup with respect to number of cores.
22/32
41. Experiments
• Improved convergence of variance-reduced methods wrt SGD.
• Significant improvement between 1 and 10 cores.
• Speedup is significant, but far from ideal.
23/32
43. Composite objective
Previous methods assume objective function is smooth.
Cannot be applied to Lasso, Group Lasso, box constraints, etc.
Objective: minimize composite objective function:
minimize
x
1
n
n∑
i=1
fi(x) + ∥x∥1
where fi is smooth (and ∥ · ∥1 is not). For simplicity we consider the
nonsmooth term to be ℓ1 norm, but this is general to any convex
function for which we have access to its proximal operator.
24/32
44. (Prox)SAGA
The ProxSAGA update is inefficient
x+
= proxγh
dense!
(x − γ(∇fi(x)
sparse
− αi
sparse
+ α
dense!
)) ; α+
i = ∇fi(x)
=⇒ a sparse variant is needed as a prerequisite for a practical
parallel method.
25/32
45. Sparse Proximal SAGA
Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017)
Extension of Sparse SAGA to composite optimization problems
26/32
46. Sparse Proximal SAGA
Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017)
Extension of Sparse SAGA to composite optimization problems
Like SAGA, it relies on unbiased gradient estimate
vi=∇fi(x) − αi + DPiα ;
26/32
47. Sparse Proximal SAGA
Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017)
Extension of Sparse SAGA to composite optimization problems
Like SAGA, it relies on unbiased gradient estimate and proximal step
vi=∇fi(x) − αi + DPiα ; x+
= proxγφi
(x − γvi) ; α+
i = ∇fi(x)
26/32
48. Sparse Proximal SAGA
Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017)
Extension of Sparse SAGA to composite optimization problems
Like SAGA, it relies on unbiased gradient estimate and proximal step
vi=∇fi(x) − αi + DPiα ; x+
= proxγφi
(x − γvi) ; α+
i = ∇fi(x)
Where Pi, D are as in Sparse SAGA and φi
def
=
∑d
j (PiD)i,i|xj|.
φi has two key properties: i) support of φi = support of ∇fi (sparse
updates) and ii) Ei[φi] = ∥x∥1 (unbiasedness)
26/32
49. Sparse Proximal SAGA
Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017)
Extension of Sparse SAGA to composite optimization problems
Like SAGA, it relies on unbiased gradient estimate and proximal step
vi=∇fi(x) − αi + DPiα ; x+
= proxγφi
(x − γvi) ; α+
i = ∇fi(x)
Where Pi, D are as in Sparse SAGA and φi
def
=
∑d
j (PiD)i,i|xj|.
φi has two key properties: i) support of φi = support of ∇fi (sparse
updates) and ii) Ei[φi] = ∥x∥1 (unbiasedness)
Convergence: same linear convergence rate as SAGA, with cheaper
updates in presence of sparsity.
26/32
50. Proximal Asynchronous SAGA (ProxASAGA)
Each core runs Sparse Proximal SAGA asynchronously without locks
and updates x, α and α in shared memory.
All read/write operations to shared memory are inconsistent, i.e.,
no performance destroying vector-level locks while reading/writing.
Convergence: under sparsity assumptions, ProxASAGA converges
with the same rate as the sequential algorithm =⇒ theoretical
linear speedup with respect to the number of cores.
27/32
52. Empirical results - Speedup
Speedup =
Time to 10−10
suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
29/32
53. Empirical results - Speedup
Speedup =
Time to 10−10
suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
• ProxASAGA achieves speedups between 6x and 12x on a 20 cores
architecture.
29/32
54. Empirical results - Speedup
Speedup =
Time to 10−10
suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
• ProxASAGA achieves speedups between 6x and 12x on a 20 cores
architecture.
• As predicted by theory, there is a high correlation between
degree of sparsity and speedup.
29/32
55. Perspectives
• Scale above 20 cores.
• Asynchronous optimization on the GPU.
• Acceleration.
• Software development.
30/32
56. Codes
Code is in github: https://github.com/fabianp/ProxASAGA.
Computational code is C++ (use of atomic type) but wrapped in
Python.
A very efficient implementation of SAGA can be found in the
scikit-learn and lightning
(https://github.com/scikit-learn-contrib/lightning) libraries.
31/32
57. References
Cauchy, Augustin (1847). “Méthode générale pour la résolution des systemes d’équations
simultanées”. In: Comp. Rend. Sci. Paris.
Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien (2014). “SAGA: A fast incremental gradient
method with support for non-strongly convex composite objectives”. In: Advances in Neural
Information Processing Systems.
Glowinski, Roland and A Marroco (1975). “Sur l’approximation, par éléments finis d’ordre un, et la
résolution, par pénalisation-dualité d’une classe de problèmes de Dirichlet non linéaires”. In:
Revue française d’automatique, informatique, recherche opérationnelle. Analyse numérique.
Jaggi, Martin et al. (2014). “Communication-Efficient Distributed Dual Coordinate Ascent”. In:
Advances in Neural Information Processing Systems 27.
Leblond, Rémi, Fabian P., and Simon Lacoste-Julien (2017). “ASAGA: asynchronous parallel SAGA”. In:
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics
(AISTATS 2017).
Niu, Feng et al. (2011). “Hogwild: A lock-free approach to parallelizing stochastic gradient descent”.
In: Advances in Neural Information Processing Systems.
31/32
58. Pedregosa, Fabian, Rémi Leblond, and Simon Lacoste-Julien (2017). “Breaking the Nonsmooth
Barrier: A Scalable Parallel Method for Composite Optimization”. In: Advances in Neural
Information Processing Systems 30.
Peng, Zhimin et al. (2016). “ARock: an algorithmic framework for asynchronous parallel coordinate
updates”. In: SIAM Journal on Scientific Computing.
Robbins, Herbert and Sutton Monro (1951). “A Stochastic Approximation Method”. In: Ann. Math.
Statist.
Shamir, Ohad, Nati Srebro, and Tong Zhang (2014). “Communication-efficient distributed
optimization using an approximate newton-type method”. In: International conference on
machine learning.
32/32
59. Supervised Machine Learning
Data: n observations (ai, bi) ∈ Rp
× R
Prediction function: h(a, x) ∈ R
Motivating examples:
• Linear prediction: h(a, x) = xT
a
• Neural networks: h(a, x) = xT
mσ(xm−1σ(· · · xT
2σ(xT
1 a))
Input
layer
Hidden
layer
Output
layer
a1
a2
a3
a4
a5
Ouput
60. Supervised Machine Learning
Data: n observations (ai, bi) ∈ Rp
× R
Prediction function: h(a, x) ∈ R
Motivating examples:
• Linear prediction: h(a, x) = xT
a
• Neural networks: h(a, x) = xT
mσ(xm−1σ(· · · xT
2σ(xT
1 a))
Minimize some distance (e.g., quadratic) between the prediction
minimize
x
1
n
n∑
i=1
ℓ(bi, h(ai, x))
notation
=
1
n
n∑
i=1
fi(x)
where popular examples of ℓ are
• Squared loss, ℓ(bi, h(ai, x))
def
= (bi − h(ai, x))2
• Logistic (softmax), ℓ(bi, h(ai, x))
def
= log(1 + exp(−bih(ai, x)))
61. Sparse Proximal SAGA
For step size γ = 1
5L and f µ-strongly convex (µ > 0), Sparse Proximal
SAGA converges geometrically in expectation. At iteration t we have
E∥xt − x∗
∥2
≤ (1 − 1
5 min{ 1
n , 1
κ })t
C0 ,
with C0 = ∥x0 − x∗
∥2
+ 1
5L2
∑n
i=1 ∥α0
i − ∇fi(x∗
)∥2
and κ = L
µ (condition
number).
Implications
• Same convergence rate than SAGA with cheaper updates.
• In the “big data regime” (n ≥ κ): rate in O(1/n).
• In the “ill-conditioned regime” (n ≤ κ): rate in O(1/κ).
• Adaptivity to strong convexity, i.e., no need to know strong
convexity parameter to obtain linear convergence.
62. Convergence ProxASAGA
Suppose τ ≤ 1
10
√
∆
. Then:
• If κ ≥ n, then with step size γ = 1
36L , ProxASAGA converges
geometrically with rate factor Ω( 1
κ ).
• If κ < n, then using the step size γ = 1
36nµ , ProxASAGA converges
geometrically with rate factor Ω( 1
n ).
In both cases, the convergence rate is the same as Sparse Proximal
SAGA =⇒ ProxASAGA is linearly faster up to constant factor. In both
cases the step size does not depend on τ.
If τ ≤ 6κ, a universal step size of Θ(1
L ) achieves a similar rate than
Sparse Proximal SAGA, making it adaptive to local strong convexity
(knowledge of κ not required).