1. SGD was tested on simple learning problems like support vector machines and conditional random fields and showed competitive performance compared to specialized solvers.
2. On text categorization with SVMs, SGD achieved the same test error as SVMLight and SVMPerf but was over 100 times faster to train.
3. On log-loss classification, SGD achieved a slightly better test error than LibLinear and was over 10 times faster to train.
4. The results demonstrate that SGD can optimize simple learning problems very efficiently without the need for more complex optimization methods.
The document discusses the principle of maximum entropy. It explains that maximum entropy is an approach for making probability assignments where the assigned probability distribution should have the largest entropy or uncertainty possible, subject to whatever information is known. It describes applications of maximum entropy modeling such as part-of-speech tagging and logistic regression. Maximum entropy and maximum likelihood methods are related as they both aim to make distributions as uniform as possible based on available information.
This document discusses maximum entropy models and their application to natural language processing tasks. It introduces maximum entropy modeling, describing how the approach works by choosing a probability distribution with maximum entropy subject to constraints from observed data. Training involves using algorithms like generalized iterative scaling to find optimal feature weights that satisfy the constraints. Maximum entropy models have been applied successfully to tasks like part-of-speech tagging, machine translation, and language modeling.
A Maximum Entropy Approach to the Loss Data Aggregation ProblemErika G. G.
This document discusses using a maximum entropy approach to model loss distributions for operational risk modeling. It begins by motivating the need to accurately model loss distributions given challenges with limited datasets, including heavy tails and dependence between risks. It then provides an overview of the loss distribution approach commonly used in operational risk modeling and its limitations. The document introduces the maximum entropy approach, which frames the problem as maximizing entropy subject to moment constraints. It discusses using the Laplace transform of loss distributions to compress information into moments and how these can be estimated from sample data or fitted distributions to serve as constraints for the maximum entropy approach.
Let me summarize the key points:
- The min-entropy of a state ρAB can be expressed as the optimal value of a semi-definite program (SDP)
- The SDP has a primal and dual problem
- Solving the dual problem gives an operator XAB that achieves the max of the dual objective ρAB, XAB
- The optimal dual states satisfy XB = 1B and correspond to Choi-Jamiolkowski states of unital CPMs
- This formulation allows the min-entropy to be computed by solving an SDP, which can be done efficiently for small systems
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAshwin Rao
The document summarizes the Adaptive Multistage Sampling (AMS) algorithm, which is considered the "spiritual origin" of Monte Carlo Tree Search (MCTS). AMS is a generic simulation-based algorithm for solving finite-horizon Markov Decision Processes when the state space is very large. It works by allocating a fixed number of action selections per state over time steps, and using an explore-exploit formula to select actions during simulations. The estimates produced by AMS are asymptotically unbiased and converge to the optimal values.
A Quick and Terse Introduction to Efficient Frontier MathematicsAshwin Rao
A Quick and Terse Introduction to Efficient Frontier Mathematics. Only a basic background in Linear Algebra, Probability and Optimization is expected to cover this material and gain a reasonable understanding of this topic within one hour.
RSS discussion of Girolami and Calderhead, October 13, 2010Christian Robert
1. The document discusses discretizing Hamiltonians for Markov chain Monte Carlo (MCMC) methods. Specifically, it examines reproducing Hamiltonian equations through discretization, such as via generalized leapfrog.
2. However, the invariance and stability properties of the continuous-time process may not carry over to the discretized version. Approximations can be corrected with a Metropolis-Hastings step, so exactly reproducing the continuous behavior is not necessarily useful.
3. Discretization induces a calibration problem of determining the appropriate step size. Convergence issues for the MCMC algorithm should not be impacted by imperfect renderings of the continuous-time process in discrete time.
The document summarizes the policy gradient theorem, which provides a way to perform policy improvement in reinforcement learning using gradient ascent on the expected returns with respect to the policy parameters. It begins by motivating policy gradients as a way to do policy improvement when the action space is large or continuous. It then defines the necessary notation, expected returns objective function, and discounted state visitation measure. The main part of the document proves the policy gradient theorem, which expresses the policy gradient as an expectation over the discounted state visitation measure and action-value function. It notes that in practice the action-value function must be estimated, and proves the compatible function approximation theorem, which ensures the policy gradient is computed correctly when using an estimated action-value
The document discusses the principle of maximum entropy. It explains that maximum entropy is an approach for making probability assignments where the assigned probability distribution should have the largest entropy or uncertainty possible, subject to whatever information is known. It describes applications of maximum entropy modeling such as part-of-speech tagging and logistic regression. Maximum entropy and maximum likelihood methods are related as they both aim to make distributions as uniform as possible based on available information.
This document discusses maximum entropy models and their application to natural language processing tasks. It introduces maximum entropy modeling, describing how the approach works by choosing a probability distribution with maximum entropy subject to constraints from observed data. Training involves using algorithms like generalized iterative scaling to find optimal feature weights that satisfy the constraints. Maximum entropy models have been applied successfully to tasks like part-of-speech tagging, machine translation, and language modeling.
A Maximum Entropy Approach to the Loss Data Aggregation ProblemErika G. G.
This document discusses using a maximum entropy approach to model loss distributions for operational risk modeling. It begins by motivating the need to accurately model loss distributions given challenges with limited datasets, including heavy tails and dependence between risks. It then provides an overview of the loss distribution approach commonly used in operational risk modeling and its limitations. The document introduces the maximum entropy approach, which frames the problem as maximizing entropy subject to moment constraints. It discusses using the Laplace transform of loss distributions to compress information into moments and how these can be estimated from sample data or fitted distributions to serve as constraints for the maximum entropy approach.
Let me summarize the key points:
- The min-entropy of a state ρAB can be expressed as the optimal value of a semi-definite program (SDP)
- The SDP has a primal and dual problem
- Solving the dual problem gives an operator XAB that achieves the max of the dual objective ρAB, XAB
- The optimal dual states satisfy XB = 1B and correspond to Choi-Jamiolkowski states of unital CPMs
- This formulation allows the min-entropy to be computed by solving an SDP, which can be done efficiently for small systems
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAshwin Rao
The document summarizes the Adaptive Multistage Sampling (AMS) algorithm, which is considered the "spiritual origin" of Monte Carlo Tree Search (MCTS). AMS is a generic simulation-based algorithm for solving finite-horizon Markov Decision Processes when the state space is very large. It works by allocating a fixed number of action selections per state over time steps, and using an explore-exploit formula to select actions during simulations. The estimates produced by AMS are asymptotically unbiased and converge to the optimal values.
A Quick and Terse Introduction to Efficient Frontier MathematicsAshwin Rao
A Quick and Terse Introduction to Efficient Frontier Mathematics. Only a basic background in Linear Algebra, Probability and Optimization is expected to cover this material and gain a reasonable understanding of this topic within one hour.
RSS discussion of Girolami and Calderhead, October 13, 2010Christian Robert
1. The document discusses discretizing Hamiltonians for Markov chain Monte Carlo (MCMC) methods. Specifically, it examines reproducing Hamiltonian equations through discretization, such as via generalized leapfrog.
2. However, the invariance and stability properties of the continuous-time process may not carry over to the discretized version. Approximations can be corrected with a Metropolis-Hastings step, so exactly reproducing the continuous behavior is not necessarily useful.
3. Discretization induces a calibration problem of determining the appropriate step size. Convergence issues for the MCMC algorithm should not be impacted by imperfect renderings of the continuous-time process in discrete time.
The document summarizes the policy gradient theorem, which provides a way to perform policy improvement in reinforcement learning using gradient ascent on the expected returns with respect to the policy parameters. It begins by motivating policy gradients as a way to do policy improvement when the action space is large or continuous. It then defines the necessary notation, expected returns objective function, and discounted state visitation measure. The main part of the document proves the policy gradient theorem, which expresses the policy gradient as an expectation over the discounted state visitation measure and action-value function. It notes that in practice the action-value function must be estimated, and proves the compatible function approximation theorem, which ensures the policy gradient is computed correctly when using an estimated action-value
This document provides an introduction to concepts in differential geometry including manifolds, tangent spaces, vector fields, differential forms, and operations on differential forms such as the exterior product and integration. It outlines key definitions and properties for differential geometry, Riemannian geometry, and applications to probability and statistics. The document is divided into three main sections on differential geometry, Riemannian geometry, and settings without Riemannian geometry.
Evolutionary Strategies as an alternative to Reinforcement LearningAshwin Rao
This document introduces Evolutionary Strategies (ES), a black-box optimization technique inspired by natural evolution. ES frames optimization as maximizing the expected value of an objective function over a parameterized probability distribution of candidate solutions. It uses stochastic gradient ascent to update the distribution's parameters based on sampled solutions and their objective values. The document applies ES to solving Markov Decision Processes by sampling candidate policies and estimating the gradient of expected return. While simpler than reinforcement learning, ES can be effective for optimization problems and more robust than advanced policy gradient methods.
This document discusses approximate Bayesian computation (ABC) methods for Bayesian model choice. It begins by introducing ABC as a technique for performing Bayesian inference when the likelihood function is intractable. It then describes how ABC can be used for model choice by sampling models from their prior and parameters from the prior conditional on the model. The document outlines the generic ABC algorithm for model choice and discusses how it estimates posterior model probabilities. It also discusses applications of ABC for model choice in Gibbs random fields and Potts models.
This document discusses approximation algorithms and introduces several combinatorial optimization problems. It begins by explaining that approximation algorithms are needed to find near-optimal solutions for problems that cannot be solved in polynomial time, such as set cover and bin packing. It then provides examples of problems that are in P, NP, and NP-complete. Several techniques for designing approximation algorithms are outlined, including greedy algorithms, linear programming, and semidefinite programming. Specific NP-complete problems like vertex cover, set cover, and independent set are introduced and approximations algorithms with performance guarantees are provided for set cover and vertex cover.
This document outlines Hadi Sinaee's seminar on Restricted Boltzmann Machines (RBMs) from scratch. The seminar covers:
1. Unsupervised learning and using Markov Random Fields (MRFs) to learn unknown data distributions.
2. Maximum likelihood estimation cannot be done analytically for MRFs, so numerical approximation is required.
3. Introducing latent variables in the form of hidden units allows modeling high-dimensional distributions like images.
4. Computing the log-likelihood gradient involves taking expectations that require summing over all possible latent variable assignments, so approximation is needed.
This document describes a new machine learning algorithm called the Balancing Board Machine (BBM). BBM approximates the centroid of the polyhedral cone that represents the version space in kernel machines. It does this by computing the centroid of the intersection between the version space cone and a hyperplane, and then iteratively "balancing" the hyperplane towards the centroid. This process improves the generalization performance over support vector machines. The document provides mathematical details on exactly computing polytope centroids and deriving efficient approximation algorithms for implementing BBM.
The document summarizes a talk given by Mark Girolami on manifold Monte Carlo methods. It discusses using stochastic diffusions and geometric concepts to improve MCMC methods. Specifically, it proposes using discretized Langevin and Hamiltonian diffusions across a Riemann manifold as an adaptive proposal mechanism. This is founded on deterministic geodesic flows on the manifold. Examples presented include a warped bivariate Gaussian, Gaussian mixture model, and log-Gaussian Cox process.
1. The document discusses approximate Bayesian computation (ABC), a technique used when the likelihood function is intractable. ABC works by simulating parameters from the prior and simulating data, rejecting simulations that are not close to the observed data based on a tolerance level.
2. Random forests can be used in ABC to select informative summary statistics from a large set of possibilities and estimate parameters. The random forests classify simulations as accepted or rejected based on the summaries, implicitly selecting important summaries.
3. Calibrating the tolerance level in ABC is important but difficult, as it determines how close simulations must be to the observed data. Methods discussed include using quantiles of prior predictive simulations or asymptotic convergence properties.
This document summarizes Andrew Ng's lecture notes on supervised learning and linear regression. It begins with examples of supervised learning problems like predicting housing prices from living area size. It introduces key concepts like training examples, features, hypotheses, and cost functions. It then describes using linear regression to predict prices from area and bedrooms. Gradient descent and stochastic gradient descent are introduced as algorithms to minimize the cost function. Finally, it discusses an alternative approach using the normal equations to explicitly minimize the cost function without iteration.
This document discusses approximate Bayesian computation (ABC) methods for Bayesian model choice. It begins with an overview of ABC and how it can be used for model choice by simulating parameters from each model's prior and accepting simulations that meet a tolerance threshold based on summary statistics. It then discusses how ABC can estimate posterior model probabilities and issues that can arise, such as whether tolerances and summary statistics should vary across models. The document also discusses how ABC performs in limiting cases and when summary statistics are sufficient. ABC provides a way to perform Bayesian model choice and estimation when likelihoods are intractable.
The document discusses several computational methods for Bayesian model choice, including importance sampling, bridge sampling, nested sampling, and approximate Bayesian computation (ABC) model choice. Importance sampling can be used to approximate Bayes factors by sampling from an importance function. Bridge sampling provides an alternative estimator that performs better when the models being compared have little overlap. Nested sampling and ABC can also be applied to Bayesian model choice by approximating the marginal likelihood of each model.
1) The document discusses PAC (probably approximately correct) learning and the VC (Vapnik-Chervonenkis) dimension.
2) The VC dimension measures the complexity of a hypothesis space or learning machine. It is defined as the maximum number of points that can be arranged so that the hypotheses shatter them, meaning every possible labeling of the points is represented by some hypothesis.
3) For linear classifiers of the form f(x) = sign(w·x + b) in R^d, the VC dimension is d+1. This provides an upper bound on the complexity of the hypothesis space.
A Numerical Analytic Continuation and Its Application to Fourier TransformHidenoriOgata
It is a slide for a talk given in the conference "ApplMath18" (9th Conference on Applied Mathematics and Scientific Computing, 17-20 September, 2018, Solaris, Sibenik, Croatia). We propose a numerical method of analytic continuation using continued fraction. From theoretical analysis and numerical examples, our method is so effective that it shows exponential convergence. We also apply our method to the computation of Fourier transforms.
Overview of Stochastic Calculus FoundationsAshwin Rao
1) Sample paths of Brownian motion are continuous but almost always non-differentiable. They are of infinite total variation but finite quadratic variation.
2) Ito's lemma and Ito isometry relate stochastic integrals of Brownian motion to integrals of deterministic functions, allowing stochastic processes to be analyzed using tools from calculus and probability theory.
3) The Fokker-Planck and Kolmogorov equations link stochastic differential equations to partial differential equations governing the probability density function of the process, while the Feynman-Kac formula relates certain PDEs to conditional expectations of the process.
This document provides an overview of PAC (Probably Approximately Correct) learning theory. It discusses how PAC learning relates the probability of successful learning to the number of training examples, complexity of the hypothesis space, and accuracy of approximating the target function. Key concepts explained include training error vs true error, overfitting, the VC dimension as a measure of hypothesis space complexity, and how PAC learning bounds can be derived for finite and infinite hypothesis spaces based on factors like the training size and VC dimension.
This document introduces modern variational inference techniques. It discusses:
1. The goal of variational inference is to approximate the posterior distribution p(θ|D) over latent parameters θ given data D.
2. This is done by positing a variational distribution qλ(θ) and optimizing its parameters λ to minimize the KL divergence between qλ(θ) and p(θ|D).
3. The evidence lower bound (ELBO) is used as a variational objective that can be optimized using stochastic gradient descent, with gradients estimated using Monte Carlo sampling and reparametrization.
The document discusses approximation algorithms for NP-hard problems. It begins with an introduction that defines approximation algorithms as algorithms that find feasible but not necessarily optimal solutions to optimization problems in polynomial time.
It then discusses different types of approximation schemes - absolute approximation where the approximate solution is within a constant of optimal, epsilon (ε)-approximation where the approximate solution is within a factor of ε times optimal, and polynomial time approximation schemes that run in polynomial time for any fixed ε.
The document provides examples of problems that admit absolute approximation algorithms, such as planar graph coloring and maximum programs stored on disks. It also discusses Graham's theorem, which proves that the largest processing time scheduling algorithm generates schedules within 1/3
1. Recurrent neural networks (RNNs) allow information to persist from previous time steps through hidden states and can process input sequences of variable lengths. Common RNN architectures include LSTMs and GRUs which address the vanishing gradient problem of traditional RNNs.
2. RNNs are commonly used for natural language processing tasks like machine translation, sentiment classification, and named entity recognition. They learn distributed word representations through techniques like word2vec, GloVe, and negative sampling.
3. Machine translation models use an encoder-decoder architecture with an RNN encoder and decoder. Beam search is commonly used to find high-probability translation sequences. Performance is evaluated using metrics like BLEU score.
The document discusses approximation algorithms for solving hard combinatorial optimization problems. It defines optimization problems and covers NP-hard problems like the clique, independent set, vertex cover, and traveling salesman problems. Approaches for solving NP-hard problems include exact algorithms, approximation algorithms that provide guaranteed good solutions, and heuristics without guarantees. Approximation algorithms aim to settle for good enough solutions rather than optimal ones.
The presentation is an introduction to decision making with approximate Bayesian Methods. It consists of a review of Bayesian Decision Theory and Variational Inference along with a description of Loss Calibrated Variational Inference.
ICCV2009: MAP Inference in Discrete Models: Part 2zukun
The course program includes sessions on discrete models in computer vision, message passing algorithms like dynamic programming and tree-reweighted message passing, quadratic pseudo-boolean optimization, transformation and move-making methods, speed and efficiency of algorithms, and a comparison of inference methods. Recent advances like dual decomposition and higher-order models will also be discussed. All materials from the tutorial will be made available online after the conference.
1) Recent advances in object recognition focused on detection but further improvements require better segmentation and mid-level representations to better localize objects.
2) The next challenges are in object interpretation, requiring new representations that can handle unfamiliar objects, physical context, and task-specific relevance.
3) Building complex, multi-faceted representations of objects that describe categories, pose, materials, characteristics, and relationships will be needed to make progress on object interpretation.
This document provides an introduction to concepts in differential geometry including manifolds, tangent spaces, vector fields, differential forms, and operations on differential forms such as the exterior product and integration. It outlines key definitions and properties for differential geometry, Riemannian geometry, and applications to probability and statistics. The document is divided into three main sections on differential geometry, Riemannian geometry, and settings without Riemannian geometry.
Evolutionary Strategies as an alternative to Reinforcement LearningAshwin Rao
This document introduces Evolutionary Strategies (ES), a black-box optimization technique inspired by natural evolution. ES frames optimization as maximizing the expected value of an objective function over a parameterized probability distribution of candidate solutions. It uses stochastic gradient ascent to update the distribution's parameters based on sampled solutions and their objective values. The document applies ES to solving Markov Decision Processes by sampling candidate policies and estimating the gradient of expected return. While simpler than reinforcement learning, ES can be effective for optimization problems and more robust than advanced policy gradient methods.
This document discusses approximate Bayesian computation (ABC) methods for Bayesian model choice. It begins by introducing ABC as a technique for performing Bayesian inference when the likelihood function is intractable. It then describes how ABC can be used for model choice by sampling models from their prior and parameters from the prior conditional on the model. The document outlines the generic ABC algorithm for model choice and discusses how it estimates posterior model probabilities. It also discusses applications of ABC for model choice in Gibbs random fields and Potts models.
This document discusses approximation algorithms and introduces several combinatorial optimization problems. It begins by explaining that approximation algorithms are needed to find near-optimal solutions for problems that cannot be solved in polynomial time, such as set cover and bin packing. It then provides examples of problems that are in P, NP, and NP-complete. Several techniques for designing approximation algorithms are outlined, including greedy algorithms, linear programming, and semidefinite programming. Specific NP-complete problems like vertex cover, set cover, and independent set are introduced and approximations algorithms with performance guarantees are provided for set cover and vertex cover.
This document outlines Hadi Sinaee's seminar on Restricted Boltzmann Machines (RBMs) from scratch. The seminar covers:
1. Unsupervised learning and using Markov Random Fields (MRFs) to learn unknown data distributions.
2. Maximum likelihood estimation cannot be done analytically for MRFs, so numerical approximation is required.
3. Introducing latent variables in the form of hidden units allows modeling high-dimensional distributions like images.
4. Computing the log-likelihood gradient involves taking expectations that require summing over all possible latent variable assignments, so approximation is needed.
This document describes a new machine learning algorithm called the Balancing Board Machine (BBM). BBM approximates the centroid of the polyhedral cone that represents the version space in kernel machines. It does this by computing the centroid of the intersection between the version space cone and a hyperplane, and then iteratively "balancing" the hyperplane towards the centroid. This process improves the generalization performance over support vector machines. The document provides mathematical details on exactly computing polytope centroids and deriving efficient approximation algorithms for implementing BBM.
The document summarizes a talk given by Mark Girolami on manifold Monte Carlo methods. It discusses using stochastic diffusions and geometric concepts to improve MCMC methods. Specifically, it proposes using discretized Langevin and Hamiltonian diffusions across a Riemann manifold as an adaptive proposal mechanism. This is founded on deterministic geodesic flows on the manifold. Examples presented include a warped bivariate Gaussian, Gaussian mixture model, and log-Gaussian Cox process.
1. The document discusses approximate Bayesian computation (ABC), a technique used when the likelihood function is intractable. ABC works by simulating parameters from the prior and simulating data, rejecting simulations that are not close to the observed data based on a tolerance level.
2. Random forests can be used in ABC to select informative summary statistics from a large set of possibilities and estimate parameters. The random forests classify simulations as accepted or rejected based on the summaries, implicitly selecting important summaries.
3. Calibrating the tolerance level in ABC is important but difficult, as it determines how close simulations must be to the observed data. Methods discussed include using quantiles of prior predictive simulations or asymptotic convergence properties.
This document summarizes Andrew Ng's lecture notes on supervised learning and linear regression. It begins with examples of supervised learning problems like predicting housing prices from living area size. It introduces key concepts like training examples, features, hypotheses, and cost functions. It then describes using linear regression to predict prices from area and bedrooms. Gradient descent and stochastic gradient descent are introduced as algorithms to minimize the cost function. Finally, it discusses an alternative approach using the normal equations to explicitly minimize the cost function without iteration.
This document discusses approximate Bayesian computation (ABC) methods for Bayesian model choice. It begins with an overview of ABC and how it can be used for model choice by simulating parameters from each model's prior and accepting simulations that meet a tolerance threshold based on summary statistics. It then discusses how ABC can estimate posterior model probabilities and issues that can arise, such as whether tolerances and summary statistics should vary across models. The document also discusses how ABC performs in limiting cases and when summary statistics are sufficient. ABC provides a way to perform Bayesian model choice and estimation when likelihoods are intractable.
The document discusses several computational methods for Bayesian model choice, including importance sampling, bridge sampling, nested sampling, and approximate Bayesian computation (ABC) model choice. Importance sampling can be used to approximate Bayes factors by sampling from an importance function. Bridge sampling provides an alternative estimator that performs better when the models being compared have little overlap. Nested sampling and ABC can also be applied to Bayesian model choice by approximating the marginal likelihood of each model.
1) The document discusses PAC (probably approximately correct) learning and the VC (Vapnik-Chervonenkis) dimension.
2) The VC dimension measures the complexity of a hypothesis space or learning machine. It is defined as the maximum number of points that can be arranged so that the hypotheses shatter them, meaning every possible labeling of the points is represented by some hypothesis.
3) For linear classifiers of the form f(x) = sign(w·x + b) in R^d, the VC dimension is d+1. This provides an upper bound on the complexity of the hypothesis space.
A Numerical Analytic Continuation and Its Application to Fourier TransformHidenoriOgata
It is a slide for a talk given in the conference "ApplMath18" (9th Conference on Applied Mathematics and Scientific Computing, 17-20 September, 2018, Solaris, Sibenik, Croatia). We propose a numerical method of analytic continuation using continued fraction. From theoretical analysis and numerical examples, our method is so effective that it shows exponential convergence. We also apply our method to the computation of Fourier transforms.
Overview of Stochastic Calculus FoundationsAshwin Rao
1) Sample paths of Brownian motion are continuous but almost always non-differentiable. They are of infinite total variation but finite quadratic variation.
2) Ito's lemma and Ito isometry relate stochastic integrals of Brownian motion to integrals of deterministic functions, allowing stochastic processes to be analyzed using tools from calculus and probability theory.
3) The Fokker-Planck and Kolmogorov equations link stochastic differential equations to partial differential equations governing the probability density function of the process, while the Feynman-Kac formula relates certain PDEs to conditional expectations of the process.
This document provides an overview of PAC (Probably Approximately Correct) learning theory. It discusses how PAC learning relates the probability of successful learning to the number of training examples, complexity of the hypothesis space, and accuracy of approximating the target function. Key concepts explained include training error vs true error, overfitting, the VC dimension as a measure of hypothesis space complexity, and how PAC learning bounds can be derived for finite and infinite hypothesis spaces based on factors like the training size and VC dimension.
This document introduces modern variational inference techniques. It discusses:
1. The goal of variational inference is to approximate the posterior distribution p(θ|D) over latent parameters θ given data D.
2. This is done by positing a variational distribution qλ(θ) and optimizing its parameters λ to minimize the KL divergence between qλ(θ) and p(θ|D).
3. The evidence lower bound (ELBO) is used as a variational objective that can be optimized using stochastic gradient descent, with gradients estimated using Monte Carlo sampling and reparametrization.
The document discusses approximation algorithms for NP-hard problems. It begins with an introduction that defines approximation algorithms as algorithms that find feasible but not necessarily optimal solutions to optimization problems in polynomial time.
It then discusses different types of approximation schemes - absolute approximation where the approximate solution is within a constant of optimal, epsilon (ε)-approximation where the approximate solution is within a factor of ε times optimal, and polynomial time approximation schemes that run in polynomial time for any fixed ε.
The document provides examples of problems that admit absolute approximation algorithms, such as planar graph coloring and maximum programs stored on disks. It also discusses Graham's theorem, which proves that the largest processing time scheduling algorithm generates schedules within 1/3
1. Recurrent neural networks (RNNs) allow information to persist from previous time steps through hidden states and can process input sequences of variable lengths. Common RNN architectures include LSTMs and GRUs which address the vanishing gradient problem of traditional RNNs.
2. RNNs are commonly used for natural language processing tasks like machine translation, sentiment classification, and named entity recognition. They learn distributed word representations through techniques like word2vec, GloVe, and negative sampling.
3. Machine translation models use an encoder-decoder architecture with an RNN encoder and decoder. Beam search is commonly used to find high-probability translation sequences. Performance is evaluated using metrics like BLEU score.
The document discusses approximation algorithms for solving hard combinatorial optimization problems. It defines optimization problems and covers NP-hard problems like the clique, independent set, vertex cover, and traveling salesman problems. Approaches for solving NP-hard problems include exact algorithms, approximation algorithms that provide guaranteed good solutions, and heuristics without guarantees. Approximation algorithms aim to settle for good enough solutions rather than optimal ones.
The presentation is an introduction to decision making with approximate Bayesian Methods. It consists of a review of Bayesian Decision Theory and Variational Inference along with a description of Loss Calibrated Variational Inference.
ICCV2009: MAP Inference in Discrete Models: Part 2zukun
The course program includes sessions on discrete models in computer vision, message passing algorithms like dynamic programming and tree-reweighted message passing, quadratic pseudo-boolean optimization, transformation and move-making methods, speed and efficiency of algorithms, and a comparison of inference methods. Recent advances like dual decomposition and higher-order models will also be discussed. All materials from the tutorial will be made available online after the conference.
1) Recent advances in object recognition focused on detection but further improvements require better segmentation and mid-level representations to better localize objects.
2) The next challenges are in object interpretation, requiring new representations that can handle unfamiliar objects, physical context, and task-specific relevance.
3) Building complex, multi-faceted representations of objects that describe categories, pose, materials, characteristics, and relationships will be needed to make progress on object interpretation.
Cvpr2010 open source vision software, intro and training part iv generic imag...zukun
This document discusses the Generic Image Library (GIL), which is part of the Boost C++ libraries. GIL abstracts image representations and algorithms, allowing code to work with any image format. It introduces key GIL concepts like image views, iterators, and locators that provide uniform access to image pixels. Examples are given of using GIL to write algorithms that compute the horizontal and vertical gradients of a grayscale image in a format-independent way.
CVPR2010: Advanced ITinCVPR in a Nutshell: part 5: Shape, Matching and Diverg...zukun
The document discusses using divergence measures like the Jensen-Shannon divergence to align multiple point sets represented as probability density functions. It motivates using the JS divergence by modeling point sets as mixtures of density functions, and shows how the likelihood ratio between models leads to the JS divergence. It then formulates the problem of group-wise point set registration as minimizing the JS divergence between density functions, combined with a regularization term. Experimental results on aligning multiple 3D hippocampus point sets are also presented.
ICCV2009: MAP Inference in Discrete Models: Part 6: Recent Advances in Convex...zukun
The document summarizes recent advances in convex relaxations for MAP inference in discrete models. It discusses the linear programming relaxation and how the primal formulation is useful for analysis. It then describes randomized rounding and move-making schemes for obtaining integer solutions from the fractional LP solution. Finally, it discusses going beyond the LP relaxation by incorporating cycle inequalities to handle cases where the LP is not tight.
This document provides an introduction to a tutorial on structured prediction and learning in computer vision. The schedule outlines topics that will be covered including graphical models, probabilistic inference, conditional random fields, structured support vector machines, and structured prediction and energy minimization. The tutorial material is also available in published book form. The introduction defines structured output learning as predicting complex structured outputs from inputs, unlike normal machine learning which predicts a single number. Examples of structured data and structured output prediction from various domains are given.
CVPR2010: higher order models in computer vision: Part 4 zukun
The schedule included an introduction from 830-900, a discussion of models with small cliques and special potentials from 900-1000, a tea break from 1000-1030, a discussion of inference techniques like LP relaxation, Lagrangian relaxation and dual decomposition from 1030-1200, and a discussion of models with global potentials and parameters from 1200-1230. The document concluded by outlining some open questions for future work, including exploiting new ideas in applications, exploring other higher-order cliques, comparing different inference techniques, and learning higher-order random fields models.
The document discusses challenges in computer vision presented by Pietro Perona at the NSF Frontiers in Vision Workshop. It covers topics like scene understanding, visual recognition of materials, objects, actions and scenes, recognizing geometry and materials, hierarchical representation of behavior, and grand challenges like building a visual encyclopedia called Visipedia and enabling autonomous driving. The goal is to move beyond just describing visual data to understanding intentions, plans, causes and consequences to achieve true scene understanding and vision-enabled actions.
The document summarizes research on hierarchical models for object recognition, including:
- Hierarchies are inspired by the primate visual system which uses hierarchies of features of increasing complexity.
- Convolutional neural networks and the Neocognitron model use hierarchical architectures with layers of feature extraction.
- Learning hierarchical compositional representations allows constructing objects from reusable parts.
- Identifying images from brain activity showed it is possible to predict fMRI activity and identify images based on a voxel activity model.
Principal component analysis and matrix factorizations for learning (part 2) ...zukun
1) Spectral clustering is a technique for clustering data based on the eigenvectors of the similarity matrix of the data. 2) It works by computing the generalized eigenvectors of the normalized graph Laplacian matrix, which leads to a low-dimensional embedding of the data that can then be clustered using k-means. 3) Spectral clustering is related to other graph clustering techniques like normalized cut that aim to minimize similarities between clusters while balancing cluster sizes.
CVPR2010: Semi-supervised Learning in Vision: Part 3: Algorithms and Applicat...zukun
The document summarizes research on semi-supervised learning techniques in computer vision, including SemiBoost. SemiBoost is an algorithm that uses a small amount of labeled data and a large amount of unlabeled data to train classifiers. It works by iteratively computing pseudo-labels and weights for unlabeled data based on a similarity measure, then retraining a weak learner. The document discusses extensions of SemiBoost, including learning distance functions from labeled data to define similarities, reusing priors from previous classifiers, and applications to tasks like car detection.
Principal component analysis and matrix factorizations for learning (part 3) ...zukun
Nonnegative matrix factorization (NMF) decomposes a data matrix into two nonnegative matrices and can be shown to be equivalent to k-means clustering and spectral clustering. NMF provides parts-based representations of data and soft clustering. Experiments on text datasets demonstrated NMF achieved better clustering accuracy than k-means.
A general survey of previous works on action recognitionzukun
1) The document summarizes several papers on action recognition from computer vision and pattern recognition conferences.
2) Key topics discussed include statistical analysis of dynamic actions using space-time gradients, space-time interest points for action detection, unsupervised learning of actions using spatial-temporal words, and recognizing actions in movies and medium resolution videos.
3) New datasets introduced include the KTH action dataset with videos of common human actions like walking and running, as well as datasets of actions extracted from real movies and videos.
ECCV2010: distance function and metric learning part 2zukun
The document summarizes Brian Kulis's tutorial on distance functions and metric learning. It introduces the concept of Mahalanobis distances and how they can be used for metric learning by parametrizing the distance function with a positive semi-definite matrix. It provides examples of how metric learning can transform the space to better suit tasks like classification. It also outlines formulations for metric learning problems and algorithms like Xing's MMC and Schultz and Joachims' relative distance approach.
Cvpr2010 open source vision software, intro and training part vii point cloud...zukun
The document provides an introduction and overview of point cloud data and the Point Cloud Library (PCL). It discusses point clouds, how they are acquired from sensors and represented as data structures. It covers common file formats for storing point clouds like ROS messages and PCD files. The document introduces PCL as a C++ library for processing point clouds, describing its modular structure and capabilities for features, surface reconstruction, filtering, I/O and segmentation.
The document discusses recent advances in computer vision that go beyond object recognition to incorporate reasoning. It presents examples where accounting for contextual and causal cues is needed to recognize complex or fine-grained concepts. Recent works have used probabilistic first-order logic and graphical models to represent relational knowledge and enable reasoning about causality. Open issues remain around how to evaluate reasoning capabilities and characterize the representational power, learnability, and performance of these approaches.
This document provides an overview of key algorithm analysis concepts including:
- Common algorithmic techniques like divide-and-conquer, dynamic programming, and greedy algorithms.
- Data structures like heaps, graphs, and trees.
- Analyzing the time efficiency of recursive and non-recursive algorithms using orders of growth, recurrence relations, and the master's theorem.
- Examples of specific algorithms that use techniques like divide-and-conquer, decrease-and-conquer, dynamic programming, and greedy strategies.
- Complexity classes like P, NP, and NP-complete problems.
This document discusses analysis of algorithms and asymptotic notation. It introduces key concepts like worst case running time, pseudo-code, experimental studies and their limitations, asymptotic notation like Big-O, and using asymptotic analysis to evaluate an algorithm's efficiency based on input size independently of hardware. Examples are provided to illustrate counting primitive operations and analyzing algorithms' asymptotic running times as O(log n), O(n), O(n^2) etc.
The document discusses machine learning and provides an itinerary for a tour through some favorite machine learning results, directions, and open problems. The itinerary includes stops to discuss batch learning algorithms and sample complexity, online learning, strong complexity results using SQ and Fourier analysis, and current practical issues in machine learning. The tour guide aims to emphasize connections between machine learning and other areas of theory of computation.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Innovations in technology has revolutionized financial services to an extent that large financial institutions like Goldman Sachs are claiming to be technology companies! It is no secret that technological innovations like Data science and AI are changing fundamentally how financial products are created, tested and delivered. While it is exciting to learn about technologies themselves, there is very little guidance available to companies and financial professionals should retool and gear themselves towards the upcoming revolution.
In this master class, we will discuss key innovations in Data Science and AI and connect applications of these novel fields in forecasting and optimization. Through case studies and examples, we will demonstrate why now is the time you should invest to learn about the topics that will reshape the financial services industry of the future!
Topic
- Frontier topics in Optimization
This document introduces feature selection in data mining and knowledge discovery. It discusses various feature selection methods like filters, wrappers and embedded methods. It presents variable selection in regression using penalized least squares with l0, l1 and l2 penalties. Maximizing penalized likelihood can also lead to sparse solutions and variable selection. The document outlines desired properties for penalty functions and discusses applications in taxonomy, financial engineering and challenges in high dimensional feature selection.
This document provides an introduction to deep learning. It discusses sources for further reading on deep learning. It then summarizes key topics that will be covered in Unit 1, including the relationship between AI, machine learning, and deep learning. It provides definitions and examples of neural networks, gradient descent optimization, and stochastic gradient descent. It also discusses neurons, the history of the perceptron, and the basic components of artificial neural networks.
The presentation covered time and space complexity, average and worst case analysis, and asymptotic notations. It defined key concepts like time complexity measures the number of operations, space complexity measures memory usage, and worst case analysis provides an upper bound on running time. Common asymptotic notations like Big-O, Omega, and Theta were explained, and how they are used to compare how functions grow relative to each other as input size increases.
This talk was based on my Master's thesis which I had completed earlier that year. It gives an overview on how certain parallel dynamic programming can be computed in parallel efficiently, and what we want that to mean here.
The plots in "Performance Examples" show speedup S on the left and efficiency E on the right, both against input size.
Read more over here: http://reitzig.github.io/publications/Reitzig2012
Big Data analysis involves building predictive models from high-dimensional data using techniques like variable selection, cross-validation, and regularization to avoid overfitting. The document discusses an example analyzing web browsing data to predict online spending, highlighting challenges with large numbers of variables. It also covers summarizing high-dimensional data through dimension reduction and model building for prediction versus causal inference.
The document discusses machine learning models learning when to give up or reject classifying examples. It proposes three main approaches: 1) Using calibrated confidence estimates to account for natural confusability in the training data, 2) Handling input noise with a secondary confidence predictor, and 3) Making classifiers aware of what examples are unknown based on the training data. The focus is on classification problems, where rejecting allows transferring to a human for assistance. The goal is to minimize rejections while maintaining a fixed error rate.
The document discusses data structures and algorithms. It defines key concepts like primitive data types, data structures, static vs dynamic structures, abstract data types, algorithm design, analysis of time and space complexity, recursion, stacks and common stack operations like push and pop. Examples are provided to illustrate factorial calculation using recursion and implementation of a stack.
The document discusses data structures and algorithms. It defines key concepts like primitive data types, data structures, static vs dynamic structures, abstract data types, algorithm design, analysis of time and space complexity, and recursion. It provides examples of algorithms and data structures like stacks and using recursion to calculate factorials. The document covers fundamental topics in data structures and algorithms.
The document discusses data structures and algorithms. It defines key concepts like primitive data types, data structures, static vs dynamic structures, abstract data types, algorithm design, analysis of time and space complexity, and recursion. It provides examples of algorithms and data structures like stacks and using recursion to calculate factorials. The document covers fundamental topics in data structures and algorithms.
The document discusses data structures and algorithms. It defines key concepts like primitive data types, data structures, static vs dynamic structures, abstract data types, algorithm design, analysis of time and space complexity, and recursion. It provides examples of algorithms and data structures like arrays, stacks and the factorial function to illustrate recursive and iterative implementations. Problem solving techniques like defining the problem, designing algorithms, analyzing and testing solutions are also covered.
The document discusses data structures and algorithms. It defines key concepts like primitive data types, data structures, static vs dynamic structures, abstract data types, algorithm analysis including time and space complexity, and common algorithm design techniques like recursion. It provides examples of algorithms and data structures like stacks and using recursion to calculate factorials. The document covers fundamental topics in data structures and algorithms.
This document discusses data structures and algorithms. It begins by defining data structures as the logical organization of data and primitive data types like integers that hold single pieces of data. It then discusses static versus dynamic data structures and abstract data types. The document outlines the main steps in problem solving as defining the problem, designing algorithms, analyzing algorithms, implementing, testing, and maintaining solutions. It provides examples of space and time complexity analysis and discusses analyzing recursive algorithms through repeated substitution and telescoping methods.
Learn from Example and Learn Probabilistic ModelJunya Tanaka
This document summarizes machine learning techniques including learning from examples, probabilistic modeling, and the EM algorithm. It covers nonparametric models, ensemble learning, statistical learning, maximum likelihood parameter estimation, density estimation, Bayesian parameter learning, and clustering with mixtures of Gaussians. The key points are that Bayesian learning calculates hypothesis probabilities given data, predictions average individual hypothesis predictions, and the EM algorithm alternates between expectation and maximization steps to handle hidden variables.
Similar to NIPS2007: learning using many examples (20)
Mylyn helps address information overload and context loss when multi-tasking. It integrates tasks into the IDE workflow and uses a degree-of-interest model to monitor user interaction and provide a task-focused UI with features like view filtering, element decoration, automatic folding and content assist ranking. This creates a single view of all tasks that are centrally managed within the IDE.
This document provides an overview of OpenCV, an open source computer vision and machine learning software library. It discusses OpenCV's core functionality for representing images as matrices and directly accessing pixel data. It also covers topics like camera calibration, feature point extraction and matching, and estimating camera pose through techniques like structure from motion and planar homography. Hints are provided for Android developers on required permissions and for planar homography estimation using additional constraints rather than OpenCV's general homography function.
This document provides information about the Computer Vision Laboratory 2012 course at the Institute of Visual Computing. The course focuses on computer vision on mobile devices and will involve 180 hours of project work per person. Students will work in groups of 1-2 people on topics like 3D reconstruction from silhouettes or stereo images on mobile devices. Key dates are provided for submitting a work plan, mid-term presentation, and final report. Contact information is given for the lecturers and teaching assistant.
This document summarizes a presentation on natural image statistics given by Siwei Lyu at the 2009 CIFAR NCAP Summer School. The presentation covered several key topics:
1) It discussed the motivation for studying natural image statistics, which is to understand representations in the visual system and develop computer vision applications like denoising.
2) It reviewed common statistical properties found in natural images like 1/f power spectra and non-Gaussian distributions.
3) Maximum entropy and Bayesian models were presented as approaches to model these statistics, with Gaussian and independent component analysis discussed as specific examples.
4) Efficient coding principles from information theory were introduced as a framework for understanding neural representations that aim to decorrelate and
Camera calibration involves determining the internal camera parameters like focal length, image center, distortion, and scaling factors that affect the imaging process. These parameters are important for applications like 3D reconstruction and robotics that require understanding the relationship between 3D world points and their 2D projections in an image. The document describes estimating internal parameters by taking images of a calibration target with known geometry and solving the equations that relate the 3D target points to their 2D image locations. Homogeneous coordinates and projection matrices are used to represent the calibration transformations mathematically.
Brunelli 2008: template matching techniques in computer visionzukun
The document discusses template matching techniques in computer vision. It begins with an overview that defines template matching and discusses some common computer vision tasks it can be used for, like object detection. It then covers topics like detection as hypothesis testing, training and testing techniques, and provides a bibliography.
The HARVEST Programme evaluates feature detectors and descriptors through indirect and direct benchmarks. Indirect benchmarks measure repeatability and matching scores on the affine covariant testbed to evaluate how features persist across transformations. Direct benchmarks evaluate features on image retrieval tasks using the Oxford 5k dataset to measure real-world performance. VLBenchmarks provides software for easily running these benchmarks and reproducing published results. It allows comparing features and selecting the best for a given application.
This document summarizes VLFeat, an open source computer vision library. It provides concise summaries of VLFeat's features, including SIFT, MSER, and other covariant detectors. It also compares VLFeat's performance to other libraries like OpenCV. The document highlights how VLFeat achieves state-of-the-art results in tasks like feature detection, description and matching while maintaining a simple MATLAB interface.
This document summarizes and compares local image descriptors. It begins with an introduction to modern descriptors like SIFT, SURF and DAISY. It then discusses efficient descriptors such as binary descriptors like BRIEF, ORB and BRISK which use comparisons of intensity value pairs. The document concludes with an overview section.
This document discusses various feature detectors used in computer vision. It begins by describing classic detectors such as the Harris detector and Hessian detector that search scale space to find distinguished locations. It then discusses detecting features at multiple scales using the Laplacian of Gaussian and determinant of Hessian. The document also covers affine covariant detectors such as maximally stable extremal regions and affine shape adaptation. It discusses approaches for speeding up detection using approximations like those in SURF and learning to emulate detectors. Finally, it outlines new developments in feature detection.
The document discusses modern feature detection techniques. It provides an introduction and agenda for a talk on advances in feature detectors and descriptors, including improvements since a 2005 paper. It also discusses software suites and benchmarks for feature detection. Several application domains are described, such as wide baseline matching, panoramic image stitching, 3D reconstruction, image search, location recognition, and object tracking.
System 1 and System 2 were basic early systems for image matching that used color and texture matching. Descriptor-based approaches like SIFT provided more invariance but not perfect invariance. Patch descriptors like SIFT were improved by making them more invariant to lighting changes like color and illumination shifts. The best performance came from combining descriptors with color invariance. Representing images as histograms of visual word occurrences captured patterns in local image patches and allowed measuring similarity between images. Large vocabularies of visual words provided more discriminative power but were costly to compute and store.
This document summarizes a research paper on internet video search. It discusses several key challenges: [1] the large variation in how the same thing can appear in images/videos due to lighting, viewpoint etc., [2] defining what defines different objects, and [3] the huge number of different things that exist. It also notes gaps in narrative understanding, shared concepts between humans and machines, and addressing diverse query contexts. The document advocates developing powerful yet simple visual features that capture uniqueness with invariance to irrelevant changes.
The document discusses computer vision techniques for object detection and localization. It describes methods like selective search that group image regions hierarchically to propose object locations. Large datasets like ImageNet and LabelMe that provide training examples are also discussed. Performance on object detection benchmarks like PASCAL VOC is shown to improve significantly over time. Evaluation standards for concept detection like those used in TRECVID are presented. The document concludes that results are impressively improving each year but that the number of detectable concepts remains limited. It also discusses making feature extraction more efficient using techniques like SURF that take advantage of integral images.
This document provides an outline and overview of Yoshua Bengio's 2012 tutorial on representation learning. The key points covered include:
1) The tutorial will cover motivations for representation learning, algorithms such as probabilistic models and auto-encoders, and analysis and practical issues.
2) Representation learning aims to automatically learn good representations of data rather than relying on handcrafted features. Learning representations can help address challenges like exploiting unlabeled data and the curse of dimensionality.
3) Deep learning algorithms attempt to learn multiple levels of increasingly complex representations, with the goal of developing more abstract, disentangled representations that generalize beyond local patterns in the data.
Advances in discrete energy minimisation for computer visionzukun
This document discusses string algorithms and data structures. It introduces the Knuth-Morris-Pratt algorithm for finding patterns in strings in O(n+m) time where n is the length of the text and m is the length of the pattern. It also discusses common string data structures like tries, suffix trees, and suffix arrays. Suffix trees and suffix arrays store all suffixes of a string and support efficient pattern matching and other string operations in linear time or O(m+logn) time where m is the pattern length and n is the text length.
This document provides a tutorial on how to use Gephi software to analyze and visualize network graphs. It outlines the basic steps of importing a sample graph file, applying layout algorithms to organize the nodes, calculating metrics, detecting communities, filtering the graph, and exporting/saving the results. The tutorial demonstrates features of Gephi including node ranking, partitioning, and interactive visualization of the graph.
EM algorithm and its application in probabilistic latent semantic analysiszukun
The document discusses the EM algorithm and its application in Probabilistic Latent Semantic Analysis (pLSA). It begins by introducing the parameter estimation problem and comparing frequentist and Bayesian approaches. It then describes the EM algorithm, which iteratively computes lower bounds to the log-likelihood function. Finally, it applies the EM algorithm to pLSA by modeling documents and words as arising from a mixture of latent topics.
This document describes an efficient framework for part-based object recognition using pictorial structures. The framework represents objects as graphs of parts with spatial relationships. It finds the optimal configuration of parts through global minimization using distance transforms, allowing fast computation despite modeling complex spatial relationships between parts. This enables soft detection to handle partial occlusion without early decisions about part locations.
Iccv2011 learning spatiotemporal graphs of human activities zukun
The document presents a new approach for learning spatiotemporal graphs of human activities from weakly supervised video data. The approach uses 2D+t tubes as mid-level features to represent activities as segmentation graphs, with nodes describing tubes and edges describing various relations. A probabilistic graph mixture model is used to model activities, and learning estimates the model parameters and permutation matrices using a structural EM algorithm. The learned models allow recognizing and segmenting activities in new videos through robust least squares inference. Evaluation on benchmark datasets demonstrates the ability to learn characteristic parts of activities and recognize them under weak supervision.
Temple of Asclepius in Thrace. Excavation resultsKrassimira Luka
The temple and the sanctuary around were dedicated to Asklepios Zmidrenus. This name has been known since 1875 when an inscription dedicated to him was discovered in Rome. The inscription is dated in 227 AD and was left by soldiers originating from the city of Philippopolis (modern Plovdiv).
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumMJDuyan
(𝐓𝐋𝐄 𝟏𝟎𝟎) (𝐋𝐞𝐬𝐬𝐨𝐧 𝟏)-𝐏𝐫𝐞𝐥𝐢𝐦𝐬
𝐃𝐢𝐬𝐜𝐮𝐬𝐬 𝐭𝐡𝐞 𝐄𝐏𝐏 𝐂𝐮𝐫𝐫𝐢𝐜𝐮𝐥𝐮𝐦 𝐢𝐧 𝐭𝐡𝐞 𝐏𝐡𝐢𝐥𝐢𝐩𝐩𝐢𝐧𝐞𝐬:
- Understand the goals and objectives of the Edukasyong Pantahanan at Pangkabuhayan (EPP) curriculum, recognizing its importance in fostering practical life skills and values among students. Students will also be able to identify the key components and subjects covered, such as agriculture, home economics, industrial arts, and information and communication technology.
𝐄𝐱𝐩𝐥𝐚𝐢𝐧 𝐭𝐡𝐞 𝐍𝐚𝐭𝐮𝐫𝐞 𝐚𝐧𝐝 𝐒𝐜𝐨𝐩𝐞 𝐨𝐟 𝐚𝐧 𝐄𝐧𝐭𝐫𝐞𝐩𝐫𝐞𝐧𝐞𝐮𝐫:
-Define entrepreneurship, distinguishing it from general business activities by emphasizing its focus on innovation, risk-taking, and value creation. Students will describe the characteristics and traits of successful entrepreneurs, including their roles and responsibilities, and discuss the broader economic and social impacts of entrepreneurial activities on both local and global scales.
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.pptHenry Hollis
The History of NZ 1870-1900.
Making of a Nation.
From the NZ Wars to Liberals,
Richard Seddon, George Grey,
Social Laboratory, New Zealand,
Confiscations, Kotahitanga, Kingitanga, Parliament, Suffrage, Repudiation, Economic Change, Agriculture, Gold Mining, Timber, Flax, Sheep, Dairying,
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...EduSkills OECD
Andreas Schleicher, Director of Education and Skills at the OECD presents at the launch of PISA 2022 Volume III - Creative Minds, Creative Schools on 18 June 2024.
This presentation was provided by Rebecca Benner, Ph.D., of the American Society of Anesthesiologists, for the second session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session Two: 'Expanding Pathways to Publishing Careers,' was held June 13, 2024.
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxEduSkills OECD
Iván Bornacelly, Policy Analyst at the OECD Centre for Skills, OECD, presents at the webinar 'Tackling job market gaps with a skills-first approach' on 12 June 2024
2. Why Large-scale Datasets?
• Data Mining
Gain competitive advantages by
analyzing data that describes the life of
our computerized society.
• Artificial Intelligence
Emulate cognitive capabilities of humans.
Humans learn from abundant and diverse data.
3. The Computerized Society Metaphor
• A society with just two kinds of computers:
Makers do business and generate
← revenue. They also produce data
in proportion with their activity.
Thinkers analyze the data to
increase revenue by finding →
competitive advantages.
• When the population of computers grows:
– The ratio #Thinkers/#Makers must remain bounded.
– The Data grows with the number of Makers.
– The number of Thinkers does not grow faster than the Data.
4. Limited Computing Resources
• The computing resources available for learning
do not grow faster than the volume of data.
– The cost of data mining cannot exceed the revenues.
– Intelligent animals learn from streaming data.
• Most machine learning algorithms demand resources
that grow faster than the volume of data.
– Matrix operations (n3 time for n2 coefficients).
– Sparse matrix operations are worse.
5. Roadmap
I. Statistical Efficiency versus Computational Cost.
II. Stochastic Algorithms.
III. Learning with a Single Pass over the Examples.
6. Part I
Statistical Efficiency versus
Computational Costs.
This part is based on a joint work with Olivier Bousquet.
7. Simple Analysis
• Statistical Learning Literature:
“It is good to optimize an objective function than ensures a fast
estimation rate when the number of examples increases.”
• Optimization Literature:
“To efficiently solve large problems, it is preferable to choose
an optimization algorithm with strong asymptotic properties, e.g.
superlinear.”
• Therefore:
“To address large-scale learning problems, use a superlinear algorithm to
optimize an objective function with fast estimation rate.
Problem solved.”
The purpose of this presentation is. . .
8. Too Simple an Analysis
• Statistical Learning Literature:
“It is good to optimize an objective function than ensures a fast
estimation rate when the number of examples increases.”
• Optimization Literature:
“To efficiently solve large problems, it is preferable to choose
an optimization algorithm with strong asymptotic properties, e.g.
superlinear.”
• Therefore: (error)
“To address large-scale learning problems, use a superlinear algorithm to
optimize an objective function with fast estimation rate.
Problem solved.”
. . . to show that this is completely wrong !
9. Objectives and Essential Remarks
• Baseline large-scale learning algorithm
Randomly discarding data is the simplest
way to handle large datasets.
– What are the statistical benefits of processing more data?
– What is the computational cost of processing more data?
• We need a theory that joins Statistics and Computation!
– 1967: Vapnik’s theory does not discuss computation.
– 1981: Valiant’s learnability excludes exponential time algorithms,
but (i) polynomial time can be too slow, (ii) few actual results.
– We propose a simple analysis of approximate optimization. . .
10. Learning Algorithms: Standard Framework
• Assumption: examples are drawn independently from an unknown
probability distribution P (x, y) that represents the rules of Nature.
• Expected Risk: E(f ) = ℓ(f (x), y) dP (x, y).
1
• Empirical Risk: En(f ) = n ℓ(f (xi), yi).
• We would like f ∗ that minimizes E(f ) among all functions.
• In general f ∗ ∈ F .
/
∗
• The best we can have is fF ∈ F that minimizes E(f ) inside F .
• But P (x, y) is unknown by definition.
• Instead we compute fn ∈ F that minimizes En(f ).
Vapnik-Chervonenkis theory tells us when this can work.
11. Learning with Approximate Optimization
Computing fn = arg min En(f ) is often costly.
f ∈F
Since we already make lots of approximations,
why should we compute fn exactly?
˜
Let’s assume our optimizer returns fn
˜
such that En(fn) < En(fn) + ρ.
For instance, one could stop an iterative
optimization algorithm long before its convergence.
12. Decomposition of the Error (i)
˜
E(fn) − E(f ∗) = E(fF ) − E(f ∗)
∗ Approximation error
∗
+ E(fn) − E(fF ) Estimation error
˜
+ E(fn) − E(fn) Optimization error
Problem:
Choose F , n, and ρ to make this as small as possible,
maximal number of examples n
subject to budget constraints
maximal computing time T
13. Decomposition of the Error (ii)
Approximation error bound: (Approximation theory)
– decreases when F gets larger.
Estimation error bound: (Vapnik-Chervonenkis theory)
– decreases when n gets larger.
– increases when F gets larger.
Optimization error bound: (Vapnik-Chervonenkis theory plus tricks)
– increases with ρ.
Computing time T : (Algorithm dependent)
– decreases with ρ
– increases with n
– increases with F
14. Small-scale vs. Large-scale Learning
We can give rigorous definitions.
• Definition 1:
We have a small-scale learning problem when the active
budget constraint is the number of examples n.
• Definition 2:
We have a large-scale learning problem when the active
budget constraint is the computing time T .
15. Small-scale Learning
The active budget constraint is the number of examples.
• To reduce the estimation error, take n as large as the budget allows.
• To reduce the optimization error to zero, take ρ = 0.
• We need to adjust the size of F .
Estimation error
Approximation error
Size of F
See Structural Risk Minimization (Vapnik 74) and later works.
16. Large-scale Learning
The active budget constraint is the computing time.
• More complicated tradeoffs.
The computing time depends on the three variables: F , n, and ρ.
• Example.
If we choose ρ small, we decrease the optimization error. But we
must also decrease F and/or n with adverse effects on the estimation
and approximation errors.
• The exact tradeoff depends on the optimization algorithm.
• We can compare optimization algorithms rigorously.
17. Executive Summary
Good optimization algorithm (superlinear).
log (ρ) ρ decreases faster than exp(−T)
Mediocre optimization algorithm (linear).
ρ decreases like exp(−T)
Best ρ
Extraordinary poor
optimization algorithm
ρ decreases like 1/T
log(T)
18. Asymptotics: Estimation
Uniform convergence bounds (with capacity d + 1)
d n α 1
Estimation error ≤ O log with ≤ α ≤ 1 .
n d 2
There are in fact three types of bounds to consider:
d
– Classical V-C bounds (pessimistic): O n
d n
– Relative V-C bounds in the realizable case: O log
n d
α
d n
– Localized bounds (variance, Tsybakov): O log
n d
Fast estimation rates are a big theoretical topic these days.
19. Asymptotics: Estimation+Optimization
Uniform convergence arguments give
d n α
Estimation error + Optimization error ≤ O log + ρ .
n d
This is true for all three cases of uniform convergence bounds.
Scaling laws for ρ when F is fixed
The approximation error is constant.
α
– No need to choose ρ smaller than O d n
n log d .
α
– Not advisable to choose ρ larger than O d n
n log d .
20. . . . Approximation+Estimation+Optimization
When F is chosen via a λ-regularized cost
– Uniform convergence theory provides bounds for simple cases
(Massart-2000; Zhang 2005; Steinwart et al., 2004-2007; . . . )
– Computing time depends on both λ and ρ.
– Scaling laws for λ and ρ depend on the optimization algorithm.
When F is realistically complicated
Large datasets matter
– because one can use more features,
– because one can use richer models.
Bounds for such cases are rarely realistic enough.
Luckily there are interesting things to say for F fixed.
21. Case Study
Simple parametric setup
– F is fixed.
– Functions fw (x) linearly parametrized by w ∈ Rd.
Comparing four iterative optimization algorithms for En(f )
1. Gradient descent.
2. Second order gradient descent (Newton).
3. Stochastic gradient descent.
4. Stochastic second order gradient descent.
22. Quantities of Interest
• Empirical Hessian at the empirical optimum wn.
n
∂ 2En 1 ∂ 2ℓ(fn(xi), yi)
H = (fwn ) =
∂w2 n ∂w2
i=1
• Empirical Fisher Information matrix at the empirical optimum wn.
n
1 ∂ℓ(fn(xi), yi) ∂ℓ(fn(xi), yi) ′
G =
n ∂w ∂w
i=1
• Condition number
We assume that there are λmin, λmax and ν such that
– trace GH −1 ≈ ν .
– spectrum H ⊂ [λmin, λmax].
and we define the condition number κ = λmax/λmin.
23. Gradient Descent (GD)
Gradient J
Iterate
∂En(fwt )
• wt+1 ← wt − η
∂w
Best speed achieved with fixed learning rate η = λ 1 .
max
(e.g., Dennis & Schnabel, 1983)
Cost per Iterations Time to reach Time to reach
iteration to reach ρ accuracy ρ ˜ ∗
E(fn) − E(fF ) < ε
2κ
GD O(nd) 1
O κ log ρ 1
O ndκ log ρ O d1/α log2 ε
1
ε
– In the last column, n and ρ are chosen to reach ε as fast as possible.
– Solve for ε to find the best error rate achievable in a given time.
– Remark: abuses of the O() notation
24. Second Order Gradient Descent (2GD)
Gradient J
Iterate
∂En(fwt )
• wt+1 ← wt − H −1
∂w
We assume H −1 is known in advance.
Superlinear optimization speed (e.g., Dennis & Schnabel, 1983)
Cost per Iterations Time to reach Time to reach
iteration to reach ρ accuracy ρ ˜ ∗
E(fn) − E(fF ) < ε
2GD O d d+n 1
O log log ρ 1
O d d + n log log ρ O d2 log 1 log log ε
1
ε1/α ε
– Optimization speed is much faster.
– Learning speed only saves the condition number κ.
25. Stochastic Gradient Descent (SGD)
Iterate
Total Gradient <J(x,y,w)>
• Draw random example (xt, yt).
Partial Gradient J(x,y,w)
η ∂ℓ(fwt (xt), yt)
• wt+1 ← wt −
t ∂w
Best decreasing gain schedule with η = λ 1 .
min
(see Murata, 1998; Bottou & LeCun, 2004)
Cost per Iterations Time to reach Time to reach
iteration to reach ρ accuracy ρ ˜ ∗
E(fn) − E(fF ) < ε
SGD O(d) νk 1
+o ρ O dν k O dν k
ρ ρ ε
With 1 ≤ k ≤ κ2
– Optimization speed is catastrophic.
– Learning speed does not depend on the statistical estimation rate α.
– Learning speed depends on condition number κ but scales very well.
26. Second order Stochastic Descent (2SGD)
Iterate
Total Gradient <J(x,y,w)>
• Draw random example (xt, yt).
Partial Gradient J(x,y,w)
1 −1 ∂ℓ(fwt (xt), yt)
• wt+1 ← wt − H
t ∂w
η 1 −1
Replace scalar gain by matrix H .
t t
Cost per Iterations Time to reach Time to reach
iteration to reach ρ accuracy ρ ˜ ∗
E(fn) − E(fF ) < ε
2SGD O d2 ν +o 1 O d2 ν O d2 ν
ρ ρ ρ ε
– Each iteration is d times more expensive.
– The number of iterations is reduced by κ2 (or less.)
– Second order only changes the constant factors.
28. Benchmarking SGD in Simple Problems
• The theory suggests that SGD is very competitive.
– Many people associate SGD with trouble.
• SGD historically associated with back-propagation.
– Multilayer networks are very hard problems (nonlinear, nonconvex)
– What is difficult, SGD or MLP?
• Try PLAIN SGD on simple learning problems.
– Support Vector Machines
– Conditional Random Fields
Download from http://leon.bottou.org/projects/sgd.
These simple programs are very short.
See also (Shalev-Schwartz et al., 2007; Vishwanathan et al., 2006)
29. Text Categorization with SVMs
• Dataset
– Reuters RCV1 document corpus.
– 781,265 training examples, 23,149 testing examples.
– 47,152 TF-IDF features.
• Task
– Recognizing documents of category CCAT.
1 λ 2
– Minimize En = w + ℓ( w xi + b, yi ) .
n 2
i
∂ℓ(w xt + b, yt)
– Update w ← w − ηt ∇(wt, xt, yt) = w − ηt λw +
∂w
Same setup as (Shalev-Schwartz et al., 2007) but plain SGD.
30. Text Categorization with SVMs
• Results: Linear SVM
ℓ(ˆ, y) = max{0, 1 − y y }
y ˆ λ = 0.0001
Training Time Primal cost Test Error
SVMLight 23,642 secs 0.2275 6.02%
SVMPerf 66 secs 0.2278 6.03%
SGD 1.4 secs 0.2275 6.02%
• Results: Log-Loss Classifier
ℓ(ˆ, y) = log(1 + exp(−y y )) λ = 0.00001
y ˆ
Training Time Primal cost Test Error
LibLinear (ε = 0.01) 30 secs 0.18907 5.68%
LibLinear (ε = 0.001) 44 secs 0.18890 5.70%
SGD 2.3 secs 0.18893 5.66%
31. The Wall
0.3 Testing cost
0.2
Training time (secs)
100
SGD
50
LibLinear
0.1 0.01 0.001 0.0001 1e−05 1e−06 1e−07 1e−08 1e−09
Optimization accuracy (trainingCost−optimalTrainingCost)
32. More SVM Experiments
From: Patrick Haffner
Date: Wednesday 2007-09-05 14:28:50
. . . I have tried on some of our main datasets. . .
I can send you the example, it is so striking!
– Patrick
Dataset Train Number of % non-0 LIBSVM LLAMA LLAMA SGDSVM
size features features (SDot) SVM MAXENT
Reuters 781K 47K 0.1% 210,000 3930 153 7
Translation 1000K 274K 0.0033% days 47,700 1,105 7
SuperTag 950K 46K 0.0066% 31,650 905 210 1
Voicetone 579K 88K 0.019% 39,100 197 51 1
33. More SVM Experiments
From: Olivier Chapelle
Date: Sunday 2007-10-28 22:26:44
. . . you should really run batch with various training set sizes . . .
– Olivier
Average Test Loss
0.4
n=10000 n=100000 n=781265
n=30000 n=300000
0.35
Log-loss problem
0.3 stochastic
Batch Conjugate
0.25 Gradient on various
training set sizes
0.2
Stochastic Gradient
0.15 on the full set
0.1
0.001 0.01 0.1 1 10 100 1000
Time (seconds)
34. Text Chunking with CRFs
• Dataset
– CONLL 2000 Chunking Task:
Segment sentences in syntactically correlated chunks
(e.g., noun phrases, verb phrases.)
– 106,978 training segments in 8936 sentences.
– 23,852 testing segments in 2012 sentences.
• Model
– Conditional Random Field (all linear, log-loss.)
– Features are n-grams of words and part-of-speech tags.
– 1,679,700 parameters.
Same setup as (Vishwanathan et al., 2006) but plain SGD.
35. Text Chunking with CRFs
• Results
Training Time Primal cost Test F1 score
L-BFGS 4335 secs 9042 93.74%
SGD 568 secs 9098 93.75%
• Notes
– Computing the gradients with the chain rule runs faster than
computing them with the forward-backward algorithm.
– Graph Transformer Networks are nonlinear conditional random fields
trained with stochastic gradient descent (Bottou et al., 1997).
36. Choosing the Gain Schedule
η
Decreasing gains: wt+1 ← wt − ∇(wt, xt, yt)
t + t0
• Asymptotic Theory
– if s = 2 η λmin < 1 then slow rate O t−s
s2
– if s = 2 η λmin > 1 then faster rate O s−1 t−1
• Example: the SVM benchmark
– Use η = 1/λ because λ ≤ λmin.
– Choose t0 to make sure that the expected initial updates
are comparable with the expected size of the weights.
• Example: the CRF benchmark
– Use η = 1/λ again.
– Choose t0 with the secret ingredient.
37. The Secret Ingredient for a good SGD
The sample size n does not change the SGD maths!
Constant gain: wt+1 ← wt − η ∇(wt, xt, yt)
At any moment during training, we can:
– Select a small subsample of examples.
– Try various gains η on the subsample.
– Pick the gain η that most reduces the cost.
– Use it for the next 100000 iterations on the full dataset.
• Examples
– The CRF benchmark code does this to choose t0 before training.
– We could also perform such cheap measurements every so often.
The selected gains would then decrease automatically.
38. Getting the Engineering Right
The very simple SGD update offers lots of engineering opportunities.
Example: Sparse Linear SVM
The update w ← w − η λw − ∇ℓ(wxi, yi)
can be performed in two steps:
i) w ← w − η∇ℓ(wxi, yi) (sparse, cheap)
ii) w ← w (1 − ηλ) (not sparse, costly)
• Solution 1
Represent vector w as the product of a scalar s and a vector v .
Perform (i) by updating v and (ii) by updating s.
• Solution 2
Perform only step (i) for each training example.
Perform step (ii) with lower frequency and higher gain.
39. SGD for Kernel Machines
• SGD for Linear SVM
– Both w and ∇ℓ(wxt, yt) represented using coordinates.
– SGD updates w by combining coordinates.
• SGD for SVM with Kernel K(xi, xj ) = < Φ(xi), Φ(xj ) >
– Represent w with its kernel expansion αi Φ(xi).
– Usually, ∇ℓ(wxt, yt) = −µ Φ(xt).
– SGD updates w by combining coefficients:
η µ if i = t,
αi ←− (1 − ηλ) αi +
0 otherwise.
• So, one just needs a good sparse vector library ?
40. SGD for Kernel Machines
• Sparsity Problems.
η µ if i = t,
αi ←− (1 − ηλ) αi +
0 otherwise.
– Each iteration potentially makes one α coefficient non zero.
– Not all of them should be support vectors.
– Their α coefficients take a long time to reach zero (Collobert, 2004).
• Dual algorihms related to primal SGD avoid this issue.
– Greedy algorithms (Vincent et al., 2000; Keerthi et al., 2007)
– LaSVM and related algorithms (Bordes et al., 2005)
More on them later. . .
• But they still need to compute the kernel values!
– Computing kernel values can be slow.
– Caching kernel values can require lots of memory.
41. SGD for Real Life Applications
A Check Reader
Examples are pairs (image,amount).
Problem with strong structure:
– Field segmentation
– Character segmentation
– Character recognition
– Syntactical interpretation.
• Define differentiable modules.
• Pretrain modules with hand-labelled data.
• Define global cost function (e.g., CRF).
• Train with SGD for a few weeks.
Industrially deployed in 1996. Ran billions of checks over 10 years.
Credits: Bengio, Bottou, Burges, Haffner, LeCun, Nohl, Simard, et al.
42. Part III
Learning with a Single Pass
over the Examples
This part is based on joint works with
Antoine Bordes, Seyda Ertekin, Yann LeCun, and Jason Weston.
43. Why learning with a Single Pass?
• Motivation
– Sometimes there is too much data to store.
– Sometimes retrieving archived data is too expensive.
• Related Topics
– Streaming data.
– Tracking nonstationarities.
– Novelty detection.
• Outline
– One-pass learning with second order SGD.
– One-pass learning with kernel machines.
– Comparisons
44. Effect of one Additional Example (i)
Compare
∗
wn = arg min En(fw )
w
∗ 1
wn+1 = arg min En+1(fw ) = arg min En(fw ) + ℓ fw (xn+1), yn+1
w w n
n+1
E (f )
n n+1 w
En(f w )
w*
n+1 w*
n
45. Effect of one Additional Example (ii)
• First Order Calculation
∗ ∗ 1 −1 ∂ ℓ fwn (xn), yn 1
wn+1 = wn − Hn+1 + O
n ∂w n2
where Hn+1 is the empirical Hessian on n + 1 examples.
• Compare with Second Order Stochastic Gradient Descent
1 ∂ ℓ fwt (xn), yn
wt+1 = wt − H −1
t ∂w
• Could they converge with the same speed?
46. Yes they do! But what does it mean?
• Theorem (Bottou & LeCun, 2003; Murata & Amari, 1998)
Under “adequate conditions”
lim ∗ ∗
n w∞ − wn 2 = lim t w∞ − wt 2 = tr(H −1G H −1)
n→∞ t→∞
lim n E(fwn ) − E(fF ) = lim t E(fwt ) − E(fF ) = tr(G H −1)
∗
n→∞ t→∞
Best solution in F.
One Pass of w∞=w*
∞
Second Order
Stochastic
w
Gradient n
≅ K/n
w0= w* Empirical Best training
0
Optima w* set error.
n
47. Optimal Learning in One Pass
Given a large enough training set,
a Single Pass of Second Order Stochastic Gradient
generalizes as well as the Empirical Optimum.
Experiments on synthetic data
Mse*
+1e−1
0.366
Mse*
+1e−2 0.362
0.358
Mse*
+1e−3 0.354
Mse* 0.350
+1e−4
0.346
0.342
1000 10000 100000 100 1000 10000
Number of examples Milliseconds
48. Unfortunate Practical Issues
• Second Order SGD is not that fast!
1 ∂ℓ(fwt (xt), yt)
wt+1 ← wt − H −1
t ∂w
– Must estimate and store d × d matrix H −1.
– Must multiply the gradient for each example by the matrix H −1.
– Sparsity tricks no longer work because H −1 is not sparse.
• Research Directions
Limited storage approximations of H −1.
– Reduce the number of epochs
– Rarely sufficient for fast one-pass learning.
– Diagonal approximation (Becker &LeCun, 1989)
– Low rank approximation (e.g., LeCun et al., 1998)
– Online L-BFGS approximation (Schraudolph, 2007)
49. Disgression: Stopping Criteria for SGD
2SGD SGD
ν 1 kν 1
Time to reach accuracy ρ +o +o
ρ ρ ρ ρ
Number of epochs to reach
same test cost as the full 1 k
optimization. 1 ≤ k ≤ κ2
There are many ways to make constant k smaller:
– Exact second order stochastic gradient descent.
– Approximate second order stochastic gradient descent.
– Simple preconditionning tricks.
50. Disgression: Stopping Criteria for SGD
• Early stopping with cross validation
– Create a validation set by setting some training examples apart.
– Monitor cost function on the validation set.
– Stop when it stops decreasing.
• Early stopping a priori
– Extract two disjoint subsamples of training data.
– Train on the first subsample; stop by validating on the second.
– The number of epochs is an estimate of k.
– Train by performing that number of epochs on the full set.
This is asymptotically correct and gives reasonable results in practice.
51. One-pass learning for Kernel Machines?
Challenges for Large-Scale Kernel Machines:
– Bulky kernel matrix (n × n.)
– Managing the kernel expansion w = αi Φ(xi).
– Managing memory.
Issues of SGD for Kernel Machines:
– Conceptually simple.
– Sparsity issues in kernel expansion.
Stochastic and Incremental SVMs:
– Iteratively constructing the kernel expansion.
– Which candidate support vectors to store and discard?
– Managing the memory required by the kernel values.
– One-pass learning?
52. Learning in the dual
• Convex, Kernel trick.
• Memory n nsv
Max margin
• Time nαnsv with 1 < α ≤ 2
• Bad news nsv ∼ 2Bn
(see Steinwart, 2004)
A • nsv could be much smaller.
(Burges, 1993; Vincent & Bengio, 2002)
B Min distance
between hulls
• How to do it fast?
• How small?
53. An Inefficient Dual Optimizer
N
P
N’
• Both P and N are linear combinations of examples
with positive coefficients summing to one.
• Projection: N ′ = (1 − γ)N + γ x with 0 ≤ γ ≤ 1.
• Projection time proportional to nsv .
54. Two Problems with this Algorithm
• Eliminating unwanted Support Vectors
γ = − α / (1−α) Pattern x already has α > 0.
But we found better support vectors.
γ=0
– Simple algo decreases α too slowly.
– Same problem as SGD in fact.
– Solution: Allow γ to be slightly negative.
γ=1
• Processing Support Vectors often enough
When drawing examples randomly,
– Most have α = 0 and should remain so.
– Support vectors (α > 0) need adjustments but are rarely processed.
– Solution: Draw support vectors more often.
55. The Huller and its Derivatives
• The Huller
Repeat
PROCESS: Pick a random fresh example and project.
REPROCESS: Pick a random support vector and project.
– Compare with incremental learning and retraining.
– PROCESS potentially adds support vectors.
– REPROCESS potentially discard support vectors.
• Derivatives of the Huller
– LASVM handles soft-margins and is connected to SMO.
– LARANK handles multiclass problems and structured outputs.
(Bordes et al., 2005, 2006, 2007)
58. Are we there yet?
– Handwritten digits recognition with on-the-fly
generation of distorted training patterns.
– Very difficult problem for local kernels.
– Potentially many support vectors.
– More a challenge than a solution.
Number of binary classifiers 10
Memory for the kernel cache 6.5GB
Examples per classifiers 8.1M
Total training time 8 days
Test set error 0.67%
– Trains in one pass: each example gets only one chance to be selected.
– Maybe the largest SVM training on a single CPU. (Loosli et al., 2006)
59. Are we there yet?
er
er
ay
ay
ll
ll
na
na
io
io
n
n
ut
ut
io
io
ol
ol
ct
ct
nv
nv
ne
ne
co
co
on
on
5
5
lc
lc
5x
5x
l
l
fu
fu
10 output units
29x29 input 5 (15x15) layers
50 (5x5) layers 100 hidden units
Training algorithm SGD
Training examples ≈ 4M.
Total training time 2-3 hours
Test set error 0.4%
(Simard et al., ICDAR 2003)
• RBF kernels cannot compete with task specific models.
• The kernel SVM is slower because it needs more memory.
• The kernel SVM trains with a single pass.
60. Conclusion
• Connection between Statistics and Computation.
• Qualitatively different tradeoffs for small– and large–scale.
• Plain SGD rocks in theory and in practice.
• One-pass learning feasible with 2SGD or dual techniques.
Current algorithms still slower than plain SGD.
• Important topics not addressed today:
Example selection, data quality, weak supervision.