The document summarizes an introduction to kernel classifiers presentation. It discusses how linear techniques like classification, regression, and dimensionality reduction are often successful due to smoothness and intuitiveness. However, linear classifiers may fail when data is not linearly separable. Kernel methods address this by projecting data into a higher-dimensional feature space where it may be linearly separable through the use of kernels.
This document provides an overview of linear support vector machines (SVMs) for classification. It discusses how SVMs find the optimal separating hyperplane between two classes by maximizing the margin, or distance between the closest data points of each class.
The problem of finding this optimal hyperplane is formulated as a quadratic programming problem that minimizes the norm of the weight vector subject to constraints requiring data points to lie on the correct side of the margin. Alternately, the problem can be formulated as a linear program that minimizes the L1 norm of the weight vector.
Finally, the document outlines the key steps in SVM classification and references further resources on the topic.
This document summarizes a lecture on linear support vector machines (SVMs) in the dual formulation. It begins with an overview of linear SVMs and their optimization as a quadratic program with inequality constraints. It then derives the dual formulation of the linear SVM problem, which involves maximizing an objective function over Lagrange multipliers while satisfying constraints. The Karush-Kuhn-Tucker conditions for optimality are described, involving stationarity, primal feasibility, dual feasibility, and complementarity. Finally, the dual formulation is presented, which involves maximizing a function of the Lagrange multipliers without the primal variables w and b.
This document provides an overview of partial derivatives, which are used to analyze functions with multiple variables. Key topics covered include:
- Definitions of limits, continuity, and partial derivatives for multivariable functions.
- Directional derivatives and the gradient, which describe the rate of change in a specified direction.
- The chain rule for partial derivatives, and implicit differentiation.
- Linearization and Taylor series approximations for multivariable functions.
- Finding local extrema and optimizing functions, using techniques like classifying critical points.
A Generalized Metric Space and Related Fixed Point TheoremsIRJET Journal
This document presents a new concept of generalized metric spaces and establishes some fixed point theorems in these spaces. It begins with defining generalized metric spaces, which generalize standard metric spaces, b-metric spaces, dislocated metric spaces, and modular spaces with the Fatou property. It then proves some properties of generalized metric spaces, including conditions for convergence. Finally, it establishes an extension of the Banach contraction principle to generalized metric spaces, proving the existence and uniqueness of a fixed point under certain assumptions.
These notes are a basic introduction to SVM, assuming almost no prior exposure. They contain some derivations, details, and explanations that not many SVM tutorials usually delve into. Thus, they're meant to augment primary course material (textbook or lecture notes) on SVMs and to help digest the course material.
The document provides examples demonstrating the chain rule for differentiating composite functions. The chain rule states that if y = g(u) and u = f(x), then dy/dx = (dy/du) * (du/dx). Several examples are worked through applying the chain rule to functions composed of multiple operations like sin(x^2) or (x^2 + 5x - 1)^(2/3). The chain rule can be extended to chains of more than two functions as shown in later examples.
The document provides information on determining limits of algebraic functions. It discusses different methods for calculating limits, including dividing the numerator and denominator by the highest power term, and multiplying by the conjugate of the numerator and denominator. Examples are provided to illustrate each method and determine limits as the variable approaches a value.
This document provides an overview of linear support vector machines (SVMs) for classification. It discusses how SVMs find the optimal separating hyperplane between two classes by maximizing the margin, or distance between the closest data points of each class.
The problem of finding this optimal hyperplane is formulated as a quadratic programming problem that minimizes the norm of the weight vector subject to constraints requiring data points to lie on the correct side of the margin. Alternately, the problem can be formulated as a linear program that minimizes the L1 norm of the weight vector.
Finally, the document outlines the key steps in SVM classification and references further resources on the topic.
This document summarizes a lecture on linear support vector machines (SVMs) in the dual formulation. It begins with an overview of linear SVMs and their optimization as a quadratic program with inequality constraints. It then derives the dual formulation of the linear SVM problem, which involves maximizing an objective function over Lagrange multipliers while satisfying constraints. The Karush-Kuhn-Tucker conditions for optimality are described, involving stationarity, primal feasibility, dual feasibility, and complementarity. Finally, the dual formulation is presented, which involves maximizing a function of the Lagrange multipliers without the primal variables w and b.
This document provides an overview of partial derivatives, which are used to analyze functions with multiple variables. Key topics covered include:
- Definitions of limits, continuity, and partial derivatives for multivariable functions.
- Directional derivatives and the gradient, which describe the rate of change in a specified direction.
- The chain rule for partial derivatives, and implicit differentiation.
- Linearization and Taylor series approximations for multivariable functions.
- Finding local extrema and optimizing functions, using techniques like classifying critical points.
A Generalized Metric Space and Related Fixed Point TheoremsIRJET Journal
This document presents a new concept of generalized metric spaces and establishes some fixed point theorems in these spaces. It begins with defining generalized metric spaces, which generalize standard metric spaces, b-metric spaces, dislocated metric spaces, and modular spaces with the Fatou property. It then proves some properties of generalized metric spaces, including conditions for convergence. Finally, it establishes an extension of the Banach contraction principle to generalized metric spaces, proving the existence and uniqueness of a fixed point under certain assumptions.
These notes are a basic introduction to SVM, assuming almost no prior exposure. They contain some derivations, details, and explanations that not many SVM tutorials usually delve into. Thus, they're meant to augment primary course material (textbook or lecture notes) on SVMs and to help digest the course material.
The document provides examples demonstrating the chain rule for differentiating composite functions. The chain rule states that if y = g(u) and u = f(x), then dy/dx = (dy/du) * (du/dx). Several examples are worked through applying the chain rule to functions composed of multiple operations like sin(x^2) or (x^2 + 5x - 1)^(2/3). The chain rule can be extended to chains of more than two functions as shown in later examples.
The document provides information on determining limits of algebraic functions. It discusses different methods for calculating limits, including dividing the numerator and denominator by the highest power term, and multiplying by the conjugate of the numerator and denominator. Examples are provided to illustrate each method and determine limits as the variable approaches a value.
This document discusses decision theory and its applications in machine learning. It describes how decision theory uses probability to make optimal decisions given input and target data. It also discusses how to minimize expected error and loss when making predictions. Finally, it explains how inference and decision problems can be broken into two stages and different models like generative, discriminative, and discriminant functions can be used.
- Bayesian decision theory provides an optimal framework for decision making when the underlying probability distributions are known.
- The Bayes rule is used to calculate the posterior probabilities of class membership given an observation's features.
- A loss function assigns costs to different types of classification mistakes and aims to minimize the total expected loss. This guides the decision rule.
- Discriminant functions are used to partition the feature space into decision regions corresponding to each class. The decision boundaries are determined by where two discriminant functions are equal.
The first report of Machine Learning Seminar organized by Computational Linguistics Laboratory at Kazan Federal University. See http://cll.niimm.ksu.ru/cms/lang/en_US/main/seminars/mlseminar
Lesson 27: Integration by Substitution (Section 4 version)Matthew Leingang
The document outlines a calculus lecture on integration by substitution. It provides examples of using u-substitution to find antiderivatives of expressions like √(x^2+1) and tan(x). The key ideas are that if u is a function of x, its derivative du/dx can be used to rewrite the integrand and perform a u-substitution integration.
Principle of Integration - Basic Introduction - by Arun Umraossuserd6b1fd
Notes for integral calculus. Students must read function analysis before going through this book. Read Derivative Calculus before going through this book.
The document discusses various machine learning algorithms including polynomial regression, quadratic regression, radial basis functions, and robust regression. It provides mathematical formulas and visual examples to explain how each algorithm works. The key ideas are that polynomial regression fits nonlinear functions of inputs, quadratic regression extends linear regression by including quadratic terms, radial basis functions use kernel functions centered at data points to perform nonlinear regression, and robust regression aims to fit data robustly by down-weighting outliers.
This document summarizes a lecture on multi-kernel support vector machines (SVM). It introduces multiple kernel learning (MKL), which allows using a combination of multiple kernel functions instead of a single kernel for SVM classification and regression. MKL learns the optimal combination of kernels by solving a convex optimization problem to find the kernel weights while training the SVM. The SimpleMKL algorithm is presented for efficiently solving the MKL problem using a reduced gradient approach. Experimental results on regression datasets demonstrate that MKL can improve performance over single kernel SVMs.
This chapter summary discusses discrete probability distributions. It distinguishes between discrete and continuous random variables and distributions. It describes how to determine the mean and variance of discrete distributions. It introduces some common discrete distributions like the binomial and Poisson distributions. For the binomial distribution, it explains how to calculate the probability of a given number of successes in a given number of trials. For the Poisson distribution, it provides the probability formula and explains that it models independent events occurring continuously over an interval.
This document provides an overview of Support Vector Data Description (SVDD), which finds the minimum enclosing ball that encapsulates a set of data points. It discusses how SVDD can be formulated as a quadratic programming problem and outlines its dual formulation. The document also notes that SVDD generalizes to non-linear settings using kernels, and discusses variations like adaptive SVDD and density-induced SVDD. Key points covered include the representer theorem, KKT conditions, and how the radius of the enclosing ball can be determined from the Lagrangian.
Signal Processing Course : Orthogonal BasesGabriel Peyré
The document discusses orthogonal decompositions of signals and images using bases like Fourier and wavelet bases. It describes how signals/images can be represented as a weighted sum of basis functions through decomposition. Linear and nonlinear approximations are discussed for efficient representation. The document also covers applications like compression, denoising and inverse problems using sparsity-promoting regularization in transform domains. Orthogonal decompositions provide a framework for efficient processing, analysis and manipulation of multidimensional data like images and videos.
Measures of risk on variability with application in stochastic activity networksAlexander Decker
This document discusses measures of risk and variability that can be applied to stochastic activity networks. It proposes a simple measure, Δ(F), to rank commonly used probability distributions based on their variability. Δ(F) is defined as the difference between the mean squared and variance. Distributions with a larger Δ(F) value have lower variability. The document outlines several probability distributions commonly used in project management, including the beta, uniform, triangular, exponential and Erlang distributions. It proves that under certain conditions of symmetry, the uniform distribution has the highest variability while the beta-PERT distribution has the lowest, based on their respective Δ(F) values. The measure can help compare risks when different distributions are used to model activity
Limit & Continuity of Functions - Differential Calculus by Arun Umraossuserd6b1fd
This books explains about limits and continuity and is base for derivative calculus. Suitable for CBSE Class XII students who are preparing for IIT JEE.
This document summarizes a lecture on linear support vector machines (SVMs) in the dual formulation. It begins with an overview of linear SVMs and their optimization as a quadratic program with inequality constraints. It then derives the dual formulation of the linear SVM problem, which involves maximizing an objective function over Lagrange multipliers while satisfying constraints. The Karush-Kuhn-Tucker conditions, which are necessary for optimality, are presented for the dual problem. Finally, the document expresses the dual problem and KKT conditions in matrix form to solve for the optimal weights and bias of the linear SVM classifier.
Integration by Parts allows us to integrate some products using the formula:
dv du
u dx = uv - v dx
dx dx
This formula means that we differentiate one factor to get du/dx and integrate the other factor to get v. We then substitute these into the formula to evaluate the integral. For example, to integrate xsinx we would have u = x and dv/dx = sinx, then substitute into the formula. This technique can be used to evaluate integrals of products that cannot be integrated by other methods.
The document discusses utility functions and indifference curves. It begins by defining concepts related to preferences such as completeness, reflexivity, and transitivity. It then introduces utility functions as a way to represent preferences that satisfy these properties, as well as continuity. Utility functions assign numerical utility values to bundles in a way that preserves the preference ordering. Indifference curves are defined as collections of bundles that provide equal utility/preference. The document provides examples of different utility functions and the shapes of their corresponding indifference curves. It also discusses concepts such as marginal utilities and marginal rates of substitution. Finally, it notes that applying a monotonic transformation to a utility function does not change the represented preferences or marginal rates of substitution.
Principle of Function Analysis - by Arun Umraossuserd6b1fd
This note explains about functions, type of function, their behaviour, conversion and derivation. This note is best for those who are going to study calculus or phys
Maximizing the spectral gap of networks produced by node removalNaoki Masuda
Presentation slides for the following two papers (currently available in the pdf format only).
(1) T. Watanabe, N. Masuda.
Enhancing the spectral gap of networks by node removal.
Physical Review E, 82, 046102 (2010).
(2) N. Masuda, T. Fujie, K. Murota.
Semidefinite programming for maximizing the spectral gap.
In: Complex Networks IV, Studies in Computational Intelligence, 476, 155-163 (2013).
This document discusses how rational decision makers choose bundles that maximize their utility given budget constraints. It provides examples of how to compute optimal bundles for different types of preferences, including interior solutions, corner solutions, and "kinky" solutions where preferences are non-convex or involve perfect complements. For Cobb-Douglas preferences, the optimal bundle can be computed in closed form. Corner and kinky solutions occur when optimal bundles involve zero consumption of one or more goods.
https://github.com/telecombcn-dl/dlmm-2017-dcu
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
SVM is a supervised learning method that finds a hyperplane with maximum margin to separate classes. It uses kernels to map data to higher dimensions to allow for nonlinear separation. The objective is to minimize training error and model complexity by maximizing the margin between classes. SVMs solve a convex optimization problem that finds support vectors and determines the separating hyperplane using kernels, slack variables, and a cost parameter C to balance margin and errors. Parameter selection, like the kernel and its parameters, affects performance and is typically done through grid search and cross-validation.
This document discusses decision theory and its applications in machine learning. It describes how decision theory uses probability to make optimal decisions given input and target data. It also discusses how to minimize expected error and loss when making predictions. Finally, it explains how inference and decision problems can be broken into two stages and different models like generative, discriminative, and discriminant functions can be used.
- Bayesian decision theory provides an optimal framework for decision making when the underlying probability distributions are known.
- The Bayes rule is used to calculate the posterior probabilities of class membership given an observation's features.
- A loss function assigns costs to different types of classification mistakes and aims to minimize the total expected loss. This guides the decision rule.
- Discriminant functions are used to partition the feature space into decision regions corresponding to each class. The decision boundaries are determined by where two discriminant functions are equal.
The first report of Machine Learning Seminar organized by Computational Linguistics Laboratory at Kazan Federal University. See http://cll.niimm.ksu.ru/cms/lang/en_US/main/seminars/mlseminar
Lesson 27: Integration by Substitution (Section 4 version)Matthew Leingang
The document outlines a calculus lecture on integration by substitution. It provides examples of using u-substitution to find antiderivatives of expressions like √(x^2+1) and tan(x). The key ideas are that if u is a function of x, its derivative du/dx can be used to rewrite the integrand and perform a u-substitution integration.
Principle of Integration - Basic Introduction - by Arun Umraossuserd6b1fd
Notes for integral calculus. Students must read function analysis before going through this book. Read Derivative Calculus before going through this book.
The document discusses various machine learning algorithms including polynomial regression, quadratic regression, radial basis functions, and robust regression. It provides mathematical formulas and visual examples to explain how each algorithm works. The key ideas are that polynomial regression fits nonlinear functions of inputs, quadratic regression extends linear regression by including quadratic terms, radial basis functions use kernel functions centered at data points to perform nonlinear regression, and robust regression aims to fit data robustly by down-weighting outliers.
This document summarizes a lecture on multi-kernel support vector machines (SVM). It introduces multiple kernel learning (MKL), which allows using a combination of multiple kernel functions instead of a single kernel for SVM classification and regression. MKL learns the optimal combination of kernels by solving a convex optimization problem to find the kernel weights while training the SVM. The SimpleMKL algorithm is presented for efficiently solving the MKL problem using a reduced gradient approach. Experimental results on regression datasets demonstrate that MKL can improve performance over single kernel SVMs.
This chapter summary discusses discrete probability distributions. It distinguishes between discrete and continuous random variables and distributions. It describes how to determine the mean and variance of discrete distributions. It introduces some common discrete distributions like the binomial and Poisson distributions. For the binomial distribution, it explains how to calculate the probability of a given number of successes in a given number of trials. For the Poisson distribution, it provides the probability formula and explains that it models independent events occurring continuously over an interval.
This document provides an overview of Support Vector Data Description (SVDD), which finds the minimum enclosing ball that encapsulates a set of data points. It discusses how SVDD can be formulated as a quadratic programming problem and outlines its dual formulation. The document also notes that SVDD generalizes to non-linear settings using kernels, and discusses variations like adaptive SVDD and density-induced SVDD. Key points covered include the representer theorem, KKT conditions, and how the radius of the enclosing ball can be determined from the Lagrangian.
Signal Processing Course : Orthogonal BasesGabriel Peyré
The document discusses orthogonal decompositions of signals and images using bases like Fourier and wavelet bases. It describes how signals/images can be represented as a weighted sum of basis functions through decomposition. Linear and nonlinear approximations are discussed for efficient representation. The document also covers applications like compression, denoising and inverse problems using sparsity-promoting regularization in transform domains. Orthogonal decompositions provide a framework for efficient processing, analysis and manipulation of multidimensional data like images and videos.
Measures of risk on variability with application in stochastic activity networksAlexander Decker
This document discusses measures of risk and variability that can be applied to stochastic activity networks. It proposes a simple measure, Δ(F), to rank commonly used probability distributions based on their variability. Δ(F) is defined as the difference between the mean squared and variance. Distributions with a larger Δ(F) value have lower variability. The document outlines several probability distributions commonly used in project management, including the beta, uniform, triangular, exponential and Erlang distributions. It proves that under certain conditions of symmetry, the uniform distribution has the highest variability while the beta-PERT distribution has the lowest, based on their respective Δ(F) values. The measure can help compare risks when different distributions are used to model activity
Limit & Continuity of Functions - Differential Calculus by Arun Umraossuserd6b1fd
This books explains about limits and continuity and is base for derivative calculus. Suitable for CBSE Class XII students who are preparing for IIT JEE.
This document summarizes a lecture on linear support vector machines (SVMs) in the dual formulation. It begins with an overview of linear SVMs and their optimization as a quadratic program with inequality constraints. It then derives the dual formulation of the linear SVM problem, which involves maximizing an objective function over Lagrange multipliers while satisfying constraints. The Karush-Kuhn-Tucker conditions, which are necessary for optimality, are presented for the dual problem. Finally, the document expresses the dual problem and KKT conditions in matrix form to solve for the optimal weights and bias of the linear SVM classifier.
Integration by Parts allows us to integrate some products using the formula:
dv du
u dx = uv - v dx
dx dx
This formula means that we differentiate one factor to get du/dx and integrate the other factor to get v. We then substitute these into the formula to evaluate the integral. For example, to integrate xsinx we would have u = x and dv/dx = sinx, then substitute into the formula. This technique can be used to evaluate integrals of products that cannot be integrated by other methods.
The document discusses utility functions and indifference curves. It begins by defining concepts related to preferences such as completeness, reflexivity, and transitivity. It then introduces utility functions as a way to represent preferences that satisfy these properties, as well as continuity. Utility functions assign numerical utility values to bundles in a way that preserves the preference ordering. Indifference curves are defined as collections of bundles that provide equal utility/preference. The document provides examples of different utility functions and the shapes of their corresponding indifference curves. It also discusses concepts such as marginal utilities and marginal rates of substitution. Finally, it notes that applying a monotonic transformation to a utility function does not change the represented preferences or marginal rates of substitution.
Principle of Function Analysis - by Arun Umraossuserd6b1fd
This note explains about functions, type of function, their behaviour, conversion and derivation. This note is best for those who are going to study calculus or phys
Maximizing the spectral gap of networks produced by node removalNaoki Masuda
Presentation slides for the following two papers (currently available in the pdf format only).
(1) T. Watanabe, N. Masuda.
Enhancing the spectral gap of networks by node removal.
Physical Review E, 82, 046102 (2010).
(2) N. Masuda, T. Fujie, K. Murota.
Semidefinite programming for maximizing the spectral gap.
In: Complex Networks IV, Studies in Computational Intelligence, 476, 155-163 (2013).
This document discusses how rational decision makers choose bundles that maximize their utility given budget constraints. It provides examples of how to compute optimal bundles for different types of preferences, including interior solutions, corner solutions, and "kinky" solutions where preferences are non-convex or involve perfect complements. For Cobb-Douglas preferences, the optimal bundle can be computed in closed form. Corner and kinky solutions occur when optimal bundles involve zero consumption of one or more goods.
https://github.com/telecombcn-dl/dlmm-2017-dcu
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
SVM is a supervised learning method that finds a hyperplane with maximum margin to separate classes. It uses kernels to map data to higher dimensions to allow for nonlinear separation. The objective is to minimize training error and model complexity by maximizing the margin between classes. SVMs solve a convex optimization problem that finds support vectors and determines the separating hyperplane using kernels, slack variables, and a cost parameter C to balance margin and errors. Parameter selection, like the kernel and its parameters, affects performance and is typically done through grid search and cross-validation.
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...MLconf
Anima Anandkumar is a faculty at the EECS Dept. at U.C.Irvine since August 2010. Her research interests are in the area of large-scale machine learning and high-dimensional statistics. She received her B.Tech in Electrical Engineering from IIT Madras in 2004 and her PhD from Cornell University in 2009. She has been a visiting faculty at Microsoft Research New England in 2012 and a postdoctoral researcher at the Stochastic Systems Group at MIT between 2009-2010. She is the recipient of the Microsoft Faculty Fellowship, ARO Young Investigator Award, NSF CAREER Award, and IBM Fran Allen PhD fellowship.
The document summarizes a deep learning programming course for artificial intelligence. The course covers topics like machine learning, deep learning, convolutional neural networks, recurrent neural networks, and applications of deep learning in medicine. It provides an overview of each week's topics, including an introduction to AI and machine learning in week 3, deep learning in week 4, and applications of AI in medicine in week 5.
This document provides an overview of support vector machines (SVM). It explains that SVM is a supervised machine learning algorithm used for classification and regression. It works by finding the optimal separating hyperplane that maximizes the margin between different classes of data points. The document discusses key SVM concepts like slack variables, kernels, hyperparameters like C and gamma, and how the kernel trick allows SVMs to fit non-linear decision boundaries.
The document discusses perceptrons and neural networks. It defines a perceptron as a linear classifier that uses a step function to output 1 if the linear combination of inputs is above a threshold, and -1 otherwise. Perceptrons can only learn linearly separable problems. The perceptron learning algorithm updates weights for each training example using rules like the perceptron rule or delta rule. The delta rule allows learning non-linearly separable problems by minimizing the error between actual and predicted outputs.
The document discusses perceptrons and neural networks. It defines a perceptron as a linear classifier that uses a step function to output 1 if the linear combination of inputs is above a threshold, and -1 otherwise. Perceptrons can only learn linearly separable problems. The perceptron learning algorithm updates weights for each training example using rules like the perceptron rule or delta rule. The delta rule allows learning non-linearly separable problems by minimizing the error between actual and predicted outputs.
Linearprog, Reading Materials for Operational Research Derbew Tesfa
The document discusses linear programming (LP), which involves optimizing a linear objective function subject to linear constraints. It provides examples of LP problems, such as production planning and transportation problems. It defines key LP concepts like the feasible region, basic solutions, basic variables, and degenerate basic feasible solutions. It also describes how to transform any LP problem into standard form and discusses properties of optimal solutions.
05 history of cv a machine learning (theory) perspective on computer visionzukun
This document provides an overview of machine learning algorithms used in computer vision from the perspective of a machine learning theorist. It discusses how the theorist got involved in a computer vision project in 2002 and summarizes key algorithms at that time like boosting, support vector machines, and their developments. It also provides historical context and comparisons of algorithms like perceptron and Winnow. The document uses examples to explain concepts like kernels and the kernel trick in support vector machines.
This document provides an overview of Linear Discriminant Analysis (LDA) for dimensionality reduction. LDA seeks to perform dimensionality reduction while preserving class discriminatory information as much as possible, unlike PCA which does not consider class labels. LDA finds a linear combination of features that separates classes best by maximizing the between-class variance while minimizing the within-class variance. This is achieved by solving the generalized eigenvalue problem involving the within-class and between-class scatter matrices. The document provides mathematical details and an example to illustrate LDA for a two-class problem.
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Beniamino Murgante
This document discusses using kernel methods, specifically support vector machines (SVMs), for environmental and geoscience applications. It provides an overview of SVMs, including how they find the optimal separating hyperplane with the maximum margin to perform classification and regression. It discusses how SVMs can handle nonlinear decision boundaries using the kernel trick. The document gives examples of applying SVMs to problems like porosity mapping, temperature inversion mapping, and landslide susceptibility modeling. It demonstrates how SVMs can extract patterns from high-dimensional environmental data and produce predictive spatial models.
Two algorithms to accelerate training of back-propagation neural networksESCOM
This document proposes two algorithms to initialize the weights of neural networks to accelerate training. Algorithm I performs a step-by-step orthogonalization of the input matrices to drive them towards a diagonal form, aiming to place the network closer to the convergence points of the activation function. Algorithm II aims to jointly diagonalize the input matrices to also drive them towards a diagonal form. The algorithms are shown to significantly reduce training time compared to random initialization, though Algorithm I works best when the activation function has φ(0)>1.
Anomaly detection using deep one class classifier홍배 김
The document discusses anomaly detection techniques using deep one-class classifiers and generative adversarial networks (GANs). It proposes using an autoencoder to extract features from normal images, training a GAN on those features to model the distribution, and using a one-class support vector machine (SVM) to determine if new images are within the normal distribution. The method detects and localizes anomalies by generating a binary mask for abnormal regions. It also discusses Gaussian mixture models and the expectation-maximization algorithm for modeling multiple distributions in data.
The document discusses various techniques for clustering data, including hierarchical clustering, k-means algorithms, and distance measures. It provides examples of how different types of data like documents, customer purchases, DNA sequences can be represented as vectors and clustered. Key clustering approaches described are hierarchical agglomerative clustering using different linkage criteria, k-means clustering and its variant BFR for large datasets.
This document provides an introduction to support vector machines (SVMs) for text classification. It discusses how SVMs find an optimal separating hyperplane that maximizes the margin between classes. SVMs can handle non-linear classification through the use of kernels, which map data into a higher-dimensional feature space. The document outlines the mathematical formulations of linear and soft-margin SVMs, explains how the kernel trick allows evaluating inner products implicitly in that feature space, and summarizes how SVMs are used for classification tasks.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
1. The document discusses various machine learning classification algorithms including neural networks, support vector machines, logistic regression, and radial basis function networks.
2. It provides examples of using straight lines and complex boundaries to classify data with neural networks. Maximum margin hyperplanes are used for support vector machine classification.
3. Logistic regression is described as useful for binary classification problems by using a sigmoid function and cross entropy loss. Radial basis function networks can perform nonlinear classification with a kernel trick.
Similar to CVPR2009 tutorial: Kernel Methods in Computer Vision: part I: Introduction to Kernel Methods, Selecting and Combining Kernels (20)
Mylyn helps address information overload and context loss when multi-tasking. It integrates tasks into the IDE workflow and uses a degree-of-interest model to monitor user interaction and provide a task-focused UI with features like view filtering, element decoration, automatic folding and content assist ranking. This creates a single view of all tasks that are centrally managed within the IDE.
This document provides an overview of OpenCV, an open source computer vision and machine learning software library. It discusses OpenCV's core functionality for representing images as matrices and directly accessing pixel data. It also covers topics like camera calibration, feature point extraction and matching, and estimating camera pose through techniques like structure from motion and planar homography. Hints are provided for Android developers on required permissions and for planar homography estimation using additional constraints rather than OpenCV's general homography function.
This document provides information about the Computer Vision Laboratory 2012 course at the Institute of Visual Computing. The course focuses on computer vision on mobile devices and will involve 180 hours of project work per person. Students will work in groups of 1-2 people on topics like 3D reconstruction from silhouettes or stereo images on mobile devices. Key dates are provided for submitting a work plan, mid-term presentation, and final report. Contact information is given for the lecturers and teaching assistant.
This document summarizes a presentation on natural image statistics given by Siwei Lyu at the 2009 CIFAR NCAP Summer School. The presentation covered several key topics:
1) It discussed the motivation for studying natural image statistics, which is to understand representations in the visual system and develop computer vision applications like denoising.
2) It reviewed common statistical properties found in natural images like 1/f power spectra and non-Gaussian distributions.
3) Maximum entropy and Bayesian models were presented as approaches to model these statistics, with Gaussian and independent component analysis discussed as specific examples.
4) Efficient coding principles from information theory were introduced as a framework for understanding neural representations that aim to decorrelate and
Camera calibration involves determining the internal camera parameters like focal length, image center, distortion, and scaling factors that affect the imaging process. These parameters are important for applications like 3D reconstruction and robotics that require understanding the relationship between 3D world points and their 2D projections in an image. The document describes estimating internal parameters by taking images of a calibration target with known geometry and solving the equations that relate the 3D target points to their 2D image locations. Homogeneous coordinates and projection matrices are used to represent the calibration transformations mathematically.
Brunelli 2008: template matching techniques in computer visionzukun
The document discusses template matching techniques in computer vision. It begins with an overview that defines template matching and discusses some common computer vision tasks it can be used for, like object detection. It then covers topics like detection as hypothesis testing, training and testing techniques, and provides a bibliography.
The HARVEST Programme evaluates feature detectors and descriptors through indirect and direct benchmarks. Indirect benchmarks measure repeatability and matching scores on the affine covariant testbed to evaluate how features persist across transformations. Direct benchmarks evaluate features on image retrieval tasks using the Oxford 5k dataset to measure real-world performance. VLBenchmarks provides software for easily running these benchmarks and reproducing published results. It allows comparing features and selecting the best for a given application.
This document summarizes VLFeat, an open source computer vision library. It provides concise summaries of VLFeat's features, including SIFT, MSER, and other covariant detectors. It also compares VLFeat's performance to other libraries like OpenCV. The document highlights how VLFeat achieves state-of-the-art results in tasks like feature detection, description and matching while maintaining a simple MATLAB interface.
This document summarizes and compares local image descriptors. It begins with an introduction to modern descriptors like SIFT, SURF and DAISY. It then discusses efficient descriptors such as binary descriptors like BRIEF, ORB and BRISK which use comparisons of intensity value pairs. The document concludes with an overview section.
This document discusses various feature detectors used in computer vision. It begins by describing classic detectors such as the Harris detector and Hessian detector that search scale space to find distinguished locations. It then discusses detecting features at multiple scales using the Laplacian of Gaussian and determinant of Hessian. The document also covers affine covariant detectors such as maximally stable extremal regions and affine shape adaptation. It discusses approaches for speeding up detection using approximations like those in SURF and learning to emulate detectors. Finally, it outlines new developments in feature detection.
The document discusses modern feature detection techniques. It provides an introduction and agenda for a talk on advances in feature detectors and descriptors, including improvements since a 2005 paper. It also discusses software suites and benchmarks for feature detection. Several application domains are described, such as wide baseline matching, panoramic image stitching, 3D reconstruction, image search, location recognition, and object tracking.
System 1 and System 2 were basic early systems for image matching that used color and texture matching. Descriptor-based approaches like SIFT provided more invariance but not perfect invariance. Patch descriptors like SIFT were improved by making them more invariant to lighting changes like color and illumination shifts. The best performance came from combining descriptors with color invariance. Representing images as histograms of visual word occurrences captured patterns in local image patches and allowed measuring similarity between images. Large vocabularies of visual words provided more discriminative power but were costly to compute and store.
This document summarizes a research paper on internet video search. It discusses several key challenges: [1] the large variation in how the same thing can appear in images/videos due to lighting, viewpoint etc., [2] defining what defines different objects, and [3] the huge number of different things that exist. It also notes gaps in narrative understanding, shared concepts between humans and machines, and addressing diverse query contexts. The document advocates developing powerful yet simple visual features that capture uniqueness with invariance to irrelevant changes.
The document discusses computer vision techniques for object detection and localization. It describes methods like selective search that group image regions hierarchically to propose object locations. Large datasets like ImageNet and LabelMe that provide training examples are also discussed. Performance on object detection benchmarks like PASCAL VOC is shown to improve significantly over time. Evaluation standards for concept detection like those used in TRECVID are presented. The document concludes that results are impressively improving each year but that the number of detectable concepts remains limited. It also discusses making feature extraction more efficient using techniques like SURF that take advantage of integral images.
This document provides an outline and overview of Yoshua Bengio's 2012 tutorial on representation learning. The key points covered include:
1) The tutorial will cover motivations for representation learning, algorithms such as probabilistic models and auto-encoders, and analysis and practical issues.
2) Representation learning aims to automatically learn good representations of data rather than relying on handcrafted features. Learning representations can help address challenges like exploiting unlabeled data and the curse of dimensionality.
3) Deep learning algorithms attempt to learn multiple levels of increasingly complex representations, with the goal of developing more abstract, disentangled representations that generalize beyond local patterns in the data.
Advances in discrete energy minimisation for computer visionzukun
This document discusses string algorithms and data structures. It introduces the Knuth-Morris-Pratt algorithm for finding patterns in strings in O(n+m) time where n is the length of the text and m is the length of the pattern. It also discusses common string data structures like tries, suffix trees, and suffix arrays. Suffix trees and suffix arrays store all suffixes of a string and support efficient pattern matching and other string operations in linear time or O(m+logn) time where m is the pattern length and n is the text length.
This document provides a tutorial on how to use Gephi software to analyze and visualize network graphs. It outlines the basic steps of importing a sample graph file, applying layout algorithms to organize the nodes, calculating metrics, detecting communities, filtering the graph, and exporting/saving the results. The tutorial demonstrates features of Gephi including node ranking, partitioning, and interactive visualization of the graph.
EM algorithm and its application in probabilistic latent semantic analysiszukun
The document discusses the EM algorithm and its application in Probabilistic Latent Semantic Analysis (pLSA). It begins by introducing the parameter estimation problem and comparing frequentist and Bayesian approaches. It then describes the EM algorithm, which iteratively computes lower bounds to the log-likelihood function. Finally, it applies the EM algorithm to pLSA by modeling documents and words as arising from a mixture of latent topics.
This document describes an efficient framework for part-based object recognition using pictorial structures. The framework represents objects as graphs of parts with spatial relationships. It finds the optimal configuration of parts through global minimization using distance transforms, allowing fast computation despite modeling complex spatial relationships between parts. This enables soft detection to handle partial occlusion without early decisions about part locations.
Iccv2011 learning spatiotemporal graphs of human activities zukun
The document presents a new approach for learning spatiotemporal graphs of human activities from weakly supervised video data. The approach uses 2D+t tubes as mid-level features to represent activities as segmentation graphs, with nodes describing tubes and edges describing various relations. A probabilistic graph mixture model is used to model activities, and learning estimates the model parameters and permutation matrices using a structural EM algorithm. The learned models allow recognizing and segmenting activities in new videos through robust least squares inference. Evaluation on benchmark datasets demonstrates the ability to learn characteristic parts of activities and recognize them under weak supervision.
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
Physiology and chemistry of skin and pigmentation, hairs, scalp, lips and nail, Cleansing cream, Lotions, Face powders, Face packs, Lipsticks, Bath products, soaps and baby product,
Preparation and standardization of the following : Tonic, Bleaches, Dentifrices and Mouth washes & Tooth Pastes, Cosmetics for Nails.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
How to Build a Module in Odoo 17 Using the Scaffold MethodCeline George
Odoo provides an option for creating a module by using a single line command. By using this command the user can make a whole structure of a module. It is very easy for a beginner to make a module. There is no need to make each file manually. This slide will show how to create a module using the scaffold method.
This presentation was provided by Steph Pollock of The American Psychological Association’s Journals Program, and Damita Snow, of The American Society of Civil Engineers (ASCE), for the initial session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session One: 'Setting Expectations: a DEIA Primer,' was held June 6, 2024.
हिंदी वर्णमाला पीपीटी, hindi alphabet PPT presentation, hindi varnamala PPT, Hindi Varnamala pdf, हिंदी स्वर, हिंदी व्यंजन, sikhiye hindi varnmala, dr. mulla adam ali, hindi language and literature, hindi alphabet with drawing, hindi alphabet pdf, hindi varnamala for childrens, hindi language, hindi varnamala practice for kids, https://www.drmullaadamali.com
বাংলাদেশের অর্থনৈতিক সমীক্ষা ২০২৪ [Bangladesh Economic Review 2024 Bangla.pdf] কম্পিউটার , ট্যাব ও স্মার্ট ফোন ভার্সন সহ সম্পূর্ণ বাংলা ই-বুক বা pdf বই " সুচিপত্র ...বুকমার্ক মেনু 🔖 ও হাইপার লিংক মেনু 📝👆 যুক্ত ..
আমাদের সবার জন্য খুব খুব গুরুত্বপূর্ণ একটি বই ..বিসিএস, ব্যাংক, ইউনিভার্সিটি ভর্তি ও যে কোন প্রতিযোগিতা মূলক পরীক্ষার জন্য এর খুব ইম্পরট্যান্ট একটি বিষয় ...তাছাড়া বাংলাদেশের সাম্প্রতিক যে কোন ডাটা বা তথ্য এই বইতে পাবেন ...
তাই একজন নাগরিক হিসাবে এই তথ্য গুলো আপনার জানা প্রয়োজন ...।
বিসিএস ও ব্যাংক এর লিখিত পরীক্ষা ...+এছাড়া মাধ্যমিক ও উচ্চমাধ্যমিকের স্টুডেন্টদের জন্য অনেক কাজে আসবে ...
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
CVPR2009 tutorial: Kernel Methods in Computer Vision: part I: Introduction to Kernel Methods, Selecting and Combining Kernels
1. Kernel Methods in Computer Vision
Christoph Lampert Matthew Blaschko
Max Planck Institute for MPI Tübingen and
Biological Cybernetics, Tübingen University of Oxford
June 20, 2009
2. Overview...
14:00 – 15:00 Introduction to Kernel Classifiers
15:20 – 15:50 Selecting and Combining Kernels
15:50 – 16:20 Other Kernel Methods
16:40 – 17:40 Learning with Structured Outputs
Slides and Additional Material (soon)
http://www.christoph-lampert.de
also watch out for
9. Linear Dimensionality Reduction
Reduce the dimensionality of a dataset while preserving its structure.
4
3
2
1
0
-1
-2
-3
-4
-3 -2 -1 0 1 2 3
Principal Component Analysis
10. Linear Techniques
Three different elementary tasks:
classification,
regression,
dimensionality reduction.
In each case, linear techniques are very successful.
11. Linear Techniques
Linear techniques...
often work well,
most natural functions are smooth,
smooth function can be approximated, at least locally, by
linear functions.
are fast and easy to solve
elementary maths, even closed form solutions
typically involve only matrix operation
are intuitive
solution can be visualized geometrically,
solution corresponds to common sense.
12. Example: Maximum Margin Classification
Notation:
data points X = {x1 , . . . , xn }, xi ∈ Rd ,
class labels Y = {y1 , . . . , yn }, yi ∈ {+1, −1}.
linear (decision) function f : Rd → R,
decide classes based on sign f : Rd → {−1, 1}.
parameterize
f (x) = a 1 x 1 + a 2 x 2 + . . . a n x n + a 0
≡ w, x + b with w = (a 1 , . . . , a n ), b = a 0 .
. , . is the scalar product is Rd .
f is uniquely determined by w ∈ Rd and b ∈ R,
but we usually ignore b and only study w
b can be absorbed into w. Set w = (w, b), x = (x, 1).
13. Example: Maximum Margin Classification
Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }.
2.0
1.0
0.0
−1.0 0.0 1.0 2.0 3.0
14. Example: Maximum Margin Classification
Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. Any w partitions the
data space into two half-spaces, i.e. defines a classifier.
2.0 f (x) > 0
f (x) < 0
1.0 w
0.0
−1.0 0.0 1.0 2.0 3.0
“What’s the best w?”
15. Example: Maximum Margin Classification
Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. What’s the best w?
2.0 2.0
1.0 1.0
0.0 0.0
−1.0 0.0 1.0 2.0 3.0 −1.0 0.0 1.0 2.0 3.0
Not these, since they misclassify many examples.
Criterion 1: Enforce sign w, xi = yi for i = 1, . . . , n.
16. Example: Maximum Margin Classification
Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. What’s the best w?
2.0 2.0
1.0 1.0
0.0 0.0
−1.0 0.0 1.0 2.0 3.0 −1.0 0.0 1.0 2.0 3.0
Better not these, since they would be “risky” for future samples.
Criterion 2: Try to ensure sign w, x = y for future (x, y) as well.
17. Example: Maximum Margin Classification
Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. Assume that future
samples are similar to current ones. What’s the best w?
2.0
γ γ
1.0
0.0
−1.0 0.0 1.0 2.0 3.0
Maximize “stability”: use w such that we can maximally perturb the
input samples without introducing misclassifications.
18. Example: Maximum Margin Classification
Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. Assume that future
samples are similar to current ones. What’s the best w?
2.0 2.0
n
rgi
ma ion
γ γ γ reg
1.0 1.0
0.0 0.0
−1.0 0.0 1.0 2.0 3.0 −1.0 0.0 1.0 2.0 3.0
Maximize “stability”: use w such that we can maximally perturb the
input samples without introducing misclassifications.
Central quantity:
w
margin(x) = distance of x to decision hyperplane = w
,x
19. Example: Maximum Margin Classification
Maximum-margin solution is determined by a maximization problem:
max γ
w∈Rd ,γ∈R+
subject to
sign w, xi = yi for i = 1, . . . n.
w
, xi ≥γ for i = 1, . . . n.
w
Classify new samples using f (x) = w, x .
20. Example: Maximum Margin Classification
Maximum-margin solution is determined by a maximization problem:
max γ
w∈Rd , w =1
γ∈R
subject to
yi w, xi ≥ γ for i = 1, . . . n.
Classify new samples using f (x) = w, x .
21. Example: Maximum Margin Classification
We can rewrite this as a minimization problem:
2
mind
w
w∈R
subject to
yi w, xi ≥ 1 for i = 1, . . . n.
Classify new samples using f (x) = w, x .
22. Example: Maximum Margin Classification
From the view of optimization theory
2
mind
w
w∈R
subject to
yi w, xi ≥ 1 for i = 1, . . . n
is rather easy:
The objective function is differentiable and convex.
The constraints are all linear.
We can find the globally optimal w in O(n 3 ) (or faster).
28. Linear Separability
The problem
2
min w
?
w∈Rd
height
subject to
yi w, xi ≥ 1
has no solution.
The constraints con-
tradict each other! width
We cannot find a maximum-margin hyperplane here, because there
is none. To fix this, we must allow hyperplanes that make mistakes.
30. Linear Separability
What is the best w for this dataset?
n
a tio
ol
vi
in
height
a rg
m
ξ
width
Possibly this one, even though one sample is misclassified.
32. Linear Separability
What is the best w for this dataset?
small
in
marg
ve r y
height
width
Maybe not this one, even though all points are classified correctly.
33. Linear Separability
What is the best w for this dataset?
n
tio
height
a
ol
vi
in
a rg
m
ξ
width
Trade-off: large margin vs. few mistakes on training set
34. Solving for Soft-Margin Solution
Mathematically, we formulate the trade-off by slack-variables ξi :
n
2
min w +C ξi
w∈R ,ξi ∈R+
d
i=1
subject to
yi w, xi ≥ 1 − ξi for i = 1, . . . n.
We can fulfill every constraint by choosing ξi large enough.
The larger ξi , the larger the objective (that we try to minimize).
C is a regularization/trade-off parameter:
small C → constraints are easily ignored
large C → constraints are hard to ignore
C = ∞ → hard margin case → no training error
Note: The problem is still convex and efficiently solvable.
35. Linear Separability
So, what is the best soft-margin w for this dataset?
y
x
None. We need something non-linear!
36. Non-Linear Classification: Stacking
Idea 1) Use classifier output as input to other (linear) classifiers:
fi=<wi,x>
σ(f5(x'))
σ(f1(x)) σ(f2(x)) σ(f3(x)) σ(f4(x))
Multilayer Perceptron (Artificial Neural Network) or Boosting
⇒ decisions depend non-linearly on x and wj .
37. Non-linearity: Data Preprocessing
Idea 2) Preprocess the data:
y
This dataset is not
(well) linearly separable: x
θ
This one is:
r
In fact, both are the same dataset!
Top: Cartesian coordinates. Bottom: polar coordinates
38. Non-linearity: Data Preprocessing
y
Non-linear separation x
θ
Linear
separation r
Linear classifier in polar space; acts non-linearly in Cartesian space.
39. Generalized Linear Classifier
Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }.
Given any (non-linear) feature map ϕ : Rk → Rm .
Solve the minimization for ϕ(x1 ), . . . , ϕ(xn ) instead of
x1 , . . . , xn :
n
2
min w +C ξi
w∈R ,ξi ∈R+
m
i=1
subject to
yi w, ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n.
The weight vector w now comes from the target space Rm .
Distances/angles are measure by the scalar product . , . in Rm .
Classifier f (x) = w, ϕ(x) is linear in w, but non-linear in x.
40. Example Feature Mappings
Polar coordinates:
√
x x 2 + y2
ϕ: →
y ∠(x, y)
d-th degree polynomials:
2 2 d d
ϕ : x1 , . . . , xn → 1, x1 , . . . , xn , x1 , . . . , xn , . . . , x1 , . . . , xn
Distance map:
ϕ:x→ x − pi , . . . , x − pN
for a set of N prototype vectors pi , i = 1, . . . , N .
41. Is this enough?
In this example, changing the coordinates did help.
Does this trick always work?
y
θ
x
r ↔
42. Is this enough?
In this example, changing the coordinates did help.
Does this trick always work?
y
θ
x
r ↔
Answer: In a way, yes!
Lemma
Let (xi )i=1,...,n with xi = xj for i = j. Let ϕ : Rk → Rm be a feature
map. If the set ϕ(xi )i=1,...,n is linearly independent, then the points
ϕ(xi )i=1,...,n are linearly separable.
Lemma
If we choose m > n large enough, we can always find a map ϕ.
43. Is this enough?
Caveat: We can separate any set, not just one with “reasonable” yi :
There is a fixed feature map ϕ : R2 → R20001 such that – no matter
how we label them – there is always a hyperplane classifier that has
zero training error.
44. Is this enough?
Caveat: We can separate any set, not just one with “reasonable” yi :
There is a fixed feature map ϕ : R2 → R20001 such that – no matter
how we label them – there is always a hyperplane classifier that has
0 training error.
45. Representer Theorem
Solve the soft-margin minimization for ϕ(x1 ), . . . , ϕ(xn ) ∈ Rm :
n
2
min w +C ξi (1)
w∈Rm ,ξi ∈R+
i=1
subject to
yi w, ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n.
For large m, won’t solving for w ∈ Rm become impossible?
46. Representer Theorem
Solve the soft-margin minimization for ϕ(x1 ), . . . , ϕ(xn ) ∈ Rm :
n
2
min w +C ξi (1)
w∈Rm ,ξi ∈R+
i=1
subject to
yi w, ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n.
For large m, won’t solving for w ∈ Rm become impossible? No!
Theorem (Representer Theorem)
The minimizing solution w to problem (1) can always be written as
n
w= αj ϕ(xj ) for coefficients α1 , . . . , αn ∈ R.
j=1
47. Kernel Trick
The representer theorem allows us to rewrite the optimization:
n
2
min w +C ξi
w∈R ,ξi ∈R+
m
i=1
subject to
yi w, ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n.
n
Insert w = j=1 αj ϕ(xj ):
48. Kernel Trick
We can minimize over αi instead of w:
n n
2
min αj ϕ(xj ) +C ξi
αi ∈R,ξi ∈R+
j=1 i=1
subject to
n
yi αj ϕ(xj ), ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n.
j=1
49. Kernel Trick
2
Use w = w, w :
n n
min αj αk ϕ(xj ), ϕ(xk ) + C ξi
αi ∈R,ξi ∈R+
j,k=1 i=1
subject to
n
yi αj ϕ(xj ), ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n.
j=1
Note: ϕ only occurs in ϕ(.), ϕ(.) pairs.
50. Kernel Trick
Set ϕ(x), ϕ(x ) =: k(x, x ), called kernel function.
n n
min αj αk k(xj , xk ) + C ξi
αi ∈R,ξi ∈R+
j,k=1 i=1
subject to
n
yi αj k(xj , xi ) ≥ 1 − ξi for i = 1, . . . n.
j=1
The maximum-margin classifier in this form with a kernel function is
often called Support-Vector Machine (SVM).
51. Why use k(x, x ) instead of ϕ(x), ϕ(x ) ?
1) Speed:
We might find an expression for k(xi , xj ) that is faster to
calculate than forming ϕ(xi ) and then ϕ(xi ), ϕ(xj ) .
Example: 2nd-order polynomial kernel (here for x ∈ R1 ):
√
ϕ : x → (1, 2x, x 2 ) ∈ R3
√ √
ϕ(xi ), ϕ(xj ) = (1, 2xi , xi2 ), (1, 2xj , xj2 )
= 1 + 2xi xj + xi2 xj2
But equivalently (and faster) we can calculate without ϕ:
k(xi , xj ) : = (1 + xi xj )2
= 1 + 2xi xj + xi2 xj2
52. Why use k(x, x ) instead of ϕ(x), ϕ(x ) ?
2) Flexibility:
There are kernel functions k(xi , xj ), for which we know that a
feature transformation ϕ exists, but we don’t know what ϕ is.
53. Why use k(x, x ) instead of ϕ(x), ϕ(x ) ?
2) Flexibility:
There are kernel functions k(xi , xj ), for which we know that a
feature transformation ϕ exists, but we don’t know what ϕ is.
How that???
Theorem
Let k : X × X → R be a positive definite kernel function. Then
there exists a Hilbert Space H and a mapping ϕ : X → H such
that
k(x, x ) = ϕ(x), ϕ(x ) H
where . , . H is the inner product in H.
54. Positive Definite Kernel Function
Definition (Positive Definite Kernel Function)
Let X be a non-empty set. A function k : X × X → R is called
positive definite kernel function, iff
k is symmetric, i.e. k(x, x ) = k(x , x) for all x, x ∈ X .
For any set of points x1 , . . . , xn ∈ X , the matrix
Kij = (k(xi , xj ))i,j
is positive (semi-)definite, i.e. for all vectors t ∈ Rn :
n
ti Kij tj ≥ 0.
i,j=1
Note: Instead of “positive definite kernel function”, we will often
just say “kernel”.
55. Hilbert Spaces
Definition (Hilbert Space)
A Hilbert Space H is a vector space H with an inner product
. , . H , e.g. a mapping
.,. H : H × H → R
which is
symmetric: v, v H = v , v H for all v, v ∈ H ,
positive definite: v, v H ≥ 0 for all v ∈ H ,
where v, v H = 0 only for v = 0 ∈ H .
bilinear: av, v H = a v, v H for v ∈ H , a ∈ R
v + v , v H = v, v H + v , v H
We can treat a Hilbert space like some Rn , if we only use concepts
like vectors, angles, distances. Note: dim H = ∞ is possible!
56. Kernels for Arbitrary Sets
Theorem
Let k : X × X → R be a positive definite kernel function. Then
there exists a Hilbert Space H and a mapping ϕ : X → H such
that
k(x, x ) = ϕ(x), ϕ(x ) H
where . , . H is the inner product in H.
Translation
Take any set X and any function k : X × X → R.
If k is a positive definite kernel, then we can use k to learn a (soft)
maximum-margin classifier for the elements in X !
Note: X can be any set, e.g. X = { all images }.
57. How to Check if a Function is a Kernel
Problem:
Checking if a given k : X × X → R fulfills the conditions for a
kernel is difficult:
We need to prove or disprove
n
ti k(xi , xj )tj ≥ 0.
i,j=1
for any set x1 , . . . , xn ∈ X and any t ∈ Rn for any n ∈ N.
Workaround:
It is easy to construct functions k that are positive definite
kernels.
58. Constructing Kernels
1) We can construct kernels from scratch:
For any ϕ : X → Rm , k(x, x ) = ϕ(x), ϕ(x ) Rm is a kernel.
If d : X × X → R is a distance function, i.e.
• d(x, x ) ≥ 0 for all x, x ∈ X ,
• d(x, x ) = 0 only for x = x ,
• d(x, x ) = d(x , x) for all x, x ∈ X ,
• d(x, x ) ≤ d(x, x ) + d(x , x ) for all x, x , x ∈ X ,
then k(x, x ) := exp(−d(x, x )) is a kernel.
59. Constructing Kernels
1) We can construct kernels from scratch:
For any ϕ : X → Rm , k(x, x ) = ϕ(x), ϕ(x ) Rm is a kernel.
If d : X × X → R is a distance function, i.e.
• d(x, x ) ≥ 0 for all x, x ∈ X ,
• d(x, x ) = 0 only for x = x ,
• d(x, x ) = d(x , x) for all x, x ∈ X ,
• d(x, x ) ≤ d(x, x ) + d(x , x ) for all x, x , x ∈ X ,
then k(x, x ) := exp(−d(x, x )) is a kernel.
2) We can construct kernels from other kernels:
if k is a kernel and α > 0, then αk and k + α are kernels.
if k1 , k2 are kernels, then k1 + k2 and k1 · k2 are kernels.
60. Constructing Kernels
Examples for kernels for X = Rd :
any linear combination j αj kj with αj ≥ 0,
polynomial kernels k(x, x ) = (1 + x, x )m , m > 0
x−x 2
Gaussian or RBF k(x, x ) = exp − 2σ 2
with σ > 0,
61. Constructing Kernels
Examples for kernels for X = Rd :
any linear combination j αj kj with αj ≥ 0,
polynomial kernels k(x, x ) = (1 + x, x )m , m > 0
x−x 2
Gaussian or RBF k(x, x ) = exp − 2σ 2
with σ > 0,
Examples for kernels for other X :
k(h, h ) = n min(hi , hi ) for n-bin histograms h, h .
i=1
k(p, p ) = exp(−KL(p, p )) with KL the symmetrized
KL-divergence between positive probability distributions.
k(s, s ) = exp(−D(s, s )) for strings s, s and D = edit distance
62. Constructing Kernels
Examples for kernels for X = Rd :
any linear combination j αj kj with αj ≥ 0,
polynomial kernels k(x, x ) = (1 + x, x )m , m > 0
x−x 2
Gaussian or RBF k(x, x ) = exp − 2σ 2
with σ > 0,
Examples for kernels for other X :
k(h, h ) = n min(hi , hi ) for n-bin histograms h, h .
i=1
k(p, p ) = exp(−KL(p, p )) with KL the symmetrized
KL-divergence between positive probability distributions.
k(s, s ) = exp(−D(s, s )) for strings s, s and D = edit distance
Examples for functions X × X → R that are not kernels:
tanh (κ x, x + θ) (matrix Kij can have negative eigenvalues)
63. Kernels in Computer Vision
X = { images }, treat feature extraction as part of kernel definition
OCR/handwriting recognition
resize image, normalize brightness/contrast/rotation/skew
polynomial kernel k(x, x ) = (1 + x, x )d , d > 0
[DeCoste, Schölkopf. ML2002]
Pedestrian detection
resize image, calculate local intensity gradient directions
local thresholding + linear kernel [Dalal, Triggs. CVPR 2005]
or
local L1 -normalization + histogram intersection kernel
[Maji, Berg, Malik. CVPR 2008]
64. Kernels in Computer Vision
X = { images }, treat feature extraction as part of kernel definition
object category recognition
extract local image descriptors, e.g. SIFT
calculate multi-level pyramid histograms h l,k (x)
pyramid match kernel [Grauman, Darrell. ICCV 2005]
L 2l−1
l
kPMK (x, x ) = 2 min h l,k (x), h l,k (x )
l=1 k=1
scene/object category recognition
extract local image descriptors, e.g. SIFT
quantize descriptors into bag-of-words histograms
χ2 -kernel [Puzicha, Buhmann, Rubner, Tomasi. ICCV1999]
kχ2 (h, h ) = exp −γχ2 (h, h ) for γ > 0
K
(hk − hk )2
χ2 (h, h ) =
k=1 hk + hk
65. Summary
Linear methods are popular and well understood
classification, regression, dimensionality reduction, ...
Kernels are at the same time...
1) Similarity measure between (arbitrary) objects,
2) Scalar products in a (hidden) vector space.
Kernelization can make linear techniques more powerful
implicit preprocessing, non-linear in the original data.
still linear in some feature space ⇒ still intuitive/interpretable
Kernels can be defined over arbitrary inputs, e.g. images
unified framework for all preprocessing steps
different features, normalization, etc., becomes kernel choices
66. What did we not see?
We have skipped the largest part of theory on kernel methods:
Optimization
Dualization
Algorithms to train SVMs
Kernel Design
Systematic methods to construct data-dependent kernels.
Statistical Interpretations
What do we assume about samples?
What performance can we expect?
Generalization Bounds
The test error of a (kernelized) linear classifier can be
controlled using its modelling error and its training error.
“Support Vectors”
This and much more in standard references.
68. Selecting From Multiple Kernels
Typically, one has many different kernels to choose from:
different functional forms
linear, polynomial, RBF, . . .
different parameters
polynomial degree, Gaussian bandwidth, . . .
69. Selecting From Multiple Kernels
Typically, one has many different kernels to choose from:
different functional forms
linear, polynomial, RBF, . . .
different parameters
polynomial degree, Gaussian bandwidth, . . .
Different image features give rise to different kernels
Color histograms,
SIFT bag-of-words,
HOG,
Pyramid match,
Spatial pyramids, . . .
70. Selecting From Multiple Kernels
Typically, one has many different kernels to choose from:
different functional forms
linear, polynomial, RBF, . . .
different parameters
polynomial degree, Gaussian bandwidth, . . .
Different image features give rise to different kernels
Color histograms,
SIFT bag-of-words,
HOG,
Pyramid match,
Spatial pyramids, . . .
How to choose?
Ideally, based on the kernels’ performance on task at hand:
estimate by cross-validation or validation set error
Classically part of “Model Selection”.
71. Kernel Parameter Selection
Note: Model Selection makes a difference!
Action Classification, KTH dataset
Method Accuracy
Dollár et al. VS-PETS 2005: ”SVM classifier“ 80.66
Nowozin et al., ICCV 2007: ”baseline RBF“ 85.19
identical features, same kernel function
difference: Nowozin used cross-validation for model selection
(bandwidth and C )
Note: there is no overfitting involved here. Model selection is fully
automatic and uses only training data.
72. Kernel Parameter Selection
Rule of thumb for kernel parameters
For kernels based on the exponential function
1
k(x, x ) = exp(− X (x, x ))
γ
with any X , set
γ ≈ meani,j=1,...,n X (xi , xj ).
Sometimes better: use only X (xi , xj ) with yi = yj .
In general, if there are several classes, then the kernel matrix :
Kij = k(xi , xj )
should have a block structure w.r.t. the classes.
76. Kernel Selection ↔ Kernel Combination
Is there a single best kernel at all?
Kernels are typcally designed to capture one aspect of the data
texture, color, edges, . . .
Choosing one kernel means to select exactly one such aspect.
77. Kernel Selection ↔ Kernel Combination
Is there a single best kernel at all?
Kernels are typcally designed to capture one aspect of the data
texture, color, edges, . . .
Choosing one kernel means to select exactly one such aspect.
Combining aspects if often better than Selecting.
Method Accuracy
Colour 60.9 ± 2.1
Shape 70.2 ± 1.3
Texture 63.7 ± 2.7
HOG 58.5 ± 4.5
HSV 61.3 ± 0.7
siftint 70.6 ± 1.6
siftbdy 59.4 ± 3.3
combination 85.2 ± 1.5
Mean accuracy on Oxford Flowers dataset [Gehler, Nowozin: ICCV2009]
78. Combining Two Kernels
For two kernels k1 , k2 :
product k = k1 k2 is again a kernel
Problem: very small kernel values suppress large ones
average k = 1 (k1 + k2 ) is again a kernel
2
Problem: k1 , k2 on different scales. Re-scale first?
convex combination kβ = (1 − β)k1 + βk2 with β ∈ [0, 1]
Model selection: cross-validate over β ∈ {0, 0.1, . . . , 1}.
79. Combining Many Kernels
Multiple kernels: k1 ,. . . ,kK
all convex combinations are kernels:
K K
k= βj kj with βj ≥ 0, β = 1.
j=1 j=1
Kernels can be “deactivated” by βj = 0.
Combinatorial explosion forbids cross-validation over all
combinations of βj
Proxy: instead of CV, maximize SVM-objective.
Each combined kernel induces a feature space.
In which of the feature spaces can we best
explain the training data, and
achieve a large margin between the classes?
80. Feature Space View of Kernel Combination
Each kernel kj induces
a Hilbert Space Hj and a mapping ϕj : X → Hj .
β
The weighted kernel kj j := βj kj induces
the same Hilbert Space Hj , but
β
a rescaled feature mapping ϕj j (x) := βj ϕj (x).
β β
k βj (x, x ) ≡ ϕj j (x), ϕj j (x ) H = βj ϕj (x), βj ϕj (x ) H
= βj ϕj (x), ϕj (x ) H = βj k(x, x ).
ˆ
The linear combination k := K βj kj induces
j=1
the product space H := ⊕K Hj , and
j=1
the product mapping ϕ(x) := (ϕβ1 (x), . . . , ϕβn (x))t
ˆ 1 n
K K
ˆ β β
k(x, x ) ≡ ϕ(x), ϕ(x )
ˆ ˆ H
= ϕj j (x), ϕj j (x ) H = βj k(x, x )
j=1 j=1
81. Feature Space View of Kernel Combination
Implicit representation of a dataset using two kernels:
Kernel k1 , feature representation ϕ1 (x1 ), . . . , ϕ1 (xn ) ∈ H1
Kernel k2 , feature representation ϕ2 (x1 ), . . . , ϕ2 (xn ) ∈ H2
Kernel Selection would most likely pick k2 .
For k = (1 − β)k1 + βk2 , top is β = 0, bottom is β = 1.
102. Multiple Kernel Learning
Can we calculate coefficients βj that realize the largest margin?
Analyze: how does the margin depend on βj ?
Remember standard SVM (here without slack variables):
2
min w H
w∈H
subject to
yi w, xi H ≥1 for i = 1, . . . n.
H and ϕ were induced by kernel k.
New samples are classified by f (x) = w, x H.
103. Multiple Kernel Learning
Insert
K
k(x, x ) = βj kj (x, x ) (2)
j=1
with
Hilbert space H = ⊕j√ j ,
H √
feature map ϕ(x) = ( β1 ϕ1 (x), . . . , βK ϕK (x))t ,
weight vector w = (w1 , . . . , wK )t .
such that
2 2
w H = wj Hj (3)
j
w, ϕ(xi ) H = βj wj , ϕj (xi ) Hj (4)
j
104. Multiple Kernel Learning
For fixed βj , the largest margin hyperplane is given by
2
min wj Hj
wj ∈Hj
j
subject to
yi βj wj , ϕj (xi ) Hj ≥1 for i = 1, . . . n.
j
0
Renaming vj = βj wj (and defining 0
= 0):
1 2
min vj Hj
vj ∈Hj
j βj
subject to
yi vj , ϕj (xi ) Hj ≥1 for i = 1, . . . n.
j
105. Multiple Kernel Learning
Therefore, best hyperplane for variable βj is given by:
1 2
min vj Hj (5)
vj ∈Hj
j βj
β =1
j j
βj ≥0
subject to
yi vj , ϕj (xi ) Hj ≥1 for i = 1, . . . n. (6)
j
This optimization problem is jointly-convex in vj and βj .
There is a unique global minimum, and we can find it efficiently!
106. Multiple Kernel Learning
Same for soft-margin with slack-variables:
1 2
min vj Hj +C ξi (7)
vj ∈Hj
j βj i
β =1
j j
βj ≥0
ξi ∈R+
subject to
yi vj , ϕj (xi ) Hj ≥ 1 − ξi for i = 1, . . . n. (8)
j
This optimization problem is jointly-convex in vj and βj .
There is a unique global minimum, and we can find it efficiently!
107. Software for Multiple Kernel Learning
Existing toolboxes allow Multiple-Kernel SVM training:
Shogun (C++ with bindings to Matlab, Python etc.)
http://www.fml.tuebingen.mpg.de/raetsch/projects/shogun
MPI IKL (Matlab with libSVM, CoinIPopt)
http://www.kyb.mpg.de/bs/people/pgehler/ikl-webpage/index.html
SimpleMKL (Matlab)
http://asi.insa-rouen.fr/enseignants/˜arakotom/code/mklindex.html
SKMsmo (Matlab)
http://www.di.ens.fr/˜fbach/ (older and slower than the others)
Typically, one only has to specify the set of candidate kernels
and the regularization parameter C .
108. MKL Toy Example
Support-vector regression to learn samples of f (t) = sin(ωt)
2
x −x 2
kj (x, x ) = exp 2
with 2σj ∈ {0.005, 0.05, 0.5, 1, 10}.
2σj
Multiple-Kernel Learning correctly identifies the right
bandwidth.
109. Combining Good Kernels
Observation: if all kernels are reasonable, simple combination
methods work as well as difficult ones (and are much faster):
Single features Combination methods
Method Accuracy Time Method Accuracy Time
Colour 60.9 ± 2.1 3 product 85.5 ± 1.2 2
Shape 70.2 ± 1.3 4 averaging 84.9 ± 1.9 10
Texture 63.7 ± 2.7 3 CG-Boost 84.8 ± 2.2 1225
HOG 58.5 ± 4.5 4 MKL (SILP) 85.2 ± 1.5 97
HSV 61.3 ± 0.7 3 MKL (Simple) 85.2 ± 1.5 152
siftint 70.6 ± 1.6 4 LP-β 85.5 ± 3.0 80
siftbdy 59.4 ± 3.3 5 LP-B 85.4 ± 2.4 98
Mean accuracy and total runtime (model selection, training, testing) on
Oxford Flowers dataset [Gehler, Nowozin: ICCV2009]
110. Combining Good and Bad kernels
Observation: if some kernels are helpful, but others are not, smart
techniques are better.
Performance with added noise features
90
85
80
75
accuracy
70
65
product
60 average
CG−Boost
55 MKL (silp or simple)
50 LP−β
LP−B
45
01 5 10 25 50
no. noise features added
Mean accuracy on Oxford Flowers dataset [Gehler, Nowozin: ICCV2009]
111. Example: Multi-Class Object Localization
MKL for joint prediction of different object classes.
Objects in images do not occur independently of each other.
Chairs and tables often occur together in indoor scenes.
Busses often occur together with cars in street scenes.
Chairs rarely occur together with cars.
One can make use of these dependencies to improve prediction.
112. Example: Multi-Class Object Localization
Predict candidate regions for all object classes.
Train a decision function for each class (red), taking into
account candidate regions for all classes (red and green).
Decide per-class which other object
categories are worth using
20
2 2
k(I , I ) = β0 kχ (h, h ) + βj kχ (hj , hj )
j=1
h: feature histogram for the full image x
hj : histogram for the region predicted for object class j in x
Use MKL to learn weights βj , j = 0, . . . , 20.
[Lampert and Blaschko, DAGM 2008]
113. Example: Multi-Class Object Localization
Benchmark on PASCAL VOC 2006 and VOC 2007.
Combination improves detection accuracy (black vs. blue).
114. Example: Multi-Class Object Localization
Interpretation of Weights (VOC 2007):
Every class decision
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
depends on the full image aeroplane [1]
bicycle [2]
and on the object box. bird [3]
boat [4]
bottle [5]
bus [6]
High image weights: car [7]
cat [8]
→ scene classification? chair [9]
cow [10]
diningtable [11]
dog [12]
Intuitive connections: horse [13]
motorbike [14]
chair → diningtable, person [15]
pottedplant [16]
sheep [17]
person → bottle, sofa [18]
train [19]
person → dog. tvmonitor [20]
Many classes depend on rows: class to be detected
the person class. columns: class candidate boxes
115. Example: Multi-Class Object Localization
We can turn the non-zero weights into a dependency graph:
diningtable person motorbike
0.07 0.20 0.06 0.20 0.27 0.05 0.04 0.12
chair bottle 0.07 cat 0.09 dog
0.08 0.04 0.09 0.09 0.12 0.06 0.06
car sofa 0.04 horse bird 0.09
0.05 0.04 0.05 0.05 0.05 0.05 0.05 0.08 0.05
train aeroplane 0.06 bus tvmonitor pottedplant bicycle 0.05 sheep
0.06 0.05 0.05 0.08
boat cow
Threshold relative weights (without image component) at 0.04
i → j means “Class i is used to predict class j.”
Interpretable clusters: vehicles, indoor, animals.
116. Summary
Kernel Selection and Combination
Model selection is important to achive highest accuracy
Combining several kernels is often superior to selecting one
Multiple-Kernel Learning
Learn weights for the “best” linear kernel combination:
unified approach to feature selection/combination.
visit [Gehler, Nowozin. CVPR 2009] on Wednesday afternoon
Beware: MKL is no silver bullet.
Other and even simpler techniques might be superior!
Always compare against single best, averaging, product.
Warning: Caltech101/256
Be careful when reading kernel combination results
Many results reported rely on “broken” Bosch kernel matrices