This document provides an introduction to machine learning and empirical inference. It discusses how machine learning allows drawing conclusions from empirical data through examples like scientific inference and perception. It also covers hard inference problems that involve processing large, complex datasets without prior knowledge. The document explains how machine learning can solve problems that humans cannot by generalizing from data, and how support vector machines provide a unique solution to classification problems using kernels.
Brain reading, compressive sensing, fMRI and statistical learning in PythonGael Varoquaux
This document discusses techniques for predictive modeling of brain imaging data using statistical learning methods. It presents an approach that combines sparse recovery, randomized clustering, and total variation regularization to predict stimuli from fMRI data with over 50,000 voxels and around 100 samples. The key steps are clustering spatially correlated voxels, running sparse models on the reduced feature set, and accumulating selected features over multiple runs. Simulations show this approach outperforms other methods at recovering brain patches. The document also discusses disseminating research through open source Python libraries like scikit-learn, which has helped popularize machine learning techniques.
The document outlines a framework for machine learning including: 1) the key components of a learning system including input, knowledge base, learner, and output; 2) different perspectives on machine learning such as optimization, concept formation, and pattern recognition; and 3) different approaches to inductive learning including decision trees, evolutionary algorithms, neural networks, and conceptual clustering. Examples are provided to illustrate different inductive learning systems and how they can generate rules from examples.
This document describes a statistical framework for interactive image category search based on mental matching. The framework allows a user to search an unstructured image database for a target category that exists only as a "mental picture" by providing feedback on displayed image sets. At each iteration, the system selects images to maximize the information gained from the user's response. The goal is to minimize the number of iterations needed to display an image from the target category. Experiments showed the approach was effective on databases containing tens of thousands of images across several semantic categories.
This document provides an overview of neural networks and backpropagation algorithms. It discusses how neural networks are inspired by biological brains and how they can be used to perform complex classification tasks. The key topics covered include perceptrons, Adaline networks, multi-layer perceptrons, backpropagation for training multi-layer networks, and an example of how backpropagation works to minimize error in a simple two-layer network.
This document provides a summary of a lecture on machine learning and decision trees. It recaps the previous lecture and outlines the goals of today's lecture, which will review data classification, decision trees, and the ID3 algorithm. The lecture will also cover how to build decision trees using ID3 and C4.5 and demonstrate running C4.5 on the Weka machine learning software.
Brain maps from machine learning? Spatial regularizationsGael Varoquaux
Pattern Recognition for NeuroImaging (PR4NI)
We will show empirically how the pattern recognition techniques-commonly used, such as SVMs, provide low-quality brain maps, eventhough they give very good prediction accuracy. We will give an overview of recently developed techniques to impose priors on patterns particularly well suited to neuroimaging: selecting a small number of spatially-structured predictive brain regions. These tools reconcile machine learning with
brain mapping by giving maps more useful to draw neuroscientific conclusions. In addition, they are more robust to cross-individuals spatial variability and thus generalize well across subjects.
Brunelli 2008: template matching techniques in computer visionzukun
The document discusses template matching techniques in computer vision. It begins with an overview that defines template matching and discusses some common computer vision tasks it can be used for, like object detection. It then covers topics like detection as hypothesis testing, training and testing techniques, and provides a bibliography.
Brain reading, compressive sensing, fMRI and statistical learning in PythonGael Varoquaux
This document discusses techniques for predictive modeling of brain imaging data using statistical learning methods. It presents an approach that combines sparse recovery, randomized clustering, and total variation regularization to predict stimuli from fMRI data with over 50,000 voxels and around 100 samples. The key steps are clustering spatially correlated voxels, running sparse models on the reduced feature set, and accumulating selected features over multiple runs. Simulations show this approach outperforms other methods at recovering brain patches. The document also discusses disseminating research through open source Python libraries like scikit-learn, which has helped popularize machine learning techniques.
The document outlines a framework for machine learning including: 1) the key components of a learning system including input, knowledge base, learner, and output; 2) different perspectives on machine learning such as optimization, concept formation, and pattern recognition; and 3) different approaches to inductive learning including decision trees, evolutionary algorithms, neural networks, and conceptual clustering. Examples are provided to illustrate different inductive learning systems and how they can generate rules from examples.
This document describes a statistical framework for interactive image category search based on mental matching. The framework allows a user to search an unstructured image database for a target category that exists only as a "mental picture" by providing feedback on displayed image sets. At each iteration, the system selects images to maximize the information gained from the user's response. The goal is to minimize the number of iterations needed to display an image from the target category. Experiments showed the approach was effective on databases containing tens of thousands of images across several semantic categories.
This document provides an overview of neural networks and backpropagation algorithms. It discusses how neural networks are inspired by biological brains and how they can be used to perform complex classification tasks. The key topics covered include perceptrons, Adaline networks, multi-layer perceptrons, backpropagation for training multi-layer networks, and an example of how backpropagation works to minimize error in a simple two-layer network.
This document provides a summary of a lecture on machine learning and decision trees. It recaps the previous lecture and outlines the goals of today's lecture, which will review data classification, decision trees, and the ID3 algorithm. The lecture will also cover how to build decision trees using ID3 and C4.5 and demonstrate running C4.5 on the Weka machine learning software.
Brain maps from machine learning? Spatial regularizationsGael Varoquaux
Pattern Recognition for NeuroImaging (PR4NI)
We will show empirically how the pattern recognition techniques-commonly used, such as SVMs, provide low-quality brain maps, eventhough they give very good prediction accuracy. We will give an overview of recently developed techniques to impose priors on patterns particularly well suited to neuroimaging: selecting a small number of spatially-structured predictive brain regions. These tools reconcile machine learning with
brain mapping by giving maps more useful to draw neuroscientific conclusions. In addition, they are more robust to cross-individuals spatial variability and thus generalize well across subjects.
Brunelli 2008: template matching techniques in computer visionzukun
The document discusses template matching techniques in computer vision. It begins with an overview that defines template matching and discusses some common computer vision tasks it can be used for, like object detection. It then covers topics like detection as hypothesis testing, training and testing techniques, and provides a bibliography.
Localization and classification. Overfeat: class agnostic versu class specific localization, fully convolutional neural networks, greedy merge strategy. Multiobject detection. Region proposal and selective search. R-CNN, Fast R-CNN, Faster R-CNN and YOLO. Image segmentation. Semantic segmentation and transposed convolutions. Instance segmentation and Mask R-CNN. Image captioning. Recurrent Neural Networks (RNNs). Language generation. Long Short Term Memory (LSTMs). DeepImageSent, Show and Tell, and Show, Attend and Tell algorithms.
Binary classification and linear separators. Perceptron, ADALINE, artifical neurons. Artificial neural networks (ANNs), activation functions, and universal approximation theorem. Linear versus non-linear classification problems. Typical tasks, architectures and loss functions. Gradient descent and back-propagation. Support Vector Machines (SVMs), soft-margins and kernel trick. Connexions between ANNs and SVMs.
Hussain Learning Relevant Eye Movement Feature Spaces Across UsersKalle
In this paper we predict the relevance of images based on a low-dimensional
feature space found using several users’ eye movements. Each user is given an image-based search task, during which their eye movements are extracted using a Tobii eye tracker. The users also provide us with explicit feedback regarding the relevance of images. We demonstrate that by using a greedy Nyström algorithm on the eye movement features of different users, we can find a suitable low-dimensional feature space for learning. We validate the suitability of this feature space by projecting the eye movement features of a new user into this space, training an online learning algorithm using these features, and showing that the number of mistakes (regret over time) made in predicting relevant images is lower than when using the original eye movement features. We also plot
Recall-Precision and ROC curves, and use a sign test to verify the statistical significance of our results.
Presentación de Vicomtech sobre seguimiento de objetos realizada durante las jornadas HOIP 2010 organizadas por la Unidad de Sistemas de Información e Interacción TECNALIA.
Más información en http://www.tecnalia.com/es/ict-european-software-institute/index.htm
Mining Uncertain Data (Sebastiaan van Schaaik)timfu
This document summarizes an upcoming seminar on mining frequent patterns and association rules from uncertain data. It introduces the concepts of frequent patterns, association rules, support and confidence as measures of "interestingness." It describes the classic Apriori algorithm for mining frequent itemsets and rules from certain data. It then discusses challenges introduced by uncertain data, such as modeling item probabilities and possible worlds. Finally, it outlines approaches that have been developed to mine uncertain data, including U-Apriori, p-Apriori, UF-growth, and UFP-tree.
Free lunch for few shot learning distribution calibrationぱんいち すみもと
The paper proposes a method to estimate the distribution of novel classes for few-shot learning by calibrating distributions. It hypothesizes that the distributions of novel classes can be represented by similar base classes based on their feature space distributions. The method samples from a mixture of estimated novel class distributions, which are determined based on distances to base class means. Logistic regression is then performed on the sampled data to train the classifier. Experimental results on standard few-shot datasets demonstrate that the proposed distribution calibration approach improves over traditional fine-tuning baselines.
The document provides an overview of a project on vision-based place recognition for autonomous robots. It outlines the objective to localize a robot within an environment using visual cues. The methodology will improve on previous work by combining successful aspects and avoiding limitations. It will use adaptive multi-scale classification to differentiate environments based on discriminative features. Challenges include variations in object appearance and limited robot resources. Testing will use datasets from Bielefeld University and ImageCLEF, as well as a custom data acquisition tool.
Machine Learning: Some theoretical and practical problemsbutest
The document outlines a presentation on machine learning, covering the following key points:
1. It discusses theoretical frameworks for machine learning problems, including classification prediction problems and definitions like learning algorithms, sample misclassification error, and consistency.
2. It describes how histograms can provide a consistent learning algorithm for classification as the cell size decreases and number of points per cell increases.
3. It notes that while many consistent learning algorithms exist, determining the best algorithm is challenging, as there is "no free lunch" - the best algorithm depends on the specific problem being addressed.
Overview of the course. Introduction to image sciences, image processing and computer vision. Basics of machine learning, terminologies, paradigms. No-free lunch theorem. Supervised versus unsupervised learning. Clustering and K-Means. Classification and regression. Linear least squares and polynomial curve fitting. Model complexity and overfitting. Curse of dimensionality. Dimensionality reduction and principal component analysis. Image representation, semantic gap, image features, and classical computer vision pipelines.
This document discusses methods for inferring multiple graph structures from high-dimensional genomic data across multiple experimental conditions that have a low sample size. It proposes handling the scarcity of data by pooling information across conditions using multi-task learning. Specifically, it considers inferring a common graph structure by maximizing a joint penalized likelihood objective rather than optimizing each experimental condition independently. It outlines statistical graphical models, neighborhood selection versus penalized likelihood approaches, and algorithms for multi-task learning of graphical structures from multiple related genomic datasets.
The attribute that should be tested at the root of the decision tree is the attribute that results in the maximum information gain, or minimum entropy, when used to split the training data. In other words, the attribute that best separates the data according to the target classes. This attribute will create "purer" nodes with respect to the target classes.
This document provides an overview of bag-of-words models for computer vision tasks. It discusses the origins of bag-of-words models in texture recognition and document classification. It describes the basic steps of extracting local image features, learning a visual vocabulary through clustering, and representing images as histograms of visual word frequencies. The document outlines discriminative classification methods like nearest neighbors and support vector machines, as well as generative models like Naive Bayes and probabilistic latent semantic analysis. It provides details on various aspects of bag-of-words models, including feature extraction, vocabulary learning, image representations, and classification approaches.
Johan Suykens: "Models from Data: a Unifying Picture" ieee_cis_cyprus
The document discusses models that are constructed from data using machine learning techniques. It provides examples of different model types, including neural networks, support vector machines, kernel methods, and spectral clustering. These models can be expressed in both primal and dual formulations, and the dual representations allow for out-of-sample extensions, model selection, and solving large-scale problems. The document outlines core models that underlie many machine learning algorithms and how adding regularization terms and constraints can yield different optimal model representations.
05 history of cv a machine learning (theory) perspective on computer visionzukun
This document provides an overview of machine learning algorithms used in computer vision from the perspective of a machine learning theorist. It discusses how the theorist got involved in a computer vision project in 2002 and summarizes key algorithms at that time like boosting, support vector machines, and their developments. It also provides historical context and comparisons of algorithms like perceptron and Winnow. The document uses examples to explain concepts like kernels and the kernel trick in support vector machines.
Abstract : For many years, Machine Learning has focused on a key issue: the design of input features to solve prediction tasks. In this presentation, we show that many learning tasks from structured output prediction to zero-shot learning can benefit from an appropriate design of output features, broadening the scope of regression. As an illustration, I will briefly review different examples and recent results obtained in my team.
Supervised learning is a category of machine learning that uses labeled datasets to train algorithms to predict outcomes and recognize patterns. Unlike unsupervised learning, supervised learning algorithms are given labeled training to learn the relationship between the input and the outputs.
Supervised machine learning algorithms make it easier for organizations to create complex models that can make accurate predictions. As a result, they are widely used across various industries and fields, including healthcare, marketing, financial services, and more.
Here, we’ll cover the fundamentals of supervised learning in AI, how supervised learning algorithms work, and some of its most common use cases.
Get started for free
How does supervised learning work?
The data used in supervised learning is labeled — meaning that it contains examples of both inputs (called features) and correct outputs (labels). The algorithms analyze a large dataset of these training pairs to infer what a desired output value would be when asked to make a prediction on new data.
For instance, let’s pretend you want to teach a model to identify pictures of trees. You provide a labeled dataset that contains many different examples of types of trees and the names of each species. You let the algorithm try to define what set of characteristics belongs to each tree based on the labeled outputs. You can then test the model by showing it a tree picture and asking it to guess what species it is. If the model provides an incorrect answer, you can continue training it and adjusting its parameters with more examples to improve its accuracy and minimize errors.
Once the model has been trained and tested, you can use it to make predictions on unknown data based on the previous knowledge it has learned.
How does supervised learning work?
The data used in supervised learning is labeled — meaning that it contains examples of both inputs (called features) and correct outputs (labels). The algorithms analyze a large dataset of these training pairs to infer what a desired output value would be when asked to make a prediction on new data.
For instance, let’s pretend you want to teach a model to identify pictures of trees. You provide a labeled dataset that contains many different examples of types of trees and the names of each species. You let the algorithm try to define what set of characteristics belongs to each tree based on the labeled outputs. You can then test the model by showing it a tree picture and asking it to guess what species it is. If the model provides an incorrect answer, you can continue training it and adjusting its parameters with more examples to improve its accuracy and minimize errors.
Once the model has been trained and tested, you can use it to make predictions on unknown data based on the previous knowledge it has learned.
Types of supervised learning
Supervised learning in machine learning is generally divided into two categories: classification and regre
The Automated-Reasoning Revolution: from Theory to Practice and BackMoshe Vardi
The document discusses the P vs NP problem in theoretical computer science. It begins by explaining computational problems like finding Hamiltonian cycles in graphs and discusses the complexity classes P and NP. P contains problems that can be solved efficiently in polynomial time, while NP contains problems where solutions can be verified efficiently, though not necessarily found efficiently. The document then explains that the famous P vs NP problem asks whether P equals NP - whether problems where solutions can be verified efficiently can also be solved efficiently. It notes this would have profound implications and remains an important open problem with a million dollar prize. The document provides historical context and explains how Boolean satisfiability relates to this problem.
The document discusses recent advances in computer vision that go beyond object recognition to incorporate reasoning. It presents examples where accounting for contextual and causal cues is needed to recognize complex or fine-grained concepts. Recent works have used probabilistic first-order logic and graphical models to represent relational knowledge and enable reasoning about causality. Open issues remain around how to evaluate reasoning capabilities and characterize the representational power, learnability, and performance of these approaches.
Localization and classification. Overfeat: class agnostic versu class specific localization, fully convolutional neural networks, greedy merge strategy. Multiobject detection. Region proposal and selective search. R-CNN, Fast R-CNN, Faster R-CNN and YOLO. Image segmentation. Semantic segmentation and transposed convolutions. Instance segmentation and Mask R-CNN. Image captioning. Recurrent Neural Networks (RNNs). Language generation. Long Short Term Memory (LSTMs). DeepImageSent, Show and Tell, and Show, Attend and Tell algorithms.
Binary classification and linear separators. Perceptron, ADALINE, artifical neurons. Artificial neural networks (ANNs), activation functions, and universal approximation theorem. Linear versus non-linear classification problems. Typical tasks, architectures and loss functions. Gradient descent and back-propagation. Support Vector Machines (SVMs), soft-margins and kernel trick. Connexions between ANNs and SVMs.
Hussain Learning Relevant Eye Movement Feature Spaces Across UsersKalle
In this paper we predict the relevance of images based on a low-dimensional
feature space found using several users’ eye movements. Each user is given an image-based search task, during which their eye movements are extracted using a Tobii eye tracker. The users also provide us with explicit feedback regarding the relevance of images. We demonstrate that by using a greedy Nyström algorithm on the eye movement features of different users, we can find a suitable low-dimensional feature space for learning. We validate the suitability of this feature space by projecting the eye movement features of a new user into this space, training an online learning algorithm using these features, and showing that the number of mistakes (regret over time) made in predicting relevant images is lower than when using the original eye movement features. We also plot
Recall-Precision and ROC curves, and use a sign test to verify the statistical significance of our results.
Presentación de Vicomtech sobre seguimiento de objetos realizada durante las jornadas HOIP 2010 organizadas por la Unidad de Sistemas de Información e Interacción TECNALIA.
Más información en http://www.tecnalia.com/es/ict-european-software-institute/index.htm
Mining Uncertain Data (Sebastiaan van Schaaik)timfu
This document summarizes an upcoming seminar on mining frequent patterns and association rules from uncertain data. It introduces the concepts of frequent patterns, association rules, support and confidence as measures of "interestingness." It describes the classic Apriori algorithm for mining frequent itemsets and rules from certain data. It then discusses challenges introduced by uncertain data, such as modeling item probabilities and possible worlds. Finally, it outlines approaches that have been developed to mine uncertain data, including U-Apriori, p-Apriori, UF-growth, and UFP-tree.
Free lunch for few shot learning distribution calibrationぱんいち すみもと
The paper proposes a method to estimate the distribution of novel classes for few-shot learning by calibrating distributions. It hypothesizes that the distributions of novel classes can be represented by similar base classes based on their feature space distributions. The method samples from a mixture of estimated novel class distributions, which are determined based on distances to base class means. Logistic regression is then performed on the sampled data to train the classifier. Experimental results on standard few-shot datasets demonstrate that the proposed distribution calibration approach improves over traditional fine-tuning baselines.
The document provides an overview of a project on vision-based place recognition for autonomous robots. It outlines the objective to localize a robot within an environment using visual cues. The methodology will improve on previous work by combining successful aspects and avoiding limitations. It will use adaptive multi-scale classification to differentiate environments based on discriminative features. Challenges include variations in object appearance and limited robot resources. Testing will use datasets from Bielefeld University and ImageCLEF, as well as a custom data acquisition tool.
Machine Learning: Some theoretical and practical problemsbutest
The document outlines a presentation on machine learning, covering the following key points:
1. It discusses theoretical frameworks for machine learning problems, including classification prediction problems and definitions like learning algorithms, sample misclassification error, and consistency.
2. It describes how histograms can provide a consistent learning algorithm for classification as the cell size decreases and number of points per cell increases.
3. It notes that while many consistent learning algorithms exist, determining the best algorithm is challenging, as there is "no free lunch" - the best algorithm depends on the specific problem being addressed.
Overview of the course. Introduction to image sciences, image processing and computer vision. Basics of machine learning, terminologies, paradigms. No-free lunch theorem. Supervised versus unsupervised learning. Clustering and K-Means. Classification and regression. Linear least squares and polynomial curve fitting. Model complexity and overfitting. Curse of dimensionality. Dimensionality reduction and principal component analysis. Image representation, semantic gap, image features, and classical computer vision pipelines.
This document discusses methods for inferring multiple graph structures from high-dimensional genomic data across multiple experimental conditions that have a low sample size. It proposes handling the scarcity of data by pooling information across conditions using multi-task learning. Specifically, it considers inferring a common graph structure by maximizing a joint penalized likelihood objective rather than optimizing each experimental condition independently. It outlines statistical graphical models, neighborhood selection versus penalized likelihood approaches, and algorithms for multi-task learning of graphical structures from multiple related genomic datasets.
The attribute that should be tested at the root of the decision tree is the attribute that results in the maximum information gain, or minimum entropy, when used to split the training data. In other words, the attribute that best separates the data according to the target classes. This attribute will create "purer" nodes with respect to the target classes.
This document provides an overview of bag-of-words models for computer vision tasks. It discusses the origins of bag-of-words models in texture recognition and document classification. It describes the basic steps of extracting local image features, learning a visual vocabulary through clustering, and representing images as histograms of visual word frequencies. The document outlines discriminative classification methods like nearest neighbors and support vector machines, as well as generative models like Naive Bayes and probabilistic latent semantic analysis. It provides details on various aspects of bag-of-words models, including feature extraction, vocabulary learning, image representations, and classification approaches.
Johan Suykens: "Models from Data: a Unifying Picture" ieee_cis_cyprus
The document discusses models that are constructed from data using machine learning techniques. It provides examples of different model types, including neural networks, support vector machines, kernel methods, and spectral clustering. These models can be expressed in both primal and dual formulations, and the dual representations allow for out-of-sample extensions, model selection, and solving large-scale problems. The document outlines core models that underlie many machine learning algorithms and how adding regularization terms and constraints can yield different optimal model representations.
05 history of cv a machine learning (theory) perspective on computer visionzukun
This document provides an overview of machine learning algorithms used in computer vision from the perspective of a machine learning theorist. It discusses how the theorist got involved in a computer vision project in 2002 and summarizes key algorithms at that time like boosting, support vector machines, and their developments. It also provides historical context and comparisons of algorithms like perceptron and Winnow. The document uses examples to explain concepts like kernels and the kernel trick in support vector machines.
Abstract : For many years, Machine Learning has focused on a key issue: the design of input features to solve prediction tasks. In this presentation, we show that many learning tasks from structured output prediction to zero-shot learning can benefit from an appropriate design of output features, broadening the scope of regression. As an illustration, I will briefly review different examples and recent results obtained in my team.
Supervised learning is a category of machine learning that uses labeled datasets to train algorithms to predict outcomes and recognize patterns. Unlike unsupervised learning, supervised learning algorithms are given labeled training to learn the relationship between the input and the outputs.
Supervised machine learning algorithms make it easier for organizations to create complex models that can make accurate predictions. As a result, they are widely used across various industries and fields, including healthcare, marketing, financial services, and more.
Here, we’ll cover the fundamentals of supervised learning in AI, how supervised learning algorithms work, and some of its most common use cases.
Get started for free
How does supervised learning work?
The data used in supervised learning is labeled — meaning that it contains examples of both inputs (called features) and correct outputs (labels). The algorithms analyze a large dataset of these training pairs to infer what a desired output value would be when asked to make a prediction on new data.
For instance, let’s pretend you want to teach a model to identify pictures of trees. You provide a labeled dataset that contains many different examples of types of trees and the names of each species. You let the algorithm try to define what set of characteristics belongs to each tree based on the labeled outputs. You can then test the model by showing it a tree picture and asking it to guess what species it is. If the model provides an incorrect answer, you can continue training it and adjusting its parameters with more examples to improve its accuracy and minimize errors.
Once the model has been trained and tested, you can use it to make predictions on unknown data based on the previous knowledge it has learned.
How does supervised learning work?
The data used in supervised learning is labeled — meaning that it contains examples of both inputs (called features) and correct outputs (labels). The algorithms analyze a large dataset of these training pairs to infer what a desired output value would be when asked to make a prediction on new data.
For instance, let’s pretend you want to teach a model to identify pictures of trees. You provide a labeled dataset that contains many different examples of types of trees and the names of each species. You let the algorithm try to define what set of characteristics belongs to each tree based on the labeled outputs. You can then test the model by showing it a tree picture and asking it to guess what species it is. If the model provides an incorrect answer, you can continue training it and adjusting its parameters with more examples to improve its accuracy and minimize errors.
Once the model has been trained and tested, you can use it to make predictions on unknown data based on the previous knowledge it has learned.
Types of supervised learning
Supervised learning in machine learning is generally divided into two categories: classification and regre
The Automated-Reasoning Revolution: from Theory to Practice and BackMoshe Vardi
The document discusses the P vs NP problem in theoretical computer science. It begins by explaining computational problems like finding Hamiltonian cycles in graphs and discusses the complexity classes P and NP. P contains problems that can be solved efficiently in polynomial time, while NP contains problems where solutions can be verified efficiently, though not necessarily found efficiently. The document then explains that the famous P vs NP problem asks whether P equals NP - whether problems where solutions can be verified efficiently can also be solved efficiently. It notes this would have profound implications and remains an important open problem with a million dollar prize. The document provides historical context and explains how Boolean satisfiability relates to this problem.
The document discusses recent advances in computer vision that go beyond object recognition to incorporate reasoning. It presents examples where accounting for contextual and causal cues is needed to recognize complex or fine-grained concepts. Recent works have used probabilistic first-order logic and graphical models to represent relational knowledge and enable reasoning about causality. Open issues remain around how to evaluate reasoning capabilities and characterize the representational power, learnability, and performance of these approaches.
This document provides an introduction to machine learning, including definitions of key concepts and examples of common machine learning algorithms and applications. It describes machine learning as using example inputs to train algorithms to predict or classify new examples. Some common machine learning techniques are discussed at a high level, including linear models, kernel methods, neural networks, and decision trees. A variety of applications are also mentioned, such as bioinformatics, text analysis, computer vision, and more. Challenges in machine learning like balancing accuracy and robustness are briefly touched on as well.
The document discusses projection methods for solving functional equations. Projection methods work by specifying a basis of functions and "projecting" the functional equation against that basis to find the parameters. This allows approximating different objects like decision rules or value functions. The document focuses on spectral methods that use global basis functions and covers various basis options like monomials, trigonometric series, Jacobi polynomials and Chebyshev polynomials. It also discusses how to generalize the basis to multidimensional problems, including using tensor products and Smolyak's algorithm to reduce the number of basis elements.
Structured regression for efficient object detectionzukun
This document summarizes research on structured regression for efficient object detection. It proposes framing object localization as a structured output regression problem rather than a classification problem. This involves learning a function that maps images directly to object bounding boxes. It describes using a structured support vector machine with joint image/box kernels and box overlap loss to learn this mapping from training data. The document also outlines techniques for efficiently solving the resulting argmax problem using branch-and-bound optimization and discusses extensions to other tasks like image segmentation.
. An introduction to machine learning and probabilistic ...butest
This document provides an overview and introduction to machine learning and probabilistic graphical models. It discusses key topics such as supervised learning, unsupervised learning, graphical models, inference, and structure learning. The document covers techniques like decision trees, neural networks, clustering, dimensionality reduction, Bayesian networks, and learning the structure of probabilistic graphical models.
My 2hr+ survey talk at the Vector Institute, on our deep learning theorems.Anirbit Mukherjee
This document provides an overview of several results from papers related to analyzing neural networks. It discusses questions about what functions neural networks can represent and the properties of their loss landscapes. Key results presented include showing neural networks can perform exact empirical risk minimization in polynomial time for 1D networks, proving networks can represent continuous piecewise linear functions, and demonstrating depth separations where shallower networks require much larger size to represent certain functions. Open problems are also discussed, such as fully characterizing the function space of neural networks.
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience hirokazutanaka
This document summarizes key concepts from a lecture on neural networks and neuroscience:
- Single-layer neural networks like perceptrons can only learn linearly separable patterns, while multi-layer networks can approximate any function. Backpropagation enables training multi-layer networks.
- Recurrent neural networks incorporate memory through recurrent connections between units. Backpropagation through time extends backpropagation to train recurrent networks.
- The cerebellum functions similarly to a perceptron for motor learning and control. Its feedforward circuitry from mossy fibers to Purkinje cells maps to the layers of a perceptron.
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...Chris Rackauckas
This document discusses scientific machine learning and differentiable simulation. It begins by explaining that scientific machine learning uses both data and physical knowledge to make accurate predictions with less data. It then discusses differentiable simulation and how universal differential equations can be used to replace unknown portions of models with neural networks while preserving known physical structure. Several examples are provided of applications in various domains like epidemiology, black hole detection, earthquake engineering, and chemistry. The document emphasizes that understanding the engineering principles and numerical properties of the domain is important for applying these methods stably and efficiently.
This document provides an overview of machine learning concepts including:
1. Machine learning aims to create computer programs that improve with experience by learning from data. It involves tasks like classification, regression, and clustering.
2. Data comes in different types like text, numbers, images and is generated in massive quantities daily from sources like Google, Facebook, and sensors.
3. Machine learning algorithms are either supervised, using labeled training data, or unsupervised, using unlabeled data. Common supervised techniques are decision trees, neural networks, and support vector machines while clustering is a major unsupervised technique.
Principal Component Analysis For Novelty DetectionJordan McBain
This document summarizes a journal article that proposes using principal component analysis (PCA) for novelty detection in condition monitoring applications. It describes how PCA can be used to reduce the dimensionality of feature spaces while retaining most of the variation in the data. The authors modify the standard PCA technique to maximize the difference between the spread of normal data and the spread of outlier data from the mean of the normal data. They validate the approach on artificial and machinery vibration data and show it can effectively distinguish outliers. Future work could involve extending the technique to non-linear data using kernel methods.
"What does it really mean for your system to be available, or how to define w...Fwdays
We will talk about system monitoring from a few different angles. We will start by covering the basics, then discuss SLOs, how to define them, and why understanding the business well is crucial for success in this exercise.
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...Jason Yip
The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.
AI in the Workplace Reskilling, Upskilling, and Future Work.pptxSunil Jagani
Discover how AI is transforming the workplace and learn strategies for reskilling and upskilling employees to stay ahead. This comprehensive guide covers the impact of AI on jobs, essential skills for the future, and successful case studies from industry leaders. Embrace AI-driven changes, foster continuous learning, and build a future-ready workforce.
Read More - https://bit.ly/3VKly70
Discover the Unseen: Tailored Recommendation of Unwatched ContentScyllaDB
The session shares how JioCinema approaches ""watch discounting."" This capability ensures that if a user watched a certain amount of a show/movie, the platform no longer recommends that particular content to the user. Flawless operation of this feature promotes the discover of new content, improving the overall user experience.
JioCinema is an Indian over-the-top media streaming service owned by Viacom18.
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...DanBrown980551
This LF Energy webinar took place June 20, 2024. It featured:
-Alex Thornton, LF Energy
-Hallie Cramer, Google
-Daniel Roesler, UtilityAPI
-Henry Richardson, WattTime
In response to the urgency and scale required to effectively address climate change, open source solutions offer significant potential for driving innovation and progress. Currently, there is a growing demand for standardization and interoperability in energy data and modeling. Open source standards and specifications within the energy sector can also alleviate challenges associated with data fragmentation, transparency, and accessibility. At the same time, it is crucial to consider privacy and security concerns throughout the development of open source platforms.
This webinar will delve into the motivations behind establishing LF Energy’s Carbon Data Specification Consortium. It will provide an overview of the draft specifications and the ongoing progress made by the respective working groups.
Three primary specifications will be discussed:
-Discovery and client registration, emphasizing transparent processes and secure and private access
-Customer data, centering around customer tariffs, bills, energy usage, and full consumption disclosure
-Power systems data, focusing on grid data, inclusive of transmission and distribution networks, generation, intergrid power flows, and market settlement data
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor IvaniukFwdays
At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience
ScyllaDB is making a major architecture shift. We’re moving from vNode replication to tablets – fragments of tables that are distributed independently, enabling dynamic data distribution and extreme elasticity. In this keynote, ScyllaDB co-founder and CTO Avi Kivity explains the reason for this shift, provides a look at the implementation and roadmap, and shares how this shift benefits ScyllaDB users.
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfleebarnesutopia
So… you want to become a Test Automation Engineer (or hire and develop one)? While there’s quite a bit of information available about important technical and tool skills to master, there’s not enough discussion around the path to becoming an effective Test Automation Engineer that knows how to add VALUE. In my experience this had led to a proliferation of engineers who are proficient with tools and building frameworks but have skill and knowledge gaps, especially in software testing, that reduce the value they deliver with test automation.
In this talk, Lee will share his lessons learned from over 30 years of working with, and mentoring, hundreds of Test Automation Engineers. Whether you’re looking to get started in test automation or just want to improve your trade, this talk will give you a solid foundation and roadmap for ensuring your test automation efforts continuously add value. This talk is equally valuable for both aspiring Test Automation Engineers and those managing them! All attendees will take away a set of key foundational knowledge and a high-level learning path for leveling up test automation skills and ensuring they add value to their organizations.
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Keywords: AI, Containeres, Kubernetes, Cloud Native
Event Link: https://meine.doag.org/events/cloudland/2024/agenda/#agendaId.4211
Introducing BoxLang : A new JVM language for productivity and modularity!Ortus Solutions, Corp
Just like life, our code must adapt to the ever changing world we live in. From one day coding for the web, to the next for our tablets or APIs or for running serverless applications. Multi-runtime development is the future of coding, the future is to be dynamic. Let us introduce you to BoxLang.
Dynamic. Modular. Productive.
BoxLang redefines development with its dynamic nature, empowering developers to craft expressive and functional code effortlessly. Its modular architecture prioritizes flexibility, allowing for seamless integration into existing ecosystems.
Interoperability at its Core
With 100% interoperability with Java, BoxLang seamlessly bridges the gap between traditional and modern development paradigms, unlocking new possibilities for innovation and collaboration.
Multi-Runtime
From the tiny 2m operating system binary to running on our pure Java web server, CommandBox, Jakarta EE, AWS Lambda, Microsoft Functions, Web Assembly, Android and more. BoxLang has been designed to enhance and adapt according to it's runnable runtime.
The Fusion of Modernity and Tradition
Experience the fusion of modern features inspired by CFML, Node, Ruby, Kotlin, Java, and Clojure, combined with the familiarity of Java bytecode compilation, making BoxLang a language of choice for forward-thinking developers.
Empowering Transition with Transpiler Support
Transitioning from CFML to BoxLang is seamless with our JIT transpiler, facilitating smooth migration and preserving existing code investments.
Unlocking Creativity with IDE Tools
Unleash your creativity with powerful IDE tools tailored for BoxLang, providing an intuitive development experience and streamlining your workflow. Join us as we embark on a journey to redefine JVM development. Welcome to the era of BoxLang.
From Natural Language to Structured Solr Queries using LLMsSease
This talk draws on experimentation to enable AI applications with Solr. One important use case is to use AI for better accessibility and discoverability of the data: while User eXperience techniques, lexical search improvements, and data harmonization can take organizations to a good level of accessibility, a structural (or “cognitive” gap) remains between the data user needs and the data producer constraints.
That is where AI – and most importantly, Natural Language Processing and Large Language Model techniques – could make a difference. This natural language, conversational engine could facilitate access and usage of the data leveraging the semantics of any data source.
The objective of the presentation is to propose a technical approach and a way forward to achieve this goal.
The key concept is to enable users to express their search queries in natural language, which the LLM then enriches, interprets, and translates into structured queries based on the Solr index’s metadata.
This approach leverages the LLM’s ability to understand the nuances of natural language and the structure of documents within Apache Solr.
The LLM acts as an intermediary agent, offering a transparent experience to users automatically and potentially uncovering relevant documents that conventional search methods might overlook. The presentation will include the results of this experimental work, lessons learned, best practices, and the scope of future work that should improve the approach and make it production-ready.
This talk will cover ScyllaDB Architecture from the cluster-level view and zoom in on data distribution and internal node architecture. In the process, we will learn the secret sauce used to get ScyllaDB's high availability and superior performance. We will also touch on the upcoming changes to ScyllaDB architecture, moving to strongly consistent metadata and tablets.
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsDianaGray10
Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more.
The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications.
We’ll discuss and demo the benefits of UiPath Apps and connectors including:
Creating a compelling user experience for any software, without the limitations of APIs.
Accelerating the app creation process, saving time and effort
Enjoying high-performance CRUD (create, read, update, delete) operations, for
seamless data management.
Speakers:
Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP
Charlie Greenberg, host
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...Fwdays
Direct losses from downtime in 1 minute = $5-$10 thousand dollars. Reputation is priceless.
As part of the talk, we will consider the architectural strategies necessary for the development of highly loaded fintech solutions. We will focus on using queues and streaming to efficiently work and manage large amounts of data in real-time and to minimize latency.
We will focus special attention on the architectural patterns used in the design of the fintech system, microservices and event-driven architecture, which ensure scalability, fault tolerance, and consistency of the entire system.
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving
Manufacturing custom quality metal nameplates and badges involves several standard operations. Processes include sheet prep, lithography, screening, coating, punch press and inspection. All decoration is completed in the flat sheet with adhesive and tooling operations following. The possibilities for creating unique durable nameplates are endless. How will you create your brand identity? We can help!
Northern Engraving | Nameplate Manufacturing Process - 2024
Introduction to Machine Learning
1. Introduction to Machine Learning
Bernhard Schölkopf
Empirical Inference Department
Max Planck Institute for Intelligent Systems
Tübingen, Germany
http://www.tuebingen.mpg.de/bs
1
2. Empirical Inference
• Drawing conclusions from empirical data (observations, measurements)
• Example 1: scientific inference
y = Σi ai k(x,xi) + b
x
y y=a*x
x
x
x
x
x x
x
x
x
Leibniz, Weyl, Chaitin
2
3. Empirical Inference
• Drawing conclusions from empirical data (observations, measurements)
• Example 1: scientific inference
“If your experiment needs statistics [inference],
you ought to have done a better experiment.” (Rutherford)
3
4. Empirical Inference, II
• Example 2: perception
“The brain is nothing but a statistical decision organ”
(H. Barlow)
4
5. Hard Inference Problems
Sonnenburg, Rätsch, Schäfer,
Schölkopf, 2006, Journal of Machine
Learning Research
Task: classify human DNA
sequence locations into {acceptor
splice site, decoy} using 15
Million sequences of length 141,
and a Multiple-Kernel Support
Vector Machines.
PRC = Precision-Recall-Curve,
fraction of correct positive
predictions among all positively
predicted cases
• High dimensionality – consider many factors simultaneously to find the regularity
• Complex regularities – nonlinear, nonstationary, etc.
• Little prior knowledge – e.g., no mechanistic models for the data
• Need large data sets – processing requires computers and automatic inference methods
5
6. Hard Inference Problems, II
• We can solve scientific inference problems that humans can’t solve
• Even if it’s just because of data set size / dimensionality, this is a
quantum leap
6
7. Generalization (thanks to O. Bousquet)
• observe 1, 2, 4, 7,..
• What’s next?
+1 +2 +3
• 1,2,4,7,11,16,…: an+1=an+n (“lazy caterer’s sequence”)
• 1,2,4,7,12,20,…: an+2=an+1+an+1
• 1,2,4,7,13,24,…: “Tribonacci”-sequence
• 1,2,4,7,14,28: set of divisors of 28
• 1,2,4,7,1,1,5,…: decimal expansions of p=3,14159…
and e=2,718… interleaved
• The On-Line Encyclopedia of Integer Sequences: >600 hits…
7
8. Generalization, II
• Question: which continuation is correct (“generalizes”)?
• Answer: there’s no way to tell (“induction problem”)
• Question of statistical learning theory: how to come up
with a law that is (probably) correct (“demarcation problem”)
(more accurately: a law that is probably as correct on the test data as it is on the training data)
8
9. 2-class classification
Learn based on m observations
generated from some
Goal: minimize expected error (“risk”)
V. Vapnik
Problem: P is unknown.
Induction principle: minimize training error (“empirical risk”)
over some class of functions. Q: is this “consistent”?
9
10. The law of large numbers
For all and
Does this imply “consistency” of empirical risk minimization
(optimality in the limit)?
No – need a uniform law of large numbers:
For all
10
12. -> LaTeX
12
Bernhard Schölkopf Empirical Inference Department Tübingen , 03 October 2011
13. Support Vector Machines
class 2
class 1
F
+-
- +
+
- k(x,x’)
+ -
= +
<F(x),F(x’)>
+
• sparse expansion of solution in terms of SVs (Boser, Guyon, Vapnik 1992):
- -
representer theorem (Kimeldorf & Wahba 1971, Schölkopf et al. 2000)
• unique solution found by convex QP
Bernhard Schölkopf, 03
October 2011 13
14. Support Vector Machines
class 2
class 1
F
+-
- +
+
- k(x,x’)
+ -
= +
<F(x),F(x’)>
+
• sparse expansion of solution in terms of SVs (Boser, Guyon, Vapnik 1992):
- -
representer theorem (Kimeldorf & Wahba 1971, Schölkopf et al. 2000)
• unique solution found by convex QP
Bernhard Schölkopf, 03
October 2011 14
15. Applications in Computational Geometry / Graphics
Steinke, Walder, Blanz et al.,
Eurographics’05, ’06, ‘08, ICML ’05,‘08,
NIPS ’07
15
17. Kernel Quiz
17
Bernhard Schölkopf Empirical Inference Department Tübingen , 03 October 2011
18. Kernel Methods
Bernhard Sch¨lkopf
o
Max Planck Institute for Intelligent Systems
B. Sch¨lkopf, MLSS France 2011
o
19. Statistical Learning Theory
1. started by Vapnik and Chervonenkis in the Sixties
2. model: we observe data generated by an unknown stochastic
regularity
3. learning = extraction of the regularity from the data
4. the analysis of the learning problem leads to notions of capacity
of the function classes that a learning machine can implement.
5. support vector machines use a particular type of function class:
classifiers with large “margins” in a feature space induced by a
kernel.
[47, 48]
B. Sch¨lkopf, MLSS France 2011
o
20. Example: Regression Estimation
y
x
• Data: input-output pairs (xi, yi) ∈ R × R
• Regularity: (x1, y1), . . . (xm, ym) drawn from P(x, y)
• Learning: choose a function f : R → R such that the error,
averaged over P, is minimized.
• Problem: P is unknown, so the average cannot be computed
— need an “induction principle”
21. Pattern Recognition
Learn f : X → {±1} from examples
(x1, y1), . . . , (xm, ym) ∈ X×{±1}, generated i.i.d. from P(x, y),
such that the expected misclassification error on a test set, also
drawn from P(x, y),
1
R[f ] = |f (x) − y)| dP(x, y),
2
is minimal (Risk Minimization (RM)).
Problem: P is unknown. −→ need an induction principle.
Empirical risk minimization (ERM): replace the average over
P(x, y) by an average over the training sample, i.e. minimize the
training error
1 m 1
Remp[f ] = |f (xi) − yi|
m i=1 2
B. Sch¨lkopf, MLSS France 2011
o
22. Convergence of Means to Expectations
Law of large numbers:
Remp[f ] → R[f ]
as m → ∞.
Does this imply that empirical risk minimization will give us the
optimal result in the limit of infinite sample size (“consistency”
of empirical risk minimization)?
No.
Need a uniform version of the law of large numbers. Uniform over
all functions that the learning machine can implement.
B. Sch¨lkopf, MLSS France 2011
o
23. Consistency and Uniform Convergence
R
Risk
Remp
Remp [f]
R[f]
f f opt fm Function class
B. Sch¨lkopf, MLSS France 2011
o
24. The Importance of the Set of Functions
What about allowing all functions from X to {±1}?
Training set (x1, y1), . . . , (xm, ym) ∈ X × {±1}
¯ ¯¯
Test patterns x1, . . . , xm ∈ X,
¯¯
such that {¯ 1, . . . , xm} ∩ {x1, . . . , xm} = {}.
x
1. f ∗(xi) = f (xi) for all i
For any f there exists f ∗ s.t.:
2. f ∗(¯ j ) = f (¯ j ) for all j.
x x
Based on the training set alone, there is no means of choosing
which one is better. On the test set, however, they give opposite
results. There is ’no free lunch’ [24, 56].
−→ a restriction must be placed on the functions that we allow
B. Sch¨lkopf, MLSS France 2011
o
25. Restricting the Class of Functions
Two views:
1. Statistical Learning (VC) Theory: take into account the ca-
pacity of the class of functions that the learning machine can
implement
2. The Bayesian Way: place Prior distributions P(f ) over the
class of functions
B. Sch¨lkopf, MLSS France 2011
o
26. Detailed Analysis
• loss ξi := 1 |f (xi) − yi| in {0, 1}
2
• the ξi are independent Bernoulli trials
1 m
• empirical mean m i=1 ξi (by def: equals Remp[f ])
• expected value E [ξ] (equals R[f ])
B. Sch¨lkopf, MLSS France 2011
o
27. Chernoff ’s Bound
1 m
P ξi − E [ξ] ≥ ǫ ≤ 2 exp(−2mǫ2)
m
i=1
• here, P refers to the probability of getting a sample ξ1, . . . , ξm
with the property m m ξi − E [ξ] ≥ ǫ (is a product mea-
1
i=1
sure)
Useful corollary: Given a 2m-sample of Bernoulli trials, we have
1 m 1
2m mǫ2
P ξi − ξi ≥ ǫ ≤ 4 exp − .
m m 2
i=1 i=m+1
B. Sch¨lkopf, MLSS France 2011
o
28. Chernoff ’s Bound, II
Translate this back into machine learning terminology: the prob-
ability of obtaining an m-sample where the training error and test
error differ by more than ǫ > 0 is bounded by
P Remp[f ] − R[f ] ≥ ǫ ≤ 2 exp(−2mǫ2).
• refers to one fixed f
• not allowed to look at the data before choosing f , hence not
suitable as a bound on the test error of a learning algorithm
using empirical risk minimization
B. Sch¨lkopf, MLSS France 2011
o
29. Uniform Convergence (Vapnik & Chervonenkis)
Necessary and sufficient conditions for nontrivial consistency of
empirical risk minimization (ERM):
One-sided convergence, uniformly over all functions that can be
implemented by the learning machine.
lim P { sup (R[f ] − Remp[f ]) > ǫ} = 0
m→∞ f ∈F
for all ǫ > 0.
• note that this takes into account the whole set of functions that
can be implemented by the learning machine
• this is hard to check for a learning machine
Are there properties of learning machines (≡ sets of functions)
which ensure uniform convergence of risk?
B. Sch¨lkopf, MLSS France 2011
o
30. How to Prove a VC Bound
Take a closer look at P{supf ∈F (R[f ] − Remp[f ]) > ǫ}.
Plan:
• if the function class F contains only one function, then Cher-
noff’s bound suffices:
P{ sup (R[f ] − Remp[f ]) > ǫ} ≤ 2 exp(−2mǫ2).
f ∈F
• if there are finitely many functions, we use the ’union bound’
• even if there are infinitely many, then on any finite sample
there are effectively only finitely many (use symmetrization
and capacity concepts)
B. Sch¨lkopf, MLSS France 2011
o
31. The Case of Two Functions
Suppose F = {f1, f2}. Rewrite
1 2
P{ sup (R[f ] − Remp[f ]) > ǫ} = P(Cǫ ∪ Cǫ ),
f ∈F
where
i
Cǫ := {(x1, y1), . . . , (xm, ym) | (R[fi] − Remp[fi]) > ǫ}
denotes the event that the risks of fi differ by more than ǫ.
The RHS equals
1 2 1 2 1 2
P(Cǫ ∪ Cǫ ) = P(Cǫ ) + P(Cǫ ) − P(Cǫ ∩ Cǫ )
1 2
≤ P(Cǫ ) + P(Cǫ ).
Hence by Chernoff’s bound
1 2
P{ sup (R[f ] − Remp[f ]) > ǫ} ≤ P(Cǫ ) + P(Cǫ )
f ∈F
≤ 2 · 2 exp(−2mǫ2).
32. The Union Bound
Similarly, if F = {f1, . . . , fn}, we have
1 n
P{ sup (R[f ] − Remp[f ]) > ǫ} = P(Cǫ ∪ · · · ∪ Cǫ ),
f ∈F
and n
1 n
P(Cǫ ∪ · · · ∪ Cǫ ) ≤ i
P(Cǫ ).
i=1
Use Chernoff for each summand, to get an extra factor n in the
bound.
i
Note: this becomes an equality if and only if all the events Cǫ
involved are disjoint.
B. Sch¨lkopf, MLSS France 2011
o
33. Infinite Function Classes
• Note: empirical risk only refers to m points. On these points,
the functions of F can take at most 2m values
• for Remp, the function class thus “looks” finite
• how about R?
• need to use a trick
B. Sch¨lkopf, MLSS France 2011
o
34. Symmetrization
Lemma 1 (Vapnik & Chervonenkis (e.g., [46, 12]))
For mǫ2 ≥ 2 we have
′
P{ sup (R[f ]−Remp[f ]) > ǫ} ≤ 2P{ sup (Remp[f ]−Remp[f ]) > ǫ/2}
f ∈F f ∈F
Here, the first P refers to the distribution of iid samples of
size m, while the second one refers to iid samples of size 2m.
In the latter case, Remp measures the loss on the first half of
′
the sample, and Remp on the second half.
B. Sch¨lkopf, MLSS France 2011
o
35. Shattering Coefficient
• Hence, we only need to consider the maximum size of F on 2m
points. Call it N(F, 2m).
• N(F, 2m) = max. number of different outputs (y1, . . . , y2m)
that the function class can generate on 2m points — in other
words, the max. number of different ways the function class can
separate 2m points into two classes.
• N(F, 2m) ≤ 22m
• if N(F, 2m) = 22m, then the function class is said to shatter
2m points.
B. Sch¨lkopf, MLSS France 2011
o
36. Putting Everything Together
We now use (1) symmetrization, (2) the shattering coefficient, and
(3) the union bound, to get
P{sup(R[f ] − Remp[f ]) > ǫ}
f ∈F
′
≤ 2P{sup(Remp[f ] − Remp[f ]) > ǫ/2}
f ∈F
′ ′
= 2P{(Remp[f1] − Remp[f1]) > ǫ/2 ∨. . .∨ (Remp[fN(F,2m)] − Remp[fN(F,2m)]) > ǫ/2}
N(F,2m)
′
≤ 2P{(Remp[fn ] − Remp[fn]) > ǫ/2}.
n=1
B. Sch¨lkopf, MLSS France 2011
o
37. ctd.
Use Chernoff’s bound for each term:∗
1 m 1
2m mǫ2
P ξi − ξi ≥ ǫ ≤ 2 exp − .
m m 2
i=1 i=m+1
This yields
mǫ2
P{ sup (R[f ] − Remp[f ]) > ǫ} ≤ 4N(F, 2m) exp − .
f ∈F 8
• provided that N(F, 2m) does not grow exponentially in m, this
is nontrivial
• such bounds are called VC type inequalities
• two types of randomness: (1) the P refers to the drawing of
the training examples, and (2) R[f ] is an expectation over the
drawing of test examples.
∗
Note that the fi depend on the 2m−sample. A rigorous treatment would need to use a second random-
ization over permutations of the 2m-sample, see [36].
38. Confidence Intervals
Rewrite the bound: specify the probability with which we want R
to be close to Remp, and solve for ǫ:
With a probability of at least 1 − δ,
8 4
R[f ] ≤ Remp[f ] + ln(N(F, 2m)) + ln .
m δ
This bound holds independent of f ; in particular, it holds for the
function f m minimizing the empirical risk.
B. Sch¨lkopf, MLSS France 2011
o
39. Discussion
• tighter bounds are available (better constants etc.)
• cannot minimize the bound over f
• other capacity concepts can be used
B. Sch¨lkopf, MLSS France 2011
o
40. VC Entropy
On an example (x, y), f causes a loss
1
ξ(x, y, f (x)) = |f (x) − y| ∈ {0, 1}.
For a larger sample (x , y ), . .2. , (x , y ), the different functions
1 1 m m
f ∈ F lead to a set of loss vectors
ξf = (ξ(x1, y1, f (x1)), . . . , ξ(xm, ym, f (xm))),
whose cardinality we denote by
N (F, (x1, y1) . . . , (xm, ym)) .
The VC entropy is defined as
HF (m) = E [ln N (F, (x1, y1) . . . , (xm, ym))] ,
where the expectation is taken over the random generation of the
m-sample (x1, y1) . . . , (xm, ym) from P.
HF (m)/m → 0 ⇐⇒ uniform convergence of risks (hence consis-
tency)
41. Further PR Capacity Concepts
• exchange ’E’ and ’ln’: annealed entropy.
ann
HF (m)/m → 0 ⇐⇒ exponentially fast uniform convergence
• take ’max’ instead of ’E’: growth function.
Note that GF (m) = ln N(F, m).
GF (m)/m → 0 ⇐⇒ exponential convergence for all underlying
distributions P.
GF (m) = m · ln(2) for all m ⇐⇒ for any m, all loss vectors
can be generated, i.e., the m points can be chosen such that by
using functions of the learning machine, they can be separated
in all 2m possible ways (shattered ).
B. Sch¨lkopf, MLSS France 2011
o
42. Structure of the Growth Function
Either GF (m) = m · ln(2) for all m ∈ N
Or there exists some maximal m for which the above is possible.
Call this number the VC-dimension, and denote it by h. For
m > h,
m
GF (m) ≤ h ln + 1 .
h
Nothing “in between” linear growth and logarithmic growth is
possible.
B. Sch¨lkopf, MLSS France 2011
o
43. VC-Dimension: Example
Half-spaces in R2:
f (x, y) = sgn(a + bx + cy), with parameters a, b, c ∈ R
• Clearly, we can shatter three non-collinear points.
• But we can never shatter four points.
• Hence the VC dimension is h = 3 (in this case, equal to the
number of parameters)
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
x x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
x x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
x
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
x x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x x xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
B. Sch¨lkopf, MLSS France 2011
o
44. A Typical Bound for Pattern Recognition
For any f ∈ F and m > h, with a probability of at least 1 − δ,
h log 2m + 1 − log(δ/4)
h
R[f ] ≤ Remp[f ] +
m
holds.
• does this mean, that we can learn anything?
• The study of the consistency of ERM has thus led to concepts
and results which lets us formulate another induction principle
(structural risk minimization)
B. Sch¨lkopf, MLSS France 2011
o
45. SRM
error R(f* )
bound on test error
capacity term
training error
h
structure
Sn−1 Sn Sn+1
B. Sch¨lkopf, MLSS France 2011
o
46. Finding a Good Function Class
• recall: separating hyperplanes in R2 have a VC dimension of 3.
• more generally: separating hyperplanes in RN have a VC di-
mension of N + 1.
• hence: separating hyperplanes in high-dimensional feature
spaces have extremely large VC dimension, and may not gener-
alize well
• however, margin hyperplanes can still have a small VC dimen-
sion
B. Sch¨lkopf, MLSS France 2011
o
47. Kernels and Feature Spaces
Preprocess the data with
Φ:X → H
x → Φ(x),
where H is a dot product space, and learn the mapping from Φ(x)
to y [6].
• usually, dim(X) ≪ dim(H)
• “Curse of Dimensionality”?
• crucial issue: capacity, not dimensionality
B. Sch¨lkopf, MLSS France 2011
o
48. Example: All Degree 2 Monomials
Φ : R2 → R3 √
2 , 2 x x , x2 )
(x1, x2) → (z1, z2, z3) := (x1 1 2 2
x
z3
2
H
H
H
H
x1 H
H
H
H
H
H
HH
z1
H H H
H
z2
B. Sch¨lkopf, MLSS France 2011
o
49. General Product Feature Space
How about patterns x ∈ RN and product features of order d?
Here, dim(H) grows like N d.
E.g. N = 16 × 16, and d = 5 −→ dimension 1010
B. Sch¨lkopf, MLSS France 2011
o
50. The Kernel Trick, N = d = 2
√ √
2)(x′2, 2 x′ x′ , x′2)⊤
Φ(x), Φ(x′) = (x2,
1 2 x1 x2 , x2 1 1 2 2
2
= x, x′
= : k(x, x′)
−→ the dot product in H can be computed in R2
B. Sch¨lkopf, MLSS France 2011
o
51. The Kernel Trick, II
More generally: x, x′ ∈ RN , d ∈ N:
d
N
x, x ′ d = xj · x′
j
j=1
N
= xj1 · · · · · xjd · x′ 1 · · · · · x′ d = Φ(x), Φ(x′) ,
j j
j1,...,jd=1
where Φ maps into the space spanned by all ordered products of
d input directions
B. Sch¨lkopf, MLSS France 2011
o
52. Mercer’s Theorem
If k is a continuous kernel of a positive definite integral oper-
ator on L2(X) (where X is some compact space),
k(x, x′)f (x)f (x′) dx dx′ ≥ 0,
X
it can be expanded as
∞
k(x, x′) = λiψi(x)ψi(x′)
i=1
using eigenfunctions ψi and eigenvalues λi ≥ 0 [30].
B. Sch¨lkopf, MLSS France 2011
o
53. The Mercer Feature Map
In that case √
√λ1ψ1(x)
Φ(x) := λ2ψ2(x)
.
.
satisfies Φ(x), Φ(x′) = k(x, x′).
Proof: √ √ ′)
√λ1ψ1(x) √λ1ψ1(x′
Φ(x), Φ(x′) = λ2ψ2(x) , λ2ψ2(x )
.
. .
.
∞
= λiψi(x)ψi(x′) = k(x, x′)
i=1
B. Sch¨lkopf, MLSS France 2011
o
54. Positive Definite Kernels
It can be shown that the admissible class of kernels coincides with
the one of positive definite (pd) kernels: kernels which are sym-
metric (i.e., k(x, x′) = k(x′, x)), and for
• any set of training points x1, . . . , xm ∈ X and
• any a1, . . . , am ∈ R
satisfy
aiaj Kij ≥ 0, where Kij := k(xi, xj ).
i,j
K is called the Gram matrix or kernel matrix.
If for pairwise distinct points, i,j aiaj Kij = 0 =⇒ a = 0, call
it strictly positive definite.
B. Sch¨lkopf, MLSS France 2011
o
55. The Kernel Trick — Summary
• any algorithm that only depends on dot products can benefit
from the kernel trick
• this way, we can apply linear methods to vectorial as well as
non-vectorial data
• think of the kernel as a nonlinear similarity measure
• examples of common kernels:
Polynomial k(x, x′) = ( x, x′ + c)d
Gaussian k(x, x′) = exp(− x − x′ 2/(2 σ 2))
• Kernels are also known as covariance functions [54, 52, 55, 29]
B. Sch¨lkopf, MLSS France 2011
o
56. Properties of PD Kernels, 1
Assumption: Φ maps X into a dot product space H; x, x′ ∈ X
Kernels from Feature Maps.
k(x, x′) := Φ(x), Φ(x′) is a pd kernel on X × X.
Kernels from Feature Maps, II
K(A, B) := x∈A,x′∈B k(x, x′),
where A, B are finite subsets of X, is also a pd kernel
˜
(Hint: use the feature map Φ(A) := x∈A Φ(x))
B. Sch¨lkopf, MLSS France 2011
o
57. Properties of PD Kernels, 2 [36, 39]
Assumption: k, k1, k2, . . . are pd; x, x′ ∈ X
k(x, x) ≥ 0 for all x (Positivity on the Diagonal)
k(x, x′)2 ≤ k(x, x)k(x′, x′) (Cauchy-Schwarz Inequality)
(Hint: compute the determinant of the Gram matrix)
k(x, x) = 0 for all x =⇒ k(x, x′) = 0 for all x, x′ (Vanishing Diagonals)
The following kernels are pd:
• αk, provided α ≥ 0
• k1 + k2
• k(x, x′) := limn→∞ kn(x, x′), provided it exists
• k1 · k2
• tensor products, direct sums, convolutions [22]
B. Sch¨lkopf, MLSS France 2011
o
58. The Feature Space for PD Kernels [4, 1, 35]
• define a feature map
Φ : X → RX
x → k(., x).
E.g., for the Gaussian kernel: Φ
. .
x x' Φ(x) Φ(x')
Next steps:
• turn Φ(X) into a linear space
• endow it with a dot product satisfying
Φ(x), Φ(x′) = k(x, x′), i.e., k(., x), k(., x′ ) = k(x, x′)
• complete the space to get a reproducing kernel Hilbert space
B. Sch¨lkopf, MLSS France 2011
o
59. Turn it Into a Linear Space
Form linear combinations
m
f (.) = αik(., xi),
i=1
m′
g(.) = βj k(., x′ )
j
j=1
(m, m′ ∈ N, αi, βj ∈ R, xi, x′ ∈ X).
j
B. Sch¨lkopf, MLSS France 2011
o
60. Endow it With a Dot Product
m m′
f, g := αiβj k(xi, x′ )
j
i=1 j=1
m m′
= αig(xi) = βj f (x′ )
j
i=1 j=1
• This is well-defined, symmetric, and bilinear (more later).
• So far, it also works for non-pd kernels
B. Sch¨lkopf, MLSS France 2011
o
61. The Reproducing Kernel Property
Two special cases:
• Assume
f (.) = k(., x).
In this case, we have
k(., x), g = g(x).
• If moreover
g(.) = k(., x′),
we have
k(., x), k(., x′ ) = k(x, x′).
k is called a reproducing kernel
(up to here, have not used positive definiteness)
B. Sch¨lkopf, MLSS France 2011
o
62. Endow it With a Dot Product, II
• It can be shown that ., . is a p.d. kernel on the set of functions
{f (.) = m αik(., xi)|αi ∈ R, xi ∈ X} :
i=1
γi γj f i , f j = γi f i , γj f j =: f, f
ij i j
= αik(., xi), αik(., xi) = αiαj k(xi, xj ) ≥ 0
i i ij
• furthermore, it is strictly positive definite:
f (x)2 = f, k(., x) 2 ≤ f, f k(., x), k(., x)
hence f, f = 0 implies f = 0.
• Complete the space in the corresponding norm to get a Hilbert
space Hk .
63. The Empirical Kernel Map
Recall the feature map
Φ : X → RX
x → k(., x).
• each point is represented by its similarity to all other points
• how about representing it by its similarity to a sample of points?
Consider
Φm : X → Rm
x → k(., x)|(x1 ,...,xm) = (k(x1, x), . . . , k(xm, x))⊤
B. Sch¨lkopf, MLSS France 2011
o
64. ctd.
• Φm(x1), . . . , Φm(xm) contain all necessary information about
Φ(x1), . . . , Φ(xm)
• the Gram matrix Gij := Φm(xi), Φm(xj ) satisfies G = K 2
where Kij = k(xi, xj )
• modify Φm to
Φw : X → Rm
m
− 1 (k(x , x), . . . , k(x , x))⊤
x → K 2 1 m
• this “whitened” map (“kernel PCA map”) satifies
Φw (xi), Φw (xj ) = k(xi, xj )
m m
for all i, j = 1, . . . , m.
B. Sch¨lkopf, MLSS France 2011
o
65. An Example of a Kernel Algorithm
Idea: classify points x := Φ(x) in feature space according to which
of the two class means is closer.
1 1
c+ := Φ(xi), c− := Φ(xi)
m+ m−
yi=1 yi=−1
+
o
o . +
w + c2
o c
c1 x-c
o
x
Compute the sign of the dot product between w := c+ − c− and
x − c.
B. Sch¨lkopf, MLSS France 2011
o
66. An Example of a Kernel Algorithm, ctd. [36]
f (x) = sgn 1 Φ(x), Φ(xi) −
1
Φ(x), Φ(xi) +b
m+ m−
{i:yi=+1} {i:yi=−1}
= sgn 1 k(x, xi) −
1
k(x, xi) + b
m+ m−
{i:yi=+1} {i:yi=−1}
where
1 1 1
b= k(xi, xj ) − k(xi, xj ) .
2 m2− m2+
{(i,j):yi=yj =−1} {(i,j):yi=yj =+1}
• provides a geometric interpretation of Parzen windows
B. Sch¨lkopf, MLSS France 2011
o
67. An Example of a Kernel Algorithm, ctd.
• Demo
• Exercise: derive the Parzen windows classifier by computing the
distance criterion directly
• SVMs (ppt)
B. Sch¨lkopf, MLSS France 2011
o
68. An example of a kernel algorithm, revisited
o
+ µ(Y )
.
+ w o
µ(X ) +
o
+
X compact subset of a separable metric space, m, n ∈ N.
Positive class X := {x1, . . . , xm} ⊂ X
Negative class Y := {y1, . . . , yn} ⊂ X
1 m 1 n
RKHS means µ(X) = m i=1 k(xi, ·), µ(Y ) = n i=1 k(yi, ·).
Get a problem if µ(X) = µ(Y )!
B. Sch¨lkopf, MLSS France 2011
o
69. When do the means coincide?
k(x, x′) = x, x′ : the means coincide
k(x, x′) = ( x, x′ + 1)d: all empirical moments up to order d coincide
k strictly pd: X =Y.
The mean “remembers” each point that contributed to it.
B. Sch¨lkopf, MLSS France 2011
o
70. Proposition 2 Assume X, Y are defined as above, k is
strictly pd, and for all i, j, xi = xj , and yi = yj .
If for some αi, βj ∈ R − {0}, we have
m n
αik(xi, .) = βj k(yj , .), (1)
i=1 j=1
then X = Y .
B. Sch¨lkopf, MLSS France 2011
o
71. Proof (by contradiction)
W.l.o.g., assume that x1 ∈ Y . Subtract n βj k(yj , .) from (1),
j=1
and make it a sum over pairwise distinct points, to get
0= γik(zi, .),
i
where z1 = x1, γ1 = α1 = 0, and
z2, · · · ∈ X ∪ Y − {x1}, γ2, · · · ∈ R.
Take the RKHS dot product with j γj k(zj , .) to get
0= γiγj k(zi, zj ),
ij
with γ = 0, hence k cannot be strictly pd.
B. Sch¨lkopf, MLSS France 2011
o
72. The mean map
m
1
µ : X = (x1, . . . , xm) → k(xi, ·)
m
i=1
satisfies
m m
1 1
µ(X), f = k(xi, ·), f = f (xi)
m m
i=1 i=1
and
m n
1 1
µ(X)−µ(Y ) = sup | µ(X) − µ(Y ), f | = sup f (xi) − f (yi) .
f ≤1 f ≤1 m i=1
n i=1
Note: Large distance = can find a function distinguishing the
samples
B. Sch¨lkopf, MLSS France 2011
o
73. Witness function
µ(X)−µ(Y )
f = µ(X)−µ(Y ) , thus f (x) ∝ µ(X) − µ(Y ), k(x, .) ):
Witness f for Gauss and Laplace data
1
f
0.8 Gauss
Laplace
0.6
Prob. density and f
0.4
0.2
0
−0.2
−0.4
−6 −4 −2 0 2 4 6
X
This function is in the RKHS of a Gaussian kernel, but not in the
RKHS of the linear kernel.
B. Sch¨lkopf, MLSS France 2011
o
74. The mean map for measures
p, q Borel probability measures,
Ex,x′∼p[k(x, x′)], Ex,x′∼q [k(x, x′)] ∞ ( k(x, .) ≤ M ∞ is sufficient)
Define
µ : p → Ex∼p[k(x, ·)].
Note
µ(p), f = Ex∼p[f (x)]
and
µ(p) − µ(q) = sup Ex∼p[f (x)] − Ex∼q [f (x)] .
f ≤1
Recall that in the finite sample case, for strictly p.d. kernels, µ
was injective — how about now?
[43, 17]
B. Sch¨lkopf, MLSS France 2011
o
75. Theorem 3 [15, 13]
p = q ⇐⇒ sup Ex∼p(f (x)) − Ex∼q (f (x)) = 0,
f ∈C(X)
where C(X) is the space of continuous bounded functions on
X.
Combine this with
µ(p) − µ(q) = sup Ex∼p[f (x)] − Ex∼q [f (x)] .
f ≤1
Replace C(X) by the unit ball in an RKHS that is dense in C(X)
— universal kernel [45], e.g., Gaussian.
Theorem 4 [19] If k is universal, then
p = q ⇐⇒ µ(p) − µ(q) = 0.
B. Sch¨lkopf, MLSS France 2011
o
76. • µ is invertible on its image
M = {µ(p) | p is a probability distribution}
(the “marginal polytope”, [53])
• generalization of the moment generating function of a RV x
with distribution p:
Mp(.) = Ex∼p e x, · .
This provides us with a convenient metric on probability distribu-
tions, which can be used to check whether two distributions are
different — provided that µ is invertible.
B. Sch¨lkopf, MLSS France 2011
o
77. Fourier Criterion
Assume we have densities, the kernel is shift invariant (k(x, y) =
k(x − y)), and all Fourier transforms below exist.
Note that µ is invertible iff
k(x − y)p(y) dy = k(x − y)q(y) dy =⇒ p = q,
i.e.,
ˆp ˆ
k(ˆ − q ) = 0 =⇒ p = q
(Sriperumbudur et al., 2008)
ˆ
E.g., µ is invertible if k has full support. Restricting the class of
ˆ
distributions, weaker conditions suffice (e.g., if k has non-empty in-
terior, µ is invertible for all distributions with compact support).
B. Sch¨lkopf, MLSS France 2011
o
78. Fourier Optics
Application: p source of incoherent light, I indicator of a finite
ˆ
aperture. In Fraunhofer diffraction, the intensity image is ∝ p∗ I 2.
ˆ
Set k = I 2, then this equals µ(p).
ˆ
This k does not have full support, thus the imaging process is not
invertible for the class of all light sources (Abbe), but it is if we
restrict the class (e.g., to compact support).
B. Sch¨lkopf, MLSS France 2011
o
79. Application 1: Two-sample problem [19]
X, Y i.i.d. m-samples from p, q, respectively.
2
µ(p) − µ(q) =Ex,x′∼p [k(x, x′)] − 2Ex∼p,y∼q [k(x, y)] + Ey,y′∼q [k(y, y ′)]
=Ex,x′∼p,y,y′∼q [h((x, y), (x′, y ′))]
with
h((x, y), (x′, y ′)) := k(x, x′) − k(x, y ′) − k(y, x′) + k(y, y ′).
Define
D(p, q)2 := Ex,x′∼p,y,y′∼q h((x, y), (x′, y ′))
ˆ
D(X, Y )2 := 1 h((xi, yi), (xj , yj )).
m(m−1)
i=j
ˆ
D(X, Y )2 is an unbiased estimator of D(p, q)2.
It’s easy to compute, and works on structured data.
B. Sch¨lkopf, MLSS France 2011
o
80. Theorem 5 Assume k is bounded.
1
ˆ
D(X, Y )2 converges to D(p, q)2 in probability with rate O(m− 2 ).
This could be used as a basis for a test, but uniform convergence bounds are often loose..
√ ˆ
Theorem 6 We assume E h2 ∞. When p = q, then m(D(X, Y )2 − D(p, q)2)
converges in distribution to a zero mean Gaussian with variance
2
σu = 4 Ez (Ez′ h(z, z ′ ))2 − Ez,z′ (h(z, z ′ ))
2
.
ˆ ˆ
When p = q, then m(D(X, Y )2 − D(p, q)2) = mD(X, Y )2 converges in distribution to
∞
λl ql2 − 2 , (2)
l=1
where ql ∼ N(0, 2) i.i.d., λi are the solutions to the eigenvalue equation
˜
k(x, x′)ψi (x)dp(x) = λi ψi(x′ ),
X
˜
and k(xi, xj ) := k(xi, xj ) − Exk(xi, x) − Exk(x, xj ) + Ex,x′ k(x, x′) is the centred RKHS
kernel.
B. Sch¨lkopf, MLSS France 2011
o
81. Application 2: Dependence Measures
Assume that (x, y) are drawn from pxy , with marginals px, py .
Want to know whether pxy factorizes.
[2, 16]: kernel generalized variance
[20, 21]: kernel constrained covariance, HSIC
Main idea [25, 34]:
x and y independent ⇐⇒ ∀ bounded continuous functions f, g,
we have Cov(f (x), g(y)) = 0.
B. Sch¨lkopf, MLSS France 2011
o
82. k kernel on X × Y.
µ(pxy ) := E(x,y)∼pxy [k((x, y), ·)]
µ(px × py ) := Ex∼px,y∼py [k((x, y), ·)] .
Use ∆ := µ(pxy ) − µ(px × py ) as a measure of dependence.
For k((x, y), (x′ , y ′)) = kx(x, x′)ky (y, y ′):
∆2 equals the Hilbert-Schmidt norm of the covariance opera-
tor between the two RKHSs (HSIC), with empirical estimate
m−2 tr HKxHKy , where H = I − 1/m [20, 44].
B. Sch¨lkopf, MLSS France 2011
o
83. Witness function of the equivalent optimisation problem:
Dependence witness and sample
1.5
0.05
1 0.04
0.03
0.5
0.02
0.01
Y
0
0
−0.5 −0.01
−0.02
−1 −0.03
−0.04
−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
X
Application: learning causal structures (Sun et al., ICML 2007; Fuku-
mizu et al., NIPS 2007))
B. Sch¨lkopf, MLSS France 2011
o
84. Application 3: Covariate Shift Correction and Local
Learning
training set X = {(x1, y1), . . . , (xm, ym)} drawn from p,
test set X ′ = (x′ , y1), . . . , (x′ , yn) from p′ = p.
1
′
n
′
Assume py|x = p′ .
y|x
[40]: reweight training set
B. Sch¨lkopf, MLSS France 2011
o
85. Minimize
2
m
βik(xi, ·) − µ(X ′) +λ β 2 subject to βi ≥ 0,
2 βi = 1.
i=1 i
Equivalent QP:
1 ⊤
minimize β (K + λ1) β − β ⊤l
β 2
subject to βi ≥ 0 and βi = 1,
i
where Kij := k(xi, xj ), li = k(xi, ·), µ(X ′) .
Experiments show that in underspecified situations (e.g., large ker-
nel widths), this helps [23].
X ′ = x′ leads to a local sample weighting scheme.
B. Sch¨lkopf, MLSS France 2011
o
86. The Representer Theorem
Theorem 7 Given: a p.d. kernel k on X × X, a training set
(x1, y1), . . . , (xm, ym) ∈ X × R, a strictly monotonic increasing
real-valued function Ω on [0, ∞[, and an arbitrary cost function
c : (X × R2)m → R ∪ {∞}
Any f ∈ Hk minimizing the regularized risk functional
c ((x1, y1, f (x1)), . . . , (xm, ym, f (xm))) + Ω ( f ) (3)
admits a representation of the form
m
f (.) = αik(xi, .).
i=1
B. Sch¨lkopf, MLSS France 2011
o
87. Remarks
• significance: many learning algorithms have solutions that can
be expressed as expansions in terms of the training examples
• original form, with mean squared loss
m
1
c((x1, y1, f (x1)), . . . , (xm, ym, f (xm))) = (yi − f (xi))2,
m
i=1
and Ω( f ) = λ f 2 (λ 0): [27]
• generalization to non-quadratic cost functions: [10]
• present form: [36]
B. Sch¨lkopf, MLSS France 2011
o
88. Proof
Decompose f ∈ H into a part in the span of the k(xi, .) and an
orthogonal one:
f= αik(xi, .) + f⊥,
where for all j i
f⊥, k(xj , .) = 0.
Application of f to an arbitrary training point xj yields
f (xj ) = f, k(xj , .)
= αik(xi, .) + f⊥, k(xj , .)
i
= αi k(xi, .), k(xj , .) ,
i
independent of f⊥.
B. Sch¨lkopf, MLSS France 2011
o
89. Proof: second part of (3)
Since f⊥ is orthogonal to i αik(xi, .), and Ω is strictly mono-
tonic, we get
Ω( f ) = Ω αik(xi, .) + f⊥
i
= Ω αik(xi, .) 2 + f⊥ 2
i
≥ Ω αik(xi, .) , (4)
i
with equality occuring if and only if f⊥ = 0.
Hence, any minimizer must have f⊥ = 0. Consequently, any
solution takes the form
f= αik(xi, .).
i
B. Sch¨lkopf, MLSS France 2011
o
90. Application: Support Vector Classification
Here, yi ∈ {±1}. Use
1
c ((xi, yi, f (xi))i) = max (0, 1 − yif (xi)) ,
λ
i
and the regularizer Ω ( f ) = f 2.
λ → 0 leads to the hard margin SVM
B. Sch¨lkopf, MLSS France 2011
o
91. Further Applications
Bayesian MAP Estimates. Identify (3) with the negative log
posterior (cf. Kimeldorf Wahba, 1970, Poggio Girosi, 1990),
i.e.
• exp(−c((xi, yi, f (xi))i)) — likelihood of the data
• exp(−Ω( f )) — prior over the set of functions; e.g., Ω( f ) =
λ f 2 — Gaussian process prior [55] with covariance function
k
• minimizer of (3) = MAP estimate
Kernel PCA (see below) can be shown to correspond to the case
of
2
1 1
c((xi, yi, f (xi))i=1,...,m) = 0 if m i f (xi) − m j f (xj ) = 1
∞ otherwise
with g an arbitrary strictly monotonically increasing function.
92. Conclusion
• the kernel corresponds to
– a similarity measure for the data, or
– a (linear) representation of the data, or
– a hypothesis space for learning,
• kernels allow the formulation of a multitude of geometrical algo-
rithms (Parzen windows, 2-sample tests, SVMs, kernel PCA,...)
B. Sch¨lkopf, MLSS France 2011
o
93. Kernel PCA [37]
linear PCA k(x,y) = (x.y)
R2
x
x
x
xx
x
x x x
x
x
x
kernel PCA k(x,y) = (x.y)d
R2
x x
x
x x x
x x x
xx
x
x
x x x x x x x
x x
x x
k H
Φ
B. Sch¨lkopf, MLSS France 2011
o
94. Kernel PCA, II
m
1
x1, . . . , xm ∈ X, Φ : X → H, C= Φ(xj )Φ(xj )⊤
m
j=1
Eigenvalue problem
m
1
λV = CV = Φ(xj ), V Φ(xj ).
m
j=1
For λ = 0, V ∈ span{Φ(x1), . . . , Φ(xm)}, thus
m
V= αiΦ(xi),
i=1
and the eigenvalue problem can be written as
λ Φ(xn), V = Φ(xn), CV for all n = 1, . . . , m
B. Sch¨lkopf, MLSS France 2011
o
95. Kernel PCA in Dual Variables
In term of the m × m Gram matrix
Kij := Φ(xi), Φ(xj ) = k(xi, xj ),
this leads to
mλKα = K 2α
where α = (α1, . . . , αm)⊤.
Solve
mλα = Kα
−→ (λn, αn)
Vn, Vn = 1 ⇐⇒ λn αn, αn = 1
thus divide α n by √λ
n
B. Sch¨lkopf, MLSS France 2011
o
96. Feature extraction
Compute projections on the Eigenvectors
m
Vn = n
αi Φ(xi)
i=1
in H:
for a test point x with image Φ(x) in H we get the features
m
Vn, Φ(x) = n
αi Φ(xi), Φ(x)
i=1
m
= n
αi k(xi, x)
i=1
B. Sch¨lkopf, MLSS France 2011
o
97. The Kernel PCA Map
Recall
Φw : X → Rm
m
1
− 2 (k(x , x), . . . , k(x , x))⊤
x → K 1 m
If K = U DU ⊤ is K’s diagonalization, then K −1/2 =
U D−1/2U ⊤. Thus we have
Φw (x) = U D−1/2U ⊤(k(x1, x), . . . , k(xm, x))⊤.
m
We can drop the leading U (since it leaves the dot product invari-
ant) to get a map
Φw CA(x) = D−1/2U ⊤(k(x1, x), . . . , k(xm, x))⊤.
KP
The rows of U ⊤ are the eigenvectors αn of K, and the entries of
−1/2
the diagonal matrix D−1/2 equal λi .
B. Sch¨lkopf, MLSS France 2011
o
98. Toy Example with Gaussian Kernel
k(x, x′) = exp − x − x′ 2
B. Sch¨lkopf, MLSS France 2011
o
99. Super-Resolution (Kim, Franz, Sch¨lkopf, 2004)
o
a. original image of resolution b. low resolution image (264 × c. bicubic interpolation d. supervised example-based f. unsupervised KPCA recon-
528 × 396 198) stretched to the original learning based on nearest neigh- struction
scale bor classifier
g. enlarged portions of a-d, and f (from left to right)
Comparison between different super-resolution methods.
B. Sch¨lkopf, MLSS France 2011
o
100. Support Vector Classifiers
input space feature space
G N
N
N
N Φ
G
G G
G
G
[6]
B. Sch¨lkopf, MLSS France 2011
o
101. Separating Hyperplane
w, x + b 0
N
G N
G
N
w, x + b 0 w N
G
G
G
{x | w, x + b = 0}
B. Sch¨lkopf, MLSS France 2011
o
103. Eliminating the Scaling Freedom [47]
Note: if c = 0, then
{x| w, x + b = 0} = {x| cw, x + cb = 0}.
Hence (cw, cb) describes the same hyperplane as (w, b).
Definition: The hyperplane is in canonical form w.r.t. X ∗ =
{x1, . . . , xr } if minxi∈X | w, xi + b| = 1.
B. Sch¨lkopf, MLSS France 2011
o
104. Canonical Optimal Hyperplane
{x | w, x + b = +1}
{x | w, x + b = −1} Note:
N
w, x1 + b = +1
H N x1 yi = +1 w, x2 + b = −1
x2H
= w , (x1−x2) = 2
N
, w
yi = −1 w N = , (x1−x2) = 2
||w|| ||w||
H
H
H
{x | w, x + b = 0}
B. Sch¨lkopf, MLSS France 2011
o
105. Canonical Hyperplanes [47]
Note: if c = 0, then
{x| w, x + b = 0} = {x| cw, x + cb = 0}.
Hence (cw, cb) describes the same hyperplane as (w, b).
Definition: The hyperplane is in canonical form w.r.t. X ∗ =
{x1, . . . , xr } if minxi∈X | w, xi + b| = 1.
Note that for canonical hyperplanes, the distance of the closest
point to the hyperplane (“margin”) is 1/ w :
minxi∈X w ,x + b = 1 .
w i w w
B. Sch¨lkopf, MLSS France 2011
o
106. Theorem 8 (Vapnik [46]) Consider hyperplanes w, x = 0
where w is normalized such that they are in canonical form
w.r.t. a set of points X ∗ = {x1, . . . , xr }, i.e.,
min | w, xi | = 1.
i=1,...,r
The set of decision functions fw(x) = sgn x, w defined on
X ∗ and satisfying the constraint w ≤ Λ has a VC dimension
satisfying
h ≤ R2Λ2.
Here, R is the radius of the smallest sphere around the origin
containing X ∗.
B. Sch¨lkopf, MLSS France 2011
o
107. x
x x R x
x
γ1
γ2
B. Sch¨lkopf, MLSS France 2011
o
108. Proof Strategy (Gurvits, 1997)
Assume that x1, . . . , xr are shattered by canonical hyperplanes
with w ≤ Λ, i.e., for all y1, . . . , yr ∈ {±1},
yi w, xi ≥ 1 for all i = 1, . . . , r. (5)
Two steps:
• prove that the more points we want to shatter (5), the larger
r
i=1 yixi must be
r
• upper bound the size of i=1 yixi in terms of R
Combining the two tells us how many points we can at most shat-
ter.
B. Sch¨lkopf, MLSS France 2011
o
109. Part I
Summing (5) over i = 1, . . . , r yields
r
w, yi xi ≥ r.
i=1
By the Cauchy-Schwarz inequality, on the other hand, we have
r r r
w, yi xi ≤ w yixi ≤ Λ yixi .
i=1 i=1 i=1
Combine both:
r
r
≤ yi xi . (6)
Λ
i=1
B. Sch¨lkopf, MLSS France 2011
o
110. Part II
Consider independent random labels yi ∈ {±1}, uniformly dis-
tributed (Rademacher variables).
2
r r r
E yi xi = E yi xi , yj xj
i=1 i=1 j=1
r
= E y i x i , yj xj + yixi
i=1 j=i
r
= E yixi, yj xj + E [ yixi, yixi ]
i=1 j=i
r r
= E yi xi 2 = xi 2
i=1 i=1
B. Sch¨lkopf, MLSS France 2011
o
111. Part II, ctd.
Since xi ≤ R, we get
2
r
E yixi ≤ rR2.
i=1
• This holds for the expectation over the random choices of the
labels, hence there must be at least one set of labels for which
it also holds true. Use this set.
Hence
2
r
yi xi ≤ rR2.
i=1
B. Sch¨lkopf, MLSS France 2011
o
112. Part I and II Combined
r 2
Part I: Λ ≤ r
yi xi 2
i=1
Part II: r
i=1 yixi 2 ≤ rR2
Hence
r2
≤ rR2,
Λ2
i.e.,
r ≤ R2Λ2,
completing the proof.
B. Sch¨lkopf, MLSS France 2011
o