This paper presents a statistical perspective on retrieval-based models for classification. It analyzes such models using two different frameworks: local empirical risk minimization and classification in an extended feature space. For local empirical risk minimization, the paper provides assumptions and derives an excess risk bound that decomposes the error of the local model into different terms related to the local vs global optimal risk, sample vs retrieved set risk, generalization error of the local model, and central absolute moment of the local model. It also shows how to tighten the bound by leveraging the local structure of the data distribution.
The document discusses autonomous vehicles and their key features. It describes how autonomous cars use sensors and advanced systems like intelligent cruise control, collision avoidance, lane support and night vision to drive themselves. These systems allow the cars to maintain speed and distance from other vehicles, detect lanes and traffic lights, and park without human assistance. While autonomous vehicles could reduce accidents and increase road capacity, challenges remain around system failures, hacking risks and high costs.
The document discusses the requirements for developing applications for intelligent vehicles. It defines an intelligent vehicle as one equipped with devices that enable automation of driving tasks like lane following, obstacle avoidance, and route determination. It describes several control systems used in intelligent vehicles, like collision warning systems. The key requirements for intelligent vehicle applications are knowledge of the vehicle state, environment state, and driver/passenger state. Sensors are necessary to gain knowledge of the surrounding environment and interpret the situation. The goal of intelligent vehicles is to one day be fully autonomous through improving existing driver assistance technologies.
Manifold regularization is an approach which exploits the geometry of the marginal distribution.
The main goal of this paper is to analyze the convergence issues of such regularization
algorithms in learning theory. We propose a more general multi-penalty framework and establish
the optimal convergence rates under the general smoothness assumption. We study a
theoretical analysis of the performance of the multi-penalty regularization over the reproducing
kernel Hilbert space. We discuss the error estimates of the regularization schemes under
some prior assumptions for the joint probability measure on the sample space. We analyze the
convergence rates of learning algorithms measured in the norm in reproducing kernel Hilbert
space and in the norm in Hilbert space of square-integrable functions. The convergence issues
for the learning algorithms are discussed in probabilistic sense by exponential tail inequalities.
In order to optimize the regularization functional, one of the crucial issue is to select regularization
parameters to ensure good performance of the solution. We propose a new parameter
choice rule “the penalty balancing principle” based on augmented Tikhonov regularization for
the choice of regularization parameters. The superiority of multi-penalty regularization over
single-penalty regularization is shown using the academic example and moon data set.
This document provides a summary of supervised learning techniques including linear regression, logistic regression, support vector machines, naive Bayes classification, and decision trees. It defines key concepts such as hypothesis, loss functions, cost functions, and gradient descent. It also covers generative models like Gaussian discriminant analysis, and ensemble methods such as random forests and boosting. Finally, it discusses learning theory concepts such as the VC dimension, PAC learning, and generalization error bounds.
This document provides an introduction to machine learning concepts including loss functions, empirical risk, and two basic methods of learning - least squared error and nearest neighborhood. It describes how machine learning aims to find an optimal function that minimizes empirical risk under a given loss function. Least squared error learning is discussed as minimizing the squared differences between predictions and labels. Nearest neighborhood is also introduced as an alternative method. The document serves as a high-level overview of fundamental machine learning principles.
This document summarizes a semi-supervised regression method that combines graph Laplacian regularization with cluster ensemble methodology. It proposes using a weighted averaged co-association matrix from the cluster ensemble as the similarity matrix in graph Laplacian regularization. The method (SSR-LRCM) finds a low-rank approximation of the co-association matrix to efficiently solve the regression problem. Experimental results on synthetic and real-world datasets show SSR-LRCM achieves significantly better prediction accuracy than an alternative method, while also having lower computational costs for large datasets. Future work will explore using a hierarchical matrix approximation instead of low-rank.
Accelerating Metropolis Hastings with Lightweight Inference CompilationFeynman Liang
This document summarizes research on accelerating Metropolis-Hastings sampling with lightweight inference compilation. It discusses background on probabilistic programming languages and Bayesian inference techniques like variational inference and sequential importance sampling. It introduces the concept of inference compilation, where a neural network is trained to construct proposals for MCMC that better match the posterior. The paper proposes a lightweight approach to inference compilation for imperative probabilistic programs that trains proposals conditioned on execution prefixes to address issues with sequential importance sampling.
The document discusses autonomous vehicles and their key features. It describes how autonomous cars use sensors and advanced systems like intelligent cruise control, collision avoidance, lane support and night vision to drive themselves. These systems allow the cars to maintain speed and distance from other vehicles, detect lanes and traffic lights, and park without human assistance. While autonomous vehicles could reduce accidents and increase road capacity, challenges remain around system failures, hacking risks and high costs.
The document discusses the requirements for developing applications for intelligent vehicles. It defines an intelligent vehicle as one equipped with devices that enable automation of driving tasks like lane following, obstacle avoidance, and route determination. It describes several control systems used in intelligent vehicles, like collision warning systems. The key requirements for intelligent vehicle applications are knowledge of the vehicle state, environment state, and driver/passenger state. Sensors are necessary to gain knowledge of the surrounding environment and interpret the situation. The goal of intelligent vehicles is to one day be fully autonomous through improving existing driver assistance technologies.
Manifold regularization is an approach which exploits the geometry of the marginal distribution.
The main goal of this paper is to analyze the convergence issues of such regularization
algorithms in learning theory. We propose a more general multi-penalty framework and establish
the optimal convergence rates under the general smoothness assumption. We study a
theoretical analysis of the performance of the multi-penalty regularization over the reproducing
kernel Hilbert space. We discuss the error estimates of the regularization schemes under
some prior assumptions for the joint probability measure on the sample space. We analyze the
convergence rates of learning algorithms measured in the norm in reproducing kernel Hilbert
space and in the norm in Hilbert space of square-integrable functions. The convergence issues
for the learning algorithms are discussed in probabilistic sense by exponential tail inequalities.
In order to optimize the regularization functional, one of the crucial issue is to select regularization
parameters to ensure good performance of the solution. We propose a new parameter
choice rule “the penalty balancing principle” based on augmented Tikhonov regularization for
the choice of regularization parameters. The superiority of multi-penalty regularization over
single-penalty regularization is shown using the academic example and moon data set.
This document provides a summary of supervised learning techniques including linear regression, logistic regression, support vector machines, naive Bayes classification, and decision trees. It defines key concepts such as hypothesis, loss functions, cost functions, and gradient descent. It also covers generative models like Gaussian discriminant analysis, and ensemble methods such as random forests and boosting. Finally, it discusses learning theory concepts such as the VC dimension, PAC learning, and generalization error bounds.
This document provides an introduction to machine learning concepts including loss functions, empirical risk, and two basic methods of learning - least squared error and nearest neighborhood. It describes how machine learning aims to find an optimal function that minimizes empirical risk under a given loss function. Least squared error learning is discussed as minimizing the squared differences between predictions and labels. Nearest neighborhood is also introduced as an alternative method. The document serves as a high-level overview of fundamental machine learning principles.
This document summarizes a semi-supervised regression method that combines graph Laplacian regularization with cluster ensemble methodology. It proposes using a weighted averaged co-association matrix from the cluster ensemble as the similarity matrix in graph Laplacian regularization. The method (SSR-LRCM) finds a low-rank approximation of the co-association matrix to efficiently solve the regression problem. Experimental results on synthetic and real-world datasets show SSR-LRCM achieves significantly better prediction accuracy than an alternative method, while also having lower computational costs for large datasets. Future work will explore using a hierarchical matrix approximation instead of low-rank.
Accelerating Metropolis Hastings with Lightweight Inference CompilationFeynman Liang
This document summarizes research on accelerating Metropolis-Hastings sampling with lightweight inference compilation. It discusses background on probabilistic programming languages and Bayesian inference techniques like variational inference and sequential importance sampling. It introduces the concept of inference compilation, where a neural network is trained to construct proposals for MCMC that better match the posterior. The paper proposes a lightweight approach to inference compilation for imperative probabilistic programs that trains proposals conditioned on execution prefixes to address issues with sequential importance sampling.
This document provides an overview of key calculus concepts and formulas taught in a Calculus I course at Miami Dade College - Hialeah Campus. The topics covered include limits and derivatives, integration, optimization techniques, and applications of calculus to economics, business, physics, and other fields. The document is intended as a study guide for students in the Calculus I class taught by Professor Mohammad Shakil.
The document summarizes Approximate Bayesian Computation (ABC). It discusses how ABC provides a way to approximate Bayesian inference when the likelihood function is intractable or too computationally expensive to evaluate directly. ABC works by simulating data under different parameter values and accepting simulations that are close to the observed data according to a distance measure and tolerance level. Key points discussed include:
- ABC provides an approximation to the posterior distribution by sampling from simulations that fall within a tolerance of the observed data.
- Summary statistics are often used to reduce the dimension of the data and improve the signal-to-noise ratio when applying the tolerance criterion.
- Random forests can help select informative summary statistics and provide semi-automated ABC
Conditional random fields (CRFs) are probabilistic models for segmenting and labeling sequence data. CRFs address limitations of previous models like hidden Markov models (HMMs) and maximum entropy Markov models (MEMMs). CRFs allow incorporation of arbitrary, overlapping features of the observation sequence and label dependencies. Parameters are estimated to maximize the conditional log-likelihood using iterative scaling or tracking partial feature expectations. Experiments show CRFs outperform HMMs and MEMMs on synthetic and real-world tasks by addressing label bias problems and modeling dependencies beyond the previous label.
Statement of stochastic programming problemsSSA KPI
AACIMP 2010 Summer School lecture by Leonidas Sakalauskas. "Applied Mathematics" stream. "Stochastic Programming and Applications" course. Part 1.
More info at http://summerschool.ssa.org.ua
Elementary Probability and Information TheoryKhalidSaghiri2
This document provides an overview of foundational probability and statistics concepts relevant to statistical natural language processing (NLP). It discusses topics like probability theory, random variables, expectation, variance, Bayesian and frequentist statistics, maximum likelihood estimation, entropy, mutual information, and how these concepts can be applied to language modeling tasks in NLP. The document aims to motivate these mathematical foundations and illustrate their use for statistical inference and modeling of language.
1. The document discusses approximate Bayesian computation (ABC), a technique used when the likelihood function is intractable. ABC works by simulating parameters from the prior and simulating data, rejecting simulations that are not close to the observed data based on a tolerance level.
2. Random forests can be used in ABC to select informative summary statistics from a large set of possibilities and estimate parameters. The random forests classify simulations as accepted or rejected based on the summaries, implicitly selecting important summaries.
3. Calibrating the tolerance level in ABC is important but difficult, as it determines how close simulations must be to the observed data. Methods discussed include using quantiles of prior predictive simulations or asymptotic convergence properties.
The proposed method uses an online weighted ensemble of one-class SVMs for feature selection in background/foreground separation. It automatically selects the best features for different image regions. Multiple base classifiers are generated using weighted random subspaces. The best base classifiers are selected and combined based on error rates. Feature importance is computed adaptively based on classifier responses. The background model is updated incrementally using a heuristic approach. Experimental results on the MSVS dataset show the proposed method achieves higher precision, recall, and F-score than other methods compared.
Maximum likelihood estimation of regularisation parameters in inverse problem...Valentin De Bortoli
This document discusses an empirical Bayesian approach for estimating regularization parameters in inverse problems using maximum likelihood estimation. It proposes the Stochastic Optimization with Unadjusted Langevin (SOUL) algorithm, which uses Markov chain sampling to approximate gradients in a stochastic projected gradient descent scheme for optimizing the regularization parameter. The algorithm is shown to converge to the maximum likelihood estimate under certain conditions on the log-likelihood and prior distributions.
This document discusses Monte Carlo methods for numerical integration and simulation. It introduces the challenge of sampling from probability distributions and several Monte Carlo techniques to address this, including importance sampling, rejection sampling, and Metropolis-Hastings. It provides pseudocode for rejection sampling and discusses its application to estimating pi. Finally, it outlines using Metropolis-Hastings to simulate the Ising model of magnetization.
We consider stochastic optimization problems arising in deep learning and other areas of statistical and machine learning from a statistical decision theory perspective. In particular, we investigate the admissibility (in the sense of decision theory) of the sample average solution estimator. We show that this estimator can be inadmissible in very simple settings, a phenomenon that is derived from the classical James-Stein estimator. However, for many problems of interest, the sample average estimator is indeed admissible. We will end with several open questions in this research directi
Error Estimates for Multi-Penalty Regularization under General Source Conditioncsandit
In learning theory, the convergence issues of the regression problem are investigated with
the least square Tikhonov regularization schemes in both the RKHS-norm and the L 2
-norm.
We consider the multi-penalized least square regularization scheme under the general source
condition with the polynomial decay of the eigenvalues of the integral operator. One of the
motivation for this work is to discuss the convergence issues for widely considered manifold
regularization scheme. The optimal convergence rates of multi-penalty regularizer is achieved
in the interpolation norm using the concept of effective dimension. Further we also propose
the penalty balancing principle based on augmented Tikhonov regularization for the choice of
regularization parameters. The superiority of multi-penalty regularization over single-penalty
regularization is shown using the academic example and moon data set.
The document discusses applications of machine learning for robot navigation and control. It describes how surrogate models can be used for predictive modeling in engineering applications like aircraft design. Dimension reduction techniques are used to reduce high-dimensional design parameters to a lower-dimensional space for faster surrogate model evaluation. For robot navigation, regression models on image manifolds are used for visual localization by mapping images to robot positions. Manifold learning is also applied to find low-dimensional representations of valid human hand poses from images to enable easier robot control.
This document provides an introduction to Bayesian analysis and probabilistic modeling. It begins with an overview of Bayes' theorem and common probability distributions used in Bayesian modeling like the Bernoulli, binomial, beta, Dirichlet, and multinomial distributions. It then discusses how these distributions can be used in Bayesian modeling for problems like estimating probabilities based on observed data. Specifically, it explains how conjugate prior distributions allow the posterior distribution to be of the same family as the prior. The document concludes by discussing how neural networks can quantify classification uncertainty by outputting evidence for different classes modeled with a Dirichlet distribution.
An Introduction To Basic Statistics And ProbabilityMaria Perkins
This document provides an introduction to basic statistics and probability concepts. It outlines topics including probability distributions, random variables, the central limit theorem, and sampling. Key concepts are defined, such as probability mass functions, expected value, and variance. Common probability distributions like the binomial and normal distributions are also introduced. Examples are provided to illustrate concepts like finding probabilities and distribution parameters.
Slides by Alexander März:
The language of statistics is of probabilistic nature. Any model that falls short of providing quantification of the uncertainty attached to its outcome is likely to provide an incomplete and potentially misleading
picture. While this is an irrevocable consensus in statistics, machine
learning approaches usually lack proper ways of quantifying uncertainty. In fact, a possible distinction between the two modelling cultures can be
attributed to the (non)-existence of uncertainty estimates that allow for,
e.g., hypothesis testing or the construction of estimation/prediction
intervals. However, quantification of uncertainty in general and
probabilistic forecasting in particular doesn’t just provide an average
point forecast, but it rather equips the user with a range of outcomes and the probability of each of those occurring.
In an effort of bringing both disciplines closer together, the audience is
introduced to a new framework of XGBoost that predicts the entire
conditional distribution of a univariate response variable. In particular,
XGBoostLSS models all moments of a parametric distribution (i.e., mean,
location, scale and shape [LSS]) instead of the conditional mean only.
Choosing from a wide range of continuous, discrete and mixed
discrete-continuous distribution, modelling and predicting the entire
conditional distribution greatly enhances the flexibility of XGBoost, as it
allows to gain additional insight into the data generating process, as well
as to create probabilistic forecasts from which prediction intervals and
quantiles of interest can be derived. As such, XGBoostLSS contributes to
the growing literature on statistical machine learning that aims at
weakening the separation between Breiman‘s „Data Modelling Culture“ and „Algorithmic Modelling Culture“, so that models designed mainly for
prediction can also be used to describe and explain the underlying data
generating process of the response of interest.
Transfer Learning for the Detection and Classification of traditional pneumon...Yusuf Brima
A presentation of my MSc in Mathematical Sciences thesis at the African Institute of Mathematical Sciences (AIMS), Rwanda. This presentation explores the application of Deep Transfer Learning towards the diagnosis and classification of traditional pneumonia and pneumonia induced from COVID-19 using chest X-ray images.
When Classifier Selection meets Information Theory: A Unifying ViewMohamed Farouk
Classifier selection aims to reduce the size of an
ensemble of classifiers in order to improve its efficiency and
classification accuracy. Recently an information-theoretic view
was presented for feature selection. It derives a space of possible
selection criteria and show that several feature selection criteria
in the literature are points within this continuous space. The
contribution of this paper is to export this information-theoretic
view to solve an open issue in ensemble learning which is
classifier selection. We investigated a couple of informationtheoretic
selection criteria that are used to rank classifiers.
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...NTNU
The introduction of expert knowledge when learning Bayesian Networks from data is known to be an excellent approach to boost the performance of automatic learning methods, specially when the data is scarce. Previous approaches for this problem based on Bayesian statistics introduce the expert knowledge modifying the prior probability distributions. In this study, we propose a new methodology based on Monte Carlo simulation which starts with non-informative priors and requires knowledge from the expert a posteriori, when the simulation ends. We also explore a new Importance Sampling method for Monte Carlo simulation and the definition of new non-informative priors for the structure of the network. All these approaches are experimentally validated with five standard Bayesian networks.
Read more:
http://link.springer.com/chapter/10.1007%2F978-3-642-14049-5_70
This document provides an overview of key calculus concepts and formulas taught in a Calculus I course at Miami Dade College - Hialeah Campus. The topics covered include limits and derivatives, integration, optimization techniques, and applications of calculus to economics, business, physics, and other fields. The document is intended as a study guide for students in the Calculus I class taught by Professor Mohammad Shakil.
The document summarizes Approximate Bayesian Computation (ABC). It discusses how ABC provides a way to approximate Bayesian inference when the likelihood function is intractable or too computationally expensive to evaluate directly. ABC works by simulating data under different parameter values and accepting simulations that are close to the observed data according to a distance measure and tolerance level. Key points discussed include:
- ABC provides an approximation to the posterior distribution by sampling from simulations that fall within a tolerance of the observed data.
- Summary statistics are often used to reduce the dimension of the data and improve the signal-to-noise ratio when applying the tolerance criterion.
- Random forests can help select informative summary statistics and provide semi-automated ABC
Conditional random fields (CRFs) are probabilistic models for segmenting and labeling sequence data. CRFs address limitations of previous models like hidden Markov models (HMMs) and maximum entropy Markov models (MEMMs). CRFs allow incorporation of arbitrary, overlapping features of the observation sequence and label dependencies. Parameters are estimated to maximize the conditional log-likelihood using iterative scaling or tracking partial feature expectations. Experiments show CRFs outperform HMMs and MEMMs on synthetic and real-world tasks by addressing label bias problems and modeling dependencies beyond the previous label.
Statement of stochastic programming problemsSSA KPI
AACIMP 2010 Summer School lecture by Leonidas Sakalauskas. "Applied Mathematics" stream. "Stochastic Programming and Applications" course. Part 1.
More info at http://summerschool.ssa.org.ua
Elementary Probability and Information TheoryKhalidSaghiri2
This document provides an overview of foundational probability and statistics concepts relevant to statistical natural language processing (NLP). It discusses topics like probability theory, random variables, expectation, variance, Bayesian and frequentist statistics, maximum likelihood estimation, entropy, mutual information, and how these concepts can be applied to language modeling tasks in NLP. The document aims to motivate these mathematical foundations and illustrate their use for statistical inference and modeling of language.
1. The document discusses approximate Bayesian computation (ABC), a technique used when the likelihood function is intractable. ABC works by simulating parameters from the prior and simulating data, rejecting simulations that are not close to the observed data based on a tolerance level.
2. Random forests can be used in ABC to select informative summary statistics from a large set of possibilities and estimate parameters. The random forests classify simulations as accepted or rejected based on the summaries, implicitly selecting important summaries.
3. Calibrating the tolerance level in ABC is important but difficult, as it determines how close simulations must be to the observed data. Methods discussed include using quantiles of prior predictive simulations or asymptotic convergence properties.
The proposed method uses an online weighted ensemble of one-class SVMs for feature selection in background/foreground separation. It automatically selects the best features for different image regions. Multiple base classifiers are generated using weighted random subspaces. The best base classifiers are selected and combined based on error rates. Feature importance is computed adaptively based on classifier responses. The background model is updated incrementally using a heuristic approach. Experimental results on the MSVS dataset show the proposed method achieves higher precision, recall, and F-score than other methods compared.
Maximum likelihood estimation of regularisation parameters in inverse problem...Valentin De Bortoli
This document discusses an empirical Bayesian approach for estimating regularization parameters in inverse problems using maximum likelihood estimation. It proposes the Stochastic Optimization with Unadjusted Langevin (SOUL) algorithm, which uses Markov chain sampling to approximate gradients in a stochastic projected gradient descent scheme for optimizing the regularization parameter. The algorithm is shown to converge to the maximum likelihood estimate under certain conditions on the log-likelihood and prior distributions.
This document discusses Monte Carlo methods for numerical integration and simulation. It introduces the challenge of sampling from probability distributions and several Monte Carlo techniques to address this, including importance sampling, rejection sampling, and Metropolis-Hastings. It provides pseudocode for rejection sampling and discusses its application to estimating pi. Finally, it outlines using Metropolis-Hastings to simulate the Ising model of magnetization.
We consider stochastic optimization problems arising in deep learning and other areas of statistical and machine learning from a statistical decision theory perspective. In particular, we investigate the admissibility (in the sense of decision theory) of the sample average solution estimator. We show that this estimator can be inadmissible in very simple settings, a phenomenon that is derived from the classical James-Stein estimator. However, for many problems of interest, the sample average estimator is indeed admissible. We will end with several open questions in this research directi
Error Estimates for Multi-Penalty Regularization under General Source Conditioncsandit
In learning theory, the convergence issues of the regression problem are investigated with
the least square Tikhonov regularization schemes in both the RKHS-norm and the L 2
-norm.
We consider the multi-penalized least square regularization scheme under the general source
condition with the polynomial decay of the eigenvalues of the integral operator. One of the
motivation for this work is to discuss the convergence issues for widely considered manifold
regularization scheme. The optimal convergence rates of multi-penalty regularizer is achieved
in the interpolation norm using the concept of effective dimension. Further we also propose
the penalty balancing principle based on augmented Tikhonov regularization for the choice of
regularization parameters. The superiority of multi-penalty regularization over single-penalty
regularization is shown using the academic example and moon data set.
The document discusses applications of machine learning for robot navigation and control. It describes how surrogate models can be used for predictive modeling in engineering applications like aircraft design. Dimension reduction techniques are used to reduce high-dimensional design parameters to a lower-dimensional space for faster surrogate model evaluation. For robot navigation, regression models on image manifolds are used for visual localization by mapping images to robot positions. Manifold learning is also applied to find low-dimensional representations of valid human hand poses from images to enable easier robot control.
This document provides an introduction to Bayesian analysis and probabilistic modeling. It begins with an overview of Bayes' theorem and common probability distributions used in Bayesian modeling like the Bernoulli, binomial, beta, Dirichlet, and multinomial distributions. It then discusses how these distributions can be used in Bayesian modeling for problems like estimating probabilities based on observed data. Specifically, it explains how conjugate prior distributions allow the posterior distribution to be of the same family as the prior. The document concludes by discussing how neural networks can quantify classification uncertainty by outputting evidence for different classes modeled with a Dirichlet distribution.
An Introduction To Basic Statistics And ProbabilityMaria Perkins
This document provides an introduction to basic statistics and probability concepts. It outlines topics including probability distributions, random variables, the central limit theorem, and sampling. Key concepts are defined, such as probability mass functions, expected value, and variance. Common probability distributions like the binomial and normal distributions are also introduced. Examples are provided to illustrate concepts like finding probabilities and distribution parameters.
Slides by Alexander März:
The language of statistics is of probabilistic nature. Any model that falls short of providing quantification of the uncertainty attached to its outcome is likely to provide an incomplete and potentially misleading
picture. While this is an irrevocable consensus in statistics, machine
learning approaches usually lack proper ways of quantifying uncertainty. In fact, a possible distinction between the two modelling cultures can be
attributed to the (non)-existence of uncertainty estimates that allow for,
e.g., hypothesis testing or the construction of estimation/prediction
intervals. However, quantification of uncertainty in general and
probabilistic forecasting in particular doesn’t just provide an average
point forecast, but it rather equips the user with a range of outcomes and the probability of each of those occurring.
In an effort of bringing both disciplines closer together, the audience is
introduced to a new framework of XGBoost that predicts the entire
conditional distribution of a univariate response variable. In particular,
XGBoostLSS models all moments of a parametric distribution (i.e., mean,
location, scale and shape [LSS]) instead of the conditional mean only.
Choosing from a wide range of continuous, discrete and mixed
discrete-continuous distribution, modelling and predicting the entire
conditional distribution greatly enhances the flexibility of XGBoost, as it
allows to gain additional insight into the data generating process, as well
as to create probabilistic forecasts from which prediction intervals and
quantiles of interest can be derived. As such, XGBoostLSS contributes to
the growing literature on statistical machine learning that aims at
weakening the separation between Breiman‘s „Data Modelling Culture“ and „Algorithmic Modelling Culture“, so that models designed mainly for
prediction can also be used to describe and explain the underlying data
generating process of the response of interest.
Transfer Learning for the Detection and Classification of traditional pneumon...Yusuf Brima
A presentation of my MSc in Mathematical Sciences thesis at the African Institute of Mathematical Sciences (AIMS), Rwanda. This presentation explores the application of Deep Transfer Learning towards the diagnosis and classification of traditional pneumonia and pneumonia induced from COVID-19 using chest X-ray images.
When Classifier Selection meets Information Theory: A Unifying ViewMohamed Farouk
Classifier selection aims to reduce the size of an
ensemble of classifiers in order to improve its efficiency and
classification accuracy. Recently an information-theoretic view
was presented for feature selection. It derives a space of possible
selection criteria and show that several feature selection criteria
in the literature are points within this continuous space. The
contribution of this paper is to export this information-theoretic
view to solve an open issue in ensemble learning which is
classifier selection. We investigated a couple of informationtheoretic
selection criteria that are used to rank classifiers.
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...NTNU
The introduction of expert knowledge when learning Bayesian Networks from data is known to be an excellent approach to boost the performance of automatic learning methods, specially when the data is scarce. Previous approaches for this problem based on Bayesian statistics introduce the expert knowledge modifying the prior probability distributions. In this study, we propose a new methodology based on Monte Carlo simulation which starts with non-informative priors and requires knowledge from the expert a posteriori, when the simulation ends. We also explore a new Importance Sampling method for Monte Carlo simulation and the definition of new non-informative priors for the structure of the network. All these approaches are experimentally validated with five standard Bayesian networks.
Read more:
http://link.springer.com/chapter/10.1007%2F978-3-642-14049-5_70
Similar to A Statistical Perspective on Retrieval-Based Models.pdf (20)
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Po-Chuan Chen
This paper proposes MetaPrompter, which utilizes meta-learning to learn a prompt pool that can generate effective prompts for complex tasks. It also introduces a new soft verbalizer called Representative Verbalizer (RepVerb) that constructs label embeddings from feature embeddings. In experiments on few-shot classification tasks, MetaPrompter outperforms prior meta-prompt tuning methods while requiring significantly fewer parameters.
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfPo-Chuan Chen
This document summarizes a research paper titled "Quark: Controllable Text Generation with Reinforced [Un]learning". The paper introduces Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function to (un)learn unwanted properties from large language models. Quark iteratively collects samples, sorts them into quantiles based on reward, and maximizes the likelihood of high-reward samples while regularizing the model to remain close to the original. Experiments show Quark can effectively reduce toxicity, unwanted sentiment, and repetition in generated text.
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Po-Chuan Chen
The SEEK paper proposes a new method for empathetic dialogue generation that models the emotion flow between utterances in a conversation. It introduces two tasks - fine-grained emotion recognition of each utterance and predicting the emotion of the response. It also models the bi-directional interaction between emotional context and commonsense knowledge selection to generate appropriate responses. Experiments on the EmpatheticDialogues dataset show the SEEK method outperforms baselines in automatic and human evaluations.
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfPo-Chuan Chen
This paper evaluates the effectiveness of offline reinforcement learning methods for dialogue response generation. It finds that decision transformers and implicit Q-learning show improvements over teacher forcing, generating responses that are similar in meaning to the target while not requiring exact matching. Evaluation on several datasets demonstrates these offline RL methods achieve better performance than teacher forcing according to automated metrics and human evaluations, while avoiding issues with online reinforcement learning.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...Po-Chuan Chen
The document summarizes a paper titled "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers". It proposes GPTQ, a new one-shot quantization method that can quantize large generative pre-trained models like GPT-3 with 175 billion parameters to 3-4 bits within a few GPU hours with minimal accuracy loss. GPTQ improves upon existing quantization methods by employing arbitrary weight order, lazy batch updates of the Hessian matrix, and a Cholesky reformulation to scale efficiently to huge models and achieve over 2x higher compression than prior work. Experimental results show GPTQ outperforms baseline quantization and enables extremely accurate models to fit in a single
A Neural Corpus Indexer for Document Retrieval.pdfPo-Chuan Chen
The document describes Neural Corpus Indexer (NCI), a sequence-to-sequence neural network that indexes documents by generating relevant document identifiers directly from input queries. NCI represents documents with hierarchical semantic identifiers generated via k-means clustering. It uses a prefix-aware weight-adaptive decoder and consistency-based regularization during training. Experiments on Natural Questions and TriviaQA datasets show NCI outperforms existing retrieval methods by significantly improving recall.
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfPo-Chuan Chen
This document summarizes the AdaMix paper, which proposes a new parameter-efficient fine-tuning method called AdaMix. AdaMix uses a mixture of adaptation modules, where it trains multiple views of the task by randomly routing inputs to different adaptation modules. By tuning only 0.1-0.2% of the model parameters, AdaMix outperforms both full model fine-tuning and other state-of-the-art PEFT methods on various NLU and NLG tasks according to experiments on datasets like GLUE, E2E, WebNLG and DART. AdaMix works by introducing a set of adaptation modules in each transformer layer and applying a stochastic routing policy during training, along with consistency regularization and adaptation
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...Po-Chuan Chen
This paper proposes LLaMA-Adapter, a lightweight method to efficiently fine-tune the LLaMA language model into an instruction-following model. It uses learnable adaption prompts prepended to word tokens in higher transformer layers. Additionally, it introduces zero-initialized attention with a gating mechanism that incorporates instructional signals while preserving pre-trained knowledge. Experiments show LLaMA-Adapter can generate high-quality responses comparable to fully fine-tuned models, and it can be extended to multi-modal reasoning tasks.
Active Retrieval Augmented Generation.pdfPo-Chuan Chen
FLARE proposes a method called Forward-Looking Active REtrieval augmented generation (FLARE) that iteratively retrieves information during text generation based on the predicted upcoming sentence. FLARE uses the predicted next sentence as a query to retrieve documents if it contains low-confidence tokens, then regenerates the sentence. Experiments show FLARE outperforms baselines on multiple knowledge-intensive tasks. However, FLARE did not significantly improve performance on a short-text dataset where continual retrieval of disparate information may not be needed.
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfPo-Chuan Chen
The document proposes an approach to generate natural language summaries for online content using offline reinforcement learning. It involves crawling Twitter data, fine-tuning models like RoBERTa and GPT-2, and using a reinforcement learning algorithm (PPO) to further train the text generation model using a reward function. The methodology, planned experiment, related work and conclusion are discussed over multiple sections and figures.
This document summarizes a paper on Cold-Start Reinforcement Learning with Softmax Policy Gradient. It introduces the limitations of existing sequence learning methods like maximum likelihood estimation and reward augmented maximum likelihood. It then describes the softmax policy gradient method which uses a softmax value function to overcome issues with warm starts and sample variance. The method achieves better performance on text summarization and image captioning tasks.
This document describes a Kaggle competition called Image to Prompts that aims to predict the text prompt for a generated image using a generative text-to-image model. The method uses an ensemble of a Vision Transformer, CLIP Interrogator, and OFA models. Analysis shows the CLIP Interrogator and OFA models generate higher quality prompts than the ViT model. Future work to improve methods includes generating a larger dataset of image-prompt pairs and training customized models on this data.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfPo-Chuan Chen
The document describes the RAG (Retrieval-Augmented Generation) model for knowledge-intensive NLP tasks. RAG combines a pre-trained language generator (BART) with a dense passage retriever (DPR) to retrieve and incorporate relevant knowledge from Wikipedia. RAG achieves state-of-the-art results on open-domain question answering, abstractive question answering, and fact verification by leveraging both parametric knowledge from the generator and non-parametric knowledge retrieved from Wikipedia. The retrieved knowledge can also be updated without retraining the model.
Evaluating Parameter Efficient Learning for Generation.pdfPo-Chuan Chen
This document summarizes a research paper that evaluated parameter efficient learning methods (PERMs) for natural language generation tasks. The researchers compared PERMs like adapter tuning, prefix tuning, and prompt tuning to finetuning large pre-trained language models on several metrics. Their results showed that PERMs can outperform finetuning with fewer training samples or larger models, and that adapter tuning generalizes best across domains while prefix tuning produces the most faithful generations. The study provides insights into how PERMs can help adapt models with limited data.
Off-Policy Deep Reinforcement Learning without Exploration.pdfPo-Chuan Chen
BCQ is an algorithm for off-policy reinforcement learning that combines deep Q-learning with a state-conditioned generative model to produce only previously seen actions from a batch of data. BCQ uses the generative model to propose actions similar to the batch, then selects the highest valued action via a Q-network. It addresses overestimation bias through importance sampling and clipped double Q-learning. Experiments show BCQ achieves state-of-the-art performance on benchmark continuous control and discrete action tasks by constraining behavior to the batch data.
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfPo-Chuan Chen
This document discusses a mixture of experts (MoE) approach for reinforcement learning-based dialogue management. It introduces a MoE language model consisting of: (1) a primitive language model capable of generating diverse utterances, (2) several specialized expert models trained for different intents, and (3) a dialogue manager that selects utterances from the experts. The experts are constructed by training on labeled conversation data. Reinforcement learning is used to train the dialogue manager to optimize long-term dialogue quality by selecting among the expert utterances. Experiments demonstrate the MoE approach can generate more coherent and engaging conversations than single language models.
Is Reinforcement Learning (Not) for Natural
Language Processing.pdfPo-Chuan Chen
The document presents RL4LMs, a library for training language models with reinforcement learning. It introduces RL4LMs, which enables generative models to be optimized with RL algorithms. It also presents the GRUE benchmark for evaluating models, which pairs NLP tasks with reward functions capturing human preferences. Additionally, it introduces the NLPO algorithm that dynamically learns task-specific constraints to reduce the large action space in language generation. The goal is to facilitate research in building RL methods to better align language models with human preferences.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...University of Maribor
Slides from talk presenting:
Aleš Zamuda: Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapter and Networking.
Presentation at IcETRAN 2024 session:
"Inter-Society Networking Panel GRSS/MTT-S/CIS
Panel Session: Promoting Connection and Cooperation"
IEEE Slovenia GRSS
IEEE Serbia and Montenegro MTT-S
IEEE Slovenia CIS
11TH INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONIC AND COMPUTING ENGINEERING
3-6 June 2024, Niš, Serbia
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMHODECEDSIET
Time Division Multiplexing (TDM) is a method of transmitting multiple signals over a single communication channel by dividing the signal into many segments, each having a very short duration of time. These time slots are then allocated to different data streams, allowing multiple signals to share the same transmission medium efficiently. TDM is widely used in telecommunications and data communication systems.
### How TDM Works
1. **Time Slots Allocation**: The core principle of TDM is to assign distinct time slots to each signal. During each time slot, the respective signal is transmitted, and then the process repeats cyclically. For example, if there are four signals to be transmitted, the TDM cycle will divide time into four slots, each assigned to one signal.
2. **Synchronization**: Synchronization is crucial in TDM systems to ensure that the signals are correctly aligned with their respective time slots. Both the transmitter and receiver must be synchronized to avoid any overlap or loss of data. This synchronization is typically maintained by a clock signal that ensures time slots are accurately aligned.
3. **Frame Structure**: TDM data is organized into frames, where each frame consists of a set of time slots. Each frame is repeated at regular intervals, ensuring continuous transmission of data streams. The frame structure helps in managing the data streams and maintaining the synchronization between the transmitter and receiver.
4. **Multiplexer and Demultiplexer**: At the transmitting end, a multiplexer combines multiple input signals into a single composite signal by assigning each signal to a specific time slot. At the receiving end, a demultiplexer separates the composite signal back into individual signals based on their respective time slots.
### Types of TDM
1. **Synchronous TDM**: In synchronous TDM, time slots are pre-assigned to each signal, regardless of whether the signal has data to transmit or not. This can lead to inefficiencies if some time slots remain empty due to the absence of data.
2. **Asynchronous TDM (or Statistical TDM)**: Asynchronous TDM addresses the inefficiencies of synchronous TDM by allocating time slots dynamically based on the presence of data. Time slots are assigned only when there is data to transmit, which optimizes the use of the communication channel.
### Applications of TDM
- **Telecommunications**: TDM is extensively used in telecommunication systems, such as in T1 and E1 lines, where multiple telephone calls are transmitted over a single line by assigning each call to a specific time slot.
- **Digital Audio and Video Broadcasting**: TDM is used in broadcasting systems to transmit multiple audio or video streams over a single channel, ensuring efficient use of bandwidth.
- **Computer Networks**: TDM is used in network protocols and systems to manage the transmission of data from multiple sources over a single network medium.
### Advantages of TDM
- **Efficient Use of Bandwidth**: TDM all
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
A Statistical Perspective on Retrieval-Based Models.pdf
1. A Statistical Perspective on Retrieval-Based Models
A Statistical Perspective
on Retrieval-Based Models
ICML, 2023
Soumya Basu, Ankit Singh Rawat, Manzil Zaheer
Speaker: Po-Chuan Chen
Oct 12, 2023
1 / 41
2. A Statistical Perspective on Retrieval-Based Models
Table of contents
1 Abstract
2 Introduction
3 Problem setup
4 Local empirical risk minimization
5 Classification in extended feature space
6 Experiments
7 Conclusion and future direction 2 / 41
3. A Statistical Perspective on Retrieval-Based Models
Abstract
Table of contents
1 Abstract
2 Introduction
3 Problem setup
4 Local empirical risk minimization
5 Classification in extended feature space
6 Experiments
7 Conclusion and future direction 3 / 41
4. A Statistical Perspective on Retrieval-Based Models
Abstract
Abstract
This paper uses a formal treatment of retrieval-based models to
characterize their performance via a novel statistical perspective.
They study two different perspective method
Analyzing local learning framework
Learning global model using kernel methods
4 / 41
5. A Statistical Perspective on Retrieval-Based Models
Introduction
Table of contents
1 Abstract
2 Introduction
3 Problem setup
4 Local empirical risk minimization
5 Classification in extended feature space
6 Experiments
7 Conclusion and future direction 5 / 41
6. A Statistical Perspective on Retrieval-Based Models
Introduction
Introduction
To increase the expressiveness of an ML model, a popular way is to
homogeneously scale the size of a parametric model.
Such large models, however, have their own limitations
High computation cost
Catastrophic forgetting
Lack of provenance
Poor explainability
6 / 41
7. A Statistical Perspective on Retrieval-Based Models
Introduction
Introduction
Figure 1: An illustration of a retrieval-based classification model.
7 / 41
8. A Statistical Perspective on Retrieval-Based Models
Introduction
Contribution
1 Setting up a formal framework for classification via
retrieval-based models under local structure
2 Finite sample analysis of explicit local learning framework
3 Extending the analysis to a globally learnt model
4 Providing the first rigorous treatment of an end-to-end
retrieval-based model to study its generalization by using
kernel-based learning
8 / 41
9. A Statistical Perspective on Retrieval-Based Models
Problem setup
Table of contents I
1 Abstract
2 Introduction
3 Problem setup
Multiclass classification
Classification with local structure
Retrieval-based classification model
4 Local empirical risk minimization
5 Classification in extended feature space
9 / 41
10. A Statistical Perspective on Retrieval-Based Models
Problem setup
Table of contents II
6 Experiments
7 Conclusion and future direction
10 / 41
11. A Statistical Perspective on Retrieval-Based Models
Problem setup
Multiclass classification
Multiclass classification
In here, it’ll access to n training examples S = {(xi, yi)}i∈[n] ⊂ X × Y
, sampled i.i.d. from the data distribution D := DX,Y.
For the scorer f, the classifier takes the form:
hf (x) = arg max
y∈Y
fy(x)
Given a set of scorer F global ⊆ {f : X → ℝ|Y| }, learning a model can
find a scorer in F global that minimizes the miss-classification error or
expected 0/1 loss:
f∗
0/1 = arg min
f ∈Fglobal
ℙD(hf (X) ≠ Y)
11 / 41
12. A Statistical Perspective on Retrieval-Based Models
Problem setup
Multiclass classification
Multiclass classification
In this part, it uses surrogate loss [1] 𝓁 for the miss-classification error
and aims to minimize the associated population risk:
R𝓁 (f) = 𝔼(X,Y)∼D[𝓁(f (X), Y)]
With minimizing the (global) empirical risk over the function class
F global, we can learn a good scorer:
f̂ = arg min
f ∈Fglobal
1
n
∑︁
i∈[n]
𝓁(f (xi), yi)
And, R̂ := 1
n
Í
i∈[n] 𝓁(f (xi), yi).
12 / 41
13. A Statistical Perspective on Retrieval-Based Models
Problem setup
Classification with local structure
Classification with local structure
They define the data in each local neighborhood as
Bx,r := {x′ ∈ X : 𝕕(x, x′) ≤ r}, where x ∈ X and r > 0.
Dx,r set as the data distribution restricted to Bx,r
Dx,r
=
D(A)
D(Bx,r × Y)
A ⊆ Bx,r
× Y
13 / 41
14. A Statistical Perspective on Retrieval-Based Models
Problem setup
Classification with local structure
Classification with local structure
Such that we have local structure condition that approximates the
Bayes optimal for the local classification problem.
That is for a given 𝜀X > 0 and ∀x ∈ X, we have
min
f ∈Fx
Rx
𝓁 (f) ≤ min
f ∈Fglobal
Rx
𝓁 (f) + 𝜀X
And the local population risk can be defined as
Rx
𝓁 (f) = 𝔼(X′,Y′ )∼Dx,r [𝓁(f (X′
), Y′
)]
14 / 41
15. A Statistical Perspective on Retrieval-Based Models
Problem setup
Retrieval-based classification model
Retrieval-based classification model
In this paper, they focus on retrieval-based methods.
In local empirical risk minimization, it will give a instance x, the
local empirical risk minimization (ERM) approach first retrieves a
neighboring set Rx = {(x′
j, y′
j)} ⊆ S.
And it identifies a scorer f̂x from a function class
F loc ⊂ {f : X → ℝ|Y| }:
f̂x
= arg min
f ∈Floc
1
|Rx|
∑︁
(x′,y′ )∈Rx
𝓁(f (x′
), y′
)
if |Rx| = 0, f̂x ∈ F loc is chosen arbitrarily.
15 / 41
16. A Statistical Perspective on Retrieval-Based Models
Problem setup
Retrieval-based classification model
Retrieval-based classification model
Another approach is called classification with extended feature
space, that the scorer directly maps the augmented input
x × Rx ∈ X × (X × Y)∗ to per-class scores.
A scorer can be learned over extended feature space X × (X × Y)∗ as
follows:
f̂ex
= arg min
f ∈Fex
R̂ex
𝓁 (f)
where R̂ex
𝓁
(f) := 1
n
Í
i∈[n] 𝓁(f (xi, Rxi ), yi) and a function class of
interest over the extended space is denoted as
F ex ⊂ {f : X × (X × Y)∗ → ℝ|Y| }.
16 / 41
17. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Table of contents I
1 Abstract
2 Introduction
3 Problem setup
4 Local empirical risk minimization
Assumptions
Excess risk bound for local ERM
Illustrative examples
Endowing local ERM with global representations
17 / 41
18. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Table of contents II
5 Classification in extended feature space
6 Experiments
7 Conclusion and future direction
18 / 41
19. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Local empirical risk minimization
The goal is to characterize the excess risk of local ERM, such that it
aims to bound
𝔼(X,Y)∼D[𝓁(f̂X
(X), Y) − 𝓁(f∗
(X), Y)]
Here f̂X (X) in the above equation is a function of RX.
19 / 41
20. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Assumptions
Assumptions
First, they define the margin of scorer f at a given label y ∈ Y as
𝛾f (x, y) = fy(x) − max
y′≠y
fy′ (x)
To ensure the margin of the scorer f has smooth deviation as x varies, a
scorer f is L-coordinate Lipschitz iff for all y ∈ Y and x, x′ ∈ X, it has
|fy(x) − fy(x′
)| ≤ L∥x − x′
∥2
Also they define the weak margin condition for a scorer f: Given a
distribution D, a scorer f satisfies (𝛼, c)-weak margin condition iff, for
all t ≥ 0,
ℙ(X,Y)∼D(|𝛾f (X, Y)| ≤ t) ≤ ct𝛼
20 / 41
21. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Assumptions
Assumption 3.1 (True scorer function)
The scorer ftrue make sure for all (x, y) ∈ X × Y, ftrue generates the
true label, i.e., 𝛾ftrue (x, y) > 0 that ftrue is Ltrue-coordinate Lipschitz,
and satisfies the (𝛼true, ctrue)-weak margin condition.
21 / 41
22. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Assumptions
Assumption 3.2 (Margin-based Lipschitz loss)
For any given example (x, y) and any scorer f we have
𝓁(f (x), y) = 𝓁(𝛾f (x, y)) and 𝓁 is a decreasing function of the margin.
That, the loss function 𝓁 is L𝓁-Lipschitz function, i.e.
|𝓁(𝛾) − 𝓁(𝛾′)| ≤ L𝓁 |𝛾 − 𝛾′|, ∀𝛾 ≥ 𝛾′.
22 / 41
23. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Assumptions
Assumption 3.3 (Data regularity condition)
Weak density condition.
There exists constants cwdc > 0, and 𝛿wdc > 0, such that for all x ∈ X
and 𝜌D(x)rd ≤ 𝛿d
wdc
.
ℙX′∼D[𝕕(X′
, x) ≤ r] ≥ cd
wdc𝜌D(x)rd
Density level-set.
There exists a function f𝜌(𝛿) with f𝜌(𝛿) → 0 as 𝛿 → 0, such that for
any 𝛿 > 0,
ℙX∼D[𝜌D(X) ≤ f𝜌(𝛿)] ≤ 𝛿
23 / 41
24. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Assumptions
Assumption 3.4 (Weak + density condition)
There exists constants cwdc+ ≥ 0, and 𝛼wdc+ > 0, such that for all
x ∈ X and r ∈ [0, rmax],
ℙX′∼D[𝕕(X′, x) ≤ r]
𝜌D(c)vold (r)
− 1 ≤ cwdc+r𝛼wdc+
Under this assumption the local ERM error bounds can be tightened
further.
24 / 41
25. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Excess risk bound for local ERM
Excess risk bound for local ERM
It proceed to their main results on the excess risk bound of local ERM.
At x ∈ X, fx,∗ denotes the minimizer of the population version of the
local loss, and f∗ for the global loss.
fx,∗
= arg min
f ∈Floc
Rx
𝓁 (f); f∗
= arg min
f ∈Fglobal
R𝓁 (f)
The next slide will show how the expected excess risk of the local
ERM solution f̂X is bounded, it is called Risk decomposition.
25 / 41
26. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Excess risk bound for local ERM
𝔼(X,Y)∼D
h
ℓ
f̂X
(X), Y
− ℓ (f∗
(X), Y)
i
≤ 𝔼(X,Y)∼D
h
RX
ℓ
fX,∗
− RX
ℓ (f∗
)
i
| {z }
Local vs Global Population Optimal Risk
+
∑︁
F∈{Fglobal ,Floc
}
𝔼(X,Y)∼D
sup
f ∈F
RX
ℓ (f) − ℓ(f (X), Y)
#
| {z }
Global and Local: Sample vs Retrieved Set Risk
+ 𝔼(X,Y)∼D
sup
f ∈Floc
RX
ℓ (f) − R̂X
ℓ (f)
#
| {z }
Generalization of Local ERM
+ 𝔼(X,Y)∼D
h
RX
ℓ
fX,∗
− R̂X
ℓ
fX,∗
i
| {z }
Central Absolute Moment of fX,∗
.
26 / 41
27. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Excess risk bound for local ERM
Excess risk bound for local ERM
In here, we can obtain a tighter bound by utilizing the local structure
of the distribution DX,r
. For any L 0, we can define
Mr (L; 𝓁, ftrue,F) := 2L𝓁 (Lr + (2∥F ∥∞ − Lr)ctrue(2Ltruer)𝛼true
)
For any x ∈ X, the weak density condition provides high probability
lower bound on the size of the retrieved set Rx.
27 / 41
28. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Excess risk bound for local ERM
Proposition 3.6.
Under the Assumption 3.3, for any x ∈ X, r 0, and 𝛿 0,
ℙD[|Rx
| N(r, 𝛿)] ≤ 𝛿
for N(r, 𝛿) = n(cd
wdc
min{f𝜌(𝛿/2)rd, 𝛿d
wdc
} −
√︃
log(2/𝛿)
2n )
The next slide will show how the expected excess risk of the local
ERM solution f̂X is bounded, it is called Excess risk bound.
28 / 41
29. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Excess risk bound for local ERM
Theorem 3.7 (Excess risk bound)
𝔼(X,Y)∼D
h
ℓ
f̂X
(X), Y
− ℓ (f∗
(X), Y)
i
≤ (𝜀x + 𝜀loc)
| {z }
Local vs Global Optimal loss (I)
+ Mr
Lloc ; ℓ, ftrue , F loc
+ Mr
Lglobal ; ℓ, ftrue , F global
| {z }
Global and Local: Sample vs Retrieved Set Risk (II)
+
𝔼(X,Y)∼D
ℜRX (G(X, Y)) | RX ≥ N(r, 𝛿)
+5Mr Lloc ; ℓ, ftrue , F loc
√︃
2 ln(4/𝛿)
N(r,𝛿)
+8𝛿Lℓ F loc
∞
,
| {z }
Generalization of Local ERM and Central Absolute Moment of fX,∗
(III)
29 / 41
30. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Excess risk bound for local ERM
The result shows a trade-off in approximation vs. generalization error
as retrieval radius r varies.
Approximation error.
It comprises two components, defined by (I) and (II) in Thm. 3.7.
Generalization error.
It (III) depends on the size of the retrieved set RX and the Rademacher
complexity of G(X, Y) which is included by F loc.
Under the local ERM setting the total approximation error increases
with increasing radius r for a fixed F loc. But for the generalization
error, it decreases.
30 / 41
31. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Illustrative examples
Illustrative examples
Local linear models.
In this setting where F loc is the class of linear classifiers in
d-dimension.
Excess Risk ≤ O
r2
|{z}
(I)
+ O
rmin{𝛼true ,1}
| {z }
(II)
+
O
d
n(2d−1)/2drd/2
+
rmin{𝛼true ,1}
n(2d−1)/4drd/2
+
1
n1/2d
| {z }
(III)
.
31 / 41
32. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Illustrative examples
Illustrative examples
Feed-forward classifiers.
As another example they study the setting where F loc is a the class of
fully connected deep neural networks (FC-DNN).
Excess Risk ≤ O
r(qmax+1)
| {z }
(I)
+ O
rmin{𝛼true ,1}
| {z }
(II)
+
O
q3/4
max ln (dqmax/r)3/4
ln(n)3/2
n(2d−1)/2drd/2
+
rmin{𝛼true ,1}
n(2d−1)/4drd/2
+
1
n1/2d
!
| {z }
(III)
.
32 / 41
33. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Endowing local ERM with global representations
Endowing local ERM with global representations
The local ERM method takes a myopic view and does not aim to learn
a global hypothesis that explains the entire data distribution.
This approach may result in poor performance in regions of input
domains that are not well represented in the training set.
The two-stage approach enables the local learning to benefit from
good quality global representations, especially in sparse data regions.
33 / 41
34. A Statistical Perspective on Retrieval-Based Models
Local empirical risk minimization
Endowing local ERM with global representations
Endowing local ERM with global representations
Here they discuss two-stage approach to address the potential
shortcoming of local empirical risk minimization (ERM) in
retrieval-based models.
In the first stage, a global representation is learned using the entire
dataset.
In the second stage, the learned global representation is utilized at test
time while solving the local ERM as previously defined.
34 / 41
35. A Statistical Perspective on Retrieval-Based Models
Classification in extended feature space
Table of contents
1 Abstract
2 Introduction
3 Problem setup
4 Local empirical risk minimization
5 Classification in extended feature space
6 Experiments
7 Conclusion and future direction 35 / 41
36. A Statistical Perspective on Retrieval-Based Models
Classification in extended feature space
Classification in extended feature space
The scorer function can implicitly solve the local empirical risk
minimization (ERM) using retrieved neighboring labeled instances to
make the classification prediction.
The objective is to learn a function f : X × (X × Y)∗ → ℝ|Y|
Here, they also discuss a kernel-based approach to classification in the
extended feature space, where the scorer function is represented as a
linear combination of kernel functions evaluated on the extended
feature space.
36 / 41
37. A Statistical Perspective on Retrieval-Based Models
Experiments
Table of contents
1 Abstract
2 Introduction
3 Problem setup
4 Local empirical risk minimization
5 Classification in extended feature space
6 Experiments
7 Conclusion and future direction 37 / 41
38. A Statistical Perspective on Retrieval-Based Models
Experiments
Experiments
This paper performs experiments on both synthetic and real datasets
to demonstrate the benefits of retrieval-based models in classification
tasks.
Synthetic: binary classification
CIFAR-10: binary classification
ImageNet: 1000-way classification
The experiments show that retrieval-based models can achieve good
performance with much simpler function classes compared to
traditional parametric and nonparametric models.
38 / 41
39. A Statistical Perspective on Retrieval-Based Models
Experiments
Experiments
Figure 2: Performance of local ERM with size of retrieved set across models
of different complexity.
39 / 41
40. A Statistical Perspective on Retrieval-Based Models
Conclusion and future direction
Conclusion and future direction
The main contributions of the paper, which include
A formal framework for retrieval-based models
Analysis of local and global learning frameworks
Empirical results that support the theoretical findings
For the future work, we could explore the use of retrieval-based
models in other machine learning tasks beyond classification.
40 / 41
41. A Statistical Perspective on Retrieval-Based Models
Conclusion and future direction
References I
[1] Peter L. Bartlett, Michael I. Jordan, and Jon D. Mcauliffe.
“Convexity, Classification, and Risk Bounds”. In: Journal of the
American Statistical Association 101.473 (2006), pp. 138–156.
issn: 01621459. url:
http://www.jstor.org/stable/30047445.
41 / 41