We consider stochastic optimization problems arising in deep learning and other areas of statistical and machine learning from a statistical decision theory perspective. In particular, we investigate the admissibility (in the sense of decision theory) of the sample average solution estimator. We show that this estimator can be inadmissible in very simple settings, a phenomenon that is derived from the classical James-Stein estimator. However, for many problems of interest, the sample average estimator is indeed admissible. We will end with several open questions in this research directi
An antiderivative of a function is a function whose derivative is the given function. The problem of antidifferentiation is interesting, complicated, and useful, especially when discussing motion.
The document discusses improper integrals, which are definite integrals with infinite intervals. An improper integral of a continuous function over an infinite interval, such as ∫ e-x dx from 0 to ∞, or an integral of an unbounded continuous function, such as ∫ 1/x dx from 0 to 1, are called improper integrals. Improper integrals are evaluated by taking the limit of the integral as the interval approaches infinity or by taking the limit of the antiderivative. If the limit exists and is finite, the improper integral converges; if the limit fails to exist or is infinite, the improper integral diverges. Examples are provided to demonstrate the evaluation and convergence of improper integrals.
An antiderivative of a function is a function whose derivative is the given function. The problem of antidifferentiation is interesting, complicated, and useful, especially when discussing motion.
This document provides an introduction to multivariate and dynamic risk measures. It begins with an overview of probabilistic and measurable spaces, including finite and infinite dimensional probability spaces. It then discusses univariate functional analysis and convexity, including definitions of convex functions and the Legendre-Fenchel transformation. Several examples are provided to illustrate these concepts. The document aims to establish the necessary foundations for understanding multivariate and dynamic risk measures.
The document provides an overview of random forests, including the random forest recipe, why random forests work, and ramifications of random forests. The random forest recipe involves drawing bootstrap samples to grow trees, randomly selecting features at each split, and aggregating predictions. Random forests work by decorrelating trees, which reduces variance and leads to lower prediction error compared to individual trees. Ramifications discussed include using out-of-bag samples to estimate generalization error and calculating variable importance.
Lesson 14: Derivatives of Logarithmic and Exponential Functions (slides)Matthew Leingang
The exponential function is pretty much the only function whose derivative is itself. The derivative of the natural logarithm function is also beautiful as it fills in an important gap. Finally, the technique of logarithmic differentiation allows us to find derivatives without the product rule.
This document discusses sparse non-parametric regression. It notes that while non-parametric regression suffers from the curse of dimensionality, assuming sparsity in the features can alleviate this. Specifically, it proposes estimating functions that depend on at most s out of p total features. However, directly solving this sparse problem is computationally difficult. Neural networks with L1 constraints on the first layer weights is proposed as one approach, but open questions remain about better function classes and optimization methods for this sparse non-parametric regression problem.
An antiderivative of a function is a function whose derivative is the given function. The problem of antidifferentiation is interesting, complicated, and useful, especially when discussing motion.
The document discusses improper integrals, which are definite integrals with infinite intervals. An improper integral of a continuous function over an infinite interval, such as ∫ e-x dx from 0 to ∞, or an integral of an unbounded continuous function, such as ∫ 1/x dx from 0 to 1, are called improper integrals. Improper integrals are evaluated by taking the limit of the integral as the interval approaches infinity or by taking the limit of the antiderivative. If the limit exists and is finite, the improper integral converges; if the limit fails to exist or is infinite, the improper integral diverges. Examples are provided to demonstrate the evaluation and convergence of improper integrals.
An antiderivative of a function is a function whose derivative is the given function. The problem of antidifferentiation is interesting, complicated, and useful, especially when discussing motion.
This document provides an introduction to multivariate and dynamic risk measures. It begins with an overview of probabilistic and measurable spaces, including finite and infinite dimensional probability spaces. It then discusses univariate functional analysis and convexity, including definitions of convex functions and the Legendre-Fenchel transformation. Several examples are provided to illustrate these concepts. The document aims to establish the necessary foundations for understanding multivariate and dynamic risk measures.
The document provides an overview of random forests, including the random forest recipe, why random forests work, and ramifications of random forests. The random forest recipe involves drawing bootstrap samples to grow trees, randomly selecting features at each split, and aggregating predictions. Random forests work by decorrelating trees, which reduces variance and leads to lower prediction error compared to individual trees. Ramifications discussed include using out-of-bag samples to estimate generalization error and calculating variable importance.
Lesson 14: Derivatives of Logarithmic and Exponential Functions (slides)Matthew Leingang
The exponential function is pretty much the only function whose derivative is itself. The derivative of the natural logarithm function is also beautiful as it fills in an important gap. Finally, the technique of logarithmic differentiation allows us to find derivatives without the product rule.
This document discusses sparse non-parametric regression. It notes that while non-parametric regression suffers from the curse of dimensionality, assuming sparsity in the features can alleviate this. Specifically, it proposes estimating functions that depend on at most s out of p total features. However, directly solving this sparse problem is computationally difficult. Neural networks with L1 constraints on the first layer weights is proposed as one approach, but open questions remain about better function classes and optimization methods for this sparse non-parametric regression problem.
At times it is useful to consider a function whose derivative is a given function. We look at the general idea of reversing the differentiation process and its applications to rectilinear motion.
Lesson 20: Derivatives and the Shapes of Curves (slides)Matthew Leingang
This document contains lecture notes on derivatives and the shapes of curves from a Calculus I class taught by Professor Matthew Leingang at New York University. The notes cover using derivatives to determine the intervals where a function is increasing or decreasing, classifying critical points as maxima or minima, using the second derivative to determine concavity, and applying the first and second derivative tests. Examples are provided to illustrate finding intervals of monotonicity for various functions.
This document contains notes from a calculus class lecture on evaluating definite integrals. It discusses using the evaluation theorem to evaluate definite integrals, writing derivatives as indefinite integrals, and interpreting definite integrals as the net change of a function over an interval. The document also contains examples of evaluating definite integrals, properties of integrals, and an outline of the key topics covered.
The document discusses evaluating definite integrals. It begins by reviewing the definition of the definite integral as a limit. It then discusses estimating integrals using the midpoint rule and properties of integrals such as integrals of nonnegative functions being nonnegative and integrals being "increasing" if one function is greater than another. An example is worked out using the midpoint rule to estimate an integral. The document provides an outline of topics and notation for integrals.
There are various reasons why we would want to find the extreme (maximum and minimum values) of a function. Fermat's Theorem tells us we can find local extreme points by looking at critical points. This process is known as the Closed Interval Method.
The closed interval method tells us how to find the extreme values of a continuous function defined on a closed, bounded interval: we check the end points and the critical points.
Convex Analysis and Duality (based on "Functional Analysis and Optimization" ...Katsuya Ito
In this presentation, we explain the monograph ”Functional Analysis and Optimization” by Kazufumi Ito
https://kito.wordpress.ncsu.edu/files/2018/04/funa3.pdf
Our goal in this presentation is to
-Understand the basic notions of functional analysis
lower-semicontinuous, subdifferential, conjugate functional
- Understand the formulation of duality problem
primal (P), perturbed (Py), and dual (P∗) problem
-Understand the primal-dual relationships
inf(P)≤sup(P∗), inf(P) = sup(P∗), inf supL≤sup inf L
The document discusses curve sketching of functions by analyzing their derivatives. It provides:
1) A checklist for graphing a function which involves finding where the function is positive/negative/zero, its monotonicity from the first derivative, and concavity from the second derivative.
2) An example of graphing the cubic function f(x) = 2x^3 - 3x^2 - 12x through analyzing its derivatives.
3) Explanations of the increasing/decreasing test and concavity test to determine monotonicity and concavity from a function's derivatives.
Approximate Bayesian model choice via random forestsChristian Robert
The document describes approximate Bayesian computation (ABC) methods for model choice when likelihoods are intractable. ABC generates parameter-dataset pairs from the prior and retains those where the simulated and observed datasets are similar according to a distance measure on summary statistics. For model choice, ABC approximates posterior model probabilities by the proportion of simulations from each model that are retained. Machine learning techniques can also be used to infer the most likely model directly from the simulated summary statistics.
This document discusses distorting risk measures and copulas in actuarial sciences. It introduces distorted risk measures as expectations of a distorted probability measure induced by a distortion function. Common distortion functions and associated risk measures are presented, including Value-at-Risk, Tail Value-at-Risk, proportional hazard measure. Archimedean copulas are defined using a generator function and can model dependence through a latent factor. Hierarchical and distorted Archimedean copulas are discussed as ways to flexibly model multivariate dependence structures.
Application of partial derivatives with two variablesSagar Patel
Application of Partial Derivatives with Two Variables
Maxima And Minima Values.
Maximum And Minimum Values.
Tangent and Normal.
Error And Approximation.
Bias-variance decomposition in Random ForestsGilles Louppe
This document discusses bias-variance decomposition in random forests. It explains that combining predictions from multiple randomized models can achieve better results than a single model by reducing variance. Random forests work by constructing decision trees on randomly selected subsets of data and features, averaging their predictions. This randomization increases bias but reduces variance, providing an effective bias-variance tradeoff. The document provides theorems on how the expected generalization error of random forests and individual trees can be decomposed into noise, bias, and variance components.
The document describes the simplex method for solving linear programming problems. It begins with an example problem of maximizing beer production given constraints on barley and corn supplies. It introduces slack variables to transform inequalities into equalities. The coefficients are written in a tableau and an initial basic feasible solution is chosen. Gaussian elimination is performed to introduce new basic variables while removing others. The process is repeated, moving through the feasible space, until an optimal solution is found without any negative entries in the objective function row. Duality between minimizing costs and maximizing profits is also discussed.
The document discusses applications of differentiation, including:
- How derivatives help locate maximum and minimum values of functions by determining if a function is increasing or decreasing over an interval.
- Examples of optimization problems involving finding maximum/minimum values, such as the optimal shape of a can.
- Key terms related to maximum/minimum values including local/global extrema, critical points, and how the first and second derivatives relate to concavity.
- An example problem involving finding the maximum area of a rectangular temple room given a perimeter constraint.
This document provides an overview of key calculus concepts and formulas taught in a Calculus I course at Miami Dade College - Hialeah Campus. The topics covered include limits and derivatives, integration, optimization techniques, and applications of calculus to economics, business, physics, and other fields. The document is intended as a study guide for students in the Calculus I class taught by Professor Mohammad Shakil.
New Insights and Perspectives on the Natural Gradient MethodYoonho Lee
The document discusses the natural gradient method for optimizing neural networks. It explains that the natural gradient finds the direction of steepest descent in function space rather than parameter space. The natural gradient is invariant to reparameterization. For most neural networks, natural gradient descent is equivalent to a second-order optimization method called the generalized Gauss-Newton method. The natural gradient takes into account the geometry of the parameter space defined by the Fisher information matrix.
Maximum likelihood estimation of regularisation parameters in inverse problem...Valentin De Bortoli
This document discusses an empirical Bayesian approach for estimating regularization parameters in inverse problems using maximum likelihood estimation. It proposes the Stochastic Optimization with Unadjusted Langevin (SOUL) algorithm, which uses Markov chain sampling to approximate gradients in a stochastic projected gradient descent scheme for optimizing the regularization parameter. The algorithm is shown to converge to the maximum likelihood estimate under certain conditions on the log-likelihood and prior distributions.
The document discusses derivative-free optimization methods for non-convex problems. It introduces direct-search methods that optimize a function using only function evaluations in different directions, without requiring derivatives. It then covers model-based approaches that fit a polynomial model to approximated points and minimize that instead of the original function. Trust-region methods are discussed that build local models within a region Δ around each point and iteratively improve the model and minimize within each region.
At times it is useful to consider a function whose derivative is a given function. We look at the general idea of reversing the differentiation process and its applications to rectilinear motion.
Lesson 20: Derivatives and the Shapes of Curves (slides)Matthew Leingang
This document contains lecture notes on derivatives and the shapes of curves from a Calculus I class taught by Professor Matthew Leingang at New York University. The notes cover using derivatives to determine the intervals where a function is increasing or decreasing, classifying critical points as maxima or minima, using the second derivative to determine concavity, and applying the first and second derivative tests. Examples are provided to illustrate finding intervals of monotonicity for various functions.
This document contains notes from a calculus class lecture on evaluating definite integrals. It discusses using the evaluation theorem to evaluate definite integrals, writing derivatives as indefinite integrals, and interpreting definite integrals as the net change of a function over an interval. The document also contains examples of evaluating definite integrals, properties of integrals, and an outline of the key topics covered.
The document discusses evaluating definite integrals. It begins by reviewing the definition of the definite integral as a limit. It then discusses estimating integrals using the midpoint rule and properties of integrals such as integrals of nonnegative functions being nonnegative and integrals being "increasing" if one function is greater than another. An example is worked out using the midpoint rule to estimate an integral. The document provides an outline of topics and notation for integrals.
There are various reasons why we would want to find the extreme (maximum and minimum values) of a function. Fermat's Theorem tells us we can find local extreme points by looking at critical points. This process is known as the Closed Interval Method.
The closed interval method tells us how to find the extreme values of a continuous function defined on a closed, bounded interval: we check the end points and the critical points.
Convex Analysis and Duality (based on "Functional Analysis and Optimization" ...Katsuya Ito
In this presentation, we explain the monograph ”Functional Analysis and Optimization” by Kazufumi Ito
https://kito.wordpress.ncsu.edu/files/2018/04/funa3.pdf
Our goal in this presentation is to
-Understand the basic notions of functional analysis
lower-semicontinuous, subdifferential, conjugate functional
- Understand the formulation of duality problem
primal (P), perturbed (Py), and dual (P∗) problem
-Understand the primal-dual relationships
inf(P)≤sup(P∗), inf(P) = sup(P∗), inf supL≤sup inf L
The document discusses curve sketching of functions by analyzing their derivatives. It provides:
1) A checklist for graphing a function which involves finding where the function is positive/negative/zero, its monotonicity from the first derivative, and concavity from the second derivative.
2) An example of graphing the cubic function f(x) = 2x^3 - 3x^2 - 12x through analyzing its derivatives.
3) Explanations of the increasing/decreasing test and concavity test to determine monotonicity and concavity from a function's derivatives.
Approximate Bayesian model choice via random forestsChristian Robert
The document describes approximate Bayesian computation (ABC) methods for model choice when likelihoods are intractable. ABC generates parameter-dataset pairs from the prior and retains those where the simulated and observed datasets are similar according to a distance measure on summary statistics. For model choice, ABC approximates posterior model probabilities by the proportion of simulations from each model that are retained. Machine learning techniques can also be used to infer the most likely model directly from the simulated summary statistics.
This document discusses distorting risk measures and copulas in actuarial sciences. It introduces distorted risk measures as expectations of a distorted probability measure induced by a distortion function. Common distortion functions and associated risk measures are presented, including Value-at-Risk, Tail Value-at-Risk, proportional hazard measure. Archimedean copulas are defined using a generator function and can model dependence through a latent factor. Hierarchical and distorted Archimedean copulas are discussed as ways to flexibly model multivariate dependence structures.
Application of partial derivatives with two variablesSagar Patel
Application of Partial Derivatives with Two Variables
Maxima And Minima Values.
Maximum And Minimum Values.
Tangent and Normal.
Error And Approximation.
Bias-variance decomposition in Random ForestsGilles Louppe
This document discusses bias-variance decomposition in random forests. It explains that combining predictions from multiple randomized models can achieve better results than a single model by reducing variance. Random forests work by constructing decision trees on randomly selected subsets of data and features, averaging their predictions. This randomization increases bias but reduces variance, providing an effective bias-variance tradeoff. The document provides theorems on how the expected generalization error of random forests and individual trees can be decomposed into noise, bias, and variance components.
The document describes the simplex method for solving linear programming problems. It begins with an example problem of maximizing beer production given constraints on barley and corn supplies. It introduces slack variables to transform inequalities into equalities. The coefficients are written in a tableau and an initial basic feasible solution is chosen. Gaussian elimination is performed to introduce new basic variables while removing others. The process is repeated, moving through the feasible space, until an optimal solution is found without any negative entries in the objective function row. Duality between minimizing costs and maximizing profits is also discussed.
The document discusses applications of differentiation, including:
- How derivatives help locate maximum and minimum values of functions by determining if a function is increasing or decreasing over an interval.
- Examples of optimization problems involving finding maximum/minimum values, such as the optimal shape of a can.
- Key terms related to maximum/minimum values including local/global extrema, critical points, and how the first and second derivatives relate to concavity.
- An example problem involving finding the maximum area of a rectangular temple room given a perimeter constraint.
This document provides an overview of key calculus concepts and formulas taught in a Calculus I course at Miami Dade College - Hialeah Campus. The topics covered include limits and derivatives, integration, optimization techniques, and applications of calculus to economics, business, physics, and other fields. The document is intended as a study guide for students in the Calculus I class taught by Professor Mohammad Shakil.
New Insights and Perspectives on the Natural Gradient MethodYoonho Lee
The document discusses the natural gradient method for optimizing neural networks. It explains that the natural gradient finds the direction of steepest descent in function space rather than parameter space. The natural gradient is invariant to reparameterization. For most neural networks, natural gradient descent is equivalent to a second-order optimization method called the generalized Gauss-Newton method. The natural gradient takes into account the geometry of the parameter space defined by the Fisher information matrix.
Maximum likelihood estimation of regularisation parameters in inverse problem...Valentin De Bortoli
This document discusses an empirical Bayesian approach for estimating regularization parameters in inverse problems using maximum likelihood estimation. It proposes the Stochastic Optimization with Unadjusted Langevin (SOUL) algorithm, which uses Markov chain sampling to approximate gradients in a stochastic projected gradient descent scheme for optimizing the regularization parameter. The algorithm is shown to converge to the maximum likelihood estimate under certain conditions on the log-likelihood and prior distributions.
The document discusses derivative-free optimization methods for non-convex problems. It introduces direct-search methods that optimize a function using only function evaluations in different directions, without requiring derivatives. It then covers model-based approaches that fit a polynomial model to approximated points and minimize that instead of the original function. Trust-region methods are discussed that build local models within a region Δ around each point and iteratively improve the model and minimize within each region.
A Statistical Perspective on Retrieval-Based Models.pdfPo-Chuan Chen
This paper presents a statistical perspective on retrieval-based models for classification. It analyzes such models using two different frameworks: local empirical risk minimization and classification in an extended feature space. For local empirical risk minimization, the paper provides assumptions and derives an excess risk bound that decomposes the error of the local model into different terms related to the local vs global optimal risk, sample vs retrieved set risk, generalization error of the local model, and central absolute moment of the local model. It also shows how to tighten the bound by leveraging the local structure of the data distribution.
Classification and regression based on derivatives: a consistency result for ...tuxette
This document summarizes a presentation on using derivatives for classification and regression of functions. It discusses using smoothing splines to estimate functions and their derivatives from discrete sampled data. A consistency result is presented that finds a classifier or regression function built from the estimated derivative functions that achieves the optimal Bayes risk, as the number of samples and examples increases. The key idea is to use smoothing splines, which consistently estimate functions and derivatives, combined with a consistent classifier or regressor on the estimated values.
This document summarizes a seminar on kernels and support vector machines. It begins by explaining why kernels are useful for increasing flexibility and speed compared to direct inner product calculations. It then covers definitions of positive definite kernels and how to prove a function is a kernel. Several kernel families are discussed, including translation invariant, polynomial, and non-Mercer kernels. Finally, the document derives the primal and dual problems for support vector machines and explains how the kernel trick allows non-linear classification.
Understanding Random Forests: From Theory to PracticeGilles Louppe
This document provides an overview of random forests and their implementation. It begins with motivating random forests as a way to reduce variance in decision trees. It then discusses growing and interpreting random forests through variable importances. The document presents theoretical results on the decomposition and properties of variable importances. It concludes by describing the efficient implementation of random forests in scikit-learn, including its modular design and optimizations for speed.
This document discusses error analysis for quasi-Monte Carlo methods. It introduces the trio error identity that decomposes the error into three terms: the variation of the integrand, the discrepancy of the sampling measure from the probability measure, and the alignment between the integrand and the difference between the measures. Several examples are provided to illustrate the identity, including integration over a reproducing kernel Hilbert space. The discrepancy term can be evaluated in O(n^2) operations and converges at different rates depending on the sampling method and properties of the integrand.
The low-rank basis problem for a matrix subspaceTasuku Soma
This document summarizes a presentation on finding low-rank bases for matrix subspaces. It introduces the low-rank basis problem, describes a greedy algorithm to solve it using two phases - rank estimation and alternating projection, and proves local convergence guarantees for the algorithm. Experimental results on synthetic and image data demonstrate the algorithm can recover known low-rank bases and separate mixed images. Comparisons are made to tensor decomposition methods for the special case of rank-1 bases.
An overview of Rademacher Averages, a fundamental concept from statistical learning theory that can be used to derive uniform sample-dependent bounds to the deviation of samples averages from their expectations.
The document discusses multiobjective optimization and evolutionary algorithms. It defines multiobjective optimization problems as having multiple objective functions to minimize subject to constraints. Pareto optimal solutions are those that are not dominated by any other solutions in terms of all objectives. Evolutionary algorithms are used to approximate the Pareto front and find Pareto optimal solutions. Non-dominated sorting and crowding distance are used to select the next population in NSGA-II. The hypervolume indicator measures the size of the space covered by the Pareto front approximations.
1. Partial derivatives describe how a function changes with respect to one variable while holding other variables constant. The partial derivative of Z with respect to x is denoted as ∂Z/∂x or fχ.
2. Optimization problems in calculus involve finding the maximum or minimum values of functions, which can be used to determine the best way to do something.
3. A function has a global/absolute maximum at c if it is greater than or equal to the function values at all other points, and a global/absolute minimum if it is less than or equal to all other points.
Maximizing Submodular Function over the Integer LatticeTasuku Soma
The document describes generalizations of submodular function maximization and submodular cover problems from sets to integer lattices. It presents polynomial-time approximation algorithms for maximizing monotone diminishing return (DR) submodular functions subject to constraints like cardinality, polymatroid and knapsack on the integer lattice. It also presents an algorithm for the DR-submodular cover problem of minimizing cost subject to achieving a quality threshold. The results provide useful extensions of submodular optimization to settings that cannot be modeled as set functions.
This document discusses a theory solver for the theory of uninterpreted functions (UF) in satisfiability modulo theories (SMT). It presents the key components of a UF solver, including union-find algorithms to handle equalities, congruence closure to handle functions, and computing theory conflicts. The solver decides satisfiability of UF formulas in incremental, backtrackable, and theory-propagating manner. It can also be used as a base layer for other theory solvers like LRA.
This document proposes a linear programming (LP) based approach for solving maximum a posteriori (MAP) estimation problems on factor graphs that contain multiple-degree non-indicator functions. It presents an existing LP method for problems with single-degree functions, then introduces a transformation to handle multiple-degree functions by introducing auxiliary variables. This allows applying the existing LP method. As an example, it applies this to maximum likelihood decoding for the Gaussian multiple access channel. Simulation results demonstrate the LP approach decodes correctly with polynomial complexity.
Murphy: Machine learning A probabilistic perspective: Ch.9Daisuke Yoneoka
This document summarizes key concepts about the exponential family and generalized linear models (GLMs). It defines the exponential family and provides examples like the Bernoulli, multinomial, and Gaussian distributions. The exponential family has important properties like finite sufficient statistics, existence of conjugate priors, and convexity. Maximum likelihood estimation for the exponential family involves matching sample moments to population moments. Conjugate priors allow tractable Bayesian inference for the exponential family. The document outlines maximum entropy derivation of the exponential family and how GLMs can generate classifiers.
Higher-order (F, α, β, ρ, d)-convexity is considered. A multiobjective programming problem (MP) is considered. Mond-Weir and Wolfe type duals are considered for multiobjective programming problem. Duality results are established for multiobjective programming problem under higher-order (F, α, β, ρ, d)- convexity assumptions. The results are also applied for multiobjective fractional programming problem.
Interpretable Sparse Sliced Inverse Regression for digitized functional datatuxette
The document discusses interpretable sparse sliced inverse regression (IS-SIR) for functional data regression. It begins with background on using metamodels as proxies for computationally expensive agronomic models to understand relationships between climate inputs and plant outputs. SIR is presented as a semi-parametric regression technique that identifies relevant subspaces to predict outputs from functional inputs. The proposal involves combining SIR with automatic interval selection to point out interpretable predictor intervals. Simulations are discussed to evaluate the proposed method.
Here are the steps to solve this problem:
(a) Let t = time and y = height. Then the differential equation is:
dy/dt = -32 ft/sec^2
Integrate both sides:
∫dy = ∫-32 dt
y = -32t + C
Initial conditions: at t = 0, y = 0
0 = -32(0) + C
C = 0
Therefore, the equation is: y = -32t
When y = 0 (the maximum height), t = 0.625 sec
(b) Put t = 0.625 sec into the equation:
y = -32(0.625) = -20 ft
On the Family of Concept Forming Operators in Polyadic FCADmitrii Ignatov
Triadic Formal Concept Analysis (3FCA) was introduced by Lehman and Wille almost two decades ago. And many researchers work in Data Mining and Formal Concept Analysis using the notions of closed sets, Galois and closure operators, closure systems. However, up-to-date even though that different researchers actively work on mining triadic and n-ary relations, a proper closure operator for enumeration of triconcepts, i.e. maximal triadic cliques of tripartite hypergaphs, was not introduced. In this talk we show that the previously introduced operators for obtaining triconcepts are not always consistent, describe their family and study their properties. We also introduce the notion of maximal switching generator to explain why such concept-forming operators are not closure operators due to violation of monotonicity property.
Similar to Deep Learning Opening Workshop - Admissibility of Solution Estimators in Stochastic Optimization - Amitabh Basu, August 12, 2019 (20)
Recently, the machine learning community has expressed strong interest in applying latent variable modeling strategies to causal inference problems with unobserved confounding. Here, I discuss one of the big debates that occurred over the past year, and how we can move forward. I will focus specifically on the failure of point identification in this setting, and discuss how this can be used to design flexible sensitivity analyses that cleanly separate identified and unidentified components of the causal model.
I will discuss paradigmatic statistical models of inference and learning from high dimensional data, such as sparse PCA and the perceptron neural network, in the sub-linear sparsity regime. In this limit the underlying hidden signal, i.e., the low-rank matrix in PCA or the neural network weights, has a number of non-zero components that scales sub-linearly with the total dimension of the vector. I will provide explicit low-dimensional variational formulas for the asymptotic mutual information between the signal and the data in suitable sparse limits. In the setting of support recovery these formulas imply sharp 0-1 phase transitions for the asymptotic minimum mean-square-error (or generalization error in the neural network setting). A similar phase transition was analyzed recently in the context of sparse high-dimensional linear regression by Reeves et al.
Many different measurement techniques are used to record neural activity in the brains of different organisms, including fMRI, EEG, MEG, lightsheet microscopy and direct recordings with electrodes. Each of these measurement modes have their advantages and disadvantages concerning the resolution of the data in space and time, the directness of measurement of the neural activity and which organisms they can be applied to. For some of these modes and for some organisms, significant amounts of data are now available in large standardized open-source datasets. I will report on our efforts to apply causal discovery algorithms to, among others, fMRI data from the Human Connectome Project, and to lightsheet microscopy data from zebrafish larvae. In particular, I will focus on the challenges we have faced both in terms of the nature of the data and the computational features of the discovery algorithms, as well as the modeling of experimental interventions.
1) The document presents a statistical modeling approach called targeted smooth Bayesian causal forests (tsbcf) to smoothly estimate heterogeneous treatment effects over gestational age using observational data from early medical abortion regimens.
2) The tsbcf method extends Bayesian additive regression trees (BART) to estimate treatment effects that evolve smoothly over gestational age, while allowing for heterogeneous effects across patient subgroups.
3) The tsbcf analysis of early medical abortion regimen data found the simultaneous administration to be similarly effective overall to the interval administration, but identified some patient subgroups where effectiveness may vary more over gestational age.
Difference-in-differences is a widely used evaluation strategy that draws causal inference from observational panel data. Its causal identification relies on the assumption of parallel trends, which is scale-dependent and may be questionable in some applications. A common alternative is a regression model that adjusts for the lagged dependent variable, which rests on the assumption of ignorability conditional on past outcomes. In the context of linear models, Angrist and Pischke (2009) show that the difference-in-differences and lagged-dependent-variable regression estimates have a bracketing relationship. Namely, for a true positive effect, if ignorability is correct, then mistakenly assuming parallel trends will overestimate the effect; in contrast, if the parallel trends assumption is correct, then mistakenly assuming ignorability will underestimate the effect. We show that the same bracketing relationship holds in general nonparametric (model-free) settings. We also extend the result to semiparametric estimation based on inverse probability weighting.
We develop sensitivity analyses for weak nulls in matched observational studies while allowing unit-level treatment effects to vary. In contrast to randomized experiments and paired observational studies, we show for general matched designs that over a large class of test statistics, any valid sensitivity analysis for the weak null must be unnecessarily conservative if Fisher's sharp null of no treatment effect for any individual also holds. We present a sensitivity analysis valid for the weak null, and illustrate why it is conservative if the sharp null holds through connections to inverse probability weighted estimators. An alternative procedure is presented that is asymptotically sharp if treatment effects are constant, and is valid for the weak null under additional assumptions which may be deemed reasonable by practitioners. The methods may be applied to matched observational studies constructed using any optimal without-replacement matching algorithm, allowing practitioners to assess robustness to hidden bias while allowing for treatment effect heterogeneity.
This document discusses difference-in-differences (DiD) analysis, a quasi-experimental method used to estimate treatment effects. The author notes that while widely applicable, DiD relies on strong assumptions about the counterfactual. She recommends approaches like matching on observed variables between similar populations, thoughtfully specifying regression models to adjust for confounding factors, testing for parallel pre-treatment trends under different assumptions, and considering more complex models that allow for different types of changes over time. The overall message is that DiD requires careful consideration and testing of its underlying assumptions to draw valid causal conclusions.
We present recent advances and statistical developments for evaluating Dynamic Treatment Regimes (DTR), which allow the treatment to be dynamically tailored according to evolving subject-level data. Identification of an optimal DTR is a key component for precision medicine and personalized health care. Specific topics covered in this talk include several recent projects with robust and flexible methods developed for the above research area. We will first introduce a dynamic statistical learning method, adaptive contrast weighted learning (ACWL), which combines doubly robust semiparametric regression estimators with flexible machine learning methods. We will further develop a tree-based reinforcement learning (T-RL) method, which builds an unsupervised decision tree that maintains the nature of batch-mode reinforcement learning. Unlike ACWL, T-RL handles the optimization problem with multiple treatment comparisons directly through a purity measure constructed with augmented inverse probability weighted estimators. T-RL is robust, efficient and easy to interpret for the identification of optimal DTRs. However, ACWL seems more robust against tree-type misspecification than T-RL when the true optimal DTR is non-tree-type. At the end of this talk, we will also present a new Stochastic-Tree Search method called ST-RL for evaluating optimal DTRs.
A fundamental feature of evaluating causal health effects of air quality regulations is that air pollution moves through space, rendering health outcomes at a particular population location dependent upon regulatory actions taken at multiple, possibly distant, pollution sources. Motivated by studies of the public-health impacts of power plant regulations in the U.S., this talk introduces the novel setting of bipartite causal inference with interference, which arises when 1) treatments are defined on observational units that are distinct from those at which outcomes are measured and 2) there is interference between units in the sense that outcomes for some units depend on the treatments assigned to many other units. Interference in this setting arises due to complex exposure patterns dictated by physical-chemical atmospheric processes of pollution transport, with intervention effects framed as propagating across a bipartite network of power plants and residential zip codes. New causal estimands are introduced for the bipartite setting, along with an estimation approach based on generalized propensity scores for treatments on a network. The new methods are deployed to estimate how emission-reduction technologies implemented at coal-fired power plants causally affect health outcomes among Medicare beneficiaries in the U.S.
Laine Thomas presented information about how causal inference is being used to determine the cost/benefit of the two most common surgical surgical treatments for women - hysterectomy and myomectomy.
We provide an overview of some recent developments in machine learning tools for dynamic treatment regime discovery in precision medicine. The first development is a new off-policy reinforcement learning tool for continual learning in mobile health to enable patients with type 1 diabetes to exercise safely. The second development is a new inverse reinforcement learning tools which enables use of observational data to learn how clinicians balance competing priorities for treating depression and mania in patients with bipolar disorder. Both practical and technical challenges are discussed.
The method of differences-in-differences (DID) is widely used to estimate causal effects. The primary advantage of DID is that it can account for time-invariant bias from unobserved confounders. However, the standard DID estimator will be biased if there is an interaction between history in the after period and the groups. That is, bias will be present if an event besides the treatment occurs at the same time and affects the treated group in a differential fashion. We present a method of bounds based on DID that accounts for an unmeasured confounder that has a differential effect in the post-treatment time period. These DID bracketing bounds are simple to implement and only require partitioning the controls into two separate groups. We also develop two key extensions for DID bracketing bounds. First, we develop a new falsification test to probe the key assumption that is necessary for the bounds estimator to provide consistent estimates of the treatment effect. Next, we develop a method of sensitivity analysis that adjusts the bounds for possible bias based on differences between the treated and control units from the pretreatment period. We apply these DID bracketing bounds and the new methods we develop to an application on the effect of voter identification laws on turnout. Specifically, we focus estimating whether the enactment of voter identification laws in Georgia and Indiana had an effect on voter turnout.
This document summarizes a simulation study evaluating causal inference methods for assessing the effects of opioid and gun policies. The study used real US state-level data to simulate the adoption of policies by some states and estimated the effects using different statistical models. It found that with fewer adopting states, type 1 error rates were too high, and most models lacked power. It recommends using cluster-robust standard errors and lagged outcomes to improve model performance. The study aims to help identify best practices for policy evaluation studies.
We study experimental design in large-scale stochastic systems with substantial uncertainty and structured cross-unit interference. We consider the problem of a platform that seeks to optimize supply-side payments p in a centralized marketplace where different suppliers interact via their effects on the overall supply-demand equilibrium, and propose a class of local experimentation schemes that can be used to optimize these payments without perturbing the overall market equilibrium. We show that, as the system size grows, our scheme can estimate the gradient of the platform’s utility with respect to p while perturbing the overall market equilibrium by only a vanishingly small amount. We can then use these gradient estimates to optimize p via any stochastic first-order optimization method. These results stem from the insight that, while the system involves a large number of interacting units, any interference can only be channeled through a small number of key statistics, and this structure allows us to accurately predict feedback effects that arise from global system changes using only information collected while remaining in equilibrium.
We discuss a general roadmap for generating causal inference based on observational studies used to general real world evidence. We review targeted minimum loss estimation (TMLE), which provides a general template for the construction of asymptotically efficient plug-in estimators of a target estimand for realistic (i.e, infinite dimensional) statistical models. TMLE is a two stage procedure that first involves using ensemble machine learning termed super-learning to estimate the relevant stochastic relations between the treatment, censoring, covariates and outcome of interest. The super-learner allows one to fully utilize all the advances in machine learning (in addition to more conventional parametric model based estimators) to build a single most powerful ensemble machine learning algorithm. We present Highly Adaptive Lasso as an important machine learning algorithm to include.
In the second step, the TMLE involves maximizing a parametric likelihood along a so-called least favorable parametric model through the super-learner fit of the relevant stochastic relations in the observed data. This second step bridges the state of the art in machine learning to estimators of target estimands for which statistical inference is available (i.e, confidence intervals, p-values etc). We also review recent advances in collaborative TMLE in which the fit of the treatment and censoring mechanism is tailored w.r.t. performance of TMLE. We also discuss asymptotically valid bootstrap based inference. Simulations and data analyses are provided as demonstrations.
We describe different approaches for specifying models and prior distributions for estimating heterogeneous treatment effects using Bayesian nonparametric models. We make an affirmative case for direct, informative (or partially informative) prior distributions on heterogeneous treatment effects, especially when treatment effect size and treatment effect variation is small relative to other sources of variability. We also consider how to provide scientifically meaningful summaries of complicated, high-dimensional posterior distributions over heterogeneous treatment effects with appropriate measures of uncertainty.
Climate change mitigation has traditionally been analyzed as some version of a public goods game (PGG) in which a group is most successful if everybody contributes, but players are best off individually by not contributing anything (i.e., “free-riding”)—thereby creating a social dilemma. Analysis of climate change using the PGG and its variants has helped explain why global cooperation on GHG reductions is so difficult, as nations have an incentive to free-ride on the reductions of others. Rather than inspire collective action, it seems that the lack of progress in addressing the climate crisis is driving the search for a “quick fix” technological solution that circumvents the need for cooperation.
This document discusses various types of academic writing and provides tips for effective academic writing. It outlines common academic writing formats such as journal papers, books, and reports. It also lists writing necessities like having a clear purpose, understanding your audience, using proper grammar and being concise. The document cautions against plagiarism and not proofreading. It provides additional dos and don'ts for writing, such as using simple language and avoiding filler words. Overall, the key message is that academic writing requires selling your ideas effectively to the reader.
Machine learning (including deep and reinforcement learning) and blockchain are two of the most noticeable technologies in recent years. The first one is the foundation of artificial intelligence and big data, and the second one has significantly disrupted the financial industry. Both technologies are data-driven, and thus there are rapidly growing interests in integrating them for more secure and efficient data sharing and analysis. In this paper, we review the research on combining blockchain and machine learning technologies and demonstrate that they can collaborate efficiently and effectively. In the end, we point out some future directions and expect more researches on deeper integration of the two promising technologies.
In this talk, we discuss QuTrack, a Blockchain-based approach to track experiment and model changes primarily for AI and ML models. In addition, we discuss how change analytics can be used for process improvement and to enhance the model development and deployment processes.
More from The Statistical and Applied Mathematical Sciences Institute (20)
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.pptHenry Hollis
The History of NZ 1870-1900.
Making of a Nation.
From the NZ Wars to Liberals,
Richard Seddon, George Grey,
Social Laboratory, New Zealand,
Confiscations, Kotahitanga, Kingitanga, Parliament, Suffrage, Repudiation, Economic Change, Agriculture, Gold Mining, Timber, Flax, Sheep, Dairying,
Gender and Mental Health - Counselling and Family Therapy Applications and In...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
Leveraging Generative AI to Drive Nonprofit InnovationTechSoup
In this webinar, participants learned how to utilize Generative AI to streamline operations and elevate member engagement. Amazon Web Service experts provided a customer specific use cases and dived into low/no-code tools that are quick and easy to deploy through Amazon Web Service (AWS.)
This presentation was provided by Racquel Jemison, Ph.D., Christina MacLaughlin, Ph.D., and Paulomi Majumder. Ph.D., all of the American Chemical Society, for the second session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session Two: 'Expanding Pathways to Publishing Careers,' was held June 13, 2024.
Walmart Business+ and Spark Good for Nonprofits.pdfTechSoup
"Learn about all the ways Walmart supports nonprofit organizations.
You will hear from Liz Willett, the Head of Nonprofits, and hear about what Walmart is doing to help nonprofits, including Walmart Business and Spark Good. Walmart Business+ is a new offer for nonprofits that offers discounts and also streamlines nonprofits order and expense tracking, saving time and money.
The webinar may also give some examples on how nonprofits can best leverage Walmart Business+.
The event will cover the following::
Walmart Business + (https://business.walmart.com/plus) is a new shopping experience for nonprofits, schools, and local business customers that connects an exclusive online shopping experience to stores. Benefits include free delivery and shipping, a 'Spend Analytics” feature, special discounts, deals and tax-exempt shopping.
Special TechSoup offer for a free 180 days membership, and up to $150 in discounts on eligible orders.
Spark Good (walmart.com/sparkgood) is a charitable platform that enables nonprofits to receive donations directly from customers and associates.
Answers about how you can do more with Walmart!"
A Visual Guide to 1 Samuel | A Tale of Two HeartsSteve Thomason
These slides walk through the story of 1 Samuel. Samuel is the last judge of Israel. The people reject God and want a king. Saul is anointed as the first king, but he is not a good king. David, the shepherd boy is anointed and Saul is envious of him. David shows honor while Saul continues to self destruct.
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxEduSkills OECD
Iván Bornacelly, Policy Analyst at the OECD Centre for Skills, OECD, presents at the webinar 'Tackling job market gaps with a skills-first approach' on 12 June 2024
Temple of Asclepius in Thrace. Excavation resultsKrassimira Luka
The temple and the sanctuary around were dedicated to Asklepios Zmidrenus. This name has been known since 1875 when an inscription dedicated to him was discovered in Rome. The inscription is dated in 227 AD and was left by soldiers originating from the city of Philippopolis (modern Plovdiv).
How to Make a Field Mandatory in Odoo 17Celine George
In Odoo, making a field required can be done through both Python code and XML views. When you set the required attribute to True in Python code, it makes the field required across all views where it's used. Conversely, when you set the required attribute in XML views, it makes the field required only in the context of that particular view.
The chapter Lifelines of National Economy in Class 10 Geography focuses on the various modes of transportation and communication that play a vital role in the economic development of a country. These lifelines are crucial for the movement of goods, services, and people, thereby connecting different regions and promoting economic activities.
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Deep Learning Opening Workshop - Admissibility of Solution Estimators in Stochastic Optimization - Amitabh Basu, August 12, 2019
1. Admissibility of solution estimators to stochastic
optimization problems
Amitabh Basu
Joint Work with Tu Nguyen and Ao Sun
Foundations of Deep Learning, Opening Workshop,
SAMSI, Durham, August 2019
1 / 19
2. A general stochastic optimization problem
F : X × Rm → R
ξ is a random variable taking values in Rm
min
x∈X
Eξ[ F(x, ξ) ]
2 / 19
3. A general stochastic optimization problem
F : X × Rm → R
ξ is a random variable taking values in Rm
min
x∈X
Eξ[ F(x, ξ) ]
Example:
Supervised Machine Learning: One sees samples (z, y) ∈ Rn × R
of labeled data from some (joint) distribution, and one aims to find
a function f ∈ F in a hypothesis class F that minimizes the
expected loss E(z,y)[ (f (z), y)], where : R × R → R+ is some
loss function. Then X = F, m = n + 1, ξ = (z, y), and
F(f , (z, y)) = (f (z), y).
2 / 19
4. A general stochastic optimization problem
F : X × Rm → R
ξ is a random variable taking values in Rm
min
x∈X
Eξ[ F(x, ξ) ]
Example:
(News) Vendor Problem: (News) Vendor buys some units of a
product (newspapers) from supplier at cost of c > 0 dollars/unit;
at most u units available. Stochastic demand for product. Product
sold at price p > c dollars/unit. End of day, vendor can return
unsold product to supplier at r < c dollars/unit. Find number of
units to buy to maximize (minimize) the expected profit (loss).
2 / 19
5. A general stochastic optimization problem
F : X × Rm → R
ξ is a random variable taking values in Rm
min
x∈X
Eξ[ F(x, ξ) ]
Example:
(News) Vendor Problem: (News) Vendor buys some units of a
product (newspapers) from supplier at cost of c > 0 dollars/unit;
at most u units available. Stochastic demand for product. Product
sold at price p > c dollars/unit. End of day, vendor can return
unsold product to supplier at r < c dollars/unit. Find number of
units to buy to maximize (minimize) the expected profit (loss).
m = 1, X = [0, u],
F(x, ξ) = cx − p min{x, ξ} − r max{x − ξ, 0}.
2 / 19
6. Solving the problem
F : X × Rm → R
ξ is a random variable taking values in Rm
min
x∈X
Eξ[ F(x, ξ) ]
3 / 19
7. Solving the problem
F : X × Rm → R
ξ is a random variable taking values in Rm
min
x∈X
Eξ[ F(x, ξ) ]
Solve the problem only given access to n i.i.d. samples of ξ.
3 / 19
8. Solving the problem
F : X × Rm → R
ξ is a random variable taking values in Rm
min
x∈X
Eξ[ F(x, ξ) ]
Solve the problem only given access to n i.i.d. samples of ξ.
Natural idea: Given samples ξ1, . . . , ξn ∈ Rd , solve the
deterministic problem
min
x∈X
1
n
n
i=1
F(x, ξi
)
3 / 19
9. Solving the problem
F : X × Rm → R
ξ is a random variable taking values in Rm
min
x∈X
Eξ[ F(x, ξ) ]
Solve the problem only given access to n i.i.d. samples of ξ.
Natural idea: Given samples ξ1, . . . , ξn ∈ Rd , solve the
deterministic problem
min
x∈X
1
n
n
i=1
F(x, ξi
)
Stochastic optimizers call this sample average approximation
(SAA); machine learners call this empirical risk minimization.
3 / 19
10. Concrete Problem
F(x, ξ) = ξT x
X ⊆ Rd is a compact set (e.g., polytope, integer points in a
polytope). So m = d.
ξ ∼ N(µ, Σ).
min
x∈X
Eξ[ F(x, ξ) ] = min
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
4 / 19
11. Concrete Problem
F(x, ξ) = ξT x
X ⊆ Rd is a compact set (e.g., polytope, integer points in a
polytope). So m = d.
ξ ∼ N(µ, Σ).
min
x∈X
Eξ[ F(x, ξ) ] = min
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
Solve the problem only given access to n i.i.d. samples of ξ.
Important: µ is unknown.
4 / 19
12. Concrete Problem
F(x, ξ) = ξT x
X ⊆ Rd is a compact set (e.g., polytope, integer points in a
polytope). So m = d.
ξ ∼ N(µ, Σ).
min
x∈X
Eξ[ F(x, ξ) ] = min
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
Solve the problem only given access to n i.i.d. samples of ξ.
Important: µ is unknown.
Sample Average Approximation (SAA):
min
x∈X
1
n
n
i=1
F(x, ξi
) = min
x∈X
ξ
T
x
where ξ := 1
n
n
i=1 ξi .
4 / 19
13. A quick tour of Statistical Decision Theory
Set of states of nature, modeled by a set Θ.
Set of possible actions to take, modeled by A.
In a particular state of nature θ ∈ Θ, the performance of any
action a ∈ A, is evaluated by a loss function L(θ, a). Goal: choose
action to minimize loss.
(Partial/Incomplete) Information about θ is obtained through a
random variable y taking values in a sample space χ. The
distribution of y depends on the particular state of nature θ,
denoted by Pθ.
5 / 19
14. A quick tour of Statistical Decision Theory
Set of states of nature, modeled by a set Θ.
Set of possible actions to take, modeled by A.
In a particular state of nature θ ∈ Θ, the performance of any
action a ∈ A, is evaluated by a loss function L(θ, a). Goal: choose
action to minimize loss.
(Partial/Incomplete) Information about θ is obtained through a
random variable y taking values in a sample space χ. The
distribution of y depends on the particular state of nature θ,
denoted by Pθ.
Decision Rule: Takes y ∈ χ as input and reports an action a ∈ A.
Denote by δ : χ → A.
5 / 19
15. Our problem cast as statistical decision problem
X ⊆ Rd is a compact set. ξ ∼ N(µ, I).
min
x∈X
Eξ[ F(x, ξ) ] = min
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
States of Nature: Θ = Rd = {all possible µ ∈ Rd }.
Set of Actions: X ⊆ Rd .
6 / 19
16. Our problem cast as statistical decision problem
X ⊆ Rd is a compact set. ξ ∼ N(µ, I).
min
x∈X
Eξ[ F(x, ξ) ] = min
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
States of Nature: Θ = Rd = {all possible µ ∈ Rd }.
Set of Actions: X ⊆ Rd .
Loss function: ?
6 / 19
17. Our problem cast as statistical decision problem
X ⊆ Rd is a compact set. ξ ∼ N(µ, I).
min
x∈X
Eξ[ F(x, ξ) ] = min
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
States of Nature: Θ = Rd = {all possible µ ∈ Rd }.
Set of Actions: X ⊆ Rd .
Loss function:
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
6 / 19
18. Our problem cast as statistical decision problem
X ⊆ Rd is a compact set. ξ ∼ N(µ, I).
min
x∈X
Eξ[ F(x, ξ) ] = min
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
States of Nature: Θ = Rd = {all possible µ ∈ Rd }.
Set of Actions: X ⊆ Rd .
Loss function:
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − minx∈X Eξ∼N(¯µ,I)[ F(x, ξ) ]
= ¯µT ¯x − ¯µT x(¯µ)
6 / 19
19. Our problem cast as statistical decision problem
X ⊆ Rd is a compact set. ξ ∼ N(µ, I).
min
x∈X
Eξ[ F(x, ξ) ] = min
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
States of Nature: Θ = Rd = {all possible µ ∈ Rd }.
Set of Actions: X ⊆ Rd .
Loss function:
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − minx∈X Eξ∼N(¯µ,I)[ F(x, ξ) ]
= ¯µT ¯x − ¯µT x(¯µ)
Sample Space: χ = Rd
× Rd
× . . . × Rd
n times
6 / 19
20. Our problem cast as statistical decision problem
X ⊆ Rd is a compact set. ξ ∼ N(µ, I).
min
x∈X
Eξ[ F(x, ξ) ] = min
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
States of Nature: Θ = Rd = {all possible µ ∈ Rd }.
Set of Actions: X ⊆ Rd .
Loss function:
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − minx∈X Eξ∼N(¯µ,I)[ F(x, ξ) ]
= ¯µT ¯x − ¯µT x(¯µ)
Sample Space: χ = Rd
× Rd
× . . . × Rd
n times
Decision Rule: δ : χ → X.
6 / 19
21. Our problem cast as statistical decision problem
X ⊆ Rd is a compact set. ξ ∼ N(µ, I).
min
x∈X
Eξ[ F(x, ξ) ] = min
x∈X
Eξ[ ξT
x ] = min
x∈X
µT
x
States of Nature: Θ = Rd = {all possible µ ∈ Rd }.
Set of Actions: X ⊆ Rd .
Loss function:
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − minx∈X Eξ∼N(¯µ,I)[ F(x, ξ) ]
= ¯µT ¯x − ¯µT x(¯µ)
Sample Space: χ = Rd
× Rd
× . . . × Rd
n times
Decision Rule: δ : χ → X.
SAA: δ(ξ1
, . . . , ξn
) ∈ arg max{ξ
T
x : x ∈ X}
6 / 19
25. How does one decide between decision rules?
8 / 19
26. How does one decide between decision rules?
States of nature Θ, Actions A, Loss function L : Θ × A → R,
Sample space χ with distributions {Pθ : θ ∈ Θ}.
8 / 19
27. How does one decide between decision rules?
States of nature Θ, Actions A, Loss function L : Θ × A → R,
Sample space χ with distributions {Pθ : θ ∈ Θ}.
Given a decision rule δ : χ → A, define the risk function of this
decision rule as:
Rδ(θ) := Ey∼Pθ
[ L(θ, δ(y)) ]
8 / 19
28. How does one decide between decision rules?
States of nature Θ, Actions A, Loss function L : Θ × A → R,
Sample space χ with distributions {Pθ : θ ∈ Θ}.
Given a decision rule δ : χ → A, define the risk function of this
decision rule as:
Rδ(θ) := Ey∼Pθ
[ L(θ, δ(y)) ]
We say that a decision rule δ dominates a decision rule δ if
Rδ (θ) ≤ Rδ(θ) for all θ ∈ Θ, and Rδ (θ∗) < Rδ(θ∗) for some
θ∗ ∈ Θ.
8 / 19
29. How does one decide between decision rules?
States of nature Θ, Actions A, Loss function L : Θ × A → R,
Sample space χ with distributions {Pθ : θ ∈ Θ}.
Given a decision rule δ : χ → A, define the risk function of this
decision rule as:
Rδ(θ) := Ey∼Pθ
[ L(θ, δ(y)) ]
We say that a decision rule δ dominates a decision rule δ if
Rδ (θ) ≤ Rδ(θ) for all θ ∈ Θ, and Rδ (θ∗) < Rδ(θ∗) for some
θ∗ ∈ Θ.
If a decision rule δ is not dominated by any other decision rule, we
say that δ is admissible. Otherwise, it is inadmissible.
8 / 19
30. Is the Sample Average Approximation (SAA) rule admissible?
9 / 19
31. Admissibility in stochastic optimization
Stochastic optimization setup:
F : X × Rm → R, ξ is a R.V. in Rm
min
x∈X
Eξ[ F(x, ξ) ]
Want to solve with access to n i.i.d. samples of ξ.
10 / 19
32. Admissibility in stochastic optimization
Stochastic optimization setup:
F : X × Rm → R, ξ is a R.V. in Rm
min
x∈X
Eξ[ F(x, ξ) ]
Want to solve with access to n i.i.d. samples of ξ.
Statistical decision theory view:
ξ ∼ N(µ, I); states of nature Θ = Rm = {all possible µ ∈ Rm}.
Set of actions A = X, Sample space χ = Rd
× Rd
× . . . × Rd
n times
Loss function
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
Given a decision rule δ : χ → X, the risk of δ
Rδ(µ) := Eξ1,...,ξn [ L(µ, δ(ξ1
, . . . , ξn
)) ]
10 / 19
33. Admissibility in stochastic optimization
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
Given a decision rule δ : χ → X, the risk of δ
Rδ(µ) := Eξ1,...,ξn [ L(µ, δ(ξ1
, . . . , ξn
)) ]
Sample Average Approximation (SAA):
min
x∈X
1
n
n
i=1
F(x, ξi
)
11 / 19
34. Admissibility in stochastic optimization
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
Given a decision rule δ : χ → X, the risk of δ
Rδ(µ) := Eξ1,...,ξn [ L(µ, δ(ξ1
, . . . , ξn
)) ]
Sample Average Approximation (SAA):
min
x∈X
1
n
n
i=1
F(x, ξi
)
Sample Average Approximation (SAA) can be inadmissible!!
11 / 19
35. Inadmissibility of SAA: Stein’s Paradox
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
Sample Average Approximation (SAA) can be inadmissible!!
F(x, ξ) = x − ξ 2, X = Rd , ξ ∼ N(µ, I).
min
x∈Rd
Eξ[ F(x, ξ) ] = min
x∈Rd
Eξ[ x − ξ 2
]
12 / 19
36. Inadmissibility of SAA: Stein’s Paradox
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
Sample Average Approximation (SAA) can be inadmissible!!
min
x∈Rd
Eξ∼N(µ,I)[ x − ξ 2
] = min
x∈Rd
x − µ 2
+ V[ ξ ]
Optimal solution: x(¯µ) = ¯µ, Optimal value: V[ ξ ] = d.
L(¯µ, ¯x) = ¯x − ¯µ 2
.
13 / 19
37. Inadmissibility of SAA: Stein’s Paradox
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
Sample Average Approximation (SAA) can be inadmissible!!
min
x∈Rd
Eξ∼N(µ,I)[ x − ξ 2
] = min
x∈Rd
x − µ 2
+ V[ ξ ]
Optimal solution: x(¯µ) = ¯µ, Optimal value: V[ ξ ] = d.
L(¯µ, ¯x) = ¯x − ¯µ 2
.
Sample Average Approximation (SAA):
min
x∈Rd
1
n
n
i=1
x − ξi 2
13 / 19
38. Inadmissibility of SAA: Stein’s Paradox
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
Sample Average Approximation (SAA) can be inadmissible!!
min
x∈Rd
Eξ∼N(µ,I)[ x − ξ 2
] = min
x∈Rd
x − µ 2
+ V[ ξ ]
Optimal solution: x(¯µ) = ¯µ, Optimal value: V[ ξ ] = d.
L(¯µ, ¯x) = ¯x − ¯µ 2
.
Sample Average Approximation (SAA):
min
x∈Rd
1
n
n
i=1
x − ξi 2
δSAA(ξ1
, . . . , ξn
) = ξ :=
n
i=1
ξi
.
13 / 19
39. Inadmissibility of SAA: Stein’s Paradox
L(¯µ, ¯x) = Eξ∼N(¯µ,I)[ F(¯x, ξ) ] − min
x∈X
Eξ∼N(¯µ,I)[ F(x, ξ) ]
Sample Average Approximation (SAA) can be inadmissible!!
min
x∈Rd
Eξ∼N(µ,I)[ x − ξ 2
]
Generalized to arbitrary convex quadratic function with uncertain
linear term in Davarnia and Cornu´ejols 2018. Follow-up work from
a Bayesian perspective in Davarnia, Kocuk and Cornu´ejols 2018.
14 / 19
40. A class of problems with no Stein’s paradox
THEOREM Basu-Nguyen-Sun 2018
Consider the problem of optimizing an uncertain linear objective
ξ ∼ N(µ, I) over a fixed compact set X ⊆ Rd :
min
x∈X
Eξ∼N(µ,I)[ ξT
x ]
The Sample Average Approximation (SAA) rule is admissible.
15 / 19
43. Main technical ideas/tools
Sufficient Statistic: P = {Pθ : θ ∈ Θ} family of distributions for
r.v. y in sample space χ. Sufficient statistic for this family is a
function T : χ → τ such that the conditional probability
P(y|T = t) does not depend on θ.
17 / 19
44. Main technical ideas/tools
Sufficient Statistic: P = {Pθ : θ ∈ Θ} family of distributions for
r.v. y in sample space χ. Sufficient statistic for this family is a
function T : χ → τ such that the conditional probability
P(y|T = t) does not depend on θ.
FACT:
χ = Rd
× . . . × Rd
n times
, P = {N(µ, I) × . . . × N(µ, I)
n times
: µ ∈ Rd
},
i.e., (ξ1, . . . , ξn) ∈ χ are i.i.d samples from the normal distribution
N(µ, I). Then T(ξ1 . . . , ξn) = ξ := 1
n
n
i=1 ξi is a sufficient
statistic for P.
17 / 19
45. Main technical ideas/tools
Sufficient Statistic: P = {Pθ : θ ∈ Θ} family of distributions for
r.v. y in sample space χ. Sufficient statistic for this family is a
function T : χ → τ such that the conditional probability
P(y|T = t) does not depend on θ.
THEOREM Rao-Blackwell 1940s
If the loss function is convex in the action space, then for any
decision rule δ, there exists a rule δ which is a function only of a
sufficient statistic and Rδ ≤ Rδ.
17 / 19
46. Main technical ideas/tools
For any decision rule δ, define the function
F(µ) = Rδ(µ) − RδSAA
(µ).
Suffices to show that there exists ˆµ ∈ Rd such that F(ˆµ) > 0.
17 / 19
47. Main technical ideas/tools
For any decision rule δ, define the function
F(µ) = Rδ(µ) − RδSAA
(µ).
Suffices to show that there exists ˆµ ∈ Rd such that F(ˆµ) > 0.
First observe: F(0) = 0.
17 / 19
48. Main technical ideas/tools
For any decision rule δ, define the function
F(µ) = Rδ(µ) − RδSAA
(µ).
Suffices to show that there exists ˆµ ∈ Rd such that F(ˆµ) > 0.
First observe: F(0) = 0.
Then compute 2F(0); show it has a strictly positive eigenvalue.
17 / 19
49. Main technical ideas/tools
For any decision rule δ, define the function
F(µ) = Rδ(µ) − RδSAA
(µ).
Suffices to show that there exists ˆµ ∈ Rd such that F(ˆµ) > 0.
First observe: F(0) = 0.
Then compute 2F(0); show it has a strictly positive eigenvalue.
Use a fact from probability theory that for any Lebesgue integrable
function f : Rn → Rd , the map
µ → Ey∈N(µ,Σ) [ f (y) ] :=
Rd
f (y) exp −
1
2
(y−µ)T
Σ−1
(y−µ) dy
has derivatives of all orders and these can be computed by taking
the derivative under the integral sign.
17 / 19
50. Open Questions
What about nonlinear objectives over compact feasible
regions? For example, what if F(x, ξ) = xT Qx + ξT x for
some fixed PSD matrix Q, and X is a compact (convex) set?
18 / 19
51. Open Questions
What about nonlinear objectives over compact feasible
regions? For example, what if F(x, ξ) = xT Qx + ξT x for
some fixed PSD matrix Q, and X is a compact (convex) set?
What about piecewise linear objectives F(x, ξ)? Recall News
Vendor Problem.
18 / 19
52. Open Questions
What about nonlinear objectives over compact feasible
regions? For example, what if F(x, ξ) = xT Qx + ξT x for
some fixed PSD matrix Q, and X is a compact (convex) set?
What about piecewise linear objectives F(x, ξ)? Recall News
Vendor Problem.
Objectives coming from machine learning problems, such as
neural network training with squared or logistic loss
(admissibility of “empirical risk minimization”). Maybe this
depends on the hypothesis class that is being learnt?
18 / 19
53. Open Questions
What about nonlinear objectives over compact feasible
regions? For example, what if F(x, ξ) = xT Qx + ξT x for
some fixed PSD matrix Q, and X is a compact (convex) set?
What about piecewise linear objectives F(x, ξ)? Recall News
Vendor Problem.
Objectives coming from machine learning problems, such as
neural network training with squared or logistic loss
(admissibility of “empirical risk minimization”). Maybe this
depends on the hypothesis class that is being learnt?
METATHEOREM (from G´erard Cornu´ejols): Admissible if and
only if feasible region is bounded !?
18 / 19