The document discusses interpretable sparse sliced inverse regression (IS-SIR) for functional data regression. It begins with background on using metamodels as proxies for computationally expensive agronomic models to understand relationships between climate inputs and plant outputs. SIR is presented as a semi-parametric regression technique that identifies relevant subspaces to predict outputs from functional inputs. The proposal involves combining SIR with automatic interval selection to point out interpretable predictor intervals. Simulations are discussed to evaluate the proposed method.
The Bayesian paradigm provides a coherent approach for quantifying uncertainty given available data and prior information. Aspects of uncertainty that arise in practice include uncertainty regarding parameters within a model, the choice of model, and propagation of uncertainty in parameters and models for predictions. In this talk I will present Bayesian approaches for addressing model uncertainty given a collection of competing models including model averaging and ensemble methods that potentially use all available models and will highlight computational challenges that arise in implementation of the paradigm.
A comparison of three learning methods to predict N20 fluxes and N leachingtuxette
The document compares three machine learning methods - multi-layer perceptrons (neural networks), support vector machines (SVMs), and random forests - for predicting N2O fluxes and N leaching from various data inputs. It provides background on machine learning for regression problems, describes the three methods and how they are trained and tuned, and discusses the methodology and results of a study comparing the performance of these methods.
The document discusses using unusual data sources in insurance. It provides examples of using pictures, text, social media data, telematics, and satellite imagery in insurance. It also discusses challenges in analyzing complex and high-dimensional data from these sources and introduces machine learning tools like PCA, generalized linear models, and evaluating models using loss, risk, and cross-validation.
This document summarizes a seminar on econometrics and machine learning given by Arthur Charpentier at Università degli studi dell’Insubria in May 2018. It discusses the history and development of econometrics, including its probabilistic foundations. It also covers key econometric techniques like regression, maximum likelihood estimation, and nonparametric methods. Model selection criteria like AIC and BIC are also briefly discussed. The document provides a high-level overview of major topics in econometrics through the lens of its use in large datasets and connection to machine learning.
The Bayesian paradigm provides a coherent approach for quantifying uncertainty given available data and prior information. Aspects of uncertainty that arise in practice include uncertainty regarding parameters within a model, the choice of model, and propagation of uncertainty in parameters and models for predictions. In this talk I will present Bayesian approaches for addressing model uncertainty given a collection of competing models including model averaging and ensemble methods that potentially use all available models and will highlight computational challenges that arise in implementation of the paradigm.
A comparison of three learning methods to predict N20 fluxes and N leachingtuxette
The document compares three machine learning methods - multi-layer perceptrons (neural networks), support vector machines (SVMs), and random forests - for predicting N2O fluxes and N leaching from various data inputs. It provides background on machine learning for regression problems, describes the three methods and how they are trained and tuned, and discusses the methodology and results of a study comparing the performance of these methods.
The document discusses using unusual data sources in insurance. It provides examples of using pictures, text, social media data, telematics, and satellite imagery in insurance. It also discusses challenges in analyzing complex and high-dimensional data from these sources and introduces machine learning tools like PCA, generalized linear models, and evaluating models using loss, risk, and cross-validation.
This document summarizes a seminar on econometrics and machine learning given by Arthur Charpentier at Università degli studi dell’Insubria in May 2018. It discusses the history and development of econometrics, including its probabilistic foundations. It also covers key econometric techniques like regression, maximum likelihood estimation, and nonparametric methods. Model selection criteria like AIC and BIC are also briefly discussed. The document provides a high-level overview of major topics in econometrics through the lens of its use in large datasets and connection to machine learning.
This document outlines an agenda for a presentation on big data and machine learning from an actuarial perspective. The presentation will include an introduction to statistical learning, covering classification and regression problems. It will discuss model selection, feature engineering, and other related computational topics. Code examples will be provided to illustrate machine learning techniques in action. The goal is to describe philosophical differences between machine learning and standard statistical approaches, as well as explain commonly used algorithms and how to implement them.
The document discusses quantiles and quantile regression. It begins by defining quantiles as the inverse of a cumulative distribution function. Quantile regression models the relationship between covariates and conditional quantiles, similar to how ordinary least squares regression models the conditional mean. The document also discusses median regression, which estimates relationships using the 1-norm rather than the 2-norm used in OLS. Median regression provides consistent estimates when the error term has a symmetric distribution.
Mini useR! in Melbourne https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network/events/251933078/
MelbURN (Melbourne useR group) https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network
July 16th, 2018
Melbourne, Australia
The document proposes using random forests (RF), a machine learning tool, for approximate Bayesian computation (ABC) model choice rather than estimating model posterior probabilities. RF improves on existing ABC model choice methods by having greater discriminative power among models, being robust to the choice and number of summary statistics, requiring less computation, and providing an error rate to evaluate confidence in the model choice. The authors illustrate the power of the RF-based ABC methodology on controlled experiments and real population genetics datasets.
Surrogate models emulate expensive computer simulations. The objective is to approximate a function, $f$, of $d$ variables to a given tolerance, $\varepsilon$, using as few function values as possible, preferably $O(d)$. We explain how tractability theory provides lower bounds on the number of function values required for any possible method. We also propose method for sampling $f$ and approximating $f$ that achieves this objective and the kind of underlying structure that $f$ must have for success.
Pattern learning and recognition on statistical manifolds: An information-geo...Frank Nielsen
This document provides an overview of Frank Nielsen's talk on pattern learning and recognition using information geometry and statistical manifolds. The talk focuses on departing from vector space representations and dealing with (dis)similarities that do not have Euclidean or metric properties. This poses new theoretical and computational challenges for pattern recognition. The talk describes using exponential family mixture models defined on dually flat statistical manifolds induced by convex functions. On these manifolds, dual coordinate systems and dual affine geodesics allow for computing-friendly representations of divergences and similarities between probabilistic patterns. The techniques aim to achieve statistical invariance and enable algorithmic approaches to problems like Gaussian mixture modeling, shape retrieval, and diffusion tensor imaging analysis.
This document provides an overview of a 2004 CVPR tutorial on nonlinear manifolds in computer vision. The tutorial is divided into four parts that cover: (1) motivation for studying nonlinear manifolds and how differential geometry can be useful in vision, (2) tools from differential geometry like manifolds, tangent spaces, and geodesics, (3) statistics on manifolds like distributions and estimation, and (4) algorithms and applications in computer vision like pose estimation, tracking, and optimal linear projections. Nonlinear manifolds are important in computer vision as the underlying spaces in problems involving constraints like objects on circles or matrices with orthogonality constraints are nonlinear. Differential geometry provides a framework for generalizing tools from vector spaces to nonlinear
Although we often told not to do it, statistical scientists frequently predict the value of outcome measures of physical systems at input points far the observed data. Since predictions are made in new regions of the input space, a statistical theory cannot dictate optimal rules for measures of uncertainty associated with extrapolation. This talk presents several solutions based on simple principles. The solutions are illustrated via the analysis of data generated by dropping spheres of varying radii and masses from different heights. Some of the techniques apply to more complex physical systems. The efficacy of these techniques is demonstrated using data (experimental and simulated) of the level of complexity physical scientist frequently face. Scientists should tailor these techniques to fit the needs of a particular application.
This document provides teaching materials for a lesson on quadratic functions in vertex form. The lesson is designed for senior high school students and will take approximately 80 minutes to complete. It includes a teachers' guide, lesson plan, student worksheet, and instructions for setting up quadratic graphs in a spreadsheet. The lesson introduces vertex form and guides students in exploring how changing the parameters a, p, and q affects the graph of the function. Students will observe transformations of the graph as these parameters are varied and analyze how the vertex, y-intercept, and minimum/maximum values change.
Calibrating Probability with Undersampling for Unbalanced ClassificationAndrea Dal Pozzolo
This study examines how undersampling affects posterior probability estimates in unbalanced classification tasks. It shows that undersampling warps the posterior probabilities away from the true probabilities. However, the study presents a method to correct the warped probabilities using a simple formula, which provides calibrated probabilities without loss of predictive performance. Experiments on real-world datasets demonstrate that the corrected probabilities have better calibration than uncorrected probabilities while maintaining ranking quality.
This document discusses various types of regression modeling and linear regression. It provides examples of linear regression analysis on fraud data and discusses assessing goodness of fit. It also briefly covers non-linear regression, problem areas like heteroskedasticity and collinearity, and model selection methods. Linear regression is presented geometrically and the assumptions and computations of ordinary least squares regression are explained.
This document provides an overview of regression models and their use in business analytics. It discusses simple and multiple linear regression models, how to develop regression equations from sample data, and how to interpret key outputs like the slope, intercept, coefficient of determination, and correlation coefficient. Regression analysis is presented as a valuable tool for managers to understand relationships between variables and predict outcomes. The document outlines the key steps in regression including developing scatter plots, calculating regression equations, and measuring the fit of regression models.
This document provides an overview of key concepts in regression analysis, including simple and multiple linear regression models. It outlines 10 learning objectives for the chapter, which cover topics like developing regression equations from sample data, interpreting regression outputs, assessing model fit, and addressing violations of regression assumptions. The document also includes sample regression calculations and residual plots for a case study on predicting home renovation sales from area payroll levels.
When is undersampling effective in unbalanced classification tasks?Andrea Dal Pozzolo
This document analyzes when undersampling is effective for addressing class imbalance in classification tasks. It introduces the concepts of warping in posterior distributions and increased variance due to sample removal with undersampling. It presents a theoretical condition under which undersampling is expected to improve classification accuracy based on comparing the ranking error probability with and without undersampling. Experiments on synthetic univariate and bivariate datasets are used to illustrate factors influencing whether the condition holds.
Mathematicians use univariate and multivariate analyses to predict the future. For univariate analysis, they use Ornstein-Uhlenbeck and autoregressive models to analyze time series data. For multivariate analysis, they use linear regression to analyze correlations between multiple time series and predict values. Their analyses generate forecasts, confidence bands around predictions, and evaluations of prediction errors. The conclusion indicates the methods provide useful predictions and that inflation rates are correlated across measures.
This document provides an overview of linear regression models. It discusses using linear regression to analyze the relationship between one or more independent variables and a dependent variable. Key points covered include:
- Linear regression can be used to measure relationships between variables, determine causal direction, and forecast variable values.
- The linear regression model relates a dependent variable to independent variables using a best fitting straight line.
- Ordinary least squares estimation is used to estimate the slope and intercept of the regression line by minimizing the sum of squared residuals.
- Diagnostic tests on residuals can check if assumptions like linearity, normality and equal variance are met.
Inferring networks from multiple samples with consensus LASSOtuxette
This document provides an overview of biological concepts and network inference methods. It discusses DNA, transcription, gene expression, and how transcriptomic data is obtained. Gene networks can be inferred from expression data using correlations or partial correlations between genes. Network inference focuses on direct relationships between genes and can identify interactions for previously unannotated genes.
This document outlines an agenda for a presentation on big data and machine learning from an actuarial perspective. The presentation will include an introduction to statistical learning, covering classification and regression problems. It will discuss model selection, feature engineering, and other related computational topics. Code examples will be provided to illustrate machine learning techniques in action. The goal is to describe philosophical differences between machine learning and standard statistical approaches, as well as explain commonly used algorithms and how to implement them.
The document discusses quantiles and quantile regression. It begins by defining quantiles as the inverse of a cumulative distribution function. Quantile regression models the relationship between covariates and conditional quantiles, similar to how ordinary least squares regression models the conditional mean. The document also discusses median regression, which estimates relationships using the 1-norm rather than the 2-norm used in OLS. Median regression provides consistent estimates when the error term has a symmetric distribution.
Mini useR! in Melbourne https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network/events/251933078/
MelbURN (Melbourne useR group) https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network
July 16th, 2018
Melbourne, Australia
The document proposes using random forests (RF), a machine learning tool, for approximate Bayesian computation (ABC) model choice rather than estimating model posterior probabilities. RF improves on existing ABC model choice methods by having greater discriminative power among models, being robust to the choice and number of summary statistics, requiring less computation, and providing an error rate to evaluate confidence in the model choice. The authors illustrate the power of the RF-based ABC methodology on controlled experiments and real population genetics datasets.
Surrogate models emulate expensive computer simulations. The objective is to approximate a function, $f$, of $d$ variables to a given tolerance, $\varepsilon$, using as few function values as possible, preferably $O(d)$. We explain how tractability theory provides lower bounds on the number of function values required for any possible method. We also propose method for sampling $f$ and approximating $f$ that achieves this objective and the kind of underlying structure that $f$ must have for success.
Pattern learning and recognition on statistical manifolds: An information-geo...Frank Nielsen
This document provides an overview of Frank Nielsen's talk on pattern learning and recognition using information geometry and statistical manifolds. The talk focuses on departing from vector space representations and dealing with (dis)similarities that do not have Euclidean or metric properties. This poses new theoretical and computational challenges for pattern recognition. The talk describes using exponential family mixture models defined on dually flat statistical manifolds induced by convex functions. On these manifolds, dual coordinate systems and dual affine geodesics allow for computing-friendly representations of divergences and similarities between probabilistic patterns. The techniques aim to achieve statistical invariance and enable algorithmic approaches to problems like Gaussian mixture modeling, shape retrieval, and diffusion tensor imaging analysis.
This document provides an overview of a 2004 CVPR tutorial on nonlinear manifolds in computer vision. The tutorial is divided into four parts that cover: (1) motivation for studying nonlinear manifolds and how differential geometry can be useful in vision, (2) tools from differential geometry like manifolds, tangent spaces, and geodesics, (3) statistics on manifolds like distributions and estimation, and (4) algorithms and applications in computer vision like pose estimation, tracking, and optimal linear projections. Nonlinear manifolds are important in computer vision as the underlying spaces in problems involving constraints like objects on circles or matrices with orthogonality constraints are nonlinear. Differential geometry provides a framework for generalizing tools from vector spaces to nonlinear
Although we often told not to do it, statistical scientists frequently predict the value of outcome measures of physical systems at input points far the observed data. Since predictions are made in new regions of the input space, a statistical theory cannot dictate optimal rules for measures of uncertainty associated with extrapolation. This talk presents several solutions based on simple principles. The solutions are illustrated via the analysis of data generated by dropping spheres of varying radii and masses from different heights. Some of the techniques apply to more complex physical systems. The efficacy of these techniques is demonstrated using data (experimental and simulated) of the level of complexity physical scientist frequently face. Scientists should tailor these techniques to fit the needs of a particular application.
This document provides teaching materials for a lesson on quadratic functions in vertex form. The lesson is designed for senior high school students and will take approximately 80 minutes to complete. It includes a teachers' guide, lesson plan, student worksheet, and instructions for setting up quadratic graphs in a spreadsheet. The lesson introduces vertex form and guides students in exploring how changing the parameters a, p, and q affects the graph of the function. Students will observe transformations of the graph as these parameters are varied and analyze how the vertex, y-intercept, and minimum/maximum values change.
Calibrating Probability with Undersampling for Unbalanced ClassificationAndrea Dal Pozzolo
This study examines how undersampling affects posterior probability estimates in unbalanced classification tasks. It shows that undersampling warps the posterior probabilities away from the true probabilities. However, the study presents a method to correct the warped probabilities using a simple formula, which provides calibrated probabilities without loss of predictive performance. Experiments on real-world datasets demonstrate that the corrected probabilities have better calibration than uncorrected probabilities while maintaining ranking quality.
This document discusses various types of regression modeling and linear regression. It provides examples of linear regression analysis on fraud data and discusses assessing goodness of fit. It also briefly covers non-linear regression, problem areas like heteroskedasticity and collinearity, and model selection methods. Linear regression is presented geometrically and the assumptions and computations of ordinary least squares regression are explained.
This document provides an overview of regression models and their use in business analytics. It discusses simple and multiple linear regression models, how to develop regression equations from sample data, and how to interpret key outputs like the slope, intercept, coefficient of determination, and correlation coefficient. Regression analysis is presented as a valuable tool for managers to understand relationships between variables and predict outcomes. The document outlines the key steps in regression including developing scatter plots, calculating regression equations, and measuring the fit of regression models.
This document provides an overview of key concepts in regression analysis, including simple and multiple linear regression models. It outlines 10 learning objectives for the chapter, which cover topics like developing regression equations from sample data, interpreting regression outputs, assessing model fit, and addressing violations of regression assumptions. The document also includes sample regression calculations and residual plots for a case study on predicting home renovation sales from area payroll levels.
When is undersampling effective in unbalanced classification tasks?Andrea Dal Pozzolo
This document analyzes when undersampling is effective for addressing class imbalance in classification tasks. It introduces the concepts of warping in posterior distributions and increased variance due to sample removal with undersampling. It presents a theoretical condition under which undersampling is expected to improve classification accuracy based on comparing the ranking error probability with and without undersampling. Experiments on synthetic univariate and bivariate datasets are used to illustrate factors influencing whether the condition holds.
Mathematicians use univariate and multivariate analyses to predict the future. For univariate analysis, they use Ornstein-Uhlenbeck and autoregressive models to analyze time series data. For multivariate analysis, they use linear regression to analyze correlations between multiple time series and predict values. Their analyses generate forecasts, confidence bands around predictions, and evaluations of prediction errors. The conclusion indicates the methods provide useful predictions and that inflation rates are correlated across measures.
This document provides an overview of linear regression models. It discusses using linear regression to analyze the relationship between one or more independent variables and a dependent variable. Key points covered include:
- Linear regression can be used to measure relationships between variables, determine causal direction, and forecast variable values.
- The linear regression model relates a dependent variable to independent variables using a best fitting straight line.
- Ordinary least squares estimation is used to estimate the slope and intercept of the regression line by minimizing the sum of squared residuals.
- Diagnostic tests on residuals can check if assumptions like linearity, normality and equal variance are met.
Inferring networks from multiple samples with consensus LASSOtuxette
This document provides an overview of biological concepts and network inference methods. It discusses DNA, transcription, gene expression, and how transcriptomic data is obtained. Gene networks can be inferred from expression data using correlations or partial correlations between genes. Network inference focuses on direct relationships between genes and can identify interactions for previously unannotated genes.
Visualiser et fouiller des réseaux - Méthodes et exemples dans Rtuxette
AG du PEPI IBIS, 1er avril 2014
Cet exposé introduira la notion de réseaux et les problématiques élémentaires qui y sont généralement associées (visualisation, recherche de sommets importants, recherche de modules). Les notions seront illustrées à l'aide d'exemples utilisant le logiciel R sur un réseau réel.
Inferring networks from multiple samples with consensus LASSOtuxette
The document discusses network inference from gene expression data. It provides background on DNA, transcription, and gene expression. Gene expression data from microarrays contains measurements of thousands of genes across multiple samples. The goal is to infer a gene network or graph with nodes as genes and edges as strong links between gene expressions. Graphical Gaussian models (GGMs) are commonly used, where the concentration matrix encodes conditional independence relationships between genes. Several approaches are discussed for estimating the concentration matrix from data, including graphical lasso methods that promote sparse solutions.
Graph mining 2: Statistical approaches for graph miningtuxette
This document summarizes a talk on statistical approaches for graph mining. It introduces basic graph terminology and describes some standard global and local numerical characteristics for describing graph structure. These characteristics are calculated for a toy graph dataset and compared to random graph null models to identify which characteristics have unexpectedly high or low values compared to the random graphs. Clustering methods for graph mining are also outlined but not described in detail.
Inferring networks from multiple samples with consensus LASSOtuxette
This document provides a short overview of network inference using graphical Gaussian models (GGMs). It discusses inferring networks from multiple samples, with the motivation being to identify genes that are linked independently or depending on different conditions. A naive approach of performing independent estimations on each sample is described. Joint network inference using the consensus LASSO method is then introduced to better identify common and condition-specific network structures across multiple related samples.
Classification and regression based on derivatives: a consistency result for ...tuxette
This document summarizes a presentation on using derivatives for classification and regression of functions. It discusses using smoothing splines to estimate functions and their derivatives from discrete sampled data. A consistency result is presented that finds a classifier or regression function built from the estimated derivative functions that achieves the optimal Bayes risk, as the number of samples and examples increases. The key idea is to use smoothing splines, which consistently estimate functions and derivatives, combined with a consistent classifier or regressor on the estimated values.
Maximum likelihood estimation of regularisation parameters in inverse problem...Valentin De Bortoli
This document discusses an empirical Bayesian approach for estimating regularization parameters in inverse problems using maximum likelihood estimation. It proposes the Stochastic Optimization with Unadjusted Langevin (SOUL) algorithm, which uses Markov chain sampling to approximate gradients in a stochastic projected gradient descent scheme for optimizing the regularization parameter. The algorithm is shown to converge to the maximum likelihood estimate under certain conditions on the log-likelihood and prior distributions.
This document provides an overview of a tutorial on intelligent information gathering and submodular function optimization. The tutorial discusses how many artificial intelligence problems can be formulated as submodular optimization problems, including sensor placement, active learning, and structure learning. It introduces key concepts such as submodular set functions, examples of submodular functions including set cover and mutual information, and properties of submodular functions including closedness under nonnegative linear combinations and the relationship between submodularity and concavity.
Image sciences, image processing, image restoration, photo manipulation. Image and videos representation. Digital versus analog imagery. Quantization and sampling. Sources and models of noises in digital CCD imagery: photon, thermal and readout noises. Sources and models of blurs. Convolutions and point spread functions. Overview of other standard models, problems and tasks: salt-and-pepper and impulse noises, half toning, inpainting, super-resolution, compressed sensing, high dynamic range imagery, demosaicing. Short introduction to other types of imagery: SAR, Sonar, ultrasound, CT and MRI. Linear and ill-posed restoration problems.
This document summarizes Arthur Charpentier's presentation at the Rennes Risk Workshop in April 2015. It discusses extending concepts of risk from univariate to multivariate prospects, including characterizing attitudes to multivariate notions of increasing risk like the Rothschild-Stiglitz mean preserving increase in risk and Quiggin's monotone mean preserving increase in risk. It also generalizes the Bickel-Lehmann dispersion order to multivariate risks and examines its implications for risk sharing.
Several nonlinear models and methods for FDAtuxette
This document summarizes several nonlinear models and methods for functional data analysis (FDA), including nonparametric kernel models. It describes the Nadaraya-Watson kernel estimator for regression with functional data. This estimator takes a weighted average of the observed y-values, with weights based on a kernel function of the distance between the observed curves. The document outlines the assumptions needed for the estimator to converge pointwise and uniformly, and states the optimal rates of convergence. It also discusses choosing the kernel and bandwidth parameters and extending the estimator to functional data in Hilbert spaces.
This document summarizes Chris Swierczewski's general exam presentation on computational applications of Riemann surfaces and Abelian functions. The presentation covered the geometry and algebra of Riemann surfaces, including bases of cycles, holomorphic differentials, and period matrices. Applications discussed include using Riemann theta functions to find periodic solutions to integrable PDEs like the Kadomtsev–Petviashvili equation. The talk also discussed linear matrix representations of algebraic curves and the constructive Schottky problem of realizing a Riemann matrix as the period matrix of a curve.
This document discusses quantile regression and loss functions used in regression analysis, including ordinary least squares (OLS) and quantile regression. It provides mathematical definitions of quantiles, OLS regression using the L2 norm and expected value, and median regression using the L1 norm and median. Examples are given of how OLS regression minimizes the squared errors while median regression minimizes the absolute errors. References to early works on quantiles, regression, and estimation methods are also provided.
The document discusses composite infimal convolutions, which combine infimal convolutions and infimal postcompositions. It provides an equation that defines the composite infimal convolution and notes some special cases. It also lists several properties that have already been investigated for this operation, including topological, algebraic, convex analytical, and its amenability to proximal splitting algorithms. It concludes by posing some open questions about norm interpolation properties and refining splitting algorithms for related problems.
This document summarizes a talk given by Yoshihiro Mizoguchi on developing a Coq library for relational calculus. The talk introduces relational calculus and its applications. It describes implementing definitions and proofs about relations, Boolean algebras, relation algebras, and Dedekind categories in Coq. The library provides a formalization of basic notions in relational theory and can be used to formally verify properties of relations and prove theorems automatedly.
This document summarizes Arthur Charpentier's presentation on econometrics and statistical learning techniques. It discusses different perspectives on modeling data, including the causal story, conditional distribution story, and explanatory data story. It also covers topics like high dimensional data, computational econometrics, generalized linear models, goodness of fit, stepwise procedures, and testing in high dimensions. The presentation provides an overview of various statistical and econometric modeling techniques.
Numerical solution of boundary value problems by piecewise analysis methodAlexander Decker
This document presents a numerical method called Piecewise-Homotopy Analysis Method (P-HAM) for solving fourth-order boundary value problems. P-HAM is based on the Homotopy Analysis Method (HAM) but uses multiple auxiliary parameters, with each parameter applied over a sub-range of the domain for improved accuracy. The document outlines the basic steps of P-HAM, including constructing the zero-order deformation equation and deriving the governing equations. It then applies P-HAM to solve two example problems and compares the results to other numerical methods.
This document summarizes a talk on inference on treatment effects after model selection. It discusses challenges with inferring treatment effects after refitting a model selected via a procedure like lasso. Specifically, refitting can lead to bias due to overfitting or underfitting the model. The document proposes using repeated data splitting to remove the overfitting bias. In each split, part of the data is used for model selection and the other part for estimating treatment effects without overfitting bias. This approach reduces bias compared to simply refitting the full model.
This document provides an introduction to key concepts in probability and statistics for machine learning. It covers topics such as sample spaces, events, axioms of probability, permutations, combinations, conditional probability, Bayes' rule, random variables, probability distributions, expectations, variance, transformations of random variables, jointly distributed random variables, parameter estimation, and the central limit theorem.
Density theorems for Euclidean point configurationsVjekoslavKovac1
1. The document discusses density theorems for point configurations in Euclidean space. Density theorems study when a measurable set A contained in Euclidean space can be considered "large".
2. One classical result is that for any measurable set A contained in R2 with positive upper Banach density, there exist points in A whose distance is any sufficiently large real number. This has been generalized to higher dimensions and other point configurations.
3. Open questions remain about determining all point configurations P for which one can show that a sufficiently large measurable set A contained in high dimensional Euclidean space must contain a scaled copy of P.
This document discusses various methods for estimating normalizing constants that arise when evaluating integrals numerically. It begins by noting there are many computational methods for approximating normalizing constants across different communities. It then lists the topics that will be covered in the upcoming workshop, including discussions on estimating constants using Monte Carlo methods and Bayesian versus frequentist approaches. The document provides examples of estimating normalizing constants using Monte Carlo integration, reverse logistic regression, and Xiao-Li Meng's maximum likelihood estimation approach. It concludes by discussing some of the challenges in bringing a statistical framework to constant estimation problems.
Elementary Landscape Decomposition of the Hamiltonian Path Optimization Problemjfrchicanog
The document describes research on decomposing optimization problem landscapes into elementary components. It defines key landscape concepts like configuration space, neighborhood operators, and objective functions. It then introduces the idea of elementary landscapes where the objective function is a linear combination of eigenfunctions. The paper discusses decomposing general landscapes into a sum of elementary components and proposes using average neighborhood fitness for selection in non-elementary landscapes. It applies these concepts to the Hamiltonian Path Optimization problem, analyzing the problem's reversals and swaps neighborhoods.
Similar to Interpretable Sparse Sliced Inverse Regression for digitized functional data (20)
Racines en haut et feuilles en bas : les arbres en mathstuxette
1. The document discusses methods for clustering and differential analysis of Hi-C matrices, which represent the 3D organization of DNA.
2. It proposes extending Ward's hierarchical clustering to directly use Hi-C similarity matrices while enforcing adjacency constraints. A fast algorithm was also developed.
3. A new method called "treediff" was created to perform differential analysis of Hi-C matrices based on the Wasserstein distance between hierarchical clusterings. Software implementations of these methods were also developed.
Méthodes à noyaux pour l’intégration de données hétérogènestuxette
The document discusses a presentation about multi-omics data integration methods using kernel methods. The presentation introduces kernel methods, how they can be used to integrate heterogeneous omics data, and examples of applications. Specifically, it discusses using kernel methods to perform unsupervised transformation-based integration of multi-omics data. It also presents an application of constrained kernel hierarchical clustering to analyze Hi-C data by directly using Hi-C matrices as kernels.
Méthodologies d'intégration de données omiquestuxette
This document presents a presentation on multi-omics data integration methods given by Nathalie Vialaneix on December 13, 2023. The presentation discusses different types of omics data that can be integrated, both vertically across different levels of omics data on the same samples and horizontally across similar types of omics data on different samples. It also discusses different analysis approaches that can be taken, including supervised and unsupervised methods. The rest of the presentation focuses on unsupervised transformation-based integration methods using kernels.
The document discusses current and future work on analyzing Hi-C data and differential analysis of Hi-C matrices. It describes a clustering method developed to partition chromosomes based on Hi-C matrix similarity. It also introduces a new method called treediff for differential analysis of Hi-C data that calculates the distance between hierarchical clusterings. Current work includes reviewing differential analysis methods, investigating differential subtrees with multiple testing control, and inferring chromatin interaction networks.
Can deep learning learn chromatin structure from sequence?tuxette
This document discusses a deep learning model called ORCA that can predict chromatin structure from DNA sequence. The model uses a neural network with an encoder to extract features from sequence and a decoder to predict Hi-C matrices. It was trained on Hi-C data from multiple cell types and can predict interactions between regions at various resolutions. The model accurately captures features like CTCF-mediated loops and can predict effects of structural variants on chromatin structure. It allows for in silico mutagenesis to study how mutations may alter 3D genome organization.
Multi-omics data integration methods: kernel and other machine learning appro...tuxette
The document discusses multi-omics data integration methods, particularly kernel methods. It describes how kernel methods transform data into similarity matrices between samples rather than relying on variable space. Multiple kernel integration approaches are presented that combine multiple similarity matrices into a consensus kernel in an unsupervised manner, such as through a STATIS-like framework that maximizes the similarity between kernels. Examples of applications to datasets from the TARA Oceans expedition are given.
This document provides an overview of the MetaboWean and Idefics projects. MetaboWean aims to study the co-evolution of gut microbiota and epithelium during suckling-to-weaning transition in rabbits, using metabolomics, metagenomics, and single-cell RNA sequencing data. Idefics integrates multiple omics datasets from human skin samples to understand relationships between microorganisms and molecules and how they are structured in patient groups. The datasets include metagenomics, metabolomics, and proteomics from host and microbiota.
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...tuxette
ASTERICS is an interactive and integrative data analysis tool for omics data. It uses Rserve and PyRserve with Flask and Vue.js in a Docker container to integrate omics data. The backend uses Rserve and PyRserve with Flask on the server side, while the frontend uses Vue.js. This architecture was chosen for its open source and light design. Data communication between Rserve and PyRserve is limited, requiring an object database. ASTERICS is deployed using three Docker containers for R, Python, and
Apprentissage pour la biologie moléculaire et l’analyse de données omiquestuxette
This document summarizes a scientific presentation about molecular biology and omics data analysis. The presentation covers topics related to analyzing large omics datasets using methods like kernel methods, graphical models, and neural networks to learn gene regulation networks and predict phenotypes. Key challenges addressed are handling big data, missing values, non-Gaussian data types like counts and compositional data. The goal is to better understand complex biological systems from multi-omics data.
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...tuxette
The document summarizes preliminary results from evaluating methods for inferring gene regulatory networks from expression data in Bacillus subtilis. It finds that recall of the known network is generally poor (<20% for random forest), but inferred clusters still retain biological information about common regulators. It plans to confirm results, test restricting edges to sigma factors, and explore other inference methods like Bayesian networks and ARACNE.
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...tuxette
The document discusses methods for integrating multi-scale omics data using kernel and machine learning approaches. It describes how omics data is large, heterogeneous, and multi-scaled, creating bottlenecks for analysis. Methods discussed for data integration include multiple kernel learning to combine different relational datasets in an unsupervised way. The methods are applied to integrate different datasets from the TARA Oceans expedition to identify patterns in ocean microbial communities. Improving interpretability of the methods and making them more accessible to biological users is discussed.
Journal club: Validation of cluster analysis results on validation datatuxette
This document presents a framework for validating cluster analysis results on validation data. It describes situations where clustering is inferential versus descriptive and recommends using validation data separate from the data used for clustering. A typology of validation methods is provided, including validation based on the clustering method or results, and evaluation using internal validation, external validation, visual properties, or stability measures.
The document discusses the differences between overfitting and overparametrization in machine learning models. It explores how random forests may exhibit a phenomenon known as "double descent" where test error initially decreases then increases with more parameters before decreasing again. While double descent has been observed in other models, the document questions whether it is directly due to model complexity in random forests since very large trees may be unable to fully interpolate extremely large datasets.
Selective inference and single-cell differential analysistuxette
This document discusses selective inference and single-cell differential analysis. It introduces the problem of "double dipping" in the standard single-cell analysis pipeline where the same dataset is used for clustering and differential analysis. Two approaches for addressing this are presented: 1) A method that perturbs clusters before testing for differences, and 2) A test based on a truncated distribution that assumes clusters and genes are given separately. Experiments applying these methods to real single-cell datasets are described. The document outlines challenges in extending these approaches to more complex analyses.
SOMbrero : un package R pour les cartes auto-organisatricestuxette
SOMbrero is an R package that implements self-organizing map (SOM) algorithms. It can handle numeric, non-numeric, and relational data. The package contains functions for training SOMs, diagnosing results, and plotting maps. It also includes tools like a shiny app and vignettes to aid users without programming experience. SOMbrero supports missing data imputation and extends SOM to relational datasets through non-Euclidean distance measures.
Graph Neural Network for Phenotype Predictiontuxette
This document describes a study on using graph neural networks (GNNs) for phenotype prediction from gene expression data. The objectives are to determine if including network information can improve predictions, which network types work best, and if GNNs can learn network inferences. It provides background on GNNs and how they generalize convolutional layers to graph data. The authors implemented a GNN model from previous work as a starting point and tested it on different network types to see which network information is most useful for predictions. Their methodology involves comparing GNN performance to other methods like random forests using 10-fold cross validation.
A short and naive introduction to using network in prediction modelstuxette
The document provides an introduction to using network information in prediction models. It discusses representing a network as a graph with a Laplacian matrix. The Laplacian captures properties like random walks on the graph and heat diffusion. Eigenvectors of the Laplacian related to small eigenvalues are strongly tied to graph structure. The document discusses using the Laplacian in prediction models by working in the feature space defined by the Laplacian eigenvectors or directly regularizing a linear model with the Laplacian. This introduces network information and encourages similar contributions from connected nodes. The approaches are applied to problems like predicting phenotypes from gene expression using a known gene network.
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDSSérgio Sacani
The pathway(s) to seeding the massive black holes (MBHs) that exist at the heart of galaxies in the present and distant Universe remains an unsolved problem. Here we categorise, describe and quantitatively discuss the formation pathways of both light and heavy seeds. We emphasise that the most recent computational models suggest that rather than a bimodal-like mass spectrum between light and heavy seeds with light at one end and heavy at the other that instead a continuum exists. Light seeds being more ubiquitous and the heavier seeds becoming less and less abundant due the rarer environmental conditions required for their formation. We therefore examine the different mechanisms that give rise to different seed mass spectrums. We show how and why the mechanisms that produce the heaviest seeds are also among the rarest events in the Universe and are hence extremely unlikely to be the seeds for the vast majority of the MBH population. We quantify, within the limits of the current large uncertainties in the seeding processes, the expected number densities of the seed mass spectrum. We argue that light seeds must be at least 103 to 105 times more numerous than heavy seeds to explain the MBH population as a whole. Based on our current understanding of the seed population this makes heavy seeds (Mseed > 103 M⊙) a significantly more likely pathway given that heavy seeds have an abundance pattern than is close to and likely in excess of 10−4 compared to light seeds. Finally, we examine the current state-of-the-art in numerical calculations and recent observations and plot a path forward for near-future advances in both domains.
Microbial interaction
Microorganisms interacts with each other and can be physically associated with another organisms in a variety of ways.
One organism can be located on the surface of another organism as an ectobiont or located within another organism as endobiont.
Microbial interaction may be positive such as mutualism, proto-cooperation, commensalism or may be negative such as parasitism, predation or competition
Types of microbial interaction
Positive interaction: mutualism, proto-cooperation, commensalism
Negative interaction: Ammensalism (antagonism), parasitism, predation, competition
I. Mutualism:
It is defined as the relationship in which each organism in interaction gets benefits from association. It is an obligatory relationship in which mutualist and host are metabolically dependent on each other.
Mutualistic relationship is very specific where one member of association cannot be replaced by another species.
Mutualism require close physical contact between interacting organisms.
Relationship of mutualism allows organisms to exist in habitat that could not occupied by either species alone.
Mutualistic relationship between organisms allows them to act as a single organism.
Examples of mutualism:
i. Lichens:
Lichens are excellent example of mutualism.
They are the association of specific fungi and certain genus of algae. In lichen, fungal partner is called mycobiont and algal partner is called
II. Syntrophism:
It is an association in which the growth of one organism either depends on or improved by the substrate provided by another organism.
In syntrophism both organism in association gets benefits.
Compound A
Utilized by population 1
Compound B
Utilized by population 2
Compound C
utilized by both Population 1+2
Products
In this theoretical example of syntrophism, population 1 is able to utilize and metabolize compound A, forming compound B but cannot metabolize beyond compound B without co-operation of population 2. Population 2is unable to utilize compound A but it can metabolize compound B forming compound C. Then both population 1 and 2 are able to carry out metabolic reaction which leads to formation of end product that neither population could produce alone.
Examples of syntrophism:
i. Methanogenic ecosystem in sludge digester
Methane produced by methanogenic bacteria depends upon interspecies hydrogen transfer by other fermentative bacteria.
Anaerobic fermentative bacteria generate CO2 and H2 utilizing carbohydrates which is then utilized by methanogenic bacteria (Methanobacter) to produce methane.
ii. Lactobacillus arobinosus and Enterococcus faecalis:
In the minimal media, Lactobacillus arobinosus and Enterococcus faecalis are able to grow together but not alone.
The synergistic relationship between E. faecalis and L. arobinosus occurs in which E. faecalis require folic acid
PPT on Direct Seeded Rice presented at the three-day 'Training and Validation Workshop on Modules of Climate Smart Agriculture (CSA) Technologies in South Asia' workshop on April 22, 2024.
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...Sérgio Sacani
Context. The observation of several L-band emission sources in the S cluster has led to a rich discussion of their nature. However, a definitive answer to the classification of the dusty objects requires an explanation for the detection of compact Doppler-shifted Brγ emission. The ionized hydrogen in combination with the observation of mid-infrared L-band continuum emission suggests that most of these sources are embedded in a dusty envelope. These embedded sources are part of the S-cluster, and their relationship to the S-stars is still under debate. To date, the question of the origin of these two populations has been vague, although all explanations favor migration processes for the individual cluster members. Aims. This work revisits the S-cluster and its dusty members orbiting the supermassive black hole SgrA* on bound Keplerian orbits from a kinematic perspective. The aim is to explore the Keplerian parameters for patterns that might imply a nonrandom distribution of the sample. Additionally, various analytical aspects are considered to address the nature of the dusty sources. Methods. Based on the photometric analysis, we estimated the individual H−K and K−L colors for the source sample and compared the results to known cluster members. The classification revealed a noticeable contrast between the S-stars and the dusty sources. To fit the flux-density distribution, we utilized the radiative transfer code HYPERION and implemented a young stellar object Class I model. We obtained the position angle from the Keplerian fit results; additionally, we analyzed the distribution of the inclinations and the longitudes of the ascending node. Results. The colors of the dusty sources suggest a stellar nature consistent with the spectral energy distribution in the near and midinfrared domains. Furthermore, the evaporation timescales of dusty and gaseous clumps in the vicinity of SgrA* are much shorter ( 2yr) than the epochs covered by the observations (≈15yr). In addition to the strong evidence for the stellar classification of the D-sources, we also find a clear disk-like pattern following the arrangements of S-stars proposed in the literature. Furthermore, we find a global intrinsic inclination for all dusty sources of 60 ± 20◦, implying a common formation process. Conclusions. The pattern of the dusty sources manifested in the distribution of the position angles, inclinations, and longitudes of the ascending node strongly suggests two different scenarios: the main-sequence stars and the dusty stellar S-cluster sources share a common formation history or migrated with a similar formation channel in the vicinity of SgrA*. Alternatively, the gravitational influence of SgrA* in combination with a massive perturber, such as a putative intermediate mass black hole in the IRS 13 cluster, forces the dusty objects and S-stars to follow a particular orbital arrangement. Key words. stars: black holes– stars: formation– Galaxy: center– galaxies: star formation
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...Sérgio Sacani
We present the JWST discovery of SN 2023adsy, a transient object located in a host galaxy JADES-GS
+
53.13485
−
27.82088
with a host spectroscopic redshift of
2.903
±
0.007
. The transient was identified in deep James Webb Space Telescope (JWST)/NIRCam imaging from the JWST Advanced Deep Extragalactic Survey (JADES) program. Photometric and spectroscopic followup with NIRCam and NIRSpec, respectively, confirm the redshift and yield UV-NIR light-curve, NIR color, and spectroscopic information all consistent with a Type Ia classification. Despite its classification as a likely SN Ia, SN 2023adsy is both fairly red (
�
(
�
−
�
)
∼
0.9
) despite a host galaxy with low-extinction and has a high Ca II velocity (
19
,
000
±
2
,
000
km/s) compared to the general population of SNe Ia. While these characteristics are consistent with some Ca-rich SNe Ia, particularly SN 2016hnk, SN 2023adsy is intrinsically brighter than the low-
�
Ca-rich population. Although such an object is too red for any low-
�
cosmological sample, we apply a fiducial standardization approach to SN 2023adsy and find that the SN 2023adsy luminosity distance measurement is in excellent agreement (
≲
1
�
) with
Λ
CDM. Therefore unlike low-
�
Ca-rich SNe Ia, SN 2023adsy is standardizable and gives no indication that SN Ia standardized luminosities change significantly with redshift. A larger sample of distant SNe Ia is required to determine if SN Ia population characteristics at high-
�
truly diverge from their low-
�
counterparts, and to confirm that standardized luminosities nevertheless remain constant with redshift.
PPT on Alternate Wetting and Drying presented at the three-day 'Training and Validation Workshop on Modules of Climate Smart Agriculture (CSA) Technologies in South Asia' workshop on April 22, 2024.
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptxshubhijain836
Centrifugation is a powerful technique used in laboratories to separate components of a heterogeneous mixture based on their density. This process utilizes centrifugal force to rapidly spin samples, causing denser particles to migrate outward more quickly than lighter ones. As a result, distinct layers form within the sample tube, allowing for easy isolation and purification of target substances.
PPT on Sustainable Land Management presented at the three-day 'Training and Validation Workshop on Modules of Climate Smart Agriculture (CSA) Technologies in South Asia' workshop on April 22, 2024.
Anti-Universe And Emergent Gravity and the Dark UniverseSérgio Sacani
Recent theoretical progress indicates that spacetime and gravity emerge together from the entanglement structure of an underlying microscopic theory. These ideas are best understood in Anti-de Sitter space, where they rely on the area law for entanglement entropy. The extension to de Sitter space requires taking into account the entropy and temperature associated with the cosmological horizon. Using insights from string theory, black hole physics and quantum information theory we argue that the positive dark energy leads to a thermal volume law contribution to the entropy that overtakes the area law precisely at the cosmological horizon. Due to the competition between area and volume law entanglement the microscopic de Sitter states do not thermalise at sub-Hubble scales: they exhibit memory effects in the form of an entropy displacement caused by matter. The emergent laws of gravity contain an additional ‘dark’ gravitational force describing the ‘elastic’ response due to the entropy displacement. We derive an estimate of the strength of this extra force in terms of the baryonic mass, Newton’s constant and the Hubble acceleration scale a0 = cH0, and provide evidence for the fact that this additional ‘dark gravity force’ explains the observed phenomena in galaxies and clusters currently attributed to dark matter.
Anti-Universe And Emergent Gravity and the Dark Universe
Interpretable Sparse Sliced Inverse Regression for digitized functional data
1. Interpretable Sparse Sliced Inverse Regression for
digitized functional data
Victor Picheny, Rémi Servien & Nathalie Villa-Vialaneix
nathalie.villa@toulouse.inra.fr
http://www.nathalievilla.org
Séminaire Institut de Mathématiques de Bordeaux
8 avril 2016
Nathalie Villa-Vialaneix | IS-SIR 1/26
2. Sommaire
1 Background and motivation
2 Presentation of SIR
3 Our proposal
4 Simulations
Nathalie Villa-Vialaneix | IS-SIR 2/26
3. Sommaire
1 Background and motivation
2 Presentation of SIR
3 Our proposal
4 Simulations
Nathalie Villa-Vialaneix | IS-SIR 3/26
4. A typical case study: meta-model in agronomy
climate
(daily time series:
rain, temperature...)
plant phenotypes
predictions
(yield, N leaching...)
Agronomic model
Nathalie Villa-Vialaneix | IS-SIR 4/26
5. A typical case study: meta-model in agronomy
climate
(daily time series:
rain, temperature...)
plant phenotypes
predictions
(yield, N leaching...)
Agronomic model
Agronomic model:
based on biological and chemical knowledge;
Nathalie Villa-Vialaneix | IS-SIR 4/26
6. A typical case study: meta-model in agronomy
climate
(daily time series:
rain, temperature...)
plant phenotypes
predictions
(yield, N leaching...)
Agronomic model
Agronomic model:
based on biological and chemical knowledge;
computationaly expensive to use;
Nathalie Villa-Vialaneix | IS-SIR 4/26
7. A typical case study: meta-model in agronomy
climate
(daily time series:
rain, temperature...)
plant phenotypes
predictions
(yield, N leaching...)
Agronomic model
Agronomic model:
based on biological and chemical knowledge;
computationaly expensive to use;
useful for realistic predictions but not to understand the link between
the inputs and the outputs.
Nathalie Villa-Vialaneix | IS-SIR 4/26
8. A typical case study: meta-model in agronomy
climate
(daily time series:
rain, temperature...)
plant phenotypes
predictions
(yield, N leaching...)
Agronomic model
Agronomic model:
based on biological and chemical knowledge;
computationaly expensive to use;
useful for realistic predictions but not to understand the link between
the inputs and the outputs.
Metamodeling: train a simplified, fast and interpretable model which can
be used as a proxy for the agronomic model.
Nathalie Villa-Vialaneix | IS-SIR 4/26
9. A first case study: SUNFLO [Casadebaig et al., 2011]
Inputs: 5 daily time series (length: one year) and 8 phenotypes for different
sunflower types
Output: sunflower yield
Data: 1000 sunflower types × 190 climatic series (different places and
years) (n = 190 000) of variables in R5×183
× R8
Nathalie Villa-Vialaneix | IS-SIR 5/26
10. Main facts obtained from a preliminary study
R. Kpekou internship
The study focused on the influence of the climate on the yield: 5 functional
variables digitized at 183 points.
Nathalie Villa-Vialaneix | IS-SIR 6/26
11. Main facts obtained from a preliminary study
R. Kpekou internship
The study focused on the influence of the climate on the yield: 5 functional
variables digitized at 183 points.
Main result: Using summary of the variables (mean, sd...) on several
weeks and an automatic aggregating procedure in a random forest
method, led to obtain good accuracy in prediction.
Nathalie Villa-Vialaneix | IS-SIR 6/26
12. Question and mathematical framework
A functional regression problem: X: random variable (functional) & Y:
random real variable
E(Y|X)?
Nathalie Villa-Vialaneix | IS-SIR 7/26
13. Question and mathematical framework
A functional regression problem: X: random variable (functional) & Y:
random real variable
E(Y|X)?
Data: n i.i.d. observations (xi, yi)i=1,...,n.
xi is not perfectly known but sampled at (fixed) points
xi = (xi(t1), . . . , xi(tp))T
∈ Rp
. We denote: X =
xT
1
...
xT
n
.
Nathalie Villa-Vialaneix | IS-SIR 7/26
14. Question and mathematical framework
A functional regression problem: X: random variable (functional) & Y:
random real variable
E(Y|X)?
Data: n i.i.d. observations (xi, yi)i=1,...,n.
xi is not perfectly known but sampled at (fixed) points
xi = (xi(t1), . . . , xi(tp))T
∈ Rp
. We denote: X =
xT
1
...
xT
n
.
Question: Find a model which is easily interpretable and points out
relevant intervals for the prediction within the range of X.
Nathalie Villa-Vialaneix | IS-SIR 7/26
15. Related works (variable selection in FDA)
LASSO / L1
regularization in linear models
[Ferraty et al., 2010, Aneiros and Vieu, 2014] (isolated evaluation
points), [Matsui and Konishi, 2011] (selects elements of an expansion
basis), [James et al., 2009] (sparsity on derivatives: piecewise
constant predictors)
[Fraiman et al., 2015] (blinding approach useable for various
problems: PCA, regression...)
[Gregorutti et al., 2015] adaptation of the importance of variables in
random forest for groups of variables
Nathalie Villa-Vialaneix | IS-SIR 8/26
16. Related works (variable selection in FDA)
LASSO / L1
regularization in linear models
[Ferraty et al., 2010, Aneiros and Vieu, 2014] (isolated evaluation
points), [Matsui and Konishi, 2011] (selects elements of an expansion
basis), [James et al., 2009] (sparsity on derivatives: piecewise
constant predictors)
[Fraiman et al., 2015] (blinding approach useable for various
problems: PCA, regression...)
[Gregorutti et al., 2015] adaptation of the importance of variables in
random forest for groups of variables
Our proposal: a semi-parametric (not entirely linear) model which selects
relevant intervals combined with an automatic procedure to define the
intervals.
Nathalie Villa-Vialaneix | IS-SIR 8/26
17. Sommaire
1 Background and motivation
2 Presentation of SIR
3 Our proposal
4 Simulations
Nathalie Villa-Vialaneix | IS-SIR 9/26
18. SIR in multidimensional framework
SIR: a semi-parametric regression model for X ∈ Rp
Y = F(aT
1 X, . . . , aT
d X, )
for a1, . . . , ad ∈ Rp
(to be estimated), F : Rd+1
→ R, unknown, and , an
error, independant from X.
Standard assumption for SIR
Y X | PA (X)
in which A is the so-called EDR space, spanned by (ak )k=1,...,d.
Nathalie Villa-Vialaneix | IS-SIR 10/26
20. Estimation
Equivalence between SIR and eigendecomposition
A is included in the space spanned by the first d Σ-orthogonal
eigenvectors of the generalized eigendecomposition problem:
Γa = λΣa, with Σ = E (X − E(X|Y)))T
E(X|Y) and
Γ = E E(X|Y)T
E(X|Y)
Nathalie Villa-Vialaneix | IS-SIR 11/26
21. Estimation
Equivalence between SIR and eigendecomposition
A is included in the space spanned by the first d Σ-orthogonal
eigenvectors of the generalized eigendecomposition problem:
Γa = λΣa, with Σ = E (X − E(X|Y)))T
E(X|Y) and
Γ = E E(X|Y)T
E(X|Y)
Estimation (when n > p)
compute X = 1
n
n
i=1 xi and ˆΣ = 1
n XT
(X − X)
Nathalie Villa-Vialaneix | IS-SIR 11/26
22. Estimation
Equivalence between SIR and eigendecomposition
A is included in the space spanned by the first d Σ-orthogonal
eigenvectors of the generalized eigendecomposition problem:
Γa = λΣa, with Σ = E (X − E(X|Y)))T
E(X|Y) and
Γ = E E(X|Y)T
E(X|Y)
Estimation (when n > p)
compute X = 1
n
n
i=1 xi and ˆΣ = 1
n XT
(X − X)
split the range of Y into H different slices: τ1, ... τH and estimate
ˆE(X|Y) = 1
nh i: yi∈τh
xi
h=1,...,H
, with nh = |{i : yi ∈ τh}|,
ˆΓ = ˆE(X|Y)T
DˆE(X|Y) with D = Diag n1
n , . . . , nH
n
Nathalie Villa-Vialaneix | IS-SIR 11/26
23. Estimation
Equivalence between SIR and eigendecomposition
A is included in the space spanned by the first d Σ-orthogonal
eigenvectors of the generalized eigendecomposition problem:
Γa = λΣa, with Σ = E (X − E(X|Y)))T
E(X|Y) and
Γ = E E(X|Y)T
E(X|Y)
Estimation (when n > p)
compute X = 1
n
n
i=1 xi and ˆΣ = 1
n XT
(X − X)
split the range of Y into H different slices: τ1, ... τH and estimate
ˆE(X|Y) = 1
nh i: yi∈τh
xi
h=1,...,H
, with nh = |{i : yi ∈ τh}|,
ˆΓ = ˆE(X|Y)T
DˆE(X|Y) with D = Diag n1
n , . . . , nH
n
solving the eigendecomposition problem ˆΓa = λˆΣa gives the
eigenvectors a1, . . . , ad ⇒ ˆA = (a1, . . . , ad), p × d
Nathalie Villa-Vialaneix | IS-SIR 11/26
24. Equivalent formulations
SIR as a regression problem [Li and Yin, 2008] shows that SIR is
equivalent to the (double) minimization of
E(A, C) =
H
h=1
ˆph Xh − X − ˆΣACh
2
for Xh = 1
nh i: yi∈τh
, A a (p × d)-matrix and C a vector in Rd
.
Nathalie Villa-Vialaneix | IS-SIR 12/26
25. Equivalent formulations
SIR as a regression problem [Li and Yin, 2008] shows that SIR is
equivalent to the (double) minimization of
E(A, C) =
H
h=1
ˆph Xh − X − ˆΣACh
2
for Xh = 1
nh i: yi∈τh
, A a (p × d)-matrix and C a vector in Rd
.
Rk: Given A, C is obtained as the solution of an ordinary least square
problem...
Nathalie Villa-Vialaneix | IS-SIR 12/26
26. Equivalent formulations
SIR as a regression problem [Li and Yin, 2008] shows that SIR is
equivalent to the (double) minimization of
E(A, C) =
H
h=1
ˆph Xh − X − ˆΣACh
2
for Xh = 1
nh i: yi∈τh
, A a (p × d)-matrix and C a vector in Rd
.
Rk: Given A, C is obtained as the solution of an ordinary least square
problem...
SIR as a Canonical Correlation problem [Li and Nachtsheim, 2008]
shows that SIR rewrites as the double optimisation problem
maxaj,φ Cor(φ(Y), aT
j
X), where φ is any function R → R and (aj)j are
Σ-orthonormal.
Nathalie Villa-Vialaneix | IS-SIR 12/26
27. Equivalent formulations
SIR as a regression problem [Li and Yin, 2008] shows that SIR is
equivalent to the (double) minimization of
E(A, C) =
H
h=1
ˆph Xh − X − ˆΣACh
2
for Xh = 1
nh i: yi∈τh
, A a (p × d)-matrix and C a vector in Rd
.
Rk: Given A, C is obtained as the solution of an ordinary least square
problem...
SIR as a Canonical Correlation problem [Li and Nachtsheim, 2008]
shows that SIR rewrites as the double optimisation problem
maxaj,φ Cor(φ(Y), aT
j
X), where φ is any function R → R and (aj)j are
Σ-orthonormal.
Rk: The solution is shown to satisfy φ(y) = aT
j
E(X|Y = y) and aj is
also obtained as the solution of the mean square error problem:
min
aj
E φ(Y) − aT
j X
2
Nathalie Villa-Vialaneix | IS-SIR 12/26
28. SIR in large dimensions: problem
In large dimension (or in Functional Data Analysis), n < p and ˆΣ is
ill-conditionned and does not have an inverse ⇒ Z = (X − InX
T
)ˆΣ−1/2
can
not be computed.
Nathalie Villa-Vialaneix | IS-SIR 13/26
29. SIR in large dimensions: problem
In large dimension (or in Functional Data Analysis), n < p and ˆΣ is
ill-conditionned and does not have an inverse ⇒ Z = (X − InX
T
)ˆΣ−1/2
can
not be computed.
Different solutions have been proposed in the litterature based on:
prior dimension reduction (e.g., PCA) [Ferré and Yao, 2003] (in the
framework of FDA)
regularization (ridge...) [Li and Yin, 2008, Bernard-Michel et al., 2008]
sparse SIR
[Li and Yin, 2008, Li and Nachtsheim, 2008, Ni et al., 2005]
Nathalie Villa-Vialaneix | IS-SIR 13/26
30. SIR in large dimensions: ridge penalty / L2-regularization
of ˆΣ
Following [Li and Yin, 2008] which shows that SIR is equivalent to the
minimization of
E2(A, C) =
H
h=1
ˆph Xh − X − ˆΣACh
2
,
Nathalie Villa-Vialaneix | IS-SIR 14/26
31. SIR in large dimensions: ridge penalty / L2-regularization
of ˆΣ
Following [Li and Yin, 2008] which shows that SIR is equivalent to the
minimization of
E2(A, C) =
H
h=1
ˆph Xh − X − ˆΣACh
2
+µ2
H
h=1
ˆph ACh
2
,
[Bernard-Michel et al., 2008] propose to penalize by a ridge penalty in a
high dimensional setting.
Nathalie Villa-Vialaneix | IS-SIR 14/26
32. SIR in large dimensions: ridge penalty / L2-regularization
of ˆΣ
Following [Li and Yin, 2008] which shows that SIR is equivalent to the
minimization of
E2(A, C) =
H
h=1
ˆph Xh − X − ˆΣACh
2
+µ2
H
h=1
ˆph ACh
2
,
[Bernard-Michel et al., 2008] propose to penalize by a ridge penalty in a
high dimensional setting.
They also show that this problem is equivalent to finding the eigenvectors
of the generalized eigenvalue problem
ˆΓa = λ ˆΣ + µ2Ip a.
Nathalie Villa-Vialaneix | IS-SIR 14/26
33. SIR in large dimensions: sparse versions
Specific issue to introduce sparsity in SIR
sparsity on a multiple-index model. Most authors use shrinkage
approaches.
First version: sparse penalization of the ridge solution
If (ˆA, ˆC) are the solutions of the ridge SIR as described in the previous
slide, [Ni et al., 2005, Li and Yin, 2008] propose to shrink this solution by
minimizing
Es,1(α) =
H
h=1
ˆph Xh − X − ˆΣDiag(α)ˆA ˆCh
2
+ µ1 α L1
(regression formulation of SIR)
Nathalie Villa-Vialaneix | IS-SIR 15/26
34. SIR in large dimensions: sparse versions
Specific issue to introduce sparsity in SIR
sparsity on a multiple-index model. Most authors use shrinkage
approaches.
Second version: [Li and Nachtsheim, 2008] derive the sparse optimization
problem from the correlation formulation of SIR:
min
as
j
n
i=1
Pˆaj
(X|yi) − (as
j )T
xi
2
+ µ1,j as
j L1
,
in which Pˆaj
is the projection of ˆE(X|Y = yi) = Xh onto the space spanned
by the solution of the ridge problem.
Nathalie Villa-Vialaneix | IS-SIR 15/26
35. Characteristics of the different approaches and possible
extensions
[Li and Yin, 2008] [Li and Nachtsheim, 2008]
sparsity on shrinkage coefficients estimates
nb optimization pb 1 d
sparsity common to all dims specific to each dim
Nathalie Villa-Vialaneix | IS-SIR 16/26
36. Characteristics of the different approaches and possible
extensions
[Li and Yin, 2008] [Li and Nachtsheim, 2008]
sparsity on shrinkage coefficients estimates
nb optimization pb 1 d
sparsity common to all dims specific to each dim
Extension to block-sparse SIR (like in PCA)?
Nathalie Villa-Vialaneix | IS-SIR 16/26
37. Sommaire
1 Background and motivation
2 Presentation of SIR
3 Our proposal
4 Simulations
Nathalie Villa-Vialaneix | IS-SIR 17/26
38. IS-SIR: a two step approach
Background: Back to the functional setting, we suppose that t1, ..., tp are
split into D intervals I1, ..., ID.
Nathalie Villa-Vialaneix | IS-SIR 18/26
39. IS-SIR: a two step approach
Background: Back to the functional setting, we suppose that t1, ..., tp are
split into D intervals I1, ..., ID.
First step: Solve the ridge problem on the digitized functions (viewed as
high dimensional vectors) to obtain ˆA and ˆC:
min
A,C
H
h=1
ˆph Xh − X − ˆΣACh
2
+ µ2
H
h=1
ˆph ACh
2
Nathalie Villa-Vialaneix | IS-SIR 18/26
40. IS-SIR: a two step approach
Background: Back to the functional setting, we suppose that t1, ..., tp are
split into D intervals I1, ..., ID.
First step: Solve the ridge problem on the digitized functions (viewed as
high dimensional vectors) to obtain ˆA and ˆC:
min
A,C
H
h=1
ˆph Xh − X − ˆΣACh
2
+ µ2
H
h=1
ˆph ACh
2
Second step: Sparse shrinkage using the intervals. If
PˆA (E(X|Y = yi)) = (Xh − X)T ˆA for h st yi ∈ τh and if Pi = (P1
i
, . . . , Pd
i
)T
and Pj
= (Pj
1
, . . . , Pj
n)T
, we solve:
arg min
α∈RD
d
j=1
Pj
− (X∆(ˆaj)) α 2
+ µ1 α L1
with ∆(ˆaj) the (p × D)-matrix such that ∆kl(ˆaj) = ˆajl if tl ∈ Ik and 0
otherwise.
Nathalie Villa-Vialaneix | IS-SIR 18/26
41. IS-SIR: Characteristics
uses the approach based on the correlation formulation (because the
dimensionality of the optimization problem is smaller);
uses a shrinkage approach and optimizes shrinkage coefficients in a
single optimization problem;
handles functional setting by penalizing entire intervals and not just
isolated points.
Nathalie Villa-Vialaneix | IS-SIR 19/26
42. Parameter estimation
H (number of slices): usually, SIR is known to be not very sensitive to
the number of slices (> d + 1). We took H = 10 (i.e., 10/30
observations per slice);
Nathalie Villa-Vialaneix | IS-SIR 20/26
43. Parameter estimation
H (number of slices): usually, SIR is known to be not very sensitive to
the number of slices (> d + 1). We took H = 10 (i.e., 10/30
observations per slice);
µ2 and d (ridge estimate ˆA):
L-fold CV for µ2 (for a d0 large enough) Note that GCV as described in
[Li and Yin, 2008] can not be used since the current version of the L2
penalty involves the use of an estimate of Σ−1
.
Nathalie Villa-Vialaneix | IS-SIR 20/26
44. Parameter estimation
H (number of slices): usually, SIR is known to be not very sensitive to
the number of slices (> d + 1). We took H = 10 (i.e., 10/30
observations per slice);
µ2 and d (ridge estimate ˆA):
L-fold CV for µ2 (for a d0 large enough)
using again L-fold CV, ∀ d = 1, . . . , d0, an estimate of
R(d) = d − E Tr Πd
ˆΠd ,
in which Πd and ˆΠd are the projector onto the first d dimensions of the
EDR space and its estimate, is derived similarly as in
[Liquet and Saracco, 2012]. The evolution of ˆR(d) versus d is studied
to select a relevant d.
Nathalie Villa-Vialaneix | IS-SIR 20/26
45. Parameter estimation
H (number of slices): usually, SIR is known to be not very sensitive to
the number of slices (> d + 1). We took H = 10 (i.e., 10/30
observations per slice);
µ2 and d (ridge estimate ˆA):
L-fold CV for µ2 (for a d0 large enough)
using again L-fold CV, ∀ d = 1, . . . , d0, an estimate of
R(d) = d − E Tr Πd
ˆΠd ,
in which Πd and ˆΠd are the projector onto the first d dimensions of the
EDR space and its estimate, is derived similarly as in
[Liquet and Saracco, 2012]. The evolution of ˆR(d) versus d is studied
to select a relevant d.
µ1 (LASSO) glmnet is used, in which µ1 is selected by CV along the
regularization path.
Nathalie Villa-Vialaneix | IS-SIR 20/26
46. An automatic approach to define intervals
1 Initial state: ∀ k = 1, . . . , p, τk = {tk }
Nathalie Villa-Vialaneix | IS-SIR 21/26
47. An automatic approach to define intervals
1 Initial state: ∀ k = 1, . . . , p, τk = {tk }
2 Iterate
along the regularization path, select three values for µ1:
Nathalie Villa-Vialaneix | IS-SIR 21/26
48. An automatic approach to define intervals
1 Initial state: ∀ k = 1, . . . , p, τk = {tk }
2 Iterate
along the regularization path, select three values for µ1: P% of the
coefficients are zero, P% of the coefficients are non zero, best GCV.
define: D−
(“strong zeros”) and D+
(“strong non zeros”)
Nathalie Villa-Vialaneix | IS-SIR 21/26
49. An automatic approach to define intervals
1 Initial state: ∀ k = 1, . . . , p, τk = {tk }
2 Iterate
define: D−
(“strong zeros”) and D+
(“strong non zeros”)
merge consecutive “strong zeros” (or “strong non zeros”) or “strong
zeros” (resp. “strong non zeros”) separated by a few numbers of
intervals which are of undetermined type.
Until no more iterations can be performed.
Nathalie Villa-Vialaneix | IS-SIR 21/26
50. An automatic approach to define intervals
1 Initial state: ∀ k = 1, . . . , p, τk = {tk }
2 Iterate
define: D−
(“strong zeros”) and D+
(“strong non zeros”)
merge consecutive “strong zeros” (or “strong non zeros”) or “strong
zeros” (resp. “strong non zeros”) separated by a few numbers of
intervals which are of undetermined type.
Until no more iterations can be performed.
3 Output: Collection of models (first with p intervals, last with 1), M∗
D
(optimal for GCV) and corresponding GCVD versus D (number of
intervals).
Nathalie Villa-Vialaneix | IS-SIR 21/26
51. An automatic approach to define intervals
1 Initial state: ∀ k = 1, . . . , p, τk = {tk }
2 Iterate
define: D−
(“strong zeros”) and D+
(“strong non zeros”)
merge consecutive “strong zeros” (or “strong non zeros”) or “strong
zeros” (resp. “strong non zeros”) separated by a few numbers of
intervals which are of undetermined type.
Until no more iterations can be performed.
3 Output: Collection of models (first with p intervals, last with 1), M∗
D
(optimal for GCV) and corresponding GCVD versus D (number of
intervals).
Final solution: Minimize GCVD over D.
Nathalie Villa-Vialaneix | IS-SIR 21/26
52. Sommaire
1 Background and motivation
2 Presentation of SIR
3 Our proposal
4 Simulations
Nathalie Villa-Vialaneix | IS-SIR 22/26
53. Simulation framework
Data generated with:
Y = d
j=1 log X, aj with X(t) = Z(t) + in which Z is a Gaussian
process with mean µ(t) = −5 + 4t − 4t2
and the Matern 3/2
covariance function with parameters σ = 0.1 and θ = 0.2/
√
3, is a
centered Gaussian variable independant of Z, with standard deviation
0.1.;
aj = sin
t(2+j)π
2 −
(j−1)π
3 IIj
(t)
two models: (M1), d = 1, I1 = [0.2, 0.4]. For (M2), d = 3 and
I1 = [0, 0.1], I2 = [0.5, 0.65] and I3 = [0.65, 0.78].
Nathalie Villa-Vialaneix | IS-SIR 23/26
63. Conclusion
IS-SIR:
sparse dimension reduction model adapted to functional framework;
fully automated definition of relevant intervals in the range of the
predictors
Nathalie Villa-Vialaneix | IS-SIR 26/26
64. Conclusion
IS-SIR:
sparse dimension reduction model adapted to functional framework;
fully automated definition of relevant intervals in the range of the
predictors
Perspective:
application to real data
block-wise sparse SIR?
Nathalie Villa-Vialaneix | IS-SIR 26/26
65. Aneiros, G. and Vieu, P. (2014).
Variable in infinite-dimensional problems.
Statistics and Probability Letters, 94:12–20.
Bernard-Michel, C., Gardes, L., and Girard, S. (2008).
A note on sliced inverse regression with regularizations.
Biometrics, 64(3):982–986.
Casadebaig, P., Guilioni, L., Lecoeur, J., Christophe, A., Champolivier, L., and Debaeke, P. (2011).
SUNFLO, a model to simulate genotype-specific performance of the sunflower crop in contrasting environments.
Agricultural and Forest Meteorology, 151(2):163–178.
Ferraty, F., Hall, P., and Vieu, P. (2010).
Most-predictive design points for functiona data predictors.
Biometrika, 97(4):807–824.
Ferré, L. and Yao, A. (2003).
Functional sliced inverse regression analysis.
Statistics, 37(6):475–488.
Fraiman, R., Gimenez, Y., and Svarc, M. (2015).
Feature selection for functional data.
Journal of Multivariate Analysis.
In Press.
Gregorutti, B., Michel, B., and Saint-Pierre, P. (2015).
Grouped variable importance with random forests and application to multiple functional data analysis.
Computational Statistics and Data Analysis, 90:15–35.
James, G., Wang, J., and Zhu, J. (2009).
Functional linear regression that’s interpretable.
Annals of Statistics, 37(5A):2083–2108.
Li, L. and Nachtsheim, C. (2008).
Nathalie Villa-Vialaneix | IS-SIR 26/26
66. Sparse sliced inverse regression.
Technometrics, 48(4):503–510.
Li, L. and Yin, X. (2008).
Sliced inverse regression with regularizations.
Biometrics, 64:124–131.
Liquet, B. and Saracco, J. (2012).
A graphical tool for selecting the number of slices and the dimension of the model in SIR and SAVE approches.
Computational Statistics, 27(1):103–125.
Matsui, H. and Konishi, S. (2011).
Variable selection for functional regression models via the l1 regularization.
Computational Statistics and Data Analysis, 55(12):3304–3310.
Ni, L., Cook, D., and Tsai, C. (2005).
A note on shrinkage sliced inverse regression.
Biometrika, 92(1):242–247.
Nathalie Villa-Vialaneix | IS-SIR 26/26