This document discusses feature selection methods for causal inference in bioinformatics. It describes how relevance and causality differ, with relevant features not always being causal. Information theory concepts like mutual information, conditional mutual information, and interaction information are introduced to quantify dependence and independence between variables. The min-Interaction Max-Relevance (mIMR) filter method is proposed to select features based on both relevance to the target and minimal interaction, approximating causal relationships. Experimental results on breast cancer gene expression datasets show mIMR outperforms conventional ranking in predictive performance, identifying a potential causal signature for survival.
An Overview and Application of Discriminant Analysis in Data AnalysisIOSR Journals
This document provides an overview of discriminant analysis, including its history, key assumptions, and different types (e.g. linear, quadratic). It discusses advantages of discriminant analysis compared to logistic regression, such as its ability to handle small sample sizes. The document also describes steps to develop a discriminant model, including variable selection, assumptions checking, and evaluation. It then presents an application of discriminant analysis to classify failed vs successful companies in Nigeria based on financial ratios. The model was able to predict company failure up to 3 years in advance.
The document discusses several key ideas in statistics and modeling:
1. Fisher and Neyman had different views on model specification - Fisher saw it as practical while Neyman emphasized theoretical building blocks.
2. Statistics can contribute a "reservoir of models", model selection techniques, and classification of theoretical vs empirical models.
3. Theoretical models aim to explain underlying mechanisms while empirical models guide actions based on forecasts.
4. Examples like Mendel's inheritance models, Pearson distributions, and Galileo's trial illustrate the development and application of statistical modeling.
This document provides an overview of multivariate analysis techniques, including dependency techniques like multiple regression, discriminant analysis, and MANOVA, as well as interdependency techniques like factor analysis, cluster analysis, and multidimensional scaling. It describes the uses and processes for each technique, such as using multiple regression to predict values, discriminate analysis to classify groups, and factor analysis to reduce variables. The document is signed off with warm wishes from the owner of Power Group.
Evaluation measures for models assessment over imbalanced data setsAlexander Decker
This document discusses various evaluation measures that can be used to assess models trained on imbalanced data sets, where one class is significantly underrepresented compared to other classes. It begins by introducing common measures like accuracy, precision, recall, and F1 score that are misleading for imbalanced data. It then describes alternative combined measures like G-mean, likelihood ratios, discriminant power, F-measure, balanced accuracy, Youden index, and Matthews correlation coefficient that provide a more credible evaluation for imbalanced data. Finally, it discusses graphical measures like ROC curves, AUC, lift charts, and cumulative gain curves that can be used for model comparison and evaluation. The document aims to provide a comprehensive overview of appropriate evaluation measures for assessing models trained
Path analysis is a technique that uses regression models to test theories of causal relationships among variables. It allows researchers to explicitly specify presumed causal relationships and determine not just associations but also potential causal relationships between variables. Path models represent variables as rectangles or ovals, with causality indicated by single-headed arrows and correlation by double-headed arrows. Path coefficients represent standardized regression weights and can be interpreted as the change in the response variable corresponding to a change in the explanatory variable while controlling for other factors.
Logistic regression is a statistical method used to predict a binary or categorical dependent variable from continuous or categorical independent variables. It generates coefficients to predict the log odds of an outcome being present or absent. The method assumes a linear relationship between the log odds and independent variables. Multinomial logistic regression extends this to dependent variables with more than two categories. An example analyzes high school student program choices using writing scores and socioeconomic status as predictors. The model fits significantly better than an intercept-only model. Increases in writing score decrease the log odds of general versus academic programs.
This document provides an overview of multivariate analysis techniques. It defines multiple regression, discriminant analysis, MANOVA, structural equation modeling, conjoint analysis, factor analysis, cluster analysis, and multidimensional scaling. For each technique, it outlines their uses, outputs, and key concepts. The overall purpose is to help readers understand how to classify and apply different multivariate methods to analyze relationships between multiple variables and classify objects.
An Overview and Application of Discriminant Analysis in Data AnalysisIOSR Journals
This document provides an overview of discriminant analysis, including its history, key assumptions, and different types (e.g. linear, quadratic). It discusses advantages of discriminant analysis compared to logistic regression, such as its ability to handle small sample sizes. The document also describes steps to develop a discriminant model, including variable selection, assumptions checking, and evaluation. It then presents an application of discriminant analysis to classify failed vs successful companies in Nigeria based on financial ratios. The model was able to predict company failure up to 3 years in advance.
The document discusses several key ideas in statistics and modeling:
1. Fisher and Neyman had different views on model specification - Fisher saw it as practical while Neyman emphasized theoretical building blocks.
2. Statistics can contribute a "reservoir of models", model selection techniques, and classification of theoretical vs empirical models.
3. Theoretical models aim to explain underlying mechanisms while empirical models guide actions based on forecasts.
4. Examples like Mendel's inheritance models, Pearson distributions, and Galileo's trial illustrate the development and application of statistical modeling.
This document provides an overview of multivariate analysis techniques, including dependency techniques like multiple regression, discriminant analysis, and MANOVA, as well as interdependency techniques like factor analysis, cluster analysis, and multidimensional scaling. It describes the uses and processes for each technique, such as using multiple regression to predict values, discriminate analysis to classify groups, and factor analysis to reduce variables. The document is signed off with warm wishes from the owner of Power Group.
Evaluation measures for models assessment over imbalanced data setsAlexander Decker
This document discusses various evaluation measures that can be used to assess models trained on imbalanced data sets, where one class is significantly underrepresented compared to other classes. It begins by introducing common measures like accuracy, precision, recall, and F1 score that are misleading for imbalanced data. It then describes alternative combined measures like G-mean, likelihood ratios, discriminant power, F-measure, balanced accuracy, Youden index, and Matthews correlation coefficient that provide a more credible evaluation for imbalanced data. Finally, it discusses graphical measures like ROC curves, AUC, lift charts, and cumulative gain curves that can be used for model comparison and evaluation. The document aims to provide a comprehensive overview of appropriate evaluation measures for assessing models trained
Path analysis is a technique that uses regression models to test theories of causal relationships among variables. It allows researchers to explicitly specify presumed causal relationships and determine not just associations but also potential causal relationships between variables. Path models represent variables as rectangles or ovals, with causality indicated by single-headed arrows and correlation by double-headed arrows. Path coefficients represent standardized regression weights and can be interpreted as the change in the response variable corresponding to a change in the explanatory variable while controlling for other factors.
Logistic regression is a statistical method used to predict a binary or categorical dependent variable from continuous or categorical independent variables. It generates coefficients to predict the log odds of an outcome being present or absent. The method assumes a linear relationship between the log odds and independent variables. Multinomial logistic regression extends this to dependent variables with more than two categories. An example analyzes high school student program choices using writing scores and socioeconomic status as predictors. The model fits significantly better than an intercept-only model. Increases in writing score decrease the log odds of general versus academic programs.
This document provides an overview of multivariate analysis techniques. It defines multiple regression, discriminant analysis, MANOVA, structural equation modeling, conjoint analysis, factor analysis, cluster analysis, and multidimensional scaling. For each technique, it outlines their uses, outputs, and key concepts. The overall purpose is to help readers understand how to classify and apply different multivariate methods to analyze relationships between multiple variables and classify objects.
Relationships Among Classical Test Theory and Item Response Theory Frameworks...AnusornKoedsri3
This document discusses relationships between classical test theory (CTT) and item response theory (IRT) frameworks. It reviews prior literature that has found the frameworks to be quite comparable in estimating person and item parameters. However, previous studies only used IRT to generate data and did not consider CTT models with an underlying normal variable assumption. The current study aims to compare item and person statistics from IRT and CTT models with an underlying normal variable assumption using simulated data generated from both frameworks. A simulation was conducted varying test length (20, 40, 60 items) and number of examinees (500, 1000) to compare one-parameter logistic, two-parameter logistic, parallel, and congeneric models.
This document discusses factor analysis, a statistical technique used to reduce the dimensionality of correlated variables into a smaller number of underlying factors. It begins by motivating factor analysis through an example involving measuring frailty. It then provides an overview of factor analysis, including key concepts like observed and latent variables, assumptions of the factor model, and common applications. The document also covers the mathematical underpinnings of one-factor and multiple-factor models, and explains important outputs of factor analysis like factor loadings and communalities.
The document discusses factor analysis as an exploratory and confirmatory multivariate technique. It explains that factor analysis is commonly used for data reduction, scale development, and evaluating the dimensionality of variables. Factor analysis determines underlying factors or dimensions from a set of interrelated variables. It reduces a large number of variables to a smaller number of factors. The key steps in factor analysis include computing a correlation matrix, extracting factors, rotating factors, and making decisions on the number of factors.
This document discusses discriminant analysis, which is a statistical technique used to classify observations into predefined groups based on independent variables. It can be used to predict the likelihood an entity belongs to a particular class. The document outlines the objectives, purposes, assumptions, and steps of discriminant analysis. It provides examples of using it to classify individuals as basketball vs volleyball players or high vs low performers based on variables.
This document discusses different methods for obtaining p-values when assessing the fit of latent class models, including asymptotic, bootstrap, and posterior predictive p-values. It describes latent class analysis and common test statistics used to evaluate model fit, such as the likelihood ratio statistic and Pearson chi-squared statistic. The document then provides an overview of how asymptotic, bootstrap, and posterior predictive p-values are calculated. Specifically, it explains that asymptotic p-values assume the proposed model is true, while bootstrap and posterior predictive p-values generate empirical reference distributions through resampling techniques. The purpose is to compare these p-value methods in assessing latent class model fit under different sample sizes.
Dummy variables take values of 0 or 1 to indicate the absence or presence of categorical effects. There are dependent and independent dummy variables. Probit, tobit, and logit models use dependent dummy variables, with probit using a standard normal distribution, tobit modeling non-negative dependent variables, and logit using a formula to describe the relationship between a dependent variable from 0 to 1 and an independent variable.
This document provides guidance on performing and interpreting logistic regression analyses in SPSS. It discusses selecting appropriate statistical tests based on variable types and study objectives. It covers assumptions of logistic regression like linear relationships between predictors and the logit of the outcome. It also explains maximum likelihood estimation, interpreting coefficients, and evaluating model fit and accuracy. Guidelines are provided on reporting logistic regression results from SPSS outputs.
1. Multinomial logistic regression allows modeling of nominal outcome variables with more than two categories by calculating multiple logistic regression equations to compare each category's probability to a reference category.
2. The document provides an example of using multinomial logistic regression to model student program choice (academic, general, vocational) based on writing score and socioeconomic status.
3. The model results show that writing score significantly impacts the choice between academic and general/vocational programs, while socioeconomic status also influences general versus academic program choice.
This document provides an overview and agenda for a presentation on multivariate analysis and discriminant analysis using SPSS. It introduces the presenter, Dr. Nisha Arora, and lists her areas of expertise including statistics, machine learning, and teaching online courses in programs like R and Python. The agenda outlines concepts in discriminant analysis and how to perform it in SPSS, including data preparation, assumptions, interpretation of outputs, and ways to improve the analysis model.
Class24 chi squaretestofindependenceposthoc(1)arvindmnnitmsw
This document provides an overview of the chi-square test of independence through 15 slides. It defines independence, demonstrates it using an example, and outlines the 5 steps for conducting a chi-square test of independence: 1) checking assumptions, 2) stating hypotheses and level of significance, 3) identifying the sampling distribution and test statistic, 4) computing the test statistic, and 5) making a decision and interpreting results. It also discusses how to identify which cells are contributing to a significant result using standardized residuals.
The document provides an overview of confirmatory factor analysis (CFA). It defines CFA and explains that CFA requires specifying the number of factors and which variables load on which factors before analysis. The document outlines the 6 stages of CFA: 1) defining constructs, 2) developing the measurement model, 3) designing a study, 4) assessing the measurement model, 5) specifying the structural model, and 6) assessing the structural model. It emphasizes that CFA confirms or rejects preconceived theories about relationships between observed and latent variables.
This document describes an experiment conducted to examine the effects of two training course characteristics (mode of delivery and provision of summary notes) on knowledge gained about software. 160 participants were randomly assigned to conditions that varied these two factors. They completed an IQ test, training, and a knowledge test. ANOVA was used to analyze the effects. It found significant main effects for both factors, with higher knowledge scores for face-to-face training and when notes were provided. A significant interaction showed the factors' effects depended on each other.
Multiple regression analysis , its methods among which multiple regression analysis one of the popular method. also discuss the applications and purposes
Logit and Probit and Tobit model: Basic IntroductionRabeesh Verma
This document provides an overview of regression analysis and different types of regression models used when the dependent variable is dichotomous (can only take two values, such as 0 and 1). It defines regression analysis and discusses linear regression assumptions. It then introduces logistic regression, probit regression, and tobit regression as alternatives to linear regression when the dependent variable is dichotomous. The key differences between these models and their applications are summarized.
This document provides an overview of logistic regression, including when and why it is used, the theory behind it, and how to assess logistic regression models. Logistic regression predicts the probability of categorical outcomes given categorical or continuous predictor variables. It relaxes the normality and linearity assumptions of linear regression. The relationship between predictors and outcomes is modeled using an S-shaped logistic function. Model fit, predictors, and interpretations of coefficients are discussed.
This document proposes generalized additive models (GAMs) to model conditional dependence structures between random variables. Specifically, it develops a GAM framework where a dependence or concordance measure between two variables is modeled as a parametric, non-parametric, or semi-parametric function of explanatory variables. It derives the root-n consistency and asymptotic normality of the maximum penalized log-likelihood estimator for the proposed GAMs. It also discusses details of the estimation procedure and selection of smoothing parameters.
Nonparametric Methods and Evolutionary Algorithms in Genetic EpidemiologyColleen Farrelly
This document discusses challenges and techniques for analyzing genetic epidemiology data from large genomic studies. It notes that traditional parametric methods lack power to detect small effects and interactions in such high-dimensional datasets. It explores several nonparametric and machine learning methods that can help address issues like interactions, rare variants, and heterogeneity, including random forests, neural networks, clustering algorithms, and combinatorial methods like multifactor dimensionality reduction. Evolutionary algorithms are presented as a way to optimize some of these techniques and make them computationally feasible for genome-wide datasets.
Overview of Multivariate Statistical MethodsThomasUttaro1
This is an overview of advanced multivariate statistical methods which have become very relevant in many domains over the last few decades. These methods are powerful and can exploit the massive datasets implemented today in meaningful ways. Typically analytics platforms do not deploy these statistical methods, in favor of straightforward metrics and machine learning, and thus they are often overlooked. Additional references are available as documented.
This document discusses feature selection algorithms, specifically branch and bound and beam search algorithms. It provides an overview of feature selection, discusses the fundamentals and objectives of feature selection. It then goes into more detail about how branch and bound works, including pseudocode, a flowchart, and an example. It also discusses beam search and compares branch and bound to other algorithms. In summary, it thoroughly explains branch and bound and beam search algorithms for performing feature selection on datasets.
Relationships Among Classical Test Theory and Item Response Theory Frameworks...AnusornKoedsri3
This document discusses relationships between classical test theory (CTT) and item response theory (IRT) frameworks. It reviews prior literature that has found the frameworks to be quite comparable in estimating person and item parameters. However, previous studies only used IRT to generate data and did not consider CTT models with an underlying normal variable assumption. The current study aims to compare item and person statistics from IRT and CTT models with an underlying normal variable assumption using simulated data generated from both frameworks. A simulation was conducted varying test length (20, 40, 60 items) and number of examinees (500, 1000) to compare one-parameter logistic, two-parameter logistic, parallel, and congeneric models.
This document discusses factor analysis, a statistical technique used to reduce the dimensionality of correlated variables into a smaller number of underlying factors. It begins by motivating factor analysis through an example involving measuring frailty. It then provides an overview of factor analysis, including key concepts like observed and latent variables, assumptions of the factor model, and common applications. The document also covers the mathematical underpinnings of one-factor and multiple-factor models, and explains important outputs of factor analysis like factor loadings and communalities.
The document discusses factor analysis as an exploratory and confirmatory multivariate technique. It explains that factor analysis is commonly used for data reduction, scale development, and evaluating the dimensionality of variables. Factor analysis determines underlying factors or dimensions from a set of interrelated variables. It reduces a large number of variables to a smaller number of factors. The key steps in factor analysis include computing a correlation matrix, extracting factors, rotating factors, and making decisions on the number of factors.
This document discusses discriminant analysis, which is a statistical technique used to classify observations into predefined groups based on independent variables. It can be used to predict the likelihood an entity belongs to a particular class. The document outlines the objectives, purposes, assumptions, and steps of discriminant analysis. It provides examples of using it to classify individuals as basketball vs volleyball players or high vs low performers based on variables.
This document discusses different methods for obtaining p-values when assessing the fit of latent class models, including asymptotic, bootstrap, and posterior predictive p-values. It describes latent class analysis and common test statistics used to evaluate model fit, such as the likelihood ratio statistic and Pearson chi-squared statistic. The document then provides an overview of how asymptotic, bootstrap, and posterior predictive p-values are calculated. Specifically, it explains that asymptotic p-values assume the proposed model is true, while bootstrap and posterior predictive p-values generate empirical reference distributions through resampling techniques. The purpose is to compare these p-value methods in assessing latent class model fit under different sample sizes.
Dummy variables take values of 0 or 1 to indicate the absence or presence of categorical effects. There are dependent and independent dummy variables. Probit, tobit, and logit models use dependent dummy variables, with probit using a standard normal distribution, tobit modeling non-negative dependent variables, and logit using a formula to describe the relationship between a dependent variable from 0 to 1 and an independent variable.
This document provides guidance on performing and interpreting logistic regression analyses in SPSS. It discusses selecting appropriate statistical tests based on variable types and study objectives. It covers assumptions of logistic regression like linear relationships between predictors and the logit of the outcome. It also explains maximum likelihood estimation, interpreting coefficients, and evaluating model fit and accuracy. Guidelines are provided on reporting logistic regression results from SPSS outputs.
1. Multinomial logistic regression allows modeling of nominal outcome variables with more than two categories by calculating multiple logistic regression equations to compare each category's probability to a reference category.
2. The document provides an example of using multinomial logistic regression to model student program choice (academic, general, vocational) based on writing score and socioeconomic status.
3. The model results show that writing score significantly impacts the choice between academic and general/vocational programs, while socioeconomic status also influences general versus academic program choice.
This document provides an overview and agenda for a presentation on multivariate analysis and discriminant analysis using SPSS. It introduces the presenter, Dr. Nisha Arora, and lists her areas of expertise including statistics, machine learning, and teaching online courses in programs like R and Python. The agenda outlines concepts in discriminant analysis and how to perform it in SPSS, including data preparation, assumptions, interpretation of outputs, and ways to improve the analysis model.
Class24 chi squaretestofindependenceposthoc(1)arvindmnnitmsw
This document provides an overview of the chi-square test of independence through 15 slides. It defines independence, demonstrates it using an example, and outlines the 5 steps for conducting a chi-square test of independence: 1) checking assumptions, 2) stating hypotheses and level of significance, 3) identifying the sampling distribution and test statistic, 4) computing the test statistic, and 5) making a decision and interpreting results. It also discusses how to identify which cells are contributing to a significant result using standardized residuals.
The document provides an overview of confirmatory factor analysis (CFA). It defines CFA and explains that CFA requires specifying the number of factors and which variables load on which factors before analysis. The document outlines the 6 stages of CFA: 1) defining constructs, 2) developing the measurement model, 3) designing a study, 4) assessing the measurement model, 5) specifying the structural model, and 6) assessing the structural model. It emphasizes that CFA confirms or rejects preconceived theories about relationships between observed and latent variables.
This document describes an experiment conducted to examine the effects of two training course characteristics (mode of delivery and provision of summary notes) on knowledge gained about software. 160 participants were randomly assigned to conditions that varied these two factors. They completed an IQ test, training, and a knowledge test. ANOVA was used to analyze the effects. It found significant main effects for both factors, with higher knowledge scores for face-to-face training and when notes were provided. A significant interaction showed the factors' effects depended on each other.
Multiple regression analysis , its methods among which multiple regression analysis one of the popular method. also discuss the applications and purposes
Logit and Probit and Tobit model: Basic IntroductionRabeesh Verma
This document provides an overview of regression analysis and different types of regression models used when the dependent variable is dichotomous (can only take two values, such as 0 and 1). It defines regression analysis and discusses linear regression assumptions. It then introduces logistic regression, probit regression, and tobit regression as alternatives to linear regression when the dependent variable is dichotomous. The key differences between these models and their applications are summarized.
This document provides an overview of logistic regression, including when and why it is used, the theory behind it, and how to assess logistic regression models. Logistic regression predicts the probability of categorical outcomes given categorical or continuous predictor variables. It relaxes the normality and linearity assumptions of linear regression. The relationship between predictors and outcomes is modeled using an S-shaped logistic function. Model fit, predictors, and interpretations of coefficients are discussed.
This document proposes generalized additive models (GAMs) to model conditional dependence structures between random variables. Specifically, it develops a GAM framework where a dependence or concordance measure between two variables is modeled as a parametric, non-parametric, or semi-parametric function of explanatory variables. It derives the root-n consistency and asymptotic normality of the maximum penalized log-likelihood estimator for the proposed GAMs. It also discusses details of the estimation procedure and selection of smoothing parameters.
Nonparametric Methods and Evolutionary Algorithms in Genetic EpidemiologyColleen Farrelly
This document discusses challenges and techniques for analyzing genetic epidemiology data from large genomic studies. It notes that traditional parametric methods lack power to detect small effects and interactions in such high-dimensional datasets. It explores several nonparametric and machine learning methods that can help address issues like interactions, rare variants, and heterogeneity, including random forests, neural networks, clustering algorithms, and combinatorial methods like multifactor dimensionality reduction. Evolutionary algorithms are presented as a way to optimize some of these techniques and make them computationally feasible for genome-wide datasets.
Overview of Multivariate Statistical MethodsThomasUttaro1
This is an overview of advanced multivariate statistical methods which have become very relevant in many domains over the last few decades. These methods are powerful and can exploit the massive datasets implemented today in meaningful ways. Typically analytics platforms do not deploy these statistical methods, in favor of straightforward metrics and machine learning, and thus they are often overlooked. Additional references are available as documented.
This document discusses feature selection algorithms, specifically branch and bound and beam search algorithms. It provides an overview of feature selection, discusses the fundamentals and objectives of feature selection. It then goes into more detail about how branch and bound works, including pseudocode, a flowchart, and an example. It also discusses beam search and compares branch and bound to other algorithms. In summary, it thoroughly explains branch and bound and beam search algorithms for performing feature selection on datasets.
Feature Selection for Document RankingAndrea Gigli
Feature selection for Machine Learning applied to Document Ranking (aka L2R, LtR, LETOR). Contains empirical results on Yahoo! and Bing public available Web Search Engine data.
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)Parinda Rajapaksha
Branch & Bound and Beam search algorithms were illustrated according to the feature selection domain. Presentation is structured as follows,
- Motivation
- Introduction
- Analysis
- Algorithm
- Pseudo Code
- Illustration of examples
- Applications
- Observations and Recommendations
- Comparison between two algorithms
- References
This document summarizes a machine learning workshop on feature selection. It discusses typical feature selection methods like single feature evaluation using metrics like mutual information and Gini indexing. It also covers subset selection techniques like sequential forward selection and sequential backward selection. Examples are provided showing how feature selection improves performance for logistic regression on large datasets with more features than samples. The document outlines the workshop agenda and provides details on when and why feature selection is important for machine learning models.
This document discusses feature selection concepts and methods. It defines features as attributes that determine which class an instance belongs to. Feature selection aims to select a relevant subset of features by removing irrelevant, redundant and unnecessary data. This improves learning accuracy, model performance and interpretability. The document categorizes feature selection algorithms as filter, wrapper or embedded methods based on how they evaluate feature subsets. It also discusses concepts like feature relevance, search strategies, successor generation and evaluation measures used in feature selection algorithms.
- Multinomial logistic regression predicts categorical membership in a dependent variable based on multiple independent variables. It is an extension of binary logistic regression that allows for more than two categories.
- Careful data analysis including checking for outliers and multicollinearity is important. A minimum sample size of 10 cases per independent variable is recommended.
- Multinomial logistic regression does not assume normality, linearity or homoscedasticity like discriminant function analysis does, making it more flexible and commonly used. It does assume independence between dependent variable categories.
Here are the key differences between supervised and unsupervised learning:
Supervised Learning:
- Uses labeled examples/data to learn. The labels provide correct answers for the learning algorithm.
- The goal is to build a model that maps inputs to outputs based on example input-output pairs.
- Common algorithms include linear/logistic regression, decision trees, k-nearest neighbors, SVM, neural networks.
- Used for classification and regression predictive problems.
Unsupervised Learning:
- Uses unlabeled data where there are no correct answers provided.
- The goal is to find hidden patterns or grouping in the data.
- Common algorithms include clustering, association rule learning, self-organizing maps.
-
This document discusses multicollinearity in regression analysis. It defines multicollinearity as a near linear relationship between predictor variables, which violates an assumption of classical linear regression. It provides an example of multicollinearity between product price and competitor prices. The effects of multicollinearity include indeterminate regression coefficients and infinite variance and covariance of coefficients when multicollinearity is perfect. Sources of multicollinearity include the data collection method, constraints in the population, model specification, and having more predictors than observations.
This document summarizes a research paper that proposes using a partial least squares latent variable modeling approach to more accurately measure interaction effects between variables. The paper conducts a literature review finding that many past studies of interaction effects in information systems research failed to accurately detect or estimate effect sizes due to measurement error. It then presents the partial least squares approach using product indicators to model interaction effects as an alternative method. The paper tests this new approach on simulated data where the true effects are known, finding it recovers the effects more accurately. It also applies the approach to a dataset on voice mail emotion and adoption, detecting a substantial interaction effect typically not found in past IS research.
This document discusses theoretical ecology, which uses theoretical methods such as mathematical models, computational simulations, and data analysis to study ecological systems. It provides examples of different types of mathematical models used to model population dynamics and species interactions, including exponential growth models, logistic growth models, structured population models using matrices, predator-prey models, host-pathogen models, and competition/mutualism models. It also discusses how theoretical ecology aims to explain a variety of ecological phenomena and how computational modeling has benefited from increased computing power.
Multivariate and Conditional Distributionssusered887b
The document discusses key concepts in multivariate analysis including:
1) The multivariate normal distribution plays a fundamental role as both a population model and approximate sampling distribution for many statistics.
2) Multivariate distributions are determined by their mean vectors and covariance matrices.
3) Multivariate analysis involves measuring and analyzing dependence between variables and sets of variables.
4) Many real-world problems fall within the framework of multivariate normal theory.
This document provides an introduction to quantitative methods. It discusses experimental and observational data, with experiments involving manipulation of variables to test hypotheses while natural experiments do not involve manipulation. It also discusses descriptive statistics to summarize data, inferential statistics to draw conclusions from samples, and deterministic versus probabilistic models. Deterministic models have certain outcomes while probabilistic models assess likelihood of outcomes. The document notes key terms like explanatory and dependent variables often represented as X and Y.
This document discusses key concepts related to correlations, t scores, and inferential statistics. It defines populations and samples, and explains that samples are used to make generalizations about populations. It also discusses variables, data, and different types of research methods including correlational and experimental designs. Specifically, it explains what correlations measure, how to interpret positive, negative, and no correlations. It emphasizes that correlation does not imply causation and discusses reasons for this. Finally, it introduces t scores as a way to standardize comparisons between distributions.
This document discusses metrics for assessing the performance of randomization methods in clinical trials. It proposes measuring randomness using potential selection bias, which calculates how well an observer could guess the next treatment assignment based on previous assignments. It also considers periodicity to detect patterns. Balance is measured using efficiency loss, which quantifies the increase in variability due to imbalances. The document outlines a simulation study comparing randomization methods using these proposed metrics. Stratification factors are modeled using a Zipf-Mandelbrot distribution to generate realistic subgroup sizes. Randomness and balance metrics are calculated at interim analyses and summarized graphically.
Min-based qualitative possibilistic networks are one of the effective tools for a compact representation of
decision problems under uncertainty. The exact approaches for computing decision based on possibilistic
networks are limited by the size of the possibility distributions. Generally, these approaches are based on
possibilistic propagation algorithms. An important step in the computation of the decision is the
transformation of the DAG (Direct Acyclic Graph) into a secondary structure, known as the junction trees
(JT). This transformation is known to be costly and represents a difficult problem. We propose in this paper
a new approximate approach for the computation of decision under uncertainty within possibilistic
networks. The computing of the optimal optimistic decision no longer goes through the junction tree
construction step. Instead, it is performed by calculating the degree of normalization in the moral graph
resulting from the merging of the possibilistic network codifying knowledge of the agent and that codifying
its preferences.
This document discusses dyadic data analysis. Dyadic data refers to data that involves two related individuals, such as perceptions between two people or levels of self-disclosure between two interacting people. There are several approaches to analyzing dyadic data, including repeated measures analysis, multi-level modeling, and structural equation modeling. These approaches allow researchers to assess dependencies between individuals and account for the covariance between observations from the same dyad. The key considerations in selecting an analytic approach include whether the dyad members can be distinguished and whether they are exchangeable.
The document discusses mathematical modeling. It defines mathematical modeling as using mathematics to represent and analyze real-world phenomena. Mathematical models can be used to solve problems in fields like engineering, science, and economics. The document outlines the steps in the mathematical modeling process, including analyzing available data and governing principles, formulating models, and validating solutions. It also discusses different types of mathematical models, such as linear vs nonlinear, deterministic vs stochastic, static vs dynamic, discrete vs continuous, and quantitative vs qualitative models.
This document discusses how to estimate multilevel models using SPSS, Stata, SAS, and R. It begins by defining key terminology used in multilevel modeling, distinguishing between fixed and random effects. It then compares model building notation commonly used in social science applications to the matrix notation found in software documentation. The document aims to clarify these concepts and demonstrate the syntax for estimating multilevel models and centering variables in each software package.
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
25 16285 a hybrid ijeecs 012 edit septianIAESIJEECS
Feature selection aims to choose an optimal subset of features that are necessary and sufficient
to improve the generalization performance and the running efficiency of the learning algorithm. To get the
optimal subset in the feature selection process, a hybrid feature selection based on mutual information and
genetic algorithm is proposed in this paper. In order to make full use of the advantages of filter and
wrapper model, the algorithm is divided into two phases: the filter phase and the wrapper phase. In the
filter phase, this algorithm first uses the mutual information to sort the feature, and provides the heuristic
information for the subsequent genetic algorithm, to accelerate the search process of the genetic
algorithm. In the wrapper phase, using the genetic algorithm as the search strategy, considering the
performance of the classifier and dimension of subset as an evaluation criterion, search the best subset of
features. Experimental results on benchmark datasets show that the proposed algorithm has higher
classification accuracy and smaller feature dimension, and its running time is less than the time of using
genetic algorithm.
This document provides an overview of Bayesian networks through a 3-day tutorial. Day 1 introduces Bayesian networks and provides a medical diagnosis example. It defines key concepts like Bayes' theorem and influence diagrams. Day 2 covers propagation algorithms, demonstrating how evidence is propagated through a sample chain network. Day 3 will cover learning from data and using continuous variables and software. The overview outlines propagation algorithms for singly and multiply connected graphs.
Similar to Perspective of feature selection in bioinformatics (20)
A statistical criterion for reducing indeterminacy in linear causal modelingGianluca Bontempi
This document proposes a new statistical criterion called C to help distinguish between causal patterns in completely connected triplets when inferring causal relationships from observational data. The criterion is based on differences in values of the term S, which is derived from the covariance matrix, between different causal hypotheses. This criterion informs an algorithm called RC that incorporates both relevance and causal measures to iteratively select variables. Experiments on linear and nonlinear networks show RC has higher accuracy than other algorithms at inferring network structure. The criterion C and RC algorithm help address challenges of causal inference from complex data where dependencies are frequent.
Adaptive model selection in Wireless Sensor NetworksGianluca Bontempi
This document discusses challenges in using wireless sensor networks for environmental monitoring applications. It notes that sensor nodes have limited energy, which poses challenges for applications that need to run for months or years. The document describes the hardware capabilities of wireless sensor nodes and their energy consumption during different operating modes. It also provides an overview of using machine learning models to model sensor measurements over time and across sensors, with the goal of reducing energy usage through adaptive model selection.
Combining Lazy Learning, Racing and Subsampling for Effective Feature SelectionGianluca Bontempi
This document discusses approaches to feature selection for machine learning models, specifically comparing global versus local modeling techniques. It proposes combining lazy learning, racing, and subsampling for effective feature selection. Lazy learning uses local linear models for prediction rather than global nonlinear models, improving computational efficiency when many predictions are needed. Racing and subsampling allow efficient evaluation of feature subsets during wrapper-based feature selection by discarding poor-performing subsets early based on statistical tests of performance on subsets of the data. Experimental results are said to validate this combined approach for feature selection.
A model-based relevance estimation approach for feature selection in microarr...Gianluca Bontempi
This document presents a model-based approach for estimating feature relevance for feature selection in microarray datasets. It aims to provide an unbiased relevance estimation between filter and wrapper methods. The approach combines a low-bias k-nearest neighbor cross-validation error estimator with either a direct probability model estimator or a mutual information filter estimator to reduce variance. Experimental results on 20 public microarray datasets compare the proposed combined estimators to a support vector machine wrapper approach.
Machine Learning Strategies for Time Series PredictionGianluca Bontempi
This document introduces machine learning strategies for time series prediction. It begins with an introduction to the speaker and his background and research interests. It then provides an outline of the topics to be covered, including notions of time series, machine learning approaches for prediction, local learning methods, forecasting techniques, and applications and future directions. The document discusses what the audience should know coming into the course and what they will learn.
This document discusses using feature selection techniques to address the curse of dimensionality in microarray data analysis. It presents the problem of having many more features than samples in bioinformatics tasks like cancer classification and network inference. It describes filter, wrapper and embedded feature selection approaches and proposes a blocking strategy that uses multiple learning algorithms to evaluate feature subsets in order to improve selection robustness when samples are limited. Finally, it lists several microarray gene expression datasets that are commonly used to evaluate feature selection methods.
A Monte Carlo strategy for structure multiple-step-head time series predictionGianluca Bontempi
The document proposes a Monte Carlo approach called SMC (Structured Monte Carlo) for multiple-step-ahead time series forecasting that takes into account the structural dependencies between predictions. It generates samples using a direct forecasting approach and weights them based on how well they satisfy dependencies identified by an iterated approach. Experiments on three benchmark datasets show the SMC approach achieves more accurate forecasts as measured by SMAPE than iterated, direct, or other comparison methods for most prediction horizons tested.
THM1: Formalizing a problem as a prediction problem is often the most important contribution of a data scientist.
THM2: A predictor is an estimator, i.e. an algorithm which takes data and returns a prediction. Reality is stochastic, so data and predictions are stochastic.
THM3: Learning is challenging since data must be used both to create prediction models and to assess them. Bias and variance must be balanced to achieve good generalization.
FP7 evaluation & selection: the point of view of an evaluatorGianluca Bontempi
The document discusses the process of evaluating proposals for EU funding as an EU evaluator. It begins by introducing the author's expertise and background evaluating FP6 and FP7 proposals. It then outlines the evaluation process, which involves individual evaluation of assigned proposals followed by consensus building and panel evaluation. Key aspects covered include managing conflicts of interest, maintaining confidentiality, and adhering to a code of conduct. The evaluation criteria for integrated projects focus on relevance to program objectives, potential impact, scientific and technological excellence, quality of consortium, and quality of management.
Local modeling in regression and time series predictionGianluca Bontempi
The document discusses global modeling versus local modeling approaches for regression and time series prediction problems. Global modeling fits a single analytical function to all input data, while local modeling performs separate fits to subsets of nearby data points. The document outlines the local modeling approach using lazy learning, which stores all training data and performs local fits when making predictions for new query points. It then applies lazy learning techniques to problems in regression, time series prediction, and feature selection.
Computational Intelligence for Time Series PredictionGianluca Bontempi
This document provides an overview of computational intelligence methods for time series prediction. It begins with introductions to time series analysis and machine learning approaches for prediction. Specific models discussed include autoregressive (AR), moving average (MA), and autoregressive moving average (ARMA) processes. Parameter estimation techniques for AR models are also covered. The document outlines applications in areas like forecasting, wireless sensors, and biomedicine and concludes with perspectives on future directions.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Analysis insight about a Flyball dog competition team's performance
Perspective of feature selection in bioinformatics
1. Perspectives of feature selection in bioinformatics:
from relevance to causal inference
Gianluca Bontempi
Machine Learning Group,
Interuniversity Institute of Bioinformatics in Brussels (IB)2
Computer Science Department
ULB, Université Libre de Bruxelles
http://mlg.ulb.ac.be
2. The long way from data to knowledge
Which information can be extracted from data?
1 Descriptive statistics.
2 Parameters of a given model (model fitting, parameter
estimation, least squares).
3 Best predictive model among a set of candidates (validation,
assessment, bias/variance).
4 Most relevant features (multivariate statistics, regularization,
search).
5 Causal information.
This is also a good outline for a statistical machine learning course
as well as my personal research journey.
3. Causality in science
A major goal of the scientific activity is to model real
phenomena by studying the dependency between entities,
objects or more in general variables.
Sometimes the goal of the modelling activity is simply
predicting future behaviours. Sometimes the goal is to
understand the causes of a phenomenon (e.g. a disease).
Understanding the causes of a phenomenon means
understanding the mechanisms by which the observed
variables take their values and predicting how the values of
those variables would change if the mechanisms were subject
to manipulations (what-if scenarios).
Applications: understanding which actions to perform on a
system to have a desired effect (eg. understanding the causes
of tumor, the causes of the activation of a gene, the causes of
different survival rates in a cohort of patients.)
4. Causal knowledge
Most of human knowledge is causal and concerns how things
work in the world, about mechanisms, behaviors.
This knowledge is causal in the sense it is about the
mechanisms which bring from causes to effects.
Mechanism: it is characterized by some inputs and outputs,
the setting of inputs determines the outputs but not viceversa.
Causal discovery aims to understand the mechanism by which
variables came to take on the values they have and to predict
what the values of those variables would be if the naturally
occurring mechanisms were subject to manipulations.
Intelligent behaviour should be related to the ability of
inferring from observations cause and effect relationships
5. Prediction by supervised learning
ERROR
PREDICTION
MODEL
STOCHASTIC
DEPENDENCY
INPUT
PREDICTION
OUTPUT
DATA
6. Relevance vs. causality
The design of predictive models is one of the main
contributions of machine learning.
The design of a model able to predict the value of a target
variable (e.g. phenotype, survival time) requires the definition
of a set of input variables (e.g. genome expression, weight,
age, smoking habits, nationality, frequency of vacations) which
are relevant, in the sense that they provide information about
the target.
It is easy to observe that the features which are good predictors
are not always the causes of the variable to be predicted.
In other terms if causal variables are always relevant, the
contrary is not necessarily true. Sometimes, effects appears to
be better predictors than causes. Sometimes, good predictors
do not have a direct causal link with the target.
7. Relevance and causality: common cause pattern
age
reading
capability
height
The height of a child provides information (i.e. is relevant)
about his reading capability though it is not causing it.
Your child will not read better by pulling his legs...and reading
books doesn’t make him taller...
The variable age is called confounding variable.
8. Other examples
Some examples of false correlation may serve to illustrate the
difference between relevance and causality. In all these examples
the input is informative about the output (i.e. relevant) though it is
not a cause.
Input: number of firemen intervening in an accident. Target:
number of casualties.
Input: amount of Cokes drunk per day by a person. Target:
her sport performance.
Input: sales of ice-cream in a country. Target: number of
drowning deaths.
Input: sleeping with shoes. Target: wake up with an headache.
Input: chocolate consumption. Target: life expectancy.
Input: expression of gene 1. Target: expression of coregulated
gene 2.
9. Large dimensionality and causality
The problem of finding causes is still more difficult in large
dimensional tasks (bioinformatics) where often the number of
features (e.g. number of probes, variants) is very large with
respect to the number of samples.
Even when experimental interventions are possible, performing
thousands of experiments to discover causal relationships
between thousands of variables is often not practical.
Dimensionality reduction techniques have been largely
discussed in statistics and machine learning. However, most of
the time they focused on improving prediction accuracy.
Open issue: can these techniques be useful also for causal
feature selection? Is prediction accuracy compatible with
causal discovery?
10. Feature selection: state of the art
Filter methods: preprocessing methods which assess the
merits of features from the data, ignoring the effects of the
selected feature subset on the performance of the learning
algorithm: ranking, PCA or clustering.
Wrapper methods: assess subsets of variables according to
their usefulness to a given predictor. Search for a good subset
using the learning algorithm itself as part of the evaluation
function: stepwise methods in linear regression.
Embedded methods: variable selection as part of the
learning procedure and are specific to learning machines:
classification trees, random forests, and methods based on
regularization techniques (e.g. lasso)
11. Ranking: the simplest feature selection
The most common feature selection strategies in bioinformatics
is ranking where each variable is scored with the univariate
association with the target returned by a measure of relevance,
like mutual information, correlation, or p-value.
Ranking is simple and fast but:
1 it cannot take into consideration higher-order interaction terms
(e.g. complementarity)
2 it disregards redundancy between features
3 it does not distinguish between causes and effects. This is due
to the fact that univariate correlation (or relevance) does not
imply causation
Causality is not addressed either in multivariate feature
selection approaches since their cost function typically takes
into consideration accuracy but disregards causal aspects.
12. Causality vs. dependency in a stochastic setting
A variable x is dependent on a variable y if the distribution of y is
different from the marginal one when we observe the value x = x
Prob {y|x = x} = Prob {y}
Dependency is symmetric. If x is dependent of y, then y is
dependent on x.
Prob {x|y = y} = Prob {x}
A variable x is a cause of a variable y if the distribution of y is
different from the marginal one when we set the value x = x
Prob {y|set(x = x)} = Prob {y}
Causality is asymmetric:
Prob {x|set(y = y)} = Prob {x}
13. Main properties of causal relationships
Given causes (inputs) x and effects (output) y
stochastic dependency: changing x is likely to end up with a
change in y, in probabilistic terms the effects y are dependent
on the causes x
asymmetry: changing y won’t modify (the distribution of ) x
conditional independency: the effect y is independent of all
the other variables (apart from its effects) given the direct
causes x. In other words the direct causes screen off the
indirect causes from the effects. Note the analogy with the
notion of (Markov) state in (stochastic) dynamic systems.
temporality: the variation of y does not occur before x.
All this make Directed Acyclic Graphs a convenient formalism to
represent causality.
14. Graphical model (exc. from Guyon et al paper.)
Coughing
Allergy
Smoking
Anxiety
Genetic
factor
(a)
(d) (c)
Hormonal
factor
Metastasis
(b)
Other
cancers
Lung cancer
15. Causation and data
Causation is much harder to measure than dependency (e.g.
correlation or mutual information). Correlations can be
estimated directly in a single uncontrolled observational study,
while causal conclusions are stronger with controlled
experiments.
Data may be collected in experimental or observational setting.
Manipulation of variable is possible only in the experimental
setting. Two types of experimental configurations exist:
randomised and controlled. These are the typical settings
allowing causal discovery.
Most statistical study are confronted with observational
static settings. Notwithstanding, causal knowledge is more
and more demanded by final users.
16. Entropy and conditional entropy
Consider a binary output class y ∈ {c1 = 0, c2 = 1}
The entropy of y is
H(y) = −p0 log p0 − p1 log p1
This quantity is greater equal than zero and measures the
uncertainty of y
Once introduced the conditional probabilities
Prob {y = 1|x = x} = p1(x), Prob {y = 0|x} = p0(x)
we can define the conditional entropy for a given x
H[y|x] = −p0(x) log p0(x) − p1(x) log p1(x)
which measures the lack of predictability of y given x.
17. Information and dependency
Let us use the formalism of information theory to quantify the
dependency between variables.
Given two continuous rvs x1 ∈ X1, x2 ∈ X2, the mutual
information
I(x1; x2) = H(x1) − H(x1|x2)
measures stochastic dependence between x1 and x2.
In the case of Gaussian distributed variables
I(x1; x2) = −
1
2
log(1 − ρ2
)
where ρ is the Pearson correlation coefficient.
18. Conditional information
The conditional mutual information
I(x1; x2|y) = H(x1|y) − H(x1|x2, y)
quantifies how the dependence between two variables depends
on the context.
The conditional mutual information is null iff x1 and x2 are
conditionally independent given y.
This is the case of the example with x1 =reading, x2 =height
and y =age.
The information that a (set of) variable(s) brings about
another is
1 conditional on the context (i.e. which other variables are
known).
2 non monotone: it can increase or decrease according to the
context.
19. Interaction information
The interaction information quantifies the amount of trivariate
dependence that cannot be explained by bivariate information.
I(x1; x2; y) = I(x1; y) − I(x1; y|x2).
When it is different from zero, we say that x1, x2 and y
three-interact.
A non-zero interaction can be either negative, and in this case we
say that there is a synergy or complementarity between the
variables, or positive, and we say that there is redundancy.
I(x1; x2; y) = I(x1; y) − I(x1; y|x2) =
= I(x2; y) − I(x2; y|x1) = I(x1; x2) − I(x1; x2|y)
24. Joint information and interaction
Since
I((x1, x2); y) = I(x2; y) + I(x1; y|x2)
and
I(x1; y|x2) = I(x1; y) − I(x1; x2; y)
it follows that
I((x1, x2); y)
Joint information
= I(x1; y) + I(x2; y) − I(x1; x2; y) =
= I(x1; y)
Relevance
+ I(x2; y)
Relevance
− [I(x1; x2) − I(x1; x2|y)]
Interaction
(1)
Note that the above relationships hold also when either x1 or x2 are
vectorial random variables.
25. min-Interaction Max-Relevance (mIMR) filter
Let X+ = {xi ∈ X : I(xi ; y) > 0} the subset of X containing all
variables having non null mutual information (i.e. non null
relevance) with y.
The mIMR forward step is
x∗
d+1 = arg max
xk ∈X+−XS
[I(xk; y) − λI(XS; xk ; y)] ≈
≈ arg max
xk ∈X+−XS
I(xk ; y) −
λ
d
xi ∈XS
(I(xi ; xk ; y)
where λ measures the amount of causation that we want to take
into consideration.
Note that λ = 0 boils down to the conventional ranking approach.
26. Experimental setting
Goal: identification of a causal signature of breast cancer
survival.
Two steps:
1 compare the generalization accuracy of conventional ranking
with mIMR
2 interpret the causal signature
Each experiment was conducted in a meta-analytical and
cross-validation framework.
27. Datasets
6 public microarray datasets (i.e., n = 13, 091 unique genes)
derived from different breast cancer clinical studies.
All the microarray studies are characterized by the collection of
gene expression data and the survival data.
In order to adopt a classification framework, the survival of the
patients was transformed in a binary class such as low or high
risk of the patients given their clinical outcome at five years
28. Experiments
Two sets of meta-analysis validation experiments:
Holdout: 100 training-and-test repetitions.
Leave-one-dataset-out where for each dataset the features
used for classification are selected without considering the
patients of the dataset itself.
All the experiments were repeated for three sizes of the gene
signature (number of selected features): v = 20, 50, 100.
All the mutual information terms are computed by using the
Gaussian approximation.
29. Assessment
The quality of the selection is represented by the accuracy of a
Naive Bayes classifier measured by four different criteria to be
maximized:
1 the Area Under the ROC curve (AUC),
2 1-RMSE where RMSE stands for Root Mean Squared Error
3 the SAR (Squared error, Accuracy, and ROC score)
4 the precision-recall F score measure.
31. Causal interpretation
The introduction of a causality term leads to a prioritization of
the genes according to their causal role.
Since genes are not acting in isolation but rather in pathways,
we analyzed the gene rankings in terms of gene set enrichment
analysis (GSEA).
By quantifying how the causal rank of genes diverges from the
conventional one (λ = 0) with respect to λ we can identify the
gene sets that are potential causes or effects of breast cancer.
32. Causal characterization of genes
Genes that remains among the top ranked ones for increasing
λ can be considered as individually relevant (i.e. they
contain predictive information about survival) and causal.
Genes whose rank increases for increasing λ are putative
causes: they have less individual relevance than other genes
(for example, those being direct effects) but they are causal
together with other. These genes would have been missed by
conventional ranking (false negatives).
Genes whose rank decreases for increasing λ are putative
effects in the sense that they are individually relevant but
probably not causal. This set of genes could be erroneously
considered as causal (false positives ) by conventional
ranking.
33. GSEA analysis
Normalized Enrichment Score
−2 −1 0 1 2
Normalized Enrichment Score
−2 −1 0 1 2
Normalized Enrichment Score
−2 −1 0 1 2
Microtubule cytoskeleton
organization and biogenesis
Coenzyme metabolic process
Regulation of cyclin dependent
protein kinase activity
A B C
Cellular defense
response
Inflammatory response
Defense response
M phase of mitotic cycle
DNA replicaton
NES NES NES
M phase
0 0.5 1 2
λ
Larger the NES of a GO term, stronger the association of this gene
set with survival; the sign of NES reflects the direction of
association of the GO term with survival, a positive score meaning
that over-expression of the genes implies worst survival and
inversely.
34. Individually causal genes
The first group of GO terms are implicated in cell movement and
division, cellular respiration and regulation of cell cycle. It was
shown that this family of proteins may cause dysregulation of cell
proliferation to promote tumor progression.
The second GO term represents the co-enzyme metabolic process
which includes proteins showed to be early indicators of breast
cancer; perturbation of these co-enzymes might cause cancers by
compromising the structure of important enzyme complexes
implicated in mitochondrial functions.
The genes of the third GO term regulation cyclin-dependent protein
kinase activity are key players in cell cycle regulation and inhibition
of such kinases proved to block proliferation of human breast cancer
cells
35. Jointly causal genes
Counterintuitively, the three GO terms in this category are related
to the immune system that is thought to be more an effect of the
tumor growth as lymphocytes strike cancer cells as they proliferate.
However, several findings support the idea that the immune system
might have a causal role in tumorigenesis.
There is strong evidence of interplay between immune system and
tumors since solid tumors are commonly infiltrated by immune cells;
in contrast to infiltration of cells responsible for chronic
inflammation, the presence of high numbers of lymphocytes,
especially T cells, has been reported to be an indicator of good
prognosis in many cancers what concours with the sign of the
enrichment.
36. Putative effects
The last group of GO terms are are related to cell-cycle and
proliferation.
In our previous research, we have shown that a quantitative
measurement of proliferation genes using mRNA gene expression
could provide an accurate assessment of prognosis of breast cancer
patients.
The enrichment of these proliferation-related genes seems to be a
downstream effect of the breast tumorigenesis instead of its cause.
37. Indistinguishable cases
mIMR shows that some causal patterns (e.g. open triplets or
unshielded colliders) can be discriminated by using notions
based on conditional independence.
These notions are exploited also by structural identification
approaches (e.g. PC algorithm in Bayesian networks) which
rely on notions of independence and conditional independence
to detect causal patterns in the data.
Unfortunately, these approaches cannot deal with
indistinguishable configurations like the two-variable setting
and the completely connected triplet configuration where it is
impossible to distinguish between cause and effects by means
of conditional or unconditional independence tests.
38. From dependency to causality
However indistinguishability does not prevent the existence of
statistical algorithms able to reduce the uncertainty about the
causal pattern even in indistinguishable configurations.
In recent years of a series of approaches appeared to deal with
the two variable setting like ANM and IGCI.
What is common to these approaches is that they use
alternative statistical features of the data to detect causal
patterns and reduce the uncertainty about their directionality.
A further important step in this direction has been represented
by the recent ChaLearn cause-effect pair challenge (YouTube
video "CauseEffectPairs" by I. Guyon).
39. ChaLearn cause-effect pair challenge
Hundreds of pairs of real variables with known causal
relationships from several domains (chemistry, climatology,
ecology, economy, engineering, epidemiology, genomics,
medicine).
Those were intermixed with controls (pairs of independent
variables and pairs of variables that are dependent but not
causally related) and semi-artificial cause-effect pairs (real
variables mixed in various ways to produce a given outcome).
The good rate of accuracy obtained by the competitors shows
that learning strategies can infer with success (or at least
significantly better than random) indistinguishable
configuration.
We took part to the ChaLearn challenge and we developed a
Dependency to Causality (D2C) learning approach for bivariate
settings which ranked 8th in the final leader board.
42. The D2C approach
Given two variables, the D2C approach infers from a number
of observed statistical features of the bivariate distribution
(e.g. the empirical estimation of the copula) or the n-variate
distribution (e.g. dependencies between members of the
Markov Blankets) the probability of the existence and then of
the directionality of the causal link between two variables.
The approach is an example of how the problem of causal
inference can be formulated as a supervised machine learning
approach where the inputs are features describing the
probabilistic dependency and the output is a class denoting the
existence (or not) of the causal link.
Once sufficient training data are made available, conventional
feature selection algorithms and classifiers can be used to
return a prediction.
43. The D2C approach: n > 2 variables
zi
zj
c
(1)
i c
(2)
i
e
(1)
i
c
(1)
j
c
(2)
j
e
(1)
j
s
(1)
i
s
(1)
j
44. Some asymmetrical relationships
By using d-separation we can write down a set of asymmetrical
relations between the members of the two Markov Blankets:
Unconditional Conditioning on the effect Conditioning on the cause
∀k c
(k)
i ⊥⊥ c
(k)
j ∀k c
(k)
i ⊥⊥ c
(k)
j |zj ∀k c
(k)
i ⊥⊥ c
(k)
j |zi
∀k e
(k)
i ⊥⊥ c
(k)
j ∀k e
(k)
i ⊥⊥ c
(k)
j |zj ∀k e
(k)
i ⊥⊥ c
(k)
j |zi
∀k c
(k)
i ⊥⊥ e
(k)
j ∀k c
(k)
i ⊥⊥ e
(k)
j |zj ∀k c
(k)
i ⊥⊥ e
(k)
j |zi
∀k zi ⊥⊥ c
(k)
j ∀k zi ⊥⊥ c
(k)
j |zj
∀k c
(k)
i ⊥⊥ zj ∀k c
(k)
i ⊥⊥ zj|zi
45. The algorithm
1 infers the Markov Blankets MBi = {m(ki ), ki = 1, . . . , Ki } and
MBj = {m(kj ), kj = 1, . . . , Kj } of zi and zj ,
2 computes the positions Pi (ki ) of m(ki ) of MBi in MBj and the
positions Pj (kj ) of m(kj ) in MBi .
3 computes
1 I = [I(zi ; zj ), I(zi ; zj |MBj zi ), I(zi ; zj |MBi zj)] where
denotes the set difference operator,
2 Ii (ki ; kj) = I(m
(ki )
i ; m
(kj )
j |zi ) and Ij (ki ; kj) = I(m
(ki )
i ; m
(kj )
j |zj)
where ki = 1, . . . , Ki , kj = 1, . . . , Kj
4 creates a vector of descriptors
x = [Q( ˆPi ), Q( ˆPj ), I, Q(ˆIi ), Q(ˆIj ), N, n]
where ˆPi and ˆPj are the empirical distributions of Pi and Pj ,
ˆIi and ˆIj are the empirical distributions of Ii (ki , kj ) and
Ij (ki , kj ) (ki = 1, . . . , Ki , kj = 1, . . . , Kj ), and Q returns the
sample quantiles of a distribution.
46. The algorithm in words
Asymmetries between MBi and MBj induce an asymmetry on
( ˆPi , ˆPj ), and (ˆIi , ˆIj ) and the quantiles provide information
about the directionality of causal link (zi → zj or zj → zi .)
The distribution of these variables should return useful
information about which is the cause and the effect.
These distributions would more informative if we were able to
rank the terms of the Markov Blankets by prioritizing the
direct causes (i.e. the terms ci and cj ) since these terms play
a major role in the asymmetries.
The D2C algorithm can then be improved by choosing mIMR
to prioritize the direct causes in the MB set.
48. Experimental validation
Training set made of D = 6000 pairs xd , yd and is obtained
by generating 750 DAGs and storing for each of them the
descriptors associated to 4 positives examples (i.e. a pair
where the node zi is a direct cause of zj) and 4 negatives
examples (i.e. a pair where zi is not a direct cause of zj ).
Dependency between children and parents modelled by 3 types
of additive relationships (linear, quadratic, nonlinear)
A Random Forest classifier is trained on the balanced dataset
and assessed on the test set.
Test set made of 190 independent DAGs for the small
configuration and 90 for the large configuration. For each
DAG we select 4 positives examples (i.e. a pair where the node
zi is a direct cause of zj) and 6 negatives examples (i.e. a pair
where the node zi is not a direct cause of zj ).
Comparison with state-of-the-art approaches implemented by
the package bnlearn: ANM? DAGL, GS, IAMB, PC, HC,
51. Conclusions
The scientific community is demanding of learning algorithms
able to detect in a fast and reliable manner subsets of
informative and causal features from observational data.
Pessimistic point of view: Correlation (or dependency) does
not imply causation.
Optimistic point of view: Causation implies correlation (or
dependency).
Causality leaves footprints on the patterns of stochastic
dependency which can be (hopefully) retrieved from
data.
This implies that inferring causes without designing
experiments is possible once we look for such constraints.
52. Quote from Scheines
Statisticians are often skeptical about causality but in almost every
case these same people (over beer later) confess their heresy by
concurring that their real ambitions are causal and their public
agnosticism is a prophylactic against the abuse of statistics by their
clients or less careful practitioners. (Scheines 2002)