This document provides an introduction to statistical learning methods. It begins with background information on statistical learning problems and discusses concepts like underfitting, overfitting, and consistency. It then summarizes decision trees and random forests, describing how they are learned from data and make predictions. Support vector machines and neural networks are also briefly mentioned. Key goals of statistical learning methods include accuracy on training data as well as generalization to new data.
Kernel methods for data integration in systems biology tuxette
This document provides an overview of a seminar presentation on kernel methods for data integration in systems biology. It begins with short biographies of the presenter, who is trained as a mathematician and statistician and applies their skills to research in human health and animal genomics using various omics data types. Examples are given of the presenter's past work inferring networks and integrating gene expression and lipid data, as well as expression and 3D DNA location data. The talk will discuss how to integrate multiple omics data from different sources and types using kernels. Kernels allow reducing high-dimensional data to similarity matrices and are not restricted to numeric data. They also allow embedding expert knowledge and provide a framework for statistical learning.
ACCOST is a method for differential analysis of Hi-C data between two conditions with replicates. It models Hi-C interaction counts with a negative binomial distribution that accounts for distance effects between loci through an offset term. ACCOST normalizes counts with ICE and estimates model parameters to obtain a p-value for each bin pair comparing the two conditions. It was validated on several datasets and shown to identify more differential contacts than other methods like diffHic and FIND, particularly at short genomic distances.
La statistique et le machine learning pour l'intégration de données de la bio...tuxette
This document summarizes a presentation on using statistics and machine learning for integrating high-throughput biological data. It discusses how biological data is large in volume, multi-scaled and heterogeneous in type, creating bottlenecks for analysis. It presents different methods for integrating multiple data tables, including multiple kernel learning to combine similarity matrices. An example application to TARA Oceans data is described, identifying Rhizaria abundance as structuring ocean differences. Interpretability of results is discussed along with prospects for deep learning and predicting phenotypes while understanding relationships.
This document provides an introduction to neural networks. It begins with an outline covering statistical machine learning concepts like underfitting, overfitting and consistency. It then discusses multi-layer perceptrons, the basic building blocks of neural networks. It covers how perceptrons are presented, their theoretical properties, and how learning occurs. Finally, it provides an overview of deep neural networks and convolutional neural networks. The goal is to introduce fundamental concepts in neural networks from statistical learning to modern deep learning architectures.
Reproducibility and differential analysis with selfishtuxette
Selfish is a Python tool for identifying differentially interacting chromatin regions from Hi-C contact maps of two conditions with no replicates. It begins by distance-correcting the interaction frequencies. It then computes Gaussian filters over neighboring bins to capture spatial dependencies. It compares the evolution of these filters between conditions and assigns p-values assuming Gaussian differences. Selfish is faster than existing methods and shows enrichment for epigenetic markers near differential regions. However, its statistical justification could be improved as it does not model overdispersion like other methods.
Kernel methods for data integration in systems biology tuxette
This document provides an overview of a seminar presentation on kernel methods for data integration in systems biology. It begins with short biographies of the presenter, who is trained as a mathematician and statistician and applies their skills to research in human health and animal genomics using various omics data types. Examples are given of the presenter's past work inferring networks and integrating gene expression and lipid data, as well as expression and 3D DNA location data. The talk will discuss how to integrate multiple omics data from different sources and types using kernels. Kernels allow reducing high-dimensional data to similarity matrices and are not restricted to numeric data. They also allow embedding expert knowledge and provide a framework for statistical learning.
ACCOST is a method for differential analysis of Hi-C data between two conditions with replicates. It models Hi-C interaction counts with a negative binomial distribution that accounts for distance effects between loci through an offset term. ACCOST normalizes counts with ICE and estimates model parameters to obtain a p-value for each bin pair comparing the two conditions. It was validated on several datasets and shown to identify more differential contacts than other methods like diffHic and FIND, particularly at short genomic distances.
La statistique et le machine learning pour l'intégration de données de la bio...tuxette
This document summarizes a presentation on using statistics and machine learning for integrating high-throughput biological data. It discusses how biological data is large in volume, multi-scaled and heterogeneous in type, creating bottlenecks for analysis. It presents different methods for integrating multiple data tables, including multiple kernel learning to combine similarity matrices. An example application to TARA Oceans data is described, identifying Rhizaria abundance as structuring ocean differences. Interpretability of results is discussed along with prospects for deep learning and predicting phenotypes while understanding relationships.
This document provides an introduction to neural networks. It begins with an outline covering statistical machine learning concepts like underfitting, overfitting and consistency. It then discusses multi-layer perceptrons, the basic building blocks of neural networks. It covers how perceptrons are presented, their theoretical properties, and how learning occurs. Finally, it provides an overview of deep neural networks and convolutional neural networks. The goal is to introduce fundamental concepts in neural networks from statistical learning to modern deep learning architectures.
Reproducibility and differential analysis with selfishtuxette
Selfish is a Python tool for identifying differentially interacting chromatin regions from Hi-C contact maps of two conditions with no replicates. It begins by distance-correcting the interaction frequencies. It then computes Gaussian filters over neighboring bins to capture spatial dependencies. It compares the evolution of these filters between conditions and assigns p-values assuming Gaussian differences. Selfish is faster than existing methods and shows enrichment for epigenetic markers near differential regions. However, its statistical justification could be improved as it does not model overdispersion like other methods.
Kernel methods and variable selection for exploratory analysis and multi-omic...tuxette
Nathalie Vialaneix
4th course on Computational Systems Biology of Cancer: Multi-omics and Machine Learning Approaches
International course, Curie training
https://training.institut-curie.org/courses/sysbiocancer2021
(remote)
September 29th, 2021
Graph Neural Network for Phenotype Predictiontuxette
This document describes a study on using graph neural networks (GNNs) for phenotype prediction from gene expression data. The objectives are to determine if including network information can improve predictions, which network types work best, and if GNNs can learn network inferences. It provides background on GNNs and how they generalize convolutional layers to graph data. The authors implemented a GNN model from previous work as a starting point and tested it on different network types to see which network information is most useful for predictions. Their methodology involves comparing GNN performance to other methods like random forests using 10-fold cross validation.
Mini useR! in Melbourne https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network/events/251933078/
MelbURN (Melbourne useR group) https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network
July 16th, 2018
Melbourne, Australia
Convolutional networks and graph networks through kernelstuxette
This presentation discusses how convolutional kernel networks (CKNs) can be used to model sequential and graph-structured data through kernels defined over sequences and graphs. CKNs define feature maps from substructures like n-mers in sequences and paths in graphs into high-dimensional spaces, which are then approximated to obtain low-dimensional representations that can be used for prediction tasks like classification. This approach is analogous to convolutional neural networks and can be extended to multiple layers. The presentation provides examples showing CKNs achieve good performance on problems involving protein sequences and social networks.
Machine Learning: Foundations Course Number 0368403401butest
This machine learning foundations course will consist of 4 homework assignments, both theoretical and programming problems in Matlab. There will be a final exam. Students will work in groups of 2-3 to take notes during classes in LaTeX format. These class notes will contribute 30% to the overall grade. The course will cover basic machine learning concepts like storage and retrieval, learning rules, estimating flexible models, and applications in areas like control, medical diagnosis, and document retrieval.
This document summarizes kernel methods in machine learning. It begins with an introductory example of using a kernel function to perform binary classification in a reproducing kernel Hilbert space. It then defines positive definite kernels and shows how they allow representing algorithms as operating in linear dot product spaces while using nonlinear kernel functions. The document covers fundamental properties of kernels, provides examples, and discusses how kernels define reproducing kernel Hilbert spaces for regularization. It overviews various kernel-based machine learning approaches and modeling structured responses using statistical models in reproducing kernel Hilbert spaces.
Machine Learning: Foundations Course Number 0368403401butest
This machine learning course will cover theoretical and practical machine learning concepts. It will include 4 homework assignments and programming in Matlab. Lectures will be supplemented by student-submitted class notes in LaTeX. Topics will include learning approaches like storage and retrieval, rule learning, and flexible model estimation, as well as applications in areas like control, medical diagnosis, and web search. A final exam format has not been determined yet.
The document discusses prototype-based models in machine learning. It provides an overview of unsupervised learning techniques including vector quantization and self-organizing maps, which group similar data points together to form clusters or reduce dimensions. It also discusses supervised learning methods like learning vector quantization, which learns prototype vectors to classify new examples based on their distances to the prototypes. The document uses examples like clustering iris flower data with a self-organizing map to illustrate prototype-based modeling approaches.
The Advancement and Challenges in Computational Physics - PhdassistancePhD Assistance
For the last five decades, computational physics has been a valuable scientific instrument in physics. In comparison to using only theoretical and experimental approaches, it has enabled physicists to understand complex problems better. Computational physics was mostly a scientific activity at the time, with relatively few organised undergraduate study.
Ph.D. Assistance serves as an external mentor to brainstorm your idea and translate that into a research model. Hiring a mentor or tutor is common and therefore let your research committee know about the same. We do not offer any writing services without the involvement of the researcher.
Learn More: https://bit.ly/3AUvG0y
Contact Us:
Website: https://www.phdassistance.com/
UK NO: +44–1143520021
India No: +91–4448137070
WhatsApp No: +91 91769 66446
Email: info@phdassistance.com
Kernel Methods and Relational Learning in Computational BiologyMichiel Stock
This document discusses kernel methods and relational learning in computational biology. It begins with an introduction to kernel methods, describing how they can handle structured and heterogeneous biological data. It then provides overviews of various kernel techniques for dealing with sequences, graphs, and other objects. The document also discusses learning relationships between different types of objects using Kronecker kernels and conditional ranking algorithms. It gives an example application of predicting enzyme function and concludes that kernel methods are well-suited for computational biology challenges involving complex objects and relationships between objects.
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNINGbutest
This document provides an introduction to recent advances in predictive machine learning, specifically support vector machines and boosted decision trees. It begins with an overview of predictive learning and common methods. It then describes kernel methods, including how they were extended to support vector machines. Next, it discusses extending decision trees with boosting. The document concludes by comparing support vector machines and boosted decision trees, and noting they are not the only recent advances in machine learning.
This document provides an introduction to statistical model selection. It discusses various approaches to model selection including predictive risk, Bayesian methods, information theoretic measures like AIC and MDL, and adaptive methods. The key goals of model selection are to understand the bias-variance tradeoff and select models that offer the best guaranteed predictive performance on new data. Model selection aims to find the right level of complexity to explain patterns in available data while avoiding overfitting.
Learning for Optimization: EDAs, probabilistic modelling, or ...butest
Marcus Gallagher gave a talk on explicit modelling in metaheuristic optimization. He discussed estimation of distribution algorithms which use probabilistic models to represent promising regions of the search space. He provided examples of modelling approaches like PBIL, MIMIC, COMIT and BOA. Finally, he summarized that EDAs take an explicit modelling approach to optimization using existing statistical models and can solve challenging problems by visualizing the model.
Invited lecture on Machine Learning in Medicine at the joint "Integrated Omics" course of Hanze University and University Hospital UMCG, Groningen, The Netherlands
Intuition – Based Teaching Mathematics for EngineersIDES Editor
It is suggested to teach Mathematics for engineers
based on development of mathematical intuition, thus, combining
conceptual and operational approaches. It is proposed to teach
main mathematical concepts based on discussion of carefully
selected case studies following solving of algorithmically generated
problems to help mastering appropriate mathematical tools.
The former component helps development of mathematical intuition;
the latter applies means of adaptive instructional technology
to improvement of operational skills. Proposed approach is applied
to teaching uniform convergence and to knowledge generation
using Computer Science object-oriented methodology.
The document describes the Mendel approach for understanding class hierarchies in object-oriented programs. Mendel uses a simple model and metrics to identify interesting classes based on their size, novelty within a hierarchy, and a combination of both. It also analyzes subclassing behaviors to distinguish between classes that mainly extend versus override functionality. The approach is demonstrated on example systems like JHotDraw and Azureus to provide insights into their key classes and inheritance usage.
The aim of this research is to find accurate solution for the Troesch’s problem by using high performance technique based on parallel processing implementation.
Design/methodology/approach – Feed forward neural network is designed to solve important type of differential equations that arises in many applied sciences and engineering applications. The suitable designed based on choosing suitable learning rate, transfer function, and training algorithm. The authors used back propagation with new implement of Levenberg - Marquardt training algorithm. Also, the authors depend new idea for choosing the weights. The effectiveness of the suggested design for the network is shown by using it for solving Troesch problem in many cases.
Findings – New idea for choosing the weights of the neural network, new implement of Levenberg - Marquardt training algorithm which assist to speeding the convergence and the implementation of the suggested design demonstrates the usefulness in finding exact solutions.
Gabriella Casalino, Nicoletta Del Buono, Corrado Mencar (2011) Subtractive Initialization of Nonnegative Matrix Factorizations for Document Clustering, 188-195. In Fuzzy Logic and Applications (WILF 2011).
The 9th International Workshop on Fuzzy Logic and Applications, August 29-31 2011, Trani
Kernel methods and variable selection for exploratory analysis and multi-omic...tuxette
Nathalie Vialaneix
4th course on Computational Systems Biology of Cancer: Multi-omics and Machine Learning Approaches
International course, Curie training
https://training.institut-curie.org/courses/sysbiocancer2021
(remote)
September 29th, 2021
Graph Neural Network for Phenotype Predictiontuxette
This document describes a study on using graph neural networks (GNNs) for phenotype prediction from gene expression data. The objectives are to determine if including network information can improve predictions, which network types work best, and if GNNs can learn network inferences. It provides background on GNNs and how they generalize convolutional layers to graph data. The authors implemented a GNN model from previous work as a starting point and tested it on different network types to see which network information is most useful for predictions. Their methodology involves comparing GNN performance to other methods like random forests using 10-fold cross validation.
Mini useR! in Melbourne https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network/events/251933078/
MelbURN (Melbourne useR group) https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network
July 16th, 2018
Melbourne, Australia
Convolutional networks and graph networks through kernelstuxette
This presentation discusses how convolutional kernel networks (CKNs) can be used to model sequential and graph-structured data through kernels defined over sequences and graphs. CKNs define feature maps from substructures like n-mers in sequences and paths in graphs into high-dimensional spaces, which are then approximated to obtain low-dimensional representations that can be used for prediction tasks like classification. This approach is analogous to convolutional neural networks and can be extended to multiple layers. The presentation provides examples showing CKNs achieve good performance on problems involving protein sequences and social networks.
Machine Learning: Foundations Course Number 0368403401butest
This machine learning foundations course will consist of 4 homework assignments, both theoretical and programming problems in Matlab. There will be a final exam. Students will work in groups of 2-3 to take notes during classes in LaTeX format. These class notes will contribute 30% to the overall grade. The course will cover basic machine learning concepts like storage and retrieval, learning rules, estimating flexible models, and applications in areas like control, medical diagnosis, and document retrieval.
This document summarizes kernel methods in machine learning. It begins with an introductory example of using a kernel function to perform binary classification in a reproducing kernel Hilbert space. It then defines positive definite kernels and shows how they allow representing algorithms as operating in linear dot product spaces while using nonlinear kernel functions. The document covers fundamental properties of kernels, provides examples, and discusses how kernels define reproducing kernel Hilbert spaces for regularization. It overviews various kernel-based machine learning approaches and modeling structured responses using statistical models in reproducing kernel Hilbert spaces.
Machine Learning: Foundations Course Number 0368403401butest
This machine learning course will cover theoretical and practical machine learning concepts. It will include 4 homework assignments and programming in Matlab. Lectures will be supplemented by student-submitted class notes in LaTeX. Topics will include learning approaches like storage and retrieval, rule learning, and flexible model estimation, as well as applications in areas like control, medical diagnosis, and web search. A final exam format has not been determined yet.
The document discusses prototype-based models in machine learning. It provides an overview of unsupervised learning techniques including vector quantization and self-organizing maps, which group similar data points together to form clusters or reduce dimensions. It also discusses supervised learning methods like learning vector quantization, which learns prototype vectors to classify new examples based on their distances to the prototypes. The document uses examples like clustering iris flower data with a self-organizing map to illustrate prototype-based modeling approaches.
The Advancement and Challenges in Computational Physics - PhdassistancePhD Assistance
For the last five decades, computational physics has been a valuable scientific instrument in physics. In comparison to using only theoretical and experimental approaches, it has enabled physicists to understand complex problems better. Computational physics was mostly a scientific activity at the time, with relatively few organised undergraduate study.
Ph.D. Assistance serves as an external mentor to brainstorm your idea and translate that into a research model. Hiring a mentor or tutor is common and therefore let your research committee know about the same. We do not offer any writing services without the involvement of the researcher.
Learn More: https://bit.ly/3AUvG0y
Contact Us:
Website: https://www.phdassistance.com/
UK NO: +44–1143520021
India No: +91–4448137070
WhatsApp No: +91 91769 66446
Email: info@phdassistance.com
Kernel Methods and Relational Learning in Computational BiologyMichiel Stock
This document discusses kernel methods and relational learning in computational biology. It begins with an introduction to kernel methods, describing how they can handle structured and heterogeneous biological data. It then provides overviews of various kernel techniques for dealing with sequences, graphs, and other objects. The document also discusses learning relationships between different types of objects using Kronecker kernels and conditional ranking algorithms. It gives an example application of predicting enzyme function and concludes that kernel methods are well-suited for computational biology challenges involving complex objects and relationships between objects.
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNINGbutest
This document provides an introduction to recent advances in predictive machine learning, specifically support vector machines and boosted decision trees. It begins with an overview of predictive learning and common methods. It then describes kernel methods, including how they were extended to support vector machines. Next, it discusses extending decision trees with boosting. The document concludes by comparing support vector machines and boosted decision trees, and noting they are not the only recent advances in machine learning.
This document provides an introduction to statistical model selection. It discusses various approaches to model selection including predictive risk, Bayesian methods, information theoretic measures like AIC and MDL, and adaptive methods. The key goals of model selection are to understand the bias-variance tradeoff and select models that offer the best guaranteed predictive performance on new data. Model selection aims to find the right level of complexity to explain patterns in available data while avoiding overfitting.
Learning for Optimization: EDAs, probabilistic modelling, or ...butest
Marcus Gallagher gave a talk on explicit modelling in metaheuristic optimization. He discussed estimation of distribution algorithms which use probabilistic models to represent promising regions of the search space. He provided examples of modelling approaches like PBIL, MIMIC, COMIT and BOA. Finally, he summarized that EDAs take an explicit modelling approach to optimization using existing statistical models and can solve challenging problems by visualizing the model.
Invited lecture on Machine Learning in Medicine at the joint "Integrated Omics" course of Hanze University and University Hospital UMCG, Groningen, The Netherlands
Intuition – Based Teaching Mathematics for EngineersIDES Editor
It is suggested to teach Mathematics for engineers
based on development of mathematical intuition, thus, combining
conceptual and operational approaches. It is proposed to teach
main mathematical concepts based on discussion of carefully
selected case studies following solving of algorithmically generated
problems to help mastering appropriate mathematical tools.
The former component helps development of mathematical intuition;
the latter applies means of adaptive instructional technology
to improvement of operational skills. Proposed approach is applied
to teaching uniform convergence and to knowledge generation
using Computer Science object-oriented methodology.
The document describes the Mendel approach for understanding class hierarchies in object-oriented programs. Mendel uses a simple model and metrics to identify interesting classes based on their size, novelty within a hierarchy, and a combination of both. It also analyzes subclassing behaviors to distinguish between classes that mainly extend versus override functionality. The approach is demonstrated on example systems like JHotDraw and Azureus to provide insights into their key classes and inheritance usage.
The aim of this research is to find accurate solution for the Troesch’s problem by using high performance technique based on parallel processing implementation.
Design/methodology/approach – Feed forward neural network is designed to solve important type of differential equations that arises in many applied sciences and engineering applications. The suitable designed based on choosing suitable learning rate, transfer function, and training algorithm. The authors used back propagation with new implement of Levenberg - Marquardt training algorithm. Also, the authors depend new idea for choosing the weights. The effectiveness of the suggested design for the network is shown by using it for solving Troesch problem in many cases.
Findings – New idea for choosing the weights of the neural network, new implement of Levenberg - Marquardt training algorithm which assist to speeding the convergence and the implementation of the suggested design demonstrates the usefulness in finding exact solutions.
Gabriella Casalino, Nicoletta Del Buono, Corrado Mencar (2011) Subtractive Initialization of Nonnegative Matrix Factorizations for Document Clustering, 188-195. In Fuzzy Logic and Applications (WILF 2011).
The 9th International Workshop on Fuzzy Logic and Applications, August 29-31 2011, Trani
The Bayesian paradigm provides a coherent approach for quantifying uncertainty given available data and prior information. Aspects of uncertainty that arise in practice include uncertainty regarding parameters within a model, the choice of model, and propagation of uncertainty in parameters and models for predictions. In this talk I will present Bayesian approaches for addressing model uncertainty given a collection of competing models including model averaging and ensemble methods that potentially use all available models and will highlight computational challenges that arise in implementation of the paradigm.
Introduction to Machine Learning Lecturesssuserfece35
This lecture discusses ensemble methods in machine learning. It introduces bagging, which trains multiple models on random subsets of the training data and averages their predictions, in order to reduce variance and prevent overfitting. Bagging is effective because it decreases the correlation between predictions. Random forests apply bagging to decision trees while also introducing more randomness by selecting a random subset of features to consider at each node. The next lecture will cover boosting, which aims to reduce bias by training models sequentially to focus on examples previously misclassified.
Predictive Modeling in Insurance in the context of (possibly) big dataArthur Charpentier
This document discusses predictive modeling in insurance in the context of big data. It begins with an introduction to the speaker and outlines some key concepts in actuarial science from both American and European perspectives. It then provides examples of common actuarial problems involving ratemaking, pricing, and claims reserving. The document reviews the history of actuarial models and discusses issues around statistical learning, machine learning, and their relationship to statistics. It also covers model evaluation and various loss functions used in modeling.
Module-2_Notes-with-Example for data sciencepujashri1975
The document discusses several key concepts in probability and statistics:
- Conditional probability is the probability of one event occurring given that another event has already occurred.
- The binomial distribution models the probability of success in a fixed number of binary experiments. It applies when there are a fixed number of trials, two possible outcomes, and the same probability of success on each trial.
- The normal distribution is a continuous probability distribution that is symmetric and bell-shaped. It is characterized by its mean and standard deviation. Many real-world variables approximate a normal distribution.
- Other concepts discussed include range, interquartile range, variance, and standard deviation. The interquartile range describes the spread of a dataset's middle 50%
This document provides an overview of Bayesian networks through a 3-day tutorial. Day 1 introduces Bayesian networks and provides a medical diagnosis example. It defines key concepts like Bayes' theorem and influence diagrams. Day 2 covers propagation algorithms, demonstrating how evidence is propagated through a sample chain network. Day 3 will cover learning from data and using continuous variables and software. The overview outlines propagation algorithms for singly and multiply connected graphs.
This document provides an overview of probability, statistics, and their applications in engineering. It defines key probability and statistics concepts like trials, outcomes, random experiments, and frequency distributions. It explains how engineers use statistics and probability to analyze data from tests and experiments to better understand product quality and failure rates. Examples are given of measures of central tendency like mean and median, measures of variation like standard deviation and variance, and the normal distribution curve. Engineering applications include using these analytical techniques to assess results from a class and compare two data histograms.
The document discusses using unusual data sources in insurance. It provides examples of using pictures, text, social media data, telematics, and satellite imagery in insurance. It also discusses challenges in analyzing complex and high-dimensional data from these sources and introduces machine learning tools like PCA, generalized linear models, and evaluating models using loss, risk, and cross-validation.
This document discusses classifier performance evaluation. It covers the following key points in 3 sentences:
The document outlines different methods for evaluating classifier performance, including hold out, k-fold cross validation, and bootstrap aggregating. It emphasizes that evaluation should be treated as statistical hypothesis testing using metrics like accuracy, precision, and recall calculated from a confusion matrix. Proper evaluation also requires partitioning data into separate training and test sets to avoid overfitting and get an accurate estimate of a classifier's generalization performance.
The document discusses key concepts in probability theory and statistical decision making under uncertainty. It covers topics like data generation processes being modelled as random variables, Bayes' rule for calculating conditional probabilities, discriminant functions for classification, and utility theory for making rational decisions. Bayesian networks and influence diagrams are introduced as graphical models for representing conditional independence between variables and making decisions. Finally, the document notes that future chapters will focus on estimating probabilities from data using parametric, semiparametric, and nonparametric approaches.
The document discusses multiple statistical comparisons and techniques for controlling error rates when performing multiple hypothesis tests on data. It introduces the concepts of family-wise error rate (FWER) and false discovery rate (FDR), and methods like the Sidak correction, Bonferroni correction, and Benjamini-Hochberg procedure for controlling FWER and FDR. It also discusses how p-value distributions can be used to estimate FDR and calculate q-values. Interactive demonstrations are provided to help illustrate key concepts like Type I and Type II errors.
Composing graphical models with neural networks for structured representatio...Jeongmin Cha
This presentation discusses the Structural Variational Autoencoder (SVAE) model, which combines graphical models and neural networks. SVAE uses neural networks to model observations and produce dense low-dimensional representations, while also explicitly representing discrete mixture components through a graphical model. This allows for structured probabilistic representations and fast exact inference. SVAE leverages the conjugacy property between the prior and posterior distributions, which aids Bayesian inference and makes the marginal likelihood tractable. The model is demonstrated on a mouse behavior video segmentation task.
In this lecture, I will present a general tour of some of the most commonly used kernel methods in statistical machine learning and data mining. I will touch on elements of artificial neural networks and then highlight their intricate connections to some general purpose kernel methods like Gaussian process learning machines. I will also resurrect the famous universal approximation theorem and will most likely ignite a [controversial] debate around the theme: could it be that [shallow] networks like radial basis function networks or Gaussian processes are all we need for well-behaved functions? Do we really need many hidden layers as the hype around Deep Neural Network architectures seem to suggest or should we heed Ockham’s principle of parsimony, namely “Entities should not be multiplied beyond necessity.” (“Entia non sunt multiplicanda praeter necessitatem.”) I intend to spend the last 15 minutes of this lecture sharing my personal tips and suggestions with our precious postdoctoral fellows on how to make the most of their experience.
Machine Learning, Financial Engineering and Quantitative InvestingShengyuan Wang Steven
This document discusses machine learning applications in financial engineering and quantitative investing. It covers machine learning techniques for curve construction, model calibration, instrument valuation, and risk measurement in quantitative finance. Specifically, it discusses using machine learning methods for yield curve construction, volatility surface calibration, discount curve calibration, and model parameter estimation from historical data. The goal is to apply machine learning to automate quantitative finance tasks and improve the accuracy of pricing and risk models.
This document provides an overview of latent Gaussian models and the INLA methodology. It discusses how hierarchical Bayesian models can be represented as latent Gaussian models, with a latent Gaussian field and hyperparameters. Latent Gaussian models have computational benefits due to the sparse precision matrix encoding conditional independence. Several examples of latent Gaussian models are provided, including mixed effects models, time series models, and disease mapping models. The document outlines how the INLA method can be used for Bayesian computation with these types of models.
Unit 1 - Mean Median Mode - 18MAB303T - PPT - Part 1.pdfAravindS199
Sir Francis Galton was a prominent English statistician, anthropologist, eugenicist, and psychometrician in the 19th century. He produced over 340 papers and books, and created the statistical concepts of correlation and regression. As a pioneer in meteorology and differential psychology, he devised early weather maps, proposed theories of weather patterns, and developed questionnaires to study human communities and intelligence. The document discusses Galton's background and contributions to statistics, anthropology, meteorology, and psychometrics.
Similar to A short introduction to statistical learning (20)
Racines en haut et feuilles en bas : les arbres en mathstuxette
1. The document discusses methods for clustering and differential analysis of Hi-C matrices, which represent the 3D organization of DNA.
2. It proposes extending Ward's hierarchical clustering to directly use Hi-C similarity matrices while enforcing adjacency constraints. A fast algorithm was also developed.
3. A new method called "treediff" was created to perform differential analysis of Hi-C matrices based on the Wasserstein distance between hierarchical clusterings. Software implementations of these methods were also developed.
Méthodes à noyaux pour l’intégration de données hétérogènestuxette
The document discusses a presentation about multi-omics data integration methods using kernel methods. The presentation introduces kernel methods, how they can be used to integrate heterogeneous omics data, and examples of applications. Specifically, it discusses using kernel methods to perform unsupervised transformation-based integration of multi-omics data. It also presents an application of constrained kernel hierarchical clustering to analyze Hi-C data by directly using Hi-C matrices as kernels.
Méthodologies d'intégration de données omiquestuxette
This document presents a presentation on multi-omics data integration methods given by Nathalie Vialaneix on December 13, 2023. The presentation discusses different types of omics data that can be integrated, both vertically across different levels of omics data on the same samples and horizontally across similar types of omics data on different samples. It also discusses different analysis approaches that can be taken, including supervised and unsupervised methods. The rest of the presentation focuses on unsupervised transformation-based integration methods using kernels.
The document discusses current and future work on analyzing Hi-C data and differential analysis of Hi-C matrices. It describes a clustering method developed to partition chromosomes based on Hi-C matrix similarity. It also introduces a new method called treediff for differential analysis of Hi-C data that calculates the distance between hierarchical clusterings. Current work includes reviewing differential analysis methods, investigating differential subtrees with multiple testing control, and inferring chromatin interaction networks.
Can deep learning learn chromatin structure from sequence?tuxette
This document discusses a deep learning model called ORCA that can predict chromatin structure from DNA sequence. The model uses a neural network with an encoder to extract features from sequence and a decoder to predict Hi-C matrices. It was trained on Hi-C data from multiple cell types and can predict interactions between regions at various resolutions. The model accurately captures features like CTCF-mediated loops and can predict effects of structural variants on chromatin structure. It allows for in silico mutagenesis to study how mutations may alter 3D genome organization.
Multi-omics data integration methods: kernel and other machine learning appro...tuxette
The document discusses multi-omics data integration methods, particularly kernel methods. It describes how kernel methods transform data into similarity matrices between samples rather than relying on variable space. Multiple kernel integration approaches are presented that combine multiple similarity matrices into a consensus kernel in an unsupervised manner, such as through a STATIS-like framework that maximizes the similarity between kernels. Examples of applications to datasets from the TARA Oceans expedition are given.
This document provides an overview of the MetaboWean and Idefics projects. MetaboWean aims to study the co-evolution of gut microbiota and epithelium during suckling-to-weaning transition in rabbits, using metabolomics, metagenomics, and single-cell RNA sequencing data. Idefics integrates multiple omics datasets from human skin samples to understand relationships between microorganisms and molecules and how they are structured in patient groups. The datasets include metagenomics, metabolomics, and proteomics from host and microbiota.
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...tuxette
ASTERICS is an interactive and integrative data analysis tool for omics data. It uses Rserve and PyRserve with Flask and Vue.js in a Docker container to integrate omics data. The backend uses Rserve and PyRserve with Flask on the server side, while the frontend uses Vue.js. This architecture was chosen for its open source and light design. Data communication between Rserve and PyRserve is limited, requiring an object database. ASTERICS is deployed using three Docker containers for R, Python, and
Apprentissage pour la biologie moléculaire et l’analyse de données omiquestuxette
This document summarizes a scientific presentation about molecular biology and omics data analysis. The presentation covers topics related to analyzing large omics datasets using methods like kernel methods, graphical models, and neural networks to learn gene regulation networks and predict phenotypes. Key challenges addressed are handling big data, missing values, non-Gaussian data types like counts and compositional data. The goal is to better understand complex biological systems from multi-omics data.
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...tuxette
The document summarizes preliminary results from evaluating methods for inferring gene regulatory networks from expression data in Bacillus subtilis. It finds that recall of the known network is generally poor (<20% for random forest), but inferred clusters still retain biological information about common regulators. It plans to confirm results, test restricting edges to sigma factors, and explore other inference methods like Bayesian networks and ARACNE.
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...tuxette
The document discusses methods for integrating multi-scale omics data using kernel and machine learning approaches. It describes how omics data is large, heterogeneous, and multi-scaled, creating bottlenecks for analysis. Methods discussed for data integration include multiple kernel learning to combine different relational datasets in an unsupervised way. The methods are applied to integrate different datasets from the TARA Oceans expedition to identify patterns in ocean microbial communities. Improving interpretability of the methods and making them more accessible to biological users is discussed.
Journal club: Validation of cluster analysis results on validation datatuxette
This document presents a framework for validating cluster analysis results on validation data. It describes situations where clustering is inferential versus descriptive and recommends using validation data separate from the data used for clustering. A typology of validation methods is provided, including validation based on the clustering method or results, and evaluation using internal validation, external validation, visual properties, or stability measures.
The document discusses the differences between overfitting and overparametrization in machine learning models. It explores how random forests may exhibit a phenomenon known as "double descent" where test error initially decreases then increases with more parameters before decreasing again. While double descent has been observed in other models, the document questions whether it is directly due to model complexity in random forests since very large trees may be unable to fully interpolate extremely large datasets.
Selective inference and single-cell differential analysistuxette
This document discusses selective inference and single-cell differential analysis. It introduces the problem of "double dipping" in the standard single-cell analysis pipeline where the same dataset is used for clustering and differential analysis. Two approaches for addressing this are presented: 1) A method that perturbs clusters before testing for differences, and 2) A test based on a truncated distribution that assumes clusters and genes are given separately. Experiments applying these methods to real single-cell datasets are described. The document outlines challenges in extending these approaches to more complex analyses.
SOMbrero : un package R pour les cartes auto-organisatricestuxette
SOMbrero is an R package that implements self-organizing map (SOM) algorithms. It can handle numeric, non-numeric, and relational data. The package contains functions for training SOMs, diagnosing results, and plotting maps. It also includes tools like a shiny app and vignettes to aid users without programming experience. SOMbrero supports missing data imputation and extends SOM to relational datasets through non-Euclidean distance measures.
A short and naive introduction to using network in prediction modelstuxette
The document provides an introduction to using network information in prediction models. It discusses representing a network as a graph with a Laplacian matrix. The Laplacian captures properties like random walks on the graph and heat diffusion. Eigenvectors of the Laplacian related to small eigenvalues are strongly tied to graph structure. The document discusses using the Laplacian in prediction models by working in the feature space defined by the Laplacian eigenvectors or directly regularizing a linear model with the Laplacian. This introduces network information and encourages similar contributions from connected nodes. The approaches are applied to problems like predicting phenotypes from gene expression using a known gene network.
This document summarizes different approaches for structure learning in graph neural networks. It discusses three main classes of methods: 1) metric-based learning which learns a similarity matrix between nodes, 2) probabilistic models which learn the parameters of a distribution over graphs, and 3) direct optimization which directly optimizes the graph adjacency matrix. The document provides examples of methods within each class and notes challenges such as the simplicity of probabilistic models and computational difficulties of direct optimization.
Anti-Universe And Emergent Gravity and the Dark UniverseSérgio Sacani
Recent theoretical progress indicates that spacetime and gravity emerge together from the entanglement structure of an underlying microscopic theory. These ideas are best understood in Anti-de Sitter space, where they rely on the area law for entanglement entropy. The extension to de Sitter space requires taking into account the entropy and temperature associated with the cosmological horizon. Using insights from string theory, black hole physics and quantum information theory we argue that the positive dark energy leads to a thermal volume law contribution to the entropy that overtakes the area law precisely at the cosmological horizon. Due to the competition between area and volume law entanglement the microscopic de Sitter states do not thermalise at sub-Hubble scales: they exhibit memory effects in the form of an entropy displacement caused by matter. The emergent laws of gravity contain an additional ‘dark’ gravitational force describing the ‘elastic’ response due to the entropy displacement. We derive an estimate of the strength of this extra force in terms of the baryonic mass, Newton’s constant and the Hubble acceleration scale a0 = cH0, and provide evidence for the fact that this additional ‘dark gravity force’ explains the observed phenomena in galaxies and clusters currently attributed to dark matter.
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfSelcen Ozturkcan
Ozturkcan, S., Berndt, A., & Angelakis, A. (2024). Mending clothing to support sustainable fashion. Presented at the 31st Annual Conference by the Consortium for International Marketing Research (CIMaR), 10-13 Jun 2024, University of Gävle, Sweden.
Signatures of wave erosion in Titan’s coastsSérgio Sacani
The shorelines of Titan’s hydrocarbon seas trace flooded erosional landforms such as river valleys; however, it isunclear whether coastal erosion has subsequently altered these shorelines. Spacecraft observations and theo-retical models suggest that wind may cause waves to form on Titan’s seas, potentially driving coastal erosion,but the observational evidence of waves is indirect, and the processes affecting shoreline evolution on Titanremain unknown. No widely accepted framework exists for using shoreline morphology to quantitatively dis-cern coastal erosion mechanisms, even on Earth, where the dominant mechanisms are known. We combinelandscape evolution models with measurements of shoreline shape on Earth to characterize how differentcoastal erosion mechanisms affect shoreline morphology. Applying this framework to Titan, we find that theshorelines of Titan’s seas are most consistent with flooded landscapes that subsequently have been eroded bywaves, rather than a uniform erosional process or no coastal erosion, particularly if wave growth saturates atfetch lengths of tens of kilometers.
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...Creative-Biolabs
Neutralizing antibodies, pivotal in immune defense, specifically bind and inhibit viral pathogens, thereby playing a crucial role in protecting against and mitigating infectious diseases. In this slide, we will introduce what antibodies and neutralizing antibodies are, the production and regulation of neutralizing antibodies, their mechanisms of action, classification and applications, as well as the challenges they face.
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
Microbial interaction
Microorganisms interacts with each other and can be physically associated with another organisms in a variety of ways.
One organism can be located on the surface of another organism as an ectobiont or located within another organism as endobiont.
Microbial interaction may be positive such as mutualism, proto-cooperation, commensalism or may be negative such as parasitism, predation or competition
Types of microbial interaction
Positive interaction: mutualism, proto-cooperation, commensalism
Negative interaction: Ammensalism (antagonism), parasitism, predation, competition
I. Mutualism:
It is defined as the relationship in which each organism in interaction gets benefits from association. It is an obligatory relationship in which mutualist and host are metabolically dependent on each other.
Mutualistic relationship is very specific where one member of association cannot be replaced by another species.
Mutualism require close physical contact between interacting organisms.
Relationship of mutualism allows organisms to exist in habitat that could not occupied by either species alone.
Mutualistic relationship between organisms allows them to act as a single organism.
Examples of mutualism:
i. Lichens:
Lichens are excellent example of mutualism.
They are the association of specific fungi and certain genus of algae. In lichen, fungal partner is called mycobiont and algal partner is called
II. Syntrophism:
It is an association in which the growth of one organism either depends on or improved by the substrate provided by another organism.
In syntrophism both organism in association gets benefits.
Compound A
Utilized by population 1
Compound B
Utilized by population 2
Compound C
utilized by both Population 1+2
Products
In this theoretical example of syntrophism, population 1 is able to utilize and metabolize compound A, forming compound B but cannot metabolize beyond compound B without co-operation of population 2. Population 2is unable to utilize compound A but it can metabolize compound B forming compound C. Then both population 1 and 2 are able to carry out metabolic reaction which leads to formation of end product that neither population could produce alone.
Examples of syntrophism:
i. Methanogenic ecosystem in sludge digester
Methane produced by methanogenic bacteria depends upon interspecies hydrogen transfer by other fermentative bacteria.
Anaerobic fermentative bacteria generate CO2 and H2 utilizing carbohydrates which is then utilized by methanogenic bacteria (Methanobacter) to produce methane.
ii. Lactobacillus arobinosus and Enterococcus faecalis:
In the minimal media, Lactobacillus arobinosus and Enterococcus faecalis are able to grow together but not alone.
The synergistic relationship between E. faecalis and L. arobinosus occurs in which E. faecalis require folic acid
5. Background
Purpose: predict Y from X;
What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn);
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/58
6. Background
Purpose: predict Y from X;
What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn);
What we want: estimate unknown Y from new X: xn+1, . . . , xm.
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/58
7. Background
Purpose: predict Y from X;
What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn);
What we want: estimate unknown Y from new X: xn+1, . . . , xm.
X can be:
numeric variables;
or factors;
or a combination of numeric variables and factors.
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/58
8. Background
Purpose: predict Y from X;
What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn);
What we want: estimate unknown Y from new X: xn+1, . . . , xm.
X can be:
numeric variables;
or factors;
or a combination of numeric variables and factors.
Y can be:
a numeric variable (Y ∈ R) ⇒ (supervised) regression régression;
a factor ⇒ (supervised) classification discrimination.
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/58
9. Basics
From (xi, yi)i, definition of a machine, Φn
s.t.:
ˆynew = Φn
(xnew).
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
10. Basics
From (xi, yi)i, definition of a machine, Φn
s.t.:
ˆynew = Φn
(xnew).
if Y is numeric, Φn
is called a regression function fonction de
classification;
if Y is a factor, Φn
is called a classifier classifieur;
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
11. Basics
From (xi, yi)i, definition of a machine, Φn
s.t.:
ˆynew = Φn
(xnew).
if Y is numeric, Φn
is called a regression function fonction de
classification;
if Y is a factor, Φn
is called a classifier classifieur;
Φn
is said to be trained or learned from the observations (xi, yi)i.
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
12. Basics
From (xi, yi)i, definition of a machine, Φn
s.t.:
ˆynew = Φn
(xnew).
if Y is numeric, Φn
is called a regression function fonction de
classification;
if Y is a factor, Φn
is called a classifier classifieur;
Φn
is said to be trained or learned from the observations (xi, yi)i.
Desirable properties
accuracy to the observations: predictions made on known data are
close to observed values;
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
13. Basics
From (xi, yi)i, definition of a machine, Φn
s.t.:
ˆynew = Φn
(xnew).
if Y is numeric, Φn
is called a regression function fonction de
classification;
if Y is a factor, Φn
is called a classifier classifieur;
Φn
is said to be trained or learned from the observations (xi, yi)i.
Desirable properties
accuracy to the observations: predictions made on known data are
close to observed values;
generalization ability: predictions made on new data are also
accurate.
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
14. Basics
From (xi, yi)i, definition of a machine, Φn
s.t.:
ˆynew = Φn
(xnew).
if Y is numeric, Φn
is called a regression function fonction de
classification;
if Y is a factor, Φn
is called a classifier classifieur;
Φn
is said to be trained or learned from the observations (xi, yi)i.
Desirable properties
accuracy to the observations: predictions made on known data are
close to observed values;
generalization ability: predictions made on new data are also
accurate.
Conflicting objectives!!
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/58
15. Underfitting/Overfitting sous/sur - apprentissage
Function x → y to be estimated
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
16. Underfitting/Overfitting sous/sur - apprentissage
Observations we might have
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
17. Underfitting/Overfitting sous/sur - apprentissage
Observations we do have
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
18. Underfitting/Overfitting sous/sur - apprentissage
First estimation from the observations: underfitting
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
19. Underfitting/Overfitting sous/sur - apprentissage
Second estimation from the observations: accurate estimation
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
20. Underfitting/Overfitting sous/sur - apprentissage
Third estimation from the observations: overfitting
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/58
22. Errors
training error (measures the accuracy to the observations)
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
23. Errors
training error (measures the accuracy to the observations)
if y is a factor: misclassification rate
{ˆyi yi, i = 1, . . . , n}
n
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
24. Errors
training error (measures the accuracy to the observations)
if y is a factor: misclassification rate
{ˆyi yi, i = 1, . . . , n}
n
if y is numeric: mean square error (MSE)
1
n
n
i=1
(ˆyi − yi)2
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
25. Errors
training error (measures the accuracy to the observations)
if y is a factor: misclassification rate
{ˆyi yi, i = 1, . . . , n}
n
if y is numeric: mean square error (MSE)
1
n
n
i=1
(ˆyi − yi)2
or root mean square error (RMSE) or pseudo-R2
: 1−MSE/Var((yi)i)
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
26. Errors
training error (measures the accuracy to the observations)
if y is a factor: misclassification rate
{ˆyi yi, i = 1, . . . , n}
n
if y is numeric: mean square error (MSE)
1
n
n
i=1
(ˆyi − yi)2
or root mean square error (RMSE) or pseudo-R2
: 1−MSE/Var((yi)i)
test error: a way to prevent overfitting (estimates the generalization
error) is the simple validation
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
27. Errors
training error (measures the accuracy to the observations)
if y is a factor: misclassification rate
{ˆyi yi, i = 1, . . . , n}
n
if y is numeric: mean square error (MSE)
1
n
n
i=1
(ˆyi − yi)2
or root mean square error (RMSE) or pseudo-R2
: 1−MSE/Var((yi)i)
test error: a way to prevent overfitting (estimates the generalization
error) is the simple validation
1 split the data into training/test sets (usually 80%/20%)
2 train Φn
from the training dataset
3 calculate the test error from the remaining data
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/58
32. Bias / Variance trade-off
This problem is also related to the well known bias / variance trade-off
bias: error that comes from erroneous assumptions in the learning
algorithm (average error of the predictor)
variance: error that comes from sensitivity to small fluctuations in the
training set (variance of the predictor)
Nathalie Villa-Vialaneix | Introduction to statistical learning 9/58
33. Bias / Variance trade-off
This problem is also related to the well known bias / variance trade-off
bias: error that comes from erroneous assumptions in the learning
algorithm (average error of the predictor)
variance: error that comes from sensitivity to small fluctuations in the
training set (variance of the predictor)
Overall error is: E(MSE) = Bias2
+ Variance
Nathalie Villa-Vialaneix | Introduction to statistical learning 9/58
34. Consistency in the parametric/non parametric case
Example in the parametric framework (linear methods)
an assumption is made on the form of the relation between X and Y:
Y = βT
X +
β is estimated from the observations (x1, y1), . . . , (xn, yn) by a given
method which calculates a βn
.
The estimation is said to be consistent if βn n→+∞
−−−−−−→ β under (possibly)
technical assumptions on X, , Y.
Nathalie Villa-Vialaneix | Introduction to statistical learning 10/58
35. Consistency in the parametric/non parametric case
Example in the nonparametric framework
the form of the relation between X and Y is unknown:
Y = Φ(X) +
Φ is estimated from the observations (x1, y1), . . . , (xn, yn) by a given
method which calculates a Φn
.
The estimation is said to be consistent if Φn n→+∞
−−−−−−→ Φ under (possibly)
technical assumptions on X, , Y.
Nathalie Villa-Vialaneix | Introduction to statistical learning 10/58
36. Consistency from the statistical learning perspective
[Vapnik, 1995]
Question: Are we really interested in estimating Φ or...
Nathalie Villa-Vialaneix | Introduction to statistical learning 11/58
37. Consistency from the statistical learning perspective
[Vapnik, 1995]
Question: Are we really interested in estimating Φ or...
... rather in having the smallest prediction error?
Statistical learning perspective: a method that builds a machine Φn
from
the observations is said to be (universally) consistent if, given a risk
function R : R × R → R+ (which calculates an error),
E (R(Φn
(X), Y))
n→+∞
−−−−−−→ inf
Φ:X→R
E (R(Φ(X), Y)) ,
for any distribution of (X, Y) ∈ X × R.
Definitions: L∗ = infΦ:X→R E (R(Φ(X), Y)) and LΦ = E (R(Φ(X), Y)).
Nathalie Villa-Vialaneix | Introduction to statistical learning 11/58
38. Desirable properties from a mathematical perspective
Simplified framework: X ∈ X and Y ∈ {−1, 1} (binary classification)
Learning process: choose a machine Φn
in a class of functions
C ⊂ {Φ : X → R} (e.g., C is the set of all functions that can be build using a
SVM).
Error decomposition
LΦn
− L∗
≤ LΦn
− inf
Φ∈C
LΦ + inf
Φ∈C
LΦ − L∗
with
infΦ∈C LΦ − L∗ is the richness of C (i.e., C must be rich to ensure that
this term is small);
Nathalie Villa-Vialaneix | Introduction to statistical learning 12/58
39. Desirable properties from a mathematical perspective
Simplified framework: X ∈ X and Y ∈ {−1, 1} (binary classification)
Learning process: choose a machine Φn
in a class of functions
C ⊂ {Φ : X → R} (e.g., C is the set of all functions that can be build using a
SVM).
Error decomposition
LΦn
− L∗
≤ LΦn
− inf
Φ∈C
LΦ + inf
Φ∈C
LΦ − L∗
with
infΦ∈C LΦ − L∗ is the richness of C (i.e., C must be rich to ensure that
this term is small);
LΦn
− infΦ∈C LΦ ≤ 2 supΦ∈C |Ln
Φ − LΦ|, Ln
Φ = 1
n
n
i=1 R(Φ(xi), yi) is
the generalization capability of C (i.e., in the worst case, the empirical
error must be close to the true error: C must not be too rich to ensure
that this term is small).
Nathalie Villa-Vialaneix | Introduction to statistical learning 12/58
40. Outline
1 Introduction
Background and notations
Underfitting / Overfitting
Consistency
2 CART and random forests
Introduction to CART
Learning
Prediction
Overview of random forests
Bootstrap/Bagging
Random forest
3 SVM
4 (not deep) Neural networks
Seminal references
Multi-layer perceptrons
Theoretical properties of perceptrons
Learning perceptrons
Learning in practice
Nathalie Villa-Vialaneix | Introduction to statistical learning 13/58
41. Overview
CART: Classification And Regression Trees introduced by
[Breiman et al., 1984].
Nathalie Villa-Vialaneix | Introduction to statistical learning 14/58
42. Overview
CART: Classification And Regression Trees introduced by
[Breiman et al., 1984].
Advantages
classification OR regression (i.e., Y can be a numeric variable or a
factor);
non parametric method: no prior assumption needed;
can deal with a large number of input variables, either numeric
variables or factors (a variable selection is included in the method);
provide an intuitive interpretation.
Nathalie Villa-Vialaneix | Introduction to statistical learning 14/58
43. Overview
CART: Classification And Regression Trees introduced by
[Breiman et al., 1984].
Advantages
classification OR regression (i.e., Y can be a numeric variable or a
factor);
non parametric method: no prior assumption needed;
can deal with a large number of input variables, either numeric
variables or factors (a variable selection is included in the method);
provide an intuitive interpretation.
Drawbacks
require a large training dataset to be efficient;
as a consequence, are often too simple to provide accurate
predictions.
Nathalie Villa-Vialaneix | Introduction to statistical learning 14/58
44. Example
X = (Gender, Age, Height) and Y = Weight
Y1 Y2 Y3
Y4 Y5
Height<1.60m Height>1.60m
Gender=M Gender=F Age<30 Age>30
Gender=M Gender=F
Root
a split
a node
a leaf
(terminal node)
Nathalie Villa-Vialaneix | Introduction to statistical learning 15/58
45. CART learning process
Algorithm
Start from root
repeat
move to a “new” node
if the node is homogeneous or small enough then
STOP
else
split the node into two child nodes with maximal “homogeneity”
end if
until all nodes are processed
Nathalie Villa-Vialaneix | Introduction to statistical learning 16/58
46. Further details
Homogeneity?
if Y is a numeric variable, variance of (yi)i for the observations
assigned to the node (Gini index is also sometimes used);
Nathalie Villa-Vialaneix | Introduction to statistical learning 17/58
47. Further details
Homogeneity?
if Y is a numeric variable, variance of (yi)i for the observations
assigned to the node (Gini index is also sometimes used);
if Y is a factor, node purity: % of observations assigned to the node
whose Y values are not the node majority class.
Nathalie Villa-Vialaneix | Introduction to statistical learning 17/58
48. Further details
Homogeneity?
if Y is a numeric variable, variance of (yi)i for the observations
assigned to the node (Gini index is also sometimes used);
if Y is a factor, node purity: % of observations assigned to the node
whose Y values are not the node majority class.
Stopping criteria?
Minimum size node (generally 1 or 5)
Minimum node purity or variance
Maximum tree depth
Nathalie Villa-Vialaneix | Introduction to statistical learning 17/58
49. Further details
Homogeneity?
if Y is a numeric variable, variance of (yi)i for the observations
assigned to the node (Gini index is also sometimes used);
if Y is a factor, node purity: % of observations assigned to the node
whose Y values are not the node majority class.
Stopping criteria?
Minimum size node (generally 1 or 5)
Minimum node purity or variance
Maximum tree depth
Hyperparameters can be tuned by cross-validation using a grid search.
Nathalie Villa-Vialaneix | Introduction to statistical learning 17/58
50. Further details
Homogeneity?
if Y is a numeric variable, variance of (yi)i for the observations
assigned to the node (Gini index is also sometimes used);
if Y is a factor, node purity: % of observations assigned to the node
whose Y values are not the node majority class.
Stopping criteria?
Minimum size node (generally 1 or 5)
Minimum node purity or variance
Maximum tree depth
Hyperparameters can be tuned by cross-validation using a grid search.
An alternative approach is pruning...
Nathalie Villa-Vialaneix | Introduction to statistical learning 17/58
51. Making new predictions
A new observation, xnew
is assigned to a leaf (straightforward)
Nathalie Villa-Vialaneix | Introduction to statistical learning 18/58
52. Making new predictions
A new observation, xnew
is assigned to a leaf (straightforward)
the corresponding predicted ˆynew is
if Y is numeric, the mean value of the observations (training set)
assigned to the same leaf
Nathalie Villa-Vialaneix | Introduction to statistical learning 18/58
53. Making new predictions
A new observation, xnew
is assigned to a leaf (straightforward)
the corresponding predicted ˆynew is
if Y is numeric, the mean value of the observations (training set)
assigned to the same leaf
if Y is a factor, the majority class of the observations (training set)
assigned to the same leaf
Nathalie Villa-Vialaneix | Introduction to statistical learning 18/58
55. Advantages/Drawbacks
Random Forest: introduced by [Breiman, 2001]
Advantages
classification OR regression (i.e., Y can be a numeric variable or a
factor)
non parametric method (no prior assumption needed) and accurate
can deal with a large number of input variables, either numeric
variables or factors
can deal with small samples
Nathalie Villa-Vialaneix | Introduction to statistical learning 19/58
56. Advantages/Drawbacks
Random Forest: introduced by [Breiman, 2001]
Advantages
classification OR regression (i.e., Y can be a numeric variable or a
factor)
non parametric method (no prior assumption needed) and accurate
can deal with a large number of input variables, either numeric
variables or factors
can deal with small samples
Drawbacks
black box model
is only supported by a few mathematical results (consistency...) until
now
Nathalie Villa-Vialaneix | Introduction to statistical learning 19/58
57. Basic description
A fact: When the sample size is small, you might be unable to estimate
properly
Nathalie Villa-Vialaneix | Introduction to statistical learning 20/58
58. Basic description
A fact: When the sample size is small, you might be unable to estimate
properly
This issue is commonly tackled by bootstrapping and, more specifically,
bagging (Boostrap Aggregating) ⇒ it reduces the variance of the estimator
Nathalie Villa-Vialaneix | Introduction to statistical learning 20/58
59. Basic description
A fact: When the sample size is small, you might be unable to estimate
properly
This issue is commonly tackled by bootstrapping and, more specifically,
bagging (Boostrap Aggregating) ⇒ it reduces the variance of the estimator
Bagging: combination of simple (and underefficient) regression (or
classification) functions
Nathalie Villa-Vialaneix | Introduction to statistical learning 20/58
60. Basic description
A fact: When the sample size is small, you might be unable to estimate
properly
This issue is commonly tackled by bootstrapping and, more specifically,
bagging (Boostrap Aggregating) ⇒ it reduces the variance of the estimator
Bagging: combination of simple (and underefficient) regression (or
classification) functions
Random forest CART bagging
Nathalie Villa-Vialaneix | Introduction to statistical learning 20/58
61. Bootstrap
Bootstrap sample: random sampling (with replacement) of the training
dataset - samples have the same size than the original dataset
Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
62. Bootstrap
Bootstrap sample: random sampling (with replacement) of the training
dataset - samples have the same size than the original dataset
General (and robust) approach to solve several problems:
Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
63. Bootstrap
Bootstrap sample: random sampling (with replacement) of the training
dataset - samples have the same size than the original dataset
General (and robust) approach to solve several problems:
Estimating confidence intervals (of X with no prior assumption on the
distribution of X)
1 Build P bootstrap samples from (xi)i
2 Use them to estimate X P times
3 The confidence interval is based on the percentiles of the empirical
distribution of X
Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
64. Bootstrap
Bootstrap sample: random sampling (with replacement) of the training
dataset - samples have the same size than the original dataset
General (and robust) approach to solve several problems:
Estimating confidence intervals (of X with no prior assumption on the
distribution of X)
Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
65. Bootstrap
Bootstrap sample: random sampling (with replacement) of the training
dataset - samples have the same size than the original dataset
General (and robust) approach to solve several problems:
Estimating confidence intervals (of X with no prior assumption on the
distribution of X)
Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
66. Bootstrap
Bootstrap sample: random sampling (with replacement) of the training
dataset - samples have the same size than the original dataset
General (and robust) approach to solve several problems:
Estimating confidence intervals (of X with no prior assumption on the
distribution of X)
Also useful to estimate p-values, residuals, ...
Nathalie Villa-Vialaneix | Introduction to statistical learning 21/58
67. Bagging
Average the estimates of the regression (or the classification) function
obtained from B bootstrap samples
Nathalie Villa-Vialaneix | Introduction to statistical learning 22/58
68. Bagging
Average the estimates of the regression (or the classification) function
obtained from B bootstrap samples
Bagging with regression trees
1: for b = 1, . . . , B do
2: Build a bootstrap sample ξb
3: Train a regression tree from ξb, ˆφb
4: end for
5: Estimate the regression function by
ˆΦn
(x) =
1
B
B
b=1
ˆφb(x)
Nathalie Villa-Vialaneix | Introduction to statistical learning 22/58
69. Bagging
Average the estimates of the regression (or the classification) function
obtained from B bootstrap samples
Bagging with regression trees
1: for b = 1, . . . , B do
2: Build a bootstrap sample ξb
3: Train a regression tree from ξb, ˆφb
4: end for
5: Estimate the regression function by
ˆΦn
(x) =
1
B
B
b=1
ˆφb(x)
For classification, the predicted class is the majority vote class.
Nathalie Villa-Vialaneix | Introduction to statistical learning 22/58
70. Random forests
CART bagging with additional variations
1 each node is based on a random (and different) subset of q variables
(an advisable choice for q is
√
p for classification and p/3 for
regression)
Nathalie Villa-Vialaneix | Introduction to statistical learning 23/58
71. Random forests
CART bagging with additional variations
1 each node is based on a random (and different) subset of q variables
(an advisable choice for q is
√
p for classification and p/3 for
regression)
2 the tree is fully developed (overfitted)
Hyperparameters
those of the CART algorithm
those that are specific to the random forest: q, number of trees
Nathalie Villa-Vialaneix | Introduction to statistical learning 23/58
72. Random forests
CART bagging with additional variations
1 each node is based on a random (and different) subset of q variables
(an advisable choice for q is
√
p for classification and p/3 for
regression)
2 the tree is fully developed (overfitted)
Hyperparameters
those of the CART algorithm
those that are specific to the random forest: q, number of trees
Random forests are not very sensitive to hyper-parameter settings: default
values for q and 500/1000 trees should work in most cases.
Nathalie Villa-Vialaneix | Introduction to statistical learning 23/58
73. Additional tools
OOB (Out-Of Bags) error: error based on the observations not
included in the “bag”
Nathalie Villa-Vialaneix | Introduction to statistical learning 24/58
74. Additional tools
OOB (Out-Of Bags) error: error based on the observations not
included in the “bag”
Stabilization of OOB error is a good indication that there is enough
trees in the forest
Nathalie Villa-Vialaneix | Introduction to statistical learning 24/58
75. Additional tools
OOB (Out-Of Bags) error: error based on the observations not
included in the “bag”
Importance of a variable to help interpretation: for a given variable Xj
1: randomize the values of the variable
2: make predictions from this new dataset
3: the importance is the mean decrease in accuracy (MSE or
misclassification rate)
Nathalie Villa-Vialaneix | Introduction to statistical learning 24/58
76. Outline
1 Introduction
Background and notations
Underfitting / Overfitting
Consistency
2 CART and random forests
Introduction to CART
Learning
Prediction
Overview of random forests
Bootstrap/Bagging
Random forest
3 SVM
4 (not deep) Neural networks
Seminal references
Multi-layer perceptrons
Theoretical properties of perceptrons
Learning perceptrons
Learning in practice
Nathalie Villa-Vialaneix | Introduction to statistical learning 25/58
77. Basic introduction
Binary classification problem: X ∈ H et Y ∈ {−1; 1}
A training set is given: (x1, y1), . . . , (xn, yn)
Nathalie Villa-Vialaneix | Introduction to statistical learning 26/58
78. Basic introduction
Binary classification problem: X ∈ H et Y ∈ {−1; 1}
A training set is given: (x1, y1), . . . , (xn, yn)
SVM is a method based on kernels. It is universally consistent method,
given that the kernel is universal [Steinwart, 2002].
Extensions to the regression case exist (SVR or LS-SVM) that are also
universally consistent when the kernel is universal.
Nathalie Villa-Vialaneix | Introduction to statistical learning 26/58
82. Optimal margin classification
w
margin: 1
w 2
Support Vector
w is chosen such that:
minw w 2
(the margin is the largest),
under the constraints: yi( w, xi + b) ≥ 1, 1 ≤ i ≤ n (the separation
between the two classes is perfect).
⇒ ensures a good generalization capability.
Nathalie Villa-Vialaneix | Introduction to statistical learning 27/58
86. Soft margin classification
w
margin: 1
w 2
Support Vector
w is chosen such that:
minw,ξ w 2
+ C n
i=1 ξi (the margin is the largest),
under the constraints: yi( w, xi + b) ≥ 1 − ξi, 1 ≤ i ≤ n,
ξi ≥ 0, 1 ≤ i ≤ n.
(the separation between the two classes is almost perfect).
⇒ allowing a few errors improves the richness of the class.
Nathalie Villa-Vialaneix | Introduction to statistical learning 28/58
87. Non linear SVM
Original space X
Nathalie Villa-Vialaneix | Introduction to statistical learning 29/58
88. Non linear SVM
Original space X Feature space H
Ψ (non linear)
Nathalie Villa-Vialaneix | Introduction to statistical learning 29/58
89. Non linear SVM
Original space X Feature space H
Ψ (non linear)
Nathalie Villa-Vialaneix | Introduction to statistical learning 29/58
90. Non linear SVM
Original space X Feature space H
Ψ (non linear)
w ∈ H is chosen such that (PC,H ):
minw,ξ w 2
H
+ C n
i=1 ξi (the margin in the feature space is the
largest),
under the constraints: yi( w, Ψ(xi) H + b) ≥ 1 − ξi, 1 ≤ i ≤ n,
ξi ≥ 0, 1 ≤ i ≤ n.
(the separation between the two classes in the feature space is
almost perfect).
Nathalie Villa-Vialaneix | Introduction to statistical learning 29/58
91. SVM from different points of view
A regularization problem: (PC,H ) ⇔
(P2
λ,H ) : min
w∈H
1
n
n
i=1
R(fw(xi), yi)
error term
+λ w 2
H
penalization term
,
where fw(x) = Ψ(x), w H and R(ˆy, y) = max(0, 1 − ˆyy) (hinge loss
function)
errors versus ˆy for y = 1:
blue: hinge loss;
green: misclassification error.
Nathalie Villa-Vialaneix | Introduction to statistical learning 30/58
92. SVM from different points of view
A regularization problem: (PC,H ) ⇔
(P2
λ,H ) : min
w∈H
1
n
n
i=1
R(fw(xi), yi)
error term
+λ w 2
H
penalization term
,
where fw(x) = Ψ(x), w H and R(ˆy, y) = max(0, 1 − ˆyy) (hinge loss
function)
A dual problem: (PC,H ) ⇔
(DC,X) : maxα∈Rn
n
i=1 αi − n
i=1
n
j=1 αiαjyiyj Ψ(xi), Ψ(xj) H ,
with N
i=1 αiyi = 0,
0 ≤ αi ≤ C, 1 ≤ i ≤ n.
Nathalie Villa-Vialaneix | Introduction to statistical learning 30/58
93. SVM from different points of view
A regularization problem: (PC,H ) ⇔
(P2
λ,H ) : min
w∈H
1
n
n
i=1
R(fw(xi), yi)
error term
+λ w 2
H
penalization term
,
where fw(x) = Ψ(x), w H and R(ˆy, y) = max(0, 1 − ˆyy) (hinge loss
function)
A dual problem: (PC,H ) ⇔
(DC,X) : maxα∈Rn
n
i=1 αi − n
i=1
n
j=1 αiαjyiyjK(xi, xj),
with N
i=1 αiyi = 0,
0 ≤ αi ≤ C, 1 ≤ i ≤ n.
There is no need to know Ψ and H:
choose a function K with a few good properties;
use it as the dot product in H:
∀ u, v ∈ H, K(u, v) = Ψ(u), Ψ(v) H .
Nathalie Villa-Vialaneix | Introduction to statistical learning 30/58
94. Which kernels?
Minimum properties that a kernel should fulfilled
symmetry: K(u, u ) = K(u , u)
positivity: ∀ N ∈ N, ∀ (αi) ⊂ RN
, ∀ (xi) ⊂ XN
, i,j αiαjK(xi, xj) ≥ 0.
[Aronszajn, 1950]: ∃ a Hilbert space (H, ., . H ) and a function Ψ : X → H
such that:
∀ u, v ∈ H, K(u, v) = Ψ(u), Ψ(v) H
Nathalie Villa-Vialaneix | Introduction to statistical learning 31/58
95. Which kernels?
Minimum properties that a kernel should fulfilled
symmetry: K(u, u ) = K(u , u)
positivity: ∀ N ∈ N, ∀ (αi) ⊂ RN
, ∀ (xi) ⊂ XN
, i,j αiαjK(xi, xj) ≥ 0.
[Aronszajn, 1950]: ∃ a Hilbert space (H, ., . H ) and a function Ψ : X → H
such that:
∀ u, v ∈ H, K(u, v) = Ψ(u), Ψ(v) H
Examples
the Gaussian kernel: ∀ x, x ∈ Rd
, K(x, x ) = e−γ x−x 2
(it is universal
for all bounded subset of Rd
);
the linear kernel: ∀ x, x ∈ Rd
, K(x, x ) = xT
(x ) (it is not universal).
Nathalie Villa-Vialaneix | Introduction to statistical learning 31/58
96. In summary, how does the solution write????
Φn
(x) =
i
αiyiK(xi, x)
where only a few αi 0. i such that αi 0 are the support vectors!
Nathalie Villa-Vialaneix | Introduction to statistical learning 32/58
97. I’m almost dead with all these stuffs on my mind!!!
What in practice?
data(iris)
iris <- iris[iris$Species%in%c("versicolor","virginica"),]
plot(iris$Petal.Length, iris$Petal.Width, col=iris$Species ,
pch=19)
legend("topleft", pch=19, col=c(2,3),
legend=c("versicolor", "virginica"))
Nathalie Villa-Vialaneix | Introduction to statistical learning 33/58
98. I’m almost dead with all these stuffs on my mind!!!
What in practice?
library(e1071)
res.tune <- tune.svm(Species ~ ., data=iris, kernel="linear",
cost = 2^(-1:4))
# Parameter tuning of ’svm’:
# - sampling method: 10fold cross validation
# - best parameters:
# cost
# 0.5
# - best performance: 0.05
res.tune$best.model
# Call:
# best.svm(x = Species ~ ., data = iris, cost = 2^(-1:4),
# kernel = "linear")
# Parameters:
# SVM-Type: C-classification
# SVM-Kernel: linear
# cost: 0.5
# gamma: 0.25
# Number of Support Vectors: 21
Nathalie Villa-Vialaneix | Introduction to statistical learning 34/58
99. I’m almost dead with all these stuffs on my mind!!!
What in practice?
table(res.tune$best.model$fitted, iris$Species)
% setosa versicolor virginica
% setosa 0 0 0
% versicolor 0 45 0
% virginica 0 5 50
plot(res.tune$best.model, data=iris, Petal.Width~Petal.Length,
slice = list(Sepal.Width = 2.872, Sepal.Length = 6.262))
Nathalie Villa-Vialaneix | Introduction to statistical learning 35/58
100. I’m almost dead with all these stuffs on my mind!!!
What in practice?
res.tune <- tune.svm(Species ~ ., data=iris, gamma = 2^(-1:1),
cost = 2^(2:4))
# Parameter tuning of ’svm’:
# - sampling method: 10fold cross validation
# - best parameters:
# gamma cost
# 0.5 4
# - best performance: 0.08
res.tune$best.model
# Call:
# best.svm(x = Species ~ ., data = iris, gamma = 2^(-1:1),
# cost = 2^(2:4))
# Parameters:
# SVM-Type: C-classification
# SVM-Kernel: radial
# cost: 4
# gamma: 0.5
# Number of Support Vectors: 32
Nathalie Villa-Vialaneix | Introduction to statistical learning 36/58
101. I’m almost dead with all these stuffs on my mind!!!
What in practice?
table(res.tune$best.model$fitted, iris$Species)
% setosa versicolor virginica
% setosa 0 0 0
% versicolor 0 49 0
% virginica 0 1 50
plot(res.tune$best.model, data=iris, Petal.Width~Petal.Length,
slice = list(Sepal.Width = 2.872, Sepal.Length = 6.262))
Nathalie Villa-Vialaneix | Introduction to statistical learning 37/58
102. Outline
1 Introduction
Background and notations
Underfitting / Overfitting
Consistency
2 CART and random forests
Introduction to CART
Learning
Prediction
Overview of random forests
Bootstrap/Bagging
Random forest
3 SVM
4 (not deep) Neural networks
Seminal references
Multi-layer perceptrons
Theoretical properties of perceptrons
Learning perceptrons
Learning in practice
Nathalie Villa-Vialaneix | Introduction to statistical learning 38/58
103. What are (artificial) neural networks?
Common properties
(artificial) “Neural networks”: general name for supervised and
unsupervised methods developed in (vague) analogy to the brain
Nathalie Villa-Vialaneix | Introduction to statistical learning 39/58
104. What are (artificial) neural networks?
Common properties
(artificial) “Neural networks”: general name for supervised and
unsupervised methods developed in (vague) analogy to the brain
combination (network) of simple elements (neurons)
Nathalie Villa-Vialaneix | Introduction to statistical learning 39/58
105. What are (artificial) neural networks?
Common properties
(artificial) “Neural networks”: general name for supervised and
unsupervised methods developed in (vague) analogy to the brain
combination (network) of simple elements (neurons)
Example of graphical representation:
INPUTS
OUTPUTS
Nathalie Villa-Vialaneix | Introduction to statistical learning 39/58
106. Different types of neural networks
A neural network is defined by:
1 the network structure;
2 the neuron type.
Nathalie Villa-Vialaneix | Introduction to statistical learning 40/58
107. Different types of neural networks
A neural network is defined by:
1 the network structure;
2 the neuron type.
Standard examples
Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to
supervised problems (classification and regression);
Nathalie Villa-Vialaneix | Introduction to statistical learning 40/58
108. Different types of neural networks
A neural network is defined by:
1 the network structure;
2 the neuron type.
Standard examples
Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to
supervised problems (classification and regression);
Radial basis function networks (RBF): same purpose but based on
local smoothing;
Nathalie Villa-Vialaneix | Introduction to statistical learning 40/58
109. Different types of neural networks
A neural network is defined by:
1 the network structure;
2 the neuron type.
Standard examples
Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to
supervised problems (classification and regression);
Radial basis function networks (RBF): same purpose but based on
local smoothing;
Self-organizing maps (SOM also sometimes called Kohonen’s maps)
or Topographic maps: dedicated to unsupervised problems
(clustering), self-organized;
. . .
Nathalie Villa-Vialaneix | Introduction to statistical learning 40/58
110. Different types of neural networks
A neural network is defined by:
1 the network structure;
2 the neuron type.
Standard examples
Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to
supervised problems (classification and regression);
Radial basis function networks (RBF): same purpose but based on
local smoothing;
Self-organizing maps (SOM also sometimes called Kohonen’s maps)
or Topographic maps: dedicated to unsupervised problems
(clustering), self-organized;
. . .
In this talk, focus on MLP.
Nathalie Villa-Vialaneix | Introduction to statistical learning 40/58
111. MLP: Advantages/Drawbacks
Advantages
classification OR regression (i.e., Y can be a numeric variable or a
factor);
non parametric method: flexible;
good theoretical properties.
Nathalie Villa-Vialaneix | Introduction to statistical learning 41/58
112. MLP: Advantages/Drawbacks
Advantages
classification OR regression (i.e., Y can be a numeric variable or a
factor);
non parametric method: flexible;
good theoretical properties.
Drawbacks
hard to train (high computational cost, especially when d is large);
overfit easily;
“black box” models (hard to interpret).
Nathalie Villa-Vialaneix | Introduction to statistical learning 41/58
113. References
Advised references:
[Bishop, 1995, Ripley, 1996] overview of the topic from a learning (more
than statistical) perspective
[Devroye et al., 1996, Györfi et al., 2002] in dedicated chapters present
statistical properties of perceptrons
Nathalie Villa-Vialaneix | Introduction to statistical learning 42/58
114. Analogy to the brain
1 a neuron collects signals
from neighboring
neurons through its
dendrites
connexions which frequently lead to activating a neuron are enforced (tend
to have an increasing impact on the destination neuron)
Nathalie Villa-Vialaneix | Introduction to statistical learning 43/58
115. Analogy to the brain
1 a neuron collects signals
from neighboring
neurons through its
dendrites
2 when total signal is
above a given threshold,
the neuron is activated
connexions which frequently lead to activating a neuron are enforced (tend
to have an increasing impact on the destination neuron)
Nathalie Villa-Vialaneix | Introduction to statistical learning 43/58
116. Analogy to the brain
1 a neuron collects signals
from neighboring
neurons through its
dendrites
2 when total signal is
above a given threshold,
the neuron is activated
3 ... and a signal is sent to
other neurons through
the axon
connexions which frequently lead to activating a neuron are enforced (tend
to have an increasing impact on the destination neuron)
Nathalie Villa-Vialaneix | Introduction to statistical learning 43/58
117. First model of artificial neuron
[Mc Culloch and Pitts, 1943, Rosenblatt, 1958, Rosenblatt, 1962]
x(1)
x(2)
x(p)
f(x)Σ+
w1
w2
wp
w0
f : x ∈ Rp
→ 1 p
j=1
wjx(j)+w0 ≥ 0
Nathalie Villa-Vialaneix | Introduction to statistical learning 44/58
118. (artificial) Perceptron
Layers
MLP have one input layer (x ∈ Rp
), one output layer (y ∈ R or
∈ {1, . . . , K − 1} values) and several hidden layers;
no connections within a layer;
connections between two consecutive layers (feedforward).
Example (regression, y ∈ R):
INPUTS
x = (x(1)
, . . . , x(p)
)
Nathalie Villa-Vialaneix | Introduction to statistical learning 45/58
119. (artificial) Perceptron
Layers
MLP have one input layer (x ∈ Rp
), one output layer (y ∈ R or
∈ {1, . . . , K − 1} values) and several hidden layers;
no connections within a layer;
connections between two consecutive layers (feedforward).
Example (regression, y ∈ R):
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1
weights w
(1)
jk
Nathalie Villa-Vialaneix | Introduction to statistical learning 45/58
120. (artificial) Perceptron
Layers
MLP have one input layer (x ∈ Rp
), one output layer (y ∈ R or
∈ {1, . . . , K − 1} values) and several hidden layers;
no connections within a layer;
connections between two consecutive layers (feedforward).
Example (regression, y ∈ R):
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1
Nathalie Villa-Vialaneix | Introduction to statistical learning 45/58
121. (artificial) Perceptron
Layers
MLP have one input layer (x ∈ Rp
), one output layer (y ∈ R or
∈ {1, . . . , K − 1} values) and several hidden layers;
no connections within a layer;
connections between two consecutive layers (feedforward).
Example (regression, y ∈ R):
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1 Layer 2
Nathalie Villa-Vialaneix | Introduction to statistical learning 45/58
122. (artificial) Perceptron
Layers
MLP have one input layer (x ∈ Rp
), one output layer (y ∈ R or
∈ {1, . . . , K − 1} values) and several hidden layers;
no connections within a layer;
connections between two consecutive layers (feedforward).
Example (regression, y ∈ R):
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1 Layer 2 y
OUTPUTS
2 hidden layer MLP
Nathalie Villa-Vialaneix | Introduction to statistical learning 45/58
123. A neuron in MLP
v1
v2
v3
×w1
×w2
×w3
+
w0 (Bias Biais)
Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
124. A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
125. A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
Standard activation functions fonctions de transfert / d’activation
Biologically inspired: Heaviside function
h(z) =
0 if z < 0
1 otherwise.
Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
126. A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
Standard activation functions
Main issue with the Heaviside function: not continuous!
Identity
h(z) = z
Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
127. A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
Standard activation functions
But identity activation function gives linear model if used with one hidden
layer: not flexible enough
Logistic function
h(z) = 1
1+exp(−z)
Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
128. A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
Standard activation functions
Another popular activation function (useful to model positive real numbers)
Rectified linear (ReLU)
h(z) = max(0, z)
Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
129. A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
General sigmoid
sigmoid: nondecreasing function h : R → R such that
lim
z→+∞
h(z) = 1 lim
z→−∞
h(z) = 0
Nathalie Villa-Vialaneix | Introduction to statistical learning 46/58
130. Focus on one-hidden-layer perceptrons
Regression case
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
f(x)
f(x) =
Q
k=1
w
(2)
k
hk x w
(1)
k
+ w
(0)
k
+ w
(2)
0
, with hk a (logistic) sigmoid
Nathalie Villa-Vialaneix | Introduction to statistical learning 47/58
131. Focus on one-hidden-layer perceptrons
Binary classification case
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
ψ(x)
ψ(x) = h0
Q
k=1
w
(2)
k
hk x w
(1)
k
+ w
(0)
k
+ w
(2)
0
with h0 logistic sigmoid or identity.
Nathalie Villa-Vialaneix | Introduction to statistical learning 47/58
132. Focus on one-hidden-layer perceptrons
Binary classification case
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
ψ(x)
decision with:
f(x) =
0 if ψ(x) < 1/2
1 otherwise
Nathalie Villa-Vialaneix | Introduction to statistical learning 47/58
133. Focus on one-hidden-layer perceptrons
Extension to any classification problem in {1, . . . , K − 1}
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
ψ(x)
Straightforward extension to multiple classes with a multiple output
perceptron (number of output units equal to K) and a maximum probability
rule for the decision.
Nathalie Villa-Vialaneix | Introduction to statistical learning 47/58
134. Theoretical properties of perceptrons
This section answers two questions:
1 can we approximate any function g : [0, 1]p
→ R arbitrary well with a
perceptron?
Nathalie Villa-Vialaneix | Introduction to statistical learning 48/58
135. Theoretical properties of perceptrons
This section answers two questions:
1 can we approximate any function g : [0, 1]p
→ R arbitrary well with a
perceptron?
2 when a perceptron is trained with i.i.d. observations from an arbitrary
random variable pair (X, Y), is it consistent? (i.e., does it reach the
minimum possible error asymptotically when the number of
observations grows to infinity?)
Nathalie Villa-Vialaneix | Introduction to statistical learning 48/58
136. Illustration of the universal approximation property
Simple examples
a function to approximate: g : [0, 1] → sin 1
x+0.1
Nathalie Villa-Vialaneix | Introduction to statistical learning 49/58
137. Illustration of the universal approximation property
Simple examples
a function to approximate: g : [0, 1] → sin 1
x+0.1
trying to approximate (how this is performed is explained later in this talk) this
function with MLP having different numbers of neurons on their
hidden layer
Nathalie Villa-Vialaneix | Introduction to statistical learning 49/58
138. Universal property from a theoretical point of view
Set of MLPs with a given size:
PQ
(h) =
x ∈ Rp
→
Q
k=1
w
(2)
k
h x w
(1)
k
+ w
(0)
k
+ w
(2)
0
: w
(2)
k
, w
(0)
k
∈ R, w
(1)
k
∈ Rp
Nathalie Villa-Vialaneix | Introduction to statistical learning 50/58
139. Universal property from a theoretical point of view
Set of MLPs with a given size:
PQ
(h) =
x ∈ Rp
→
Q
k=1
w
(2)
k
h x w
(1)
k
+ w
(0)
k
+ w
(2)
0
: w
(2)
k
, w
(0)
k
∈ R, w
(1)
k
∈ Rp
Set of all MLPs: P(h) = ∪Q∈NPQ
(h)
Nathalie Villa-Vialaneix | Introduction to statistical learning 50/58
140. Universal property from a theoretical point of view
Set of MLPs with a given size:
PQ
(h) =
x ∈ Rp
→
Q
k=1
w
(2)
k
h x w
(1)
k
+ w
(0)
k
+ w
(2)
0
: w
(2)
k
, w
(0)
k
∈ R, w
(1)
k
∈ Rp
Set of all MLPs: P(h) = ∪Q∈NPQ
(h)
Universal approximation [Pinkus, 1999]
If h is a non polynomial continuous function, then, for any continuous
function g : [0, 1]p
→ R and any > 0, ∃ f ∈ P(h) such that:
sup
x∈[0,1]p
|f(x) − g(x)| ≤ .
Nathalie Villa-Vialaneix | Introduction to statistical learning 50/58
141. Remarks on universal approximation
continuity of the activation function is not required (see
[Devroye et al., 1996] for a result with arbitrary sigmoids)
other versions of this property are given in
[Hornik, 1991, Hornik, 1993, Stinchcombe, 1999] for different functional
spaces for g
none of the spaces PQ
(h), for a fixed Q, has this property
this result can be used to show that perceptron are consistent
whenever Q log(n)/n
n→+∞
−−−−−−→ 0
[Farago and Lugosi, 1993, Devroye et al., 1996]
Nathalie Villa-Vialaneix | Introduction to statistical learning 51/58
142. Empirical error minimization
Given i.i.d. observations of (X, Y), (Xi, Yi), how to choose the weights w?
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
fw(x)
Nathalie Villa-Vialaneix | Introduction to statistical learning 52/58
143. Empirical error minimization
Given i.i.d. observations of (X, Y), (Xi, Yi), how to choose the weights w?
Standard approach: minimize the empirical L2 risk:
Rn(w) =
n
i=1
[fw(Xi) − Yi]2
with
Yi ∈ R for the regression case
Yi ∈ {0, 1} for the classification case, with the associated decision rule
x → 1{fw (x)≤1/2}.
Nathalie Villa-Vialaneix | Introduction to statistical learning 52/58
144. Empirical error minimization
Given i.i.d. observations of (X, Y), (Xi, Yi), how to choose the weights w?
Standard approach: minimize the empirical L2 risk:
Rn(w) =
n
i=1
[fw(Xi) − Yi]2
with
Yi ∈ R for the regression case
Yi ∈ {0, 1} for the classification case, with the associated decision rule
x → 1{fw (x)≤1/2}.
But: ˆRn(w) is not convex in w ⇒ general optimization problem
Nathalie Villa-Vialaneix | Introduction to statistical learning 52/58
145. Optimization with gradient descent
Method: initialize (randomly or with some prior knowledge) the weights
w(0) ∈ RQp+2Q+1
Batch approach: for t = 1, . . . , T
w(t + 1) = w(t) − µ(t) w
ˆRn(w(t));
Nathalie Villa-Vialaneix | Introduction to statistical learning 53/58
146. Optimization with gradient descent
Method: initialize (randomly or with some prior knowledge) the weights
w(0) ∈ RQp+2Q+1
Batch approach: for t = 1, . . . , T
w(t + 1) = w(t) − µ(t) w
ˆRn(w(t));
online (or stochastic) approach: write
ˆRn(w) =
n
i=1
[fw(Xi) − Yi]2
=Ei
and for t = 1, . . . , T, randomly pick i ∈ {1, . . . , n} and update:
w(t + 1) = w(t) − µ(t) wEi(w(t)).
Nathalie Villa-Vialaneix | Introduction to statistical learning 53/58
147. Discussion about practical choices for this approach
batch version converges (in an optimization point of view) to a local
minimum of the error for a good choice of µ(t) but convergence can
be slow
stochastic version is usually very inefficient but is useful for large
datasets (n large)
more efficient algorithms exist to solve the optimization task. The one
implemented in the R package nnet uses higher order derivatives
(BFGS algorithm)
in all cases, solutions returned are, at best, local minima which
strongly depends on the initialization: using more than one
initialization state is advised
Nathalie Villa-Vialaneix | Introduction to statistical learning 54/58
148. Gradient backpropagation method
[Rumelhart and Mc Clelland, 1986]
The gradient backpropagation rétropropagation du gradient principle is
used to easily calculate gradients in perceptrons (or in other types of
neural network):
Nathalie Villa-Vialaneix | Introduction to statistical learning 55/58
149. Gradient backpropagation method
[Rumelhart and Mc Clelland, 1986]
The gradient backpropagation rétropropagation du gradient principle is
used to easily calculate gradients in perceptrons (or in other types of
neural network):
This way, stochastic gradient descent alternates:
a forward step which aims at calculating outputs from all observations
Xi given a value of the weights w
a backward step in which the gradient backpropagation principle is
used to obtain the gradient for the current weights w
Nathalie Villa-Vialaneix | Introduction to statistical learning 55/58
157. Initialization and stopping of the training algorithm
1 How to initialize weights? Standard choices w
(1)
jk
∼ N(0, 1/
√
p) and
w
(2)
k
∼ N(0, 1/
√
Q)
Nathalie Villa-Vialaneix | Introduction to statistical learning 57/58
158. Initialization and stopping of the training algorithm
1 How to initialize weights? Standard choices w
(1)
jk
∼ N(0, 1/
√
p) and
w
(2)
k
∼ N(0, 1/
√
Q)
2 When to stop the algorithm? (gradient descent or alike) Standard
choices:
bounded T
target value of the error ˆRn(w)
target value of the evolution ˆRn(w(t)) − ˆRn(w(t + 1))
Nathalie Villa-Vialaneix | Introduction to statistical learning 57/58
159. Initialization and stopping of the training algorithm
1 How to initialize weights? Standard choices w
(1)
jk
∼ N(0, 1/
√
p) and
w
(2)
k
∼ N(0, 1/
√
Q)
In the R package nnet, weights are sampled uniformly between
[−0.5, 0.5] or between − 1
maxi X
(j)
i
, 1
maxi X
(j)
i
if X(j) is large.
2 When to stop the algorithm? (gradient descent or alike) Standard
choices:
bounded T
target value of the error ˆRn(w)
target value of the evolution ˆRn(w(t)) − ˆRn(w(t + 1))
In the R package nnet, a combination of the three criteria is used and
tunable.
Nathalie Villa-Vialaneix | Introduction to statistical learning 57/58
160. Strategies to avoid overfitting
Properly tune Q with a CV or a bootstrap estimation of the
generalization ability of the method
Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
161. Strategies to avoid overfitting
Properly tune Q with a CV or a bootstrap estimation of the
generalization ability of the method
Early stopping: for Q large enough, use a part of the data as a
validation set and stops the training (gradient descent) when the
empirical error calculated on this dataset starts to increase
Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
162. Strategies to avoid overfitting
Properly tune Q with a CV or a bootstrap estimation of the
generalization ability of the method
Early stopping: for Q large enough, use a part of the data as a
validation set and stops the training (gradient descent) when the
empirical error calculated on this dataset starts to increase
Weight decay: for Q large enough, penalize the empirical risk with a
function of the weights, e.g.,
ˆRn(w) + λw w
Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
163. Strategies to avoid overfitting
Properly tune Q with a CV or a bootstrap estimation of the
generalization ability of the method
Early stopping: for Q large enough, use a part of the data as a
validation set and stops the training (gradient descent) when the
empirical error calculated on this dataset starts to increase
Weight decay: for Q large enough, penalize the empirical risk with a
function of the weights, e.g.,
ˆRn(w) + λw w
Noise injection: modify the input data with a random noise during the
training
Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
164. References
Aronszajn, N. (1950).
Theory of reproducing kernels.
Transactions of the American Mathematical Society, 68(3):337–404.
Bishop, C. (1995).
Neural Networks for Pattern Recognition.
Oxford University Press, New York, USA.
Breiman, L. (2001).
Random forests.
Machine Learning, 45(1):5–32.
Breiman, L., Friedman, J., Olsen, R., and Stone, C. (1984).
Classification and Regression Trees.
Chapman and Hall, Boca Raton, Florida, USA.
Devroye, L., Györfi, L., and Lugosi, G. (1996).
A Probabilistic Theory for Pattern Recognition.
Springer-Verlag, New York, NY, USA.
Farago, A. and Lugosi, G. (1993).
Strong universal consistency of neural network classifiers.
IEEE Transactions on Information Theory, 39(4):1146–1151.
Györfi, L., Kohler, M., Krzy˙zak, A., and Walk, H. (2002).
A Distribution-Free Theory of Nonparametric Regression.
Springer-Verlag, New York, NY, USA.
Hornik, K. (1991).
Approximation capabilities of multilayer feedfoward networks.
Neural Networks, 4(2):251–257.
Hornik, K. (1993).
Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
165. Some new results on neural network approximation.
Neural Networks, 6(8):1069–1072.
Mc Culloch, W. and Pitts, W. (1943).
A logical calculus of ideas immanent in nervous activity.
Bulletin of Mathematical Biophysics, 5(4):115–133.
Pinkus, A. (1999).
Approximation theory of the MLP model in neural networks.
Acta Numerica, 8:143–195.
Ripley, B. (1996).
Pattern Recognition and Neural Networks.
Cambridge University Press.
Rosenblatt, F. (1958).
The perceptron: a probabilistic model for information storage and organization in the brain.
Psychological Review, 65:386–408.
Rosenblatt, F. (1962).
Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms.
Spartan Books, Washington, DC, USA.
Rumelhart, D. and Mc Clelland, J. (1986).
Parallel Distributed Processing: Exploration in the MicroStructure of Cognition.
MIT Press, Cambridge, MA, USA.
Steinwart, I. (2002).
Support vector machines are universally consistent.
Journal of Complexity, 18:768–791.
Stinchcombe, M. (1999).
Neural network approximation of continuous functionals and continuous functions on compactifications.
Neural Network, 12(3):467–477.
Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58
166. Vapnik, V. (1995).
The Nature of Statistical Learning Theory.
Springer Verlag, New York, USA.
Nathalie Villa-Vialaneix | Introduction to statistical learning 58/58