The document compares different machine learning methods for predicting N2O fluxes and N leaching from agricultural soils. It describes a dataset from a biogeochemical model containing 19,000 observations of 11 input variables and 2 output variables for corn cultivation across Europe. Several regression methods are tested, including linear models, multilayer perceptrons, support vector machines, random forests, and spline-based approaches. The goal is to create metamodels that can more quickly estimate outputs from the biogeochemical model inputs for integrated assessment modeling and scenario analysis.
A comparison of three learning methods to predict N20 fluxes and N leachingtuxette
The document compares three machine learning methods - multi-layer perceptrons (neural networks), support vector machines (SVMs), and random forests - for predicting N2O fluxes and N leaching from various data inputs. It provides background on machine learning for regression problems, describes the three methods and how they are trained and tuned, and discusses the methodology and results of a study comparing the performance of these methods.
This document summarizes and compares different graph-based semi-supervised learning methods. It presents an optimization framework that generalizes two existing approaches - the standard Laplacian method and normalized Laplacian method. It also introduces a new PageRank-based method. The framework provides classification functions in closed form and allows tuning two parameters. Random walk interpretations explain the methods and show their relationships to personalized PageRank. Examples are presented to compare the methods' performance.
Abstract : Motivated by the recovery and prediction of electricity consumption time series, we extend Nonnegative Matrix Factorization to take into account external features as side information. We consider general linear measurement settings, and propose a framework which models non-linear relationships between external features and the response variable. We extend previous theoretical results to obtain a sufficient condition on the identifiability of NMF with side information. Based on the classical Hierarchical Alternating Least Squares (HALS) algorithm, we propose a new algorithm (HALSX, or Hierarchical Alternating Least Squares with eXogeneous variables) which estimates NMF in this setting. The algorithm is validated on both simulated and real electricity consumption datasets as well as a recommendation system dataset, to show its performance in matrix recovery and prediction for new rows and columns.
New Surrogate-Assisted Search Control and Restart Strategies for CMA-ESIlya Loshchilov
This document discusses surrogate-assisted CMA-ES algorithms. It begins with an introduction to CMA-ES and support vector machines. It then presents an algorithm called self-adaptive surrogate-assisted CMA-ES that uses a rank-based SVM as a surrogate model within CMA-ES. The algorithm learns the surrogate model from the rankings of solutions and directly optimizes the surrogate for a number of generations before evaluating on the true objective function. Results show the algorithm can provide speedups over directly optimizing the true objective.
Spatial Point Processes and Their Applications in EpidemiologyLilac Liu Xu
Spatial statistics can be used in epidemiology to analyze spatial point patterns of disease cases and controls. Common models include homogeneous and inhomogeneous Poisson processes, which describe patterns of complete spatial randomness and non-random clustering or dispersion. Descriptive statistics like the first-order intensity function λ(s) and second-order K-function can quantify clustering in a point pattern. A case-control study compares these statistics between case and control patterns to test for non-random spatial variations in disease risk. Monte Carlo simulations are used to calculate p-values when testing hypotheses about relative risk and clustering.
A comparison of three learning methods to predict N20 fluxes and N leachingtuxette
The document compares three machine learning methods - multi-layer perceptrons (neural networks), support vector machines (SVMs), and random forests - for predicting N2O fluxes and N leaching from various data inputs. It provides background on machine learning for regression problems, describes the three methods and how they are trained and tuned, and discusses the methodology and results of a study comparing the performance of these methods.
This document summarizes and compares different graph-based semi-supervised learning methods. It presents an optimization framework that generalizes two existing approaches - the standard Laplacian method and normalized Laplacian method. It also introduces a new PageRank-based method. The framework provides classification functions in closed form and allows tuning two parameters. Random walk interpretations explain the methods and show their relationships to personalized PageRank. Examples are presented to compare the methods' performance.
Abstract : Motivated by the recovery and prediction of electricity consumption time series, we extend Nonnegative Matrix Factorization to take into account external features as side information. We consider general linear measurement settings, and propose a framework which models non-linear relationships between external features and the response variable. We extend previous theoretical results to obtain a sufficient condition on the identifiability of NMF with side information. Based on the classical Hierarchical Alternating Least Squares (HALS) algorithm, we propose a new algorithm (HALSX, or Hierarchical Alternating Least Squares with eXogeneous variables) which estimates NMF in this setting. The algorithm is validated on both simulated and real electricity consumption datasets as well as a recommendation system dataset, to show its performance in matrix recovery and prediction for new rows and columns.
New Surrogate-Assisted Search Control and Restart Strategies for CMA-ESIlya Loshchilov
This document discusses surrogate-assisted CMA-ES algorithms. It begins with an introduction to CMA-ES and support vector machines. It then presents an algorithm called self-adaptive surrogate-assisted CMA-ES that uses a rank-based SVM as a surrogate model within CMA-ES. The algorithm learns the surrogate model from the rankings of solutions and directly optimizes the surrogate for a number of generations before evaluating on the true objective function. Results show the algorithm can provide speedups over directly optimizing the true objective.
Spatial Point Processes and Their Applications in EpidemiologyLilac Liu Xu
Spatial statistics can be used in epidemiology to analyze spatial point patterns of disease cases and controls. Common models include homogeneous and inhomogeneous Poisson processes, which describe patterns of complete spatial randomness and non-random clustering or dispersion. Descriptive statistics like the first-order intensity function λ(s) and second-order K-function can quantify clustering in a point pattern. A case-control study compares these statistics between case and control patterns to test for non-random spatial variations in disease risk. Monte Carlo simulations are used to calculate p-values when testing hypotheses about relative risk and clustering.
A short and naive introduction to using network in prediction modelstuxette
The document provides an introduction to using network information in prediction models. It discusses representing a network as a graph with a Laplacian matrix. The Laplacian captures properties like random walks on the graph and heat diffusion. Eigenvectors of the Laplacian related to small eigenvalues are strongly tied to graph structure. The document discusses using the Laplacian in prediction models by working in the feature space defined by the Laplacian eigenvectors or directly regularizing a linear model with the Laplacian. This introduces network information and encourages similar contributions from connected nodes. The approaches are applied to problems like predicting phenotypes from gene expression using a known gene network.
Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...IJORCS
This paper presents a new algorithm for solving large scale global optimization problems based on hybridization of simulated annealing and Nelder-Mead algorithm. The new algorithm is called simulated Nelder-Mead algorithm with random variables updating (SNMRVU). SNMRVU starts with an initial solution, which is generated randomly and then the solution is divided into partitions. The neighborhood zone is generated, random number of partitions are selected and variables updating process is starting in order to generate a trail neighbor solutions. This process helps the SNMRVU algorithm to explore the region around a current iterate solution. The Nelder- Mead algorithm is used in the final stage in order to improve the best solution found so far and accelerates the convergence in the final stage. The performance of the SNMRVU algorithm is evaluated using 27 scalable benchmark functions and compared with four algorithms. The results show that the SNMRVU algorithm is promising and produces high quality solutions with low computational costs.
The document proposes two algorithms for dynamically summarizing large graphs over time: (1) kC clusters each node to supernodes at every time step, ignoring previous cluster assignments; (2) μC clusters nodes into microclusters, maintains statistics over time, and periodically clusters microclusters into supernodes, allowing it to better track changes over time. The algorithms are evaluated on Twitter, network flow, and synthetic datasets and are shown to scale to large graphs while maintaining low reconstruction error.
Splash: User-friendly Programming Interface for Parallelizing Stochastic Lear...Turi, Inc.
The document describes Splash, a programming interface and execution engine for parallelizing stochastic algorithms. Splash allows users to write single-threaded stochastic algorithms and handles parallelization automatically without requiring the user to manage communication, data partitioning, or other parallelization details. The execution engine runs the algorithm by proposing different levels of parallelism and combining partial updates from cores processing subsets of data to obtain a global update in each iteration. This approach avoids communication bottlenecks that plague naive parallelization attempts for stochastic algorithms.
Additive Smoothing for Relevance-Based Language Modelling of Recommender Syst...Daniel Valcarce
This document summarizes a presentation on additive smoothing for relevance-based language modelling of recommender systems. It discusses using pseudo-relevance feedback and relevance models for collaborative filtering recommendations. Specifically, it examines how different collection-based smoothing techniques like Dirichlet priors, Jelinek-Mercer, and absolute discounting can demote the desired IDF effect, which promotes less popular items. The document proposes using additive smoothing, which does not demote the IDF effect. Experiments on movie recommendation datasets show additive smoothing achieves better accuracy, diversity, and novelty than other smoothing methods.
The document discusses estimation of multi-Granger network causal models from time series data. It proposes a joint modeling approach to estimate vector autoregressive (VAR) models for multiple time series datasets simultaneously. The key steps are:
1. Estimate the inverse covariance matrices for each dataset using a factor model approach.
2. Use the estimated inverse covariance matrices in a generalized fused lasso optimization to jointly estimate the VAR coefficient matrices for each dataset.
Simulation results show the joint modeling approach improves estimation of the VAR coefficients and reduces forecasting error compared to estimating the models separately, especially when the number of time points is small. The factor modeling approach also provides a better estimate of the inverse covariance than using the empirical estimate.
Regression and Classification: An Artificial Neural Network ApproachKhulna University
This presentation introduces artificial neural networks (ANN) as a technique for regression and classification problems. It provides historical context on the development of ANN, describes common network structures and activation functions, and the backpropagation algorithm for training networks. Experimental results on 7 datasets show ANN outperformed other methods for both regression and classification across a variety of problem types and data characteristics. Limitations of ANN and areas for further research are also discussed.
In this chapter, our goal is to introduce the foundational principles of supervised learning. As we progress, we place particular emphasis on both regression and classification techniques, offering learners a more comprehensive perspective on the practical application of these methodologies in real-world scenarios. By the end of this chapter, learners will not only possess a robust understanding of the core principles but will also be armed with valuable insights into the tangible applications of supervised learning. This knowledge empowers them to skillfully navigate and leverage the full potential of this influential paradigm within the vast expanse of machine learning.
This document provides an overview and plan for STATA tutorials covering topics like data management, regression analysis, and specialized models. The tutorials will be interactive, demonstrating STATA commands through example do-files. The first two lectures cover basic STATA usage and simple regressions. Subsequent lectures include more advanced topics like robust standard errors, instrumental variables models, and limited dependent variable models for binary and censored outcomes. Marginal effects are emphasized as more informative than coefficients for nonlinear models.
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...Rafael Nogueras
This document discusses self-sampling strategies for multimemetic algorithms (MMAs) in unstable computational environments subject to churn. It proposes using probabilistic models to sample new individuals when populations need to be enlarged due to node failures. Experimental results show the bivariate model is superior for high churn, maintaining diversity and convergence better than random strategies. Future work aims to extend these self-sampling strategies to dynamic network topologies and more complex probabilistic models.
MACHINE LEARNING FOR SATELLITE-GUIDED WATER QUALITY MONITORINGVisionGEOMATIQUE2014
The document discusses using machine learning techniques for satellite-guided water quality monitoring. It covers using machine learning algorithms to automatically develop empirical models from multimodal satellite and field data sets. Machine learning can help construct nonlinear mappings between satellite measurements and water quality products and optimize in-situ data collection through mission planning. Experimental results are shown applying these techniques to map water quality metrics like chlorophyll-a and total suspended solids using MODIS satellite images of Lake Winnipeg.
Simulation-based Optimization of a Real-world Travelling Salesman Problem Usi...CSCJournals
This paper presents a real-world case study of optimizing waste collection in Sweden. The problem, involving approximately 17,000 garbage bins served by three bin lorries, is approached as a travelling salesman problem and solved using simulation-based optimization and an evolutionary algorithm. To improve the performance of the evolutionary algorithm, it is enhanced with a repair function that adjusts its genome values so that shorter routes are found more quickly. The algorithm is tested using two crossover operators, i.e., the order crossover and heuristic crossover, combined with different mutation rates. The results indicate that the order crossover is superior to the heuristics crossover, but that the driving force of the search process is the mutation operator combined with the repair function.
This document provides an overview of machine learning techniques for classification and regression, including decision trees, linear models, and support vector machines. It discusses key concepts like overfitting, regularization, and model selection. For decision trees, it explains how they work by binary splitting of space, common splitting criteria like entropy and Gini impurity, and how trees are built using a greedy optimization approach. Linear models like logistic regression and support vector machines are covered, along with techniques like kernels, regularization, and stochastic optimization. The importance of testing on a holdout set to avoid overfitting is emphasized.
Knowledge-based generalization for metabolic modelsAnna Zhukova
Genome-scale metabolic models describe the relationships between thousands of reactions and biochemical molecules, and are used to improve our understanding of organism’s metabolism. They found applications in pharmaceutical, chemical and bioremediation industries.
The complexity of metabolic models hampers many tasks that are important during the process of model inference, such as model comparison, analysis, curation and refinement by human experts. The abundance of details in large-scale networks can mask errors and important organism-specific adaptations. It is therefore important to find the right levels of abstraction that are comfortable for human experts. These abstract levels should highlight the essential model structure and the divergences from it, such as alternative paths or missing reactions, while hiding inessential details.
To address this issue, we defined a knowledge-based generalization that allows for production of higher-level abstract views of metabolic network models. We developed a theoretical method that groups similar metabolites and reactions based on the network structure and the knowledge extracted from metabolite ontologies, and then compresses the network based on this grouping. We implemented our method as a python
library, that is available for download from metamogen.gforge.inria.fr.
To validate our method we applied it to 1 286 metabolic models from the Path2Model project, and showed that it helps to detect organism-, and domain-specific adaptations, as well as to compare models.
Based on discussions with users about their ways of navigation in metabolic networks, we defined a 3-level representation of metabolic networks: the full-model level, the generalized level, the compartment level. We combined our model generalization method with the zooming user interface (ZUI) paradigm and developed Mimoza, a user-centric tool for zoomable navigation and knowledge-based exploration of metabolic networks that produces this 3-level representation. Mimoza is available both as an on-line tool and for download at mimoza.bordeaux.inria.fr.
LNCS 5050 - Bilevel Optimization and Machine Learningbutest
This document discusses using bilevel optimization and machine learning techniques to improve model selection in machine learning problems. It proposes framing machine learning model selection as a bilevel optimization problem, where the inner level problems involve optimizing models on training data and the outer level problem selects hyperparameters to minimize error on test data. This bilevel framing allows for systematic optimization of hyperparameters and enables novel machine learning approaches. The document illustrates the approach for support vector regression, formulating model selection as a Stackelberg game and solving the resulting mathematical program with equilibrium constraints.
Sampling-Based Planning Algorithms for Multi-Objective MissionsMd Mahbubur Rahman
multiobjective path planning has Increasing demand in military missions, rescue operations, construction job-sites.
There is Lack of robotic path planning algorithm that compromises multiple
objectives. Commonly no solution that optimizes all the objective functions. Here we modify RRT, RRT* sampling based algorithm.
This document summarizes a talk on inference on treatment effects after model selection. It discusses challenges with inferring treatment effects after refitting a model selected via a procedure like lasso. Specifically, refitting can lead to bias due to overfitting or underfitting the model. The document proposes using repeated data splitting to remove the overfitting bias. In each split, part of the data is used for model selection and the other part for estimating treatment effects without overfitting bias. This approach reduces bias compared to simply refitting the full model.
This document presents a graph theoretic approach to optimizing the design of a software defined radio (SDR) system capable of supporting multiple standards. It describes representing an SDR system as a directed hypergraph, with blocks as vertices and implementation dependencies as hyperarcs. A cost function is defined based on building cost and computational cost of blocks. The optimization problem is to select a set of common operators to implement the system at minimum cost. This is proven to be an NP-problem if the number of levels in the graph is bounded by a constant. An algorithm called Minimum Cost Design is proposed to solve the problem using graph theory.
Racines en haut et feuilles en bas : les arbres en mathstuxette
1. The document discusses methods for clustering and differential analysis of Hi-C matrices, which represent the 3D organization of DNA.
2. It proposes extending Ward's hierarchical clustering to directly use Hi-C similarity matrices while enforcing adjacency constraints. A fast algorithm was also developed.
3. A new method called "treediff" was created to perform differential analysis of Hi-C matrices based on the Wasserstein distance between hierarchical clusterings. Software implementations of these methods were also developed.
Méthodes à noyaux pour l’intégration de données hétérogènestuxette
The document discusses a presentation about multi-omics data integration methods using kernel methods. The presentation introduces kernel methods, how they can be used to integrate heterogeneous omics data, and examples of applications. Specifically, it discusses using kernel methods to perform unsupervised transformation-based integration of multi-omics data. It also presents an application of constrained kernel hierarchical clustering to analyze Hi-C data by directly using Hi-C matrices as kernels.
More Related Content
Similar to A comparison of learning methods to predict N2O fluxes and N leaching
A short and naive introduction to using network in prediction modelstuxette
The document provides an introduction to using network information in prediction models. It discusses representing a network as a graph with a Laplacian matrix. The Laplacian captures properties like random walks on the graph and heat diffusion. Eigenvectors of the Laplacian related to small eigenvalues are strongly tied to graph structure. The document discusses using the Laplacian in prediction models by working in the feature space defined by the Laplacian eigenvectors or directly regularizing a linear model with the Laplacian. This introduces network information and encourages similar contributions from connected nodes. The approaches are applied to problems like predicting phenotypes from gene expression using a known gene network.
Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...IJORCS
This paper presents a new algorithm for solving large scale global optimization problems based on hybridization of simulated annealing and Nelder-Mead algorithm. The new algorithm is called simulated Nelder-Mead algorithm with random variables updating (SNMRVU). SNMRVU starts with an initial solution, which is generated randomly and then the solution is divided into partitions. The neighborhood zone is generated, random number of partitions are selected and variables updating process is starting in order to generate a trail neighbor solutions. This process helps the SNMRVU algorithm to explore the region around a current iterate solution. The Nelder- Mead algorithm is used in the final stage in order to improve the best solution found so far and accelerates the convergence in the final stage. The performance of the SNMRVU algorithm is evaluated using 27 scalable benchmark functions and compared with four algorithms. The results show that the SNMRVU algorithm is promising and produces high quality solutions with low computational costs.
The document proposes two algorithms for dynamically summarizing large graphs over time: (1) kC clusters each node to supernodes at every time step, ignoring previous cluster assignments; (2) μC clusters nodes into microclusters, maintains statistics over time, and periodically clusters microclusters into supernodes, allowing it to better track changes over time. The algorithms are evaluated on Twitter, network flow, and synthetic datasets and are shown to scale to large graphs while maintaining low reconstruction error.
Splash: User-friendly Programming Interface for Parallelizing Stochastic Lear...Turi, Inc.
The document describes Splash, a programming interface and execution engine for parallelizing stochastic algorithms. Splash allows users to write single-threaded stochastic algorithms and handles parallelization automatically without requiring the user to manage communication, data partitioning, or other parallelization details. The execution engine runs the algorithm by proposing different levels of parallelism and combining partial updates from cores processing subsets of data to obtain a global update in each iteration. This approach avoids communication bottlenecks that plague naive parallelization attempts for stochastic algorithms.
Additive Smoothing for Relevance-Based Language Modelling of Recommender Syst...Daniel Valcarce
This document summarizes a presentation on additive smoothing for relevance-based language modelling of recommender systems. It discusses using pseudo-relevance feedback and relevance models for collaborative filtering recommendations. Specifically, it examines how different collection-based smoothing techniques like Dirichlet priors, Jelinek-Mercer, and absolute discounting can demote the desired IDF effect, which promotes less popular items. The document proposes using additive smoothing, which does not demote the IDF effect. Experiments on movie recommendation datasets show additive smoothing achieves better accuracy, diversity, and novelty than other smoothing methods.
The document discusses estimation of multi-Granger network causal models from time series data. It proposes a joint modeling approach to estimate vector autoregressive (VAR) models for multiple time series datasets simultaneously. The key steps are:
1. Estimate the inverse covariance matrices for each dataset using a factor model approach.
2. Use the estimated inverse covariance matrices in a generalized fused lasso optimization to jointly estimate the VAR coefficient matrices for each dataset.
Simulation results show the joint modeling approach improves estimation of the VAR coefficients and reduces forecasting error compared to estimating the models separately, especially when the number of time points is small. The factor modeling approach also provides a better estimate of the inverse covariance than using the empirical estimate.
Regression and Classification: An Artificial Neural Network ApproachKhulna University
This presentation introduces artificial neural networks (ANN) as a technique for regression and classification problems. It provides historical context on the development of ANN, describes common network structures and activation functions, and the backpropagation algorithm for training networks. Experimental results on 7 datasets show ANN outperformed other methods for both regression and classification across a variety of problem types and data characteristics. Limitations of ANN and areas for further research are also discussed.
In this chapter, our goal is to introduce the foundational principles of supervised learning. As we progress, we place particular emphasis on both regression and classification techniques, offering learners a more comprehensive perspective on the practical application of these methodologies in real-world scenarios. By the end of this chapter, learners will not only possess a robust understanding of the core principles but will also be armed with valuable insights into the tangible applications of supervised learning. This knowledge empowers them to skillfully navigate and leverage the full potential of this influential paradigm within the vast expanse of machine learning.
This document provides an overview and plan for STATA tutorials covering topics like data management, regression analysis, and specialized models. The tutorials will be interactive, demonstrating STATA commands through example do-files. The first two lectures cover basic STATA usage and simple regressions. Subsequent lectures include more advanced topics like robust standard errors, instrumental variables models, and limited dependent variable models for binary and censored outcomes. Marginal effects are emphasized as more informative than coefficients for nonlinear models.
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...Rafael Nogueras
This document discusses self-sampling strategies for multimemetic algorithms (MMAs) in unstable computational environments subject to churn. It proposes using probabilistic models to sample new individuals when populations need to be enlarged due to node failures. Experimental results show the bivariate model is superior for high churn, maintaining diversity and convergence better than random strategies. Future work aims to extend these self-sampling strategies to dynamic network topologies and more complex probabilistic models.
MACHINE LEARNING FOR SATELLITE-GUIDED WATER QUALITY MONITORINGVisionGEOMATIQUE2014
The document discusses using machine learning techniques for satellite-guided water quality monitoring. It covers using machine learning algorithms to automatically develop empirical models from multimodal satellite and field data sets. Machine learning can help construct nonlinear mappings between satellite measurements and water quality products and optimize in-situ data collection through mission planning. Experimental results are shown applying these techniques to map water quality metrics like chlorophyll-a and total suspended solids using MODIS satellite images of Lake Winnipeg.
Simulation-based Optimization of a Real-world Travelling Salesman Problem Usi...CSCJournals
This paper presents a real-world case study of optimizing waste collection in Sweden. The problem, involving approximately 17,000 garbage bins served by three bin lorries, is approached as a travelling salesman problem and solved using simulation-based optimization and an evolutionary algorithm. To improve the performance of the evolutionary algorithm, it is enhanced with a repair function that adjusts its genome values so that shorter routes are found more quickly. The algorithm is tested using two crossover operators, i.e., the order crossover and heuristic crossover, combined with different mutation rates. The results indicate that the order crossover is superior to the heuristics crossover, but that the driving force of the search process is the mutation operator combined with the repair function.
This document provides an overview of machine learning techniques for classification and regression, including decision trees, linear models, and support vector machines. It discusses key concepts like overfitting, regularization, and model selection. For decision trees, it explains how they work by binary splitting of space, common splitting criteria like entropy and Gini impurity, and how trees are built using a greedy optimization approach. Linear models like logistic regression and support vector machines are covered, along with techniques like kernels, regularization, and stochastic optimization. The importance of testing on a holdout set to avoid overfitting is emphasized.
Knowledge-based generalization for metabolic modelsAnna Zhukova
Genome-scale metabolic models describe the relationships between thousands of reactions and biochemical molecules, and are used to improve our understanding of organism’s metabolism. They found applications in pharmaceutical, chemical and bioremediation industries.
The complexity of metabolic models hampers many tasks that are important during the process of model inference, such as model comparison, analysis, curation and refinement by human experts. The abundance of details in large-scale networks can mask errors and important organism-specific adaptations. It is therefore important to find the right levels of abstraction that are comfortable for human experts. These abstract levels should highlight the essential model structure and the divergences from it, such as alternative paths or missing reactions, while hiding inessential details.
To address this issue, we defined a knowledge-based generalization that allows for production of higher-level abstract views of metabolic network models. We developed a theoretical method that groups similar metabolites and reactions based on the network structure and the knowledge extracted from metabolite ontologies, and then compresses the network based on this grouping. We implemented our method as a python
library, that is available for download from metamogen.gforge.inria.fr.
To validate our method we applied it to 1 286 metabolic models from the Path2Model project, and showed that it helps to detect organism-, and domain-specific adaptations, as well as to compare models.
Based on discussions with users about their ways of navigation in metabolic networks, we defined a 3-level representation of metabolic networks: the full-model level, the generalized level, the compartment level. We combined our model generalization method with the zooming user interface (ZUI) paradigm and developed Mimoza, a user-centric tool for zoomable navigation and knowledge-based exploration of metabolic networks that produces this 3-level representation. Mimoza is available both as an on-line tool and for download at mimoza.bordeaux.inria.fr.
LNCS 5050 - Bilevel Optimization and Machine Learningbutest
This document discusses using bilevel optimization and machine learning techniques to improve model selection in machine learning problems. It proposes framing machine learning model selection as a bilevel optimization problem, where the inner level problems involve optimizing models on training data and the outer level problem selects hyperparameters to minimize error on test data. This bilevel framing allows for systematic optimization of hyperparameters and enables novel machine learning approaches. The document illustrates the approach for support vector regression, formulating model selection as a Stackelberg game and solving the resulting mathematical program with equilibrium constraints.
Sampling-Based Planning Algorithms for Multi-Objective MissionsMd Mahbubur Rahman
multiobjective path planning has Increasing demand in military missions, rescue operations, construction job-sites.
There is Lack of robotic path planning algorithm that compromises multiple
objectives. Commonly no solution that optimizes all the objective functions. Here we modify RRT, RRT* sampling based algorithm.
This document summarizes a talk on inference on treatment effects after model selection. It discusses challenges with inferring treatment effects after refitting a model selected via a procedure like lasso. Specifically, refitting can lead to bias due to overfitting or underfitting the model. The document proposes using repeated data splitting to remove the overfitting bias. In each split, part of the data is used for model selection and the other part for estimating treatment effects without overfitting bias. This approach reduces bias compared to simply refitting the full model.
This document presents a graph theoretic approach to optimizing the design of a software defined radio (SDR) system capable of supporting multiple standards. It describes representing an SDR system as a directed hypergraph, with blocks as vertices and implementation dependencies as hyperarcs. A cost function is defined based on building cost and computational cost of blocks. The optimization problem is to select a set of common operators to implement the system at minimum cost. This is proven to be an NP-problem if the number of levels in the graph is bounded by a constant. An algorithm called Minimum Cost Design is proposed to solve the problem using graph theory.
Similar to A comparison of learning methods to predict N2O fluxes and N leaching (20)
Racines en haut et feuilles en bas : les arbres en mathstuxette
1. The document discusses methods for clustering and differential analysis of Hi-C matrices, which represent the 3D organization of DNA.
2. It proposes extending Ward's hierarchical clustering to directly use Hi-C similarity matrices while enforcing adjacency constraints. A fast algorithm was also developed.
3. A new method called "treediff" was created to perform differential analysis of Hi-C matrices based on the Wasserstein distance between hierarchical clusterings. Software implementations of these methods were also developed.
Méthodes à noyaux pour l’intégration de données hétérogènestuxette
The document discusses a presentation about multi-omics data integration methods using kernel methods. The presentation introduces kernel methods, how they can be used to integrate heterogeneous omics data, and examples of applications. Specifically, it discusses using kernel methods to perform unsupervised transformation-based integration of multi-omics data. It also presents an application of constrained kernel hierarchical clustering to analyze Hi-C data by directly using Hi-C matrices as kernels.
Méthodologies d'intégration de données omiquestuxette
This document presents a presentation on multi-omics data integration methods given by Nathalie Vialaneix on December 13, 2023. The presentation discusses different types of omics data that can be integrated, both vertically across different levels of omics data on the same samples and horizontally across similar types of omics data on different samples. It also discusses different analysis approaches that can be taken, including supervised and unsupervised methods. The rest of the presentation focuses on unsupervised transformation-based integration methods using kernels.
The document discusses current and future work on analyzing Hi-C data and differential analysis of Hi-C matrices. It describes a clustering method developed to partition chromosomes based on Hi-C matrix similarity. It also introduces a new method called treediff for differential analysis of Hi-C data that calculates the distance between hierarchical clusterings. Current work includes reviewing differential analysis methods, investigating differential subtrees with multiple testing control, and inferring chromatin interaction networks.
Can deep learning learn chromatin structure from sequence?tuxette
This document discusses a deep learning model called ORCA that can predict chromatin structure from DNA sequence. The model uses a neural network with an encoder to extract features from sequence and a decoder to predict Hi-C matrices. It was trained on Hi-C data from multiple cell types and can predict interactions between regions at various resolutions. The model accurately captures features like CTCF-mediated loops and can predict effects of structural variants on chromatin structure. It allows for in silico mutagenesis to study how mutations may alter 3D genome organization.
Multi-omics data integration methods: kernel and other machine learning appro...tuxette
The document discusses multi-omics data integration methods, particularly kernel methods. It describes how kernel methods transform data into similarity matrices between samples rather than relying on variable space. Multiple kernel integration approaches are presented that combine multiple similarity matrices into a consensus kernel in an unsupervised manner, such as through a STATIS-like framework that maximizes the similarity between kernels. Examples of applications to datasets from the TARA Oceans expedition are given.
This document provides an overview of the MetaboWean and Idefics projects. MetaboWean aims to study the co-evolution of gut microbiota and epithelium during suckling-to-weaning transition in rabbits, using metabolomics, metagenomics, and single-cell RNA sequencing data. Idefics integrates multiple omics datasets from human skin samples to understand relationships between microorganisms and molecules and how they are structured in patient groups. The datasets include metagenomics, metabolomics, and proteomics from host and microbiota.
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...tuxette
ASTERICS is an interactive and integrative data analysis tool for omics data. It uses Rserve and PyRserve with Flask and Vue.js in a Docker container to integrate omics data. The backend uses Rserve and PyRserve with Flask on the server side, while the frontend uses Vue.js. This architecture was chosen for its open source and light design. Data communication between Rserve and PyRserve is limited, requiring an object database. ASTERICS is deployed using three Docker containers for R, Python, and
Apprentissage pour la biologie moléculaire et l’analyse de données omiquestuxette
This document summarizes a scientific presentation about molecular biology and omics data analysis. The presentation covers topics related to analyzing large omics datasets using methods like kernel methods, graphical models, and neural networks to learn gene regulation networks and predict phenotypes. Key challenges addressed are handling big data, missing values, non-Gaussian data types like counts and compositional data. The goal is to better understand complex biological systems from multi-omics data.
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...tuxette
The document summarizes preliminary results from evaluating methods for inferring gene regulatory networks from expression data in Bacillus subtilis. It finds that recall of the known network is generally poor (<20% for random forest), but inferred clusters still retain biological information about common regulators. It plans to confirm results, test restricting edges to sigma factors, and explore other inference methods like Bayesian networks and ARACNE.
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...tuxette
The document discusses methods for integrating multi-scale omics data using kernel and machine learning approaches. It describes how omics data is large, heterogeneous, and multi-scaled, creating bottlenecks for analysis. Methods discussed for data integration include multiple kernel learning to combine different relational datasets in an unsupervised way. The methods are applied to integrate different datasets from the TARA Oceans expedition to identify patterns in ocean microbial communities. Improving interpretability of the methods and making them more accessible to biological users is discussed.
Journal club: Validation of cluster analysis results on validation datatuxette
This document presents a framework for validating cluster analysis results on validation data. It describes situations where clustering is inferential versus descriptive and recommends using validation data separate from the data used for clustering. A typology of validation methods is provided, including validation based on the clustering method or results, and evaluation using internal validation, external validation, visual properties, or stability measures.
The document discusses the differences between overfitting and overparametrization in machine learning models. It explores how random forests may exhibit a phenomenon known as "double descent" where test error initially decreases then increases with more parameters before decreasing again. While double descent has been observed in other models, the document questions whether it is directly due to model complexity in random forests since very large trees may be unable to fully interpolate extremely large datasets.
Selective inference and single-cell differential analysistuxette
This document discusses selective inference and single-cell differential analysis. It introduces the problem of "double dipping" in the standard single-cell analysis pipeline where the same dataset is used for clustering and differential analysis. Two approaches for addressing this are presented: 1) A method that perturbs clusters before testing for differences, and 2) A test based on a truncated distribution that assumes clusters and genes are given separately. Experiments applying these methods to real single-cell datasets are described. The document outlines challenges in extending these approaches to more complex analyses.
SOMbrero : un package R pour les cartes auto-organisatricestuxette
SOMbrero is an R package that implements self-organizing map (SOM) algorithms. It can handle numeric, non-numeric, and relational data. The package contains functions for training SOMs, diagnosing results, and plotting maps. It also includes tools like a shiny app and vignettes to aid users without programming experience. SOMbrero supports missing data imputation and extends SOM to relational datasets through non-Euclidean distance measures.
Graph Neural Network for Phenotype Predictiontuxette
This document describes a study on using graph neural networks (GNNs) for phenotype prediction from gene expression data. The objectives are to determine if including network information can improve predictions, which network types work best, and if GNNs can learn network inferences. It provides background on GNNs and how they generalize convolutional layers to graph data. The authors implemented a GNN model from previous work as a starting point and tested it on different network types to see which network information is most useful for predictions. Their methodology involves comparing GNN performance to other methods like random forests using 10-fold cross validation.
The technology uses reclaimed CO₂ as the dyeing medium in a closed loop process. When pressurized, CO₂ becomes supercritical (SC-CO₂). In this state CO₂ has a very high solvent power, allowing the dye to dissolve easily.
Authoring a personal GPT for your research and practice: How we created the Q...Leonel Morgado
Thematic analysis in qualitative research is a time-consuming and systematic task, typically done using teams. Team members must ground their activities on common understandings of the major concepts underlying the thematic analysis, and define criteria for its development. However, conceptual misunderstandings, equivocations, and lack of adherence to criteria are challenges to the quality and speed of this process. Given the distributed and uncertain nature of this process, we wondered if the tasks in thematic analysis could be supported by readily available artificial intelligence chatbots. Our early efforts point to potential benefits: not just saving time in the coding process but better adherence to criteria and grounding, by increasing triangulation between humans and artificial intelligence. This tutorial will provide a description and demonstration of the process we followed, as two academic researchers, to develop a custom ChatGPT to assist with qualitative coding in the thematic data analysis process of immersive learning accounts in a survey of the academic literature: QUAL-E Immersive Learning Thematic Analysis Helper. In the hands-on time, participants will try out QUAL-E and develop their ideas for their own qualitative coding ChatGPT. Participants that have the paid ChatGPT Plus subscription can create a draft of their assistants. The organizers will provide course materials and slide deck that participants will be able to utilize to continue development of their custom GPT. The paid subscription to ChatGPT Plus is not required to participate in this workshop, just for trying out personal GPTs during it.
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...Scintica Instrumentation
Targeting Hsp90 and its pathogen Orthologs with Tethered Inhibitors as a Diagnostic and Therapeutic Strategy for cancer and infectious diseases with Dr. Timothy Haystead.
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfSelcen Ozturkcan
Ozturkcan, S., Berndt, A., & Angelakis, A. (2024). Mending clothing to support sustainable fashion. Presented at the 31st Annual Conference by the Consortium for International Marketing Research (CIMaR), 10-13 Jun 2024, University of Gävle, Sweden.
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...Advanced-Concepts-Team
Presentation in the Science Coffee of the Advanced Concepts Team of the European Space Agency on the 07.06.2024.
Speaker: Diego Blas (IFAE/ICREA)
Title: Gravitational wave detection with orbital motion of Moon and artificial
Abstract:
In this talk I will describe some recent ideas to find gravitational waves from supermassive black holes or of primordial origin by studying their secular effect on the orbital motion of the Moon or satellites that are laser ranged.
Or: Beyond linear.
Abstract: Equivariant neural networks are neural networks that incorporate symmetries. The nonlinear activation functions in these networks result in interesting nonlinear equivariant maps between simple representations, and motivate the key player of this talk: piecewise linear representation theory.
Disclaimer: No one is perfect, so please mind that there might be mistakes and typos.
dtubbenhauer@gmail.com
Corrected slides: dtubbenhauer.com/talks.html
Equivariant neural networks and representation theory
A comparison of learning methods to predict N2O fluxes and N leaching
1. A comparison of learning methods to predict N2O
fluxes and N leaching
Nathalie Villa-Vialaneix
http://www.nathalievilla.org
Joined work with Marco (Follador & Ratto) and Adrian Leip (EC, Ispra,
Italy)
April, 27th, 2012 - BIA, INRA Auzeville
SAMM (Université Paris 1) &
IUT de Carcassonne (Université de Perpignan)
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 1 / 27
3. DNDC-Europe model description
Sommaire
1 DNDC-Europe model description
2 Methodology
3 Results
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 3 / 27
4. DNDC-Europe model description
General overview
Modern issues in agriculture
• fight against the food crisis;
• while preserving environments.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 4 / 27
5. DNDC-Europe model description
General overview
Modern issues in agriculture
• fight against the food crisis;
• while preserving environments.
EC needs simulation tools to
• link the direct aids with the respect of standards ensuring proper
management;
• quantify the environmental impact of European policies (“Cross
Compliance”).
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 4 / 27
6. DNDC-Europe model description
Cross Compliance Assessment Tool
DNDC is a biogeochemical model.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 5 / 27
7. DNDC-Europe model description
Zoom on DNDC-EUROPE
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 6 / 27
8. DNDC-Europe model description
Moving from DNDC-Europe to metamodeling
Needs for metamodeling
• easier integration into CCAT
• faster execution and responding scenario analysis
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 7 / 27
9. DNDC-Europe model description
Moving from DNDC-Europe to metamodeling
Needs for metamodeling
• easier integration into CCAT
• faster execution and responding scenario analysis
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 7 / 27
10. DNDC-Europe model description
Data [Villa-Vialaneix et al., 2012]
Data extracted from the biogeochemical simulator DNDC-EUROPE: ∼
19 000 HSMU (Homogeneous Soil Mapping Units 1km2
but the area is
quite varying) used for corn cultivation:
• corn corresponds to 4.6% of UAA;
• HSMU for which at least 10% of the agricultural land was used for
corn were selected.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 8 / 27
11. DNDC-Europe model description
Data [Villa-Vialaneix et al., 2012]
Data extracted from the biogeochemical simulator DNDC-EUROPE:
11 input (explanatory) variables (selected by experts and previous
simulations)
• N FR (N input through fertilization; kg/ha y);
• N MR (N input through manure spreading; kg/ha y);
• Nfix (N input from biological fixation; kg/ha y);
• Nres (N input from root residue; kg/ha y);
• BD (Bulk Density; g/cm3
);
• SOC (Soil organic carbon in topsoil; mass fraction);
• PH (Soil pH);
• Clay (Ratio of soil clay content);
• Rain (Annual precipitation; mm/y);
• Tmean (Annual mean temperature; C);
• Nr (Concentration of N in rain; ppm).
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 8 / 27
12. DNDC-Europe model description
Data [Villa-Vialaneix et al., 2012]
Data extracted from the biogeochemical simulator DNDC-EUROPE:
2 outputs to be estimated (independently) from the inputs:
• N2O fluxes (greenhouse gaz);
• N leaching (one major cause for water pollution).
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 8 / 27
14. Methodology
Methodology
Purpose: Comparison of several metamodeling approaches (accuracy,
computational time...).
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 10 / 27
15. Methodology
Methodology
Purpose: Comparison of several metamodeling approaches (accuracy,
computational time...).
For every data set, every output and every method,
1 The data set was split into a training set and a test set (on a
80%/20% basis);
2 The regression function was learned from the training set (with a
full validation process for the hyperparameter tuning);
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 10 / 27
16. Methodology
Methodology
Purpose: Comparison of several metamodeling approaches (accuracy,
computational time...).
For every data set, every output and every method,
1 The data set was split into a training set and a test set (on a
80%/20% basis);
2 The regression function was learned from the training set (with a
full validation process for the hyperparameter tuning);
3 The performances were calculated on the basis of the test set: for
the test set, predictions were made from the inputs and compared to
the true outputs.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 10 / 27
17. Methodology
Methods
• 2 linear models:
• one with the 11 explanatory variables;
• one with the 11 explanatory variables plus several nonlinear
transformations of these variables (square, log...): stepwise AIC was
used to train the model;
• MLP
• SVM
• RF
• 3 approaches based on splines: ACOSSO (ANOVA splines), SDR
(improvement of the previous one) and DACE (kriging based
approach).
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 11 / 27
18. Methodology
Regression
Consider the problem where:
• Y ∈ R has to be estimated from X ∈ Rd
;
• we are given a learning set, i.e., N i.i.d. observations of (X, Y),
(x1, y1), . . . , (xN, yN).
Example: Predict N2O fluxes from PH, climate, concentration of N in rain,
fertilization for a large number of HSMU . . .
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 12 / 27
19. Methodology
Multilayer perceptrons (MLP)
A “one-hidden-layer perceptron” takes the form:
Φw : x ∈ Rd
→
Q
i=1
w
(2)
i
G xT
w
(1)
i
+ w
(0)
i
+ w
(2)
0
where:
• the w are the weights of the MLP that have to be learned from the
learning set;
• G is a given activation function: typically, G(z) = 1−e−z
1+e−z ;
• Q is the number of neurons on the hidden layer. It controls the
flexibility of the MLP. Q is a hyper-parameter that is usually tuned
during the learning process.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 13 / 27
20. Methodology
Symbolic representation of MLP
INPUTS
x1
x2
. . .
xd
w
(1)
11
w
(1)
pQ
Neuron 1
Neuron Q
φw(x)
w
(2)
1
w
(2)
Q
+w
(0)
Q
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 14 / 27
21. Methodology
Learning MLP
• Learning the weights: w are learned by a mean squared error
minimization scheme :
w∗
= arg min
w
N
i=1
L(yi, Φw(xi)).
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 15 / 27
22. Methodology
Learning MLP
• Learning the weights: w are learned by a mean squared error
minimization scheme penalized by a weight decay to avoid
overfitting (ensure a better generalization ability):
w∗
= arg min
w
N
i=1
L(yi, Φw(xi))+C w 2
.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 15 / 27
23. Methodology
Learning MLP
• Learning the weights: w are learned by a mean squared error
minimization scheme penalized by a weight decay to avoid
overfitting (ensure a better generalization ability):
w∗
= arg min
w
N
i=1
L(yi, Φw(xi))+C w 2
.
Problem: MSE is not quadratic in w and thus some solutions can be
local minima.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 15 / 27
24. Methodology
Learning MLP
• Learning the weights: w are learned by a mean squared error
minimization scheme penalized by a weight decay to avoid
overfitting (ensure a better generalization ability):
w∗
= arg min
w
N
i=1
L(yi, Φw(xi))+C w 2
.
Problem: MSE is not quadratic in w and thus some solutions can be
local minima.
• Tuning the hyper-parameters, C and Q: simple validation was
used to tune first C and Q.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 15 / 27
25. Methodology
SVM
SVM is also an algorithm based on penalized error loss minimization:
1 Basic linear SVM for regression: Φ(w,b) is of the form x → wT
x + b
with (w, b) solution of
arg min
N
i=1
L (yi, Φ(w,b)(xi)) + λ w 2
where
• λ is a regularization (hyper) parameter (to be tuned);
• L (y, ˆy) = max{|y − ˆy| − , 0} is an -insensitive loss function
See -insensitive loss function
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 16 / 27
26. Methodology
SVM
SVM is also an algorithm based on penalized error loss minimization:
1 Basic linear SVM for regression
2 Non linear SVM for regression are the same except that a non
linear (fixed) transformation of the inputs is previously made:
ϕ(x) ∈ H is used instead of x.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 16 / 27
27. Methodology
SVM
SVM is also an algorithm based on penalized error loss minimization:
1 Basic linear SVM for regression
2 Non linear SVM for regression are the same except that a non
linear (fixed) transformation of the inputs is previously made:
ϕ(x) ∈ H is used instead of x.
Kernel trick: in fact, ϕ is never explicit but used through a kernel,
K : Rd
× Rd
→ R. This kernel is used for K(xi, xj) = ϕ(xi)T
ϕ(xj).
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 16 / 27
28. Methodology
SVM
SVM is also an algorithm based on penalized error loss minimization:
1 Basic linear SVM for regression
2 Non linear SVM for regression are the same except that a non
linear (fixed) transformation of the inputs is previously made:
ϕ(x) ∈ H is used instead of x.
Kernel trick: in fact, ϕ is never explicit but used through a kernel,
K : Rd
× Rd
→ R. This kernel is used for K(xi, xj) = ϕ(xi)T
ϕ(xj).
Common kernel: Gaussian kernel
Kγ(u, v) = e−γ u−v 2
is known to have good theoretical properties both for accuracy and
generalization.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 16 / 27
29. Methodology
Learning SVM
• Learning (w, b): w = N
i=1 αiK(xi, .) and b are calculated by an
exact optimization scheme (quadratic programming). The only step
that can be time consumming is the calculation of the kernel matrix:
K(xi, xj) for i, j = 1, . . . , N.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 17 / 27
30. Methodology
Learning SVM
• Learning (w, b): w = N
i=1 αiK(xi, .) and b are calculated by an
exact optimization scheme (quadratic programming). The only step
that can be time consumming is the calculation of the kernel matrix:
K(xi, xj) for i, j = 1, . . . , N.
The resulting ˆΦN
is known to be of the form:
ˆΦN
(x) =
N
i=1
αiK(xi, x) + b
where only a few αi are non zero. The corresponding xi are called
support vectors.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 17 / 27
31. Methodology
Learning SVM
• Learning (w, b): w = N
i=1 αiK(xi, .) and b are calculated by an
exact optimization scheme (quadratic programming). The only step
that can be time consumming is the calculation of the kernel matrix:
K(xi, xj) for i, j = 1, . . . , N.
The resulting ˆΦN
is known to be of the form:
ˆΦN
(x) =
N
i=1
αiK(xi, x) + b
where only a few αi are non zero. The corresponding xi are called
support vectors.
• Tuning of the hyper-parameters, C = 1/λ, and γ: simple
validation has been used. To limit waste of time, has not been
tuned in our experiments but set to the default value (1) which
ensured 0.5N support vectors at most.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 17 / 27
32. Methodology
From regression tree to random forest
Example of a regression tree
|
SOCt < 0.095
PH < 7.815
SOCt < 0.025
FR < 130.45 clay < 0.185
SOCt < 0.025
SOCt < 0.145
FR < 108.45
PH < 6.5
4.366 7.100
15.010 8.975
2.685 5.257
26.260
28.070 35.900 59.330
Each split is made such that
the two induced subsets have
the greatest homogeneity pos-
sible.
The prediction of a final node
is the mean of the Y value of
the observations belonging to
this node.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 18 / 27
33. Methodology
Random forest
Basic principle: combination of a large number of under-efficient
regression trees (the prediction is the mean prediction of all trees).
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 19 / 27
34. Methodology
Random forest
Basic principle: combination of a large number of under-efficient
regression trees (the prediction is the mean prediction of all trees).
For each tree, two simplifications of the original method are performed:
1 A given number of observations are randomly chosen among the
training set: this subset of the training data set is called in-bag sample
whereas the other observations are called out-of-bag and are used to
control the error of the tree;
2 For each node of the tree, a given number of variables are randomly
chosen among all possible explanatory variables.
The best split is then calculated on the basis of these variables and of the
chosen observations. The chosen observations are the same for a given
tree whereas the variables taken into account change for each split.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 19 / 27
35. Methodology
Learning a random forest
Random forest are not very sensitive to hyper-parameters (number of
observations for each tree, number of variables for each split): the default
values have been used.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 20 / 27
36. Methodology
Learning a random forest
Random forest are not very sensitive to hyper-parameters (number of
observations for each tree, number of variables for each split): the default
values have been used.
The number of trees should be large enough for the mean squared error
based on out-of-sample observations to stabilize:
0 100 200 300 400 500
0246810
trees
Error
Out−of−bag (training)
Test
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 20 / 27
39. Results
Influence of the training sample size
5 6 7 8 9
0.60.70.80.91.0
N leaching prediction
log size (training)
R2
LM1
LM2
Dace
SDR
ACOSSO
MLP
SVM
RF
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 22 / 27
40. Results
Computational time
Use LM1 LM2 Dace SDR Acosso
Train <1 s. 50 min 80 min 4 hours 65 min n
Prediction <1 s. <1 s. 90 s. 14 min 4 min.
Use MLP SVM RF
Train 2.5 hours 5 hours 15 min
Prediction 1 s. 20 s. 5 s.
Time for DNDC: about 200 hours with a desktop computer and about 2
days using cluster 7!
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 23 / 27
41. Results
Further comparisons
Evaluation of the different step (time/difficulty)
Training Validation Test
LM1 ++ +
LM2 + +
ACOSSO = + -
SDR = + -
DACE = - -
MLP - - +
SVM = - -
RF + + +
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 24 / 27
42. Results
Understanding which inputs are important
Importance: A measure to estimate the importance of the input variables
can be defined by:
• for a given input variable randomly permute the input values and
calculate the prediction from this new randomly permutated inputs;
• compare the accuracy of these predictions to accuracy of the
predictions obtained with the true inputs: the increase of mean
squared error is called the importance.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 25 / 27
43. Results
Understanding which inputs are important
Importance: A measure to estimate the importance of the input variables
can be defined by:
• for a given input variable randomly permute the input values and
calculate the prediction from this new randomly permutated inputs;
• compare the accuracy of these predictions to accuracy of the
predictions obtained with the true inputs: the increase of mean
squared error is called the importance.
This comparison is made on the basis of data that are not used to define
the machine, either the validation set or the out-of-bag observations.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 25 / 27
44. Results
Understanding which inputs are important
Example (N2O, RF):
q
q
q
q
q
q
q
q
q
q
q
2 4 6 8 10
51015202530
Rank
Importance(meandecreaseMSE)
pH
Nr N_MR
Nfix
N_FR
clay NresTmean BD rain
SOC
The variables SOC and PH are the most important for accurate
predictions.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 25 / 27
45. Results
Understanding which inputs are important
Example (N leaching, SVM):
q
q
q q
q
q
q
q
q q
q
2 4 6 8 10
050010001500
Rank
Importance(decreaseMSE)
N_FR
Nres pH
Nr
clay
rain
SOC
Tmean Nfix
BD
N_MR
The variables N_MR, N_FR, Nres and pH are the most important for
accurate predictions.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 25 / 27
46. Results
Thank you for your attention
Any questions?
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 26 / 27
47. Results
Villa-Vialaneix, N., Follador, M., Ratto, M., and Leip, A. (2012).
A comparison of eight metamodeling techniques for the simulation of
n2o fluxes and n leaching from corn crops.
Environmental Modelling and Software, 34:51–66.
Nathalie Villa-Vialaneix (April 27th, 2012) Comparison of metamodels SAMM & UPVD 27 / 27