Presentation in the International Conference on Hybrid Artificial Intelligent Systems (HAIS) 2018 of a preliminary study of diversity in ensembles, applied to Extreme Learning Machine (ELM)
Scaling Multinomial Logistic Regression via Hybrid ParallelismParameswaran Raman
Distributed algorithms in machine learning follow two main paradigms: data parallel, where the data is distributed across multiple workers and model parallel, where the model parameters are partitioned across multiple workers. The main limitation of the first approach is that the model parameters need to be replicated on every machine. This is problematic when the number of parameters is very large, and hence cannot fit in a single machine. The drawback of the latter approach is that the data needs to be replicated on each machine. Such replications limit the scalability of machine learning algorithms, since in several real-world tasks it is observed that the data and model sizes typically grow hand in hand. In this talk, I will present Hybrid-Parallelism, a new paradigm that partitions both, the data as well as the model parameters simultaneously in a completely de-centralized manner. As a result, each worker only needs access to a subset of the data and a subset of the parameters while performing parameter updates. Next, I will present a case-study showing how to apply these ideas to reformulate Multinomial Logistic Regression to achieve Hybrid Parallelism (DSMLR: Doubly-Separable Multinomial Logistic Regression). Finally, I will demonstrate the versatility of DS-MLR under various scenarios in data and model parallelism, through an empirical study consisting of real-world datasets.
Medical pathology images are visually evaluated by experts for disease diagnosis, but the connectionbetween image features and the state of the cells in an image is typically unknown. To understand thisrelationship, we describe a multimodal modeling and inference framework that estimates shared latentstructure of joint gene expression levels and medical image features. The method is built aroundprobabilistic canonical correlation analysis (PCCA), which is jointly fit to image embeddings that are learnedusing convolutional neural networks and linear embeddings of paired gene expression data. We finallydiscuss a set of theoretical and empirical challenges in domain adaptation settings arising from genomics data.(based on work in collab with Gregory Gundersen and Barbara E. Engelhardt)
Since the advent of the horseshoe priors for regularization, global-local shrinkage methods have proved to be a fertile ground for the development of Bayesian theory and methodology in machine learning. They have achieved remarkable success in computation, and enjoy strong theoretical support. Much of the existing literature has focused on the linear Gaussian case. The purpose of the current talk is to demonstrate that the horseshoe priors are useful more broadly, by reviewing both methodological and computational developments in complex models that are more relevant to machine learning applications. Specifically, we focus on methodological challenges in horseshoe regularization in nonlinear and non-Gaussian models; multivariate models; and deep neural networks. We also outline the recent computational developments in horseshoe shrinkage for complex models along with a list of available software implementations that allows one to venture out beyond the comfort zone of the canonical linear regression problems.
Inria Tech Talk - La classification de données complexes avec MASSICCCStéphanie Roger
MASSICCC - Une plateforme SaaS pour le traitement de la classification de données complexes hétérogènes et incomplètes.
Dans ce Tech Talk venez découvrir, tester et apprendre à maîtriser MASSICCC (Massive clustering in cloud computing) une plateforme SaaS orientée utilisateurs, ainsi que ses trois familles d’algorithmes de #classification, fruits des dernières avancées des équipes de recherche Modal & Celeste de Inria, pour analyser et faire de l’apprentissage sur vos "Big Data" (ex : en immobilier, maintenance prédictive, santé, open data, etc. ).
MASSICCC c’est aussi :
- Un accès gratuit pour le test et la recherche sur https://massiccc.lille.inria.fr
- Un "one for all" de la classification
- Une forte interprétabilité des résultats (avec ses graphiques)
- Un mode SaaS qui vous permet un suivi des expériences (en cours ou terminées)
- Et des algorithmes open source qui sont réutilisables indépendamment.
The document outlines a presentation on evolutionary algorithms and their performance on optimization problems. It introduces evolutionary algorithms and describes classical evolutionary programming (CEP), fast evolutionary programming (FEP), and improved fast evolutionary programming (IFEP). It then discusses benchmark functions used to evaluate the algorithms' performance, including sphere, Schwefel, Rastrigin, and Rosenbrock functions. Tables show the results of applying CEP, FEP, and IFEP to optimize 10 standard benchmark functions, finding that IFEP generally performs best.
Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Modelweekendsunny
This document summarizes the author's approach to predicting short term stock price movements in the 2010 INFORMS Data Mining Contest. It began with support vector machines and logistic regression, then tried LASSO (logistic regression with variable selection) and other methods. The author eventually used a two-stage variable selection method with LASSO on lagged data to select variables for a generalized linear model, achieving 3rd place. The document outlines the basic analysis, variable selection methods explored including traditional approaches and L1-penalized LASSO, and results from using future information against the evaluation criteria.
Traffic flow modeling on road networks using Hamilton-Jacobi equationsGuillaume Costeseque
This document discusses traffic flow modeling using Hamilton-Jacobi equations on road networks. It motivates the use of macroscopic traffic models based on conservation laws and Hamilton-Jacobi equations to describe traffic flow. These models can capture traffic behavior at a aggregate level based on density, flow and speed. The document outlines different orders of macroscopic traffic models, from first order Lighthill-Whitham-Richards models to higher order models that account for additional traffic attributes. It also discusses the relationship between microscopic car-following models and the emergence of macroscopic behavior through homogenization.
This document discusses challenges in comparing structured models using cross-validation. It presents a decision-theoretic framework for model assessment and selection based on expected predictive loss. Different methods for estimating predictive loss are discussed, including k-fold cross-validation which is used to estimate the predictive loss of various multilevel models for a dataset from the Cooperative Congressional Election Survey with deeply nested demographic variables.
Scaling Multinomial Logistic Regression via Hybrid ParallelismParameswaran Raman
Distributed algorithms in machine learning follow two main paradigms: data parallel, where the data is distributed across multiple workers and model parallel, where the model parameters are partitioned across multiple workers. The main limitation of the first approach is that the model parameters need to be replicated on every machine. This is problematic when the number of parameters is very large, and hence cannot fit in a single machine. The drawback of the latter approach is that the data needs to be replicated on each machine. Such replications limit the scalability of machine learning algorithms, since in several real-world tasks it is observed that the data and model sizes typically grow hand in hand. In this talk, I will present Hybrid-Parallelism, a new paradigm that partitions both, the data as well as the model parameters simultaneously in a completely de-centralized manner. As a result, each worker only needs access to a subset of the data and a subset of the parameters while performing parameter updates. Next, I will present a case-study showing how to apply these ideas to reformulate Multinomial Logistic Regression to achieve Hybrid Parallelism (DSMLR: Doubly-Separable Multinomial Logistic Regression). Finally, I will demonstrate the versatility of DS-MLR under various scenarios in data and model parallelism, through an empirical study consisting of real-world datasets.
Medical pathology images are visually evaluated by experts for disease diagnosis, but the connectionbetween image features and the state of the cells in an image is typically unknown. To understand thisrelationship, we describe a multimodal modeling and inference framework that estimates shared latentstructure of joint gene expression levels and medical image features. The method is built aroundprobabilistic canonical correlation analysis (PCCA), which is jointly fit to image embeddings that are learnedusing convolutional neural networks and linear embeddings of paired gene expression data. We finallydiscuss a set of theoretical and empirical challenges in domain adaptation settings arising from genomics data.(based on work in collab with Gregory Gundersen and Barbara E. Engelhardt)
Since the advent of the horseshoe priors for regularization, global-local shrinkage methods have proved to be a fertile ground for the development of Bayesian theory and methodology in machine learning. They have achieved remarkable success in computation, and enjoy strong theoretical support. Much of the existing literature has focused on the linear Gaussian case. The purpose of the current talk is to demonstrate that the horseshoe priors are useful more broadly, by reviewing both methodological and computational developments in complex models that are more relevant to machine learning applications. Specifically, we focus on methodological challenges in horseshoe regularization in nonlinear and non-Gaussian models; multivariate models; and deep neural networks. We also outline the recent computational developments in horseshoe shrinkage for complex models along with a list of available software implementations that allows one to venture out beyond the comfort zone of the canonical linear regression problems.
Inria Tech Talk - La classification de données complexes avec MASSICCCStéphanie Roger
MASSICCC - Une plateforme SaaS pour le traitement de la classification de données complexes hétérogènes et incomplètes.
Dans ce Tech Talk venez découvrir, tester et apprendre à maîtriser MASSICCC (Massive clustering in cloud computing) une plateforme SaaS orientée utilisateurs, ainsi que ses trois familles d’algorithmes de #classification, fruits des dernières avancées des équipes de recherche Modal & Celeste de Inria, pour analyser et faire de l’apprentissage sur vos "Big Data" (ex : en immobilier, maintenance prédictive, santé, open data, etc. ).
MASSICCC c’est aussi :
- Un accès gratuit pour le test et la recherche sur https://massiccc.lille.inria.fr
- Un "one for all" de la classification
- Une forte interprétabilité des résultats (avec ses graphiques)
- Un mode SaaS qui vous permet un suivi des expériences (en cours ou terminées)
- Et des algorithmes open source qui sont réutilisables indépendamment.
The document outlines a presentation on evolutionary algorithms and their performance on optimization problems. It introduces evolutionary algorithms and describes classical evolutionary programming (CEP), fast evolutionary programming (FEP), and improved fast evolutionary programming (IFEP). It then discusses benchmark functions used to evaluate the algorithms' performance, including sphere, Schwefel, Rastrigin, and Rosenbrock functions. Tables show the results of applying CEP, FEP, and IFEP to optimize 10 standard benchmark functions, finding that IFEP generally performs best.
Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Modelweekendsunny
This document summarizes the author's approach to predicting short term stock price movements in the 2010 INFORMS Data Mining Contest. It began with support vector machines and logistic regression, then tried LASSO (logistic regression with variable selection) and other methods. The author eventually used a two-stage variable selection method with LASSO on lagged data to select variables for a generalized linear model, achieving 3rd place. The document outlines the basic analysis, variable selection methods explored including traditional approaches and L1-penalized LASSO, and results from using future information against the evaluation criteria.
Traffic flow modeling on road networks using Hamilton-Jacobi equationsGuillaume Costeseque
This document discusses traffic flow modeling using Hamilton-Jacobi equations on road networks. It motivates the use of macroscopic traffic models based on conservation laws and Hamilton-Jacobi equations to describe traffic flow. These models can capture traffic behavior at a aggregate level based on density, flow and speed. The document outlines different orders of macroscopic traffic models, from first order Lighthill-Whitham-Richards models to higher order models that account for additional traffic attributes. It also discusses the relationship between microscopic car-following models and the emergence of macroscopic behavior through homogenization.
This document discusses challenges in comparing structured models using cross-validation. It presents a decision-theoretic framework for model assessment and selection based on expected predictive loss. Different methods for estimating predictive loss are discussed, including k-fold cross-validation which is used to estimate the predictive loss of various multilevel models for a dataset from the Cooperative Congressional Election Survey with deeply nested demographic variables.
Molodtsov's Soft Set Theory and its Applications in Decision Makinginventionjournals
Molodtsov's soft set theory was originally proposed as a general mathematical tool for dealing with uncertainty. In this paper, we apply the theory of soft set to solve a decision making problem in terms of rough mathematics.
This document compares the performance of genetic algorithms and niching methods for clustering undirected weighted graphs. It discusses how genetic algorithms can converge prematurely on local optima for complex problems like clustering that have many potential solutions. Niching methods like deterministic crowding are introduced to maintain population diversity and allow the search of multiple peaks in parallel. The paper applies genetic algorithms and deterministic crowding to the graph clustering problem and compares their results on test graphs, finding that deterministic crowding is more computationally demanding but provides better optimization.
In this lecture, you will learn two of the most popular methods for classifying data points into a finite set of categories. Both methods are based on representing a classifier via its decision boundary which is a hyperplane. The parameters of the hyperplane are learned from training data by minimizing a particular loss function.
Naive computations involving a function of many variables suffer from the curse of dimensionality: the computational cost grows exponentially with the number of variables. One approach to bypassing the curse is to approximate the function as a sum of products of functions of one variable and compute in this format. When the variables are indices, a function of many variables is called a tensor, and this approach is to approximate and use the tensor in the (so-called) canonical tensor format. In this talk I will describe how such approximations can be used in numerical analysis and in machine learning.
This document proposes applying boosting techniques to attraction-based demand models that are popular in pricing optimization. It formulates a multinomial likelihood for a semiparametric demand choice model (DCM) where product utility is specified without a fixed functional form. Gradient boosting is used to maximize the likelihood and estimate the nonparametric utility functions. The boosted tree-based approach flexibly models utility as a sum of trees, addressing limitations of existing DCMs like non-stationary demand and nonlinear attribute effects.
Boosted Tree-based Multinomial Logit Model for Aggregated Market DataJay (Jianqiang) Wang
This document presents a boosted tree-based multinomial logit model for estimating aggregated market demand from mobile computer sales data. It discusses challenges in modeling high-dimensional choice data with interactions among attributes and price. The proposed model uses gradient boosted trees to flexibly estimate utility functions without specifying a functional form, allowing for varying coefficient and nonparametric specifications. The model is shown to outperform elastic net regularized estimation on Australian mobile computer sales data, with the nonparametric model achieving the best test set performance while capturing complex attribute interactions.
This document discusses important issues in machine learning for data mining, including the bias-variance dilemma. It explains that the difference between the optimal regression and a learned model can be measured by looking at bias and variance. Bias measures the error between the expected output of the learned model and the optimal regression, while variance measures the error between the learned model's output and its expected output. There is a tradeoff between bias and variance - increasing one decreases the other. This is known as the bias-variance dilemma. Cross-validation and confusion matrices are also introduced as evaluation techniques.
The document discusses using linear programming and the simplex method in R to solve transportation and assignment problems. It presents a transportation problem with 5 origins and 5 destinations and shows the code to solve it in R, obtaining an optimal solution. It also presents an assignment problem with 8 workers and 8 work hours categories and the R code to solve it, minimizing costs.
In this lecture, I will present a general tour of some of the most commonly used kernel methods in statistical machine learning and data mining. I will touch on elements of artificial neural networks and then highlight their intricate connections to some general purpose kernel methods like Gaussian process learning machines. I will also resurrect the famous universal approximation theorem and will most likely ignite a [controversial] debate around the theme: could it be that [shallow] networks like radial basis function networks or Gaussian processes are all we need for well-behaved functions? Do we really need many hidden layers as the hype around Deep Neural Network architectures seem to suggest or should we heed Ockham’s principle of parsimony, namely “Entities should not be multiplied beyond necessity.” (“Entia non sunt multiplicanda praeter necessitatem.”) I intend to spend the last 15 minutes of this lecture sharing my personal tips and suggestions with our precious postdoctoral fellows on how to make the most of their experience.
Improving Performance of Back propagation Learning Algorithmijsrd.com
The standard back-propagation algorithm is one of the most widely used algorithm for training feed-forward neural networks. One major drawback of this algorithm is it might fall into local minima and slow convergence rate. Natural gradient descent is principal method for solving nonlinear function is presented and is combined with the modified back-propagation algorithm yielding a new fast training multilayer algorithm. This paper describes new approach to natural gradient learning in which the number of parameters necessary is much smaller than the natural gradient algorithm. This new method exploits the algebraic structure of the parameter space to reduce the space and time complexity of algorithm and improve its performance.
Paper Summary of Disentangling by Factorising (Factor-VAE)준식 최
The paper proposes Factor-VAE, which aims to learn disentangled representations in an unsupervised manner. Factor-VAE enhances disentanglement over the β-VAE by encouraging the latent distribution to be factorial (independent across dimensions) using a total correlation penalty. This penalty is optimized using a discriminator network. Experiments on various datasets show that Factor-VAE achieves better disentanglement than β-VAE, as measured by a proposed disentanglement metric, while maintaining good reconstruction quality. Latent traversals qualitatively demonstrate disentangled factors of variation.
We provide a review of the recent literature on statistical risk bounds for deep neural networks. We also discuss some theoretical results that compare the performance of deep ReLU networks to other methods such as wavelets and spline-type methods. The talk will moreover highlight some open problems and sketch possible new directions.
Dynamic Feature Induction: The Last Gist to the State-of-the-ArtJinho Choi
We introduce a novel technique called dynamic feature induction that keeps inducing high dimensional features automatically until the feature space becomes `more' linearly separable. Dynamic feature induction searches for the feature combinations that give strong clues for distinguishing certain label pairs, and generates joint features from these combinations. These induced features are trained along with the primitive low dimensional features. Our approach was evaluated on two core NLP tasks, part-of-speech tagging and named entity recognition, and showed the state-of-the-art results for both tasks, achieving the accuracy of 97.64 and the F1-score of 91.00 respectively, with about a 25% increase in the feature space.
The document discusses machine learning techniques for clustering and segmentation. It introduces Dirichlet process mixtures and the Chinese restaurant process as nonparametric Bayesian models that allow for an infinite number of clusters. It describes how these models can be used for problems like image segmentation, object recognition, population clustering from genetic data, and evolutionary document clustering over time. Approximate inference methods like Markov chain Monte Carlo sampling are used to analyze these models.
lassification with decision trees from a nonparametric predictive inference p...NTNU
An application of nonparametric predictive inference for multinomial data (NPI) to classification tasks is presented. This model is applied to an established procedure for building classification trees using imprecise probabilities and uncertainty measures, thus far used only with the imprecise Dirichlet model (IDM), that is defined through the use of a parameter expressing previous knowledge. The accuracy of that procedure of classification has a significant dependence on the value of the parameter used when the IDM is applied. A detailed study involving 40 data sets shows that the procedure using the NPI model (which has no parameter dependence) obtains a better trade-off between accuracy and size of tree than does the procedure when the IDM is used, whatever the choice of parameter. In a bias-variance study of the errors, it is proved that the procedure with the NPI model has a lower variance than the one with the IDM, implying a lower level of over-fitting.
This document discusses dimensionality reduction techniques for machine learning. It introduces Fisher Linear Discriminant analysis, which seeks projection directions that maximize separation between classes while minimizing within-class variance. It describes using the means and scatter measures of each class to define a cost function that is maximized to find the optimal projection direction. Principal Component Analysis is also briefly mentioned as another technique for dimensionality reduction.
An optimal design of current conveyors using a hybrid-based metaheuristic alg...IJECEIAES
This paper focuses on the optimal sizing of a positive second-generation current conveyor (CCII+), employing a hybrid algorithm named DE-ACO, which is derived from the combination of differential evolution (DE) and ant colony optimization (ACO) algorithms. The basic idea of this hybridization is to apply the DE algorithm for the ACO algorithm’s initialization stage. Benchmark test functions were used to evaluate the proposed algorithm’s performance regarding the quality of the optimal solution, robustness, and computation time. Furthermore, the DE-ACO has been applied to optimize the CCII+ performances. SPICE simulation is utilized to validate the achieved results, and a comparison with the standard DE and ACO algorithms is reported. The results highlight that DE-ACO outperforms both ACO and DE.
The document discusses genetic algorithms and genetic programming. It explains that genetic algorithms perform a parallel search of the hypothesis space to optimize a fitness function, mimicking biological evolution. New hypotheses are generated through mutation and crossover of existing hypotheses. Genetic programming similarly evolves computer programs represented as trees through genetic operators. An example shows a genetic programming approach for stacking blocks to spell a word.
This document discusses using evolutionary techniques like genetic algorithms and particle swarm optimization to reduce the order of large-scale linear systems. Specifically, it examines using GA and PSO to find stable reduced order models by minimizing the integral squared error between the original and reduced system responses. Both techniques guarantee stability if the original system is stable. The paper provides background on model order reduction techniques and describes how GA and PSO are applied, including representing systems as chromosomes and using selection, crossover, and mutation operators. Results from a numerical example are compared to a conventional method.
Molodtsov's Soft Set Theory and its Applications in Decision Makinginventionjournals
Molodtsov's soft set theory was originally proposed as a general mathematical tool for dealing with uncertainty. In this paper, we apply the theory of soft set to solve a decision making problem in terms of rough mathematics.
This document compares the performance of genetic algorithms and niching methods for clustering undirected weighted graphs. It discusses how genetic algorithms can converge prematurely on local optima for complex problems like clustering that have many potential solutions. Niching methods like deterministic crowding are introduced to maintain population diversity and allow the search of multiple peaks in parallel. The paper applies genetic algorithms and deterministic crowding to the graph clustering problem and compares their results on test graphs, finding that deterministic crowding is more computationally demanding but provides better optimization.
In this lecture, you will learn two of the most popular methods for classifying data points into a finite set of categories. Both methods are based on representing a classifier via its decision boundary which is a hyperplane. The parameters of the hyperplane are learned from training data by minimizing a particular loss function.
Naive computations involving a function of many variables suffer from the curse of dimensionality: the computational cost grows exponentially with the number of variables. One approach to bypassing the curse is to approximate the function as a sum of products of functions of one variable and compute in this format. When the variables are indices, a function of many variables is called a tensor, and this approach is to approximate and use the tensor in the (so-called) canonical tensor format. In this talk I will describe how such approximations can be used in numerical analysis and in machine learning.
This document proposes applying boosting techniques to attraction-based demand models that are popular in pricing optimization. It formulates a multinomial likelihood for a semiparametric demand choice model (DCM) where product utility is specified without a fixed functional form. Gradient boosting is used to maximize the likelihood and estimate the nonparametric utility functions. The boosted tree-based approach flexibly models utility as a sum of trees, addressing limitations of existing DCMs like non-stationary demand and nonlinear attribute effects.
Boosted Tree-based Multinomial Logit Model for Aggregated Market DataJay (Jianqiang) Wang
This document presents a boosted tree-based multinomial logit model for estimating aggregated market demand from mobile computer sales data. It discusses challenges in modeling high-dimensional choice data with interactions among attributes and price. The proposed model uses gradient boosted trees to flexibly estimate utility functions without specifying a functional form, allowing for varying coefficient and nonparametric specifications. The model is shown to outperform elastic net regularized estimation on Australian mobile computer sales data, with the nonparametric model achieving the best test set performance while capturing complex attribute interactions.
This document discusses important issues in machine learning for data mining, including the bias-variance dilemma. It explains that the difference between the optimal regression and a learned model can be measured by looking at bias and variance. Bias measures the error between the expected output of the learned model and the optimal regression, while variance measures the error between the learned model's output and its expected output. There is a tradeoff between bias and variance - increasing one decreases the other. This is known as the bias-variance dilemma. Cross-validation and confusion matrices are also introduced as evaluation techniques.
The document discusses using linear programming and the simplex method in R to solve transportation and assignment problems. It presents a transportation problem with 5 origins and 5 destinations and shows the code to solve it in R, obtaining an optimal solution. It also presents an assignment problem with 8 workers and 8 work hours categories and the R code to solve it, minimizing costs.
In this lecture, I will present a general tour of some of the most commonly used kernel methods in statistical machine learning and data mining. I will touch on elements of artificial neural networks and then highlight their intricate connections to some general purpose kernel methods like Gaussian process learning machines. I will also resurrect the famous universal approximation theorem and will most likely ignite a [controversial] debate around the theme: could it be that [shallow] networks like radial basis function networks or Gaussian processes are all we need for well-behaved functions? Do we really need many hidden layers as the hype around Deep Neural Network architectures seem to suggest or should we heed Ockham’s principle of parsimony, namely “Entities should not be multiplied beyond necessity.” (“Entia non sunt multiplicanda praeter necessitatem.”) I intend to spend the last 15 minutes of this lecture sharing my personal tips and suggestions with our precious postdoctoral fellows on how to make the most of their experience.
Improving Performance of Back propagation Learning Algorithmijsrd.com
The standard back-propagation algorithm is one of the most widely used algorithm for training feed-forward neural networks. One major drawback of this algorithm is it might fall into local minima and slow convergence rate. Natural gradient descent is principal method for solving nonlinear function is presented and is combined with the modified back-propagation algorithm yielding a new fast training multilayer algorithm. This paper describes new approach to natural gradient learning in which the number of parameters necessary is much smaller than the natural gradient algorithm. This new method exploits the algebraic structure of the parameter space to reduce the space and time complexity of algorithm and improve its performance.
Paper Summary of Disentangling by Factorising (Factor-VAE)준식 최
The paper proposes Factor-VAE, which aims to learn disentangled representations in an unsupervised manner. Factor-VAE enhances disentanglement over the β-VAE by encouraging the latent distribution to be factorial (independent across dimensions) using a total correlation penalty. This penalty is optimized using a discriminator network. Experiments on various datasets show that Factor-VAE achieves better disentanglement than β-VAE, as measured by a proposed disentanglement metric, while maintaining good reconstruction quality. Latent traversals qualitatively demonstrate disentangled factors of variation.
We provide a review of the recent literature on statistical risk bounds for deep neural networks. We also discuss some theoretical results that compare the performance of deep ReLU networks to other methods such as wavelets and spline-type methods. The talk will moreover highlight some open problems and sketch possible new directions.
Dynamic Feature Induction: The Last Gist to the State-of-the-ArtJinho Choi
We introduce a novel technique called dynamic feature induction that keeps inducing high dimensional features automatically until the feature space becomes `more' linearly separable. Dynamic feature induction searches for the feature combinations that give strong clues for distinguishing certain label pairs, and generates joint features from these combinations. These induced features are trained along with the primitive low dimensional features. Our approach was evaluated on two core NLP tasks, part-of-speech tagging and named entity recognition, and showed the state-of-the-art results for both tasks, achieving the accuracy of 97.64 and the F1-score of 91.00 respectively, with about a 25% increase in the feature space.
The document discusses machine learning techniques for clustering and segmentation. It introduces Dirichlet process mixtures and the Chinese restaurant process as nonparametric Bayesian models that allow for an infinite number of clusters. It describes how these models can be used for problems like image segmentation, object recognition, population clustering from genetic data, and evolutionary document clustering over time. Approximate inference methods like Markov chain Monte Carlo sampling are used to analyze these models.
lassification with decision trees from a nonparametric predictive inference p...NTNU
An application of nonparametric predictive inference for multinomial data (NPI) to classification tasks is presented. This model is applied to an established procedure for building classification trees using imprecise probabilities and uncertainty measures, thus far used only with the imprecise Dirichlet model (IDM), that is defined through the use of a parameter expressing previous knowledge. The accuracy of that procedure of classification has a significant dependence on the value of the parameter used when the IDM is applied. A detailed study involving 40 data sets shows that the procedure using the NPI model (which has no parameter dependence) obtains a better trade-off between accuracy and size of tree than does the procedure when the IDM is used, whatever the choice of parameter. In a bias-variance study of the errors, it is proved that the procedure with the NPI model has a lower variance than the one with the IDM, implying a lower level of over-fitting.
This document discusses dimensionality reduction techniques for machine learning. It introduces Fisher Linear Discriminant analysis, which seeks projection directions that maximize separation between classes while minimizing within-class variance. It describes using the means and scatter measures of each class to define a cost function that is maximized to find the optimal projection direction. Principal Component Analysis is also briefly mentioned as another technique for dimensionality reduction.
An optimal design of current conveyors using a hybrid-based metaheuristic alg...IJECEIAES
This paper focuses on the optimal sizing of a positive second-generation current conveyor (CCII+), employing a hybrid algorithm named DE-ACO, which is derived from the combination of differential evolution (DE) and ant colony optimization (ACO) algorithms. The basic idea of this hybridization is to apply the DE algorithm for the ACO algorithm’s initialization stage. Benchmark test functions were used to evaluate the proposed algorithm’s performance regarding the quality of the optimal solution, robustness, and computation time. Furthermore, the DE-ACO has been applied to optimize the CCII+ performances. SPICE simulation is utilized to validate the achieved results, and a comparison with the standard DE and ACO algorithms is reported. The results highlight that DE-ACO outperforms both ACO and DE.
The document discusses genetic algorithms and genetic programming. It explains that genetic algorithms perform a parallel search of the hypothesis space to optimize a fitness function, mimicking biological evolution. New hypotheses are generated through mutation and crossover of existing hypotheses. Genetic programming similarly evolves computer programs represented as trees through genetic operators. An example shows a genetic programming approach for stacking blocks to spell a word.
This document discusses using evolutionary techniques like genetic algorithms and particle swarm optimization to reduce the order of large-scale linear systems. Specifically, it examines using GA and PSO to find stable reduced order models by minimizing the integral squared error between the original and reduced system responses. Both techniques guarantee stability if the original system is stable. The paper provides background on model order reduction techniques and describes how GA and PSO are applied, including representing systems as chromosomes and using selection, crossover, and mutation operators. Results from a numerical example are compared to a conventional method.
LNCS 5050 - Bilevel Optimization and Machine Learningbutest
This document discusses using bilevel optimization and machine learning techniques to improve model selection in machine learning problems. It proposes framing machine learning model selection as a bilevel optimization problem, where the inner level problems involve optimizing models on training data and the outer level problem selects hyperparameters to minimize error on test data. This bilevel framing allows for systematic optimization of hyperparameters and enables novel machine learning approaches. The document illustrates the approach for support vector regression, formulating model selection as a Stackelberg game and solving the resulting mathematical program with equilibrium constraints.
A simple framework for contrastive learning of visual representationsDevansh16
Link: https://machine-learning-made-simple.medium.com/learnings-from-simclr-a-framework-contrastive-learning-for-visual-representations-6c145a5d8e99
If you'd like to discuss something, text me on LinkedIn, IG, or Twitter. To support me, please use my referral link to Robinhood. It's completely free, and we both get a free stock. Not using it is literally losing out on free money.
Check out my other articles on Medium. : https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn. Let's connect: https://rb.gy/m5ok2y
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
My Substack: https://devanshacc.substack.com/
Live conversations at twitch here: https://rb.gy/zlhk9y
Get a free stock on Robinhood: https://join.robinhood.com/fnud75
This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100X fewer labels.
Comments: ICML'2020. Code and pretrained models at this https URL
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Cite as: arXiv:2002.05709 [cs.LG]
(or arXiv:2002.05709v3 [cs.LG] for this version)
Submission history
From: Ting Chen [view email]
[v1] Thu, 13 Feb 2020 18:50:45 UTC (5,093 KB)
[v2] Mon, 30 Mar 2020 15:32:51 UTC (5,047 KB)
[v3] Wed, 1 Jul 2020 00:09:08 UTC (5,829 KB)
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODELijcsit
Predicting the student performance is a great concern to the higher education managements.This
prediction helps to identify and to improve students' performance.Several factors may improve this
performance.In the present study, we employ the data mining processes, particularly classification, to
enhance the quality of the higher educational system. Recently, a new direction is used for the improvement
of the classification accuracy by combining classifiers.In thispaper, we design and evaluate a fastlearning
algorithm using AdaBoost ensemble with a simple genetic algorithmcalled “Ada-GA” where the genetic
algorithm is demonstrated to successfully improve the accuracy of the combined classifier performance.
The Ada-GA algorithm proved to be of considerable usefulness in identifying the students at risk early,
especially in very large classes. This early prediction allows the instructor to provide appropriate advising
to those students. The Ada/GA algorithm is implemented and tested on ASSISTments dataset, the results
showed that this algorithm hassuccessfully improved the detection accuracy as well as it reduces the
complexity of computation.
This document summarizes the application of computational intelligence techniques like genetic algorithms and particle swarm optimization for solving economic load dispatch problems. It first applies a real-coded genetic algorithm to minimize generation costs for a 6-generator test system with continuous fuel cost equations, showing superiority over quadratic programming. It then uses particle swarm optimization to minimize costs for a 10-generator system with each generator having discontinuous fuel options, showing better results than other published methods. The document provides background on economic load dispatch problems and optimization techniques like quadratic programming, genetic algorithms, and particle swarm optimization.
Flavours of Physics Challenge: Transfer Learning approachAlexander Rakhlin
Presentation for "Heavy Flavour Data Mining workshop", February 18-19, University of Zurich. I discuss the solution that won Physics Prize of Flavours of Physics challenge organized by CERN, Yandex, Intel at Kaggle.
Optimization of Mechanical Design Problems Using Improved Differential Evolut...IDES Editor
Differential Evolution (DE) is a novel evolutionary
approach capable of handling non-differentiable, non-linear
and multi-modal objective functions. DE has been consistently
ranked as one of the best search algorithm for solving global
optimization problems in several case studies. This paper
presents an Improved Constraint Differential Evolution
(ICDE) algorithm for solving constrained optimization
problems. The proposed ICDE algorithm differs from
unconstrained DE algorithm only in the place of initialization,
selection of particles to the next generation and sorting the
final results. Also we implemented the new idea to five versions
of DE algorithm. The performance of ICDE algorithm is
validated on four mechanical engineering problems. The
experimental results show that the performance of ICDE
algorithm in terms of final objective function value, number
of function evaluations and convergence time.
Optimization of Mechanical Design Problems Using Improved Differential Evolut...IDES Editor
Differential Evolution (DE) is a novel evolutionary
approach capable of handling non-differentiable, non-linear
and multi-modal objective functions. DE has been consistently
ranked as one of the best search algorithm for solving global
optimization problems in several case studies. This paper
presents an Improved Constraint Differential Evolution
(ICDE) algorithm for solving constrained optimization
problems. The proposed ICDE algorithm differs from
unconstrained DE algorithm only in the place of initialization,
selection of particles to the next generation and sorting the
final results. Also we implemented the new idea to five versions
of DE algorithm. The performance of ICDE algorithm is
validated on four mechanical engineering problems. The
experimental results show that the performance of ICDE
algorithm in terms of final objective function value, number
of function evaluations and convergence time.
Mimo system-order-reduction-using-real-coded-genetic-algorithmCemal Ardil
This document describes a method for reducing the order of multi-input multi-output (MIMO) systems using real-coded genetic algorithms. The method aims to minimize the integral square error between the transient responses of the original and reduced order models. It treats both the numerator and denominator parameters of the reduced order model as free parameters to be optimized. A real-coded genetic algorithm is used to search for the parameter values that minimize the error. The method is illustrated with an example and shown to produce results comparable to other established order reduction techniques while guaranteeing stability of the reduced model.
Multimodal Biometrics Recognition by Dimensionality Diminution MethodIJERA Editor
Multimodal biometric system utilizes two or more character modalities, e.g., face, ear, and fingerprint,
Signature, plamprint to improve the recognition accuracy of conventional unimodal methods. We propose a new
dimensionality reduction method called Dimension Diminish Projection (DDP) in this paper. DDP can not only
preserve local information by capturing the intra-modal geometry, but also extract between-class relevant
structures for classification effectively. Experimental results show that our proposed method performs better
than other algorithms including PCA, LDA and MFA.
Chap 8. Optimization for training deep modelsYoung-Geun Choi
연구실 내부 세미나 자료. Goodfellow et al. (2016), Deep Learning, MIT Press의 Chapter 8을 요약/발췌하였습니다. 깊은 신경망(deep neural network) 모형 훈련시 목적함수 최적화 방법으로 흔히 사용되는 방법들을 소개합니다.
Machine learning in science and industry — day 1arogozhnikov
A course of machine learning in science and industry.
- notions and applications
- nearest neighbours: search and machine learning algorithms
- roc curve
- optimal classification and regression
- density estimation
- Gaussian mixtures and EM algorithm
- clustering, an example of clustering in the opera
MULTIPROCESSOR SCHEDULING AND PERFORMANCE EVALUATION USING ELITIST NON DOMINA...ijcsa
Task scheduling plays an important part in the improvement of parallel and distributed systems. The problem of task scheduling has been shown to be NP hard. The time consuming is more to solve the problem in deterministic techniques. There are algorithms developed to schedule tasks for distributed environment, which focus on single objective. The problem becomes more complex, while considering biobjective.This paper presents bi-objective independent task scheduling algorithm using elitist Nondominated
sorting genetic algorithm (NSGA-II) to minimize the makespan and flowtime. This algorithm generates pareto global optimal solutions for this bi-objective task scheduling problem. NSGA-II is implemented by using the set of benchmark instances. The experimental result shows NSGA-II generates efficient optimal schedules.
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...Rafael Nogueras
This document discusses self-sampling strategies for multimemetic algorithms (MMAs) in unstable computational environments subject to churn. It proposes using probabilistic models to sample new individuals when populations need to be enlarged due to node failures. Experimental results show the bivariate model is superior for high churn, maintaining diversity and convergence better than random strategies. Future work aims to extend these self-sampling strategies to dynamic network topologies and more complex probabilistic models.
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...IJCNCJournal
This paper presents a method for constructing intrusion detection systems based on efficient fuzzy rulebased
classifiers. The design process of a fuzzy rule-based classifier from a given input-output data set can
be presented as a feature selection and parameter optimization problem. For parameter optimization of
fuzzy classifiers, the differential evolution is used, while the binary harmonic search algorithm is used for
selection of relevant features. The performance of the designed classifiers is evaluated using the KDD Cup
1999 intrusion detection dataset. The optimal classifier is selected based on the Akaike information
criterion. The optimal intrusion detection system has a 1.21% type I error and a 0.39% type II error. A
comparative study with other methods was accomplished. The results obtained showed the adequacy of the
proposed method
Uncertainty-quantification tasks are often ``many query'' in nature, as they require repeated evaluations of a model that often corresponds to a parameterized system of nonlinear equations (e.g., arising from the spatial discretization of a PDE). To make this task tractable for large-scale models, low-fidelity models (e.g., reduced-order models, coarse-mesh solutions) must be employed. However, such approximations introduce additional error, which may be treated as a source of epistemic uncertainty that must be quantified to ensure rigor in the ultimate UQ result. We present a new approach to quantify the error (i.e., epistemic uncertainty) introduced by these low-fidelity models approximations. The approach (1) engineers features that are informative of the error using concepts related to dual-weighted residuals and rigorous error bounds, and (2) applies machine learning regression techniques (e.g., artificial neural networks, random forests, support vector machines) to construct a statistical model of the error from these features. We consider both (signed) errors in quantities of interest, as well as global state-space error norms. We present several examples to demonstrate the effectiveness of the proposed approach compared to more conventional feature and regression choices. In each of the examples, the predicted errors have a coefficient of determination value of at least 0.998.
The document discusses applications of machine learning for robot navigation and control. It describes how surrogate models can be used for predictive modeling in engineering applications like aircraft design. Dimension reduction techniques are used to reduce high-dimensional design parameters to a lower-dimensional space for faster surrogate model evaluation. For robot navigation, regression models on image manifolds are used for visual localization by mapping images to robot positions. Manifold learning is also applied to find low-dimensional representations of valid human hand poses from images to enable easier robot control.
Similar to A preliminary study of diversity in ELM ensembles (HAIS 2018) (20)
Charla para la PyConES 2018 en Málaga, sobre cómo se pueden crear sistemas de recomendación de canciones usando análisis de Fourier, herramientas de machine learning no supervisado y Python.
Análisis de sentimiento como indicador reputacional - TFMCarlos Perales
Trabajo de Fin de Máster (Ingeniería Matemática, en la Universidad Complutense de Madrid) realizado por Carlos Perales. Versa sobre algoritmos de clasificación (machine learning) aplicado a tweets y reputación financiera.
¿Podemos predecir si Twitter hundirá un banco?Carlos Perales
Charla de la PyCon ES 2016 en Almería, sobre clasificadores de texto y riesgo reputacional aplicado al ámbito financiero.
El análisis de sentimiento es una herramienta con la cual se exploran las opiniones de un producto mediante valoración automática de mensajes en redes sociales. Se ha utilizado esta herramienta sobre Twitter para extraer una métrica sobre la reputación que tiene una entidad financiera, y poder estimar las pérdidas por riesgo reputacional.
Estudio y simulación numérica de las ecuaciones de aguas somerasCarlos Perales
Trabajo de Fin de Grado de Carlos Perales González sobre las ecuaciones de aguas someras, también llamadas ecuaciones de shallow water, que son una simplificación de las ecuaciones de Navier-Stokes.
Algunos vídeos de los que se mencionan en la sección de resultados se pueden ver aquí: https://www.youtube.com/playlist?list=PLkjZXk8AWCPW18dZUjr093jvvx3NUvY5l
Registrado en Safe Creative: https://www.safecreative.org/work/1507284738401-tfg-simulacion-de-aguas-poco-profundas
Un estudio numérico sobre el número de MachCarlos Perales
Este documento presenta el cálculo del número de Mach crítico utilizando los métodos de la bisección y la secante. Se define el número de Mach y los diferentes regímenes de flujo. Se describe la ecuación a resolver y se grafican las funciones. Los programas en Matlab implementan ambos métodos numéricos para aproximar la solución, arrojando valores de 0.738 (bisección) y 0.7396 (secante) para un coeficiente de presión dado, con menos iteraciones en el método de la secante.
Energía fotovoltaica en España y el mundo (2004-2008)Carlos Perales
Se trata de un pequeño estudio sobre la producción fotovoltaica en España, la potencia instalada y la energía obtenida, comparando el crecimiento de este sector con el de otras partes del mundo.
Propagación de una enfermedad en poblaciones dinámicasCarlos Perales
Se trata de un estudio simplificado del comportamiento matemático de modelos de poblaciones con crecimiento que sufren una epidemia. Realizado por Carlos Perales para una asignatura del grado de Física de la UCO (Universidad de Córdoba)
Sobre la radiación Cherenkov (presentación)Carlos Perales
Presentación sobre el trabajo de la radiación Cherenkov, aquí extendido: http://www.slideshare.net/CarlosPerales/radiacin-cherenkov-carlos-perales-1 . Realizado por Carlos Perales para una asignatura del grado de Física de la UCO (Universidad de Córdoba)
Sobre la radiación Cherenkov y los rayos cósmicosCarlos Perales
Se trata de un fundamento teórico y resumen de las aplicaciones, en especial de la astrofísica, que tiene la radiación Cherenkov. Está realizado por el alumno Carlos Perales, de la Universidad de Córdoba UCO, para el grado de Física
Resumes, Cover Letters, and Applying OnlineBruce Bennett
This webinar showcases resume styles and the elements that go into building your resume. Every job application requires unique skills, and this session will show you how to improve your resume to match the jobs to which you are applying. Additionally, we will discuss cover letters and learn about ideas to include. Every job application requires unique skills so learn ways to give you the best chance of success when applying for a new position. Learn how to take advantage of all the features when uploading a job application to a company’s applicant tracking system.
5 Common Mistakes to Avoid During the Job Application Process.pdfAlliance Jobs
The journey toward landing your dream job can be both exhilarating and nerve-wracking. As you navigate through the intricate web of job applications, interviews, and follow-ups, it’s crucial to steer clear of common pitfalls that could hinder your chances. Let’s delve into some of the most frequent mistakes applicants make during the job application process and explore how you can sidestep them. Plus, we’ll highlight how Alliance Job Search can enhance your local job hunt.
Job Finding Apps Everything You Need to Know in 2024SnapJob
SnapJob is revolutionizing the way people connect with work opportunities and find talented professionals for their projects. Find your dream job with ease using the best job finding apps. Discover top-rated apps that connect you with employers, provide personalized job recommendations, and streamline the application process. Explore features, ratings, and reviews to find the app that suits your needs and helps you land your next opportunity.
A Guide to a Winning Interview June 2024Bruce Bennett
This webinar is an in-depth review of the interview process. Preparation is a key element to acing an interview. Learn the best approaches from the initial phone screen to the face-to-face meeting with the hiring manager. You will hear great answers to several standard questions, including the dreaded “Tell Me About Yourself”.
A preliminary study of diversity in ELM ensembles (HAIS 2018)
1. A preliminary study of diversity in Extreme Learning
Machines ensembles
Carlos Perales-Gonz´alez1
Mariano Carbonero-Ruz1
David Becerra Fern´andez-Navarro1
Francisco Fern´andez-Navarro1
1Universidad Loyola Andaluc´ıa
HAIS 2018
C. Perales-Gonz´alez (ULOYOLA) Diversity for ELM 2018-06-21 1 / 25
2. Overview
1 Introduction
Abstract
Extreme Learning Machine
2 Diverse ELM
Other ensembles
Diversity as metric
DELM loss function
3 Experiments, results and conclusions
Description
Results
Conclusions and future work
C. Perales-Gonz´alez (ULOYOLA) Diversity for ELM 2018-06-21 2 / 25
3. Abstract
In this paper, the neural network version of Extreme Learning Machine
(ELM) is used as a base learner for an ensemble meta-algorithm which
promotes diversity explicitly in the ELM loss function. The cost
function proposed encourages orthogonality (scalar product) in the
parameter space. Other ensemble-based meta-algorithms from AdaBoost
family are used for comparison purposes. Both accuracy and diversity
presented in our proposal are competitive, thus reinforcing the idea of
introducing diversity explicitly.
C. Perales-Gonz´alez (ULOYOLA) Diversity for ELM 2018-06-21 3 / 25
4. ELM I
Extreme Learning Machine a.k.a. Ridge classification [1], [2], [3]. First
layer connections are random. Mathematically, the classification is a
multiregression.
f(x) = h (x) β, (1)
1-of-J
Output
Features
input
h(x) β
C. Perales-Gonz´alez (ULOYOLA) Diversity for ELM 2018-06-21 4 / 25
5. ELM II
Where
x ∈ Rm is the vector of attributes, m is the dimension of the input
space.
h : Rm → Rd is the mapping function and d is the number of hidden
nodes (the dimension of the transformed space).
β = (βj , j = 1, . . . , J) ∈ Rd×J is ELM matrix.
Predicted label y is obtained from function f(x)
y (x) = arg max
j=1,...,J
f(x)j . (2)
C. Perales-Gonz´alez (ULOYOLA) Diversity for ELM 2018-06-21 5 / 25
6. ELM III
We worked with the neural version of ELM, so h(xi ) is defined as
h(xi ) = (φ(xi ; wj , bj ), j = 1, . . . , d), (3)
where
φ(·; wj , bj ) : Rm
→ R (4)
is the activation function of the j-th hidden node. In this case, a sigmoid
was used.
C. Perales-Gonz´alez (ULOYOLA) Diversity for ELM 2018-06-21 6 / 25
7. ELM IV
Learning problem: Let us also denote
H = (h (xi ) , i = 1, . . . , n) ∈ Rn×d as the transformation of the training
set.
Y ∈ Rn×J is the matrix of labels ”1-of-J” encoded
min
β∈Rd×J
β 2
+ C Hβ − Y
2
, (5)
where C ∈ R+ is a cross-validated hyper-parameter for regularization.
Matrix solution is
β =
I
C
+ H H
−1
H Y (6)
C. Perales-Gonz´alez (ULOYOLA) Diversity for ELM 2018-06-21 7 / 25
8. Other ensembles
The two main approaches to combine several classifiers into one predictive
model:
Bagging (bootstrap aggregating): Several versions of a base learner
by selecting some subsets from the training set [4]. Random data
sampling.
Boosting focus on combining base learners over several iterations
and generate a weighted majority hypothesis [5]. Data sampling
depends on performance
C. Perales-Gonz´alez (ULOYOLA) Diversity for ELM 2018-06-21 8 / 25
9. Diversity as metric
These ensembles search diversity by data sampling.
Our proposal: promote diversity explicitly using orthogonality. For two
vectors u, v ∈ Rn
d (u, v) = 1 −
u, v 2
u 2
v 2
(7)
C. Perales-Gonz´alez (ULOYOLA) Diversity for ELM 2018-06-21 9 / 25
10. Our proposal I
Loss function for explicit diverse ELM:
min
β(s)
∈Rd×J
1
2
β(s) 2
+ C Hβ(s)
− Y
2
+ D +
n
s
J
j=1
s−1
k=1
β
(s)
j , u
(k)
j
2
(8)
where:
u(k) ∈ Rd×J is the column-by-column normalized β(k)
from the
iteration k of the ensemble.
D > 0 is a hyperparameter like C.
C. Perales-Gonz´alez (ULOYOLA) Diversity for ELM 2018-06-21 10 / 25
11. Our proposal II
Hence, β
(s)
j could be obtained analytically as:
β
(s)
j =
I
C
+ H H +
1
C
D +
n
s
M
(s)
j
−1
H Yj j = 1, . . . , J (9)
where M
(s)
j is defined as
M
(s)
j ≡
s−1
k=1
u
(k)
j u
(k)
j (10)
C. Perales-Gonz´alez (ULOYOLA) Diversity for ELM 2018-06-21 11 / 25
12. Datasets I
To sum up,
The datasets were extracted from the UCI Machine Learning [6] and
the mldata.org repositories.
10 datasets, 5 of them ≥ 1000 instances.
Experimental design using a 10-fold cross-validation, with 5 nested
cross-validation for hyperparameters.
Nominal variables were transformed to binary variables.
Features were standarized.
C. Perales-Gonz´alez (ULOYOLA) Diversity for ELM 2018-06-21 12 / 25
13. Datasets II
Table: Characteristics of the data sets, ordered by size and number of classes
Data sets
Dataset Size #Attr. #Classes Class distribution
car 1728 21 4 (1210, 384, 69, 65)
winequality-red 1599 11 6 (10, 53, 681, 638, 199, 18)
ERA 1000 4 9 (92, 142, 181, 172,
158, 118, 88, 31, 18)
LEV 1000 4 5 (93, 280, 403, 197, 27)
SWD 1000 10 4 (32, 352, 399, 217)
newthyroid 215 5 3 (30, 150, 35)
automobile 205 71 6 (3, 22, 67, 54, 32, 27)
squash-stored 52 51 3 (23, 21, 8)
squash-unstored 52 52 3 (24, 24, 4)
pasture 36 25 3 (12, 12, 12)
C. Perales-Gonz´alez (ULOYOLA) Diversity for ELM 2018-06-21 13 / 25
14. Metrics I
Acc Accuracy rate: It is the number of successful hits relative to
the number of total classifications.
Acc =
1
n
n
i=1
I (˜y (xi ) = yi ) (11)
Div Diversity: For an ELM ensemble with s individuals, this
metric is obtained this metric is obtained applying Eq. (7) to
the β1
, . . . , βs
matrices:
d β(1)
, . . . , β(s)
=
1
J s
2 k<l
J
j=1
d β
(k)
j , β
(l)
j (12)
C. Perales-Gonz´alez (ULOYOLA) Diversity for ELM 2018-06-21 14 / 25
15. Algorithms
Three boosting ensembles were compared against our proposal:
AELM AdaBoost Extreme Learning Machine, defined in [7].
BRELM Boosting Ridge Extreme Learning Machine, explained in [8].
NCELM Negative Correlation Extreme Learning Machine, defined in
[9].
DELM Diverse Extreme Learning Machine, our proposal.
C. Perales-Gonz´alez (ULOYOLA) Diversity for ELM 2018-06-21 15 / 25
16. Results I
Accuracy (Acc)
DELM AELM BRELM NCELM
car 0.929711 0.834618 0.901805 0.905111
winequality-red 0.853687 0.840085 0.839670 0.837363
ERA 0.829479 0.822201 0.828019 0.828428
LEV 0.836345 0.786404 0.792371 0.798220
SWD 0.787940 0.764487 0.759893 0.760442
newthyroid 0.932035 0.817172 0.812035 0.819509
automobile 0.867376 0.834618 0.841636 0.846499
squash-stored 0.694286 0.751429 0.694063 0.711937
squash-unstored 0.814286 0.830952 0.813810 0.812381
pasture 0.833333 0.766667 0.811111 0.826667
C. Perales-Gonz´alez (ULOYOLA) Diversity for ELM 2018-06-21 16 / 25
20. Conclusions
Introduction of diversity explicitly while improving accuracy.
Improvement diversity of solutions / classifiers.
Sensitivity analysis (this could helps to establish a ranking of good
solutions).
C. Perales-Gonz´alez (ULOYOLA) Diversity for ELM 2018-06-21 20 / 25
21. Future work
Reduce computational time:
Remove hyperparameter D.
Reduce the number of inverses
Extend the experiments (in datasets and algoritms).
Study the performance on unbalanced data.
C. Perales-Gonz´alez (ULOYOLA) Diversity for ELM 2018-06-21 21 / 25
22. References I
A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimation
for nonorthogonal problems,” Technometrics, vol. 12, no. 1,
pp. 55–67, 1970.
G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine:
theory and applications,” Neurocomputing, vol. 70, no. 1-3,
pp. 489–501, 2006.
G.-B. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme learning
machine for regression and multiclass classification,” IEEE
Transactions on Systems, Man, and Cybernetics, Part B
(Cybernetics), vol. 42, no. 2, pp. 513–529, 2012.
L. Bbeiman, “Bagging Predictors,” Machine Learning, vol. 24,
pp. 123–140, 1996.
C. Perales-Gonz´alez (ULOYOLA) Diversity for ELM 2018-06-21 22 / 25
23. References II
Y. Freund and R. E. Schapire, “A Short Introduction to Boosting,”
Journal of Japanese Society for Artificial Intelligence, vol. 14, no. 5,
pp. 771–780, 1999.
D. Dheeru and E. Karra Taniskidou, “UCI machine learning
repository,” 2017.
A. Riccardi, F. Fern´andez-Navarro, and S. Carloni, “Cost-sensitive
AdaBoost algorithm for ordinal regression based on extreme learning
machine,” IEEE Transactions on Cybernetics, vol. 44, no. 10,
pp. 1898–1909, 2014.
Y. Ran, X. Sun, H. Sun, L. Sun, X. Wang, and W. X. Ran Y, Sun X,
Sun H, Sun L, “Boosting Ridge Extreme Learning Machine,”
Proceedings - 2012 IEEE Symposium on Robotics and Applications,
ISRA 2012, pp. 881–884, 2012.
C. Perales-Gonz´alez (ULOYOLA) Diversity for ELM 2018-06-21 23 / 25
24. References III
S. Wang, H. Chen, and X. Yao, “Negative correlation learning for
classification ensembles,” Proceedings of the International Joint
Conference on Neural Networks, 2010.
C. Perales-Gonz´alez (ULOYOLA) Diversity for ELM 2018-06-21 24 / 25