The document summarizes research on improving the training of multilayer perceptron (MLP) neural networks. It proposes using multiple optimal learning factors (MOLF) during training, which is shown to be equivalent to optimally transforming the net function vector in the MLP. For large networks, the MOLF Hessian matrix can become large, so the paper develops a method to compress the matrix into a smaller, well-conditioned form. Simulation results show the proposed algorithm performs almost as well as Levenberg-Marquardt but with the computational complexity of a first-order method.
A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...Cemal Ardil
This document compares first and second order training algorithms for artificial neural networks. It summarizes that feedforward network training is a special case of functional minimization where no explicit model of the data is assumed. Gradient descent, conjugate gradient, and quasi-Newton methods are discussed as first and second order training methods. Conjugate gradient and quasi-Newton methods are shown to outperform gradient descent methods experimentally using share rate data. The backpropagation algorithm and its variations are described for finding the gradient of the error function with respect to the network weights. Conjugate gradient techniques are discussed as a way to find the search direction without explicitly computing the Hessian matrix.
The document summarizes a presentation on minimizing tensor estimation error using alternating minimization. It begins with an introduction to tensor decompositions including CP, Tucker, and tensor train decompositions. It then discusses nonparametric tensor estimation using an alternating minimization method. The method iteratively updates components while holding other components fixed, achieving efficient computation. The analysis shows that after t iterations, the estimation error is bounded by the sum of a statistical error term and an optimization error term decaying exponentially in t. Real data analysis uses the method for multitask learning.
The document summarizes radial basis function (RBF) networks. Key points:
- RBF networks use radial basis functions as activation functions and can universally approximate continuous functions.
- They are local approximators compared to multilayer perceptrons which are global approximators.
- Learning involves determining the centers, widths, and weights. Centers can be randomly selected or via clustering. Widths are usually different for each basis function. Weights are typically learned via least squares or gradient descent methods.
The document discusses Radial Basis Function (RBF) networks. It describes the architecture of an RBF network which has three layers - an input layer, a hidden layer of radial basis functions, and a linear output layer. It also discusses types of radial basis functions like Gaussian, training algorithms for determining hidden unit centers and radii, and provides an example of how an RBF network can learn the XOR problem.
This document summarizes a study on pattern recognition and learning in networks of coupled bistable units. The network is composed of N oscillators moving in a double-well potential, with pair-wise interactions between all elements. Two methods are used for training the network: (1) constructing the coupling matrix using Hebb's rule based on stored patterns, and (2) iteratively updating the matrix to minimize error between applied and desired patterns. Graphs show the learning rate converges as mean squared error and coupling strengths decrease over iterations.
This document summarizes an academic paper that proposes modifying well-known local linear models for system identification by replacing their original recursive learning rules with outlier-robust variants based on M-estimation. It describes three existing local linear models - local linear map (LLM), radial basis function network (RBFN), and local model network (LMN) - and then introduces the concept of M-estimation as a way to make the learning rules of these models more robust to outliers. The performance of the proposed outlier-robust variants is evaluated on three benchmark datasets and is found to provide considerable improvement in the presence of outliers compared to the original models.
Training and Inference for Deep Gaussian ProcessesKeyon Vafa
The document discusses training and inference for deep Gaussian processes (DGPs). It introduces the Deep Gaussian Process Sampling (DGPS) algorithm for learning DGPs. The DGPS algorithm relies on Monte Carlo sampling to circumvent the intractability of exact inference in DGPs. It is described as being more straightforward than existing DGP methods and able to more easily adapt to using arbitrary kernels. The document provides background on Gaussian processes and motivation for using deep Gaussian processes before describing the DGPS algorithm in more detail.
A comparison-of-first-and-second-order-training-algorithms-for-artificial-neu...Cemal Ardil
This document compares first and second order training algorithms for artificial neural networks. It summarizes that feedforward network training is a special case of functional minimization where no explicit model of the data is assumed. Gradient descent, conjugate gradient, and quasi-Newton methods are discussed as first and second order training methods. Conjugate gradient and quasi-Newton methods are shown to outperform gradient descent methods experimentally using share rate data. The backpropagation algorithm and its variations are described for finding the gradient of the error function with respect to the network weights. Conjugate gradient techniques are discussed as a way to find the search direction without explicitly computing the Hessian matrix.
The document summarizes a presentation on minimizing tensor estimation error using alternating minimization. It begins with an introduction to tensor decompositions including CP, Tucker, and tensor train decompositions. It then discusses nonparametric tensor estimation using an alternating minimization method. The method iteratively updates components while holding other components fixed, achieving efficient computation. The analysis shows that after t iterations, the estimation error is bounded by the sum of a statistical error term and an optimization error term decaying exponentially in t. Real data analysis uses the method for multitask learning.
The document summarizes radial basis function (RBF) networks. Key points:
- RBF networks use radial basis functions as activation functions and can universally approximate continuous functions.
- They are local approximators compared to multilayer perceptrons which are global approximators.
- Learning involves determining the centers, widths, and weights. Centers can be randomly selected or via clustering. Widths are usually different for each basis function. Weights are typically learned via least squares or gradient descent methods.
The document discusses Radial Basis Function (RBF) networks. It describes the architecture of an RBF network which has three layers - an input layer, a hidden layer of radial basis functions, and a linear output layer. It also discusses types of radial basis functions like Gaussian, training algorithms for determining hidden unit centers and radii, and provides an example of how an RBF network can learn the XOR problem.
This document summarizes a study on pattern recognition and learning in networks of coupled bistable units. The network is composed of N oscillators moving in a double-well potential, with pair-wise interactions between all elements. Two methods are used for training the network: (1) constructing the coupling matrix using Hebb's rule based on stored patterns, and (2) iteratively updating the matrix to minimize error between applied and desired patterns. Graphs show the learning rate converges as mean squared error and coupling strengths decrease over iterations.
This document summarizes an academic paper that proposes modifying well-known local linear models for system identification by replacing their original recursive learning rules with outlier-robust variants based on M-estimation. It describes three existing local linear models - local linear map (LLM), radial basis function network (RBFN), and local model network (LMN) - and then introduces the concept of M-estimation as a way to make the learning rules of these models more robust to outliers. The performance of the proposed outlier-robust variants is evaluated on three benchmark datasets and is found to provide considerable improvement in the presence of outliers compared to the original models.
Training and Inference for Deep Gaussian ProcessesKeyon Vafa
The document discusses training and inference for deep Gaussian processes (DGPs). It introduces the Deep Gaussian Process Sampling (DGPS) algorithm for learning DGPs. The DGPS algorithm relies on Monte Carlo sampling to circumvent the intractability of exact inference in DGPs. It is described as being more straightforward than existing DGP methods and able to more easily adapt to using arbitrary kernels. The document provides background on Gaussian processes and motivation for using deep Gaussian processes before describing the DGPS algorithm in more detail.
A scalable collaborative filtering framework based on co clusteringAllenWu
This document proposes a scalable collaborative filtering framework based on co-clustering. It introduces collaborative filtering and discusses limitations of existing methods. The framework uses co-clustering to simultaneously obtain user and item neighborhoods and generate predictions based on average ratings. Experimental results show the approach provides high quality predictions with lower computational cost than other methods.
A comparison of efficient algorithms for scheduling parallel data redistributionIJCNCJournal
This document compares algorithms for scheduling parallel data redistribution. It proposes a new algorithm called the Split Graph Algorithm (SGA) that splits the data into two clusters based on size and schedules large and small data differently. SGA was tested on 1000 test cases and found to produce schedules closer to the theoretical lower bound and faster than two existing algorithms, demonstrating its efficiency.
This document discusses kernel methods and radial basis function (RBF) networks. It begins with an introduction and overview of Cover's theory of separability of patterns. It then revisits the XOR problem and shows how it can be solved using Gaussian hidden functions. The interpolation problem is explained and how RBF networks can perform strict interpolation through a set of training data points. Radial basis functions that satisfy Micchelli's theorem allowing for a nonsingular interpolation matrix are presented. Finally, the structure and training of RBF networks using k-means clustering and recursive least squares estimation is covered.
The document summarizes a study on fractal image compression of satellite images using range and domain techniques. It discusses fractal image compression methods, including partitioning images into range and domain blocks. Affine transformations are applied to domain blocks to match range blocks. Peak signal-to-noise ratio (PSNR) values are calculated for reconstructed rural and urban satellite images after 4 iterations, showing PSNR of around 17.0 for rural images and 22.0 for urban images. The proposed algorithm partitions the original image into non-overlapping range blocks and selects domain blocks twice the size of range blocks.
Incremental collaborative filtering via evolutionary co clusteringAllen Wu
This document summarizes an incremental collaborative filtering approach via evolutionary co-clustering. It introduces incremental collaborative filtering and discusses existing approaches. It then proposes an incremental evolutionary co-clustering method that assigns new users and items to clusters during the online phase to make more accurate predictions. The method uses an ensemble of co-clustering solutions and an evolutionary algorithm to improve performance. Experimental results on a movie rating dataset show the proposed approach achieves better accuracy than other incremental collaborative filtering methods.
This document discusses regularization techniques for neural networks to prevent overfitting. It describes how the number of hidden units controls complexity and can lead to underfitting or overfitting. Early stopping is introduced as an alternative to regularization where training is stopped when validation error starts to increase. Consistency of regularization terms under linear transformations of data is discussed. Different approaches for building invariances like data augmentation, tangent propagation, and convolutional network structures are outlined.
Restricted Boltzman Machine (RBM) presentation of fundamental theorySeongwon Hwang
The document discusses restricted Boltzmann machines (RBMs), an type of neural network that can learn probability distributions over its input data. It explains that RBMs define an energy function over hidden and visible units, with no connections between units within the same group. This conditional independence allows efficient computation of conditional probabilities. RBMs are trained using maximum likelihood, minimizing the negative log-likelihood of the training data by gradient descent.
A general multiobjective clustering approach based on multiple distance measuresMehran Mesbahzadeh
The document describes a general multiobjective clustering approach based on multiple distance measures. It proposes two clustering methods, MOECDM and MOEACDM, that use multiple distance matrices and multiobjective evolutionary algorithms to partition datasets into clusters in different distance spaces simultaneously. The key steps include: (1) generating initial populations using different distance measures and pre-clustering, (2) designing objective functions to minimize distances within clusters in different spaces, and (3) applying evolutionary operators like crossover and mutation to optimize the objective functions.
The document discusses several approaches for learning molecular representations using deep learning networks, including:
1. Improved Neural Fingerprint (A-NFP), which addresses limitations in the original Neural Fingerprint approach by differentiating between bond types. A-NFP achieves better performance than NFP on toxicity, solubility, and drug efficacy prediction tasks.
2. Message Passing Neural Networks (MPNN), which provide a general framework for graph-structured data like molecules. MPNNs contain message passing and readout phases. CNNs for learning molecular fingerprints is a specific case of MPNNs.
3. Mol2vec, which learns molecular representations without labels using a skip-gram model similar to word2vec and
Density based Clustering finds clusters of arbitrary shape by looking for dense regions of points separated by low density regions. It includes DBSCAN, which defines clusters based on core points that have many nearby neighbors and border points near core points. DBSCAN has parameters for neighborhood size and minimum points. OPTICS is a density based algorithm that computes an ordering of all objects and their reachability distances without fixing parameters.
Adaptive Channel Equalization using Multilayer Perceptron Neural Networks wit...IOSRJVSP
This document presents a neural network approach to channel equalization using a multilayer perceptron with a variable learning rate parameter. Specifically, it proposes modifying the backpropagation algorithm to allow the learning rate to adapt at each iteration in order to achieve faster convergence. The equalizer structure is a decision feedback equalizer modeled as a neural network with an input, hidden and output layer. Simulation results show the proposed variable learning rate approach improves bit error rate and convergence speed compared to a standard backpropagation algorithm.
- POSTECH EECE695J, "딥러닝 기초 및 철강공정에의 활용", Week 5
- Contents: Restricted Boltzmann Machine (RBM), various activation functions, data preprocessing, regularization methods, training of a neural network
- Video: https://youtu.be/v4rGPl-8wdo
Tutorial on Markov Random Fields (MRFs) for Computer Vision ApplicationsAnmol Dwivedi
The goal of this mini-project is to implement a pairwise binary label-observation Markov Random Field
model for bi-level image segmentation. Specifically, two inference algorithms, i.e., the Iterative
Conditional Mode (ICM) and Gibbs sampling methods will be implemented to perform image segmentation.
Inference & Learning in Linear Chain Conditional Random Fields (CRFs)Anmol Dwivedi
This mini-project will consider performing inference and learning in Linear Chain CRFs. In particular, it will consider an application to hand-written word recognition. Handwritten word recognition is a task many have explored with different methods of machine learning. Some written characters can be evaluated individually or as a whole word to account for the context in characters. In this mini-project, we use linear chain CRF models to account for context between the characters of a word to improve word recognition accuracy.
This document describes a virtual point light (VPL) rendering technique that uses a Metropolis sampling algorithm to efficiently approximate global illumination in scenes. The technique converts indirect lighting computed by a path tracer into virtual point lights that are rendered in real-time using a GPU. This allows for interactive rendering speeds while still achieving results close to a conventional path tracer. The key aspects of the VPL technique and Metropolis sampling algorithm are discussed, and details of the implementation including VPL generation, shadow mapping, and a shader chain for flexible material modeling are provided.
Classification of handwritten characters by their symmetry featuresAYUSH RAJ
The document presents a technique for classifying handwritten characters based on their symmetry features. The Generalized Symmetry Transform is applied to digits from the USPS dataset to extract symmetry magnitude and orientation maps. These features are used to train Probabilistic Neural Networks, which are then compared to a network trained on the original data. The symmetry-trained networks classify the training data perfectly but generalize poorer than the original data network, achieving 87.2% and 72.2% accuracy respectively compared to 95.17% for the original data network. While symmetry features can classify characters, the original data leads to better generalization performance.
This document summarizes the DBSCAN clustering algorithm. DBSCAN finds clusters based on density, requiring only two parameters: Eps, which defines the neighborhood distance, and MinPts, the minimum number of points required to form a cluster. It can discover clusters of arbitrary shape. The algorithm works by expanding clusters from core points, which have at least MinPts points within their Eps-neighborhood. Points that are not part of any cluster are classified as noise. Applications include spatial data analysis, image segmentation, and automatic border detection in medical images.
(DL輪読)Matching Networks for One Shot LearningMasahiro Suzuki
1. Matching Networks is a neural network architecture proposed by DeepMind for one-shot learning.
2. The network learns to classify novel examples by comparing them to a small support set of examples, using an attention mechanism to focus on the most relevant support examples.
3. The network is trained using a meta-learning approach, where it learns to learn from small support sets to classify novel examples from classes not seen during training.
1. Molecular dynamics (MD) simulations solve equations of motion to generate long time trajectories of molecular systems. However, simulations are limited by the small time step required for integration.
2. Parallelization using domain decomposition and MPI distribution allows MD simulations to be accelerated by distributing computational work across multiple processors. Key algorithms like the fast Fourier transform (FFT) used in particle mesh Ewald (PME) methods for long-range interactions must also be parallelized efficiently.
3. Two common parallelization schemes are the replicated data approach and domain decomposition. Domain decomposition distributes spatial domains across processors and has better parallel efficiency but is more complex to implement. Hybrid parallelization using MPI and OpenMP can further improve performance.
The document presents a new batch training algorithm called Multiple Optimal Learning Factors (MOLF) for training a multi-layer perceptron neural network. MOLF uses a two-stage approach per iteration: 1) Newton's method is used to find a vector of optimal learning factors, one for each hidden unit, which is used to update the input weights. 2) Linear equations are solved to update the output weights. The algorithm analyzes how linear dependencies among inputs and hidden units can impact training, and develops an improved version of MOLF that is not affected by these dependencies. In experiments, the improved MOLF performs better than first-order methods with minimal overhead, and almost as well as second-order
MODIFIED LLL ALGORITHM WITH SHIFTED START COLUMN FOR COMPLEXITY REDUCTIONijwmn
Multiple-input multiple-output (MIMO) systems are playing an important role in the recent wireless
communication. The complexity of the different systems models challenge different researches to get a good
complexity to performance balance. Lattices Reduction Techniques and Lenstra-Lenstra-Lovàsz (LLL)
algorithm bring more resources to investigate and can contribute to the complexity reduction purposes.
In this paper, we are looking to modify the LLL algorithm to reduce the computation operations by
exploiting the structure of the upper triangular matrix without “big” performance degradation. Basically,
the first columns of the upper triangular matrix contain many zeroes, so the algorithm will perform several
operations with very limited income. We are presenting a performance and complexity study and our
proposal show that we can gain in term of complexity while the performance results remains almost the
same.
A scalable collaborative filtering framework based on co clusteringAllenWu
This document proposes a scalable collaborative filtering framework based on co-clustering. It introduces collaborative filtering and discusses limitations of existing methods. The framework uses co-clustering to simultaneously obtain user and item neighborhoods and generate predictions based on average ratings. Experimental results show the approach provides high quality predictions with lower computational cost than other methods.
A comparison of efficient algorithms for scheduling parallel data redistributionIJCNCJournal
This document compares algorithms for scheduling parallel data redistribution. It proposes a new algorithm called the Split Graph Algorithm (SGA) that splits the data into two clusters based on size and schedules large and small data differently. SGA was tested on 1000 test cases and found to produce schedules closer to the theoretical lower bound and faster than two existing algorithms, demonstrating its efficiency.
This document discusses kernel methods and radial basis function (RBF) networks. It begins with an introduction and overview of Cover's theory of separability of patterns. It then revisits the XOR problem and shows how it can be solved using Gaussian hidden functions. The interpolation problem is explained and how RBF networks can perform strict interpolation through a set of training data points. Radial basis functions that satisfy Micchelli's theorem allowing for a nonsingular interpolation matrix are presented. Finally, the structure and training of RBF networks using k-means clustering and recursive least squares estimation is covered.
The document summarizes a study on fractal image compression of satellite images using range and domain techniques. It discusses fractal image compression methods, including partitioning images into range and domain blocks. Affine transformations are applied to domain blocks to match range blocks. Peak signal-to-noise ratio (PSNR) values are calculated for reconstructed rural and urban satellite images after 4 iterations, showing PSNR of around 17.0 for rural images and 22.0 for urban images. The proposed algorithm partitions the original image into non-overlapping range blocks and selects domain blocks twice the size of range blocks.
Incremental collaborative filtering via evolutionary co clusteringAllen Wu
This document summarizes an incremental collaborative filtering approach via evolutionary co-clustering. It introduces incremental collaborative filtering and discusses existing approaches. It then proposes an incremental evolutionary co-clustering method that assigns new users and items to clusters during the online phase to make more accurate predictions. The method uses an ensemble of co-clustering solutions and an evolutionary algorithm to improve performance. Experimental results on a movie rating dataset show the proposed approach achieves better accuracy than other incremental collaborative filtering methods.
This document discusses regularization techniques for neural networks to prevent overfitting. It describes how the number of hidden units controls complexity and can lead to underfitting or overfitting. Early stopping is introduced as an alternative to regularization where training is stopped when validation error starts to increase. Consistency of regularization terms under linear transformations of data is discussed. Different approaches for building invariances like data augmentation, tangent propagation, and convolutional network structures are outlined.
Restricted Boltzman Machine (RBM) presentation of fundamental theorySeongwon Hwang
The document discusses restricted Boltzmann machines (RBMs), an type of neural network that can learn probability distributions over its input data. It explains that RBMs define an energy function over hidden and visible units, with no connections between units within the same group. This conditional independence allows efficient computation of conditional probabilities. RBMs are trained using maximum likelihood, minimizing the negative log-likelihood of the training data by gradient descent.
A general multiobjective clustering approach based on multiple distance measuresMehran Mesbahzadeh
The document describes a general multiobjective clustering approach based on multiple distance measures. It proposes two clustering methods, MOECDM and MOEACDM, that use multiple distance matrices and multiobjective evolutionary algorithms to partition datasets into clusters in different distance spaces simultaneously. The key steps include: (1) generating initial populations using different distance measures and pre-clustering, (2) designing objective functions to minimize distances within clusters in different spaces, and (3) applying evolutionary operators like crossover and mutation to optimize the objective functions.
The document discusses several approaches for learning molecular representations using deep learning networks, including:
1. Improved Neural Fingerprint (A-NFP), which addresses limitations in the original Neural Fingerprint approach by differentiating between bond types. A-NFP achieves better performance than NFP on toxicity, solubility, and drug efficacy prediction tasks.
2. Message Passing Neural Networks (MPNN), which provide a general framework for graph-structured data like molecules. MPNNs contain message passing and readout phases. CNNs for learning molecular fingerprints is a specific case of MPNNs.
3. Mol2vec, which learns molecular representations without labels using a skip-gram model similar to word2vec and
Density based Clustering finds clusters of arbitrary shape by looking for dense regions of points separated by low density regions. It includes DBSCAN, which defines clusters based on core points that have many nearby neighbors and border points near core points. DBSCAN has parameters for neighborhood size and minimum points. OPTICS is a density based algorithm that computes an ordering of all objects and their reachability distances without fixing parameters.
Adaptive Channel Equalization using Multilayer Perceptron Neural Networks wit...IOSRJVSP
This document presents a neural network approach to channel equalization using a multilayer perceptron with a variable learning rate parameter. Specifically, it proposes modifying the backpropagation algorithm to allow the learning rate to adapt at each iteration in order to achieve faster convergence. The equalizer structure is a decision feedback equalizer modeled as a neural network with an input, hidden and output layer. Simulation results show the proposed variable learning rate approach improves bit error rate and convergence speed compared to a standard backpropagation algorithm.
- POSTECH EECE695J, "딥러닝 기초 및 철강공정에의 활용", Week 5
- Contents: Restricted Boltzmann Machine (RBM), various activation functions, data preprocessing, regularization methods, training of a neural network
- Video: https://youtu.be/v4rGPl-8wdo
Tutorial on Markov Random Fields (MRFs) for Computer Vision ApplicationsAnmol Dwivedi
The goal of this mini-project is to implement a pairwise binary label-observation Markov Random Field
model for bi-level image segmentation. Specifically, two inference algorithms, i.e., the Iterative
Conditional Mode (ICM) and Gibbs sampling methods will be implemented to perform image segmentation.
Inference & Learning in Linear Chain Conditional Random Fields (CRFs)Anmol Dwivedi
This mini-project will consider performing inference and learning in Linear Chain CRFs. In particular, it will consider an application to hand-written word recognition. Handwritten word recognition is a task many have explored with different methods of machine learning. Some written characters can be evaluated individually or as a whole word to account for the context in characters. In this mini-project, we use linear chain CRF models to account for context between the characters of a word to improve word recognition accuracy.
This document describes a virtual point light (VPL) rendering technique that uses a Metropolis sampling algorithm to efficiently approximate global illumination in scenes. The technique converts indirect lighting computed by a path tracer into virtual point lights that are rendered in real-time using a GPU. This allows for interactive rendering speeds while still achieving results close to a conventional path tracer. The key aspects of the VPL technique and Metropolis sampling algorithm are discussed, and details of the implementation including VPL generation, shadow mapping, and a shader chain for flexible material modeling are provided.
Classification of handwritten characters by their symmetry featuresAYUSH RAJ
The document presents a technique for classifying handwritten characters based on their symmetry features. The Generalized Symmetry Transform is applied to digits from the USPS dataset to extract symmetry magnitude and orientation maps. These features are used to train Probabilistic Neural Networks, which are then compared to a network trained on the original data. The symmetry-trained networks classify the training data perfectly but generalize poorer than the original data network, achieving 87.2% and 72.2% accuracy respectively compared to 95.17% for the original data network. While symmetry features can classify characters, the original data leads to better generalization performance.
This document summarizes the DBSCAN clustering algorithm. DBSCAN finds clusters based on density, requiring only two parameters: Eps, which defines the neighborhood distance, and MinPts, the minimum number of points required to form a cluster. It can discover clusters of arbitrary shape. The algorithm works by expanding clusters from core points, which have at least MinPts points within their Eps-neighborhood. Points that are not part of any cluster are classified as noise. Applications include spatial data analysis, image segmentation, and automatic border detection in medical images.
(DL輪読)Matching Networks for One Shot LearningMasahiro Suzuki
1. Matching Networks is a neural network architecture proposed by DeepMind for one-shot learning.
2. The network learns to classify novel examples by comparing them to a small support set of examples, using an attention mechanism to focus on the most relevant support examples.
3. The network is trained using a meta-learning approach, where it learns to learn from small support sets to classify novel examples from classes not seen during training.
1. Molecular dynamics (MD) simulations solve equations of motion to generate long time trajectories of molecular systems. However, simulations are limited by the small time step required for integration.
2. Parallelization using domain decomposition and MPI distribution allows MD simulations to be accelerated by distributing computational work across multiple processors. Key algorithms like the fast Fourier transform (FFT) used in particle mesh Ewald (PME) methods for long-range interactions must also be parallelized efficiently.
3. Two common parallelization schemes are the replicated data approach and domain decomposition. Domain decomposition distributes spatial domains across processors and has better parallel efficiency but is more complex to implement. Hybrid parallelization using MPI and OpenMP can further improve performance.
The document presents a new batch training algorithm called Multiple Optimal Learning Factors (MOLF) for training a multi-layer perceptron neural network. MOLF uses a two-stage approach per iteration: 1) Newton's method is used to find a vector of optimal learning factors, one for each hidden unit, which is used to update the input weights. 2) Linear equations are solved to update the output weights. The algorithm analyzes how linear dependencies among inputs and hidden units can impact training, and develops an improved version of MOLF that is not affected by these dependencies. In experiments, the improved MOLF performs better than first-order methods with minimal overhead, and almost as well as second-order
MODIFIED LLL ALGORITHM WITH SHIFTED START COLUMN FOR COMPLEXITY REDUCTIONijwmn
Multiple-input multiple-output (MIMO) systems are playing an important role in the recent wireless
communication. The complexity of the different systems models challenge different researches to get a good
complexity to performance balance. Lattices Reduction Techniques and Lenstra-Lenstra-Lovàsz (LLL)
algorithm bring more resources to investigate and can contribute to the complexity reduction purposes.
In this paper, we are looking to modify the LLL algorithm to reduce the computation operations by
exploiting the structure of the upper triangular matrix without “big” performance degradation. Basically,
the first columns of the upper triangular matrix contain many zeroes, so the algorithm will perform several
operations with very limited income. We are presenting a performance and complexity study and our
proposal show that we can gain in term of complexity while the performance results remains almost the
same.
Improving Performance of Back propagation Learning Algorithmijsrd.com
The standard back-propagation algorithm is one of the most widely used algorithm for training feed-forward neural networks. One major drawback of this algorithm is it might fall into local minima and slow convergence rate. Natural gradient descent is principal method for solving nonlinear function is presented and is combined with the modified back-propagation algorithm yielding a new fast training multilayer algorithm. This paper describes new approach to natural gradient learning in which the number of parameters necessary is much smaller than the natural gradient algorithm. This new method exploits the algebraic structure of the parameter space to reduce the space and time complexity of algorithm and improve its performance.
This document presents an implementation and analysis of parallelizing the training of a multilayer perceptron neural network. It describes distributing the calculations across processor nodes by assigning each processor responsibility for a fraction of nodes in each layer. Theoretical speedups of 1-16x are estimated and experimental speedups of 2-10x are observed for networks with over 60,000 nodes trained on up to 16 processors. Node parallelization provides near linear speedup for training multilayer perceptrons.
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKSIJCI JOURNAL
Sliding window sums are widely used for string indexing, hashing and time series analysis. We have
developed a family of the generic vectorized sliding sum algorithms that provide speedup of O(P/w) for
window size w and number of processors P. For a sum with a commutative operator the speedup is
improved to O(P/log(w)). Even more important, our algorithms exhibit efficient memory access patterns. In
this paper we study the application of sliding sum algorithms to the training and inference of Deep Neural
Networks. We demonstrate how both pooling and convolution primitives could be expressed as sliding
sums and evaluated by the compute kernels with a shared structure. We show that the sliding sum
convolution kernels are more efficient than the commonly used GEMM kernels on CPUs and could even
outperform their GPU counterparts.
Drobics, m. 2001: datamining using synergiesbetween self-organising maps and...ArchiLab 7
The document describes a three-stage approach to data mining that uses self-organizing maps, clustering, and fuzzy rule induction. In the first stage, a self-organizing map is used to reduce the data size while preserving topology. In the second stage, clustering identifies regions of interest. In the third stage, fuzzy rules are generated to describe the clusters. The approach was tested on image and real-world datasets and produced intuitive results.
An improved spfa algorithm for single source shortest path problem using forw...IJMIT JOURNAL
We present an improved SPFA algorithm for the single source shortest path problem. For a random graph,
the empirical average time complexity is O(|E|), where |E| is the number of edges of the input network.
SPFA maintains a queue of candidate vertices and add a vertex to the queue only if that vertex is relaxed.
In the improved SPFA, MinPoP principle is employed to improve the quality of the queue. We theoretically
analyse the advantage of this new algorithm and experimentally demonstrate that the algorithm is efficient.
International Journal of Managing Information Technology (IJMIT)IJMIT JOURNAL
We present an improved SPFA algorithm for the single source shortest path problem. For a random graph, the empirical average time complexity is O(|E|), where |E| is the number of edges of the input network. SPFA maintains a queue of candidate vertices and add a vertex to the queue only if that vertex is relaxed. In the improved SPFA, MinPoP principle is employed to improve the quality of the queue. We theoretically analyse the advantage of this new algorithm and experimentally demonstrate that the algorithm is efficient
An improved spfa algorithm for single source shortest path problem using forw...IJMIT JOURNAL
We present an improved SPFA algorithm for the single source shortest path problem. For a random graph,
the empirical average time complexity is O(|E|), where |E| is the number of edges of the input network.
SPFA maintains a queue of candidate vertices and add a vertex to the queue only if that vertex is relaxed.
In the improved SPFA, MinPoP principle is employed to improve the quality of the queue. We theoretically
analyse the advantage of this new algorithm and experimentally demonstrate that the algorithm is efficient.
This document presents a new adaptive algorithm for an adaptive decision feedback equalizer (ADFE) that has lower computational complexity than existing algorithms. The proposed block-based normalized least mean square (BBNLMS) algorithm with set-membership filtering for the ADFE achieves similar bit error rate performance and convergence speed as conventional algorithms like set-membership normalized least mean square (SM-NLMS), but with significantly fewer computations. Simulation results show the new algorithm provides comparable equalization performance to SM-NLMS while realizing about a 70% reduction in computational operations, especially at high signal-to-noise ratios, making it suitable for high-speed decision feedback equalization applications.
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...IJDKP
Quantum clustering (QC), is a data clustering algorithm based on quantum mechanics which is
accomplished by substituting each point in a given dataset with a Gaussian. The width of the Gaussian is a
σ value, a hyper-parameter which can be manually defined and manipulated to suit the application.
Numerical methods are used to find all the minima of the quantum potential as they correspond to cluster
centers. Herein, we investigate the mathematical task of expressing and finding all the roots of the
exponential polynomial corresponding to the minima of a two-dimensional quantum potential. This is an
outstanding task because normally such expressions are impossible to solve analytically. However, we
prove that if the points are all included in a square region of size σ, there is only one minimum. This bound
is not only useful in the number of solutions to look for, by numerical means, it allows to to propose a new
numerical approach “per block”. This technique decreases the number of particles by approximating some
groups of particles to weighted particles. These findings are not only useful to the quantum clustering
problem but also for the exponential polynomials encountered in quantum chemistry, Solid-state Physics
and other applications.
Fuzzy clustering algorithm can not obtain good clustering effect when the sample characteristic is not obvious and need to determine the number of clusters firstly. For thi0s reason, this paper proposes an adaptive fuzzy kernel clustering algorithm. The algorithm firstly use the adaptive function of clustering number to calculate the optimal clustering number, then the samples of input space is mapped to highdimensional feature space using gaussian kernel and clustering in the feature space. The Matlab simulation results confirmed that the algorithm's performance has greatly improvement than classical clustering algorithm and has faster convergence speed and more accurate clustering results.
Fuzzy clustering algorithm can not obtain good clustering effect when the sample characteristic is not
obvious and need to determine the number of clusters firstly. For thi0s reason, this paper proposes an
adaptive fuzzy kernel clustering algorithm. The algorithm firstly use the adaptive function of clustering
number to calculate the optimal clustering number, then the samples of input space is mapped to highdimensional
feature space using gaussian kernel and clustering in the feature space. The Matlab simulation
results confirmed that the algorithm's performance has greatly improvement than classical clustering algorithm and has faster convergence speed and more accurate clustering results
MLPfit is a tool for designing and training multi-layer perceptrons (MLPs) for tasks like function approximation and classification. It implements stochastic minimization as well as more powerful methods like conjugate gradients and BFGS. MLPfit is designed to be simple, precise, fast and easy to use for both standalone and integrated applications. Documentation and source code are available online.
This document summarizes a directed research report on using singular value decomposition (SVD) to reconstruct images with missing pixel values. It describes how images can be represented as matrices and SVD is commonly used for matrix completion problems. The report explores using an alternating least squares (ALS) algorithm based on SVD to fill in missing pixel values by finding feature matrices that approximate the rank k reconstruction of an image matrix. The ALS algorithm works by alternating between optimizing one feature matrix while holding the other fixed, minimizing the reconstruction error between the known pixel values and predicted values from multiplying the feature matrices.
DSP_FOEHU - MATLAB 01 - Discrete Time Signals and SystemsAmr E. Mohamed
This document provides an overview of discrete-time signals and systems in MATLAB. It defines discrete signals as sequences represented by x(n) and how they can be implemented as vectors in MATLAB. It describes various types of sequences like unit sample, unit step, exponential, sinusoidal, and random. It also covers operations on sequences like addition, multiplication, scaling, shifting, folding, and correlations. Discrete time systems are defined as operators that transform an input sequence x(n) to an output sequence y(n). Key properties discussed are linearity, time-invariance, stability, causality, and the use of convolution to represent the output of a linear time-invariant system. Examples are provided to demonstrate various concepts.
Black-box modeling of nonlinear system using evolutionary neural NARX modelIJECEIAES
Nonlinear systems with uncertainty and disturbance are very difficult to model using mathematic approach. Therefore, a black-box modeling approach without any prior knowledge is necessary. There are some modeling approaches have been used to develop a black box model such as fuzzy logic, neural network, and evolution algorithms. In this paper, an evolutionary neural network by combining a neural network and a modified differential evolution algorithm is applied to model a nonlinear system. The feasibility and effectiveness of the proposed modeling are tested on a piezoelectric actuator SISO system and an experimental quadruple tank MIMO system.
This document presents results from a lattice QCD calculation of the proton isovector scalar charge (gs) at two light quark masses. The calculation uses domain-wall fermions and Iwasaki gauge actions on a 323x64 lattice with a spacing of 0.144 fm. Ratios of three-point to two-point correlation functions are formed and fit to a plateau to extract gs. Values of gs are obtained for quark masses of 0.0042 and 0.001, and all-mode averaging is used for the lighter mass. Chiral perturbation theory will be used to extrapolate gs to the physical quark mass. Preliminary results for gs at the unphysical quark masses are reported in lattice units.
Low Power Adaptive FIR Filter Based on Distributed ArithmeticIJERA Editor
This paper aims at implementation of a low power adaptive FIR filter based on distributed arithmetic (DA) with
low power, high throughput, and low area. Least Mean Square (LMS) Algorithm is used to update the weight
and decrease the mean square error between the current filter output and the desired response. The pipelined
Distributed Arithmetic table reduces switching activity and hence it reduces power. The power consumption is
reduced by keeping bit-clock used in carry-save accumulation much faster than clock of rest of the operations.
We have implemented it in Quartus II and found that there is a reduction in the total power and the core dynamic
power by 31.31% and 100.24% respectively when compared with the architecture without DA table
2. Fig. 1. A fully connected Multilayer Perceptron
woi relieve the hidden units of the burden of approximating
the linear part of the desired mapping. This reduces the
necessary number of hidden units. The vector of hidden
layer net functions, np and the actual output of the network,
yp can be written as
np = W · xp (1)
yp
= Woi · xp + Woh · op
where the kth
element of the hidden unit activation vector op
is calculated as op(k) = f(np(k)) and f(.) denotes the hidden
layer activation function. Training an MLP typically
involves minimizing the mean squared error between the
desired and the actual network outputs, defined as
E =
1
Nv
[ tp i - yp
i ]
2
M
i=1
Nv
p=1
=
1
Nv
Ep,
Nv
p=1
(2)
Ep = [ tp i - yp
i ]
2
M
i=1
where Ep is the cumulative squared error for pattern p.
B. Discussion of Back-Propagation
BP training [13, 22, 23, 24] is a gradient based learning
algorithm which involves changing system parameters so
that the output error is reduced. For a MLP, the expression
for actual output found from the input is given by equation
(1). The weights of the MLP are changed by BP so that the
error E of equation (2), which computes the sum of the
squared errors between the actual and desired outputs,
decreases. The weights are updated using gradients of the
error with respect to the corresponding weights.
The required negative error gradients are calculated using
the chain rule utilizing delta functions, which are negative
gradients of Ep with respect to net functions. For the pth
pattern, output and hidden layer delta functions [13][21] are
respectively found as,
δpo i = 2 tp i - yp
i 3
δp k = f' np k δpo
M
i=1
(i)woh(i,k)
Now, the negative gradient of E with respect to w(k,n) is,
g k,n =
-∂E
∂w(k,n)
=
1
Nv
δp
Nv
p=1
k xp n (4)
The matrix of negative partial derivatives can therefore be
written as,
G =
1
Nv
δp
Nv
p=1
· xp
T
(5)
where δp = [δp(1) , δ p(2)… , δ p(Nh) ]T
.
If steepest descent is used to modify the hidden weights,
W is updated in a given iteration as,
W ← W + ∆ W, (6)
∆ W = z·G
where z is the learning factor, which can be obtained
optimally from the expression,
z =
-∂E/ ∂z
∂2
E/ ∂z2
(7)
III. REVIEW OF MOLF ALGORITHM
The MOLF algorithm makes use of the BP procedure to
update the input layer weights W, where a different learning
factor zk is used to update weights feeding into the kth
hidden
unit. The output weights Woh and bypass weights Woi are
found using output weight optimization (OWO).
Output weight optimization (OWO) is a technique for
finding weights connected to the outputs of the network.
Since the outputs have linear activations, finding the weights
connected to the outputs is equivalent to solving a system of
linear equations. The expression for the outputs given in (1)
can be re-written as
yp
= Wo· xp (8)
where xp=[ T
, T
]
T
is the augmented input vector of length
Nu where Nu equals N + Nh + 1, Wo is formed as [Woi : Woh]
and has a size of M by Nu . The output weights can be solved
for by setting ∂E ∂Wo⁄ =0 which leads to a set of linear
equations given by,
Ca= Ra·Wo
T
(9)
where,
Ca=
1
Nv
xp
Nv
p=1
· tp
T
, Ra=
1
Nv
xp·
Nv
p=1
xp
T
Equation (9) is most easily solved using orthogonal least
squares (OLS) [22].
The input weight connecting the nth
input to the kth
hidden
unit is updated using,
w k,n ← w k,n + zk·g k,n (10)
where, zk denotes the learning factor corresponding to the kth
hidden unit. The vector z containing the different learning
xp(1)
xp(2)
xp(3)
xp(N+1)
yp(1)
yp(2)
yp(3)
yp(M)
W Woh
np(1) op(1)
op(Nh)
Woi
Input
Layer
Hidden
Layer
Output
Layer
np(Nh)
2594
3. factors zk can be found using OLS from the following
relation,
Hmolf·z = gmolf
(11)
gmolf
j =
-∂E
∂zj
, hmolf k,j =
∂2
E
∂zk∂zj
IV. ANALYSIS OF THE MOLF ALGORITHM
In this section, the structure of the MLP is analyzed in
detail using equivalent networks. The relationship between
MOLF and linear transformations of the net function vector
is investigated.
A. Discussion of Equivalent Networks
Let MLP 1, have a net function vector np and output vector
yp after back propagation training as discussed previously in
section II-B. In a second network called MLP 2 trained
similarly, the net function vector and the output vector are
respectively denoted by and yp′. The bypass and output
weight matrix along with the hidden unit activation functions
are considered to be equal for both the MLP 1 and MLP 2.
Based on this information the output vectors for MLP 1 and
MLP 2 are respectively,
yp
i = woi
N+1
n=1
i,n xp n + woh
Nh
k=1
i,k f np k (12)
y'p i = woi
N+1
n=1
i,n xp n + woh
Nh
k=1
i,k f n k (13)
In order to make MLP 2 strongly equivalent to MLP 1
their respective output vectors yp and yp′ have to be equal.
Based on equations (12) and (13) this can be achieved only
when their respective net function vectors are equal [23].
MLP 2 can be made strongly equivalent to MLP 1 by
linearly transforming its net function vector before
passing it as an argument to the activation function f(.) given
by,
np = C · (14)
where C is a linear transformation matrix of size Nh by Nh.
On applying the linear transformation to the net function
vector in equation (13) we get,
y'p i = yp
(i) = woi
N+1
n=1
i,n xp n +
woh(i,k)
Nh
k=1
f c(m,k)
Nh
m=1
·n k (15)
After making MLP 2 strongly equivalent to MLP 1, the
net function vector np can be related to its input weights W
and the net function vector as,
np(k) = w(k,m)xp
N+1
n=1
(m) = c(k,m)
Nh
m=1
n (m) (16)
Similarly, the MLP 2 net function vector can be related
to its weights W′ as,
n k = w k,m
N+1
n=1
xp m (17)
On substituting (17) into (16) we get,
np k c k,m
Nh
m 1
N 1
n 1
w m,n xp n (18)
It is observed above that the net function of MLP 1 is
found by linear transformation of the input weight matrix W′
of MLP 2 given by,
W = C · W′ (19)
If the elements of the C matrix are found through an
optimality criterion, then optimal input weights W′ can be
computed with the input weight matrix W of MLP 1.
B. Optimal transformation of net function
The discussion of equivalent networks in section IV-A
suggests that, an optimal set of net functions can be obtained
by linear transformation of the input weights W. In this
section, the input weight update equations are derived for
OWO-BP training, using the linear transformation matrix C
from MLP 2.
MLP 1 and MLP 2 are strongly equivalent as before. Then
output of the two networks after transformation is given as
in equation (15). The elements of the negative gradient
matrix G′ of MLP 2 found from equation (2) and (15) is
defined as,
g u,v =
-∂E
∂w'(u,v)
=
2
Nv
[ tp
M
i=1
Nv
p=1
i - yp
i ]·
∂yp
(i)
∂w'(u,v)
(20)
∂yp
(i)
∂w'(u,v)
= woh
Nh
k=1
(i,k) o'p(k) c(k,u) xp(v)
Rearranging terms in (20) results in,
g u,v =
c k,u
Nh
k=1
2
Nv
[ tp
M
i=1
Nv
p=1
i - yp
i ]woh i,k o k xp v
= c(k,u)
-∂E
∂w(k,v)
Nh
k=1
(21)
which is abbreviated as,
G′ = CT
· G (22)
The input weight update equation for MLP 2 based on its
gradient matrix G′ is given by,
W′ = W′ + z · G′ (23)
On pre-multiplying this equation by C and using equations
(19) and (22) we obtain
2595
4. W = W + z · G′′ (24)
where,
G′′ = R · G, R = C · CT
Equation (24) is nothing but the weight update equation of
MLP 1, which would result in the network being strongly
equivalent to MLP 2. Thus using equation (24), the input
weights of MLP 1 can be updated so that its net functions
are optimal as in MLP 2.
Lemma 1: For a given R matrix, there are an uncountably
infinite number of C matrices.
The transformed gradient G′′ of MLP 1 is given in terms
of the original gradient G as,
g u,v = r (u,k) g(k,v) (25)
Nh
k=1
g k,v =
-∂E
∂w(k,v)
=
2
Nv
[ tp
M
i=1
Nv
p=1
i - yp
i ]woh i,k o'p k xp v (26)
Equations (24) and (25) suggest that MLP 1 could be
trained with optimal net functions using only the knowledge
of the linear transformation matrix R.
C. Multiple optimal learning factors
Equations (24) and (25) give a method for optimally
transforming the net functions of an MLP by using the
transformation matrix R. In this subsection, the MOLF
approach is derived from equations (24) and (25). Let the R
matrix in equation (24) be diagonal. In this case equation
(24) becomes,
w(k,n) = w(k,n) + z · r(k)g(k,n) (27)
where rk denotes the kth
diagonal element of R. On
comparing equations (10) and (27), the expression for the
optimal learning factors zk could be given as,
zk = z · r(k) (28)
Equation (28) suggests that using the MOLF algorithm for
training a MLP is equivalent to optimally transforming the
net function vector using a diagonal transformation matrix.
V. EFFECTS OF COLLAPSING THE MOLF HESSIAN
This section presents a computational analysis of the
existing MOLF algorithm followed by a proposed method to
collapse the MOLF Hessian matrix to create fewer optimal
learning factors.
A. Computational Analysis of the MOLF algorithm
In the existing MOLF algorithm a significant amount of
the computational burden is attributed to inverting the
Hessian matrix to find the optimal learning factors, usually
through the OLS procedure. The size of the Hessian matrix
which is based on the number of hidden units in the network
directly affects the computational load of the MOLF
algorithm.
The gradient and Hessian equations of the MOLF
algorithm are given in equation (11).The computation of the
multiple optimal learning factor vector z requires that
equation (11) be solved. The number of multiplies required
for computing z through OLS is,
Mols-molf (Nh
+1) Nh Nh+ 2 (29)
Equation (29) shows that the number of multiplications
required increases cubically with the number of hidden units.
For networks having large values of Nh, this can be time
consuming since solving (11) has to be done for every
iteration of the training algorithm. If some hidden units have
linearly dependent outputs [20], the MOLF Hessian matrix
Hmolf can be ill conditioned. This can affect MLP training.
Thus having the number of optimal learning factors equal
to the number of hidden units can result in a high
computational load and also could lead to an ill-conditioned
MOLF Hessian matrix.
B. Training with several hidden units per OLF
In this section, we modify the MOLF approach so that
each OLF is assigned to one or more hidden units. The
Hessian for this new approach is compared to Hmolf.
Let NOLF be the number of optimal learning factors used
for training in the proposed variable optimal learning factors
(VOLF) method. NOLF is selected such that it divides Nh,
with no remainder. Each optimal learning factor zv then
applies to Nh/NOLF hidden units. The MLP output based on
these conditions is given by,
yp
m = woi
N+1
n=1
m,n xp n +
woh m,k f(
Nh
k=1
(w k,n +zvg(k,n)xp n )
N+1
n=1
) 30
v = ceil ( k · (NOLF/Nh) )
where the function ceil() rounds non-integer arguments up to
the next integer.
The negative gradient of the error in equation (2) with
respect to each of the optimal learning factors zv, denoted as
gvolf is computed based on equation (30) as,
gvolf
j =
-∂E
∂zj
=
2
Nv
tp m – woh(m,k)op(zv,k)
Nh
k=1
M
m=1
Nv
p=1
·
woh(m,k(j,c))o p(k(j,c))n p(k(j,c))
Nh NOLF⁄
c=1
(31)
where,
tp m tp m woi
N+1
n=1
m,n xp n ,
k j,c = c + (j-1)·
Nh
NOLF
,
op(zv,k) = f( (w k,n + zv
N+1
n=1
g(k,n))xp(n))
The Hessian matrix Hvolf is derived from (2), (30) and (31)
as,
hvolf i,j =
∂2
E
∂zi∂zj
2596
5. =
2
Nv
woh(m,k i,d )o p(k i,d )n'p(k i,d ·
Nh NOLF⁄
d=1
M
m=1
Nv
p=1
woh(m,k j,c )o'
p(k j,c )n'p(k j,c )
Nh NOLF⁄
c=1
(32)
The vector z of variable optimal learning factors is found
from the negative gradient vector and Hessian matrix from
the relation,
Hvolf · z = gvolf (33)
where Hvolf is NOLF by NOLF and gvolf is a column vector of
dimension NOLF. Thus by varying the number of optimal
learning factors required for training, the vector z is found
using the method discussed in section III.
The computation required for solving equation (33) can be
adjusted by choosing the number of optimal learning factors.
When NOLF equals one, the current procedure is similar to
using a fixed optimal learning factor as discussed in section
II. In this case, it requires less computation but the algorithm
is also not very effective. When NOLF equals Nh, then the
algorithm reduces to the MOLF algorithm described in
section 3. Thus by varying the number of optimal learning
factors between one and Nh, the algorithm interpolates
between the MOLF and OLF cases.
C. Collapsing the MOLF Hessian
Occasionally the MOLF approach can fail because of
distortion in Hmolf due to linearly dependent inputs [20] or
because it is ill-conditioned due to linearly dependent hidden
units [20]. In these cases we don’t have to redo the current
training iteration. Instead we can collapse Hmolf and gmolf
down to a smaller size and use the approach of the previous
subsection. The elements of the original MOLF Hessian
matrix Hmolf for a MLP with Nh hidden units are denoted as,
Hmolf=
hmolf(1,1) hmolf(1,Nh)
hmolf(2,1)
hmolf(3,1)
hmolf(Nh,1) hmolf(Nh,Nh)
(34)
collapsing the Hessian matrix Hmolf to a matrix Hmolf1 of size
NOLF by NOLF we get,
Hmolf1=
hmolf1(1,1) hmolf1(1,NOLF)
hmolf1(2,1)
hmolf1(3,1)
hmolf1(NOLF,1) hmolf1(NOLF,NOLF)
35
The relationship between the elements of Hmolf1 and Hmolf is
described by,
hmolf1(m,n)= hmolf(
Nh NOLF⁄
d=1
Nh NOLF⁄
c=1
k(m,c),k(n,d)) 36
Equations (11) and (32) show that collapsing the MOLF
Hessian matrix to size NOLF by NOLF results in a VOLF
Hessian matrix of that size. Thus the Hmolf1 Hessian matrix
of equation (35) is the equivalent to the VOLF Hessian
matrix Hvolf of equation (32).
On collapsing the negative gradient vector gmolf to a vector
gmolf1 having NOLF elements we get,
gmolf1
= gmolf1
1 , gmolf1
2 ,………… gmolf1
(NOLF)
T
(37)
The relationship between gmolf1 and gmolf is described by,
gmolf1
(m)= gmolf
(k(m,c) )
Nh NOLF⁄
c=1
(38)
Equations (11) and (31) show that collapsing the MOLF
gradient vector to size NOLF by 1 results in the VOLF
gradient vector.
Therefore collapsing the MOLF Hessian matrix Hmolf and
gradient vector gmolf is equivalent to training the MLP with
the VOLF algorithm.
VI. RESULTS
A. Computational Burden of Different Algorithms
In this section the computational burden is calculated for
different MLP training algorithms in terms of required
multiplies. All the algorithms were implemented in
Microsoft Visual Studio 2005.
Let Nu = N + Nh + 1 denote the number of weights
connected to each output. The total number of weights in the
network is denoted as Nw = M(N + Nh + 1) + Nh(N + 1).The
number of multiplies required to solve for output weights
using Orthogonal Least Squares [22] is given by,
Mols Nu M Nu 2
3
4
Nu Nu 1 39
The numbers of multiplies required per training iteration
using BP, OWO-BP and LM are respectively given by,
Mbp= Nv · MNu + 2Nh N+1 + M(N+6Nh+4) +Nw 40
Mowo-bp= Nv[2Nh N+2 +M Nu+1 +
Nu N+1
2
+ M(N+6Nh+4)]+ Mols+ Nh(N+1) (41)
Mlm= Mbp + Nv MNu Nu+ 3Nh N+1 + 4Nh
2
(N+1)2
+ Nw
2
+ Nw
3
42
Equation (44) gives the multiplies required for computing
the optimal learning factors of the MOLF algorithm.
Similarly the number of multiplies required for the VOLF
algorithm is given as,
Mols-volf (NOLF
+1) NOLF NOLF+ 2 (43)
The total number of multiplies required for the MOLF and
VOLF algorithms are respectively given as,
Mmolf = Mowo-bp+ Nv Nh N+4 - M(7N-Nh+4) + (Nh)3
+ Mols-molf (44)
Mvolf = Mowo-bp+ Nv Nh N+4 - M(7N-Nh+4) + (Nh)3
+ Mols-volf (45)
Equations (44) and (45), show that the number of
multiplies required in the MOLF and VOLF algorithms
consists of the OWO-BP multiplies, multiplies to compute
the Hessian and negative gradient matrix, along with the
number of multiplies to invert the Hessian matrix. The
matrices Hmolf and Hvolf are respectively inverted in the
MOLF and VOLF algorithms.
2597
6. B. Experimental results
Here the performance of VOLF is compared with those of
MOLF, OWO-BP, LM and CG. In CG and LM, all weights
are varied in every iteration. In MOLF, VOLF and OWO-BP
we first solve linear equations for the output weights and
subsequently update the input weights.
The data sets used for the simulations are listed in Table I.
The training for all the datasets are done on inputs
normalized to zero mean and unit variance.
The optimal number of hidden units to be used in the
MLP is determined by network pruning using the method of
[24]. Then the k-fold validation procedure is used to obtain
the average training and validation errors. In k-fold
validation, the data set is split into k non-overlapping parts
of equal size, and (k − 1) parts are used for training and the
remaining one part is used for validation. The procedure is
Fig. 2. Twod data set: average error vs. number of iterations.
Fig. 3. Twod data set: average error vs. number of multiplies.
repeated k times picking a different part for use as the
validation set every time. For the simulations, k is chosen as
10. The number of iterations used for OWO-BP, CG, MOLF
and VOLF is chosen as 4000. For the LM algorithm 300
iterations are performed. For the VOLF algorithm NOLF =
Nh/2.
The average training error and the number of multiplies is
calculated for every iteration in a particular dataset using the
different training algorithms. These measurements are then
plotted to provide a graphical representation of the
efficiency and quality of the different training algorithms.
These plots for different datasets are shown below.
For the Twod.tra data file [25], the MLP is trained with 30
hidden units. In Fig. 2, the average mean square error (MSE)
for training from 10-fold validation is plotted versus the
number of iterations for each algorithm (shown on a log10
scale). In Fig. 3, the average training MSE from 10-fold
validation is plotted versus the required number of multiplies
(shown on a log10 scale).The LM algorithm shows the
lowest error of all the algorithms used for training, but its
computational burden may make it unsuitable for training
purposes. It is noted that the average error of the VOLF
algorithm lies between that of the OWO-BP and MOLF
algorithms for every iteration.
Fig. 4. Single2 data set: average error vs. number of iterations.
Fig. 5. Single2 data set: average error vs. number of multiplies.
10
0
10
1
10
2
10
3
10
4
0.15
0.16
0.17
0.18
0.19
0.2
0.21
0.22
Number of iterations
MSE
VOLF
MOLF
LM
CG
OWO-BP
10
6
10
7
10
8
10
9
10
10
10
11
10
12
0.15
0.16
0.17
0.18
0.19
0.2
0.21
0.22
Number of multiplies
MSE
VOLF
MOLF
LM
CG
OWO-BP
10
0
10
1
10
2
10
3
10
4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Number of iterations
MSE
VOLF
MOLF
LM
CG
OWO-BP
10
7
10
8
10
9
10
10
10
11
10
12
10
13
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Number of multiplies
MSE
VOLF
MOLF
LM
CG
OWO-BP
TABLE I
Data set description
Data Set Name No. of Inputs No. of Outputs No. of Patterns
Twod 8 7 1768
Single2 16 3 10000
Power12 12 1 1414
Concrete 8 1 1030
2598
7. Fig. 6. Power12 data set: Average error vs. number of iterations.
Fig. 7. Power12 data set: average error vs. number of multiplies.
For the Single2.tra data file [25], the MLP is trained with
20 hidden units. In Fig. 4, the average mean square error
(MSE) for training from 10-fold validation is plotted versus
the number of iterations for each algorithm (shown on a
log10 scale). In Fig. 5, the average training MSE from 10-
fold validation is plotted versus the required number of
multiplies (shown on a log10 scale). In the initial iterations,
the MOLF algorithm shows the least error. But overall, as
for the previous dataset, the LM algorithm shows the lowest
error of all the algorithms used for training. As before the
average error of the VOLF algorithm lies between that of the
OWO-BP and MOLF algorithms.
For the Power12trn.tra data file [25], the MLP is trained
with 25 hidden units. In Fig. 6, the average mean square
error (MSE) for training from 10-fold validation is plotted
versus the number of iterations for each algorithm (shown on
a log10 scale). In Fig. 7, the average training MSE from 10-
fold validation is plotted versus the required number of
multiplies (shown on a log10 scale). For this dataset the
MOLF algorithm performs better than all the other
Fig. 8. Concrete data set: average error vs. number of iterations.
Fig. 9. Concrete data set: average error vs. number of multiplies.
algorithms. As seen previously, the computational
requirements of the LM algorithm is very high. As before
the average error of the VOLF algorithm lies between that of
the OWO-BP and MOLF algorithms for each iteration.
For the Concrete data file [26], the MLP is trained with 15
hidden units. In Fig. 8, the average mean square error (MSE)
for training from 10-fold validation is plotted versus the
number of iterations for each algorithm (shown on a log10
scale).
In Fig. 9, the average training MSE from 10-fold
validation is plotted versus the required number of multiplies
(shown on a log10 scale). As before, the average error of the
VOLF algorithm lies between that of the OWO-BP and
MOLF algorithms for every iteration.
Table II compares the average training and validation
errors of the MOLF and VOLF algorithms with the other
algorithms for different data files. For each data set, the
average training and validation errors are found after 10-fold
validation.
10
0
10
1
10
2
10
3
10
4
4000
4500
5000
5500
6000
6500
7000
Number of iterations
MSE
VOLF
MOLF
LM
CG
OWO-BP
10
6
10
7
10
8
10
9
10
10
10
11
10
12
4000
4500
5000
5500
6000
6500
7000
Number of multiplies
MSE
VOLF
MOLF
LM
CG
OWO-BP
10
0
10
1
10
2
10
3
10
4
10
20
30
40
50
60
70
Number of iterations
MSE
VOLF
MOLF
LM
CG
OWO-BP
10
5
10
6
10
7
10
8
10
9
10
10
10
11
10
20
30
40
50
60
70
Number of multiplies
MSE
VOLF
MOLF
LM
CG
OWO-BP
2599
8. From the plots and the table presented, it can be inferred
that the error performance of the VOLF algorithm lies
between those of MOLF and OWO-BP algorithms. The
VOLF and MOLF algorithms are also found to produce
good results approaching those from the LM algorithm, with
computational requirements only in the order of first order
training algorithms.
VII. CONCLUSION
In this paper we have put the MOLF training algorithm on
a firm theoretical footing by showing its derivation from an
optimal linear transformation of the hidden layer's net
function vector. We have also developed the VOLF training
algorithm, which associates several hidden units with each
optimal learning factor. This modification of the MOLF
approach will allow us to train much larger networks
efficiently.
REFERENCES
[1] R. C. Odom, P. Pavlakos, S.S. Diocee, S. M. Bailey, D. M. Zander,
and J. J. Gillespie, “Shaly sand analysis using density-neutron
porosities from a cased-hole pulsed neutron system,” SPE Rocky
Mountain regional meeting proceedings: Society of Petroleum
Engineers, pp. 467-476, 1999.
[2] A. Khotanzad, M. H. Davis, A. Abaye, D. J. Maratukulam, “An
artificial neural network hourly temperature forecaster with
applications in load forecasting,” IEEE Transactions on Power
Systems, vol. 11, No. 2, pp. 870-876, May 1996.
[3] S. Marinai, M. Gori, G. Soda, “Artificial neural networks for
document analysis and recognition,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 27, No. 1, pp. 23-35, 2005.
[4] J. Kamruzzaman, R. A. Sarker, R. Begg, “Artificial Neural Networks:
Applications in Finance and Manufacturing,” Idea Group Inc (IGI),
2006.
[5] L. Wang and X. Fu, Data Mining With Computational Intelligence,
Springer-Verlag, 2005.
[6] K. Hornik, M. Stinchcombe, and H.White, “Multilayer feedforward
networks are universal approximators.” Neural Networks, Vol. 2, No.
5, 1989, pp.359-366.
[7] K. Hornik, M. Stinchcombe, and H. White, “Universal approximation
of an unknown mapping and its derivatives using multilayer
feedforward networks,” Neural Networks,vol. 3,1990, pp.551-560.
[8] C. Cortes and V. Vapnik. Support-vector networks. Machine
Learning, 20:273-297, November 1995.
[9] G. Cybenko, “Approximations by superposition of a sigmoidal
function,” Mathematics of Control, Signals, and Systems (MCSS),
vol. 2, pp. 303-314, 1989.
[10] D. W. Ruck et al., “The multi-layer perceptron as an approximation to
a Bayes optimal discriminant function,” IEEE Transactions on Neural
Networks, vol. 1, No. 4, 1990.
[11] Michael T.Manry, Steven J.Apollo, and Qiang Yu, “Minimum mean
square estimation and neural networks,” Neurocomputing, vol. 13,
September 1996, pp.59-74.
[12] Q. Yu, S.J. Apollo, and M.T. Manry, “MAP estimation and the multi-
layer perceptron,” Proceedings of the 1993 IEEE Workshop on
Neural Networks for Signal Processing, pp. 30-39, Linthicum Heights,
Maryland, Sept. 6-9, 1993.
[13] D.E. Rumelhart, G.E. Hinton, and R.J. Williams, “Learning internal
representations by error propagation,” in D.E. Rumelhart and J.L.
McClelland (Eds.), Parallel Distributed Processing, vol. I, Cambridge,
Massachusetts: The MIT Press, 1986.
[14] Changhua Yu, Michael T. Manry, and Jiang Li, "Effects of
nonsingular pre-processing on feed-forward network training ".
International Journal of Pattern Recognition and Artificial Intelligence
, Vol. 19, No. 2 (2005) pp. 217-247.
[15] J.P. Fitch, S.K. Lehman, F.U. Dowla, S.Y. Lu, E.M. Johansson, and
D.M. Goodman, "Ship Wake-Detection Procedure Using Conjugate
Gradient Trained Artificial Neural Networks," IEEE Trans. on
Geoscience and Remote Sensing, Vol. 29, No. 5, September 1991, pp.
718-726.
[16] S. McLoone and G. Irwin, “A variable memory Quasi-Newton
training algorithm,” Neural Processing Letters, vol. 9, pp. 77-89,
1999.
[17] A. J. Shepherd Second-Order Methods for Neural Networks, Springer-
Verlag New York, Inc., 1997.
[18] K. Levenberg, “A method for the solution of certain problems in least
squares,” Quart. Appl. Math., Vol. 2, pp. 164.168, 1944.
[19] D. Marquardt, “An algorithm for least-squares estimation of nonlinear
parameters,” SIAM J. Appl. Math., Vol. 11, pp. 431.441, 1963.
[20] S. S. Malalur, M. T. Manry, "Multiple optimal learning factors for
feed-forward networks," accepted by The SPIE Defense, Security and
Sensing (DSS) Conference, Orlando, FL, April 2010
[21] T.H. Kim “Development and evaluation of multilayer perceptron
training algorithms”, Phd. Dissertation, The University of Texas at
Arlington, 2001.
[22] W. Kaminski, P. Strumillo, “Kernel orthonormalization in radial basis
function neural networks,” IEEE Transactions on Neural Networks,
vol. 8, Issue 5, pp. 1177 - 1183, 1997
[23] R.P Lippman, “An introduction to computing with Neural Nets,”
IEEE ASSP Magazine,April 1987.
[24] Pramod L. Narasimha, Walter H. Delashmit, Michael T. Manry, Jiang
Li, Francisco Maldonado, “An integrated growing-pruning method for
feedforward network training,” Neurocomputing vol. 71, pp. 2831–
2847, 2008.
[25] Univ. of Texas at Arlington, Training Data Files –
http://wwwee.uta.edu/eeweb/ip/training_data_files.html.
[26] Univ. of California, Irvine, Machine Learning Repository -
http://archive.ics.uci.edu/ml/
TABLE II
Average 10-fold training and validation error
Data Set OWO-BP CG VOLF MOLF LM
Twod
Etrn 0.22 0.20 0.18 0.17 0.16
Eval 0.25 0.22 0.21 0.20 0.17
Single2
Etrn 0.83 0.63 0.51 0.16 0.03
Eval 1.02 0.99 0.64 0.19 0.11
Power12
Etrn 6959.89 6471.98 5475.01 4177.45 4566.34
Eval 7937.21 7132.13 6712.34 5031.23 5423.54
Concrete
Etrn 38.98 19.88 21.81 21.56 16.98
Eval 65.12 56.12 37.23 37.21 33.12
2600