The objective of this project was to classify the given set of events as either tau-tau decay of Higgs Boson or as a background noise. This project was completed as a part of the Machine Learning module. We have come up with an ensemble model with XGBoosting and Random Forest classifiers to solve this problem.
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ijcsit
Data mining is indispensable for business organizations for extracting useful information from the huge volume of stored data which can be used in managerial decision making to survive in the competition. Due to the day-to-day advancements in information and communication technology, these data collected from ecommerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets selected in subsequent iterations by feature selection should be same or similar even in case of small perturbations of the dataset and is called as selection stability. It is recently becomes important topic of research community. The selection stability has been measured by various measures. This paper analyses the selection of the suitable search method and stability measure for the feature selection algorithms and also the influence of the characteristics of the dataset as the choice of the best approach is highly problem dependent.
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
Data clustering is a common technique for statistical data analysis; it is defined as a class of
statistical techniques for classifying a set of observations into completely different groups. Cluster analysis
seeks to minimize group variance and maximize between group variance. In this study we formulate a
mathematical programming model that chooses the most important variables in cluster analysis. A nonlinear
binary model is suggested to select the most important variables in clustering a set of data. The idea of the
suggested model depends on clustering data by minimizing the distance between observations within groups.
Indicator variables are used to select the most important variables in the cluster analysis.
Cluster analysis of classification is often called the 'non-supervised technique'.
It is a multivariate technique used to determine group membership for cases or variables.
Cluster analysis is a major tool in a number of applications in many fields of Business, Engineering & etc.(The odoridis and Koutroubas, 1999):
Data reduction.
Hypothesis generation.
Hypothesis testing.
Prediction based on groups.
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ijcsit
Data mining is indispensable for business organizations for extracting useful information from the huge volume of stored data which can be used in managerial decision making to survive in the competition. Due to the day-to-day advancements in information and communication technology, these data collected from ecommerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets selected in subsequent iterations by feature selection should be same or similar even in case of small perturbations of the dataset and is called as selection stability. It is recently becomes important topic of research community. The selection stability has been measured by various measures. This paper analyses the selection of the suitable search method and stability measure for the feature selection algorithms and also the influence of the characteristics of the dataset as the choice of the best approach is highly problem dependent.
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
Data clustering is a common technique for statistical data analysis; it is defined as a class of
statistical techniques for classifying a set of observations into completely different groups. Cluster analysis
seeks to minimize group variance and maximize between group variance. In this study we formulate a
mathematical programming model that chooses the most important variables in cluster analysis. A nonlinear
binary model is suggested to select the most important variables in clustering a set of data. The idea of the
suggested model depends on clustering data by minimizing the distance between observations within groups.
Indicator variables are used to select the most important variables in the cluster analysis.
Cluster analysis of classification is often called the 'non-supervised technique'.
It is a multivariate technique used to determine group membership for cases or variables.
Cluster analysis is a major tool in a number of applications in many fields of Business, Engineering & etc.(The odoridis and Koutroubas, 1999):
Data reduction.
Hypothesis generation.
Hypothesis testing.
Prediction based on groups.
2DFMT in the Range to & its Application with Some FunctionIOSRJM
In the present paper, the two dimensional Fourier-Mellin Transform (2DFMT) in the range of to is defined by using two dimensional Fourier-Laplace transform (2DFLT). Then, we have obtained the solution of 2DFMT of some function in the above defined range.
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...cscpconf
In search based test data generation, the problem of test data generation is reduced to that of
function minimization or maximization.Traditionally, for branch testing, the problem of test data
generation has been formulated as a minimization problem. In this paper we define an alternate
maximization formulation and experimentally compare it with the minimization formulation. We
use a genetic algorithm as the search technique and in addition to the usual genetic algorithm
operators we also employ the path prefix strategy as a branch ordering strategy and memory and elitism. Results indicate that there is no significant difference in the performance or the coverage obtained through the two approaches and either could be used in test data generation when coupled with the path prefix strategy, memory and elitism.
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...ijsc
In fuzzy decision-making processes based on linguistic information, operations on discrete fuzzy numbers
are commonly performed. Aggregation and defuzzification operations are some of these often used
operations. Many aggregation and defuzzification operators produce results independent to the decisionmaker’s
strategy. On the other hand, the Weighted Average Based on Levels (WABL) approach can take
into account the level weights and the decision maker's "optimism" strategy. This gives flexibility to the
WABL operator and, through machine learning, can be trained in the direction of the decision maker's
strategy, producing more satisfactory results for the decision maker. However, in order to determine the
WABL value, it is necessary to calculate some integrals. In this study, the concept of WABL for discrete
trapezoidal fuzzy numbers is investigated, and analytical formulas have been proven to facilitate the
calculation of WABL value for these fuzzy numbers. Trapezoidal and their special form, triangular fuzzy
numbers, are the most commonly used fuzzy number types in fuzzy modeling, so in this study, such numbers
have been studied. Computational examples explaining the theoretical results have been performed.
Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsup...CSCJournals
This paper presents the application of multi dimensional feature reduction of Consistency Subset Evaluator (CSE) and Principal Component Analysis (PCA) and Unsupervised Expectation Maximization (UEM) classifier for imaging surveillance system. Recently, research in image processing has raised much interest in the security surveillance systems community. Weapon detection is one of the greatest challenges facing by the community recently. In order to overcome this issue, application of the UEM classifier is performed to focus on the need of detecting dangerous weapons. However, CSE and PCA are used to explore the usefulness of each feature and reduce the multi dimensional features to simplified features with no underlying hidden structure. In this paper, we take advantage of the simplified features and classifier to categorize images object with the hope to detect dangerous weapons effectively. In order to validate the effectiveness of the UEM classifier, several classifiers are used to compare the overall accuracy of the system with the compliment from the features reduction of CSE and PCA. These unsupervised classifiers include Farthest First, Densitybased Clustering and k-Means methods. The final outcome of this research clearly indicates that UEM has the ability in improving the classification accuracy using the extracted features from the multi-dimensional feature reduction of CSE. Besides, it is also shown that PCA is able to speed-up the computational time with the reduced dimensionality of the features compromising the slight decrease of accuracy.
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...ijaia
This paper gives complete guidelines for authors submitting papers for the AIRCC Journals. A sliding puzzle is a combination puzzle where a player slides pieces along specific routes on a board to reach a certain end configuration. In this paper, we propose a novel measurement of the complexity of 100 sliding puzzles with paramodulation, which is an inference method of automated reasoning. It turned out that by counting the number of clauses yielded with paramodulation, we can evaluate the difficulty of each puzzle. In the experiment, we have generated 100 * 8 puzzles that passed the solvability checking by countering inversions. By doing this, we can distinguish the complexity of 8 puzzles with the number generated with paramodulation. For example, board [2,3,6,1,7,8,5,4, hole] is the easiest with score 3008 and board [6,5,8,7,4,3,2,1, hole] is the most difficult with score 48653.Besides, we have succeeded in obverse several layers of complexity (the number of clauses generated) in 100 puzzles. We can conclude that the proposed method can provide a new perspective of paramodulation complexity concerning sliding block puzzles.
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORYijaia
This paper presents a batch classifier that splits a dataset into tree branches depending on the category type. It has been improved from the earlier version and fixed a mistake in the earlier paper. Two important changes have been made. The first is to represent each category with a separate classifier. Each classifier then classifies its own subset of data rows, using batch input values to create the centroid and also represent the category itself. If the classifier contains data from more than one category however, it needs to create new classifiers for the incorrect data. The second change therefore is to allow the classifier to branch to new layers when there is a split in the data, and create new classifiers there for the data rows that are incorrectly classified. Each layer can therefore branch like a tree - not for distinguishing features, but for distinguishing categories. The paper then suggests a further innovation, which is to represent some data columns with fixed value ranges, or bands. When considering features, it is shown that some of the data can be classified directly through fixed value ranges, while the rest must be classified using a classifier technique and the idea allows the paper to discuss a biological analogy with neurons and neuron links. Tests show that the method can successfully classify a diverse set of benchmark datasets to better than the state-of-the-art.
INTERVAL TYPE-2 INTUITIONISTIC FUZZY LOGIC SYSTEM FOR TIME SERIES AND IDENTIF...ijfls
This paper proposes a sliding mode control-based learning of interval type-2 intuitionistic fuzzy logic system for time series and identification problems. Until now, derivative-based algorithms such as gradient descent back propagation, extended Kalman filter, decoupled extended Kalman filter and hybrid method of decoupled extended Kalman filter and gradient descent methods have been utilized for the optimization of the parameters of interval type-2 intuitionistic fuzzy logic systems. The proposed model is based on a Takagi-Sugeno-Kang inference system. The evaluations of the model are conducted using both real world and artificially generated datasets. Analysis of results reveals that the proposed interval type-2 intuitionistic fuzzy logic system trained with sliding mode control learning algorithm (derivative-free) do outperforms some existing models in terms of the test root mean squared error while competing favourable with other models in the literature. Moreover, the proposed model may stand as a good choice for real time applications where running time is paramount compared to the derivative-based models.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...csandit
Attribute reduction and classification task are an essential process in dealing with large data
sets that comprise numerous number of input attributes. There are many search methods and
classifiers that have been used to find the optimal number of attributes. The aim of this paper is
to find the optimal set of attributes and improve the classification accuracy by adopting
ensemble rule classifiers method. Research process involves 2 phases; finding the optimal set of
attributes and ensemble classifiers method for classification task. Results are in terms of
percentage of accuracy and number of selected attributes and rules generated. 6 datasets were
used for the experiment. The final output is an optimal set of attributes with ensemble rule
classifiers method. The experimental results conducted on public real dataset demonstrate that
the ensemble rule classifiers methods consistently show improve classification accuracy on the
selected dataset. Significant improvement in accuracy and optimal set of attribute selected is
achieved by adopting ensemble rule classifiers method.
Constructing a classification model is important in machine learning for a particular task. A
classification process involves assigning objects into predefined groups or classes based on a
number of observed attributes related to those objects. Artificial neural network is one of the
classification algorithms which, can be used in many application areas. This paper investigates
the potential of applying the feed forward neural network architecture for the classification of
medical datasets. Migration based differential evolution algorithm (MBDE) is chosen and
applied to feed forward neural network to enhance the learning process and the network
learning is validated in terms of convergence rate and classification accuracy. In this paper,
MBDE algorithm with various migration policies is proposed for classification problems using
medical diagnosis.
2DFMT in the Range to & its Application with Some FunctionIOSRJM
In the present paper, the two dimensional Fourier-Mellin Transform (2DFMT) in the range of to is defined by using two dimensional Fourier-Laplace transform (2DFLT). Then, we have obtained the solution of 2DFMT of some function in the above defined range.
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...cscpconf
In search based test data generation, the problem of test data generation is reduced to that of
function minimization or maximization.Traditionally, for branch testing, the problem of test data
generation has been formulated as a minimization problem. In this paper we define an alternate
maximization formulation and experimentally compare it with the minimization formulation. We
use a genetic algorithm as the search technique and in addition to the usual genetic algorithm
operators we also employ the path prefix strategy as a branch ordering strategy and memory and elitism. Results indicate that there is no significant difference in the performance or the coverage obtained through the two approaches and either could be used in test data generation when coupled with the path prefix strategy, memory and elitism.
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...ijsc
In fuzzy decision-making processes based on linguistic information, operations on discrete fuzzy numbers
are commonly performed. Aggregation and defuzzification operations are some of these often used
operations. Many aggregation and defuzzification operators produce results independent to the decisionmaker’s
strategy. On the other hand, the Weighted Average Based on Levels (WABL) approach can take
into account the level weights and the decision maker's "optimism" strategy. This gives flexibility to the
WABL operator and, through machine learning, can be trained in the direction of the decision maker's
strategy, producing more satisfactory results for the decision maker. However, in order to determine the
WABL value, it is necessary to calculate some integrals. In this study, the concept of WABL for discrete
trapezoidal fuzzy numbers is investigated, and analytical formulas have been proven to facilitate the
calculation of WABL value for these fuzzy numbers. Trapezoidal and their special form, triangular fuzzy
numbers, are the most commonly used fuzzy number types in fuzzy modeling, so in this study, such numbers
have been studied. Computational examples explaining the theoretical results have been performed.
Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsup...CSCJournals
This paper presents the application of multi dimensional feature reduction of Consistency Subset Evaluator (CSE) and Principal Component Analysis (PCA) and Unsupervised Expectation Maximization (UEM) classifier for imaging surveillance system. Recently, research in image processing has raised much interest in the security surveillance systems community. Weapon detection is one of the greatest challenges facing by the community recently. In order to overcome this issue, application of the UEM classifier is performed to focus on the need of detecting dangerous weapons. However, CSE and PCA are used to explore the usefulness of each feature and reduce the multi dimensional features to simplified features with no underlying hidden structure. In this paper, we take advantage of the simplified features and classifier to categorize images object with the hope to detect dangerous weapons effectively. In order to validate the effectiveness of the UEM classifier, several classifiers are used to compare the overall accuracy of the system with the compliment from the features reduction of CSE and PCA. These unsupervised classifiers include Farthest First, Densitybased Clustering and k-Means methods. The final outcome of this research clearly indicates that UEM has the ability in improving the classification accuracy using the extracted features from the multi-dimensional feature reduction of CSE. Besides, it is also shown that PCA is able to speed-up the computational time with the reduced dimensionality of the features compromising the slight decrease of accuracy.
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...ijaia
This paper gives complete guidelines for authors submitting papers for the AIRCC Journals. A sliding puzzle is a combination puzzle where a player slides pieces along specific routes on a board to reach a certain end configuration. In this paper, we propose a novel measurement of the complexity of 100 sliding puzzles with paramodulation, which is an inference method of automated reasoning. It turned out that by counting the number of clauses yielded with paramodulation, we can evaluate the difficulty of each puzzle. In the experiment, we have generated 100 * 8 puzzles that passed the solvability checking by countering inversions. By doing this, we can distinguish the complexity of 8 puzzles with the number generated with paramodulation. For example, board [2,3,6,1,7,8,5,4, hole] is the easiest with score 3008 and board [6,5,8,7,4,3,2,1, hole] is the most difficult with score 48653.Besides, we have succeeded in obverse several layers of complexity (the number of clauses generated) in 100 puzzles. We can conclude that the proposed method can provide a new perspective of paramodulation complexity concerning sliding block puzzles.
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORYijaia
This paper presents a batch classifier that splits a dataset into tree branches depending on the category type. It has been improved from the earlier version and fixed a mistake in the earlier paper. Two important changes have been made. The first is to represent each category with a separate classifier. Each classifier then classifies its own subset of data rows, using batch input values to create the centroid and also represent the category itself. If the classifier contains data from more than one category however, it needs to create new classifiers for the incorrect data. The second change therefore is to allow the classifier to branch to new layers when there is a split in the data, and create new classifiers there for the data rows that are incorrectly classified. Each layer can therefore branch like a tree - not for distinguishing features, but for distinguishing categories. The paper then suggests a further innovation, which is to represent some data columns with fixed value ranges, or bands. When considering features, it is shown that some of the data can be classified directly through fixed value ranges, while the rest must be classified using a classifier technique and the idea allows the paper to discuss a biological analogy with neurons and neuron links. Tests show that the method can successfully classify a diverse set of benchmark datasets to better than the state-of-the-art.
INTERVAL TYPE-2 INTUITIONISTIC FUZZY LOGIC SYSTEM FOR TIME SERIES AND IDENTIF...ijfls
This paper proposes a sliding mode control-based learning of interval type-2 intuitionistic fuzzy logic system for time series and identification problems. Until now, derivative-based algorithms such as gradient descent back propagation, extended Kalman filter, decoupled extended Kalman filter and hybrid method of decoupled extended Kalman filter and gradient descent methods have been utilized for the optimization of the parameters of interval type-2 intuitionistic fuzzy logic systems. The proposed model is based on a Takagi-Sugeno-Kang inference system. The evaluations of the model are conducted using both real world and artificially generated datasets. Analysis of results reveals that the proposed interval type-2 intuitionistic fuzzy logic system trained with sliding mode control learning algorithm (derivative-free) do outperforms some existing models in terms of the test root mean squared error while competing favourable with other models in the literature. Moreover, the proposed model may stand as a good choice for real time applications where running time is paramount compared to the derivative-based models.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...csandit
Attribute reduction and classification task are an essential process in dealing with large data
sets that comprise numerous number of input attributes. There are many search methods and
classifiers that have been used to find the optimal number of attributes. The aim of this paper is
to find the optimal set of attributes and improve the classification accuracy by adopting
ensemble rule classifiers method. Research process involves 2 phases; finding the optimal set of
attributes and ensemble classifiers method for classification task. Results are in terms of
percentage of accuracy and number of selected attributes and rules generated. 6 datasets were
used for the experiment. The final output is an optimal set of attributes with ensemble rule
classifiers method. The experimental results conducted on public real dataset demonstrate that
the ensemble rule classifiers methods consistently show improve classification accuracy on the
selected dataset. Significant improvement in accuracy and optimal set of attribute selected is
achieved by adopting ensemble rule classifiers method.
Constructing a classification model is important in machine learning for a particular task. A
classification process involves assigning objects into predefined groups or classes based on a
number of observed attributes related to those objects. Artificial neural network is one of the
classification algorithms which, can be used in many application areas. This paper investigates
the potential of applying the feed forward neural network architecture for the classification of
medical datasets. Migration based differential evolution algorithm (MBDE) is chosen and
applied to feed forward neural network to enhance the learning process and the network
learning is validated in terms of convergence rate and classification accuracy. In this paper,
MBDE algorithm with various migration policies is proposed for classification problems using
medical diagnosis.
This was a presentation given during our CTEL away day. It describes the different channels which could be utilized to promote CTEL work and research and increase networking both internally and externally.
Query Aware Determinization of Uncertain Objectsnexgentechnology
bulk ieee projects in pondicherry,ieee projects in pondicherry,final year ieee projects in pondicherry
Nexgen Technology Address:
Nexgen Technology
No :66,4th cross,Venkata nagar,
Near SBI ATM,
Puducherry.
Email Id: praveen@nexgenproject.com.
www.nexgenproject.com
Mobile: 9751442511,9791938249
Telephone: 0413-2211159.
NEXGEN TECHNOLOGY as an efficient Software Training Center located at Pondicherry with IT Training on IEEE Projects in Android,IEEE IT B.Tech Student Projects, Android Projects Training with Placements Pondicherry, IEEE projects in pondicherry, final IEEE Projects in Pondicherry , MCA, BTech, BCA Projects in Pondicherry, Bulk IEEE PROJECTS IN Pondicherry.So far we have reached almost all engineering colleges located in Pondicherry and around 90km
bulk ieee projects in pondicherry,ieee projects in pondicherry,final year ieee projects in pondicherry
Nexgen Technology Address:
Nexgen Technology
No :66,4th cross,Venkata nagar,
Near SBI ATM,
Puducherry.
Email Id: praveen@nexgenproject.com.
www.nexgenproject.com
Mobile: 9751442511,9791938249
Telephone: 0413-2211159.
NEXGEN TECHNOLOGY as an efficient Software Training Center located at Pondicherry with IT Training on IEEE Projects in Android,IEEE IT B.Tech Student Projects, Android Projects Training with Placements Pondicherry, IEEE projects in pondicherry, final IEEE Projects in Pondicherry , MCA, BTech, BCA Projects in Pondicherry, Bulk IEEE PROJECTS IN Pondicherry.So far we have reached almost all engineering colleges located in Pondicherry and around 90km
Query aware determinization of uncertainnexgentech15
Nexgen Technology Address:
Nexgen Technology
No :66,4th cross,Venkata nagar,
Near SBI ATM,
Puducherry.
Email Id: praveen@nexgenproject.com.
www.nexgenproject.com
Mobile: 9751442511,9791938249
Telephone: 0413-2211159.
NEXGEN TECHNOLOGY as an efficient Software Training Center located at Pondicherry with IT Training on IEEE Projects in Android,IEEE IT B.Tech Student Projects, Android Projects Training with Placements Pondicherry, IEEE projects in pondicherry, final IEEE Projects in Pondicherry , MCA, BTech, BCA Projects in Pondicherry, Bulk IEEE PROJECTS IN Pondicherry.So far we have reached almost all engineering colleges located in Pondicherry and around 90km
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
In this paper Compare the performance of two
classification algorithm. I t is useful to differentiate
algorithms based on computational performance rather
than classification accuracy alone. As although
classification accuracy between the algorithms is similar,
computational performance can differ significantly and it
can affect to the final results. So the objective of this paper
is to perform a comparative analysis of two machine
learning algorithms namely, K Nearest neighbor,
classification and Logistic Regression. In this paper it
was considered a large dataset of 7981 data points and 112
features. Then the performance of the above mentioned
machine learning algorithms are examined. In this paper
the processing time and accuracy of the different machine
learning techniques are being estimated by considering the
collected data set, over a 60% for train and remaining
40% for testing. The paper is organized as follows. In
Section I, introduction and background analysis of the
research is included and in section II, problem statement.
In Section III, our application and data analyze Process,
the testing environment, and the Methodology of our
analysis are being described briefly. Section IV comprises
the results of two algorithms. Finally, the paper concludes
with a discussion of future directions for research by
eliminating the problems existing with the current
research methodology.
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...csandit
There is an exhaustive study around the area of engine design that covers different methods that try to reduce costs of production and to optimize the performance of these engines.
Mathematical methods based in statistics, self-organized maps and neural networks reach the best results in these designs but there exists the problem that configuration of these methods is
not an easy work due the high number of parameters that have to be measured.
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...cscpconf
There is an exhaustive study around the area of engine design that covers different methods that try to reduce costs of production and to optimize the performance of these engines. Mathematical methods based in statistics, self-organized maps and neural networks reach the best results in these designs but there exists the problem that configuration of these methods is not an easy work due the high number of parameters that have to be measured. In this work we extend an algorithm for computing implications between attributes with positive and negative values for obtaining the mixed concepts lattice and also we propose a theoretical method based in these results for engine simulators adjusting specific and different elements for obtaining optimal engine configurations.
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONijaia
Function Approximation is a popular engineering problems used in system identification or Equation
optimization. Due to the complex search space it requires, AI techniques has been used extensively to spot
the best curves that match the real behavior of the system. Genetic algorithm is known for their fast
convergence and their ability to find an optimal structure of the solution. We propose using a genetic
algorithm as a function approximator. Our attempt will focus on using the polynomial form of the
approximation. After implementing the algorithm, we are going to report our results and compare it with
the real function output.
Exploring Support Vector Regression - Signals and Systems ProjectSurya Chandra
Our team competed in a Kaggle competition to predict the bike share use as a part of their capital bike share program in Washington DC using a powerful function approximation technique called support vector regression.
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...cscpconf
Constructing a classification model is important in machine learning for a particular task. A
classification process involves assigning objects into predefined groups or classes based on a
number of observed attributes related to those objects. Artificial neural network is one of the
classification algorithms which, can be used in many application areas. This paper investigates
the potential of applying the feed forward neural network architecture for the classification of
medical datasets. Migration based differential evolution algorithm (MBDE) is chosen and
applied to feed forward neural network to enhance the learning process and the network
learning is validated in terms of convergence rate and classification accuracy. In this paper,
MBDE algorithm with various migration policies is proposed for classification problems using
medical diagnosis.
Sentence-level translation quality estimation with cross-lingual transformers.
Please consider citing our paper
@InProceedings{transquest:2020,
author = {Ranasinghe, Tharindu and Orasan, Constantin and Mitkov, Ruslan},
title = {TransQuest: Translation Quality Estimation with Cross-lingual Transformers},
booktitle = {Proceedings of the 28th International Conference on Computational Linguistics},
year = {2020}
}
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
1. Higgs Boson Machine Learning
Challenge
Group Project CS4622
Team Members:
100112V Edirisinghe
E.A.S.D
100132G Fernando
W.V.D.
100440A Ranasinghe
R.H.T.D.
100498G Senaratne
H. H.
100559V Vithana
Y. G. K.
100577A Weerasinghe
L.A.
2. Table of Content
1. Introduction
2. Background
3. Approach Followed
3.1 Preprocessing
3.1.1 Understanding the nature of the given variables
3.1.2 Handling missing values
3.1.3 Converting Data Types
3.1.4 Data Normalization
3.1.5 Feature Selection and Deriving Features
3.2 Training Techniques
3.2.1 Random Forest Classifier
3.2.2 Gradient Boost Classifier
3.2.3 Neural Networks
3.2.4 XGBoost Classifier
4. Results and Discussion
5. Reference
6. Appendix
3. 1. Introduction
This report reveals the procedure used by the team in order to solve “Higgs Boson Machine
Learning Challenge” stated under the Kaggle site.
As the initial parts of this report, we have delivered some knowledge about the background of
this problem, which is closely related to particle physics. Later on we have included how we have
modeled and preprocessed
the data, what machine learning techniques and procedures that we have
used to solve this problem and what results we were able to obtain with the followed approaches.
Finally we have analysed and discussed about the methods that we followed and the outputs that we
obtained.
2. Background
Discovery of Higgs Boson which is an elementary particle of particle physics was recently
claimed by the ATLAS experiment and the CMS experiment. This discovery was acknowledged by the
2013 Nobel prize in physics given to Francois Englert and Peter Higgs. The related experiments are
running at the Large Hadron Collider which is commonly known as LHC at CERN (the European
Organization for Nuclear Research), Geneva, Switzerland; which began operating in 2009 after about
20 years of design and construction.
This particle decays under several processes, and produces other particles. A channel is the
term that is used to indicate the decay of a particle into other specific particles in physics. It was recently
reported by the ATLAS experiment, the first evidence of the Higgs Boson to tau tau channel. The
ATLAS experiment has observed a signal of the above mentioned decay and this signal is small one and
buried in background noise.
What is expected from this Higgs Boson Machine Learning Challenge is to explore the potential
of advanced machine learning methods to improve the discovery significance of the experiment by
4. classifying a given event into the correct region out of ‘signal’ and ‘background’. That is deciding
whether the the results of a certain event has happened due to tau tau decay of Higgs Boson (signal) or
due to other background noise (background).
Training set consists of several primary attributes and derived attributes related to this event
classification, along with signal/background labels and with weights. The weights are related to the
normalizations of signals and backgrounds. The test set consists of the variables in training set instead of
labels and weights. The required data as the solution should contain the fields EventID (a unique
identifier for each event), RankOrder (a permutation of integer list from 1 to test set size) and Class
(either “b” or “s”). The higher ranks indicate more signallike
events and the lower ranks indicate more
backgroundlike
events. Since the rank could be calculated using the weight values, the objective is to
find a function of weights or in simple terms to predict the weights for test set after training a machine
learning model with the use of training set. Depending on the value for the weight it is possible to predict
the event’s class because it is clear that two different ranges of weights fall into two different classes.
Figure 1: Graphical representation of a Higgs boson decaying to two tau particles in the ATLAS
detector
5. 3. Approach Followed
Under this section we discuss how we preprocessed training data in order to feed for a machine
learning model and what machine learning techniques that we have used for training.
3.1 Preprocessing
Data preprocessing plays an important part in any machine learning challenge. In higgs Boson
Machine Learning Challenge, we used several data preprocessing methods which will be described in
this section.
3.1.1 Understanding the nature of the given variables
Before starting with the preprocessing work, we tried to figure out any directly visible
relationships between the classification and the variables. In order to see this we thought of graphically
representing the data which will show any information directly associated with the classification. The
following figures (Figure 2 to Figure 5) show how the classification has occurred with respect to the
range of values of few of the variable.
Figure 2: Classification relative to the distribution of the variable Der_lep_eta_centrality
6. Figure 3: Classification relative to the distribution of the variable Weight
Figure 4: Classification relative to the distribution of the variable DER_mass_MMC
7. Figure 5: Classification relative to the distribution of the variable PRI_lep_eta
Through these visualizations we figured out that there is no directly associated variable except
for the weight. From this fact we learned that if we predict the weight for the given test scenarios, we
will be able to do both the classification and the ranking at the same time.
8. 3.1.2 Handling missing values
In the data given for the competition the missing values are stored as 999.
Exploring we
discovered that there are lot of missing values in data.
Figure 6: Variable statistics
As you can see in the Figure 6, many columns like DER_deltaeta_jet_jet,
DER_massdelta_jet_jet have 999
values for more than half of the total values (more than the median).
It was clear that dropping training subjects where the missing values are present cannot be used for
handling missing values, because we need to predict for the test entries which also contain these missing
values. So as the first approach to handle missing values, we tried dropping variables where the
missing values are present. It could not improve the results due to the large amount of missing values
present. After dropping the variables there was not enough data to predict and also important
relationships and variables tend to disappear for the sake of handling missing values. Therefore it is not a
good approach for handling missing values.
The next approach we considered to handle the missing values was to use traditional
imputation, but the results were not good. In this case we substituted missing values with the average
9. values of the corresponding variable columns while ignoring the missing values for the calculations. The
main reason behind the non improved performance is that the missing values are “actually” missing
values where a value for that feature can not exist in that particular training instance. So the best way to
handle the missing values is to interpret 999
as a special missing value and use algorithms that will
consider 999
values as a special category.
3.1.3 Converting Data Types
In Order to apply xgboosting and gradient boosting techniques the value of the label should be
numeric. So we had to change the Label type to 0,1 in when we were doing data pre processing. Used
0 if label is equal to “b” and used 1 if the label is equal to “s”.
3.1.4 Data Normalization
If you have a look at Figure 6, the distribution of data values varies highly for different columns.
For an example the column DER_pt_h varies from 0 to 2835, DER_met_phi_centrality varies from 1.4
to +1.4 only. To guarantee stable convergence of weights and biases in our model we had to normalize
all the columns. In this competition we used minmax
normalization where each value in the columns will
be matched to a value between 0 and 1.
3.1.5 Feature Selection and Deriving Features
The Figure 7 represents the correlation between the label and other features in the data set.
10. Figure 7: The correlation between the label and other features in the data set
As you can see in the Figure 7 some variables like PRI_tau_eta can be dropped when building
the model since they are insignificant to the Label value. But the other thing you should notice with the
diagram is that no variable can be considered as significant to Labelvalue
that we should predict. So
deriving new features was required.
We identified 4 features[1] which can be important to our model.
assymenj = (MET − MHT)/(MHT + MET)
dijet = sum of the two jet masses
deltaphi = jet1_phi − jet2_phi
11. eltaphimet (jet1_d = phi + jet2_phi) / 2
The feature; dijet is already included as a variable in the data sets as; DER_mass_jet_jet. We
derived the other three variables with the available data as follows. Since MHT (Missing energy
calculated from the jets) was not readily available we used a derived variable (estimatedMHT) which is
proportional to this quantity.
estimatedMHT = PRI_jet_all_pt − PRI_jet_leading_pt − PRI_jet_subleading_pt
assymenj = (PRI_met − estimatedMHT ) / (PRI_met + estimatedMHT)
deltaphi = PRI_jet_leading_phi − PRI_jet_subleading_phi
deltaphimet = (PRI_jet_leading_phi + PRI_jet_subleading_phi) / 2
Also identified a variable using greedy approach that had a 0.2 correlation with the label.
Special = DER_mass_MMC × DER_pt_ratio_lep_tau / (DER_sum_pt + 0.0000001)
These newly added columns in the preprocessing
stage improved the score in the public leader
board with submissions using the xgboosting algorithm. The initial version of these new variables simply
calculated the relevant values using the relevant columns of data without considering the fact whether
any of the columns have 999.
In that case the variable creation algorithm simply took 999
as a valid
value for the respective field and calculated the result. We thought that since 999
is not a valid value,
but only an indicator that those values are not available.
With consideration of the above fact, we decided to filter out the entries which have invalid
inputs. We changed the variable creation algorithm to output 999
in cases where at least one of the
inputs have invalid value. But unfortunately the results did not appear to be as expected. from the
analysis of the change and the results, we came to a conclusion that the success rate decreased due to
the elimination of diversity. To clarify this let’s use two example entries.
12. EventID Value 1 Value 2 New variable
neglecting 999
New Variable
considering 999
1 999
2.14 996.86
999
15 1.52 999
1000.52 999
122 999
999
0 999
You can see that for the above three entries, the variable which does not consider the invalid
input has given three different values and their value range is directly associated with the invalid data
input combination. But in the variable created with the consideration of the invalid inputs, all three have
the same value 999.
This clearly shows us the reason why the success rate decreased with the new
variable. The new variable had eliminated variability of the previous variable which hides a lot of
information which are very important in classification.
The new variables before considering the invalid inputs seemed to have introduced new
measurements of the relations among the the variables when taken as groups. With this improvement we
thought of discovering the possibility of creating more derived variables to impose measurements of the
collective relative relationship among the primitive variables for a specific result.
We came up with few more columns by randomly combining primitive variables. The intention
was to see if we can improve the success rate by introducing new variables which have combined
information on other variables. But the results decreased the success rate. So we concluded that
introducing variables which have known relationship to the classification may improve success rate
while others may decrease the success rate by introducing unimportant relationships.
13. 3.2 Training Techniques
3.2.1 Random Forest Classifier
We used random forest classifier for the higgs boson challenge in the earlier stages. We used
scikit learn package for python to develop the solution. Random Forest Classifier comes under the
sklearn.ensemble package in scikit learn.
The basic functionality of random forest is as follows[2]. It creates number of classification trees
instead of making a single classification tree. So when it needs to classify for new input it is given to all
the classifier trees and the answer is taken. Then in order to get the final answer it uses voting
mechanism where results from each tree is considered as a vote and the final answer is selected by the
answer which has most votes.
When building the trees in the random forest, there are some guidelines follows. One is if there
are N cases in the training set then there will be N sample cases which are used to train the trees. One
other thing is it will select m number of variables randomly from the total M number of input variables for
each node. Other thing is the trees are grown without any pruning.
One major feature in random forest classifier is it runs efficiently on large data sets, and it can
handle large number of input variables. It has the ability to handle the missing values effectively and it can
maintain the accuracy when large proportion of data is missing. Furthermore it has the ability to identify
the variables which are most important and relationship between variables. It also does not get over
fitted to the inputs.
When training the trees in random forest classifier about one third of data is not used and they
are used as out of bag data to get running unbiased estimates and also to get the importance of
variables. The rest of the data is used a bootstrap to train the trees. The out of bag data for the trees are
put back on them to get a classification, and finally take a class which got most votes from the out of
bag data. That is used as an error estimate for the random forest classifier.
14. Measuring the importance of variables is also an important feature in random forest
classification. That is done by putting the out of bag data on each tree in the forest and count the number
of votes in correct class. Then it changes the value in the variable that needs to be checked and put
back on to the trees and count the votes in the correct class. Then by subtracting the votes from the
original result and from the changed input and averaging the results over the forest to get an score about
the importance of the variable. If the number of variables in the data set is very high then forest can be
run for all the variables for once and then again with only the most important variables.
Proximities are also an important feature in the random forest classifier. It is formed by creating
a NxN matrix and putting all the data including the training data and out of bag data. Since it is not
possible to have NxN matrix for large data sets NxT matrix is formed where T is number of trees in the
forest.
To fill the missing values in the data set random forest classifier has two methods. The faster
way is filling the missing values by the median. But the more accurate way is the second way where
initially filling the missing values by rough estimates and then run the forest to compute the proximities.
Outliers are identified by the random forest method by the proximity values. If there are entries
in a class with small proximities then those entries are identified as outliers.
In the random forest classifier of scikit learn package there are several parameters which we can
use to tune our results[3]. One parameter is n_estimators which is used to specify the number of trees in
the forest. max_depth is used to specify the maximum depth of the trees in the random forest. The
default value for that is none and it will expand until all the leaves are pure. oob_score is an boolean
parameter to specify whether to use Out of bag data for the dataset.
It has several methods that the users can use for the prediction work. Fit method is used to build
the forest using the training set and predict method is used to predict the results for the test data. It has a
15. method called transform which can be used to reduce the input data matrix to the most important
features.
Initially we tried Random forest classifier to predict the Label value of the data as Signal (s) or
Background (b) without predicting the weight value. That way we were not able to have a rank value
for the test data. And also for the initial submission we removed derived features from the training and
then we added them back later. Then we made submissions with replacing the values with 999
from
the average value of the columns and also tried removing those columns from training and test data sets.
But Random forest classifier did not gave much good results with either of those methods. The
maximum we were able to score with random forest method was 2.90576 in the private score with the
n_estimator value as 150. Then when we tried to estimate the weight value using the random forest
classifier it failed because it took huge amount of memory. So we decided to move for other available
options to have better results.
3.2.2 Gradient Boost Classifier
Another classifier we tested in the initial states was the Gradient Boost Classifier. Gradient
boosting algorithms use an ensemble of weak decision trees built to optimize a customizable loss
function. Trees are built using boosting in a staged manner. Gradient boosting classifiers can be used for
both regression and classification. Gradient boosting algorithms can handle data of mixed types and are
very robust to outliers.
We used a Gradient boosting regression trees algorithm from the Scikitlearn
library in python
for this problem. This model used all the features in the data set to train the classifier. To improve the
accuracy we used hyper parameter tuning along with stratified cross validation to set the best values for
the parameters.
We also tried using multiple loss functions such as the default ‘deviance’ function as well as the
AMS function used in this competition. Using the AMS function as the loss function improved our
16. results slightly. Even with all this effort we were unable to match the performance we got from the
XGBoost algorithm. so this approach was dropped.
3.2.3 Neural Networks
Artificial neural networks provide a general, practical method for learning realvalued,
discretevalued,
and vectorvalued
functions from examples. Algorithms such as Backpropagation use
gradient descent to tune network parameters to best fit a training set of inputoutput
pairs. Neural
Network learning is robust to errors in the training data and has been successfully applied to problems
such as interpreting visual scenes, speech recognition, and learning robot control strategies.
We have used the PyBrain[5] python library to build a neural network which used
backpropagation algorithm to train the network. While training the neural network, we have faced a
number of problems such as
1. Number of hidden layers to be used
Number of Hidden Layers Result
none Only capable of representing linear separable functions
decisions.
1 Can approximate any function that contains a continuous
mapping from one finite space to another.
2 Can represent an arbitrary decision boundary to arbitrary
accuracy with rational activation functions and can
approximate any smooth mapping to any accuracy.
Above summarize the knowledge we have acquired by going through various research
papers. But unfortunately we were unable to find a specific method to determine the number
17. hidden layers and hence we’ve tested a various number of hidden layers ranging from 2 to 50.
We were unable to increase the number of hidden layers further due to the huge amount of time
taken by the network training phase.
2. Numbers of neurons for each hidden layer
We were unable to find any specific formula to calculate the number of neurons in a particular
hidden layer. Although we found many ruleofthumb
methods for determining the correct
number of neurons to use in the hidden layers, such as the following:
● The number of hidden neurons should be between the size of the input layer and the size
of the output layer.
● The number of hidden neurons should be 2/3 the size of the input layer, plus the size of
the output layer.
● The number of hidden neurons should be less than twice the size of the input layer.
We have applied above rules and further we tried to decide the number of neurons in a
hidden layer based on a combination of prime number series and fibonacci number series.
3. Neural Network training time
Training the neural network took a lot of time. As a last resort we tried to use genetic
algorithms[6] and pruning algorithms[7][8] to optimize the neural network. But the result was
not satisfactory
4. How to decide the cutoff
mark for signal or background noise
The output of the neural network was a floating point value between 01.
The values
closer to the 1 are the signal and the values closer to the 0 are the background noise. By using
the 10fold
cross validation, we found that floating point values above 0.65 should be
considered as signal and values below 0.65 should be considered as background noise
However, all the prediction results obtained through the neural network model performed
poorly when compared to other models during the cross validation.
18. 3.2.4 XGBoost Classifier
We used xgboost package[4] with R language which implements extreme gradient boosting
classifier. Extreme gradient boosting classifier is an efficient and scalable implementation of gradient
boosting framework that we have described earlier. The package includes efficient linear model solver
and tree learning algorithms. The package can automatically do parallel computation with OpenMP, so
that this is more than 10 times faster than the previously used gradient boosting. Xgboost supports
various objective functions, including regression, classification and ranking. We used “rank” function in
order to rank the probabilities of the events being due to signals, as required for the submissions. The
two sets of classes; “s” and “b” are classified using a threshold value which was carefully chosen after
analysis, on the sorted test entries according to the probabilities of those being due to signals.
Unlike in gradient boost classifier, xgboost classifier provides a special way of handling missing
values. Xgboost automatically learns the best direction to go when a value is missing. We feeded 999
as missing values to xgboost classifier and by doing so, we could improve the scores that we got in
public leaderboard.
Then we tried tuning up the parameters. We used GridSearch in SciKit learn package[9] in
selecting a better set of parameters, apart from the default parameter values. With the following set of
parameters we could obtain good results. Since gradient boosting dramatically improves the model’s
generalization ability with lower learning rates (heavy shrinkages), we reduced the default value of eta.
Since the lower learning rates needs more iterations increasing nrounds variable had a positive impact on
our results. It is a known fact that lower learning rates reduce overfitting.
eta = 0.05
max_depth = 6
silent = 1
nthread = 16
nround = 500
19. 4. Results and Discussion
As mentioned in the previous parts of this document we tried multiple classification systems to
solve this challenge to varying results.
1. Gradient Boosting
2. Random Forests
3. Neural Networks
4. XGBoost
But we found out that XGBoost algorithm was able to produce the best results for this problem.
Using the XGBoost algorithm as mentioned above we were able to get a score of 3.64655 which gave
us a rank of 437.
While we were able to get higher score on the public leaderboards they were misleading us on
the quality of the overall predictive ability of our models. So our best submission on the private
leaderboard which scored 3.64672 and was ranked 223 was a result of overfitting which caused us to
drop in the private leaderboard which determines the final positions.
So the biggest issue we had with our process in this competition was the lack of good cross
validation. We were relying too much on the public leaderboard to access the quality of our model that
we were unable to avoid overfitting the predictions to the leaderboard. As the public leader board was
made up using only 18% of the data relying on that to gauge improvements had the effect of leading to
overfitting. A valuable lesson learned through this contest is the importance of maintaining good
standards of cross validating our predictions, which would have allowed us to perform much better.
20. One of the major challenges we have faced during this competition was to come up with derived
features. Since we had no knowledge in the field of high energy particle physics we had to read a couple
of research papers on that area inorder
to come up features mentioned in section 3.1.5. and in the
process we were able to gain a considerable amount of knowledge on that field.
This competition was a valuable opportunity for us to learn important machine learning and data
mining concepts while contributing to a very important scientific cause. Through this competition we
were able to get a better understanding of the challenges in the field and also the methodologies
practically used to overcome them.
21. 5. References
[1] http://www.lps.ens.fr/~laetitia/HIGGS.pdf
[2] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
[3] http://scikitlearn.
org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html
[4] http://cran.rproject.
org/web/packages/xgboost/index.html
[5] Schaul, T., Bayer, J., Wierstra, D., Sun, Y., Felder, M., Sehnke, F., ... & Schmidhuber, J. (2010).
PyBrain. The Journal of Machine Learning Research, 11, 743746.
[6] Karnin, E. D. (1990). A simple procedure for pruning backpropagation
trained neural networks.
Neural Networks, IEEE Transactions on, 1(2), 239242.
[7] Leung, F. H. F., Lam, H. K., Ling, S. H., & Tam, P. K. S. (2003). Tuning of the structure and
parameters of a neural network using an improved genetic algorithm. Neural Networks, IEEE
Transactions on, 14(1), 7988.
[8] http://www.pybrain.org/docs/api/optimization/optimization.html#populationbased
[9] Scikitlearn:
Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 28252830,
2011.
22. 6. Appendix
Appendix A : Public and private scores for Random Forest models and Logistic Regression
Models
23. Appendix B : Public and private scores for Gradient Boosting Models
Appendix C : Public and private scores for some Xgboost Models
24. Appendix D : Public and private scores for some Neural Network Models