This document presents an entropy-based adaptive genetic algorithm approach for learning classification rules from data. The key aspects are:
1. Binary encoding is used to represent classification rules, with the consequent determined by the majority of examples a rule matches.
2. The fitness function considers error rate, entropy, rule consistency, and hole ratio to evaluate individuals. Entropy measures the homogeneity of examples matched by rules.
3. An adaptive asymmetric mutation operator is applied, with the mutation inversion probability self-adapting during the algorithm run.
4. The approach is tested on credit, voting, and heart disease databases, outperforming decision trees, neural networks, and naive Bayes methods.
This document describes a genetic algorithm approach for automatically generating fuzzy rules for classification problems. The genetic algorithm uses rule importance as the fitness criteria, calculated as the rule's support for each class. The algorithm encodes rules using fuzzy membership set numbers for antecedents and consequents. It iterates for a set number of generations or until a minimum number of rules are fired, selecting high-fitness rules to generate offspring via crossover. The offspring replace lower-fitness rules, and the new rule population is evaluated in the next generation. The approach aims to consistently generate optimal fuzzy rules for classification using genetic search.
Relationships Among Classical Test Theory and Item Response Theory Frameworks...AnusornKoedsri3
This document discusses relationships between classical test theory (CTT) and item response theory (IRT) frameworks. It reviews prior literature that has found the frameworks to be quite comparable in estimating person and item parameters. However, previous studies only used IRT to generate data and did not consider CTT models with an underlying normal variable assumption. The current study aims to compare item and person statistics from IRT and CTT models with an underlying normal variable assumption using simulated data generated from both frameworks. A simulation was conducted varying test length (20, 40, 60 items) and number of examinees (500, 1000) to compare one-parameter logistic, two-parameter logistic, parallel, and congeneric models.
This document summarizes a study on multilabel text classification and the effect of label hierarchy. The study implements various algorithms for multilabel classification, including naive Bayes, k-nearest neighbors, random forests, SVMs, RBMs, and hierarchical classification algorithms. It evaluates the algorithms on four datasets that vary in features, labels, training/test sizes, and label cardinality. The goal is to analyze how different algorithmic approaches and dataset properties affect classification performance, particularly for hierarchical learning algorithms. Evaluation measures include micro/macro-averaged precision, recall and F1-score. The document provides details on the problem formulation, algorithms, implementation, datasets and evaluation.
This document discusses using genetic programming (GP) for data classification. It begins with an introduction to classification problems and algorithms like decision trees, neural networks, and Bayesian classification. It then provides background on genetic algorithms and genetic programming, explaining that GP uses tree-based representations of computer programs to evolve solutions. The document presents a case study on using GP for classification, discussing fitness measures, creating training sets, and an interleaved data format to address class imbalance. The goal is to automatically generate classification models for data sets using GP.
Automated alphabet reduction with evolutionary algorithms for protein structu...kknsastry
This paper focuses on automated procedures to reduce the dimensionality of protein structure prediction datasets by simplifying the way in which the primary sequence of a protein is represented. The potential benefits of this procedure are faster and easier learning process as well as the generation of more compact and human-readable classifiers. The dimensionality reduction procedure we propose consists on the reduction of the 20-letter amino acid (AA) alphabet, which is normally used to specify a protein sequence, into a lower cardinality alphabet. This reduction comes about by a clustering of AA types accordingly to their physical and chemical similarity. Our automated reduction procedure is guided by a fitness function based on the Mutual Information between the AA-based input attributes of the dataset and the protein structure feature that being predicted. <br />
To search for the optimal reduction, the Extended Compact Genetic Algorithm (ECGA) was used, and afterwards the results of this process were fed into (and validated by) BioHEL, a genetics-based machine learning technique. BioHEL used the reduced alphabet to induce rules for protein structure prediction features. BioHEL results are compared to two standard machine learning systems. Our results show that it is possible to reduce the size of the alphabet used for prediction from twenty to just three letters resulting in more compact, i.e. interpretable, rules. Also, a protein-wise accuracy performance measure suggests that the loss of accuracy accrued by this substantial alphabet reduction is not statistically significant when compared to the full alphabet.
This document describes two machine learning techniques, particle swarm optimization with support vector machines (PSO-SVM) and recursive feature elimination with support vector machines (RFE-SVM), that were used to classify autism neuroimaging data from the Autism Brain Imaging Data Exchange database. PSO-SVM was used to select discriminative features for classification, while RFE-SVM ranked features by importance. Both techniques aimed to improve classification accuracy and reduce overfitting by selecting optimal feature subsets from the high-dimensional neuroimaging data. The results could help develop brain-based diagnostic criteria for autism.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Performance analysis of linkage learning techniques in genetic algorithmseSAT Journals
Abstract One variance of Genetic Algorithms is a Linkage Learning Genetic Algorithm (LLGA) enhances the efficiencies of Simple Genetic Algorithm (SGA) while solving NP hard Problems. Discovery of Linkage Learning Technique is an important task in GA. Almost all existing Linkage Learning Techniques follow either random approach or probabilistic approaches. This makes repeated passes over the population to determine the relationship between individuals. SGA with random linkage technique is simple but may take long time to converge to the optimal solutions. This paper uses a linkage learning operator called Gene Silencing which is an inspired mechanism from biological systems. The Gene Silencing mechanism is used to improve the linkages by preserving the building blocks in an individual from the disruption of recombination processes such as Crossover and Mutation. It converges quickly to the optimal solution without compromising the diversification on search spaces. To prove this phenomenon, the Travelling Sales Person problem (TSP) has been chosen to retain the order of cities in a tour. Experiments carried out on different TSP benchmark instances taken from TSPLIB which is a standard library for TSP problems. These benchmark instances have also been applied on various linkage learning techniques and analyses the performance of these techniques with Gene Silencing (GS) mechanism. The performance analysis has been made on experimental results with respect to optimal solution and convergence speed. Index Terms: Linkage Learning, Gene Silencing, Building Blocks, Genetic Algorithm, TSPLIB, Performance Analysis
This document describes a genetic algorithm approach for automatically generating fuzzy rules for classification problems. The genetic algorithm uses rule importance as the fitness criteria, calculated as the rule's support for each class. The algorithm encodes rules using fuzzy membership set numbers for antecedents and consequents. It iterates for a set number of generations or until a minimum number of rules are fired, selecting high-fitness rules to generate offspring via crossover. The offspring replace lower-fitness rules, and the new rule population is evaluated in the next generation. The approach aims to consistently generate optimal fuzzy rules for classification using genetic search.
Relationships Among Classical Test Theory and Item Response Theory Frameworks...AnusornKoedsri3
This document discusses relationships between classical test theory (CTT) and item response theory (IRT) frameworks. It reviews prior literature that has found the frameworks to be quite comparable in estimating person and item parameters. However, previous studies only used IRT to generate data and did not consider CTT models with an underlying normal variable assumption. The current study aims to compare item and person statistics from IRT and CTT models with an underlying normal variable assumption using simulated data generated from both frameworks. A simulation was conducted varying test length (20, 40, 60 items) and number of examinees (500, 1000) to compare one-parameter logistic, two-parameter logistic, parallel, and congeneric models.
This document summarizes a study on multilabel text classification and the effect of label hierarchy. The study implements various algorithms for multilabel classification, including naive Bayes, k-nearest neighbors, random forests, SVMs, RBMs, and hierarchical classification algorithms. It evaluates the algorithms on four datasets that vary in features, labels, training/test sizes, and label cardinality. The goal is to analyze how different algorithmic approaches and dataset properties affect classification performance, particularly for hierarchical learning algorithms. Evaluation measures include micro/macro-averaged precision, recall and F1-score. The document provides details on the problem formulation, algorithms, implementation, datasets and evaluation.
This document discusses using genetic programming (GP) for data classification. It begins with an introduction to classification problems and algorithms like decision trees, neural networks, and Bayesian classification. It then provides background on genetic algorithms and genetic programming, explaining that GP uses tree-based representations of computer programs to evolve solutions. The document presents a case study on using GP for classification, discussing fitness measures, creating training sets, and an interleaved data format to address class imbalance. The goal is to automatically generate classification models for data sets using GP.
Automated alphabet reduction with evolutionary algorithms for protein structu...kknsastry
This paper focuses on automated procedures to reduce the dimensionality of protein structure prediction datasets by simplifying the way in which the primary sequence of a protein is represented. The potential benefits of this procedure are faster and easier learning process as well as the generation of more compact and human-readable classifiers. The dimensionality reduction procedure we propose consists on the reduction of the 20-letter amino acid (AA) alphabet, which is normally used to specify a protein sequence, into a lower cardinality alphabet. This reduction comes about by a clustering of AA types accordingly to their physical and chemical similarity. Our automated reduction procedure is guided by a fitness function based on the Mutual Information between the AA-based input attributes of the dataset and the protein structure feature that being predicted. <br />
To search for the optimal reduction, the Extended Compact Genetic Algorithm (ECGA) was used, and afterwards the results of this process were fed into (and validated by) BioHEL, a genetics-based machine learning technique. BioHEL used the reduced alphabet to induce rules for protein structure prediction features. BioHEL results are compared to two standard machine learning systems. Our results show that it is possible to reduce the size of the alphabet used for prediction from twenty to just three letters resulting in more compact, i.e. interpretable, rules. Also, a protein-wise accuracy performance measure suggests that the loss of accuracy accrued by this substantial alphabet reduction is not statistically significant when compared to the full alphabet.
This document describes two machine learning techniques, particle swarm optimization with support vector machines (PSO-SVM) and recursive feature elimination with support vector machines (RFE-SVM), that were used to classify autism neuroimaging data from the Autism Brain Imaging Data Exchange database. PSO-SVM was used to select discriminative features for classification, while RFE-SVM ranked features by importance. Both techniques aimed to improve classification accuracy and reduce overfitting by selecting optimal feature subsets from the high-dimensional neuroimaging data. The results could help develop brain-based diagnostic criteria for autism.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Performance analysis of linkage learning techniques in genetic algorithmseSAT Journals
Abstract One variance of Genetic Algorithms is a Linkage Learning Genetic Algorithm (LLGA) enhances the efficiencies of Simple Genetic Algorithm (SGA) while solving NP hard Problems. Discovery of Linkage Learning Technique is an important task in GA. Almost all existing Linkage Learning Techniques follow either random approach or probabilistic approaches. This makes repeated passes over the population to determine the relationship between individuals. SGA with random linkage technique is simple but may take long time to converge to the optimal solutions. This paper uses a linkage learning operator called Gene Silencing which is an inspired mechanism from biological systems. The Gene Silencing mechanism is used to improve the linkages by preserving the building blocks in an individual from the disruption of recombination processes such as Crossover and Mutation. It converges quickly to the optimal solution without compromising the diversification on search spaces. To prove this phenomenon, the Travelling Sales Person problem (TSP) has been chosen to retain the order of cities in a tour. Experiments carried out on different TSP benchmark instances taken from TSPLIB which is a standard library for TSP problems. These benchmark instances have also been applied on various linkage learning techniques and analyses the performance of these techniques with Gene Silencing (GS) mechanism. The performance analysis has been made on experimental results with respect to optimal solution and convergence speed. Index Terms: Linkage Learning, Gene Silencing, Building Blocks, Genetic Algorithm, TSPLIB, Performance Analysis
Enhancing Keyword Query Results Over Database for Improving User Satisfaction ijmpict
Storing data in relational databases is widely increasing to support keyword queries but search results does not gives effective answers to keyword query and hence it is inflexible from user perspective. It would be helpful to recognize such type of queries which gives results with low ranking. Here we estimate prediction of query performance to find out effectiveness of a search performed in response to query and features of such hard queries is studied by taking into account contents of the database and result list. One relevant problem of database is the presence of missing data and it can be handled by imputation. Here an inTeractive Retrieving-Inferring data imputation method (TRIP) is used which achieves retrieving and inferring alternately to fill the missing attribute values in the database. So by considering both the prediction of hard queries and imputation over the database, we can get better keyword search results.
siRNA has become an indispensible tool for silencing gene expression. It can act as an antiviral agent in RNAi pathway against plant diseases caused by plant viruses. However, identification of appropriate features for effective siRNA design has become a pressing issue for researchers which need to be resolved. Feature selection is a vital pre-processing technique involved in bioinformatics data set to find the most discriminative information not only for dimensionality reduction and detection of relevance features but also for minimizing the cost associated with features to design an accurate learning system. In this paper, we propose an ANN based feature selection approach using hybrid GA-PSO for selecting feature subset by discarding the irrelevant features and evaluating the cost of the model training. The results showed that the performance of proposed hybrid GA-PSO model outperformed the results of general PSO.a
Improving the effectiveness of information retrieval system using adaptive ge...ijcsit
The document describes research into improving the effectiveness of information retrieval systems using an adaptive genetic algorithm. A genetic algorithm with variable crossover and mutation probabilities (adaptive GA) is investigated. The adaptive GA is tested on 242 Arabic abstracts using three information retrieval models: vector space model, extended Boolean model, and language model. Results show the adaptive GA approach improves retrieval effectiveness over traditional genetic algorithms and baseline information retrieval systems, as measured by average recall and precision. Key aspects of the adaptive GA used include variable crossover and mutation probabilities tuned during the search process, and fitness functions based on document retrieval order.
SURVEY ON MODELLING METHODS APPLICABLE TO GENE REGULATORY NETWORKijbbjournal
Gene Regulatory Network (GRN) plays an important role in knowing insight of cellular life cycle. It gives
information about at which different environmental conditions genes of particular interest get over
expressed or under expressed. Modelling of GRN is nothing but finding interactive relationships between
genes. Interaction can be positive or negative. For inference of GRN, time series data provided by
Microarray technology is used. Key factors to be considered while constructing GRN are scalability,
robustness, reliability and maximum detection of true positive interactions between genes. This paper
gives detailed technical review of existing methods applied for building of GRN along with scope for
future work.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
This document describes a proposed modified cluster-based fuzzy-genetic data mining algorithm. The algorithm aims to mine both association rules and membership functions from quantitative transaction data. It uses a genetic algorithm approach that represents each set of membership functions as a chromosome. Chromosomes are clustered using a modified k-means approach to reduce computational costs. The representative chromosome of each cluster is used to calculate fitness values. Offspring are produced through genetic operators and selected through roulette wheel selection. The algorithm iterates until obtaining a set of membership functions with high fitness. These are then used to mine multilevel fuzzy association rules from the transaction data. The algorithm is illustrated through a simple example involving transaction data containing purchases of items like milk, bread, etc
A clonal based algorithm for the reconstruction of genetic network using s sy...eSAT Journals
Abstract Motivation: Gene regulatory network is the network based approach to represent the interactions between genes. DNA microarray is the most widely used technology for extracting the relationships between thousands of genes simultaneously. Gene microarray experiment provides the gene expression data for a particular condition and varying time periods. The expression of a particular gene depends upon the biological conditions and other genes. In this paper, we propose a new method for the analysis of microarray data. The proposed method makes use of S-system, which is a well-accepted model for the gene regulatory network reconstruction. Since the problem has multiple solutions, we have to identify an optimized solution. Evolutionary algorithms have been used to solve such problems. Though there are a number of attempts already been carried out by various researchers, the solutions are still not that satisfactory with respect to the time taken and the degree of accuracy achieved. Therefore, there is a need of huge amount further work in this topic for achieving solutions with improved performances. Results: In this work, we have proposed Clonal selection algorithm for identifying optimal gene regulatory network. The approach is tested on the real life data: SOS Ecoli DNA repairing gene expression data. It is observed that the proposed algorithm converges much faster and provides better results than the existing algorithms. Index Terms: Microarray analysis, Evolutionary Algorithm, Artificial Immune System, S-system, Gene Regulatory Network, SOS Ecoli DNA repairing, Clonal Selection Algorithm.
This document summarizes a research paper that proposes using a genetic algorithm to optimize the placement of Fiber Bragg Grating sensors for structural health monitoring. It begins by introducing FBG sensors and the need for sensor placement optimization when resources are limited. It then provides background on genetic algorithms and describes how they can be applied to the sensor placement problem by coding sensor locations into chromosomes, evaluating fitness, and using genetic operators like selection, crossover and mutation to evolve optimized sensor configurations. The document explains the genetic algorithm approach in detail through sections on coding, initial population, selection, crossover and mutation operations.
The document proposes a new approach to compare stock market patterns to DNA sequences using compression techniques. Stock market data is converted to binary sequences representing increases and decreases, which are then encoded into DNA nucleotides. These nucleotide sequences are divided and matched against human genome sequences using BLAST. The analysis found certain sub-sequences of the stock market patterns matched 100% to the human genome, suggesting this approach could potentially predict stock market behavior.
Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene...Ronak Shah
Protein-Protein interactions discovered by the existing high-throughput techniques contain very high amount of false positives. Here we present an SVM based approach to generate a model that is built on sequence and non-sequence based information of the interacting proteins. This model is used to assess the reliability of given protein-protein interactions. It was run on the interaction data of a pathogenic bacterium; Treponema pallidum (causes Syphilis in humans) obtained from Yeast two hybrid experiments. Various kernels were used for building the model and of all, Sigmoid kernel performed well when used with all the features combined with area under the receiver operating curve (ROC) as 0.53.
K-NN Classifier Performs Better Than K-Means Clustering in Missing Value Imp...IOSR Journals
This document compares the performance of K-means clustering and K-nearest neighbor (K-NN) classification for imputing missing values. It finds that K-NN performs better than K-means clustering in terms of accuracy when imputing missing values at rates from 2% to 20%. The document simulates datasets with various missing value rates, uses each method to group the data and impute missing values via mean substitution, and compares the results to the original complete dataset to calculate accuracy. K-NN achieved an average accuracy of 67% compared to 62% for K-means clustering across the different missing value rates tested.
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...cscpconf
Optimization problems are dominantly being solved using Computational Intelligence. One of
the issues that can be addressed in this context is problems related to attribute subset selection
evaluation. This paper presents a computational intelligence technique for solving the
optimization problem using a proposed model called Modified Genetic Search Algorithms
(MGSA) that avoids local bad search space with merit and scaled fitness variables, detecting
and deleting bad candidate chromosomes, thereby reducing the number of individual
chromosomes from search space and subsequent iterations in next generations. This paper aims
to show that Rotation forest ensembles are useful in the feature selection method. The base
classifier is multinomial logistic regression method integrated with Haar wavelets as projection
filter and reproducing the ranks of each features with 10 fold cross validation method. It also
discusses the main findings and concludes with promising result of the proposed model. It
explores the combination of MGSA for optimization with Naïve Bayes classification. The result
obtained using proposed model MGSA is validated mathematically using Principal Component
Analysis. The goal is to improve the accuracy and quality of diagnosis of Breast cancer disease
with robust machine learning algorithms. As compared to other works in literature survey,
experimental results achieved in this paper show better results with statistical inferenc
Supervised WSD Using Master- Slave Voting Techniqueiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Patterns that only occur in objects belonging to a
single class are called Jumping Emerging Patterns (JEP). JEP
based Classifiers are considered one of the successful classification
systems. Due to its comprehensibility, simplicity and strong
differentiating abilities JEPs have captured significant recognition.
However, discovery of JEPs in a large pattern space is normally a
time consuming and challenging task because of their exponential
behaviour. In this work a novel method based on genetic
algorithm (GA) is proposed to discover JEPs in large pattern
space. Since the complexity of GA is lower than other algorithms,
so we have combined the power of JEPs and GA to find high
quality JEPs from datasets to improve performance of
classification system. Our proposed method explores a set of high
quality JEPs from pattern search space unlike other methods in
literature that compute complete set of JEPs, Large numbers of
duplicate and redundant JEPs are filtered out during their
discovery process. Experimental results show that our proposed
Genetic-JEPs are effective and accurate for classification of a
variety of data sets and in general achieve higher accuracy than
other standard classifiers.
Genetic Algorithm based Optimization of Machining ParametersAngshuman Pal
Optimization of a process output with reference to multiple input parameters is an important aspect of any manufacturing or machining process. This project examines and formulates a mechanism to optimize a given process output using the optimization technique of Genetic Algorithm. A new code for this purpose is formulated in MATLAB environment. The code is tested against some standardized functions and some case studies are performed to validate its performance against existing literature. The algorithm is then run over some real process performance data obtained from milling operation, and the optimum input parameters under given constraints required for achieving minimum surface roughness is proposed.
This report summarizes Francisco Reinaldo's work in the first year of his PhD, including proposing a cognitive agent architecture called AFRANCI and developing a framework to implement it. AFRANCI is inspired by the mammalian brain and aims to enable agents to achieve goals, coordinate activities, and resolve conflicts through a hierarchical structure of cognitive processes. The framework will provide base classes for an artificial central nervous system to develop heterogeneous agents. Reinaldo plans to evaluate AFRANCI by implementing rescue agents in the RoboCup Rescue simulation system over the next years of his PhD research.
This document discusses the role of consistency controlled future generating models in strategic management. It argues that applying principles of combinatorics can automate the process of searching for the best model to forecast the future. However, defining the goal function for what constitutes the "best" model raises philosophical issues. The document also examines how consistency controlled modeling can help improve strategic decision making at both the social and enterprise levels, and discusses the role of the CIO in collecting and organizing the internal and external data needed to build such models for agricultural enterprises.
This document discusses the role of users in the decision tree data mining process. Some key points:
1. Building decision trees is often viewed as an automatic process, but the document argues that users play an important role at various steps, including data preparation, choosing attribute selection criteria and pruning methods, and interpreting results.
2. There are many options for attribute selection criteria and pruning methods, and the best choice depends on factors like the certainty of the data and desired interpretability. Users must make choices at these steps.
3. Associated tasks like data cleaning, attribute encoding, and analyzing results are also important but often overlooked. These tasks require significant user input and influence the final results. The document aims to emphasize
This document provides a summary of Dragomir R. Radev's research interests, employment history, education, funding, patents, and publications. Specifically:
- Radev's research interests include information retrieval, natural language processing, text summarization, and artificial intelligence.
- He is currently an Associate Professor at the University of Michigan across several departments and centers. He has also held positions at IBM and Columbia University.
- Radev received his Ph.D. in Computer Science from Columbia University and has received funding from several agencies including NSF and NIH.
- He holds one U.S. patent and has one additional patent pending. Radev has authored or co-authored over 15
This document provides a curriculum vitae for Ido Dagan. It outlines his personal details, education history, employment history, teaching experience, research interests, and publications. Ido Dagan received his PhD in Computer Science from Technion - Israel Institute of Technology in 1992. He has held various academic and industrial positions, including Vice President of Technology at LingoMotors and Founding CTO of FocusEngine. His research interests include natural language processing with a focus on empirical and machine learning methods. He has numerous publications in journals, books, and conferences.
Enhancing Keyword Query Results Over Database for Improving User Satisfaction ijmpict
Storing data in relational databases is widely increasing to support keyword queries but search results does not gives effective answers to keyword query and hence it is inflexible from user perspective. It would be helpful to recognize such type of queries which gives results with low ranking. Here we estimate prediction of query performance to find out effectiveness of a search performed in response to query and features of such hard queries is studied by taking into account contents of the database and result list. One relevant problem of database is the presence of missing data and it can be handled by imputation. Here an inTeractive Retrieving-Inferring data imputation method (TRIP) is used which achieves retrieving and inferring alternately to fill the missing attribute values in the database. So by considering both the prediction of hard queries and imputation over the database, we can get better keyword search results.
siRNA has become an indispensible tool for silencing gene expression. It can act as an antiviral agent in RNAi pathway against plant diseases caused by plant viruses. However, identification of appropriate features for effective siRNA design has become a pressing issue for researchers which need to be resolved. Feature selection is a vital pre-processing technique involved in bioinformatics data set to find the most discriminative information not only for dimensionality reduction and detection of relevance features but also for minimizing the cost associated with features to design an accurate learning system. In this paper, we propose an ANN based feature selection approach using hybrid GA-PSO for selecting feature subset by discarding the irrelevant features and evaluating the cost of the model training. The results showed that the performance of proposed hybrid GA-PSO model outperformed the results of general PSO.a
Improving the effectiveness of information retrieval system using adaptive ge...ijcsit
The document describes research into improving the effectiveness of information retrieval systems using an adaptive genetic algorithm. A genetic algorithm with variable crossover and mutation probabilities (adaptive GA) is investigated. The adaptive GA is tested on 242 Arabic abstracts using three information retrieval models: vector space model, extended Boolean model, and language model. Results show the adaptive GA approach improves retrieval effectiveness over traditional genetic algorithms and baseline information retrieval systems, as measured by average recall and precision. Key aspects of the adaptive GA used include variable crossover and mutation probabilities tuned during the search process, and fitness functions based on document retrieval order.
SURVEY ON MODELLING METHODS APPLICABLE TO GENE REGULATORY NETWORKijbbjournal
Gene Regulatory Network (GRN) plays an important role in knowing insight of cellular life cycle. It gives
information about at which different environmental conditions genes of particular interest get over
expressed or under expressed. Modelling of GRN is nothing but finding interactive relationships between
genes. Interaction can be positive or negative. For inference of GRN, time series data provided by
Microarray technology is used. Key factors to be considered while constructing GRN are scalability,
robustness, reliability and maximum detection of true positive interactions between genes. This paper
gives detailed technical review of existing methods applied for building of GRN along with scope for
future work.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
This document describes a proposed modified cluster-based fuzzy-genetic data mining algorithm. The algorithm aims to mine both association rules and membership functions from quantitative transaction data. It uses a genetic algorithm approach that represents each set of membership functions as a chromosome. Chromosomes are clustered using a modified k-means approach to reduce computational costs. The representative chromosome of each cluster is used to calculate fitness values. Offspring are produced through genetic operators and selected through roulette wheel selection. The algorithm iterates until obtaining a set of membership functions with high fitness. These are then used to mine multilevel fuzzy association rules from the transaction data. The algorithm is illustrated through a simple example involving transaction data containing purchases of items like milk, bread, etc
A clonal based algorithm for the reconstruction of genetic network using s sy...eSAT Journals
Abstract Motivation: Gene regulatory network is the network based approach to represent the interactions between genes. DNA microarray is the most widely used technology for extracting the relationships between thousands of genes simultaneously. Gene microarray experiment provides the gene expression data for a particular condition and varying time periods. The expression of a particular gene depends upon the biological conditions and other genes. In this paper, we propose a new method for the analysis of microarray data. The proposed method makes use of S-system, which is a well-accepted model for the gene regulatory network reconstruction. Since the problem has multiple solutions, we have to identify an optimized solution. Evolutionary algorithms have been used to solve such problems. Though there are a number of attempts already been carried out by various researchers, the solutions are still not that satisfactory with respect to the time taken and the degree of accuracy achieved. Therefore, there is a need of huge amount further work in this topic for achieving solutions with improved performances. Results: In this work, we have proposed Clonal selection algorithm for identifying optimal gene regulatory network. The approach is tested on the real life data: SOS Ecoli DNA repairing gene expression data. It is observed that the proposed algorithm converges much faster and provides better results than the existing algorithms. Index Terms: Microarray analysis, Evolutionary Algorithm, Artificial Immune System, S-system, Gene Regulatory Network, SOS Ecoli DNA repairing, Clonal Selection Algorithm.
This document summarizes a research paper that proposes using a genetic algorithm to optimize the placement of Fiber Bragg Grating sensors for structural health monitoring. It begins by introducing FBG sensors and the need for sensor placement optimization when resources are limited. It then provides background on genetic algorithms and describes how they can be applied to the sensor placement problem by coding sensor locations into chromosomes, evaluating fitness, and using genetic operators like selection, crossover and mutation to evolve optimized sensor configurations. The document explains the genetic algorithm approach in detail through sections on coding, initial population, selection, crossover and mutation operations.
The document proposes a new approach to compare stock market patterns to DNA sequences using compression techniques. Stock market data is converted to binary sequences representing increases and decreases, which are then encoded into DNA nucleotides. These nucleotide sequences are divided and matched against human genome sequences using BLAST. The analysis found certain sub-sequences of the stock market patterns matched 100% to the human genome, suggesting this approach could potentially predict stock market behavior.
Protein-Protein Interaction using SVM based kernel,Jacob Coefficient and Gene...Ronak Shah
Protein-Protein interactions discovered by the existing high-throughput techniques contain very high amount of false positives. Here we present an SVM based approach to generate a model that is built on sequence and non-sequence based information of the interacting proteins. This model is used to assess the reliability of given protein-protein interactions. It was run on the interaction data of a pathogenic bacterium; Treponema pallidum (causes Syphilis in humans) obtained from Yeast two hybrid experiments. Various kernels were used for building the model and of all, Sigmoid kernel performed well when used with all the features combined with area under the receiver operating curve (ROC) as 0.53.
K-NN Classifier Performs Better Than K-Means Clustering in Missing Value Imp...IOSR Journals
This document compares the performance of K-means clustering and K-nearest neighbor (K-NN) classification for imputing missing values. It finds that K-NN performs better than K-means clustering in terms of accuracy when imputing missing values at rates from 2% to 20%. The document simulates datasets with various missing value rates, uses each method to group the data and impute missing values via mean substitution, and compares the results to the original complete dataset to calculate accuracy. K-NN achieved an average accuracy of 67% compared to 62% for K-means clustering across the different missing value rates tested.
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...cscpconf
Optimization problems are dominantly being solved using Computational Intelligence. One of
the issues that can be addressed in this context is problems related to attribute subset selection
evaluation. This paper presents a computational intelligence technique for solving the
optimization problem using a proposed model called Modified Genetic Search Algorithms
(MGSA) that avoids local bad search space with merit and scaled fitness variables, detecting
and deleting bad candidate chromosomes, thereby reducing the number of individual
chromosomes from search space and subsequent iterations in next generations. This paper aims
to show that Rotation forest ensembles are useful in the feature selection method. The base
classifier is multinomial logistic regression method integrated with Haar wavelets as projection
filter and reproducing the ranks of each features with 10 fold cross validation method. It also
discusses the main findings and concludes with promising result of the proposed model. It
explores the combination of MGSA for optimization with Naïve Bayes classification. The result
obtained using proposed model MGSA is validated mathematically using Principal Component
Analysis. The goal is to improve the accuracy and quality of diagnosis of Breast cancer disease
with robust machine learning algorithms. As compared to other works in literature survey,
experimental results achieved in this paper show better results with statistical inferenc
Supervised WSD Using Master- Slave Voting Techniqueiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Patterns that only occur in objects belonging to a
single class are called Jumping Emerging Patterns (JEP). JEP
based Classifiers are considered one of the successful classification
systems. Due to its comprehensibility, simplicity and strong
differentiating abilities JEPs have captured significant recognition.
However, discovery of JEPs in a large pattern space is normally a
time consuming and challenging task because of their exponential
behaviour. In this work a novel method based on genetic
algorithm (GA) is proposed to discover JEPs in large pattern
space. Since the complexity of GA is lower than other algorithms,
so we have combined the power of JEPs and GA to find high
quality JEPs from datasets to improve performance of
classification system. Our proposed method explores a set of high
quality JEPs from pattern search space unlike other methods in
literature that compute complete set of JEPs, Large numbers of
duplicate and redundant JEPs are filtered out during their
discovery process. Experimental results show that our proposed
Genetic-JEPs are effective and accurate for classification of a
variety of data sets and in general achieve higher accuracy than
other standard classifiers.
Genetic Algorithm based Optimization of Machining ParametersAngshuman Pal
Optimization of a process output with reference to multiple input parameters is an important aspect of any manufacturing or machining process. This project examines and formulates a mechanism to optimize a given process output using the optimization technique of Genetic Algorithm. A new code for this purpose is formulated in MATLAB environment. The code is tested against some standardized functions and some case studies are performed to validate its performance against existing literature. The algorithm is then run over some real process performance data obtained from milling operation, and the optimum input parameters under given constraints required for achieving minimum surface roughness is proposed.
This report summarizes Francisco Reinaldo's work in the first year of his PhD, including proposing a cognitive agent architecture called AFRANCI and developing a framework to implement it. AFRANCI is inspired by the mammalian brain and aims to enable agents to achieve goals, coordinate activities, and resolve conflicts through a hierarchical structure of cognitive processes. The framework will provide base classes for an artificial central nervous system to develop heterogeneous agents. Reinaldo plans to evaluate AFRANCI by implementing rescue agents in the RoboCup Rescue simulation system over the next years of his PhD research.
This document discusses the role of consistency controlled future generating models in strategic management. It argues that applying principles of combinatorics can automate the process of searching for the best model to forecast the future. However, defining the goal function for what constitutes the "best" model raises philosophical issues. The document also examines how consistency controlled modeling can help improve strategic decision making at both the social and enterprise levels, and discusses the role of the CIO in collecting and organizing the internal and external data needed to build such models for agricultural enterprises.
This document discusses the role of users in the decision tree data mining process. Some key points:
1. Building decision trees is often viewed as an automatic process, but the document argues that users play an important role at various steps, including data preparation, choosing attribute selection criteria and pruning methods, and interpreting results.
2. There are many options for attribute selection criteria and pruning methods, and the best choice depends on factors like the certainty of the data and desired interpretability. Users must make choices at these steps.
3. Associated tasks like data cleaning, attribute encoding, and analyzing results are also important but often overlooked. These tasks require significant user input and influence the final results. The document aims to emphasize
This document provides a summary of Dragomir R. Radev's research interests, employment history, education, funding, patents, and publications. Specifically:
- Radev's research interests include information retrieval, natural language processing, text summarization, and artificial intelligence.
- He is currently an Associate Professor at the University of Michigan across several departments and centers. He has also held positions at IBM and Columbia University.
- Radev received his Ph.D. in Computer Science from Columbia University and has received funding from several agencies including NSF and NIH.
- He holds one U.S. patent and has one additional patent pending. Radev has authored or co-authored over 15
This document provides a curriculum vitae for Ido Dagan. It outlines his personal details, education history, employment history, teaching experience, research interests, and publications. Ido Dagan received his PhD in Computer Science from Technion - Israel Institute of Technology in 1992. He has held various academic and industrial positions, including Vice President of Technology at LingoMotors and Founding CTO of FocusEngine. His research interests include natural language processing with a focus on empirical and machine learning methods. He has numerous publications in journals, books, and conferences.
The document provides an overview of symbolic machine learning approaches. It discusses how machine learning allows systems to learn intelligent behavior from experience rather than being explicitly programmed. It outlines different types of symbolic learning approaches that will be covered, including inductive learning, rote learning, nearest neighbor classification, and Bayesian classification. Key algorithms like nearest neighbor and examples of their applications are also summarized.
Venture capital is high-risk financing that seeks high returns of 50% or more annually. It is best for businesses with heavy R&D needs, large market opportunities, or those requiring a lot of working capital like biotech, semiconductors, or Amazon. However, VC financing greatly dilutes the founder's ownership and gives investors control. It is a lengthy process that is not suitable for startups that are too early, small, or without barriers to entry. Alternative options include bootstrapping, self-financing, funding from friends and family, finding partners, or acquiring an existing business to avoid high dilution from venture capital.
world history.docx (19K) - Get Online Tutoring butest
The document contains 37 multiple choice questions about world history. The questions cover a wide range of topics including social conditions during the Renaissance, achievements of ancient civilizations like Greece and Rome, the development of early civilizations and agriculture, major religious and intellectual movements like the Protestant Reformation and Scientific Revolution, imperialism and colonialism, and the causes and consequences of the two World Wars. The questions test understanding of key historical periods, influential individuals, social and political systems, as well as economic and environmental factors that have shaped world history.
This document provides contact information for a designer who creates landmark calendars and posters. It lists the designer's email and phone number and indicates they are a supervisor in the hi-res department and senior designer.
This document summarizes the ICOM project which researched computational intelligence, its principles, and applications. The project developed and implemented neural, symbolic, and hybrid systems including theory refinement systems, ANN compilers, genetic algorithms, and applications in various domains. Key developments included the CIL2P system which combines logic programming and neural networks, and rule extraction methods to explain neural network decisions. The combinatorial neural model was also investigated as a way to integrate neural and symbolic processing for classification tasks.
Mr. Arsalan Siddiq worked as an Android intern from May to July 2016, where he created tasks to learn Android basics and logic building. During his internship, his employer found him to be hard-working and intelligent. The employer wished Mr. Siddiq best of luck in his future career pursuits.
This document discusses genetic algorithms and how they are used for concept learning. It explains that genetic algorithms are inspired by biological evolution and use selection, crossover, and mutation to iteratively update a population of hypotheses. It then describes how genetic algorithms work, including representing hypotheses, genetic operators like crossover and mutation, fitness functions, and selection methods. Finally, it provides an example of a genetic algorithm called GABIL that was used for concept learning tasks.
This document describes a genetic algorithm approach for automatically generating fuzzy rules for classification. It proposes using genetic algorithms to evolve an optimal set of rules, with rule importance as the fitness criteria. Rule importance is calculated based on the rule's support for each class. The algorithm encodes rules using fuzzy membership values, and stops when the minimum number of fired rules is reached. It was shown to generate consistent results with an optimal set of rules for classification problems.
Methods of Combining Neural Networks and Genetic AlgorithmsESCOM
1. The document discusses methods for combining neural networks and genetic algorithms. It describes three main approaches: evolving connection weights, evolving network architectures, and evolving learning rules.
2. In the first approach, a genetic algorithm is used as the learning rule to optimize neural network weights. The second approach uses a genetic algorithm to select structural parameters of the network and trains the network separately. The third approach evolves parameters of the network's learning rule.
3. Combining neural networks and genetic algorithms can help compensate for each techniques' weaknesses in search, and may lead to highly successful adaptive systems by integrating learning and evolutionary search.
The Evaluated Measurement of a Combined Genetic Algorithm and Artificial Immu...IJECEIAES
This paper demonstrates a hybrid between two optimization methods which are the Artificial Immune System (AIS) and Genetic Algorithm (GA). The novel algorithm called the immune genetic algorithm (IGA), provides improvement to the results that enable GA and AIS to work separately which is the main objective of this hybrid. Negative selection which is one of the techniques in the AIS, was employed to determine the input variables (populations) of the system. In order to illustrate the effectiveness of the IGA, the comparison with a steady-state GA, AIS, and PSO were also investigated. The testing of the performance was conducted by mathematical testing, problems were divided into single and multiple objectives. The five single objectives were then used to test the modified algorithm, the results showed that IGA performed better than all of the other methods. The DTLZ multi-objective testing functions were then used. The result also illustrated that the modified approach still had the best performance.
This document discusses fuzzy genetic algorithms (FGAs), which combine fuzzy logic and genetic algorithms. It provides definitions of fuzzy logic and genetic algorithms. Fuzzy logic handles imprecise variables between 0 and 1, while genetic algorithms use techniques like selection, crossover and mutation to evolve solutions. The document notes that FGAs use fuzzy logic techniques to improve genetic algorithm behavior and components. It describes different FGA approaches and lists application sectors like engineering and economics.
This document provides an introduction to the differential evolution algorithm (DEA) and its implementation in MATLAB. It defines DEA as a population-based, direct search algorithm used to optimize global functions. The basic steps of DEA are described as initializing a population, evaluating it, and then iteratively mutating, recombining, and selecting new candidate solutions until a termination criterion is met. Key aspects of DEA covered include its mutation operation based on differences between random vectors, advantages like simplicity and robustness, and how it uses populations of candidate solutions. Implementation details discussed include population structure/parameter limits, the mutation, selection, and recombination processes. The document concludes by noting how to implement this algorithm in MATLAB.
rule refinement in inductive knowledge based systemsarteimi
This document summarizes a paper about refining inductive knowledge-based systems through empirical methods. It describes a system called ARIS that interleaves learning and performance evaluation to allow accurate classifications on real-world datasets. ARIS employs an ordering of rules according to their learned weights, which are calculated using Bayes' theorem. The system focuses rule analysis and refinement by heuristics. It has been used to refine knowledge bases from C4.5 and RIPPER, showing improved accuracy and the ability to gradually enhance the knowledge base during refinement.
This document provides information about genetic algorithms including:
1. Definitions of genetic algorithms from Grefenstette and Goldberg that describe genetic algorithms as search algorithms based on biological evolution and natural selection.
2. An overview of genetic algorithms including the basic concepts of populations, chromosomes, genes, fitness functions, selection, crossover, and mutation.
3. Examples of genetic representations like binary encoding and permutation encoding.
4. Descriptions of genetic operators like selection, crossover, and mutation that maintain genetic diversity between generations.
A Genetic Algorithm on Optimization Test FunctionsIJMERJOURNAL
ABSTRACT: Genetic Algorithms (GAs) have become increasingly useful over the years for solving combinatorial problems. Though they are generally accepted to be good performers among metaheuristic algorithms, most works have concentrated on the application of the GAs rather than the theoretical justifications. In this paper, we examine and justify the suitability of Genetic Algorithms in solving complex, multi-variable and multi-modal optimization problems. To achieve this, a simple Genetic Algorithm was used to solve four standard complicated optimization test functions, namely Rosenbrock, Schwefel, Rastrigin and Shubert functions. These functions are benchmarks to test the quality of an optimization procedure towards a global optimum. We show that the method has a quicker convergence to the global optima and that the optimal values for the Rosenbrock, Rastrigin, Schwefel and Shubert functions are zero (0), zero (0), -418.9829 and -14.5080 respectively
An integrated mechanism for feature selectionsai kumar
This document discusses an integrated mechanism for feature selection and fuzzy rule extraction for classification problems. It aims to select a useful set of features that can solve the classification problem while designing an interpretable fuzzy rule-based system. The mechanism is an embedded feature selection method, meaning feature selection is integrated into the rule base formation process. This allows it to account for possible nonlinear interactions between features and between features and the modeling tool. The authors demonstrate the effectiveness of the proposed method on several datasets.
Predictive job scheduling in a connection limited system using parallel genet...Mumbai Academisc
The document discusses predictive job scheduling in a connection limited system using parallel genetic algorithms. It introduces the problem of job scheduling in parallel computing systems and describes existing non-predictive greedy algorithms. The proposed approach uses genetic algorithms to develop a predictive model for job scheduling that learns from previous experiences to improve scheduling efficiency over time. The goal is to schedule jobs in a way that optimizes system metrics like utilization and throughput while minimizing user metrics like turnaround time.
The document summarizes an optimization problem that uses genetic algorithms to optimize De Jong's function. De Jong's function is a commonly used benchmark function for testing continuous optimization algorithms. The genetic algorithm is applied using different selection schemes like roulette wheel selection, random selection, and best fit selection. The performance of these selection methods is analyzed by running the genetic algorithm with each method for various numbers of iterations and recording the resulting fitness values. Best fit selection consistently produced the best fitness values compared to the other selection methods.
Gene Selection for Sample Classification in Microarray: Clustering Based MethodIOSR Journals
This document describes a clustering-based method for gene selection to classify samples in microarray data. It involves calculating the relevance of each gene to class labels and the redundancy between genes using mutual information. Genes are clustered based on their relevance, with the most relevant gene selected as the cluster representative. Min-hash clustering is then applied to reduce redundant genes and cluster size. The goal is to select a minimal set of non-redundant genes that can accurately classify samples by reducing noise from irrelevant genes.
Biology-Derived Algorithms in Engineering OptimizationXin-She Yang
The document discusses biology-derived algorithms and their applications in engineering optimization. It describes several biology-inspired algorithms including genetic algorithms, photosynthetic algorithms, neural networks, and cellular automata. Genetic algorithms and photosynthetic algorithms are discussed in more detail. The document also provides examples of how these algorithms can be applied to problems in engineering optimization, such as parameter estimation and inverse analysis.
An interactive approach to multiobjective clustering of gene expression patternsRavi Kumar
This document describes an interactive genetic algorithm-based multi-objective approach to cluster gene expression patterns. The proposed Interactive Multi-Objective Clustering (IMOC) algorithm simultaneously evolves the set of validity measures to optimize and finds the clustering solution. It takes input from a human decision maker during execution to learn the best validity measures and clustering for the gene expression data. The algorithm is applied to benchmark gene expression datasets and shows more biologically significant clusters than other clustering algorithms.
A Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network DatasetsDrjabez
1. The document proposes a genetic-fuzzy based method for automatic intrusion detection using network datasets. It combines fuzzy set theory with genetic algorithms to extract rules for both discrete and continuous attributes to detect normal and intrusion patterns.
2. The method was tested on KDD99 Cup and DARPA98 network intrusion detection datasets and showed high detection rates with low false alarm rates for both misuse detection and anomaly detection.
3. By extracting many rules to represent normal network behavior patterns, the proposed genetic-fuzzy approach can detect new or unknown intrusions based on anomalies without requiring prior domain expertise on intrusion patterns.
Multi-Population Methods with Adaptive Mutation for Multi-Modal Optimization ...ijscai
This paper presents an efficient scheme to locate multiple peaks on multi-modal optimization problems by
using genetic algorithms (GAs). The premature convergence problem shows due to the loss of diversity,
the multi-population technique can be applied to maintain the diversity in the population and the
convergence capacity of GAs. The proposed scheme is the combination of multi-population with adaptive
mutation operator, which determines two different mutation probabilities for different sites of the
solutions. The probabilities are updated by the fitness and distribution of solutions in the search space
during the evolution process. The experimental results demonstrate the performance of the proposed
algorithm based on a set of benchmark problems in comparison with relevant algorithms.
Dynamic Radius Species Conserving Genetic Algorithm for Test Generation for S...ijseajournal
This document summarizes a research paper that proposes a new approach called Dynamic-radius Species-conserving Genetic Algorithm (DSGA) for generating structural test cases using a genetic algorithm. DSGA aims to generate a complete test suite with a single run by finding test cases that cover different areas of the program structure. It begins by finding test cases that cover some areas, then excludes those areas to search for test cases covering other uncovered areas, similar to how humans generate structural test cases. The paper evaluates DSGA on the Triangle Classification algorithm and finds it able to generate a complete test suite without limitations of other genetic algorithm approaches for structural test case generation.
This paper proposes a parallel evolutionary algorithm to solve single variable optimization problems. Specifically:
- It presents a genetic algorithm approach that runs in parallel using a master-slave model, where the master performs genetic operations and distributes individuals to slaves for evaluation.
- The algorithm is tested on single variable optimization problems to find minimum/maximum values.
- Experimental results show the parallel genetic algorithm is effective at finding optimal solutions to these problems and represents an efficient parallel approach for optimization.
Este documento analiza el modelo de negocio de YouTube. Explica que YouTube y otros sitios de video online representan un nuevo modelo de negocio para contenidos audiovisuales debido al cambio en los hábitos de consumo causado por las nuevas tecnologías. Describe cómo YouTube aprovecha la participación de los usuarios para mejorar continuamente y atraer una audiencia diferente a la de los medios tradicionales.
The defense was successful in portraying Michael Jackson favorably to the jury in several ways:
1) They dressed Jackson in ornate costumes that conveyed images of purity, innocence, and humility.
2) Jackson was shown entering the courtroom as if on a red carpet, emphasizing his celebrity status.
3) Jackson appeared vulnerable, childlike, and in declining health during the trial, eliciting sympathy from jurors.
4) Defense attorney Tom Mesereau effectively presented a coherent narrative of Jackson as a victim and portrayed Neverland as a place of refuge, undermining the prosecution's arguments.
Michael Jackson was born in 1958 in Gary, Indiana and rose to fame in the 1960s as the lead singer of The Jackson 5, topping music charts in the 1970s. As a solo artist in the 1980s, his album Thriller broke music records. In the 1990s and 2000s, Jackson faced several legal issues related to child abuse allegations while continuing to release music. He married Lisa Marie Presley and Debbie Rowe and had two children before his death in 2009.
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
This document appears to be a list of popular books from various authors. It includes over 150 book titles across many genres such as fiction, non-fiction, memoirs, and novels. The books cover a wide range of topics from politics to cooking to autobiographies.
The prosecution lost the Michael Jackson trial due to several key mistakes and weaknesses in their case:
1) The lead prosecutor, Thomas Sneddon, was too personally invested in the case against Jackson, having pursued him for over a decade without success.
2) Sneddon's opening statement was disorganized and weak, failing to effectively outline the prosecution's case.
3) The accuser's mother was not credible and damaged the prosecution's case through her erratic testimony, history of lies and con artist behavior.
4) Many prosecution witnesses were not credible due to prior lawsuits against Jackson, debts owed to him, or having been fired by him. Several witnesses even took the Fifth Amendment.
Here are three examples of public relations from around the world:
1. The UK government's "Be Clear on Cancer" campaign which aims to raise awareness of cancer symptoms and encourage early diagnosis.
2. Samsung's global brand marketing and sponsorship activities which aim to increase brand awareness and favorability of Samsung products worldwide.
3. The Brazilian government's efforts to improve its international image and relations with other countries through strategic communication and diplomacy.
The three most important functions of public relations are:
1. Media relations because the media is how most organizations reach their key audiences. Strong media relationships are crucial.
2. Writing, because written communication is at the core of public relations and how most information is
Michael Jackson Please Wait... provides biographical information about Michael Jackson including his birthdate, birthplace, parents, height, interests, idols, favorite foods, films, and more. It discusses his background, career highlights including influential albums like Thriller, and films he appeared in such as The Wiz and Moonwalker. The document contains photos and details about Jackson's life and illustrious music career.
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
The document discusses the process of manufacturing celebrity and its negative byproducts. It argues that celebrities are rarely the best in their individual pursuits like singing, dancing, etc. but become famous due to being products of a system controlled by wealthy elites. This system stifles opportunities for worthy artists and creates feudalism. The document also asserts that manufactured celebrities should not be viewed as role models due to behaviors like drug abuse and narcissism that result from the celebrity-making process.
Michael Jackson was a child star who rose to fame with the Jackson 5 in the late 1960s and early 1970s. As a solo artist in the 1970s and 1980s, he had immense commercial success with albums like Off the Wall, Thriller, and Bad, which featured hit singles and groundbreaking music videos. However, his career and public image were plagued by controversies related to allegations of child sexual abuse in the 1990s and 2000s. He continued recording and performing but faced ongoing media scrutiny into his private life until his death in 2009.
Social Networks: Twitter Facebook SL - Slide 1butest
The document discusses using social networking tools like Twitter and Facebook in K-12 education. Twitter allows students and teachers to share short updates and can be used to give parents a window into classroom activities. Facebook allows targeted advertising that could be used to promote educational activities. Both tools could help facilitate communication between schools and communities if used properly while managing privacy and security concerns.
Facebook has over 300 million active users who log on daily, and allows brands to create public profile pages to interact with users. Pages are for brands and organizations only, while groups can be made by any user about any topic. Pages do not show admin names and have no limits on fans, while groups display admin names and are limited to 5,000 members. Content on pages should aim to provoke action from subscribers and establish a regular posting schedule using a conversational tone.
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
Hare Chevrolet is a car dealership located in Noblesville, Indiana that has successfully used social media platforms like Twitter, Facebook, and YouTube to create a positive brand image. They invest significant time interacting directly with customers online to foster a sense of community rather than overtly advertising. As a result, Hare Chevrolet has built a large, engaged audience on social media and serves as a model for how brands can use online presences strategically.
Welcome to the Dougherty County Public Library's Facebook and ...butest
This document provides instructions for signing up for Facebook and Twitter accounts. It outlines the sign up process for both platforms, including filling out forms with name, email, password and other details. It describes how the platforms will then search for friends and suggest people to connect with. It also explains how to search for and follow the Dougherty County Public Library page on both Facebook and Twitter once signed up. The document concludes by thanking participants and providing a contact for any additional questions.
Paragon Software announces the release of Paragon NTFS for Mac OS X 8.0, which provides full read and write access to NTFS partitions on Macs. It is the fastest NTFS driver on the market, achieving speeds comparable to native Mac file systems. Paragon NTFS for Mac 8.0 fully supports the latest Mac OS X Snow Leopard operating system in 64-bit mode and allows easy transfer of files between Windows and Mac partitions without additional hardware or software.
This document provides compatibility information for Olympus digital products used with Macintosh OS X. It lists various digital cameras, photo printers, voice recorders, and accessories along with their connection type and any notes on compatibility. Some products require booting into OS 9.1 for software compatibility or do not support devices that need a serial port. Drivers and software are available for download from Olympus and other websites for many products to enable use with OS X.
To use printers managed by the university's Information Technology Services (ITS), students and faculty must install the ITS Remote Printing software on their Mac OS X computer. This allows them to add network printers, log in with their ITS account credentials, and print documents while being charged per page to funds in their pre-paid ITS account. The document provides step-by-step instructions for installing the software, adding a network printer, and printing to that printer from any internet connection on or off campus. It also explains the pay-in-advance printing payment system and how to check printing charges.
The document provides an overview of the Mac OS X user interface for beginners, including descriptions of the desktop, login screen, desktop elements like the dock and hard disk, and how to perform common tasks like opening files and folders. It also addresses frequently asked questions for Windows users switching to Mac OS X, such as where documents are stored, how to save or find documents, and what the equivalent of the C: drive is in Mac OS X. The document concludes with sections on file management tasks like creating and deleting folders, organizing files within applications, using Spotlight search, and an overview of the Dashboard feature.
This document provides a checklist for securing Mac OS X version 10.5, focusing on hardening the operating system, securing user accounts and administrator accounts, enabling file encryption and permissions, implementing intrusion detection, and maintaining password security. It describes the Unix infrastructure and security framework that Mac OS X is built on, leveraging open source software and following the Common Data Security Architecture model. The checklist can be used to audit a system or harden it against security threats.
This document summarizes a course on web design that was piloted in the summer of 2003. The course was a 3 credit course that met 4 times a week for lectures and labs. It covered topics such as XHTML, CSS, JavaScript, Photoshop, and building a basic website. 18 students from various majors enrolled. Student and instructor evaluations found the course to be very successful overall, though some improvements were suggested like ensuring proper software and pairing programming/non-programming students. The document also discusses implications of incorporating web design material into existing computer science curriculums.
1. An Entropy-based Adaptive Genetic Algorithm for Learning
Classification Rules
Linyu Yang Dwi H. Widyantoro Thomas Ioerger John Yen
Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science
Texas A&M University Texas A&M University Texas A&M University Texas A&M University
College Station, TX 77843 College Station, TX 77843 College Station, TX 77843 College Station, TX 77843
lyyang@cs.tamu.edu dhw7942@cs.tamu.edu ioerger@cs.tamu.edu yen@cs.tamu.edu
Abstract- Genetic algorithm is one of the commonly used instead, the consequent of a rule is determined by the
approaches on data mining. In this paper, we put majority of training examples it matches.
forward a genetic algorithm approach for classification Four important factors are considered in our evaluation
problems. Binary coding is adopted in which an functions. Error rate is calculated by the predicting results
individual in a population consists of a fixed number of on the training examples. Entropy is used to measure the
rules that stand for a solution candidate. The evaluation homogeneity of the examples that a certain rule matches.
function considers four important factors which are Rule consistency is a measure on the consistency of
error rate, entropy measure, rule consistency and hole classification conclusions of a certain training example
ratio, respectively. Adaptive asymmetric mutation is given by a set of rules. Finally, hole ratio is used to evaluate
applied by the self-adaptation of mutation inversion the percentage of training examples that a set of rules does
probability from 1-0 (0-1). The generated rules are not not cover. We try to include related information as complete
disjoint but can overlap. The final conclusion for as possible in the evaluation function so that the overall
prediction is based on the voting of rules and the performance of a rule set can be better.
classifier gives all rules equal weight for their votes. An adaptive asymmetric mutation operator is applied in
Based on three databases, we compared our approach our reproduction step. When a bit is selected to mutate, the
with several other traditional data mining techniques inversion probability from 1-0 (0-1) is not 50% as usual.
including decision trees, neural networks and naive The value of this probability is asymmetric and self-
bayes learning. The results show that our approach adaptive during the running of the program. This is made to
outperformed others on both the prediction accuracy reach the best match of a certain rule to training examples.
and the standard deviation. For crossover, two-point crossover is adopted in our
approach.
Keywords: genetic algorithm, adaptive asymmetric We used three real databases to test our approach: credit
mutation, entropy, voting-based classifier database, voting database and heart database. We compared
our performance with four well-known methods from data
1. Introduction mining, namely Induction Decision Trees (ID3) (Quinlan,
1986), ID3 with Boosting (Quinlan, 1996), Neural
Genetic algorithms have been successfully applied to a wide Networks, and Naive Bayes (Mitchell, 1997). The
range of optimization problems including design, appropriate state-of-the-art techniques are incorporated in
scheduling, routing, and control, etc. Data mining is also these non-GA methods to improve their performances. The
one of the important application fields of genetic algorithms. results show that our GA approach outperformed other
In data mining, GA can be used to either optimize approaches on both the prediction accuracy and the standard
parameters for other kinds of data mining algorithms or deviation.
discover knowledge by itself. In this latter task the rules that
In the rest of the paper, we will first give a brief overview
GA found are usually more general because of its global
of related work. Our GA approach is then discussed in
search nature. In contrast, most other data mining methods
detail. The results of our application on three real databases
are based on the rule induction paradigm, where the
and the comparison with other data mining methods are
algorithm usually performs a kind of local search. The
followed. Finally we make some concluding remarks.
advantage of GA becomes more obvious when the search
space of a task is large.
In this paper we put forward a genetic algorithm approach 2. Related works
for classification problems. First we use binary coding in Many researchers have contributed to the application of GA
which an individual solution candidate consists of a fixed on data mining. In this section, we will give a brief
number of rules. In each rule, k bits are used for the possible overview on a few representative works.
k values of a certain attribute. Continuous attributes are In early 90’s, De Jong et al. implemented a GA-based
modified to threshold-based boolean attributes before system called GABIL that continually learns and refines
coding. Rule consequent is not explicitly coded in the string, concept classification rules (De Jong, 1993). An individual
2. is a variable-length string representing a set of fixed-length classification problems. Some parameter values used in the
rules. Traditional bit inversion is used on mutation. In algorithm might be task dependent.
crossover, corresponding crossover points in the two parents
3.1 Individual’s encoding
must semantically match. The fitness function contains only
Each individual in the population consists of a fixed number
the percent of correct examples classified by an individual
of rules. In other words, the individual itself is a complete
rule set. They compared the performance of GABIL with
solution candidate. In our current implementation, we set
that of four other traditional concept learners on a variety of
this fixed number as 10 which well satisfies the requirement
target concepts.
of our testing databases. The antecedent of a certain rule in
In GIL (Janikow, 1993), an individual is also a set of
the individual is formed by a conjunction of n attributes,
rules, but attributes values are encoded directly, rather than
where n is number of attributes being mined. K bits will be
bits. GIL has special genetic operators for handling rule
used to stand for an attribute if this attribute has k possible
sets, rules, and rule conditions. The operators can perform
values. Continuous attributes will be partitioned to
generalization, specialization or other operations. Besides
threshold-based boolean attribute in which the threshold is a
the correctness, the evaluation function of GIL also includes
boundary (adjacent examples across this boundary differ in
the complexity of a rule set. Therefore it favors correct,
their target classification) that maximizes the information
simple (short) rules.
gain. Therefore two bits will be used for a continuous
Greene and Smith put forward a GA-based inductive
attribute. The consequent of a rule is not explicitly encoded
system called COGIN (Greene, 1993). In contrast to the
in the string. In contrast, it is automatically given based on
above two approaches, the system’s current model at any
the proportion of positive/negative examples it matches in
point during the search is represented as a population of
the training set. We will illustrate the encoding method by
fixed length rules. The population size (i.e., the number of
the following example.
rules in the model) will vary from cycle to cycle as a
Suppose our task has three attributes, and they have 4, 2,
function of the coverage constraint is applied. The fitness
5 possible values respectively. Then an individual in the
function contains the information gain of a rule R and a
population can be represented as following:
penalty of the number of misclassifications made by R.
A1 A2 A3 A1 A2 A3 A1 A2 A3
Entropy measure is used to calculate the information gain of
0110 11 10110 1110 01 10011 …… 1100 11 01110
rule R based on the number of examples rule R Matched
Rule 1 Rule 2 …… Rule 10
and Unmatched. However, it couldn’t evaluate the entropy
Ten rules are included in this individual. The architecture
measure of the entire partition formed by the classification
of each rule is same. We will use rule 1 to explain the
due to its encoding method.
meaning of encoding. In this example, the meaning of the
In view of the situation that most of the data mining work
antecedent of rule 1 is:
emphasizes only on the predictive accuracy and
If (A1=value 2 OR value 3) AND (A2=value 1 OR value 2)
comprehensibility, Noda et al. (Noda, 1999) put forward a
AND (A3=value 1 OR value 3 OR value 4)
GA approach designed to discover the interesting rules. The
If all the bits belong to one attribute are 0s, it means that
fitness function consists of two parts. The first one measures
attribute can not equal to any possible value therefore this is
the degree of interestingness of the rule, while the second
meaningless. To avoid this, we add one step before the
measures its predictive accuracy. The computation of the
evaluation of the population. We will check each rule in
consequent’s degree of interestingness is based on the
each individual one by one, if the above case happens, we
following idea: the larger the relative frequency (in the
will randomly select one bit of that attribute and change it to
training set) of the value being predicted by the consequent,
one.
the less interesting it is. In other words, the rarer a value of a
The consequent of a rule is not encoded in the string. It
goal attribute, the more interesting a rule predicting it is.
will be determined by the proportion situation of the training
Since the values of the goal attribute in the databases we
examples that rule matches. Suppose i is one of the
tested are not unevenly distributed seriously, we didn’t
classifications, the consequent of a rule will be i if
specially consider interestingness of a rule in our current
N matched _ i N training _ i
implementation but mainly focus on including related > , (1)
factors as complete as possible to improve the predicting N matched N training
accuracy. However, we did consider how to treat uneven N matched _ i
where is the number of examples whose
distribution of goal attribute values somehow. We will
discuss this in details in the next section. classification is i and matched by that rule, N matched is the
total number of examples that the rule matches; N training _ i is
3. Our GA approach the number of training examples whose classification is i,
In this section we present our GA approach for N training is the total number of training examples.
classification problem. The key idea of the algorithm is For example, if the distribution of the positive examples
general and should be applicable for various kinds of and negative examples in the training set is 42% and 58%,
and among the examples of rule 1 matches positive /
3. negative examples are half to half, then the consequent of n
rule 1 should be positive because 0.5 > 0.42. Since the Entropy ( R ) = − ∑(p i log 2 ( p i )) (4)
testing databases we use at this time don’t have a very i =1
uneven distribution on the classification of training where n is the number of target classifications.
examples, in our current implementation we didn’t specially While an individual consists of a number of rules, the
consider the interestingness of rules but use this strategy to entropy measure of an individual is calculated by averaging
keep enough rules to match examples with minor the entropy of each rule:
NR
classification. Our encoding method is not limited to two-
category classification but applicable to multi target value ∑ Entropy( R )
i =1
i
(5)
problems. Entropy (individual ) =
NR
3.2 Fitness function
where NR is number of rules in the individual (in our current
It is very important to define a good fitness function that
implementation it is 10).
rewards the right kinds of individuals. We try to consider
The rationale of using entropy measure in fitness function
affecting factors as complete as possible to improve the
is to prefer those rules that match less examples whose
results of classification. Our fitness function is defined as
target values are different from rule’s consequent. High
following:
accuracy does not implicitly guarantee the entropy measure
Fitness = Error rate + Entropy measure + Rule consistency
is good because the final classification conclusion of a
+ Hole ratio (2)
certain training example is based on the comprehensive
We will elaborate each part in the fitness function in the
results of a number of rules. It is very possible that each rule
following.
in the individual has a bad entropy measure but the whole
1) Error rate
rule set still gives the correct classification. Keeping low
It is well known that accuracy is the most important and
entropy value of an individual will be helpful to get better
commonly used measure in the fitness function as the final
predicting results for untrained examples.
goal of data mining is to get good prediction results. Since
3) Rule consistency
our objective function here is minimization, we use error
As stated in the above sections, the final predicted
rate to represent this information. It is calculated as:
classification of a training example is the majority
Error rate = percent of misclassified examples (3)
classification made by rules in an individual. Let’s consider
If a rule matches a certain example, the classification it
the following classifications made by two individuals on an
gives is its consequent part. If it doesn’t match, no
example:
classification is given. An individual consists of a set of
Individual a: six rules +, four rules -, final
rules, the final classification predicted by this rule set is
classification: +
based on the voting of those rules that match the example.
Individual b: nine rules +, one rule -, final
The classifier gives all matching rules equal weight. For
classification: +
instance, in an individual (which has ten rules here), one
We will prefer the second individual since it is less
rule doesn’t match, six rules give positive classification and
ambiguous. To address this rule consistency issue, we add
three rules give negative classification on a training
another measure in the fitness function. The calculation is
example, then the final conclusion given by this individual
similar to the entropy measure. Let Pcorrect be the proportion
on that training example is positive. If a tie happens (i.e.,
of rules in one individual whose consequent is same with
four positive classifications and four negative
the target value of the training example, then
classifications), the final conclusion will be the majority
Rule consistency (individual ) = − p correct log 2 p correct −
classification in the training examples. If none of the rules in (6)
the individual matches that example, the final conclusion (1 − p correct ) log 2 (1 − p correct )
will also be the majority classification in the training We should notice that this formula will give the same rule
examples. The error rate measure of this individual is the consistency value when pcorrect and (1-pcorrect) switch each
percent of misclassified examples among all training other. Therefore a penalty will be given when pcorrect is
examples. smaller than 0.5. In this case Ruleconsistency = 2 - Ruleconsistency.
2) Entropy measure The above calculation is based on the predicting results for
Entropy is a commonly used measure in information theory. one training example. The complete measure of rule
Originally it is used to characterize the (im)purity of an consistency of an individual should be averaged by the
arbitrary collection of examples. In our implementation number of training examples.
entropy is used to measure the homogeneity of the examples 4) Hole ratio:
that a rule matches. The last element in the fitness function is the hole ratio. It is
Given a collection S, containing the examples that a a measure of rule’s coverage on training examples.
certain rule R matches, let Pi be the proportion of examples Coverage is not a problem for traditional inductive learning
in S belonging to class i, then the entropy Entropy(R) related methods like decision trees, since the process of creating
to this rule is defined as: trees guarantees that all the training examples will be
4. covered in the tree. However, this also brings a new next generation and calculate the average fitness of the new
problem that it may be sensitive to noise. GA approach does generation.
not guarantee that the generated rules will cover all the 3) If the fitness is better (value is smaller), continue on
training examples. This allows flexibility and may be this direction and the amount of change is:
potentially useful for future prediction. In real ∆p = max{0.05, (1- fitnessnew / fitnessold) * 0.1} (8)
implementation we still hope the coverage should reach a If the fitness is worse (value is larger), reverse the
certain point. For instance, if a rule only matches one direction and the amount of change is:
training example and its consequent is correct, the accuracy ∆p = max{0.05, (fitnessnew / fitnessold - 1) * 0.05} (9)
and entropy measure of this rule are both excellent but we Use the new probability to produce the next generation
do not prefer this rule because its coverage is too low. and calculate the average fitness of the new generation.
In our fitness function the hole ratio equals to 1-coverage, Repeat step 3 until the program ends.
in which the latter is calculated by the union of examples
that are matched and also correctly predicted by the rules in 4. Results and discussions
an individual. Totally misclassified examples (not classified
correctly by any rule in the individual) will not be included We tested our approach on three real databases. We
even though they are matched by some rules. The following compared our approach with four other traditional data
is the formula to calculate the hole ratio for binary mining techniques. This section will present the testing
classification problem (positive, negative). results.
4.1 The information of databases
P
i
i
+
+ N
i
i
−
(7) 1) Credit database
Hole = 1 − This database concerns credit card applications. All attribute
S names and values have been changed to meaningless
+
where Pi stands for those examples whose target value is symbols to protect confidentiality of the data. This database
is interesting because there is a good mix of attributes ----
positive and classified as positive by at least one rule in the
continuous, nominal with small numbers of values, and
individual, N i− stands for those examples whose target nominal with larger numbers of values. There are 15
value is negative and classified as negative by at least one attributes plus one target attribute. Total number of
rule in the individual. S is the total number of training instances is 690.
2) Voting database
examples.
This database saves 1984 United States Congressional
3.3 Adaptive asymmetric mutation Voting Records. The data set includes votes for each of the
In our reproduction step, traditional bit inversion is used on U.S. House of Representatives Congressmen on the 16 key
mutation. However, we found many examples will not be votes identified by the CQA. The CQA lists nine different
matched if we keep number of 1’s and 0’s approximately types of votes: voted for, paired for, and announced for
equivalent in an individual (i.e., the inversion probability (these three simplified to yea), voted against, paired against,
from 1-0 and 0-1 are both 50%). The learning process will and announced against (these three simplified to nay), voted
become a majority guess if there are too many unmatched present, voted present to avoid conflict of interest, and did
examples. Therefore, we put forward a strategy of adaptive not vote or otherwise make a position known (these three
asymmetric mutation in which the inversion probability simplified to an unknown disposition). There are 16
from 1-0 (0-1) is self-adaptive during the process of run. attributes plus one target attribute. Total number of
The asymmetric mutation biases the population toward instances is 435 (267 democrats, 168 republicans).
generating rules with more coverage on training examples. 3) Heart database
The self-adaptation of inversion probability makes the This database concerns heart disease diagnosis. The data
optimal mutation parameter be automatically adjusted. was provided by V.A. Medical Center, Long Beach and
We presented an adaptive simplex genetic algorithm Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.
before (Yang, 2000) in which the percentage of simplex There are 14 attributes plus one target attribute. Total
operator is self-adaptive during the process of run. Similar number of instances is 303.
idea is adopted here that average of fitness is used as a
4.2 The description of non-GA approaches
feedback to adjust the inversion probability. The process of
We used four well-known methods from machine learning,
self-adaptation is described as following:
namely Induction Decision Trees (ID3) (Quinlan, 1986),
1) An initial inversion probability is set (e.g., 0.5 for 1-0).
Decision Trees with Boosting (Quinlan, 1996), Neural
Use this probability on mutation to produce a new
Networks, and Naïve Bayes (Mitchell, 1997), to compare
generation. Calculate the average fitness of this generation.
the performance of our improved GA. Appropriate state-of-
2) Randomly select the direction of changing this
the-art techniques have been incorporated in most of the
probability (increase, decrease). Modify the probability
non-GA methods to improve their performance. The
along that direction with a small amount (0.02 in our current
implementation). Use the new probability to produce the
5. following is a description of the non-GA approaches we instance class is selected from the maximum weighted
used for the performance comparison studies. average of the predicted class over all decision trees.
1) Induction Decision Trees 3) Neural Networks
The construction of a decision tree is divided into two Inspired in part by biological learning systems, neural
stages. First, creating an initial, large decision tree using a networks approach is built from a densely interconnected set
set of training set. Second, pruning the initial decision tree, of simple units. Since this technique offers many design
if applicable, using a validation set. Given a noise-free selections, we fixed some of them to the ones that had been
training set, the first stage will generate a decision tree that well proven to be good or acceptable design choices. In
can classify correctly all examples in the set. Except that the particular, we use a network architecture with one hidden
training set covers all instances in the domain, the initial layer, the back-propagation learning algorithm (Rumelhart,
decision tree generated will over fit the training data, which Hinton & William, 1986), the delta-bar-delta adaptive
then reduce its performance in the test data. The second learning rates (Jacobs, 1988), and the Nguyen-Widrow
stage helps alleviate this problem by reducing the size of the weight initialization (Nguyen & Widrow, 1990). Discrete-
tree. This process has an effect in generalizing the decision valued attributes fed into the networks input layer is
tree that hopefully could improve its performance in the test represented as 1-of-N encoding using bipolar values to
data. denote the presence (e.g., value 1) and the absence (e.g.,
During the construction of an initial decision tree, the value –1) of an attribute value. Continuous-valued attribute
selection of the best attribute is based on either the is scaled into a real value in the range [–1, 1]. We varied the
information gain (IG) or gain ratio (GR). A binary split is design choices for batch versus incremental learning, and 1-
applied in nodes with continuous-valued attributes. The best of-N versus single network output encoding.
cut-off value of a continuous-valued attribute is locally 4) Naive Bayes
selected within each node in the tree based on the remaining Naive Bayes classifier is a variant of the Bayesian learning
training examples. The tree’s node expansion stops either that manipulates directly probabilities from observed data
when the remaining training set is homogeneous (e.g., all and uses these probabilities to make an optimal decision.
instances have the same target attribute values) or when no This approach assumes that attributes are conditionally
attribute remains for selection. The decision on a leaf independent given a class. Based on this simplifying
resulting from the latter case is determined by selecting the assumption, the probability of observing attributes’
majority of the target attribute value in the remaining conjunction is the product of the probabilities for the
training set. individual attributes. Given an instance with a set of
Decision tree pruning is a process of replacing sub-trees attribute-value pairs, the Naive Bayes approach will choose
with leaves to reduce the size of the decision tree while a class that maximizes the conditional probability of the
retaining and hopefully increasing the accuracy of tree's class given the conjunction of attributes values. Although in
classification. To obtain the best result from the induction practice the independence assumption is not entirely correct,
decision tree method, we varied the use of pruning it does not necessarily degrade the system performance
algorithm to the initial decision tree. We considered using (Domingos & Pazzani, 1997).
the following decision trees pruning algorithms: critical- We also assume that the values of continuous-valued
value pruning (Mingers, 1987), minimum-error pruning attributes follow Gaussian distribution. Hence, once the
(Niblett & Bratko, 1986), pessimistic pruning, cost- mean and the standard deviation of these attributes are
complexity pruning and reduced-error pruning (Quinlan, obtained from the training examples, the probability of the
1987),. corresponding attributes can be calculated from the given
2) Induction Decision Trees with Boosting attribute values. To avoid zero frequency count problem that
Decision Tree with Boosting is a method that generates a can dampen the entire probabilities calculation, we use an
sequence of decision trees from a single training set by re- m-estimate approach in calculating probabilities of discrete-
weighting and re-sampling the samples in the set (Quinlan, valued attributes (Mitchell, 1997).
1996; Freund & Schapire, 1996). Initially, all samples in the
4.3 Comparison results
training set are equally weighted so that their sum is one.
For each database, k-fold cross-validation method is used
Once a decision tree has been created, the samples in the
for evaluation. In this method, a data set is divided equally
training set are re-weighted in such a way that misclassified
into k disjoint subsets. k experiments are then performed
examples will get higher weights than the ones that are
using k different training-test set pairs. A training-test set
easier to classify. The new samples weights are then
pair used in each experiment is generated by using one of
renormalized, and next decision tree is created using higher-
the k subsets as the test set, and using the remaining subsets
weighted samples in the training set. In effect, this process
as the training set. Given k disjoint subsets, for example, the
enforces the more difficult samples to be learned more
first experiment takes the first subset for the test set, and
frequently by decision trees. The trees generated are then
uses the second subset through the kth subset for the training
given weights in accordance with their performance in the
set. The second experiment uses subset 1 and subset 3
training examples (e.g., their accuracy in correctly
through subset k for the training set; and takes subset 2 for
classifying the training data). Given a new instance, the new
6. the test set, and so on. All results from a particular database boosting, we present only the best experimental results after
are averaged along with its variance over k experiment runs. varying the data splitting methods as well as the pruning
Based on their size, the credit database and voting algorithms described earlier. Similarly, results from neural
database are partitioned into 10 disjoint subsets, the heart networks approach are selected from the ones that provide
database is partitioned into 5 subsets. Table 1, 2, 3 show the the best performance after varying the use of different
performance comparison results of different approaches on network output encoding and batch versus incremental
these three databases. For decision trees with and without learning methods.
Table 1 The comparison results on the prediction accuracy and standard deviation (%) of credit database.
Our GA Decision trees Decision trees Neural networks Naive Bayes
approach (IG, Min-Err) with boosting (1-of-N, batch
(RG, Cost-Com, learning)
21 trees)
Run 1 90.77 87.69 89.23 89.23 66.15
Run 2 89.23 84.62 86.15 86.15 78.46
Run 3 89.23 89.23 90.77 90.77 84.62
Run 4 92.31 90.77 90.77 89.23 81.54
Run 5 86.15 81.54 81.54 84.62 75.38
Run 6 89.23 87.69 87.69 87.69 80.00
Run 7 84.62 81.54 84.62 84.62 73.85
Run 8 87.69 86.15 87.69 86.15 83.08
Run 9 90.77 86.15 89.23 87.69 76.92
Run 10 86.76 88.24 91.18 86.76 75.00
Average 88.68 86.36 87.89 87.29 77.50
Standard deviation 2.37 3.06 3.08 2.03 5.36
In above table, the decision tree is generated using data splitting and cost-complexity pruning algorithm. Batch
information gain for data splitting and minimum-error learning and 1-of-N output encoding are used in neural
pruning. Decision trees with boosting generates 21 different networks.
decision trees, each is constructed by using gain ratio for
Table 2 The comparison results on the prediction accuracy and standard deviation (%) of voting database.
Our GA Decision trees Decision trees Neural networks Naive Bayes
approach (IG, Red-Err) with boosting (1-of-N, batch
(IG, Red-Err, learning)
3 trees)
Run 1 95.35 95.35 95.35 93.02 93.02
Run 2 97.67 93.02 97.67 97.67 90.70
Run 3 97.67 97.67 97.67 97.67 93.02
Run 4 95.35 95.35 97.67 97.67 88.37
Run 5 95.35 95.35 95.35 93.02 88.37
Run 6 97.67 97.67 97.67 97.67 88.37
Run 7 95.35 93.02 93.02 95.35 93.02
Run 8 97.67 95.35 95.35 95.35 90.70
Run 9 100.00 100.00 97.67 95.35 90.70
Run 10 95.83 93.75 97.92 95.03 85.42
Average 96.79 95.65 96.54 95.86 90.17
Standard deviation 1.59 2.23 1.67 1.46 2.53
In table 2, the best results from both decision trees with Only three decision trees are needed in the decision trees
and without boosting are obtained from using information with boosting.
gain for data splitting and reduced-error pruning algorithm.
7. Table 3 The comparison results on the prediction accuracy and standard deviation (%) of heart database.
Our GA Decision trees Decision trees Neural networks Naive Bayes
approach (IG, Red-Err) with boosting (1-of-N, incr
(IG, Red-Err, learning)
21 trees)
8. Run 1 89.83 81.36 88.14 81.36 89.83
Run 2 83.05 77.97 84.75 84.75 79.66
Run 3 81.36 76.27 76.27 74.58 83.05
Run 4 88.14 76.27 83.05 89.83 83.05
Run 5 83.33 66.67 71.67 81.67 80.00
Average 85.14 75.71 80.77 82.44 83.12
Standard deviation 3.64 5.46 5.54 5.66 4.08
In table 3, the best results from neural networks are Domingos, P. and Pazzani, M. (1997) On the Optimality of
obtained from applying incremental learning and 1-of-N the Simple Bayesian Classifier under Zero-One Loss.
network output encoding. Machine Learning, 29, 103-130.
From the above results we can see that our GA approach Freund, Yoav, and Schapire, R. E. (1996) Experiments with
outperformed other approaches on both the average a new boosting algorithm. In Machine Learning:
prediction accuracy and the standard deviation. The Proceedings of the Thirteen International Conference, pp.
advantage of our GA approach becomes more obvious on 148-156.
heart database, which is most difficult to learn among the Greene, D., P. and Smith, S. F. (1993) Competition-based
three. During the process of running, we also found that the induction of decision models from examples. Machine
training accuracy and testing accuracy of GA approach are Learning, 13, 229-257.
basically in the same level, while the training accuracy is Jacobs, R.A. (1988) Increased Rates of Convergence
often much higher than testing accuracy for other Through Learning Rate Adaptation. Neural Networks, 1(4):
approaches. This proves that GA approach is less sensitive 295-307.
to noise and might be more effective for future prediction. Janikow, C. Z. (1993) A knowledge-intensive genetic
algorithm for supervised learning. Machine Learning, 13,
5. Conclusions and future work 189-228.
Mingers, J. (1987) Expert Systems – Rule Induction with
In this paper we put forward a genetic algorithm approach Statistical Data. Journal of the Operational Research
for classification problem. An individual in a population is a Society, 38, 39-47.
complete solution candidate that consists of a fixed number Mitchell, Tom. (1997) Machine Learning. New York:
of rules. Rule consequent is not explicitly encoded in the McGraw-Hill.
string but determined by the match situation on training Niblett, T. (1986) Constructing Decision Trees in Noisy
examples of the rule. To consider the performance affecting Domains. In I. Bratko and N. Lavrac (Eds). Progress in
factors as complete as possible, four elements are included Machine Learning. England: Sigma Press.
in the fitness function which are predicting error rate, Noda, E., Freitas, A. A. and Lopes, H. S. (1999)
entropy measure, rule consistency and hole ratio, Discovering interesting prediction rules with a genetic
respectively. Adaptive asymmetric mutation and two-point algorithm. In Proceedings of 1999 Congress on
crossover are adopted in reproduction step. The inversion Evolutionary Computation (CEC’ 99), pp. 1322-1329.
probability of 1-0 (0-1) in mutation is self-adaptive by the Nguyen, D. and Widrow, B. (1990) Improving the Learning
feedback of average fitness during the run. The generated Speed of Two-Layer Networks by Choosing Initial Values
classifier after evolution is voting-based. Rules are not of the Adaptive Weights. International Joint Conference on
disjoint but allowed to overlap. Classifier gives all rules the Neural Networks, San Diego, CA, III:21-26.
equal weight for their votes. We tested our algorithm on Quinlan, J.R. (1986) Induction of Decision Trees. Machine
three real databases and compared the results with four Learning, 1, 81-106.
other traditional data mining approaches. It is shown that Quinlan, J.R. (1987) Simplifying Decision Trees.
our approach outperformed other approaches on both International Journal of Man-Machine Studies, 27, 221-234.
prediction accuracy and the standard deviation. Quinlan, J. R. (1996) Bagging, Boosting, and C4.5. In
Further testing on various databases is in progress to test Proceedings of the Thirteenth National Conference on
the robustness of our algorithm. Splitting continuous Artificial Intelligence, pp. 725-730.
attribute into multiple intervals rather than just two intervals Rumelhart, D. E., Hinton, G.E., and William, R. J. (1986)
based on a single threshold is also considered to improve the Learning Representations by Back-Propagation Error.
performance. Nature, 323:533-536.
Yang, L. and Yen, J. (2000) An adaptive simplex genetic
Bibliography algorithm. In Proceedings of the Genetic and Evolutionary
Computation Conference (GECCO 2000), July 2000, pp.
De Jong, K. A., Spears, W. M., and Gordon, D. F. (1993)
379.
Using genetic algorithms for concept learning. Machine
Learning, 13, 161-188.