A survey of Evolutionary Algorithms for Data Mining and Knowledge ...Presentation Transcript
A survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery Alex Freitas (2001) A look at Genetic Algorithms(GA) and Genetic Programming(GP) also from the point-of-view of the user. An overview of Data Mining and Knowledge Discovery The three desirable properties of KD: 1)Accuracy. We would like a high predictive accuracy rate for our data mining task of classification 2)Comprehensible.The user should not be presented with a “black box” predictor that spouts out a number of IF_THEN rules and say this is the output. Prediction rules should be comprehensible. 3)Interesting. In many cases the user should be presented with an ensemble of discovered rules not all of which will be interesting in practice.
Show the user some schematics:
Data Set#3---------->Data Integration----->Pre-Processing
Data Set#4 Data Mining
Data Set#5 Post-Processing
Data Set#6 Output Knowledge
“ We believe that ideally a combination of subjective and objective approaches should be used to try to solve the very hard problem of returning interesting knowledge to the user”
Objective approaches: clearly the algorithms and computer processing give us high-level knowledge
Subjective approaches: user helps to define and re-define those formulas or IF_THEN rules that are of interest to the knowledge discovery.
Data Mining Task: Classification
Many of the KD procedures are to predict the value (the class) of a user-specified goal attribute based on the predicting attributes.
E.G. IF( Unpaid_Loan =“NO”) AND (Overdrafts=“Yes”) THEN ( Credit = “Bad”)
For a comprehensive discussion about how to measure predictive accuracy of classification rules we are referred to  Hand, DJ , Construction and Assessment of Classification Rules
Tom Mitchell’s book also has good information.
Data Mining : Other tasks
Dependence Modelling :an extension or generalization of classification
Discovery of Association Rules: more than one item in the consequent attribute is possible; the classification task may be assymmetric with respect to the predicting attributes and the consequent attribute
The Knowledge Discovery Process
Discretization: transforming a “continuous” attribute into a discrete one. (For example, “low”, “medium”, “high”)
Attribute Selection: selecting a set of attributes relevant for classification.
The motivation for attribute selection may be obvious: it has been found that irrelavant attributes can somehow “confuse” the data mining algorithm, leading to the discovery of inaccurate or useless knowledge. ( A wrapper method can help select attributes.)
Discovered Knowledge Postprocessing
Two main motivations:
First , when the discovered rule set is large, we want to simplify it. (The user may or may not help at this stage.) Some other techniques will be addressed.
Second, we often want to extract a subset of interesting rules, among all the discovered ones. We will look at some objective methods later on(GA) but subjective methods involving a user/data miner collaboration may also be important.
Genetic Algorithms (GA) for Rule Discovery
Michigan approach: population consists of individuals(“chromosomes”) where each individual encodes a single prediction rule.
Pittsburgh approach: each individual encodes a set of prediction rules
Pluses and minuses: the Pittsburgh approach directly takes into account rule interaction when computing the fitness function of an individual. This approach leads to syntactically-longer individuals. In the Michigan approach the individuals are simpler and syntactically shorter. It simplifies the design of genetic operators. (Interactions in Michigan approach are not taken into account.)
Take the rule: IF cond#1 AND cond#2 AND …cnd#n….THEN class= c(i)
Representation of the rule antecedent
Representation of rule consequent ( the THEN part)
The Rule Antecedent (Using GA)
Often there is a conjunction of conditions.
Usually use binary encoding.
A given attribute can take on k discrete values. Encoding can consist of k bits.( for “on” or “off”) Allows for internal disjunctions.
0 0 1 1 1 0 1 1 0 0………0
All bits can be turned into “1” ’s in order to “turn off” this condition.
Non-binary encoding is possible. Variable-length individuals will arise. May have to modify crossover to be able to cope with variable-length individuals.
Representing the Rule Consequent (Predicted Class)
Three ways of representing the predicted class. (the THEN part)
First, encode it in the genome of an individual (possibly making it subject to evolution.)
Second, associate all individuals of the population with the same predicted class, which is never modified during the running of the algorithm.
Third, choose the predicted class most suitable for a rule (a deterministic way) as soon as the corresponding rule antecedent is formed. ( E.G. Maximize fitness.)
Author believes the third possibility to be the most sound overall.
Genetic Operators for Rule Discovery
Selection: Each individual represents a single rule.(Michigan approach). An approach called REGAL can be used. Individuals to be “mated” are “elected” by training examples. (Use a fitness operator or a probabilistic model.)
Generalizing/specializing crossover: basic idea of this special kind of crossover is to generalize or specialize a given rule, depending on whether it is currently overfitting or underfitting the data.
Generalizing/specializing-condition operator. The g/s of a rule can be done in a way independent of crossover. (Tweaking the antecedent conditions especially if contnuous conditions exist.)
How to get these 3 rule quality criteria incorporated into a fitness function.
Let a rule be IF A THEN C , the calculate the confidence factor CF = TP / ( TP + FP) using chart:
Given Actual Class
. C not C
. Predicted C TP FP
. Class not C FN TN
Comp=completeness measure = TP / (TP + FN)
Fitness = CF * COMP = (TP)(TP) / (TP+FP)(TP+FN)
Fitness = w1 X (CF * COMP) + w2 X (Simp) where Simp is a measure of rule simplicity 0< Simp<1 and w1 and w2 are weights
Genetic Algorithms (GAs) for Pre-processing
“ The use of GAs for attribute selection seems natural. The main reason is that the major source of difficulty in attribute selection is attribute interaction, and one of the strengths of Gas is that they usually cope well with attribute interactions.”
We can use very simple genetic encoding where each individual represents a candidate attribute subset. A candidate attribute subset can be represented as a string with m binary genes where m is the number of attributes and each gene can take on a “0” or “1”.
Follow crossover and mutation procedures.
GA can be used with nearest neighbour algorithms (NNA) to “tweak” for better results.
Genetic Algorithms (GAs) for Post-processing
GAs can be used in the post-processing step when there is an ensemble of classifiers (e.g. rule sets) created.
Generating an ensemble of classifiers is useful since it has been shown that in several cases an ensemble of classifiers has a better predictive accuracy than a single classifier.
A fitness function may be created using weights for each classifier in the ensemble. (A user may help.) There are also GA schemes to optimize the weights of the classifiers.
Genetic Programming (GP) for Rule Discovery
Individual representation: attributes are often numeric
Functions such as +,-,*,/ , < , >,=, AND, OR,… are used as well as input arguments.
An individual is often represented as a tree diagram.
Once we apply the functions in the internal nodes of a GP individual to the values of the attributes in the leaf nodes of the individual, the system computes a numerical value that is output at the root of the tree.
Discovering comprehensible rules using GP: these rules could be similar to GA rules but there are some othersuch as:
This could lead to discovery of short, simple rules that may be required in the Medical field.
Genetic Programming (GP) for Data Pre-Processing
A Major problem in attribute construction is that the search space tends to be huge. If the search can be accomplished especially with the relational operator “>” then many good candidate operations may evolve.
Sometimes there are GA/GP methods for pre-processing.
In his Chapter on evolutionary algorithms Alex Freitas has shown us where the emphasis should be laid in Knowledge learning. His goal of “transparency” of pre-processing, rule learning and post-processing methods should be taken into account. A user would like to know where the classifier or the IF_THEN rule came from and if he can help influence the process to get a more intelligent result.
High Classification Accuracy does not imply Effective Genetic Search (Tim Kovacs;Manfred Kerber)
The authors are publishing their experimental results which they believe will help clear up some of the limitations of GA methods. In particular they examine XCS, a popular classification system which uses genetic algorithms.
The paper by K+K refers us to work by Stewart Wilson(1995) entitled “ Classifier Fitness Based on Accuracy”. In XCS each classifier maintains a prediction of expected payoff, but the classifier’s fitness is given by a measure of the predictor’s accuracy. Wilson’s example shows some individuals in the population P:
. p є F
Table of #011:01 43 .01 99
Population P 11##:00 32 .13 9
. #0##:11 14 .05 5
Where p=prediction є = prediction error F= fitness parameter
XCS and XCS-NGA are compared
The authors XCS-NGA (XCS with no GA) uses XCS modified so that genetic search does not operate on the initial rule population. In all other respects XCS-NGA functions as XCS.
XCS classifies data points by a vote among the rules which match it, with each vote weighted both by the rule’s fitness. In this way a low-accuracy rule and a high-accuracy rule is given the classification of the high-accuracy rule.
In XCS, the rules (region shapes and sizes) are adapted by the GA.
XCS-NGA lacks a GA and its region shapes and sizes do not change.
XCS-NGA relies on there being enough rules to adequately solve the rule improvement problem (rule discovery) by random chance.
Roughly speaking, XCS-NGA’s approach is to generate many random rules and ignore those which happen to have low accuracy.
High Accuracy implications of XCS-NGA
The authors have obtained very good results with XCS-NGA. They agree that this alternative has its limitations but they want to address the publication of classification accuracy as a goal onto itself.
Many published papers use 6-bit multiplexer functions only. (Strings are of length L= k + 2^k so that if k=2bits then L=6 . For a 70-bit multiplexer we have : k=6 bits, and L= 70.)
Because of its random nature the XCS-NGA thrives better with more initial rules and then give excellent results.
The claim is that they argue that only those studies which claim effective genetic research based on results with small functions are demonstrated invalid by their results with XCS-NGA.
A More Powerful Metric for Evaluating Genetic Search
This metric is symbolized by %[O] where %[O] = the proportion of the Optimal solution in the rule population on a given time step.
This metric has been shown to have greater discriminatory power than the performance metric introduced by Wilson.
Wilson’s “performance” is defined as a moving average of the proportion of the last n trials in which the system has responded with the correct action. ( n is traditionally equal to 50.)
There are often 400 rules to start with. The XCS-NGA does better with more rules.
%[O] is better able to discern the progress of genetic search than the performance metric.
The new metric extends the utility of small tests.
%[O] has disadvantages including the need to compute the optimal solution in advance as well as the computational expense of evaluating it.
Finally, replacing GA with an interactive random rule generator would provide a baseline against which to compare genetic search.